Self-supervised Video Retrieval
9 papers with code • 2 benchmarks • 2 datasets
Most implemented papers
Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework
With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations.
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning
A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise.
Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning
As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning.
Video Playback Rate Perception for Self-Supervised Spatio-Temporal Representation Learning
The generative perception model acts as a feature decoder to focus on comprehending high temporal resolution and short-term representation by introducing a motion-attention mechanism.
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning
It is convenient to treat PCL as a standard training strategy and apply it to many other works in self-supervised video feature learning.
TCLR: Temporal Contrastive Learning for Video Representation
However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension.
Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting
Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning.
Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
We present CrissCross, a self-supervised framework for learning audio-visual representations.
SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos
One of the key reasons for this is that sampling pairs of similar video clips, a required step for many self-supervised contrastive learning methods, is currently done conservatively to avoid false positives.