Semi-Supervised Action Recognition with Temporal Contrastive Learning



Published on



Computer Vision CVPR

Learning to recognize actions from only a handful of labeled videos is a challenging problem due to the scarcity of tediously collected activity labels. We approach this problem by learning a two-pathway temporal contrastive model using unlabeled videos in two different speeds. Specifically, we propose to maximize the similarity between encoded representations of the same video in two different speeds as well as minimize the same between different videos run in different speeds. This way we leverage the rich supervisory information in terms of `time’ that is present in otherwise unsupervised pull of videos. With this simple yet surprisingly effective strategy of manipulating playback rates of unlabeled video, we are able to considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methodologies across multiple datasets and network architectures. Interestingly, our approach is shown to benefit from out-of-domain unlabeled videos showing robustness and generalizability of it. We also perform rigorous ablations and analysis to validate our approach.

This paper has been published at CVPR 2021

Please cite our work using the BibTeX below.

    author    = {Singh, Ankit and Chakraborty, Omprakash and Varshney, Ashutosh and Panda, Rameswar and Feris, Rogerio and Saenko, Kate and Das, Abir},
    title     = {Semi-Supervised Action Recognition With Temporal Contrastive Learning},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {10389-10399}
Close Modal