Research

Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos

Authors

Hildegard Kühne
Rameswar Panda
Rogerio Feris
James Glass
Brian Chen
Andrew Rouditchenko
Kevin Duarte
Samuel Thomas
Angie Boggust
Brian Kingsbury
David Harwath
Michael Picheny
Shih-Fu Chang

Cite

Research

Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos

ICCV

Cite Paper

Authors

Hildegard Kühne
Rameswar Panda
Rogerio Feris
James Glass
Brian Chen
Andrew Rouditchenko
Kevin Duarte
Samuel Thomas
Angie Boggust
Brian Kingsbury
David Harwath
Michael Picheny
Shih-Fu Chang

Published on

10/17/2021

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a framework that, starting from a pre-trained backbone, learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.

Please cite our work using the BibTeX below.

@InProceedings{Chen_2021_ICCV,
    author    = {Chen, Brian and Rouditchenko, Andrew and Duarte, Kevin and Kuehne, Hilde and Thomas, Samuel and Boggust, Angie and Panda, Rameswar and Kingsbury, Brian and Feris, Rogerio and Harwath, David and Glass, James and Picheny, Michael and Chang, Shih-Fu},
    title     = {Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {8012-8021}
}