Grounding Spoken Words in Unlabeled Video

CVPR

Cite Paper

Authors

Rogerio Feris
Dan Gutfreund
Yang Zhang
Antonio Torralba
James Glass
Angie Boggust
Kartik Audhkhasi
Dhiraj Joshi
David Harwath
Samuel Thomas
Michael Picheny

Published on

06/20/2019

In this paper, we explore deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized. Specifically, we consider cooking show videos from the YouCook2 dataset and a subset of the YouTube-8M dataset. We introduce varying levels of supervision into the learning process to guide the sampling of audio-visual pairs for training the models. This includes (1) a fully-unsupervised approach that samples audio-visual segments uniformly from an entire video, and (2) sampling audio-visual segments using weak supervision from off-the-shelf automatic speech and visual recognition systems. Although these models are preliminary, even with no supervision they are capable of learning cross-modal correlations, and with weak supervision we see significant amounts of cross-modal learning.

Please cite our work using the BibTeX below.

@InProceedings{Boggust_2019_CVPR_Workshops,
author = {W Boggust, Angie and Audhkhasi, Kartik and Joshi, Dhiraj and Harwath, David and Thomas, Samuel and Feris, Rogerio and Gutfreund, Danny and Zhang, Yang and Torralba, Antonio and Picheny, Michael and Glass, James},
title = {Grounding Spoken Words in Unlabeled Video},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2019}
}