Learning to Separate Object Sounds by Watching Unlabeled Video

Computer Vision

Cite Demo Paper

Authors

Ruohan Gao
Rogerio Feris
Kristen Grauman

Published on

04/05/2018

Categories

Audio Processing Computer Vision Multimodal Learning

Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale in the wild videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising.

Please cite our work using the BibTeX below.


@article{DBLP:journals/corr/abs-1804-01665,
  author    = {Ruohan Gao and
               Rog{\'{e}}rio Schmidt Feris and
               Kristen Grauman},
  title     = {Learning to Separate Object Sounds by Watching Unlabeled Video},
  journal   = {CoRR},
  volume    = {abs/1804.01665},
  year      = {2018},
  url       = {http://arxiv.org/abs/1804.01665},
  archivePrefix = {arXiv},
  eprint    = {1804.01665},
  timestamp = {Mon, 13 Aug 2018 16:47:22 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1804-01665.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}