Research

We Have So Much in Common: Modeling Semantic Relational Set Abstractions in Videos

ECCV

Authors

Published on

08/28/2020

Categories

Computer Vision ECCV

Identifying common patterns among events is a key ability in human and machine perception, as it underlies intelligent decision making. We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong to the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision.

This paper has been published at ECCV 2020.

Please cite our work using the BibTeX below.

@InProceedings{10.1007/978-3-030-58523-5_2,
author="Andonian, Alex
and Fosco, Camilo
and Monfort, Mathew
and Lee, Allen
and Feris, Rogerio
and Vondrick, Carl
and Oliva, Aude",
editor="Vedaldi, Andrea
and Bischof, Horst
and Brox, Thomas
and Frahm, Jan-Michael",
title="We Have So Much in Common: Modeling Semantic Relational Set Abstractions in Videos",
booktitle="Computer Vision -- ECCV 2020",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="18--34",
abstract="Identifying common patterns among events is a key capability for human and machine perception, as it underlies intelligent decision making. Here, we propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. Our model combines visual features as input with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong to the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision. Code and models are available online (Project website: abstraction.csail.mit.edu).",
isbn="978-3-030-58523-5"
}

Close Modal