CLEVRER: The first video dataset for neuro-symbolic reasoning

Neuro-Symbolic AI


Published on


In the near future, it will not be enough for artificial intelligence to simply correlate information and make predictions based on those relationships. For AI to advance it must understand not only what happened but why it happened, as well as the cause and effect relationships pointing to that particular outcome. At the MIT-IBM Watson AI Lab, we are leading the development of explainable AI that understands such causal relationships and will eventually be able to apply a measure of common sense to solve problems. Our latest contribution is a dataset that for the first time helps AI recognize objects in videos, analyze their movement, and reason about their behaviors. We describe this work in our paper, CLEVRER: Collision Events for Video Representation and Reasoning, published as a Spotlight presentation in ICLR 2020. This work has been a collaboration between MIT CSAIL, IBM Research, Harvard University, and Google DeepMind.

CoLlision Events for Video REpresentation and Reasoning

The new CoLlision Events for Video REpresentation and Reasoning, or CLEVRER, dataset enabled us to simplify the problem of visual recognition. We used CLEVRER to benchmark the performances of neural networks and neuro-symbolic reasoning — a hybrid of neural networks and symbolic programming — using only a fraction of the data required for traditional deep learning systems.

CLEVRER was inspired by the Compositional Language and Elementary Visual Reasoning, or CLEVR, diagnostic dataset introduced in 2017 to test a range of visual reasoning abilities. CLEVR contained 100,000 rendered images and about one million automatically-generated questions, of which 853,000 were unique. CLEVRER, however, is a diagnostic video dataset that includes 20,000 videos—each five seconds long—that simulate colliding objects of different colors and shapes.

The table below shows a comparison between CLEVRER and other visual reasoning benchmarks on images and videos. CLEVRER is a well-annotated video reasoning dataset created under a controlled environment. It introduces a wide range of reasoning tasks including description, explanation, prediction and counterfactuals.

Dataset Video Diagnostic Annotations Temporal Relation Explanation Prediction Counterfactual
VQA (Antol et al., 2015)
CLEVR (Johnson et al., 2017a)
COG (Yang et al., 2018)
VCR (Zellers et al., 2019)
GQA (Johnson et al., 2017a)
TGIF-QA (Jang et al., 2017)
MovieQA (Tapaswi et al., 2016)
MarioQA (Mun et al., 2017)
TVQA (Lei et al., 2018)
Social-IQ (Zadeh et al., 2019)
CLEVRER (ours)

Objects in the CLEVRER videos adopt similar compositional intrinsic attributes as those found in the CLEVR dataset, including three shapes, two materials and eight colors. Yet CLEVRER also includes 300,000 questions and answers related to the action depicted in the videos. The questions fall into four different categories:

  1. Descriptive: e.g. “What is the material of the last object to collide with the cyan cylinder?”
  2. Explanatory: e.g. “What is responsible for the collision between rubber and metal cylinders?”
  3. Predictive: e.g. “What will happen next?”
  4. Counterfactual: e.g. “What will happen without the cyan cylinder?”

Developing new AI methods using CLEVRER

We initially used CLEVRER to evaluate the ability of various state-of-the-art deep learning models’ to apply visual reasoning. These models thrive on the perception-based tasks, meaning they did well when answering the descriptive questions. But they performed poorly on all other types of questions, which are aimed at establishing cause-and-effect relationships.

Next, we created and tested a neuro-symbolic dynamic reasoning (NS-DR) model to see if it could succeed where neural networks could not. NS-DR combined object-centric representations, scene dynamics modeling and symbolic program execution. We used a neural network to recognize the objects’ colors, shapes and materials and a symbolic system to understand the physics of their movements as well as the causal relationships between them.

More specifically, NS-DR first parsed an input video into an abstract, object-based, frame-wise representation that essentially cataloged the objects appearing in the video. Then, a dynamics model learned to infer the motion and dynamic relationships among the different objects. Third, a semantic parser turned each question into a functional program. Finally, a symbolic program executor ran the program, using information about the objects and their relationships to produce an answer to the question.

In this way, NS-DR used neural networks and symbolic programming in a complementary way that outperformed the deep learning models alone by a large margin across all categories of questions. The model’s ability to perform well on a dataset portraying physical movement means it could be used to address very fundamental problems — such as autonomous navigation — that have important applications in the real world. It is the closest research has gotten to granting AI the ability to effectively navigate and infer in the physical world.

The neuro-symbolic paradigm shift

Neuro-symbolic paradigms will be integral to AI’s ability to learn and reason across a variety of tasks without a huge burden on training — all while being more secure, fair, scalable and explainable. By combining the best of neural-based learning with symbolic-based reasoning, we continue to explore how to create AI systems that require less data to tackle significantly more diverse and complex tasks.

We have made the CLEVRER dataset available to larger research community as a way for them to test different AI models. Next-generation AI systems need the ability to infer casual relationships from data and begin to understand cause and effect, instead of relying exclusively on pattern recognition. Our goal is to facilitate the advances essential to developing AI that ultimately demonstrates common sense thinking.

Please cite our work using the BibTeX below.

title={CLEVRER: Collision Events for Video Representation and Reasoning},
author={Kexin Yi* and Chuang Gan* and Yunzhu Li and Pushmeet Kohli and Jiajun Wu and Antonio Torralba and Joshua B. Tenenbaum},
booktitle={International Conference on Learning Representations},
Close Modal