Multimodal Learning

Imagine trying to interpret the world using only one of your senses at a time. Or even worse, employing only one aspect of one of your senses. Static imagery, yes – but no action sequences. Or spoken word – but no song. This is pretty much the state of AI today. It’s good at recognizing language or imagery in silos, but when it comes to combining interpretations or incorporating action sequences, there’s still a long way to go.

We want to build innovative AI systems that are better at what’s known as multimodal learning. They’ll draw from more than one input at a time and be capable of deciphering complex scenes that incorporate imagery as well as actions and sound. In short, they’ll contain a human-like ability to make sense of the world.

Consider parsing audio and video. As humans, we can simultaneously see and hear a person playing a violin and identify the source of the sound. In other words, we integrate our senses. Through multimodal learning, AI models are gaining the same ability. This represents a big step forward toward autonomous systems that can interact in our complex world.

All Work

Neural-Network Can Identify a Melody Through Musicians’ Body Movements
Neural-Network Can Identify a Melody Through Musicians’ Body Movements
Interesting Engineering
Identifying a melody by studying a musician’s body language
Identifying a melody by studying a musician’s body language
MIT News
Undergraduates develop next-generation intelligence tools
Undergraduates develop next-generation intelligence tools
MIT News
Self-supervised Moving Vehicle Tracking with Stereo Sound
Self-supervised Moving Vehicle Tracking with Stereo Sound
 
The sound of motions
The sound of motions
 
New tricks from old dogs: multi-source transfer learning
New tricks from old dogs: multi-source transfer learning
 
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
 
Dialog-based Interactive Image Retrieval
Dialog-based Interactive Image Retrieval
 
Learning to Separate Object Sounds by Watching Unlabeled Video
Learning to Separate Object Sounds by Watching Unlabeled Video