Multimodal Learning

Imagine trying to interpret the world using only one of your senses at a time. Or even worse, employing only one aspect of one of your senses. Static imagery, yes – but no action sequences. Or spoken word – but no song. This is pretty much the state of AI today. It’s good at recognizing language or imagery in silos, but when it comes to combining interpretations or incorporating action sequences, there’s still a long way to go.

We want to build innovative AI systems that are better at what’s known as multimodal learning. They’ll draw from more than one input at a time and be capable of deciphering complex scenes that incorporate imagery as well as actions and sound. In short, they’ll contain a human-like ability to make sense of the world.

Consider parsing audio and video. As humans, we can simultaneously see and hear a person playing a violin and identify the source of the sound. In other words, we integrate our senses. Through multimodal learning, AI models are gaining the same ability. This represents a big step forward toward autonomous systems that can interact in our complex world.

All Work

Multiple AI models help robots execute complex plans more transparently
Multiple AI models help robots execute complex plans more transparently
MIT News
Using language to give robots a better grasp of an open-ended world
Using language to give robots a better grasp of an open-ended world
MIT News
Scaling audio-visual learning without labels
Scaling audio-visual learning without labels
MIT News
Some glimpse AGI in ChatGPT. Others call it a mirage
Some glimpse AGI in ChatGPT. Others call it a mirage
WIRED
This AI can harness sound to reveal the structure of unseen spaces
This AI can harness sound to reveal the structure of unseen spaces
Popular Science
Perceptron: AI that sees with sound, learns to walk and predicts seismic physics
Perceptron: AI that sees with sound, learns to walk and predicts seismic physics
TechCrunch
Using sound to model the world
Using sound to model the world
MIT News
Daniel Huttenlocher
Daniel Huttenlocher
MIT ILP
Converting several audio streams into one voice makes it easier for AI to learn
Converting several audio streams into one voice makes it easier for AI to learn
IBM Research
More Language, Less Labeling with Kate Saenko
More Language, Less Labeling with Kate Saenko
This Week in Machine Learning & AI (TWIML) podcast
Hallucinating to better text translation
Hallucinating to better text translation
MIT News
Artificial intelligence system learns concepts shared across video, audio, and text
Artificial intelligence system learns concepts shared across video, audio, and text
MIT News
AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning
AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning
 
Neural-Network Can Identify a Melody Through Musicians’ Body Movements
Neural-Network Can Identify a Melody Through Musicians’ Body Movements
Interesting Engineering
Identifying a melody by studying a musician’s body language
Identifying a melody by studying a musician’s body language
MIT News
Undergraduates develop next-generation intelligence tools
Undergraduates develop next-generation intelligence tools
MIT News
The sound of motions
The sound of motions
 
Self-supervised Moving Vehicle Tracking with Stereo Sound
Self-supervised Moving Vehicle Tracking with Stereo Sound
 
New tricks from old dogs: multi-source transfer learning
New tricks from old dogs: multi-source transfer learning
 
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
 
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
 
Dialog-based Interactive Image Retrieval
Dialog-based Interactive Image Retrieval
 
Learning to Separate Object Sounds by Watching Unlabeled Video
Learning to Separate Object Sounds by Watching Unlabeled Video