Research

Multimodal Learning

Imagine trying to interpret the world using only one of your senses at a time. Or even worse, employing only one aspect of one of your senses. Static imagery, yes – but no action sequences. Or spoken word – but no song. This is pretty much the state of AI today. It’s good at recognizing language or imagery in silos, but when it comes to combining interpretations or incorporating action sequences, there’s still a long way to go.

We want to build innovative AI systems that are better at what’s known as multimodal learning. They’ll draw from more than one input at a time and be capable of deciphering complex scenes that incorporate imagery as well as actions and sound. In short, they’ll contain a human-like ability to make sense of the world.

Consider parsing audio and video. As humans, we can simultaneously see and hear a person playing a violin and identify the source of the sound. In other words, we integrate our senses. Through multimodal learning, AI models are gaining the same ability. This represents a big step forward toward autonomous systems that can interact in our complex world.

All Work

AI-enabled control system helps autonomous drones stay on target in uncertain environments

MIT News

↗

AI learns how vision and sound are connected, without human intervention

MIT News

↗

IBM Granite now has eyes

IBM Research

↗

From surf to satellites: Campbell Watson is bringing AI to Earth science

IBM Research

↗

Participatory AI highlights paths to sustainability

MIT ILP

↗

MIT researchers advance automated interpretability in AI models

MIT News

↗

Multiple AI models help robots execute complex plans more transparently

MIT News

↗

Using language to give robots a better grasp of an open-ended world

MIT News

↗

Scaling audio-visual learning without labels

MIT News

↗

Some glimpse AGI in ChatGPT. Others call it a mirage

WIRED

↗

This AI can harness sound to reveal the structure of unseen spaces

Popular Science

↗

Perceptron: AI that sees with sound, learns to walk and predicts seismic physics

TechCrunch

↗

Using sound to model the world

MIT News

↗

Daniel Huttenlocher

MIT ILP

↗

Converting several audio streams into one voice makes it easier for AI to learn

IBM Research

↗

More Language, Less Labeling with Kate Saenko

This Week in Machine Learning & AI (TWIML) podcast

↗

Hallucinating to better text translation

MIT News

↗

Artificial intelligence system learns concepts shared across video, audio, and text

MIT News

↗

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

Computer Vision Multimodal Learning

Neural-Network Can Identify a Melody Through Musicians’ Body Movements

Interesting Engineering

↗

Identifying a melody by studying a musician’s body language

MIT News

↗

Undergraduates develop next-generation intelligence tools

MIT News

↗

The sound of motions

Computer Vision

Self-supervised Moving Vehicle Tracking with Stereo Sound

Multimodal Learning Computer Vision

New tricks from old dogs: multi-source transfer learning

Transfer Learning Explainability

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Neuro-Symbolic AI Computer Vision

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

ECCV Multimodal Learning

Dialog-based Interactive Image Retrieval

NeurIPS Computer Vision

The Sound of Pixels

Multimodal Learning Computer Vision

Learning to Separate Object Sounds by Watching Unlabeled Video

Computer Vision