Imagine trying to interpret the world using only one of your senses at a time. Or even worse, employing only one aspect of one of your senses. Static imagery, yes – but no action sequences. Or spoken word – but no song. This is pretty much the state of AI today. It’s good at recognizing language or imagery in silos, but when it comes to combining interpretations or incorporating action sequences, there’s still a long way to go.
We want to build innovative AI systems that are better at what’s known as multimodal learning. They’ll draw from more than one input at a time and be capable of deciphering complex scenes that incorporate imagery as well as actions and sound. In short, they’ll contain a human-like ability to make sense of the world.
Consider parsing audio and video. As humans, we can simultaneously see and hear a person playing a violin and identify the source of the sound. In other words, we integrate our senses. Through multimodal learning, AI models are gaining the same ability. This represents a big step forward toward autonomous systems that can interact in our complex world.