When a dog barks at a squeaky toy or a mechanic suddenly stops talking mid-sentence, you don’t need a PhD in cognitive science to figure out what’s happening—you just watch, listen, and understand. But for multimodal AI models, this simple human reflex remains surprisingly hard to replicate. Despite all the recent hype around “frontier” models like GPT-4o and Gemini 1.5 Pro, most of them still fumble when forced to truly synthesize what they see and hear. That’s exactly the problem MAVERIX is trying to solve.
Where benchmarks fall short—and MAVERIX steps in
Today’s leading multimodal benchmarks might claim they test real-world reasoning, but many of them cheat. They reward models that can get by with just vision or just text transcripts, instead of forcing them to integrate multiple senses like humans do. MAVERIX (short for Multimodal Audio-Visual Evaluation Reasoning IndeX) is a new benchmark that finally raises the bar by requiring tightly coupled audio-visual reasoning across 700 videos and more than 2,500 questions.
Think of it as a crash course in common sense for AI: if you hear a buzzing and see a bee near the camera, you should probably rule out “mechanical device off-screen.” But MAVERIX doesn’t just hand models a few easy puzzles. It comes with eight-option multiple-choice questions (to kill the guesswork) and open-ended prompts (to test true understanding), pushing models beyond pattern recognition into full-on cognitive coordination.
Real-world questions, real human complexity
MAVERIX’s questions are designed like psychological Rorschach tests for machines—covering causal reasoning, emotional inference, spatial awareness, and dynamic context. Picture a video of two people arguing. Are they fighting for real, acting in a movie, or just mimicking WWE wrestling for laughs? That answer could hinge on the slap and the laugh track. You need to see and hear to understand.
To make this all work, the MAVERIX team built a meticulous pipeline that blends human expertise with AI validation. Every video comes with subtitles, categorized sounds (speech, music, natural noise), and annotated keyframes. Every question is vetted to ensure that unimodal shortcuts—like just reading the subtitles—don’t cut it. If a model could answer without using both modalities, the question gets rewritten or tossed.
So, how well do today’s AIs actually perform?
Not great. Even with direct access to audio and video, the top performer—Gemini 1.5 Pro—scored around 71.9% accuracy. That’s close to humans, but still behind. Humans, with full audiovisual input, clock in at over 80%. But here’s the kicker: some open-source models barely crack 30%. And when you strip away audio or video, performance drops like a mic.
In open-ended tasks where models must generate their own explanations, things get messier. The average model scored just 1.9 out of 5 in GPT-4o-judged coherence and reasoning. Humans scored 2.79. That gap widens even more when tasks involve complex emotional cues or off-screen events—like guessing why a crowd shifts tables at a poker game or whether two dancers are fighting or just rehearsing.
Not all models struggle the same way
One of MAVERIX’s most revealing contributions is how it exposes what different models actually rely on. Gemini performs best when given raw audio, while most other models do better with subtitles. That says a lot about what’s going on under the hood—some models “listen,” others just “read.” But neither matches human-level perception across the board.
Interestingly, tasks like shopping—where structured, factual data matters—are where machines shine. But for sports commentary, gaming strategy, or interpreting human emotions? Humans crush them. These gaps show that current AI is much better at scanning catalogs than parsing social nuance or context that evolves over time.
Difficulty levels matter, and so does modality
Easy tasks gave the biggest boost from multimodal inputs—suggesting that some models use audio and video to refine obvious answers. But when questions got harder, many models leaned heavily on vision and ignored audio. Claude 3.5 Sonnet, for example, improved 41.5% on easy videos with multimodal input, but only 17% on hard ones.
This highlights a deeper issue: most models aren’t really fusing modalities. They’re stacking them. You can give them both audio and video, but unless the model needs both to solve the task, it’ll pick a favorite. MAVERIX aims to change that by designing questions that demand true fusion—where the answer hinges on the interplay between sound and sight.
To bridge the performance gap, we’ll need better architectures that treat audio as more than an afterthought. We’ll need new training strategies that reward synchronized understanding rather than isolated predictions. And most of all, we’ll need benchmarks like MAVERIX that don’t settle for what’s easy to measure, but ask the hard questions about how machines really understand.
So the next time your AI assistant messes up a simple command or misreads a tone, remember: it might not be deaf—it just hasn’t passed the MAVERIX test yet.