groundy
models & research

Vision-Language Models Move Past Object Detection: The MLLM Perception Shift

Vision-language models now reason over tables, charts, and documents, but detection-era benchmarks still rank them on box localization and undercount comprehension.

7 min···5 sources ↓

Vision-language models have outgrown the evaluation layer built for them. A run of multimodal-LLM survey work from late 2024 through early 2026 argues that current models do unified structural and cross-modal reasoning, yet the benchmarks ranking them were designed around single-task object detection. Practitioners who select on detection leaderboards are reading the wrong dial, and the field is starting to admit it.

Why do vision-language benchmarks still reward detection?

Most public vision benchmarks descend from the image-classification and object-detection tradition, where a model’s job was to localize boxes or label pixels, and a single train-eval-test split on one task decided who won. That inheritance is now the bottleneck.

The MME-Survey, produced by the teams behind MME, MMBench, and LLaVA, frames this directly. It argues that MLLM versatility has driven a proliferation of new benchmarks and evaluation methods, breaking from what the authors call the “traditional train-eval-test paradigm that only favors a single task like image classification.” The survey reorganizes evaluation into foundation capabilities, model self-analysis, and extended applications, an explicit move away from one-task scoring.

The cost of the old framing shows up downstream. A model that can read a table, reason over its cells, and answer a question gets the same evaluation surface as one that merely draws a box around it. Detection metrics like mAP and IoU were never meant to credit comprehension, so they do not. The benchmark is not wrong; it is measuring a narrower thing than the model now does, and the difference is where entire capability classes go uncounted.

Can one model handle detection and comprehension together?

TabPedia, a large vision-language model from ByteDance and USTC released in mid-2024, treats the four core visual table understanding tasks (table detection, table structure recognition, table querying, and table question answering) as one job for one model rather than four jobs for four specialist architectures.

The authors’ framing is blunt: previous methods “generally design task-specific architectures and objectives for individual tasks, resulting in modal isolation and intricate workflows” (TabPedia, arXiv:2406.01326). TabPedia instead abstracts every visual table understanding task and its multi-source visual embeddings as a unified “concept,” and its concept synergy mechanism lets perception and comprehension tasks share clues instead of running as separate pipelines.

The comprehension argument is the point of the unification. The authors built ComTQA precisely because comprehension-heavy evaluation was missing, and their reported experiments span both perception and comprehension benchmarks. When the task demands reasoning over real table structure, the paper argues, the specialist-plus-LLM stitching falls behind a model trained to perceive and comprehend together.

The detection story cuts the other way, and it is worth stating plainly. The TabPedia paper scopes its case to unified comprehension, not to beating specialist structure recognizers on narrow benchmarks. Unified models do not automatically win the detection race they were not built to lose.

Why do detection metrics undercount unified models?

Detection metrics measure whether a box is in the right place. They are silent on whether the model understood what was inside it. That gap matters because the interesting failures have moved upstream of localization.

The “True (VIS) Lies” study evaluated 16 frontier multimodal LLMs (15 open-weight models spanning roughly 12B to 1,000B parameters, plus OpenAI’s GPT-5.4) on detecting misleading visualizations across 2,336 COVID-19 tweets, supplemented by a visualization-expert user study. The task is not “find the chart.” It is decide whether a chart is being used to deceive. An IoU score has nothing to say about that judgment, and a detection leaderboard will never rank models on it.

This is the undercounting mechanism in the wild. When a field measures perception as box-localization, models that improved on rhetoric, intent, and authorial purpose look flat, because the yardstick does not extend to those capabilities. The model did not stop improving. The benchmark stopped covering the thing it improved at. For anyone selecting models, that is a silent mis-ranking, and it biases selection toward detection-strong, reasoning-weak systems.

What replaces detection-only evaluation?

A wave of new benchmarks reorganizes evaluation around reasoning depth and cross-task structure rather than single-task accuracy. Three of them, each cited to a primary source, mark the direction.

ComTQA, released with TabPedia, is a table VQA benchmark of approximately 9,000 QA pairs (TabPedia, arXiv:2406.01326). It was designed to test comprehension in real-world scenarios, not detection, which is why the unified model’s case rests on it.

The Cognitive Compass framework aggregates 24 affective datasets into a three-tier cognitive hierarchy: Emotion Perception and Recognition, Emotion Understanding and Analysis, and Emotion Cognition and Reasoning. The point is to move MLLM evaluation past flat classification accuracy toward reasoning depth. The tiers are not cosmetic: tasks like OSA and FESD sit at Level 1 (perception), while reasoning tasks like sarcasm detection and emotion interpretation sit at Level 3. A flat accuracy leaderboard rolls all three tiers into one number and hides capability the model already has.

The Visual Document Retrieval survey self-describes as “the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era),” covering multimodal embedding models, multimodal rerankers, and retrieval-augmented and agentic integration for document intelligence. It reframes perception through the MLLM era rather than the detection era.

Benchmark or frameworkWhat it testsWhat it replaces
ComTQA (TabPedia)Table VQA with a comprehension focus (~9k pairs)Detection-only table metrics on cropped images
Cognitive Compass tiersThree cognitive levels: perception, understanding, reasoningFlat emotion-classification accuracy
VDR survey framingEmbedding, reranking, and RAG for document intelligenceDetection-centric document pipelines

How should you pick a vision-language model now?

Weight benchmarks that match the failure mode you actually fear, and read detection scores as a floor rather than a ceiling.

If the task is narrow detection (find tables, find objects in a fixed domain), a specialist detector or a DETR-style pipeline is still the cheaper and often higher-scoring choice. The TabPedia paper scopes its contribution to unified comprehension, not to beating specialist detectors on narrow structure-recognition tasks. Treating unified LVLMs as universally superior is the mirror error of treating detection as the whole story.

If the task is comprehension over real-world structure (answer questions from a table image, judge whether a chart misleads, reason about layout), look at ComTQA, Cognitive Compass’s higher tiers, and document-retrieval results. Those are where unified LVLMs separate from detection pipelines, and where the undercounting was loudest.

Treat prompt framing as part of the evaluation. Cognitive Compass frames emotion tasks as Theory-of-Mind-guided reasoning rather than flat classification, so a model that looks weak under a generic prompt can behave differently under a structured one. The corollary cuts hard: a leaderboard that fixes one prompt is also undercounting, and it is undercounting in a direction you cannot predict without running the alternative.

Watch the dates. The strongest reframings are recent survey work (the VDR survey, March 2026; the MME-Survey, late 2024), and the model versions they test move fast. A benchmark’s relevance decays as architectures move past the assumptions it was built on. ComTQA, the Cognitive Compass tiers, and the VDR roadmap are current signals, but they are not permanent ones, and they will need refreshes on a six-to-twelve-month cycle.

The detection era is not over for detection. It is over as the default yardstick for models that have already moved past it.

Frequently Asked Questions

What encoder setup do unified table models use to merge perception and comprehension?

TabPedia pairs two vision encoders: ViT-L for global low-resolution features and Swin-B for fine-grained high-resolution features, connected through mediative tokens that let the perception and comprehension heads share clues. The TabPedia paper text alternates the spellings mediative and meditative for those tokens, a typo; the intended term is mediative.

How much can prompt framing swing a model’s score on these benchmarks?

On Cognitive Compass Level-1 perception tasks, Theory-of-Mind-style prompting lifted GPT-4o by up to 7.34 points on the OSA task and GPT-4.1 by up to 3.28 points on FESD. A leaderboard that fixes one prompt can reorder models by several positions, and the direction of that bias is not predictable without running the alternative prompt.

How do the MME-Survey and VDR survey reframe evaluation differently?

The MME-Survey redraws the measurement taxonomy horizontally into foundation capabilities, model self-analysis, and extended applications. The VDR survey takes a vertical cut instead, organizing one application domain around multimodal embedding models, multimodal rerankers, and retrieval-augmented or agentic integration. The first redefines what gets scored across MLLM tasks; the second redefines the retrieval stack for document intelligence.

Does the unified-comprehension claim extend to video or real-time perception?

The cited evidence sits in static, text-rich imagery: tables via ComTQA, emotion images via Cognitive Compass, documents via the VDR survey, and static charts via the True (VIS) Lies study. None of those surveys extend the unified-comprehension claim to video understanding, 3D scene perception, or real-time robotics, where specialist detectors and tracking pipelines still lead on their native benchmarks.

sources · 5 cited