Indexing Images for RAG: kapa.ai's Approach to Multimodal Retrieval

Q: How does corpus size shift the economics of index-time captioning?

kapa.ai indexes knowledge bases holding millions of images, amortizing the one-time vision-model cost across millions of future queries. For a corpus of a few thousand images serving under a hundred daily queries, that amortization period stretches to months. The payback then hinges on reindexing frequency: a daily-updated API changelog may never break even, because each rebuild reruns the vision model on changed images before the prior batch has paid for itself.

Q: Which multimodal RAG architecture are production teams actually shipping?

A 2026 Big Data Boutique survey found that most production deployments use unified embeddings or hybrid late-fusion, not caption-and-index or ColPali. Unified embeddings avoid the caption-fidelity problem by encoding text and images into a shared vector space, but require a joint encoder per chunk, raising ingestion compute. ColPali's page-as-image approach attracts research interest but its late-interaction scoring carries query-time cost that limits production adoption.

Q: What accuracy does the junk-image classifier achieve, and where does it fail?

The zero-shot classifier reaches 96.8% accuracy (F1 0.974) on clear-cut images but drops to 59.8% on ambiguous ones. Because kapa.ai tunes the classifier to err on the side of keeping images, ambiguous inputs that pass the filter generate captions with low confidence, and those captions enter the retrieval index without any quality gate to flag them as unreliable.

Q: How many retrieved images does it take to hit model payload limits?

kapa.ai reports that typical queries retrieve 20 to 30 images, with a long tail past 130. Claude's context window caps at 30 MB and OpenAI's at 50 MB. A single high-resolution architecture diagram can consume several megabytes, so a query retrieving 30 or more images routinely approaches these ceilings, making query-time vision unreliable precisely on the queries that need it most.

Technical documentation is full of images: architecture diagrams, pin configuration tables, annotated screenshots, circuit schematics. Standard RAG pipelines strip them out, index the text, and hope the answer lives in prose. kapa.ai published its production image-indexing methodology on June 1, 2026, with enough cost and accuracy data to let practitioners weigh the tradeoff: pay once at ingestion to caption images, or pay per query to stuff raw images into the prompt.

Why Text-Only RAG Misses Half the Signal

Across kapa.ai’s customer base, images were cited in generated answers on 10% to 64% of queries depending on the project, according to the company’s writeup. For hardware documentation, kapa.ai reports 99%+ answer accuracy and 30%+ support ticket deflection for semiconductor customers working with register maps, pin configurations, and code examples. Those are not edge cases. A register map is the answer; the surrounding text is commentary. If the retrieval pipeline discards it, the answer degrades in ways that are hard to diagnose because the system never signals that it is missing something.

The problem is structural. Conventional RAG chunking splits on text boundaries. Images are either stripped entirely or passed through OCR, which captures labels and axis text but loses spatial relationships, color coding, and the layout of a block diagram. A flowchart with six nodes and eight edges carries information in the edges. OCR gives you six labels.

The Query-Time Vision Trap

The naive fix is to send images alongside text at query time. This works in demos. It falls apart in production.

kapa.ai’s measurements show that raw images increased per-query cost by 27% on GPT 5.1 and 51% on Claude 4.6 Sonnet as of their June 2026 benchmarks. At high retrieval volumes, queries approach model context limits. Cost and reliability both degrade, and the degradation is nonlinear because larger contexts trigger more expensive compute tiers and higher latency.

The economic argument for query-time multimodal is strongest when image retrieval is rare. The moment it becomes routine, which it does for technical documentation, the cost profile inverts. You are paying a vision-model tax on every query, including the ones where the image adds nothing.

kapa.ai’s Index-Time Pipeline: Filter, Caption, Store

kapa.ai’s approach shifts the vision-model work to ingestion. The pipeline has three stages.

Filter. A zero-shot classifier filters junk images (decorative icons, spacer GIFs, low-information screenshots) using multimodal embeddings. The classifier is conservative, erring toward keeping images rather than discarding them. That is the right tradeoff for retrieval, but it means noise still enters the pipeline.

Caption. A cheap vision model generates a text caption for each surviving image. The key finding: caption quality depends more on surrounding text context than on model size. A secondary summary of kapa.ai’s findings reports that GPT-4 mini produced near-identical caption quality to models four times its price. This is an important data point for practitioners. The temptation is to throw the largest vision model at every image; the data suggests the marginal return on model capacity is small when the surrounding document context is rich.

Store. Captions are stored as separate text chunks rather than inlined into the parent document’s text stream. This turned out to matter more than expected.

Production Metrics: Cost, Accuracy, and Placement

The headline numbers from kapa.ai’s experiments:

Per-query overhead: 1% to 6% versus text-only baselines, depending on project and model.
Caption placement accuracy: 94% to 99% across projects.
Image citation rate: 10% to 64% of queries, depending on documentation density.
Answer preference: an LLM judge preferred answers with image context by a statistically significant margin (McNemar’s test, p < 0.05) across three customer projects and two models.

The separate-chunks-versus-inline comparison is the most architecturally relevant finding. kapa.ai’s benchmarks show that inline captions raised per-query cost 19% with GPT versus 6% for separate chunks. Separate chunks are cheaper and the retrieval system can surface them independently. Inline captions bloat the context window with information the re-ranker may not need for a given query.

The Caption-Fidelity Risk

This is the part that does not show up in benchmark tables.

A bad text chunk degrades recall in obvious ways: the wrong document ranks, the answer is incomplete, or the system returns a low-confidence refusal. These are detectable failure modes. A bad image caption degrades recall silently. If the caption misdescribes a diagram, labeling a feedback loop as a feedforward path, the retrieval system confidently returns the wrong image, the generator confidently describes the wrong architecture, and the user gets a plausible but incorrect answer. There is no obvious signal that the failure originated in the caption rather than in retrieval or generation.

kapa.ai acknowledges this indirectly. The classifier’s degraded performance on ambiguous images, per their own measurements, means some fraction of indexed captions describe images that resist reliable classification in the first place.

The risk scales with documentation type. API reference docs with annotated screenshots are forgiving; the caption mostly needs to identify which UI element is shown. Architecture diagrams and circuit schematics are not. A caption that gets the direction of a data flow wrong poisons every downstream query that retrieves it.

This is the real tradeoff that kapa.ai’s data implies but does not fully name: indexing images shifts cost from query time to ingestion time, but it also shifts the fidelity requirement from “does the model see the image?” to “did we describe it correctly?” The first is a binary check. The second is a quality gate with no automated feedback loop.

How This Fits the 2026 Multimodal Landscape

kapa.ai’s caption-and-index approach is one of four architectures now in production use. Big Data Boutique’s 2026 survey identifies the others: unified embeddings (encoding text and images into the same vector space), page-as-image retrieval with late interaction (the ColPali family), and hybrid late-fusion that combines multiple retrieval strategies.

Architecture	Ingestion cost	Query cost	Recall dependency	Complexity
Caption-and-index (kapa.ai)	Medium (vision model per image)	Low (text-only retrieval)	Caption fidelity	Low
Unified embeddings	High (joint encoder per chunk)	Medium (vector similarity)	Embedding quality	Medium
ColPali / page-as-image	Low (embed page image)	High (late interaction scoring)	Page layout fidelity	High
Hybrid late-fusion	High (multiple pipelines)	Medium (fusion logic)	Fusion weighting	High

kapa.ai’s data gives practitioners concrete benchmarks for the first row. The 1% to 6% overhead figure, the 94% to 99% placement accuracy, and the separate-versus-inline cost comparison are the kind of production numbers that make architecture decisions tractable. They are also specific to kapa.ai’s domain, which is technical documentation with structured images, and should not be generalized to medical imaging or satellite data without independent validation.

The broader lesson is economic. Query-time vision is a variable cost that scales with traffic. Index-time captioning is a fixed cost that scales with corpus size. For documentation corpora that change infrequently and serve high query volumes, the fixed-cost model wins. For corpora that change constantly and serve low query volumes, the calculus reverses. kapa.ai’s numbers confirm the intuition; they do not change it.

Frequently Asked Questions

How does corpus size shift the economics of index-time captioning?

kapa.ai indexes knowledge bases holding millions of images, amortizing the one-time vision-model cost across millions of future queries. For a corpus of a few thousand images serving under a hundred daily queries, that amortization period stretches to months. The payback then hinges on reindexing frequency: a daily-updated API changelog may never break even, because each rebuild reruns the vision model on changed images before the prior batch has paid for itself.

Which multimodal RAG architecture are production teams actually shipping?

A 2026 Big Data Boutique survey found that most production deployments use unified embeddings or hybrid late-fusion, not caption-and-index or ColPali. Unified embeddings avoid the caption-fidelity problem by encoding text and images into a shared vector space, but require a joint encoder per chunk, raising ingestion compute. ColPali’s page-as-image approach attracts research interest but its late-interaction scoring carries query-time cost that limits production adoption.

What accuracy does the junk-image classifier achieve, and where does it fail?

The zero-shot classifier reaches 96.8% accuracy (F1 0.974) on clear-cut images but drops to 59.8% on ambiguous ones. Because kapa.ai tunes the classifier to err on the side of keeping images, ambiguous inputs that pass the filter generate captions with low confidence, and those captions enter the retrieval index without any quality gate to flag them as unreliable.

How many retrieved images does it take to hit model payload limits?

kapa.ai reports that typical queries retrieve 20 to 30 images, with a long tail past 130. Claude’s context window caps at 30 MB and OpenAI’s at 50 MB. A single high-resolution architecture diagram can consume several megabytes, so a query retrieving 30 or more images routinely approaches these ceilings, making query-time vision unreliable precisely on the queries that need it most.

Why Text-Only RAG Misses Half the Signal

The Query-Time Vision Trap

kapa.ai’s Index-Time Pipeline: Filter, Caption, Store

Production Metrics: Cost, Accuracy, and Placement

The Caption-Fidelity Risk

How This Fits the 2026 Multimodal Landscape

Frequently Asked Questions

How does corpus size shift the economics of index-time captioning?

Which multimodal RAG architecture are production teams actually shipping?

What accuracy does the junk-image classifier achieve, and where does it fail?

How many retrieved images does it take to hit model payload limits?

sources · 3 cited