1. Introduction
Reasoning-intensive retrieval [
1,
2,
3] has become central to information systems built around large language models. Users no longer issue keyword queries: they paste error logs, attach circuit schematics, screenshot a chart from a paper, or photograph a whiteboard, and expect the system to retrieve a document that addresses the underlying question. This multimodal phrasing is now the norm in technical question answering, scientific search, and AI assistants embedded in productivity tools, and it requires the retriever to perform genuine cross-modal reasoning rather than surface-level matching [
4,
5].
MM-BRIGHT [
3] is a benchmark of 2803 real-world reasoning-intensive queries spanning 29 technical domains, of which multimodal-to-text—
multimodal query (text + image) to text document— is the headline multimodal-retrieval task. The benchmark exposes a striking failure mode of current multimodal encoders: across seven state-of-the-art dense vision–language retrievers, the best model (Nomic-Vision [
6]) reaches only
nDCG@10 on multimodal-to-text, with the runner-up Jina-CLIP [
7] at
, GME-7B [
8] at
, GME-2B at
, and CLIP [
9] and BGE-VL [
10] at
and
respectively [
3]. Adding visual signal does not help retrieval here—it actively introduces noise. Abdallah et al. [
3] further observe that even augmenting strong text retrievers with vision-language captions can
degrade performance: a reasoning-enhanced retriever loses
nDCG@10 points when its query is concatenated with a generated image caption. The question is therefore not how to make the visual encoder bigger or the caption longer, but how to introduce the image into the retrieval pipeline at all without contaminating the textual signal that current retrievers rely on.
We argue the bottleneck is the
substrate, not the encoder. Existing multimodal retrievers route the query image through a dense visual encoder whose output must compete with text in a single shared embedding space, and existing caption-augmentation pipelines splice an unconstrained natural-language description into a query string that the retriever was never trained to consume. Both choices send noisy signal to a model that has only one place to put it. Recent results in chart understanding [
11,
12,
13], scientific document parsing [
14], and chart question answering [
15,
16] demonstrate that, for images with regular structure, vision–language models can extract that structure into clean typed text—table cells, LaTeX formulas, code, node–edge graphs—which is competitive with the original image for downstream reasoning. We adopt this insight as a retrieval primitive:
convert the image into typed structured text and then run text retrieval, rather than encoding the image into a dense vector that must be matched against text, or attaching a free-form caption that derails the reasoning retriever.
We present
VISA https://github.com/HarnessLab/VISA-Agent (accessed on 9 June 2026) (
Figure 1), a multimodal retrieval agent that re-casts multimodal-to-text as a text-retrieval problem over three parallel streams. A multimodal query
is dispatched by a Vision LLM in three roles via separate prompts: a zero-shot
router that classifies the image
x into up to three parser types from a fixed taxonomy of nine (
chart,
circuit,
equation,
screenshot,
code,
figure,
diagram,
map,
photograph); typed
parsers, one per chosen type, each extracting the image into structured text
; and a holistic
captioner producing a natural-language description
c. The agent then constructs three text streams—the raw query (Stream
A), the query augmented with the parsed symbolic content (Stream
B), and the query augmented with the holistic caption (Stream
C)—scores each with a single frozen
retrieval LLM (a 4B-parameter decoder-only embedding model), and fuses the per-document scores via Reciprocal Rank Fusion [
17] or a confidence-weighted linear combination conditioned on the router’s output. The whole agent contains no trainable parameters: a single Vision LLM (
Qwen3-VL [
18]) served via vLLM [
19] answers different prompts to fill all three roles, and the retrieval LLM is used unchanged. Concretely,
Figure 1 reads left to right: the query image first enters the
router (top), which emits up to three typed parser calls; each typed
parser returns structured text that is concatenated into the symbolic block
S and cached by image hash, while the
captioner returns a holistic description
c; the raw query,
S, and
c then form the three streams
A,
B,
C that are independently scored by the frozen retrieval LLM and merged by the fusion block on the right into the final document ranking.
On all 29
MM-BRIGHT multimodal-to-text domains,
VISA achieves
32.4 nDCG@10, an absolute improvement of
+4.8 over the strongest dense multimodal encoder (Nomic-Vision,
),
+9.4 over Jina-CLIP (
),
+10.4 over GME-7B (
), and more than triple the score of the weakest dense vision–language retriever evaluated by Abdallah et al. [
3]. Crucially,
VISA is the
first system to make the multimodal-to-text score
exceed the strongest dense visual baselines on
MM-BRIGHT without using a multimodal encoder at all: every retrieval call in our pipeline is a plain text-side query. Per-domain analysis (
Section 5) shows
VISA maintains substantial margins over the multimodal baselines across STEM and software domains, where image content is structure-heavy.
The hardest multimodal-to-text domains are precisely those whose query images are
structured: charts in physics and bioinformatics, equations in mathematics, code or terminal screenshots in cryptography and software domains, maps in geographic-information systems. Holistic captions verbalise these images at a level of abstraction too coarse for retrieval (“a line chart showing emissions over time”); the answering documents contain the actual axis values, the actual variable names, the actual function signatures.
VISA’s typed parsers extract those literal tokens, so the retrieval LLM can match them directly against the corpus. For images that are not structured (natural photographs in travel, gaming, or religious-history queries), the router falls back to the
photograph parser, whose output reduces to a generic caption—by construction
VISA cannot perform worse than the existing caption baseline, and in particular avoids the
nDCG@10 collapse that Abdallah et al. [
3] report when reasoning-aware text retrievers are augmented with raw image captions.
What is new relative to prior multimodal retrievers and caption-augmentation pipelines is not a bigger encoder or a better caption, but a different retrieval substrate: VISA is the first training-free agent to replace the dense multimodal embedding with typed symbolic text and to keep every retrieval call text-side, and the first to turn a vision LLM into a routed parser bank rather than an encoder or a single captioner. Concretely, our contributions are:
We identify dense visual embedding as the substrate-level bottleneck in reasoning-intensive multimodal retrieval and propose
symbolic grounding—parsing the query image into typed structured text—as an alternative substrate that sidesteps this bottleneck without sacrificing image content (
Section 3).
We instantiate this substrate as
VISA, a Vision LLM agent that performs zero-shot image-type routing over a fixed taxonomy of nine parser prompts and dispatches typed parsers in parallel, composing their outputs with a holistic caption into three text streams scored by a single frozen 4B retrieval LLM (
Section 3).
We propose a per-query confidence-weighted linear fusion scheme that adjusts the contribution of the symbolic and caption streams according to the router’s confidence in a structure-friendly type (
Section 3.9).
We evaluate
VISA on all 29
MM-BRIGHT multimodal-to-text domains and report nDCG@10 of
32.4, beating the strongest dense multimodal encoder by
absolute. We further provide ten ablations isolating the contribution of each pipeline component, a parser leave-one-out matrix, and a linear-weight sensitivity sweep (
Section 5 and
Section 6).
VISA is training-free, model-agnostic, and deployable on top of any existing text-retrieval stack: it adds reasoning-intensive multimodal capability without training or hosting a multimodal encoder, its per-image parsing is computed once and cached so warm-cache queries cost only three text encodes, and each typed parser prompt can be swapped for a specialist model (e.g., DePlot, Nougat) without touching the rest of the pipeline.
Section 2 surveys reasoning-intensive multimodal retrieval, multimodal retrievers, and image-to-structure parsing.
Section 3 formalises
VISA.
Section 4 describes the experimental setup and
Section 5 reports the headline result.
Section 6 isolates the contribution of each component,
Section 7 discusses limitations and future work, and
Section 8 concludes.
2. Related Work
Early retrieval benchmarks such as
BEIR [
20] and
MTEB [
21] target zero-shot generalisation under surface-level relevance, where lexical or shallow-semantic overlap between query and document is sufficient. The benchmarks closest to ours impose two simultaneous constraints—genuine reasoning beyond keyword overlap, and multimodal queries that combine text with images.
BRIGHT [
2] and
RAR-b [
5] establish the reasoning side of this problem in the text-only regime; multimodal benchmarks such as
ViDoRe,
UNIIR,
MMEB, and
MRMR [
22] introduce the modality side but largely focus on surface visual–textual alignment.
MM-BRIGHT [
3] is the first benchmark to combine both: 2803 reasoning-intensive queries spanning 29 StackExchange-style technical domains, with multimodal-to-text (
Query+Image → Documents) the headline retrieval configuration we target. The benchmark is curated such that the relevant documents require multi-step inference rather than literal token overlap, making it a strict generalisation of
BRIGHT into the multimodal regime.
ARK [
23] confirms the same pattern under a different annotation protocol: visual reasoning under retrieval pressure remains an open problem.
Bi-encoder retrievers grounded in DPR [
24] have driven most progress on
BEIR and
MTEB, with subsequent encoders such as
E5-Mistral [
25],
GritLM [
26], instruction tuning, and multi-task training. Under the reasoning pressure of
BRIGHT and
MM-BRIGHT, however, these retrievers underperform their
MTEB numbers by large margins.
ReasonIR [
27] and a recent line of 4B-parameter decoder-only retrievers [
28] address this by training on reasoning-annotated pairs with chain-of-thought augmentation and hard negatives. Throughout this paper we treat the choice of base
retrieval LLM as orthogonal to the contribution of
VISA: any text retriever can be substituted, and the rest of the pipeline is unchanged. We report all results with the same frozen 4B-parameter retrieval LLM applied identically to each of the three streams.
A line of work scales contrastively pre-trained vision–language encoders [
7,
9,
10] to the retrieval setting, with
Nomic Embed Vision [
6],
VLM2Vec [
29], and
GME [
8] extending the substrate to instructed multimodal embeddings. On
MM-BRIGHT multimodal-to-text, the strongest of these models (Nomic-Vision) achieves
nDCG@10, with the rest clustered between
and
[
3]. These models excel on shallow image–text alignment benchmarks but, as observed on
MM-BRIGHT, their visual signal is added as noise rather than evidence under the reasoning constraint. Two findings from Abdallah et al. [
3] sharpen this picture and directly motivate
VISA. First, multimodal retrievers underperform on the
essential-image subset of multimodal-to-text, where the image carries information not derivable from the text—exactly the setting where visual signal should help. Second, even augmenting strong text-only retrievers with vision–language captions can degrade performance: a reasoning-aware retriever loses up to
nDCG@10 points when its query is concatenated with a generated caption. This shows that the problem is not solved by “encoding the image more carefully” or by “describing the image more naturally”; the substrate of dense multimodal embedding and free-form caption-augmentation is itself the bottleneck.
VISA sidesteps this regime entirely: it never encodes the image into a vector, and it never splices a free-form caption directly onto the retrieval query as the only image-derived signal. The image is converted into typed structured text consumed by a text retrieval LLM, with the holistic caption retained only as one stream of three under explicit fusion.
Table 1 summarises these representative multimodal-retrieval approaches, contrasting their representation substrate, advantages, and limitations on reasoning-intensive multimodal-to-text retrieval with those of
VISA.
LLM rerankers [
30,
31] consistently improve over dense retrieval alone;
RankGPT and BracketRank [
4,
32] introduced listwise reasoning, and
Rank1 [
33] and
RankR1 [
34] added chain-of-thought reasoning and reinforcement-learned ranking. Query expansion [
35,
36,
37] and agentic retrieval pipelines [
38,
39] improve recall by issuing LLM-generated subqueries and interleaving retrieval with chain-of-thought. These rerankers and agentic pipelines, however, still operate downstream of a (multimodal) dense retriever and therefore inherit the encoder-bottleneck behaviour observed on
MM-BRIGHT multimodal-to-text.
VISA is complementary to all of them: it eliminates the multimodal encoder altogether and produces purely textual streams, so any such reranker can operate on top of
VISA’s outputs in a future combination.
A separate research thread has built specialist models for converting visual structure into text:
DePlot [
13] and
MatCha [
11] for chart-to-table translation,
Nougat [
14] for academic-document OCR producing LaTeX, and chart-question-answering benchmarks such as
ChartQA [
15] and
PlotQA [
16] that motivated this line. These models are typically used as components of document-QA or summarisation pipelines; we are not aware of prior work that uses typed structural parsers as a
retrieval substrate. Where existing work asks
what is in this chart?, we ask
which document in a 30k-passage corpus answers a question that mentions this chart?, and show that converting the image into clean structured text once, caching the result, and then running text retrieval is markedly stronger than encoding the same image into a dense multimodal vector. The approach is also
parser-agnostic: each typed parser prompt in our taxonomy can be replaced one-for-one with a specialist model such as
DePlot for charts or
Nougat for equations, without altering the rest of the pipeline.
Table 1.
Representative multimodal-retrieval approaches, their representation substrate, advantages, and limitations on reasoning-intensive multimodal-to-text retrieval. Existing methods inject the image either as a dense vector or as a free-form caption;
VISA instead converts it into typed symbolic text and keeps retrieval entirely text-side. nDCG@10 figures are the
MM-BRIGHT multimodal-to-text numbers of Abdallah et al. [
3].
Table 1.
Representative multimodal-retrieval approaches, their representation substrate, advantages, and limitations on reasoning-intensive multimodal-to-text retrieval. Existing methods inject the image either as a dense vector or as a free-form caption;
VISA instead converts it into typed symbolic text and keeps retrieval entirely text-side. nDCG@10 figures are the
MM-BRIGHT multimodal-to-text numbers of Abdallah et al. [
3].
| Method/Family | Substrate | Advantage | Limitation |
|---|
| CLIP, SigLIP [9] | Contrastive dual encoder | Strong shallow image–text alignment; cheap inference | Visual vector competes with text; collapses under reasoning (≤10.8) |
Jina-CLIP [7], BGE-VL [10] | Contrastive VL encoder | Multilingual/KG-augmented embeddings | Same dense-vector bottleneck; mid-range (10.0–23.0) |
GME-2B/7B [8], VLM2Vec [29] | Instructed multimodal LLM embedding | Unified instructed embedding; scales with model size | Needs multimodal training; visual signal added as noise |
| Nomic-Vision [6] | Contrastive VL encoder (expanded latent) | Strongest dense baseline (27.6) | Single shared vector; no symbolic grounding |
| Caption augmentation [3] | Free-form caption ⊕ query | No multimodal encoder required | Caption text derails the reasoning retriever (up to −12.0) |
| LLM reranking [32,33] | Listwise/reasoning reranker over a base retriever | Strong reranking gains; chain-of- thought reasoning | Operates downstream of a dense multimodal retriever; inherits its bottleneck |
| VISA(ours) | Typed symbolic text streams | Training-free; text-side only; literal-token match; cacheable (32.4) | Relies on parser quality for unstructured photographs |
Several recent systems augment retrieval with episodic memory [
40], case-based RAG [
41], or multimodal episode indexing [
42]. We do not maintain such memory:
VISA’s only persistent state is a parsed-content cache keyed by image hash, used to avoid recomputing parser outputs across runs. Episodic factor memory of this kind remains compatible with
VISA and is left to future work.
A growing body of 2025–2026 work reinforces our central premise that
explicit structure, rather than a larger dense encoder, is what stabilises vision under reasoning or matching pressure. Ma et al. [
43] enhance a vision–language model with structured spatio-temporal data for traffic-scene understanding, showing that injecting structured side information sharpens visual understanding beyond what raw visual features provide. Zhu et al. [
44] improve image–text matching through multi-level semantic consistency alignment, decomposing text into hierarchical semantic levels—a structuring of the
text side that is complementary to our structuring of the
image side. Li et al. [
45] exploit a known gravity direction as a geometric prior to make point-cloud registration outlier-robust, another instance of a structural prior buying robustness in a vision task. These works target scene understanding, image–text matching, and 3D registration respectively; none re-casts the query image as a text-side symbolic substrate consumed by a frozen reasoning retriever, which is the specific contribution of
VISA to reasoning-intensive multimodal retrieval.
3. Method
VISA is a multimodal retrieval agent that re-casts the multimodal-to-text problem as text retrieval over three parallel streams. A multimodal query —text q and image x—is processed by a single Vision LLM acting in three distinct roles via separate prompts (a zero-shot router, a typed parser per chosen type, and a holistic captioner). The resulting symbolic and caption signals, together with the raw query, form three text streams that are scored against the document corpus by a single frozen retrieval LLM and fused into a final ranking. The whole agent is inference-only.
3.1. Problem Setup and Notation
Let denote the text-only document corpus of a target MM-BRIGHT domain, with N on the order of to passages per domain. A multimodal-to-text query is a pair , where is the natural-language question and x is one query image (When a query has multiple attached images we use the first one as the primary input; multi-image extension is straightforward and orthogonal to our contribution). The goal is to produce a ranking that maximises a relevance metric (nDCG@10) against the gold relevance judgments released with the benchmark.
We denote the parser-type taxonomy by
:
We partition the taxonomy as
where
The set
contains the seven structure-friendly types whose parser prompts produce machine-readable artefacts, such as tables, LaTeX, code, and node–edge graphs. The fallback type
produces a generic caption-style description, while
is treated as a mixed type whose parser yields verbatim visible text but no formal structure.
3.2. Pipeline Overview
VISA executes five stages per query (
Figure 1):
- 1.
Routing. A Vision LLM classifies
x into up to
parser types from
with calibrated confidences (
Section 3.3).
- 2.
Parsing. For each chosen type
, a typed prompt is re-issued to the same Vision LLM, producing structured text
(
Section 3.4).
- 3.
Symbolic block construction. The per-type outputs are concatenated into a single symbolic block
S (
Section 3.5).
- 4.
Stream construction and retrieval. Three text streams
are built from
q,
S, and a holistic caption
, and each is scored independently by the retrieval LLM (
Section 3.7).
- 5.
Fusion. Per-document scores from the three streams are combined via Reciprocal Rank Fusion or confidence-weighted linear fusion to produce
(
Section 3.9).
We denote the Vision LLM by
and the retrieval LLM by
. Both are frozen pre-trained models and receive no fine-tuning at any stage of
VISA; the only persistent state is a parsed-content cache keyed by image hash and parser type (
Section 3.10).
3.3. Vision LLM Router
The router is a single zero-shot call to
with a fixed classification prompt
(shown verbatim in
Figure 2, top) that asks for a JSON list of up to three (
type,
confidence,
reason) triples drawn from
. Formally, the router is given by Equation (
1),
where each
, each
is a self-reported confidence clipped to the unit interval, and
extracts the JSON array, validates the type vocabulary, and clips numerical confidences. We denote
as the top-1 router confidence, used downstream for fusion-weight modulation. If the parser is empty (an LLM-side parsing failure), the router falls back deterministically to a single
photograph entry with confidence
, which guarantees Stream
B never collapses to an undefined state and that
VISA reduces, in the worst case, to caption-level evidence plus the raw query.
3.4. Parser Toolkit
We define a family of parser prompts
that share input format with the router but enforce typed output schemata:
requests
;
requests LaTeX, named variables, and domain conditions;
requests language, signatures, and verbatim source; and so on. For each chosen type
from the router output, we issue the typed parser of Equation (
2),
calling the same Vision LLM
in parallel across the chosen types. The number of parallel parser calls per query is
.
Figure 2 shows the exact prompts used: the zero-shot router prompt
(top) and, as a representative typed parser, the
equation parser prompt (bottom). All nine parser prompts and the captioner prompt are released with the code at
https://github.com/HarnessLab/VISA-Agent (accessed on 9 June 2026).
3.5. Symbolic Block Construction
The per-type structured outputs are composed into a single symbolic block by concatenating type-tagged segments, as in Equation (
3):
where ⊕ denotes string concatenation and
writes the type name in upper-case (e.g.,
[CHART],
[FIGURE]). The type tags act as soft markers that the retrieval LLM can attend to, and they help the per-stream analysis disentangle which parser type contributed which tokens. When
(e.g.,
x is missing or the router fails entirely),
S degenerates to the empty string and Stream
B collapses to the raw query.
3.6. Caption Provider
The third stream uses a holistic image caption
. In our experiments
is the
field released with
MM-BRIGHT [
3], ensuring direct comparability with caption-augmented baselines reported in the original benchmark paper.
VISA does not depend on this specific caption source; substituting
for an in-house captioning prompt
leaves the rest of the pipeline unchanged (
Section 6).
3.7. Stream Construction
VISA builds three parallel text streams per query, defined in Equations (
4)–(6):
where
and
denote the symbolic-content and caption delimiters, respectively:
These short strings announce the visual evidence to the retrieval LLM. Stream
A is the unmodified text query—a safety anchor that ensures
VISA cannot regress below text-only retrieval. Stream
B injects the typed symbolic content extracted from
x. Stream
C injects the holistic caption.
3.8. Per-Stream Retrieval
Let
be the retrieval LLM, producing
d-dimensional embeddings (
is shared between queries and documents. In our implementation
is a frozen 4B-parameter decoder-only retrieval LLM, but the pipeline is agnostic to this choice). Document embeddings are pre-computed once per domain and cached to disk following Equation (
7):
For each stream
, the agent encodes the stream’s realisation of the query and scores every document by inner product (Equation (
8)):
A single multimodal query therefore induces three sets of per-document scores
,
,
. Document embeddings are encoded once and reused across streams, so the cost of the three-stream construction is dominated by three query encodes per
.
3.9. Score Fusion
The three per-stream score vectors are fused into a single ranking. We support two fusion strategies that switch under a single CLI flag.
The default robust setting is RRF [
17]. Let
denote the rank of
in stream
s (best is 1). Then, as in Equation (
9),
with
following standard practice. RRF is parameter-light, robust to score-magnitude differences across streams, and ignores the router’s confidence signal entirely.
When the router signal is reliable, we can do better by routing
trust towards the stream most aligned with the image type. Let
denote the per-stream min–max-normalised score (Equation (
10)), and let
be base weights with
(default
). The per-query weights are conditioned on the router payload via Equation (
11):
where the shifts
are tied to the top-1 router type
and confidence
as defined in Equation (
12):
with
and
in our experiments. When the router is highly confident the image is parseable into structure (charts, equations, code, …), Stream
B is up-weighted at the expense of the holistic caption stream; when the image is photograph- like, the holistic caption is trusted more. The textual anchor
A keeps a fixed minimum weight under all conditions, ensuring
VISA never falls below the text-only retrieval baseline. The fused score is given by Equation (
13):
The final ranking is obtained by sorting documents in descending order of
and returning the top
K.
3.10. Caching and Compute Footprint
Vision LLM calls are the dominant per-query cost. We cache parser outputs persistently keyed by
, with the router payload cached separately under the synthetic key
_ _
router_ _), namespaced by the router strategy so that ablation runs (
Section 6) do not collide. Re-runs and stream-level ablations therefore incur only retrieval and fusion cost, not Vision LLM cost.
The retrieval LLM document embeddings are computed once per domain and serialised to disk. Across the 29 domains the combined corpus exceeds 1.4M passages; encoding it once and reusing the cache across the headline run, all ablations, and the linear-weight sweep keeps the marginal cost of an ablation roughly equal to one mini-batch of query encodes plus the fusion arithmetic, where is the number of queries in the domain.
Let
denote the (uncached) Vision LLM calls and
the retrieval LLM query-encoder calls. For an image
x on which the router selects
parser types and the caption is supplied externally, the per-query call counts follow Equation (
14),
on a cold cache, and zero Vision LLM calls plus
on a warm cache. Total Vision LLM token budget per cold-cache query is approximately 300 tokens of routing output and up to
tokens of parser output, well within standard serving limits. With a 27B-parameter Vision LLM served on vLLM [
19], this translates to roughly 5–10 s per cold query and sub-second per warm query in our setting.
It is worth making explicit how the cost of generating multiple symbolic text streams compares with a traditional dense multimodal embedding pipeline. Both paradigms share the dominant cost: a one-time corpus encode of
N passages (
, computed once and cached) and an
nearest-neighbour scan at query time. A dense multimodal retriever then adds one multimodal-encoder forward pass per query (image+text fused into a single vector).
VISA instead adds,
on a cold cache, one router call and
parser calls to the Vision LLM (bounded by ≈300 + 3 × 900 output tokens, Equation (
14)) plus three lightweight text query-encodes; on a
warm cache the Vision LLM cost is zero and the only query-time overhead over single-stream text retrieval is two extra query-encodes and a constant-time fusion step. Because parser outputs are cached per unique image hash, the Vision LLM cost is paid once per distinct image and then amortised across all re-runs, every ablation, and the linear-weight sweep. The trade-off is therefore a small constant factor of cheap
text-side encodes and a one-time, fully-cacheable parsing pass, in exchange for (i) eliminating the multimodal encoder entirely—no multimodal model is trained, fine-tuned, or even hosted at retrieval time—and (ii) the substantial accuracy gain reported in
Section 5. In contrast to dense pipelines,
VISA never re-encodes the corpus with a multimodal model and never pays multimodal inference for repeated images.
3.11. Algorithm
Algorithm 1 summarises the per-domain procedure executed once at evaluation time. The outer loop over queries is the only concurrency dimension; within each query, the parser calls in line 2b run in parallel via asynchronous I/O against the Vision LLM server, and the three retrieval calls in line 2e reuse the same cached document embeddings.
| Algorithm 1: VISA retrieval for a single MM-BRIGHT domain. |
Input: multimodal queries , document corpus , captions . Models: Vision LLM , retrieval LLM . Output: per-query top-K ranking .
1. Pre-compute (and cache) for all j. (Equation (7))
2. For each query :
(a)
router payload from cache, else . (Equation (1))
(b) For each with cache miss, compute in parallel and persist. (Equation (2))
(c) Compose . (Equation (3))
(d) Build from . (Equations (4)–(6))
(e) For each : score for all j. (Equation (8))
(f) Fuse to obtain via Equation (9) or Equations (11)–(13).
3. return . |
3.12. Design Discussion
Three design choices are worth flagging. (i)
No multimodal encoder is used at any stage. Both the per-stream retrieval and the caching are text-only—the Vision LLM is the only place the image is consumed, and only to produce text. This eliminates the dense multimodal-embedding bottleneck identified in the introduction. (ii)
The router and parsers share the same Vision LLM. The parser-prompt taxonomy is therefore extensible without retraining: a specialist substitution (
[
13],
[
14]) is a single-class swap. (iii)
Stream A is a structural anchor. By construction, removing Streams
B and
C recovers a single-stream text-only retrieval baseline, so
VISA cannot underperform that baseline by more than the score-normalisation noise of the fusion function—a property we verify empirically in
Section 6.
3.13. Robustness to Routing Errors and Prompt Design
A natural concern is that VISA depends on a zero-shot router and on hand-written parser prompts: how does it behave when the router mis-classifies an image or a prompt is sub-optimal? VISA is engineered so that such errors degrade performance gracefully rather than catastrophically, through four mechanisms.
(i) A lower-bounding text anchor. Stream
A (the raw query) is always present and keeps a fixed minimum weight
under all router outputs (Equations (
11)–(
12)). Even a completely wrong router cannot drag the fused ranking below text-only retrieval by more than the normalisation noise of the fusion function, so routing errors cannot trigger the
nDCG@10 collapse that afflicts caption-augmented queries.
(ii) Redundant top-3 dispatch. The router emits up to three types rather than committing to one, so a single mis-classification is usually accompanied by a correct co-dispatched type, and the retrieval LLM attends to whichever symbolic tokens actually match the corpus. The top-1 ablation (
Section 6.3,
nDCG@10 vs. top-3) quantifies exactly this redundancy: removing it is the single largest router-side degradation.
(iii) Deterministic fallback. If the router returns an empty or malformed payload, it falls back deterministically to a single
photograph entry (
Section 3.3), so Stream
B never collapses to an undefined state; in the worst case
VISA reduces to caption evidence plus the raw query. Router outputs are additionally validated—the JSON is parsed, the type vocabulary is checked against
, and confidences are clipped to
—so malformed generations are rejected rather than propagated.
(iv) Selection, not coverage, is what matters. Two ablations show the design is robust to the question of whether typed routing helps at all. Disabling the router and firing all nine parsers (all 9) actually hurts () because off-target parser outputs contaminate the symbolic block, and a deterministic-random router matches it almost exactly (). The gain therefore comes from the router’s selective typed choice, and over-dispatching—the failure mode a noisy router would induce—is explicitly penalised by the pipeline rather than rewarded. Finally, because all Vision LLM calls use temperature 0, parser outputs are deterministic and reproducible (which is also what makes hash-keyed caching sound), and because the taxonomy is parser-agnostic, any individual prompt can be replaced by a specialist model without altering the rest of the pipeline, decoupling overall robustness from the wording of any single prompt.
5. Results
5.1. Main Result
Across all 29
MM-BRIGHT multimodal-to-text domains,
VISA achieves
32.4 nDCG@10, an absolute improvement of
+4.8 over the strongest dense multimodal encoder (Nomic-Vision,
),
+9.4 over Jina-CLIP,
+10.4 over GME-7B, and a roughly
improvement over BGE-VL (
) and SigLIP (
).
Table 2 reports the full per-domain breakdown alongside every multimodal-to-text baseline evaluated by Abdallah et al. [
3].
5.2. Where the Gains Concentrate
The per-domain breakdown in
Table 2 sharpens the headline into three patterns.
VISA delivers its biggest wins on salesforce (, over the best baseline GME-7B), askubuntu (, over Nomic-Vision), gaming (, over Jina-CLIP), travel (, over Nomic-Vision), and pm (, over GME-7B). What these domains share is that their query images are highly structured artefacts—UI screenshots, admin-console settings, item-stat tables, and route maps—whose contents the typed parsers can transcribe verbatim into Stream B, giving the retrieval LLM exact tokens to match against the corpus.
On STEM domains VISA matches or modestly beats the best dense multimodal model in seven of nine cases. The two losses (bioinformatics: vs. Nomic-Vision; math: vs. Nomic-Vision) coincide with images dominated by complex multi-panel figures whose layout exceeds the parser schema.
For economics, psychology, law, and christianity VISA wins by to points. These domains have queries where the image is supporting context (a graph illustrating an argument, a portrait accompanying a historical question); Stream A alone already gets close to baseline scores, and the symbolic stream provides a small but consistent additional signal.
VISA loses to a dense baseline by more than 2 nDCG@10 in only one domain (bioinformatics, ); two close losses occur on apple ( vs. Nomic-Vision) and robotics (). In each of these the queries are dominated by photographic content (product photos, robot images) for which the dense visual encoders are well-tuned and the structured parser yield is small.
5.3. Qualitative Analysis: Where VISA Wins and Loses
To make the structured-vs-unstructured distinction concrete, we summarise the representative patterns behind
VISA’s largest wins and its three losses, read off the per-domain results of
Table 2 together with the parser type the router selects for those images.
The wins concentrate where the query image is recoverable structure that the typed parsers can linearise into literal tokens present in the answering documents:
salesforce (, over GME-7B; over Nomic-Vision): admin-console screenshots are routed to the screenshot/code parsers, which transcribe field names, setting paths, and menu labels verbatim; these strings match the configuration-documentation passages almost lexically, whereas a single dense vector blurs them.
gaming (, over Jina-CLIP): item-stat tables and inventory panels are routed to the chart/screenshot parsers, recovering the exact stat names and numeric values that the wiki answer pages enumerate.
askubuntu (, over Nomic-Vision): terminal and error-dialog screenshots are parsed into verbatim commands and error strings, which are high-precision retrieval keys.
The losses are exactly the mirror image—images whose information is not recoverable as typed structure, where a dense encoder tuned to such imagery retains the advantage:
bioinformatics (, vs. Nomic-Vision): queries are dominated by dense multi-panel composite figures (e.g., alignment mosaics beside phylogenetic trees) whose layout exceeds any single typed schema; the router fires figure/chart but the parser cannot serialise the cross-panel relationships, so Stream B contributes little while the dense encoder still aligns the whole image.
apple (, vs. Nomic-Vision): product photographs carry no parseable structure, so the router falls back to photograph and Stream B degenerates to a caption-like list — precisely the regime in which the dense visual encoders are strongest.
math (, vs. Nomic-Vision): most equation images are captured well by the equation parser, but a few multi-line derivations spill beyond its single-expression schema, leaving a small residual gap.
The common rule is that VISA’s margin tracks how much of the image is typed, serialisable structure; when that fraction is high it wins decisively, and when it approaches zero the text anchor (Stream A) and the photograph fallback keep it within a point or two of the best dense model rather than collapsing.
6. Ablations
We isolate the contribution of every
VISA component through four families of ablations: (i) stream leave-out (
Section 6.1, configurations
–
), (ii) fusion mode (
Section 6.2,
–
), (iii) router strategy (
Section 6.3,
–
), and (iv) parser leave-one-out (
Section 6.4,
–
). Together they test which design choices in
VISA carry the headline gain over dense multimodal retrieval and which are nearly free hyperparameter knobs. All twenty-one ablations re-use the parser cache populated by the headline run, so the marginal cost of each row in the tables below is one retrieval pass plus one fusion arithmetic step per domain. Every aggregate value reported below is read directly from the
aggregated. NDCG@10 field of that configuration’s
summary.json; we never recompute aggregates by averaging the per-domain
metrics.json files.
The headline result is recovered only by the full pipeline at nDCG@10, and every one of the twenty-one ablations – scores below it. The drops range from (dropping the photograph fallback parser) to (router top-1 with the symbolic and caption streams still active but the parser dispatch capped to a single type). VISA is therefore not dominated by any single component; the gain over dense multimodal retrieval comes from the joint action of router, parsers, streams, and fusion.
6.1. Stream Leave-Out
Table 3 reports the aggregate nDCG@10 for every non-empty subset of
. The full three-stream pipeline beats every subset by a clear margin.
No single stream and no two-stream pair exceeds
nDCG@10, yet the three-stream union scores
, a clear super-additive jump of
over the best two-stream variant. This contrasts with naive caption-augmentation pipelines, where mixing visual signal into the query text often
degrades performance on reasoning-aware retrievers [
3]. Because
VISA keeps the three signals on
separate retrieval channels and fuses only the score vectors, evidence from the typed symbolic content is combined with the holistic caption and the raw query without polluting the retrieval LLM’s input.
Stream A alone clears the previous-best dense multimodal baseline () by points, confirming that text-only retrieval over the multimodal queries is already a strong baseline. Stream B alone () almost matches Nomic-Vision but loses to the text baseline A — symbolic content is informative but cannot substitute for the question text it summarises. Stream C alone () is intermediate. The closeness of A, C, , and (all in ) shows that any single image-derived signal behaves as a flat ceiling once one is present; only adding the second image-derived signal recovers the headline.
Removing B alone (configuration , ) costs nDCG@10, the second-largest single drop in the stream table after removing both A and C. The symbolic stream is therefore not redundant with the caption; it carries information the holistic caption fails to surface.
6.2. Fusion Mode
Table 4 compares the two fusion strategies of
Section 3.9. Reciprocal Rank Fusion drops
nDCG@10 against the default confidence-weighted linear fusion, and a uniform linear fusion (no router-confidence shift) drops
. The router- confidence weighting in Equations (
11) and (
12) is therefore load-bearing: it accounts for ≈+2 points of headline gain.
The router-confidence shift up-weights Stream
B when the router is confident the image is structure-friendly, and up-weights Stream
C when it is photograph-like (Equation (
12)). RRF’s confidence- agnostic rank merging does not exploit this signal, and equal-weight linear fusion treats all images alike, both falling short of the adaptive default.
6.3. Router Strategy
We replace the Vision-LLM router with several alternatives: top-
k caps, an all-parsers-fire baseline that skips routing entirely, and a deterministic-random baseline that picks
types per image seeded by
.
Table 5 shows the result.
The router behaves as a structured classifier whose value comes from both which types it selects and how many are dispatched. Capping at top-1 loses nDCG@10—a single typed parser is not enough to cover the visual content of most queries. Top-2 recovers over top-1, and top-3 (the default) recovers another ; beyond that, dispatching all nine parsers (all 9) actively hurts by because off-target parser outputs contaminate the symbolic block. The deterministic-random baseline matches the all-9 strategy almost exactly (), confirming that without typed selection the parsers are no better than randomly chosen text augmentations. The specific identity of the typed parsers chosen by the Vision LLM is therefore load-bearing, not just whether typed parsing is invoked.
6.4. Parser Leave-One-Out
Table 6 drops each parser type from the taxonomy and re-runs the full pipeline. Every parser registers a measurable contribution;
equation is the most important (
when dropped),
photograph the least (
).
Unlike the previous-generation result where most leave-one-out drops were silent, the corrected ablation makes the contribution of each parser visible: dropping any one of the nine parsers degrades performance by between and nDCG@10. The taxonomy is not over-parameterised—there is no parser whose removal can be absorbed by the others.
The four largest drops (equation, circuit, screenshot, code) are all heavily structure-friendly types from , with parser outputs (LaTeX, node–edge graphs, verbatim source, UI-element trees) that have no equivalent in a holistic caption. The smallest drop comes from photograph: when the image is unstructured, the caption stream C already covers most of the signal, and the photograph parser’s object-list description is partially redundant with it.
Mapped onto the four domain categories of
Table 2, this ordering directly reflects the structured-vs- unstructured composition of each category: the high-impact structure-friendly parsers (
equation,
circuit,
screenshot,
code,
chart) are the ones that fire in the STEM & Life Sciences and Software & Technical Systems categories, where
VISA’s per-category average margin over the strongest dense baseline is largest (
and
over Nomic-Vision, respectively, in
Table 2); whereas the
photograph parser, whose removal costs the least, is the one that dominates the more natural-image queries in the Applied and Humanities categories, where the caption stream already carries the signal. The parser leave-one-out contribution is therefore a parser-type-resolved restatement of the structured-vs.-unstructured pattern seen across categories in the main table.
6.5. Summary of Findings
The twenty-one ablations – support three clean conclusions about the source of VISA’s headline improvement.
Every component matters. All twenty-one ablations score below the full pipeline, with drops ranging from to . VISA’s headline gain is the joint product of router, parsers, streams, and fusion—no single component is responsible for it, and removing any of them costs measurable performance.
Streams combine super-additively. The strongest single stream scores
and the full three-stream union scores
—a
super-additive jump. Confidence-weighted linear fusion (Equations (
11)–(
12)) is responsible for most of this gain: replacing it with RRF or with equal weights costs
and
respectively.
Router top-3 is the right operating point. Capping the router below three types loses up to , dispatching all nine parsers loses , and random dispatch loses . The router’s specific typed selection at is therefore not a free hyperparameter but a load-bearing component of the pipeline.
7. Limitations and Future Work
VISA converts a query image into typed symbolic text, so its advantage is, by design, tied to how much of the image is recoverable structure. This framing also delimits where the method is weakest.
(i) A central question is how VISA behaves when symbolic patterns are limited or absent—ordinary photographs of objects, scenes, or people, as in the apple and robotics domains. Here the router falls back to the photograph type and Stream B degenerates to a generic caption-style description, so VISA carries no structural advantage over a contrastive visual encoder that was trained precisely on such imagery. Two safety mechanisms nonetheless prevent failure: the raw-query anchor (Stream A) lower-bounds performance at text-only retrieval, and the caption stream (Stream C) is kept on a separate fusion channel rather than spliced into the query, so VISA matches but does not exceed the strongest dense encoder on natural-image domains, and avoids the nDCG@10 collapse of naive caption augmentation. In other words, the current design degrades gracefully on unstructured content but does not yet turn it into a win; closing that gap is the most important direction for future work, e.g., a hybrid scheme that adds a learned dense visual stream specifically when the router’s structure confidence is low, fusing it with the symbolic and caption streams under the same confidence-weighted scheme.
(ii) Composite layouts. Multi-panel or densely composited figures (the bioinformatics loss) exceed any single typed schema; our parsers serialise within a type but do not yet capture cross-panel relationships.
(iii) Single primary image. We use the first attached image per query; principled multi-image (and video) fusion is left to future work.
(iv)
Dependence on a capable Vision LLM. Routing and parsing assume a sufficiently strong vision model; although the redundancy, fallback, and validation mechanisms of
Section 3.13 make the pipeline tolerant of individual errors, very small vision models may degrade parse quality.
(v) Self-reported confidence. Fusion weights are modulated by the router’s self-reported confidence, whose calibration we do not independently verify.
Future work.
Beyond the hybrid natural-image stream above, promising directions include: calibrated or learned routing with adaptive parser selection; substituting specialist parsers (DePlot, Nougat) and extending the taxonomy with new typed parsers (tables, molecular structures, musical notation) to broaden the definition of “structure”; composing VISA’s purely textual streams with a downstream factor-decomposed LLM reranker; and validating the approach across additional reasoning-intensive multimodal benchmarks.
8. Conclusions
We introduced VISA, a training-free Visual Symbolic Agent for reasoning-intensive multimodal retrieval. Instead of forcing image content into a dense multimodal embedding space, VISA converts the query image into typed symbolic text through a Vision LLM router and a set of structured parser prompts. The resulting symbolic evidence is kept separate from both the raw query and a holistic caption, yielding three complementary text streams that are scored by a frozen retrieval LLM and combined through confidence-weighted fusion. On MM-BRIGHT multimodal-to-text, VISA achieves 32.4 nDCG@10 across 29 domains, improving by 4.8 points over the strongest dense multimodal baseline. The gains are especially clear in domains where images contain recoverable structure, such as equations, charts, code, screenshots, maps, circuits, and technical diagrams. Ablation results show that this improvement does not come from a single component: the raw-query, symbolic, and caption streams are all necessary, the router’s top-3 typed dispatch is load-bearing, and confidence-weighted fusion substantially outperforms confidence-agnostic alternatives. These results suggest that the main bottleneck in reasoning-intensive multimodal retrieval is not only model scale, but the representation substrate used to inject visual evidence into retrieval. For structured technical images, symbolic grounding provides a cleaner and more retriever-compatible interface than dense visual embeddings or free-form captions alone. More broadly, VISA shows that strong multimodal retrieval can be obtained without training a multimodal retriever: a Vision LLM can act as a parser, while retrieval itself remains purely text-based.