VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval

Abdalla, Mahmoud; Kasem, Mahmoud SalahEldin; Mahmoud, Mohamed; Senussi, Mostafa Farouk; Abdallah, Abdelrahman; Kang, Hyun Soo

doi:10.3390/math14122197

Open AccessArticle

VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval

by

Mahmoud Abdalla

¹

,

Mahmoud SalahEldin Kasem

^1,2

,

Mohamed Mahmoud

^1,3

,

Mostafa Farouk Senussi

^1,3

,

Abdelrahman Abdallah

^3,4 and

Hyun Soo Kang

^1,*

¹

Department of Information and Communication Engineering, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si 28644, Republic of Korea

²

Multimedia Department, Faculty of Computers and Information, Assiut University, Assiut 71526, Egypt

³

Information Technology Department, Faculty of Computers and Information, Assiut University, Assiut 71526, Egypt

⁴

Department of Computer Science, Innsbruck University, 6020 Innsbruck, Austria

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(12), 2197; https://doi.org/10.3390/math14122197

Submission received: 12 May 2026 / Revised: 14 June 2026 / Accepted: 16 June 2026 / Published: 18 June 2026

(This article belongs to the Special Issue New Advances in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Reasoning-intensive multimodal retrieval suffers from a counter-intuitive bottleneck: on MM-BRIGHT multimodal-to-text (Query+Image → Documents), the strongest dense multimodal encoder reaches only 27.6 nDCG@10 and the rest of the dense vision–language retrievers cluster between 10.0 and 23.0. The visual signal, encoded as a dense vector, adds noise rather than evidence; even augmenting strong text retrievers with raw image captions degrades performance by up to 12.0 points. We propose VISA, a Visual Symbolic Agent that re-casts multimodal-to-text as text retrieval over three parallel streams. A Vision LLM is dispatched in three roles via separate prompts: a zero-shot router that classifies the query image into up to three parser types from a fixed taxonomy of nine (chart, circuit, equation, screenshot, code, figure, diagram, map, photograph); typed parsers that extract structured text per type; and a holistic captioner. The agent constructs three text streams (raw query, query ⊕ symbolic, query ⊕ caption), scores each with a single frozen 4B-parameter retrieval LLM, and fuses the per-document scores via Reciprocal Rank Fusion or a confidence-weighted linear combination. The whole agent contains no trainable parameters. The key novelty is a change of substrate: rather than projecting the query image into a dense multimodal vector that competes with text, VISA is, to our knowledge, the first retrieval system to convert the image into typed symbolic text and keep retrieval entirely text-side, so that a frozen text retriever can match the literal tokens (axis values, variable names, function signatures) that answering documents actually contain. Across all 29 MM-BRIGHT multimodal-to-text domains, VISA achieves 32.4 nDCG@10, an absolute improvement of +4.8 over the strongest dense multimodal encoder and substantially larger margins over the remaining six dense vision–language baselines. Per-domain analysis shows VISA maintains its margin across STEM and software domains where image content is structure-heavy. In practical terms, VISA is training-free and model-agnostic: it requires no fine-tuning, reuses any off-the-shelf vision LLM and text retriever, caches all per-image parsing so re-runs cost only three query encodes, and can therefore be dropped into an existing text-retrieval stack to add reasoning-intensive multimodal capability without building or training a multimodal encoder.

Keywords:

multimodal retrieval; reasoning-intensive retrieval; vision–language models; symbolic parsing; LLM agents; MM-BRIGHT

MSC:

68T42

1. Introduction

Reasoning-intensive retrieval [1,2,3] has become central to information systems built around large language models. Users no longer issue keyword queries: they paste error logs, attach circuit schematics, screenshot a chart from a paper, or photograph a whiteboard, and expect the system to retrieve a document that addresses the underlying question. This multimodal phrasing is now the norm in technical question answering, scientific search, and AI assistants embedded in productivity tools, and it requires the retriever to perform genuine cross-modal reasoning rather than surface-level matching [4,5].

MM-BRIGHT [3] is a benchmark of 2803 real-world reasoning-intensive queries spanning 29 technical domains, of which multimodal-to-text—multimodal query (text + image) to text document— is the headline multimodal-retrieval task. The benchmark exposes a striking failure mode of current multimodal encoders: across seven state-of-the-art dense vision–language retrievers, the best model (Nomic-Vision [6]) reaches only

27.6

nDCG@10 on multimodal-to-text, with the runner-up Jina-CLIP [7] at

23.0

, GME-7B [8] at

22.0

, GME-2B at

19.5

, and CLIP [9] and BGE-VL [10] at

10.4

and

10.0

respectively [3]. Adding visual signal does not help retrieval here—it actively introduces noise. Abdallah et al. [3] further observe that even augmenting strong text retrievers with vision-language captions can degrade performance: a reasoning-enhanced retriever loses

12.0

nDCG@10 points when its query is concatenated with a generated image caption. The question is therefore not how to make the visual encoder bigger or the caption longer, but how to introduce the image into the retrieval pipeline at all without contaminating the textual signal that current retrievers rely on.

We argue the bottleneck is the substrate, not the encoder. Existing multimodal retrievers route the query image through a dense visual encoder whose output must compete with text in a single shared embedding space, and existing caption-augmentation pipelines splice an unconstrained natural-language description into a query string that the retriever was never trained to consume. Both choices send noisy signal to a model that has only one place to put it. Recent results in chart understanding [11,12,13], scientific document parsing [14], and chart question answering [15,16] demonstrate that, for images with regular structure, vision–language models can extract that structure into clean typed text—table cells, LaTeX formulas, code, node–edge graphs—which is competitive with the original image for downstream reasoning. We adopt this insight as a retrieval primitive: convert the image into typed structured text and then run text retrieval, rather than encoding the image into a dense vector that must be matched against text, or attaching a free-form caption that derails the reasoning retriever.

We present VISA https://github.com/HarnessLab/VISA-Agent (accessed on 9 June 2026) (Figure 1), a multimodal retrieval agent that re-casts multimodal-to-text as a text-retrieval problem over three parallel streams. A multimodal query

(q, x)

is dispatched by a Vision LLM in three roles via separate prompts: a zero-shot router that classifies the image x into up to three parser types from a fixed taxonomy of nine (chart, circuit, equation, screenshot, code, figure, diagram, map, photograph); typed parsers, one per chosen type, each extracting the image into structured text

S_{t}

; and a holistic captioner producing a natural-language description c. The agent then constructs three text streams—the raw query (Stream A), the query augmented with the parsed symbolic content (Stream B), and the query augmented with the holistic caption (Stream C)—scores each with a single frozen retrieval LLM (a 4B-parameter decoder-only embedding model), and fuses the per-document scores via Reciprocal Rank Fusion [17] or a confidence-weighted linear combination conditioned on the router’s output. The whole agent contains no trainable parameters: a single Vision LLM (Qwen3-VL [18]) served via vLLM [19] answers different prompts to fill all three roles, and the retrieval LLM is used unchanged. Concretely, Figure 1 reads left to right: the query image first enters the router (top), which emits up to three typed parser calls; each typed parser returns structured text that is concatenated into the symbolic block S and cached by image hash, while the captioner returns a holistic description c; the raw query, S, and c then form the three streams A, B, C that are independently scored by the frozen retrieval LLM and merged by the fusion block on the right into the final document ranking.

On all 29 MM-BRIGHT multimodal-to-text domains, VISA achieves 32.4 nDCG@10, an absolute improvement of +4.8 over the strongest dense multimodal encoder (Nomic-Vision,

27.6

), +9.4 over Jina-CLIP (

23.0

), +10.4 over GME-7B (

22.0

), and more than triple the score of the weakest dense vision–language retriever evaluated by Abdallah et al. [3]. Crucially, VISA is the first system to make the multimodal-to-text score exceed the strongest dense visual baselines on MM-BRIGHT without using a multimodal encoder at all: every retrieval call in our pipeline is a plain text-side query. Per-domain analysis (Section 5) shows VISA maintains substantial margins over the multimodal baselines across STEM and software domains, where image content is structure-heavy.

The hardest multimodal-to-text domains are precisely those whose query images are structured: charts in physics and bioinformatics, equations in mathematics, code or terminal screenshots in cryptography and software domains, maps in geographic-information systems. Holistic captions verbalise these images at a level of abstraction too coarse for retrieval (“a line chart showing emissions over time”); the answering documents contain the actual axis values, the actual variable names, the actual function signatures. VISA’s typed parsers extract those literal tokens, so the retrieval LLM can match them directly against the corpus. For images that are not structured (natural photographs in travel, gaming, or religious-history queries), the router falls back to the photograph parser, whose output reduces to a generic caption—by construction VISA cannot perform worse than the existing caption baseline, and in particular avoids the

- 12.0

nDCG@10 collapse that Abdallah et al. [3] report when reasoning-aware text retrievers are augmented with raw image captions.

Contributions.

What is new relative to prior multimodal retrievers and caption-augmentation pipelines is not a bigger encoder or a better caption, but a different retrieval substrate: VISA is the first training-free agent to replace the dense multimodal embedding with typed symbolic text and to keep every retrieval call text-side, and the first to turn a vision LLM into a routed parser bank rather than an encoder or a single captioner. Concretely, our contributions are:

We identify dense visual embedding as the substrate-level bottleneck in reasoning-intensive multimodal retrieval and propose symbolic grounding—parsing the query image into typed structured text—as an alternative substrate that sidesteps this bottleneck without sacrificing image content (Section 3).
We instantiate this substrate as VISA, a Vision LLM agent that performs zero-shot image-type routing over a fixed taxonomy of nine parser prompts and dispatches typed parsers in parallel, composing their outputs with a holistic caption into three text streams scored by a single frozen 4B retrieval LLM (Section 3).
We propose a per-query confidence-weighted linear fusion scheme that adjusts the contribution of the symbolic and caption streams according to the router’s confidence in a structure-friendly type (Section 3.9).
We evaluate VISA on all 29 MM-BRIGHT multimodal-to-text domains and report nDCG@10 of 32.4, beating the strongest dense multimodal encoder by $+ 4.8$ absolute. We further provide ten ablations isolating the contribution of each pipeline component, a parser leave-one-out matrix, and a linear-weight sensitivity sweep (Section 5 and Section 6).
VISA is training-free, model-agnostic, and deployable on top of any existing text-retrieval stack: it adds reasoning-intensive multimodal capability without training or hosting a multimodal encoder, its per-image parsing is computed once and cached so warm-cache queries cost only three text encodes, and each typed parser prompt can be swapped for a specialist model (e.g., DePlot, Nougat) without touching the rest of the pipeline.

Section 2 surveys reasoning-intensive multimodal retrieval, multimodal retrievers, and image-to-structure parsing. Section 3 formalises VISA. Section 4 describes the experimental setup and Section 5 reports the headline result. Section 6 isolates the contribution of each component, Section 7 discusses limitations and future work, and Section 8 concludes.

2. Related Work

Early retrieval benchmarks such as BEIR [20] and MTEB [21] target zero-shot generalisation under surface-level relevance, where lexical or shallow-semantic overlap between query and document is sufficient. The benchmarks closest to ours impose two simultaneous constraints—genuine reasoning beyond keyword overlap, and multimodal queries that combine text with images. BRIGHT [2] and RAR-b [5] establish the reasoning side of this problem in the text-only regime; multimodal benchmarks such as ViDoRe, UNIIR, MMEB, and MRMR [22] introduce the modality side but largely focus on surface visual–textual alignment. MM-BRIGHT [3] is the first benchmark to combine both: 2803 reasoning-intensive queries spanning 29 StackExchange-style technical domains, with multimodal-to-text (Query+Image → Documents) the headline retrieval configuration we target. The benchmark is curated such that the relevant documents require multi-step inference rather than literal token overlap, making it a strict generalisation of BRIGHT into the multimodal regime. ARK [23] confirms the same pattern under a different annotation protocol: visual reasoning under retrieval pressure remains an open problem.

Bi-encoder retrievers grounded in DPR [24] have driven most progress on BEIR and MTEB, with subsequent encoders such as E5-Mistral [25], GritLM [26], instruction tuning, and multi-task training. Under the reasoning pressure of BRIGHT and MM-BRIGHT, however, these retrievers underperform their MTEB numbers by large margins. ReasonIR [27] and a recent line of 4B-parameter decoder-only retrievers [28] address this by training on reasoning-annotated pairs with chain-of-thought augmentation and hard negatives. Throughout this paper we treat the choice of base retrieval LLM as orthogonal to the contribution of VISA: any text retriever can be substituted, and the rest of the pipeline is unchanged. We report all results with the same frozen 4B-parameter retrieval LLM applied identically to each of the three streams.

A line of work scales contrastively pre-trained vision–language encoders [7,9,10] to the retrieval setting, with Nomic Embed Vision [6], VLM2Vec [29], and GME [8] extending the substrate to instructed multimodal embeddings. On MM-BRIGHT multimodal-to-text, the strongest of these models (Nomic-Vision) achieves

27.6

nDCG@10, with the rest clustered between

10.0

and

23.0

[3]. These models excel on shallow image–text alignment benchmarks but, as observed on MM-BRIGHT, their visual signal is added as noise rather than evidence under the reasoning constraint. Two findings from Abdallah et al. [3] sharpen this picture and directly motivate VISA. First, multimodal retrievers underperform on the essential-image subset of multimodal-to-text, where the image carries information not derivable from the text—exactly the setting where visual signal should help. Second, even augmenting strong text-only retrievers with vision–language captions can degrade performance: a reasoning-aware retriever loses up to

12.0

nDCG@10 points when its query is concatenated with a generated caption. This shows that the problem is not solved by “encoding the image more carefully” or by “describing the image more naturally”; the substrate of dense multimodal embedding and free-form caption-augmentation is itself the bottleneck. VISA sidesteps this regime entirely: it never encodes the image into a vector, and it never splices a free-form caption directly onto the retrieval query as the only image-derived signal. The image is converted into typed structured text consumed by a text retrieval LLM, with the holistic caption retained only as one stream of three under explicit fusion. Table 1 summarises these representative multimodal-retrieval approaches, contrasting their representation substrate, advantages, and limitations on reasoning-intensive multimodal-to-text retrieval with those of VISA.

LLM rerankers [30,31] consistently improve over dense retrieval alone; RankGPT and BracketRank [4,32] introduced listwise reasoning, and Rank1 [33] and RankR1 [34] added chain-of-thought reasoning and reinforcement-learned ranking. Query expansion [35,36,37] and agentic retrieval pipelines [38,39] improve recall by issuing LLM-generated subqueries and interleaving retrieval with chain-of-thought. These rerankers and agentic pipelines, however, still operate downstream of a (multimodal) dense retriever and therefore inherit the encoder-bottleneck behaviour observed on MM-BRIGHT multimodal-to-text. VISA is complementary to all of them: it eliminates the multimodal encoder altogether and produces purely textual streams, so any such reranker can operate on top of VISA’s outputs in a future combination.

A separate research thread has built specialist models for converting visual structure into text: DePlot [13] and MatCha [11] for chart-to-table translation, Nougat [14] for academic-document OCR producing LaTeX, and chart-question-answering benchmarks such as ChartQA [15] and PlotQA [16] that motivated this line. These models are typically used as components of document-QA or summarisation pipelines; we are not aware of prior work that uses typed structural parsers as a retrieval substrate. Where existing work asks what is in this chart?, we ask which document in a 30k-passage corpus answers a question that mentions this chart?, and show that converting the image into clean structured text once, caching the result, and then running text retrieval is markedly stronger than encoding the same image into a dense multimodal vector. The approach is also parser-agnostic: each typed parser prompt in our taxonomy can be replaced one-for-one with a specialist model such as DePlot for charts or Nougat for equations, without altering the rest of the pipeline.

Table 1. Representative multimodal-retrieval approaches, their representation substrate, advantages, and limitations on reasoning-intensive multimodal-to-text retrieval. Existing methods inject the image either as a dense vector or as a free-form caption; VISA instead converts it into typed symbolic text and keeps retrieval entirely text-side. nDCG@10 figures are the MM-BRIGHT multimodal-to-text numbers of Abdallah et al. [3].

Method/Family	Substrate	Advantage	Limitation
CLIP, SigLIP [9]	Contrastive dual encoder	Strong shallow image–text alignment; cheap inference	Visual vector competes with text; collapses under reasoning (≤10.8)
Jina-CLIP [7], BGE-VL [10]	Contrastive VL encoder	Multilingual/KG-augmented embeddings	Same dense-vector bottleneck; mid-range (10.0–23.0)
GME-2B/7B [8], VLM2Vec [29]	Instructed multimodal LLM embedding	Unified instructed embedding; scales with model size	Needs multimodal training; visual signal added as noise
Nomic-Vision [6]	Contrastive VL encoder (expanded latent)	Strongest dense baseline (27.6)	Single shared vector; no symbolic grounding
Caption augmentation [3]	Free-form caption ⊕ query	No multimodal encoder required	Caption text derails the reasoning retriever (up to −12.0)
LLM reranking [32,33]	Listwise/reasoning reranker over a base retriever	Strong reranking gains; chain-of- thought reasoning	Operates downstream of a dense multimodal retriever; inherits its bottleneck
VISA(ours)	Typed symbolic text streams	Training-free; text-side only; literal-token match; cacheable (32.4)	Relies on parser quality for unstructured photographs

Several recent systems augment retrieval with episodic memory [40], case-based RAG [41], or multimodal episode indexing [42]. We do not maintain such memory: VISA’s only persistent state is a parsed-content cache keyed by image hash, used to avoid recomputing parser outputs across runs. Episodic factor memory of this kind remains compatible with VISA and is left to future work.

Recent structure-aware vision–language work.

A growing body of 2025–2026 work reinforces our central premise that explicit structure, rather than a larger dense encoder, is what stabilises vision under reasoning or matching pressure. Ma et al. [43] enhance a vision–language model with structured spatio-temporal data for traffic-scene understanding, showing that injecting structured side information sharpens visual understanding beyond what raw visual features provide. Zhu et al. [44] improve image–text matching through multi-level semantic consistency alignment, decomposing text into hierarchical semantic levels—a structuring of the text side that is complementary to our structuring of the image side. Li et al. [45] exploit a known gravity direction as a geometric prior to make point-cloud registration outlier-robust, another instance of a structural prior buying robustness in a vision task. These works target scene understanding, image–text matching, and 3D registration respectively; none re-casts the query image as a text-side symbolic substrate consumed by a frozen reasoning retriever, which is the specific contribution of VISA to reasoning-intensive multimodal retrieval.

3. Method

VISA is a multimodal retrieval agent that re-casts the multimodal-to-text problem as text retrieval over three parallel streams. A multimodal query

(q, x)

—text q and image x—is processed by a single Vision LLM acting in three distinct roles via separate prompts (a zero-shot router, a typed parser per chosen type, and a holistic captioner). The resulting symbolic and caption signals, together with the raw query, form three text streams that are scored against the document corpus

D

by a single frozen retrieval LLM and fused into a final ranking. The whole agent is inference-only.

3.1. Problem Setup and Notation

Let

D = {d_{j}}_{j = 1}^{N}

denote the text-only document corpus of a target MM-BRIGHT domain, with N on the order of

10^{4}

to

10^{5}

passages per domain. A multimodal-to-text query is a pair

q^{★} = (q, x)

, where

q \in Σ^{★}

is the natural-language question and x is one query image (When a query has multiple attached images we use the first one as the primary input; multi-image extension is straightforward and orthogonal to our contribution). The goal is to produce a ranking

π (q^{★}) = (d_{(1)}, \dots, d_{(K)})

that maximises a relevance metric (nDCG@10) against the gold relevance judgments

rel (q^{★}, d) \in {0, 1}

released with the benchmark.

We denote the parser-type taxonomy by

T

:

\begin{matrix} T = & {chart, circuit, equation, screenshot, code, \\ figure, diagram, map, photograph}, | T | = 9 . \end{matrix}

We partition the taxonomy as

T = T_{struct} \cup {photograph},

where

\begin{matrix} T_{struct} = & {chart, circuit, equation, code, \\ figure, diagram, map} . \end{matrix}

The set

T_{struct}

contains the seven structure-friendly types whose parser prompts produce machine-readable artefacts, such as tables, LaTeX, code, and node–edge graphs. The fallback type

photograph

produces a generic caption-style description, while

screenshot

is treated as a mixed type whose parser yields verbatim visible text but no formal structure.

3.2. Pipeline Overview

VISA executes five stages per query (Figure 1):

1.: Routing. A Vision LLM classifies x into up to $K_{r} = 3$ parser types from $T$ with calibrated confidences (Section 3.3).
2.: Parsing. For each chosen type $τ$ , a typed prompt is re-issued to the same Vision LLM, producing structured text $S_{τ}$ (Section 3.4).
3.: Symbolic block construction. The per-type outputs are concatenated into a single symbolic block S (Section 3.5).
4.: Stream construction and retrieval. Three text streams $A, B, C$ are built from q, S, and a holistic caption $ψ$ , and each is scored independently by the retrieval LLM (Section 3.7).
5.: Fusion. Per-document scores from the three streams are combined via Reciprocal Rank Fusion or confidence-weighted linear fusion to produce $π (q^{★})$ (Section 3.9).

We denote the Vision LLM by

f_{v}

and the retrieval LLM by

ϕ

. Both are frozen pre-trained models and receive no fine-tuning at any stage of VISA; the only persistent state is a parsed-content cache keyed by image hash and parser type (Section 3.10).

3.3. Vision LLM Router

The router is a single zero-shot call to

f_{v}

with a fixed classification prompt

Π_{r}

(shown verbatim in Figure 2, top) that asks for a JSON list of up to three (type, confidence, reason) triples drawn from

T

. Formally, the router is given by Equation (1),

r (x) = {parse}_{T} (f_{v} (Π_{r}, x)) = [(τ_{1}, c_{1}), \dots, (τ_{k}, c_{k})], k \leq K_{r},

(1)

where each

τ_{i} \in T

, each

c_{i} \in [0, 1]

is a self-reported confidence clipped to the unit interval, and

{parse}_{T} (\cdot)

extracts the JSON array, validates the type vocabulary, and clips numerical confidences. We denote

c (x) : = c_{1}

as the top-1 router confidence, used downstream for fusion-weight modulation. If the parser is empty (an LLM-side parsing failure), the router falls back deterministically to a single photograph entry with confidence

0.3

, which guarantees Stream B never collapses to an undefined state and that VISA reduces, in the worst case, to caption-level evidence plus the raw query.

3.4. Parser Toolkit

We define a family of parser prompts

{Π_{τ}}_{τ \in T}

that share input format with the router but enforce typed output schemata:

Π_{chart}

requests

〈 title, axes, (x, y) - pairs, trend 〉

;

Π_{equation}

requests LaTeX, named variables, and domain conditions;

Π_{code}

requests language, signatures, and verbatim source; and so on. For each chosen type

τ

from the router output, we issue the typed parser of Equation (2),

S_{τ} = f_{v} (Π_{τ}, x) \in Σ^{★},

(2)

calling the same Vision LLM

f_{v}

in parallel across the chosen types. The number of parallel parser calls per query is

| {τ : (τ, c) \in r (x)} | \leq K_{r} = 3

. Figure 2 shows the exact prompts used: the zero-shot router prompt

Π_{r}

(top) and, as a representative typed parser, the equation parser prompt (bottom). All nine parser prompts and the captioner prompt are released with the code at https://github.com/HarnessLab/VISA-Agent (accessed on 9 June 2026).

3.5. Symbolic Block Construction

The per-type structured outputs are composed into a single symbolic block by concatenating type-tagged segments, as in Equation (3):

S = ⨁_{τ \in types (r (x))} [TYPE (τ)] \oplus S_{τ},

(3)

where ⊕ denotes string concatenation and

TYPE (τ)

writes the type name in upper-case (e.g., [CHART], [FIGURE]). The type tags act as soft markers that the retrieval LLM can attend to, and they help the per-stream analysis disentangle which parser type contributed which tokens. When

r (x) = ⌀

(e.g., x is missing or the router fails entirely), S degenerates to the empty string and Stream B collapses to the raw query.

3.6. Caption Provider

The third stream uses a holistic image caption

ψ (x) \in Σ^{★}

. In our experiments

ψ (x)

is the

caption_gpt 4 o

field released with MM-BRIGHT [3], ensuring direct comparability with caption-augmented baselines reported in the original benchmark paper. VISA does not depend on this specific caption source; substituting

ψ (x) = f_{v} (Π_{c}, x)

for an in-house captioning prompt

Π_{c}

leaves the rest of the pipeline unchanged (Section 6).

3.7. Stream Construction

VISA builds three parallel text streams per query, defined in Equations (4)–(6):

\begin{matrix} A (q^{★}) & = q, \end{matrix}

(4)

\begin{matrix} B (q^{★}) & = q \oplus ζ_{s} \oplus S, \end{matrix}

(5)

\begin{matrix} C (q^{★}) & = q \oplus ζ_{c} \oplus ψ (x), \end{matrix}

(6)

where

ζ_{s}

and

ζ_{c}

denote the symbolic-content and caption delimiters, respectively:

\begin{matrix} ζ_{s} & = “ \ n [Visual content extracted as structure] : \ n ”, \\ ζ_{c} & = “ \ nImage Description : \ n ” . \end{matrix}

These short strings announce the visual evidence to the retrieval LLM. Stream A is the unmodified text query—a safety anchor that ensures VISA cannot regress below text-only retrieval. Stream B injects the typed symbolic content extracted from x. Stream C injects the holistic caption.

3.8. Per-Stream Retrieval

Let

ϕ : Σ^{★} \to R^{d}

be the retrieval LLM, producing d-dimensional embeddings (

ϕ

is shared between queries and documents. In our implementation

ϕ

is a frozen 4B-parameter decoder-only retrieval LLM, but the pipeline is agnostic to this choice). Document embeddings are pre-computed once per domain and cached to disk following Equation (7):

e_{j} = ϕ (d_{j}), j = 1, \dots, N .

(7)

For each stream

s \in {A, B, C}

, the agent encodes the stream’s realisation of the query and scores every document by inner product (Equation (8)):

R_{s} (d_{j} ∣ q^{★}) = 〈ϕ (s (q^{★})), e_{j}〉 .

(8)

A single multimodal query therefore induces three sets of per-document scores

{R_{A} (d_{j})}_{j}

,

{R_{B} (d_{j})}_{j}

,

{R_{C} (d_{j})}_{j}

. Document embeddings are encoded once and reused across streams, so the cost of the three-stream construction is dominated by three query encodes per

q^{★}

.

3.9. Score Fusion

The three per-stream score vectors are fused into a single ranking. We support two fusion strategies that switch under a single CLI flag.

Reciprocal Rank Fusion.

The default robust setting is RRF [17]. Let

{rank}_{s} (d_{j})

denote the rank of

d_{j}

in stream s (best is 1). Then, as in Equation (9),

{score}_{RRF} (d_{j}) = \sum_{s \in {A, B, C}} \frac{1}{k_{RRF} + {rank}_{s} (d_{j})},

(9)

with

k_{RRF} = 60

following standard practice. RRF is parameter-light, robust to score-magnitude differences across streams, and ignores the router’s confidence signal entirely.

Confidence-weighted linear fusion.

When the router signal is reliable, we can do better by routing trust towards the stream most aligned with the image type. Let

{\tilde{R}}_{s} (d_{j}) = \frac{R_{s} (d_{j}) - \min_{j^{'}} R_{s} (d_{j^{'}})}{\max_{j^{'}} R_{s} (d_{j^{'}}) - \min_{j^{'}} R_{s} (d_{j^{'}})} \in [0, 1]

(10)

denote the per-stream min–max-normalised score (Equation (10)), and let

({\bar{w}}_{A}, {\bar{w}}_{B}, {\bar{w}}_{C})

be base weights with

{\bar{w}}_{A} + {\bar{w}}_{B} + {\bar{w}}_{C} = 1

(default

0.30, 0.40, 0.30

). The per-query weights are conditioned on the router payload via Equation (11):

(w_{A}, w_{B}, w_{C}) = normalise ({\bar{w}}_{A}, {\bar{w}}_{B} + δ_{B} (r (x)), {\bar{w}}_{C} + δ_{C} (r (x))),

(11)

where the shifts

(δ_{B}, δ_{C})

are tied to the top-1 router type

τ_{1}

and confidence

c_{1}

as defined in Equation (12):

(δ_{B}, δ_{C}) = \{\begin{matrix} (+ α c_{1}, - α c_{1}) & if τ_{1} \in T_{struct}, \\ (- β c_{1}, + β c_{1}) & otherwise (e . g ., photograph), \end{matrix}

(12)

with

α = 0.30

and

β = 0.20

in our experiments. When the router is highly confident the image is parseable into structure (charts, equations, code, …), Stream B is up-weighted at the expense of the holistic caption stream; when the image is photograph- like, the holistic caption is trusted more. The textual anchor A keeps a fixed minimum weight under all conditions, ensuring VISA never falls below the text-only retrieval baseline. The fused score is given by Equation (13):

{score}_{LIN} (d_{j}) = w_{A} {\tilde{R}}_{A} (d_{j}) + w_{B} {\tilde{R}}_{B} (d_{j}) + w_{C} {\tilde{R}}_{C} (d_{j}) .

(13)

The final ranking is obtained by sorting documents in descending order of

score (\cdot)

and returning the top K.

3.10. Caching and Compute Footprint

Parser cache.

Vision LLM calls are the dominant per-query cost. We cache parser outputs persistently keyed by

(SHA 1 (x) [: 20], τ)

, with the router payload cached separately under the synthetic key

(SHA 1 (x) [: 20],)

_ _router_ _), namespaced by the router strategy so that ablation runs (Section 6) do not collide. Re-runs and stream-level ablations therefore incur only retrieval and fusion cost, not Vision LLM cost.

Document embedding cache.

The retrieval LLM document embeddings

{e_{j}}

are computed once per domain and serialised to disk. Across the 29 domains the combined corpus exceeds 1.4M passages; encoding it once and reusing the cache across the headline run, all ablations, and the linear-weight sweep keeps the marginal cost of an ablation roughly equal to one mini-batch of

3 N_{q}

query encodes plus the fusion arithmetic, where

N_{q}

is the number of queries in the domain.

Per-query cost.

Let

# router, # parser, # caption

denote the (uncached) Vision LLM calls and

# enc

the retrieval LLM query-encoder calls. For an image x on which the router selects

k \leq 3

parser types and the caption is supplied externally, the per-query call counts follow Equation (14),

# router = 1, # parser \leq 3, # caption = 0, # enc = 3,

(14)

on a cold cache, and zero Vision LLM calls plus

# enc = 3

on a warm cache. Total Vision LLM token budget per cold-cache query is approximately 300 tokens of routing output and up to

3 \times 900 = 2700

tokens of parser output, well within standard serving limits. With a 27B-parameter Vision LLM served on vLLM [19], this translates to roughly 5–10 s per cold query and sub-second per warm query in our setting.

Comparison to dense multimodal retrieval.

It is worth making explicit how the cost of generating multiple symbolic text streams compares with a traditional dense multimodal embedding pipeline. Both paradigms share the dominant cost: a one-time corpus encode of N passages (

O (N)

, computed once and cached) and an

O (N)

nearest-neighbour scan at query time. A dense multimodal retriever then adds one multimodal-encoder forward pass per query (image+text fused into a single vector). VISA instead adds, on a cold cache, one router call and

k \leq 3

parser calls to the Vision LLM (bounded by ≈300 + 3 × 900 output tokens, Equation (14)) plus three lightweight text query-encodes; on a warm cache the Vision LLM cost is zero and the only query-time overhead over single-stream text retrieval is two extra query-encodes and a constant-time fusion step. Because parser outputs are cached per unique image hash, the Vision LLM cost is paid once per distinct image and then amortised across all re-runs, every ablation, and the linear-weight sweep. The trade-off is therefore a small constant factor of cheap text-side encodes and a one-time, fully-cacheable parsing pass, in exchange for (i) eliminating the multimodal encoder entirely—no multimodal model is trained, fine-tuned, or even hosted at retrieval time—and (ii) the substantial accuracy gain reported in Section 5. In contrast to dense pipelines, VISA never re-encodes the corpus with a multimodal model and never pays multimodal inference for repeated images.

3.11. Algorithm

Algorithm 1 summarises the per-domain procedure executed once at evaluation time. The outer loop over queries is the only concurrency dimension; within each query, the parser calls in line 2b run in parallel via asynchronous I/O against the Vision LLM server, and the three retrieval calls in line 2e reuse the same cached document embeddings.

Algorithm 1: VISA retrieval for a single MM-BRIGHT domain.

Input: multimodal queries

{q_{i}^{★} = (q_{i}, x_{i})}_{i = 1}^{N_{q}}

, document corpus

D = {d_{j}}_{j = 1}^{N}

, captions

{ψ_{i}}_{i = 1}^{N_{q}}

.
Models: Vision LLM

f_{v}

, retrieval LLM

ϕ

.
Output: per-query top-K ranking

π_{i}

.
1. Pre-compute (and cache)

e_{j} = ϕ (d_{j})

for all j. (Equation (7))
2. For each query

i \in {1, \dots, N_{q}}

:
(a)

r_{i} \leftarrow

router payload from cache, else

{parse}_{T} (f_{v} (Π_{r}, x_{i}))

. (Equation (1))
(b) For each

τ \in types (r_{i})

with cache miss, compute

S_{τ} \leftarrow f_{v} (Π_{τ}, x_{i})

in parallel and persist. (Equation (2))
(c) Compose

S_{i} \leftarrow ⨁_{τ} [TYPE (τ)] \oplus S_{τ}

. (Equation (3))
(d) Build

A_{i}, B_{i}, C_{i}

from

q_{i}, S_{i}, ψ_{i}

. (Equations (4)–(6))
(e) For each

s \in {A, B, C}

: score

R_{s} (d_{j} ∣ q_{i}^{★}) = 〈 ϕ (s_{i}), e_{j} 〉

for all j. (Equation (8))
(f) Fuse to obtain

π_{i}

via Equation (9) or Equations (11)–(13).
3. return

{π_{i}}_{i = 1}^{N_{q}}

.

3.12. Design Discussion

Three design choices are worth flagging. (i) No multimodal encoder is used at any stage. Both the per-stream retrieval and the caching are text-only—the Vision LLM is the only place the image is consumed, and only to produce text. This eliminates the dense multimodal-embedding bottleneck identified in the introduction. (ii) The router and parsers share the same Vision LLM. The parser-prompt taxonomy is therefore extensible without retraining: a specialist substitution (

Π_{chart} \to DEPLOT

[13],

Π_{equation} \to NOUGAT

[14]) is a single-class swap. (iii) Stream A is a structural anchor. By construction, removing Streams B and C recovers a single-stream text-only retrieval baseline, so VISA cannot underperform that baseline by more than the score-normalisation noise of the fusion function—a property we verify empirically in Section 6.

3.13. Robustness to Routing Errors and Prompt Design

A natural concern is that VISA depends on a zero-shot router and on hand-written parser prompts: how does it behave when the router mis-classifies an image or a prompt is sub-optimal? VISA is engineered so that such errors degrade performance gracefully rather than catastrophically, through four mechanisms.

(i) A lower-bounding text anchor. Stream A (the raw query) is always present and keeps a fixed minimum weight

{\bar{w}}_{A}

under all router outputs (Equations (11)–(12)). Even a completely wrong router cannot drag the fused ranking below text-only retrieval by more than the normalisation noise of the fusion function, so routing errors cannot trigger the

- 12.0

nDCG@10 collapse that afflicts caption-augmented queries.

(ii) Redundant top-3 dispatch. The router emits up to three types rather than committing to one, so a single mis-classification is usually accompanied by a correct co-dispatched type, and the retrieval LLM attends to whichever symbolic tokens actually match the corpus. The top-1 ablation (Section 6.3,

- 3.96

nDCG@10 vs. top-3) quantifies exactly this redundancy: removing it is the single largest router-side degradation.

(iii) Deterministic fallback. If the router returns an empty or malformed payload, it falls back deterministically to a single photograph entry (Section 3.3), so Stream B never collapses to an undefined state; in the worst case VISA reduces to caption evidence plus the raw query. Router outputs are additionally validated—the JSON is parsed, the type vocabulary is checked against

T

, and confidences are clipped to

[0, 1]

—so malformed generations are rejected rather than propagated.

(iv) Selection, not coverage, is what matters. Two ablations show the design is robust to the question of whether typed routing helps at all. Disabling the router and firing all nine parsers (all 9) actually hurts (

- 2.99

) because off-target parser outputs contaminate the symbolic block, and a deterministic-random router matches it almost exactly (

- 3.07

). The gain therefore comes from the router’s selective typed choice, and over-dispatching—the failure mode a noisy router would induce—is explicitly penalised by the pipeline rather than rewarded. Finally, because all Vision LLM calls use temperature 0, parser outputs are deterministic and reproducible (which is also what makes hash-keyed caching sound), and because the taxonomy is parser-agnostic, any individual prompt can be replaced by a specialist model without altering the rest of the pipeline, decoupling overall robustness from the wording of any single prompt.

4. Experimental Setup

4.1. Benchmark and Task

We evaluate VISA on multimodal-to-text of MM-BRIGHT [3]: Query+Image → Documents retrieval over the full 29-domain corpus, with 2803 multimodal queries in total. Each query consists of a natural- language question and one or more attached images (we use the first image as the primary visual input). The corpus per domain ranges from roughly 9k passages (Crypto, Salesforce) to over 300k passages (Physics), and gold relevance judgements are released with the benchmark. We report nDCG@10 across all 29 domains and their unweighted average, following the protocol of Abdallah et al. [3].

4.2. Models

The Vision LLM

f_{v}

used by the router, parsers, and captioner is Qwen3-VL-30B-A3B [18], served locally via vLLM [19] with tensor parallelism across four NVIDIA H100 80 GB GPUs. We use temperature 0,

\max_new_tokens

of 300 for the router and 900 for each parser, with the OpenAI- compatible HTTP endpoint exposed by vLLM. All Vision LLM calls share the same model instance; the only difference between the three roles is the prompt template (

Π_{r}

,

Π_{τ}

,

Π_{c}

).

The retrieval LLM

ϕ

is a frozen 4B-parameter decoder-only embedding model [28]. Document embeddings are pre-computed once per domain and cached to disk as raw float16 matrices; the three per-stream calls of VISA therefore re-encode only the query side of each stream, not the corpus.

4.3. Hyperparameters

We adopt the following defaults, fixed across every experiment unless an ablation explicitly varies them:

Router cap $K_{r} = 3$ parser types per image (Equation (1)).
Linear-fusion base weights ${\bar{w}}_{A} = 0.30$ , ${\bar{w}}_{B} = 0.40$ , ${\bar{w}}_{C} = 0.30$ , with router-confidence shifts $α = 0.30$ (structure-friendly types) and $β = 0.20$ (photograph fallback) (Equations (11)–(12)).
RRF constant $k_{RRF} = 60$ (Equation (9)).
Top-K for the final ranking is 25.

4.4. Baselines

We compare against every dense multimodal retriever evaluated on MM-BRIGHT multimodal-to-text by Abdallah et al. [3]: BGE-VL, CLIP [9], GME-2B, GME-7B [8], Jina-CLIP [7], Nomic-Vision [6], and SigLIP. All baseline numbers are taken verbatim from Table 4 of Abdallah et al. [3]. None of these models is fine-tuned for our experiments—VISA likewise introduces no trainable parameters.

5. Results

5.1. Main Result

Across all 29 MM-BRIGHT multimodal-to-text domains, VISA achieves 32.4 nDCG@10, an absolute improvement of +4.8 over the strongest dense multimodal encoder (Nomic-Vision,

27.6

), +9.4 over Jina-CLIP, +10.4 over GME-7B, and a roughly

3 \times

improvement over BGE-VL (

10.0

) and SigLIP (

10.8

). Table 2 reports the full per-domain breakdown alongside every multimodal-to-text baseline evaluated by Abdallah et al. [3].

5.2. Where the Gains Concentrate

The per-domain breakdown in Table 2 sharpens the headline into three patterns.

VISA delivers its biggest wins on salesforce (

54.9

,

+ 8.7

over the best baseline GME-7B), askubuntu (

47.8

,

+ 13.5

over Nomic-Vision), gaming (

61.2

,

+ 15.6

over Jina-CLIP), travel (

44.3

,

+ 7.6

over Nomic-Vision), and pm (

40.3

,

+ 7.1

over GME-7B). What these domains share is that their query images are highly structured artefacts—UI screenshots, admin-console settings, item-stat tables, and route maps—whose contents the typed parsers can transcribe verbatim into Stream B, giving the retrieval LLM exact tokens to match against the corpus.

On STEM domains VISA matches or modestly beats the best dense multimodal model in seven of nine cases. The two losses (bioinformatics:

- 11.7

vs. Nomic-Vision; math:

- 0.6

vs. Nomic-Vision) coincide with images dominated by complex multi-panel figures whose layout exceeds the parser schema.

For economics, psychology, law, and christianity VISA wins by

+ 4.8

to

+ 6.8

points. These domains have queries where the image is supporting context (a graph illustrating an argument, a portrait accompanying a historical question); Stream A alone already gets close to baseline scores, and the symbolic stream provides a small but consistent additional signal.

VISA loses to a dense baseline by more than 2 nDCG@10 in only one domain (bioinformatics,

- 11.7

); two close losses occur on apple (

- 8.9

vs. Nomic-Vision) and robotics (

- 0.5

). In each of these the queries are dominated by photographic content (product photos, robot images) for which the dense visual encoders are well-tuned and the structured parser yield is small.

5.3. Qualitative Analysis: Where VISA Wins and Loses

To make the structured-vs-unstructured distinction concrete, we summarise the representative patterns behind VISA’s largest wins and its three losses, read off the per-domain results of Table 2 together with the parser type the router selects for those images.

Where VISA beats the dense baselines.

The wins concentrate where the query image is recoverable structure that the typed parsers can linearise into literal tokens present in the answering documents:

salesforce ( $54.9$ , $+ 7.6$ over GME-7B; $+ 28.7$ over Nomic-Vision): admin-console screenshots are routed to the screenshot/code parsers, which transcribe field names, setting paths, and menu labels verbatim; these strings match the configuration-documentation passages almost lexically, whereas a single dense vector blurs them.
gaming ( $61.2$ , $+ 15.6$ over Jina-CLIP): item-stat tables and inventory panels are routed to the chart/screenshot parsers, recovering the exact stat names and numeric values that the wiki answer pages enumerate.
askubuntu ( $47.8$ , $+ 13.5$ over Nomic-Vision): terminal and error-dialog screenshots are parsed into verbatim commands and error strings, which are high-precision retrieval keys.

Where VISA loses.

The losses are exactly the mirror image—images whose information is not recoverable as typed structure, where a dense encoder tuned to such imagery retains the advantage:

bioinformatics ( $22.1$ , $- 11.7$ vs. Nomic-Vision): queries are dominated by dense multi-panel composite figures (e.g., alignment mosaics beside phylogenetic trees) whose layout exceeds any single typed schema; the router fires figure/chart but the parser cannot serialise the cross-panel relationships, so Stream B contributes little while the dense encoder still aligns the whole image.
apple ( $19.8$ , $- 8.9$ vs. Nomic-Vision): product photographs carry no parseable structure, so the router falls back to photograph and Stream B degenerates to a caption-like list — precisely the regime in which the dense visual encoders are strongest.
math ( $33.4$ , $- 0.6$ vs. Nomic-Vision): most equation images are captured well by the equation parser, but a few multi-line derivations spill beyond its single-expression schema, leaving a small residual gap.

The common rule is that VISA’s margin tracks how much of the image is typed, serialisable structure; when that fraction is high it wins decisively, and when it approaches zero the text anchor (Stream A) and the photograph fallback keep it within a point or two of the best dense model rather than collapsing.

6. Ablations

We isolate the contribution of every VISA component through four families of ablations: (i) stream leave-out (Section 6.1, configurations

c_{1}

–

c_{6}

), (ii) fusion mode (Section 6.2,

c_{11}

–

c_{12}

), (iii) router strategy (Section 6.3,

c_{7}

–

c_{10}

), and (iv) parser leave-one-out (Section 6.4,

c_{13}

–

c_{21}

). Together they test which design choices in VISA carry the headline gain over dense multimodal retrieval and which are nearly free hyperparameter knobs. All twenty-one ablations re-use the parser cache populated by the headline run, so the marginal cost of each row in the tables below is one retrieval pass plus one fusion arithmetic step per domain. Every aggregate value reported below is read directly from the aggregated. NDCG@10 field of that configuration’s summary.json; we never recompute aggregates by averaging the per-domain metrics.json files.

The headline result is recovered only by the full

A + B + C

pipeline at

32.35

nDCG@10, and every one of the twenty-one ablations

c_{1}

–

c_{21}

scores below it. The drops range from

- 1.75

(dropping the photograph fallback parser) to

- 4.96

(router top-1 with the symbolic and caption streams still active but the parser dispatch capped to a single type). VISA is therefore not dominated by any single component; the gain over dense multimodal retrieval comes from the joint action of router, parsers, streams, and fusion.

6.1. Stream Leave-Out

Table 3 reports the aggregate nDCG@10 for every non-empty subset of

{A, B, C}

. The full three-stream pipeline beats every subset by a clear margin.

No single stream and no two-stream pair exceeds

30.34

nDCG@10, yet the three-stream union scores

32.35

, a clear super-additive jump of

+ 2.0

over the best two-stream variant. This contrasts with naive caption-augmentation pipelines, where mixing visual signal into the query text often degrades performance on reasoning-aware retrievers [3]. Because VISA keeps the three signals on separate retrieval channels and fuses only the score vectors, evidence from the typed symbolic content is combined with the holistic caption and the raw query without polluting the retrieval LLM’s input.

Stream A alone clears the previous-best dense multimodal baseline (

27.6

) by

+ 2.5

points, confirming that text-only retrieval over the multimodal queries is already a strong baseline. Stream B alone (

27.32

) almost matches Nomic-Vision but loses to the text baseline A — symbolic content is informative but cannot substitute for the question text it summarises. Stream C alone (

29.94

) is intermediate. The closeness of A, C,

A + C

,

A + B

and

B + C

(all in

[29.68, 30.34]

) shows that any single image-derived signal behaves as a flat ceiling once one is present; only adding the second image-derived signal recovers the headline.

Stream B is the largest single source of gain.

Removing B alone (configuration

c_{4}

,

A + C

) costs

- 2.64

nDCG@10, the second-largest single drop in the stream table after removing both A and C. The symbolic stream is therefore not redundant with the caption; it carries information the holistic caption fails to surface.

6.2. Fusion Mode

Table 4 compares the two fusion strategies of Section 3.9. Reciprocal Rank Fusion drops

- 2.98

nDCG@10 against the default confidence-weighted linear fusion, and a uniform linear fusion (no router-confidence shift) drops

- 1.96

. The router- confidence weighting in Equations (11) and (12) is therefore load-bearing: it accounts for ≈+2 points of headline gain.

The router-confidence shift up-weights Stream B when the router is confident the image is structure-friendly, and up-weights Stream C when it is photograph-like (Equation (12)). RRF’s confidence- agnostic rank merging does not exploit this signal, and equal-weight linear fusion treats all images alike, both falling short of the adaptive default.

6.3. Router Strategy

We replace the Vision-LLM router with several alternatives: top-k

\in {1, 2, 3}

caps, an all-parsers-fire baseline that skips routing entirely, and a deterministic-random baseline that picks

k = 3

types per image seeded by

sha 1 (x) [: 20]

. Table 5 shows the result.

The router behaves as a structured classifier whose value comes from both which types it selects and how many are dispatched. Capping at top-1 loses

- 3.96

nDCG@10—a single typed parser is not enough to cover the visual content of most queries. Top-2 recovers

+ 1.5

over top-1, and top-3 (the default) recovers another

+ 2.5

; beyond that, dispatching all nine parsers (all 9) actively hurts by

- 2.99

because off-target parser outputs contaminate the symbolic block. The deterministic-random baseline matches the all-9 strategy almost exactly (

- 3.07

), confirming that without typed selection the parsers are no better than randomly chosen text augmentations. The specific identity of the typed parsers chosen by the Vision LLM is therefore load-bearing, not just whether typed parsing is invoked.

6.4. Parser Leave-One-Out

Table 6 drops each parser type from the taxonomy and re-runs the full pipeline. Every parser registers a measurable contribution; equation is the most important (

- 3.85

when dropped), photograph the least (

- 1.75

).

Every parser earns its place.

Unlike the previous-generation result where most leave-one-out drops were silent, the corrected ablation makes the contribution of each parser visible: dropping any one of the nine parsers degrades performance by between

- 1.75

and

- 3.85

nDCG@10. The taxonomy is not over-parameterised—there is no parser whose removal can be absorbed by the others.

Symbolic types matter most where the image is most structured.

The four largest drops (equation, circuit, screenshot, code) are all heavily structure-friendly types from

T_{struct}

, with parser outputs (LaTeX, node–edge graphs, verbatim source, UI-element trees) that have no equivalent in a holistic caption. The smallest drop comes from photograph: when the image is unstructured, the caption stream C already covers most of the signal, and the photograph parser’s object-list description is partially redundant with it.

Mapped onto the four domain categories of Table 2, this ordering directly reflects the structured-vs- unstructured composition of each category: the high-impact structure-friendly parsers (equation, circuit, screenshot, code, chart) are the ones that fire in the STEM & Life Sciences and Software & Technical Systems categories, where VISA’s per-category average margin over the strongest dense baseline is largest (

+ 1.1

and

+ 5.9

over Nomic-Vision, respectively, in Table 2); whereas the photograph parser, whose removal costs the least, is the one that dominates the more natural-image queries in the Applied and Humanities categories, where the caption stream already carries the signal. The parser leave-one-out contribution is therefore a parser-type-resolved restatement of the structured-vs.-unstructured pattern seen across categories in the main table.

6.5. Summary of Findings

The twenty-one ablations

c_{1}

–

c_{21}

support three clean conclusions about the source of VISA’s headline improvement.

Every component matters. All twenty-one ablations score below the full $A + B + C$ pipeline, with drops ranging from $- 1.75$ to $- 4.96$ . VISA’s headline gain is the joint product of router, parsers, streams, and fusion—no single component is responsible for it, and removing any of them costs measurable performance.
Streams combine super-additively. The strongest single stream scores $30.34$ and the full three-stream union scores $32.35$ —a $+ 2.0$ super-additive jump. Confidence-weighted linear fusion (Equations (11)–(12)) is responsible for most of this gain: replacing it with RRF or with equal weights costs $- 2.98$ and $- 1.96$ respectively.
Router top-3 is the right operating point. Capping the router below three types loses up to $- 3.96$ , dispatching all nine parsers loses $- 2.99$ , and random dispatch loses $- 3.07$ . The router’s specific typed selection at $k = 3$ is therefore not a free hyperparameter but a load-bearing component of the pipeline.

7. Limitations and Future Work

VISA converts a query image into typed symbolic text, so its advantage is, by design, tied to how much of the image is recoverable structure. This framing also delimits where the method is weakest.

(i) A central question is how VISA behaves when symbolic patterns are limited or absent—ordinary photographs of objects, scenes, or people, as in the apple and robotics domains. Here the router falls back to the photograph type and Stream B degenerates to a generic caption-style description, so VISA carries no structural advantage over a contrastive visual encoder that was trained precisely on such imagery. Two safety mechanisms nonetheless prevent failure: the raw-query anchor (Stream A) lower-bounds performance at text-only retrieval, and the caption stream (Stream C) is kept on a separate fusion channel rather than spliced into the query, so VISA matches but does not exceed the strongest dense encoder on natural-image domains, and avoids the

- 12.0

nDCG@10 collapse of naive caption augmentation. In other words, the current design degrades gracefully on unstructured content but does not yet turn it into a win; closing that gap is the most important direction for future work, e.g., a hybrid scheme that adds a learned dense visual stream specifically when the router’s structure confidence is low, fusing it with the symbolic and caption streams under the same confidence-weighted scheme.

(ii) Composite layouts. Multi-panel or densely composited figures (the bioinformatics loss) exceed any single typed schema; our parsers serialise within a type but do not yet capture cross-panel relationships.

(iii) Single primary image. We use the first attached image per query; principled multi-image (and video) fusion is left to future work.

(iv) Dependence on a capable Vision LLM. Routing and parsing assume a sufficiently strong vision model; although the redundancy, fallback, and validation mechanisms of Section 3.13 make the pipeline tolerant of individual errors, very small vision models may degrade parse quality.

(v) Self-reported confidence. Fusion weights are modulated by the router’s self-reported confidence, whose calibration we do not independently verify.

Future work.

Beyond the hybrid natural-image stream above, promising directions include: calibrated or learned routing with adaptive parser selection; substituting specialist parsers (DePlot, Nougat) and extending the taxonomy with new typed parsers (tables, molecular structures, musical notation) to broaden the definition of “structure”; composing VISA’s purely textual streams with a downstream factor-decomposed LLM reranker; and validating the approach across additional reasoning-intensive multimodal benchmarks.

8. Conclusions

We introduced VISA, a training-free Visual Symbolic Agent for reasoning-intensive multimodal retrieval. Instead of forcing image content into a dense multimodal embedding space, VISA converts the query image into typed symbolic text through a Vision LLM router and a set of structured parser prompts. The resulting symbolic evidence is kept separate from both the raw query and a holistic caption, yielding three complementary text streams that are scored by a frozen retrieval LLM and combined through confidence-weighted fusion. On MM-BRIGHT multimodal-to-text, VISA achieves 32.4 nDCG@10 across 29 domains, improving by 4.8 points over the strongest dense multimodal baseline. The gains are especially clear in domains where images contain recoverable structure, such as equations, charts, code, screenshots, maps, circuits, and technical diagrams. Ablation results show that this improvement does not come from a single component: the raw-query, symbolic, and caption streams are all necessary, the router’s top-3 typed dispatch is load-bearing, and confidence-weighted fusion substantially outperforms confidence-agnostic alternatives. These results suggest that the main bottleneck in reasoning-intensive multimodal retrieval is not only model scale, but the representation substrate used to inject visual evidence into retrieval. For structured technical images, symbolic grounding provides a cleaner and more retriever-compatible interface than dense visual embeddings or free-form captions alone. More broadly, VISA shows that strong multimodal retrieval can be obtained without training a multimodal retriever: a Vision LLM can act as a parser, while retrieval itself remains purely text-based.

Author Contributions

Conceptualization, M.A. and M.S.K.; methodology, M.A., M.S.K. and M.M.; software, M.A. and M.F.S.; validation, M.A., M.S.K. and A.A.; formal analysis, M.A. and M.M.; investigation, M.A., M.S.K. and M.F.S.; resources, H.S.K.; data curation, M.A. and A.A.; writing—original draft preparation, M.A., M.S.K. and M.M.; writing—review and editing, A.A. and H.S.K.; visualization, M.A. and A.A.; supervision, H.S.K.; project administration, H.S.K.; funding acquisition, H.S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (IITP-2026-RS-2020-II201462, 50%), partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (RS-2023-NR076833), and partly by the Regional Innovation System & Education (RISE) program through the (Chungbuk Regional Innovation System & Education Center), funded by the Ministry of Education (MOE) and the (Chungcheongbuk-do), Republic of Korea (2026-RISE-11-014-03).

Data Availability Statement

Our study are publicly available. Specifically, the MM-BRIGHT dataset is available through the following repository: https://huggingface.co/datasets/mm-bright/MM-BRIGHT (accessed on 9 June 2026), Additional information about the dataset and benchmark is also available on the project website: https://mm-bright.github.io/ (accessed on 9 June 2026).

Conflicts of Interest

The authors declare no conflict of interest.

References

Abdallah, A.; Ali, M.; Abdul-Mageed, M.; Jatowt, A. TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval. arXiv 2026, arXiv:2601.09523. [Google Scholar]
Su, H.; Yen, H.; Xia, M.; Shi, W.; Muennighoff, N.; Wang, H.Y.; Haisu, L.; Shi, Q.; Siegel, Z.; Tang, M.; et al. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025; International Conference on Learning Representations (ICLR): Amherst, MA, USA; Volume 2025, pp. 48941–48991.
Abdallah, A.; Mounis, M.D.; Abdalla, M.; Kasem, M.S.; Senussi, M.F.; Mahmoud, M.; Ali, M.; Jatowt, A.; Kang, H.S. MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning-Intensive Retrieval. arXiv 2026, arXiv:2601.09562. [Google Scholar]
Abdallah, A.; Ali, M.; Piryani, B.; Jatowt, A. BracketRank: Large Language Model Document Ranking via Reasoning-based Competitive Elimination. arXiv 2026, arXiv:2604.08834. [Google Scholar]
Xiao, C.; Hudson, G.T.; Moubayed, N.A. Rar-b: Reasoning as retrieval benchmark. arXiv 2024, arXiv:2404.06347. [Google Scholar]
Nussbaum, Z.; Duderstadt, B.; Mulyar, A. Nomic embed vision: Expanding the latent space. arXiv 2024, arXiv:2406.18587. [Google Scholar]
Koukounas, A.; Mastrapas, G.; Eslami, S.; Wang, B.; Akram, M.K.; Günther, M.; Mohr, I.; Sturua, S.; Wang, N.; Xiao, H. jina-clip-v2: Multilingual multimodal embeddings for text and images. arXiv 2024, arXiv:2412.08802. [Google Scholar]
Zhang, X.; Zhang, Y.; Xie, W.; Li, M.; Dai, Z.; Long, D.; Xie, P.; Zhang, M.; Li, W.; Zhang, M. GME: Improving universal multimodal retrieval by multimodal LLMs. arXiv 2024, arXiv:2412.16855. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR,, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Efthymiou, A.; Rudinac, S.; Kackovic, M.; Wijnberg, N.; Worring, M. VL-KGE: Vision–Language Models Meet Knowledge Graph Embeddings. In Proceedings of the ACM Web Conference 2026, Dubai, United Arab Emirates, 13–17 April 2026; pp. 7552–7563. [Google Scholar]
Liu, F.; Piccinno, F.; Krichene, S.; Pang, C.; Lee, K.; Joshi, M.; Altun, Y.; Collier, N.; Eisenschlos, J. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 12756–12770. [Google Scholar]
Abdallah, A.; Mozafari, J.; Piryani, B.; Jatowt, A. Dear: Dual-stage document reranking with reasoning agents via llm distillation. arXiv 2025, arXiv:2508.16998. [Google Scholar]
Liu, F.; Eisenschlos, J.; Piccinno, F.; Krichene, S.; Pang, C.; Lee, K.; Joshi, M.; Chen, W.; Collier, N.; Altun, Y. DePlot: One-shot visual language reasoning by plot-to-table translation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 10381–10399. [Google Scholar]
Blecher, L.; Cucurull Preixens, G.; Scialom, T.; Stojnic, R. Nougat: Neural optical understanding for academic documents. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; International Conference on Learning Representations (ICLR): Amherst, MA, USA; Volume 2024, pp. 37646–37663.
Masry, A.; Do, X.L.; Tan, J.Q.; Joty, S.; Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Irelamd, 22–27 May 2022; pp. 2263–2279. [Google Scholar]
Methani, N.; Ganguly, P.; Khapra, M.M.; Kumar, P. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1527–1536. [Google Scholar]
Cormack, G.V.; Clarke, C.L.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, 19–23 July 2009; pp. 758–759. [Google Scholar]
Bai, S.; Cai, Y.; Chen, R.; Chen, K.; Chen, X.; Cheng, Z.; Deng, L.; Ding, W.; Gao, C.; Ge, C.; et al. Qwen3-vl technical report. arXiv 2025, arXiv:2511.21631. [Google Scholar]
Kwon, W.; Li, Z.; Zhuang, S.; Sheng, Y.; Zheng, L.; Yu, C.H.; Gonzalez, J.; Zhang, H.; Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, Germany, 23–26 October 2023; pp. 611–626. [Google Scholar]
Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv 2021, arXiv:2104.08663. [Google Scholar]
Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 2014–2037. [Google Scholar]
Zhang, S.; Gao, Y.; Zhou, X.; Zhao, Y.; Song, T.; Cohan, A.; Luu, A.T.; Zhao, C. MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval. arXiv 2025, arXiv:2510.09510. [Google Scholar]
Lin, Y.; Ding, G.; Zhou, H.; Li, H.; Yang, M.; Peng, X. Ark: A dual-axis multimodal retrieval benchmark along reasoning and knowledge. arXiv 2026, arXiv:2602.09839. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
Wang, L.; Yang, N.; Huang, X.; Yang, L.; Majumder, R.; Wei, F. Improving text embeddings with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 11897–11916. [Google Scholar]
Muennighoff, N.; Su, H.; Wang, L.; Yang, N.; Wei, F.; Yu, T.; Singh, A.; Kiela, D. Generative representational instruction tuning. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025; International Conference on Learning Representations (ICLR): Amherst, MA, USA; Volume 2025, pp. 45544–45613.
Shao, R.; Qiao, R.; Kishore, V.; Muennighoff, N.; Lin, X.V.; Rus, D.; Low, B.K.H.; Min, S.; Yih, W.t.; Koh, P.W.; et al. Reasonir: Training retrievers for reasoning tasks. arXiv 2025, arXiv:2504.20595. [Google Scholar]
Long, M.; Sun, D.; Yang, D.; Wang, J.; Luo, Y.; Shen, Y.; Wang, J.; Zhou, H.; Guo, C.; Wei, P.; et al. Diver: A multi-stage approach for reasoning-intensive information retrieval. arXiv 2025, arXiv:2508.07995. [Google Scholar]
Jiang, Z.; Meng, R.; Yang, X.; Yavuz, S.; Zhou, Y.; Chen, W. VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks. arXiv 2025, arXiv:2410.05160. [Google Scholar]
Zhuang, H.; Qin, Z.; Hui, K.; Wu, J.; Yan, L.; Wang, X.; Bendersky, M. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers); ACL: Stroudsburg, PA, USA, 2024; pp. 358–370. [Google Scholar]
Pradeep, R.; Sharifymoghaddam, S.; Lin, J. Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv 2023, arXiv:2309.15088. [Google Scholar]
Sun, W.; Yan, L.; Ma, X.; Wang, S.; Ren, P.; Chen, Z.; Yin, D.; Ren, Z. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 14918–14937. [Google Scholar]
Weller, O.; Ricci, K.; Yang, E.; Yates, A.; Lawrie, D.; Van Durme, B. Rank1: Test-time compute for reranking in information retrieval. arXiv 2025, arXiv:2502.18418. [Google Scholar]
Zhang, L.; Wang, B.; Qiu, X.; Reddy, S.; Agrawal, A. Rearank: Reasoning re-ranking agent via reinforcement learning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; pp. 2458–2471. [Google Scholar]
Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 1762–1777. [Google Scholar]
Wang, L.; Yang, N.; Wei, F. Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 9414–9423. [Google Scholar]
Zhong, Y.; Yang, J.; Fan, Y.; Su, L.; de Rijke, M.; Zhang, R.; Cheng, X. Reasoning-enhanced Query Understanding through Decomposition and Interpretation. arXiv 2025, arXiv:2509.06544. [Google Scholar]
Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 10014–10037. [Google Scholar]
Singh, A.; Ehtesham, A.; Kumar, S.; Khoei, T.T.; Vasilakos, A.V. Agentic retrieval-augmented generation: A survey on agentic rag. arXiv 2025, arXiv:2501.09136. [Google Scholar]
Pink, M.; Wu, Q.; Vo, V.A.; Turek, J.; Mu, J.; Huth, A.; Toneva, M. Position: Episodic memory is the missing piece for long-term llm agents. arXiv 2025, arXiv:2502.06975. [Google Scholar]
Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering. In Proceedings of the International Conference on Case-Based Reasoning; Springer: Berlin/Heidelberg, Germany, 2024; pp. 445–460. [Google Scholar]
Yeo, W.; Kim, K.; Yoon, J.; Hwang, S.J. Worldmm: Dynamic multimodal memory agent for long video reasoning. arXiv 2025, arXiv:2512.02425. [Google Scholar]
Ma, J.; Wang, J.; Zhao, W.X.; Liu, G.; Wen, X. Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding. IEEE Trans. Intell. Transp. Syst. 2025, 27, 1248–1266. [Google Scholar]
Zhu, L.; Han, D.; Shen, X.; Chen, C.; Li, K.C. Enhancing image–text matching through multi-level semantic consistency alignment: L. Zhu et al. Vis. Comput. 2025, 41, 9555–9570. [Google Scholar] [CrossRef]
Li, X.; Huang, Z.; Liu, Y.; Wang, Y. Accelerating Outlier-Robust Point Cloud Registration by Known Gravity Directions. IEEE Trans. Autom. Sci. Eng. 2026, 23, 2310–2323. [Google Scholar] [CrossRef]

Figure 1. Overview of the VISA pipeline: one Vision LLM routes the query image to typed parsers and a captioner, and a frozen text retrieval LLM scores and fuses the resulting three streams (A raw query, B query ⊕ symbolic, C query ⊕ caption) into the final ranking. The stages are described in the text.

Figure 2. Exact VISA prompts (verbatim). (Top:) the zero-shot router prompt that classifies the query image into up to three of the nine parser types with calibrated confidences. (Bottom:) a representative typed parser prompt (the equation parser), which enforces a strict output schema.

Table 2. nDCG@10 on MM-BRIGHT multimodal-to-text (Query+Image → Documents) across 29 domains. Baseline numbers from Table 4 of Abdallah et al. [3]. The right-most column is VISA (ours). Bold marks the best model in each row. The strongest dense baseline (Nomic-Vision) is placed immediately to the left of VISA for direct comparison, and an unweighted per-category average row is added after each of the four domain groups.

Domain	BGE-VL	CLIP	GME-2B	GME-7B	Jina-CLIP	SigLIP	Nomic-Vision	VISA
STEM & Life Sciences
academia	4.2	4.8	16.2	27.6	22.3	3.6	22.6	31.2
biology	5.7	14.8	22.9	15.2	20.5	11.9	26.9	32.0
chemistry	10.8	9.6	27.2	21.9	30.6	11.6	30.6	33.2
physics	6.8	6.1	13.3	14.0	14.4	7.3	17.2	19.9
math	13.1	17.9	16.4	9.3	27.0	15.3	34.0	33.4
earthscience	10.1	10.9	20.5	26.2	24.6	11.8	30.1	33.2
bioacoustics	13.3	11.4	10.5	13.4	19.4	14.8	23.4	24.4
bioinformatics	11.6	9.4	21.1	19.2	23.7	16.8	33.8	22.1
medicalsciences	12.6	9.8	22.7	19.0	26.8	9.1	33.9	33.0
Category Avg	9.8	10.5	19.0	18.4	23.3	11.4	28.1	29.2
Software & Technical Systems
apple	7.2	12.3	23.9	17.0	24.3	4.4	28.7	19.8
askubuntu	11.6	5.5	25.9	34.2	26.1	12.6	34.3	47.8
bitcoin	8.9	8.3	18.2	19.6	22.6	10.0	22.7	31.7
crypto	11.3	14.8	9.8	7.1	15.5	10.2	22.4	21.3
quantumcomputing	4.5	2.6	5.9	5.6	10.8	2.6	12.1	12.2
robotics	16.1	10.6	15.8	18.7	19.0	14.3	30.3	29.8
salesforce	14.2	2.3	31.1	47.3	32.3	6.5	26.2	54.9
Category Avg	10.5	8.1	18.7	21.4	21.5	8.7	25.2	31.1
Social Sciences & Humanities
economics	9.5	6.0	10.0	12.6	13.5	9.8	21.1	27.9
psychology	6.4	8.7	15.6	18.6	20.8	7.9	23.9	31.6
philosophy	2.4	5.4	15.2	18.0	19.4	7.0	21.7	18.5
law	10.2	19.7	30.7	35.0	35.3	16.4	47.6	50.6
christianity	8.9	15.0	20.0	26.5	21.0	13.0	30.9	37.1
islam	12.0	10.7	25.8	32.0	24.3	6.5	28.9	31.2
Category Avg	8.2	10.9	19.6	23.8	22.4	10.1	29.0	32.8
Applied Domains
aviation	9.6	15.4	16.2	17.0	24.3	9.2	24.1	26.8
gaming	17.5	19.1	41.6	43.9	45.6	21.4	43.1	61.2
gis	13.8	13.1	15.5	15.6	20.3	16.5	25.8	32.3
pm	8.6	8.9	21.9	33.2	20.5	12.4	27.6	40.3
sustainability	10.1	9.0	16.7	25.6	24.3	11.5	24.7	34.9
travel	10.1	16.1	23.9	30.8	26.6	13.1	36.7	44.3
quant	8.1	2.1	12.4	15.3	11.6	5.8	16.2	21.4
Category Avg	11.1	12.0	21.2	25.9	24.7	12.8	28.3	37.3
Average	10.0	10.4	19.5	22.0	23.0	10.8	27.6	32.4

Table 3. Stream leave-out ablation on MM-BRIGHT multimodal-to-text. Every row is averaged across all 29 domains. The

Δ

column is relative to the full VISA configuration (

A + B + C

).

Table 3. Stream leave-out ablation on MM-BRIGHT multimodal-to-text. Every row is averaged across all 29 domains. The

Δ

column is relative to the full VISA configuration (

A + B + C

).

Configuration	nDCG@10	Δ vs. Full
A only (raw query)	30.11	$- 2.24$
B only (symbolic)	27.32	$- 5.03$
C only (caption)	29.94	$- 2.41$
$A + C$ (no symbolic)	29.71	$- 2.64$
$A + B$ (no caption)	29.68	$- 2.67$
$B + C$ (no raw query)	30.34	$- 2.01$
$A + B + C$ (full VISA)	32.35	$0.00$

Table 4. Fusion-mode ablation on MM-BRIGHT multimodal-to-text, all 29 domains.

Fusion	nDCG@10	Δ vs. Default
Reciprocal Rank Fusion ( $k_{RRF} = 60$ )	29.37	$- 2.98$
Linear, equal weights ( $\bar{w} = 1 / 3$ each)	30.39	$- 1.96$
Linear, confidence-weighted (default)	32.35	$0.00$

Table 5. Router-strategy ablation. The Vision-LLM router with

k = 3

is the default; all 9 disables the router and runs every parser; random picks three types uniformly at random per image hash.

Table 5. Router-strategy ablation. The Vision-LLM router with

k = 3

is the default; all 9 disables the router and runs every parser; random picks three types uniformly at random per image hash.

Router Strategy	nDCG@10	Δ vs. Default
Top-1 (most-confident type)	28.39	$- 3.96$
Top-2	29.89	$- 2.46$
Top-3 (default)	32.35	$0.00$
All 9 parsers fire (no router)	29.36	$- 2.99$
Random parser type ( $k = 3$ )	29.28	$- 3.07$

Table 6. Parser leave-one-out: each row drops a single parser type from the nine-type taxonomy. The default is full VISA with all nine parser prompts available to the router.

Parser Dropped	nDCG@10	Δ vs. Default
`equation`	28.50	$- 3.85$
`circuit`	28.87	$- 3.48$
`screenshot`	28.92	$- 3.43$
`code`	29.73	$- 2.62$
`figure`	29.77	$- 2.58$
`diagram`	29.71	$- 2.64$
`map`	29.92	$- 2.43$
`chart`	30.15	$- 2.20$
`photograph`	30.60	$- 1.75$
None (default)	32.35	$0.00$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Senussi, M.F.; Abdallah, A.; Kang, H.S. VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval. Mathematics 2026, 14, 2197. https://doi.org/10.3390/math14122197

AMA Style

Abdalla M, Kasem MS, Mahmoud M, Senussi MF, Abdallah A, Kang HS. VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval. Mathematics. 2026; 14(12):2197. https://doi.org/10.3390/math14122197

Chicago/Turabian Style

Abdalla, Mahmoud, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Abdelrahman Abdallah, and Hyun Soo Kang. 2026. "VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval" Mathematics 14, no. 12: 2197. https://doi.org/10.3390/math14122197

APA Style

Abdalla, M., Kasem, M. S., Mahmoud, M., Senussi, M. F., Abdallah, A., & Kang, H. S. (2026). VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval. Mathematics, 14(12), 2197. https://doi.org/10.3390/math14122197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VISA-Agent: A Visual Symbolic Agent for Reasoning-Intensive Multimodal Retrieval

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Problem Setup and Notation

3.2. Pipeline Overview

3.3. Vision LLM Router

3.4. Parser Toolkit

3.5. Symbolic Block Construction

3.6. Caption Provider

3.7. Stream Construction

3.8. Per-Stream Retrieval

3.9. Score Fusion

3.10. Caching and Compute Footprint

3.11. Algorithm

3.12. Design Discussion

3.13. Robustness to Routing Errors and Prompt Design

4. Experimental Setup

4.1. Benchmark and Task

4.2. Models

4.3. Hyperparameters

4.4. Baselines

5. Results

5.1. Main Result

5.2. Where the Gains Concentrate

5.3. Qualitative Analysis: Where VISA Wins and Loses

6. Ablations

6.1. Stream Leave-Out

6.2. Fusion Mode

6.3. Router Strategy

6.4. Parser Leave-One-Out

6.5. Summary of Findings

7. Limitations and Future Work

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI