Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes

Lin, Shanshan; Xie, Xiaoxuan; Yang, Zexian; Chen, Chao

doi:10.3390/math14101667

Open AccessArticle

Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes

by

Shanshan Lin

¹,

Xiaoxuan Xie

¹,

Zexian Yang

¹ and

Chao Chen

^2,*

¹

College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China

²

School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1667; https://doi.org/10.3390/math14101667

Submission received: 3 April 2026 / Revised: 2 May 2026 / Accepted: 10 May 2026 / Published: 13 May 2026

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

Recent image captioning methods based on pre-trained vision–language models can generate fluent and coherent descriptions, yet they still struggle in open-domain scenes that contain long-tail concepts, uncommon object combinations, and ambiguous visual evidence. Two limitations are especially important. First, the knowledge needed to recognize and name rare or domain-specific entities is only weakly represented in model parameters, causing captions to be generic, incomplete, or biased toward frequent concepts. Second, token generation is typically grounded mainly by local visual matching, making it sensitive to clutter, occlusion, and visually similar distractors, and therefore prone to attribute errors, relation confusion, and object hallucination. To address these issues, we propose R2G (retrieval- and grounding-guided captioning), a lightweight plug-in framework for frozen image captioning backbones. R2G consists of two complementary components. The first, retrieval-guided visual prompting, retrieves image-relevant concepts from an external visual concept memory, converts them into a continuous prompt representation, and injects this representation into selected layers of the visual encoder, so that external semantic information can influence visual feature formation before decoding begins. The second, global–local semantic grounding, derives a global semantic prior from an auxiliary vision–language encoder and adaptively fuses it with token-level local visual evidence through a decoder-state-dependent gating mechanism, thereby improving semantic stability while preserving fine-grained visual support. The resulting framework is lightweight, compatible with frozen pre-trained backbones, and designed to improve both concept coverage and semantic faithfulness. Experimental results on MS-COCO and NoCaps show that R2G consistently improves caption quality over the baseline and yields particularly clear gains in open-domain and out-of-domain settings.

Keywords:

image captioning; open-domain captioning; retrieval augmentation; visual prompting; semantic grounding; vision–language models; hallucination mitigation

MSC:

68T45; 68T50

1. Introduction

Image captioning aims to generate natural language descriptions that are fluent, informative, and faithful to visual content. The field has evolved from early encoder–decoder [1] and attention-based models [2,3] to large-scale pre-trained vision–language systems such as SimVLM [4], BLIP [5], and BLIP-2 [6]. These advances have improved language quality and transferability. However, robust caption generation remains challenging in open-domain settings, where images often contain long-tail objects, uncommon scene compositions, and fine-grained semantic distinctions.

This difficulty arises from two limitations. The first is insufficient concept coverage. Modern captioning models largely rely on knowledge stored in parameters, which makes them less reliable when rare entities or domain-specific concepts appear at test time. In such cases, the generated captions are often vague, overly generic, or biased toward frequent language patterns. Retrieval augmentation offers a practical alternative by consulting external memory at inference time instead of forcing all knowledge into model weights [7,8,9]. However, most existing retrieval-based captioning methods inject retrieved captions, textual descriptions, or concept cues mainly on the decoder side. While such late fusion may improve lexical access, it does not directly influence how the vision encoder forms visual evidence before sentence generation begins.

The second limitation is unstable semantic grounding during decoding. Even when relevant visual evidence is available, standard token-level cross-attention may still be distracted by background clutter, partial occlusion, or visually similar objects. Consequently, the model may generate descriptions that are linguistically plausible but not fully supported by the image, including mistaken relations, and hallucinated objects [10,11,12]. This problem is especially severe in open-domain settings, where weak local evidence and long-tail concepts increase the tendency of the decoder to rely on language priors.

More importantly, these two limitations are coupled. If external knowledge is introduced only after visual evidence has already been formed, retrieval can enrich lexical choices but cannot guide the model toward better visual representations. Conversely, even when richer concepts are available, caption quality may still deteriorate if decoding is not stably grounded in the image. This observation suggests that a reliable open-domain captioning framework should improve both knowledge accessibility before generation and semantic grounding during generation in a unified manner.

To this end, we propose R2G (retrieval- and grounding-guided captioning), a lightweight plug-in framework for frozen image captioning backbones. R2G contains two complementary modules. The first, retrieval-guided visual prompting, retrieves image-relevant concepts from an external visual concept memory and converts them into a continuous prompt code that modulates selected layers of the visual encoder. Unlike decoder-side retrieval prompting, this mechanism injects external semantic information directly into the visual stream, allowing retrieved concepts to influence feature formation before decoding starts. The second, global–local semantic grounding, extracts a persistent global semantic prior from an auxiliary CLIP-style encoder [13] and fuses the prior with token-specific local visual evidence through a decoder-state-dependent gating mechanism. In this way, the model maintains scene-level semantic consistency while preserving fine-grained grounding for objects, attributes, and relations.

Experimental results on MS-COCO and NoCaps demonstrate the effectiveness of R2G. Specifically, R2G consistently outperforms the frozen BLIP-2 baseline across all reported metrics in both a standard MS-COCO setting and a challenging NoCaps benchmark. The ablation studies further show that the two modules play complementary roles. These results support the central claim that open-domain captioning benefits from combining encoder-side knowledge injection with decoder-side semantic grounding.

Although this work is motivated by open-domain image captioning, it is also closely related to several themes of mathematical interest. Specifically, R2G can be viewed as a composition of three interacting mappings: a retrieval operator from an external memory, a prompt-conditioned modulation operator acting on latent visual representations, and a state-dependent fusion operator that combines global and local semantic evidence during decoding. Retrieval-guided visual prompting may be interpreted as a structured perturbation of encoder features induced by an auxiliary memory, while global–local semantic grounding can be regarded as an adaptive weighting mechanism over multiple information sources. From this perspective, R2G provides an application-driven setting in which ideas from operator analysis, dynamical systems, optimization, and statistical learning can help explain when external knowledge improves generation and when it may instead amplify noise or instability.

The main contributions of this paper are summarized as follows:

We identify open-domain image captioning as a joint problem of concept coverage and semantic grounding, and we introduce a unified framework that addresses both within a frozen-backbone setting.
We propose retrieval-guided visual prompting, which converts retrieved concepts into continuous prompt codes and injects them into selected visual encoder layers through lightweight modulation, enabling external knowledge to influence visual evidence formation early in the pipeline.
We propose global–local semantic grounding, which fuses a persistent scene-level semantic prior with token-specific local visual evidence through a gating mechanism, thereby improving semantic stability and reducing hallucination risk during decoding.
Experimental results on MS-COCO and NoCaps show that R2G consistently improves caption quality, confirming the effectiveness of the two key modules.

2. Related Work

2.1. Pre-Trained and Instruction-Tuned Vision–Language Captioning

Recent image captioning systems increasingly build on large pre-trained vision–language models (VLMs) and multimodal large language models (MLLMs), often in frozen or lightly tuned settings. BLIP-2 [6] and InstructBLIP [14] show that strong generation performance can be obtained by coupling frozen visual encoders and language models through lightweight bridging modules. Building on this trend, recent work has pushed captioning toward more detailed, controllable, and faithful generation. Examples include Altogether, which improves caption quality by re-aligning noisy alt-text supervision [15]; Painting with Words, which studies detailed captioning benchmarks and alignment learning [16]; Patch Matters, which enhances fine-grained captioning through local perception and hierarchical aggregation [17]; and RICO, which refines captions through visual reconstruction to improve accuracy and completeness [18]. Related grounded-generation systems such as GLaMM [19], DAM [20], and FINECAPTION [21] further emphasize localized and compositional description. Our work is complementary to this line: instead of redesigning the whole captioning stack or relying on richer supervision, we target open-domain captioning under a frozen-backbone regime and focus on how external concepts and semantic grounding can be injected efficiently into the generation process.

2.2. Retrieval-Augmented Captioning and External Memory

Retrieval augmentation has become an effective strategy for improving captioning without storing all knowledge in model parameters. Early recent works such as SmallCap [8] and EXTRA [22] retrieve related captions from external datastores and use them as additional textual evidence for generation. Follow-up work studies the robustness of such retrieval-conditioned captioners and shows that retrieved text can strongly shape final outputs, both positively and negatively [23]. More recent methods expand the form of retrieved knowledge: EVCap uses an external visual-name memory to retrieve object names for open-world captioning [24], while MeaCap introduces a memory-augmented framework for zero-shot captioning [9]. External context has also been explored in domain-specific captioning—for example, in news image captioning with visually aware context modeling [25]. Compared with these approaches, our method differs in where retrieval is injected. Existing retrieval-based captioners mainly use retrieved captions, names, or context on the language side. In contrast, R2G converts retrieved concepts into a continuous prompt code and uses that code to modulate selected layers of the visual encoder, allowing retrieval to influence visual evidence formation before decoding begins.

2.3. Grounding, Hallucination, and Semantic Faithfulness

A central concern in modern captioning and multimodal generation is semantic faithfulness. Recent work has developed stronger benchmarks and metrics for hallucination analysis, such as ALOHa for open-vocabulary hallucination measurement [26] and THRONE for free-form hallucination evaluation [27]. On the mitigation side, LURE analyzes and revises object hallucinations in large VLMs [28]; M3ID reduces hallucination by increasing the influence of visual information during decoding [29]; and MOCHa introduces an open-vocabulary framework for measuring and reducing caption hallucinations, together with the OpenCHAIR benchmark [30]. More recent inference time grounding methods include Visual Description Grounding, which uses grounded visual descriptions to reduce hallucinations [31], DeCo, which performs dynamic correction decoding for hallucination mitigation [32], and ICT, which intervenes across image- and object-level representations to improve trustworthiness during generation [33]. These works provide important insight into faithfulness, but many of them operate through post hoc revision, decoding-time correction, or explicit hallucination-oriented control. Our global–local semantic grounding module is different in purpose and mechanism: it builds a persistent global semantic prior once per image and fuses it with token-level local evidence through a decoder-state-dependent gate, so semantic stabilization is integrated into the standard captioning forward pass rather than added only as a correction step.

2.4. Prompting and Parameter-Efficient Adaptation

Prompt-based and parameter-efficient adaptation methods provide a natural foundation for lightweight multimodal generation. In the vision–language literature, MaPLe [34], PromptSRC [35], and DAPT [36] show that carefully designed prompts can adapt frozen vision–language backbones with limited trainable parameters. PromptKD further studies prompt-based knowledge distillation for efficient transfer [37]. In the generative VLM setting, Learning by Correction demonstrates that lightweight tuning tasks can improve zero-shot generative reasoning without full model updating [38]. Our work is related to this efficient adaptation perspective but differs in two important ways. First, our prompt is image-conditioned and retrieval-derived, rather than a static task prompt learned only from downstream supervision. Second, the prompt is injected into the visual encoder to reshape visual representations for captioning, rather than being used only as a generic adaptation.

2.5. Position of Our Work

Overall, our method lies at the intersection of retrieval-augmented captioning, grounded caption generation, and parameter-efficient adaptation. Relative to recent retrieval-based captioning methods, our novelty is to transform retrieved concepts into continuous encoder-side visual prompts rather than decoder-side textual cues. Relative to hallucination mitigation and grounding methods, our contribution is to stabilize generation through an explicit global–local fusion mechanism embedded in the autoregressive decoder. The novelty of R2G does not come from retrieval augmentation or gating alone, but from the integration strategy: external concepts are injected into the visual encoder before decoding through retrieval-guided modulation, and the resulting visual evidence is further stabilized during autoregressive generation by global–local semantic grounding. By combining early semantic injection with state-dependent grounding, R2G addresses concept coverage and semantic faithfulness within a unified frozen-backbone framework.

3. Method

3.1. Problem Formulation and Model Overview

Given an image I, the goal is to generate a caption

Y = {y_{1}, \dots, y_{T}}

that is both informative and visually faithful. We build on a frozen image captioning backbone consisting of a visual encoder

E_{v}

and an autoregressive language decoder D, following the parameter-efficient adaptation paradigm used in models such as BLIP-2 [6]. As Figure 1 shows, to address two complementary sources of captioning errors: missing knowledge before generation and unstable grounding during generation, we introduce a plug-in method, called R2G (retrieval- and grounding-guided), consisting of two lightweight modules.

The idea behind R2G is simple: captioning errors in open-domain images arise not only from missing concepts, but also from unstable grounding. Accordingly, R2G addresses both aspects within a unified and parameter-efficient design. On the one hand, retrieval-guided visual prompting (RVP) improves access to long-tail concepts by injecting external knowledge into the encoder rather than limiting retrieval to the language side. Unlike prior retrieval-augmented captioning methods that inject retrieved captions or object names primarily at the language side, RVP conditions visual feature formation before decoding begins [22,24]. On the other hand, global–local semantic grounding (GLSG) stabilizes autoregressive generation by balancing persistent scene-level semantics with token-level visual evidence. Because both modules are lightweight, the framework remains compatible with frozen pre-trained backbones and standard decoding strategies such as greedy search and beam search.

3.2. Retrieval-Guided Visual Prompting

3.2.1. External Visual Concept Memory

The external visual concept memory is constructed offline from the training image set. Each memory item consists of a visual key and a concept set:

M = {(k_{i}, C_{i})}_{i = 1}^{M},

(1)

where

k_{i} \in R^{d_{r}}

is extracted from a frozen visual retrieval encoder and

C_{i}

contains the normalized visual concepts associated with the reference image. In our implementation, the memory is built from the MS-COCO training split. Concepts are obtained from object annotations and salient noun or noun-phrase candidates in the reference captions. We lowercase all concepts, remove duplicates, filter extremely rare or non-visual words, and merge concepts with identical lemmas. Thus, each memory entry stores compact visual–semantic cues rather than full captions.

Given an input image I, we compute the retrieval query q from the visual global token and retrieve the top-K memory entries according to cosine similarity:

R_{K} (q, M) = {TopK}_{i} \cos (q, k_{i}) .

The retrieved concept sets are merged into the image-specific concept pool

\hat{C} (I) = ⋃_{i \in R_{K} (q, M)} C_{i} .

This memory provides semantically relevant candidates for open-domain concepts that may be under-represented in the parametric captioning model.

3.2.2. Prompt Prediction from Retrieved Concepts

Let

\hat{C} (I) = {c_{1}, \dots, c_{m}}

denote the final concept pool for image I. Each concept is embedded by a pre-trained embedding layer

e (c_{j})

. The concept sequence is summarized by attention pooling:

u = AttnPool ({e (c_{j})}_{j = 1}^{m}),

(2)

where u is a global concept representation. The prompt code is then predicted as

p = {MLP}_{p} (u),

(3)

where p is the conditioning vector used by the visual modulation layers.

This design keeps the retrieval branch lightweight: external concepts are injected as a continuous control signal rather than being serialized into text tokens and re-processed by the decoder.

3.2.3. Prompt-Conditioned Visual Modulation

Let

E_{v}^{(l)}

denote the l-th block of the visual encoder, with

l \in {1, \dots, L}

. The hidden states are updated as

H^{(l)} = E_{v}^{(l)} (H^{(l - 1)}) .

(4)

For each selected block

l \in L_{\mod}

, the prompt code predicts a feature-wise scale and bias:

[γ^{(l)}; β^{(l)}] = {MLP}_{\mod}^{(l)} (p) .

(5)

The block output is then modulated as

H^{(l)} \leftarrow H^{(l)} + γ^{(l)} ⊙ LN (H^{(l)}) + β^{(l)},

(6)

where

γ^{(l)}

and

β^{(l)}

are broadcast across all spatial tokens. The final retrieval-enhanced visual memory is

\tilde{V} = H^{(L)} \in R^{(N + 1) \times d} .

(7)

Compared with decoder-side prompting [39], this encoder-side conditioning allows retrieved concepts to influence visual representation learning earlier in the forward pass, thereby improving the accessibility of visually grounded long-tail entities before sentence generation starts.

3.3. Global–Local Semantic Grounding (GLSG)

3.3.1. Global Semantic Prior

Although RVP improves concept accessibility, it does not by itself guarantee stable token generation. We therefore introduce a persistent scene-level prior using an auxiliary CLIP-style visual encoder [13]. Let

f_{g} (I)

be the global image representation extracted by the auxiliary encoder. We project it into the decoder hidden space

g = LN ({MLP}_{g} (f_{g} (I))) .

(8)

The vector g captures coarse scene semantics, such as dominant objects, scene type, and overall semantic scope. Because it is computed once per image outside the autoregressive decoding loop, it provides a more stable anchor than token-level attention alone.

3.3.2. State-Dependent Fusion of Global and Local Evidence

At decoding step t, let

h_{t} \in R^{d}

denote the decoder hidden state. Token-specific local evidence is obtained by cross-attention over the retrieval-enhanced visual memory:

c_{t} = CrossAttn (h_{t}, \tilde{V}) .

(9)

A gating network then balances the global prior and local context according to the current decoding state:

α_{t} = sigmoid (W_{h} h_{t} + W_{c} c_{t} + W_{g} g + b),

(10)

and then fuses the two information sources as

{\hat{c}}_{t} = LN (h_{t} + α_{t} ⊙ g + (1 - α_{t}) ⊙ c_{t}) .

(11)

The next-token distribution is then predicted by the frozen language head:

P (y_{t} ∣ y_{< t}, I, M) = D_{lm} ({\hat{c}}_{t}) .

(12)

When the current token mainly depends on scene-level semantics, the gate can place more weight on g. When the model needs specific entities, attributes, or relations, it can rely more on local visual evidence

c_{t}

. The result is a decoding process that is guided by a stable global anchor without losing token-level grounding.

3.4. Learning Framework

3.4.1. Algorithm Description

Algorithm 1 summarizes the unified training and inference procedure of R2G. We divide the procedure into four stages: retrieval, retrieval-guided visual prompting, global prior extraction, and autoregressive caption generation. The first three stages are performed once for each input image, while the global–local semantic grounding module is applied at every decoding step.

3.4.2. Training and Inference

During training, RVP and GLSG are optimized jointly using the standard captioning objective. Under teacher forcing, the model minimizes the negative log-likelihood of the reference caption:

L_{R 2 G} = - \sum_{t = 1}^{T} log P (y_{t}^{*} ∣ y_{< t}^{*}, I, M),

(13)

where

y_{t}^{*}

denotes the ground-truth token at step t, and

y_{< t}^{*}

denotes the corresponding ground-truth prefix. The gradients from

L_{R 2 G}

update only the newly introduced lightweight components, while the large captioning backbone remains frozen. Optional lightweight regularization can be added to constrain modulation magnitudes or prevent gate saturation, but the core objective remains standard next-token prediction.

At inference time, retrieval is performed once for each image. The retrieved concepts are converted into a prompt code

p

, which produces the retrieval-enhanced visual memory

\tilde{V}

. Meanwhile, the auxiliary encoder extracts the global semantic prior g. The decoder then generates the caption autoregressively, using GLSG at each time step to fuse the global prior with token-level local visual evidence. This procedure is compatible with standard greedy decoding and beam search.

Algorithm 1 Unified training and inference procedure for R2G

Require: Image I, external memory

M

, reference caption

Y^{*}

for training, trainable parameters

Θ

Ensure: Updated parameters

Θ

during training, or generated caption Y during inference
Stage 1: Retrieval from external memory

1:: Extract the global retrieval token $v_{cls}$ from the base visual encoder and compute the retrieval query q.
2:: Retrieve the top-K nearest neighbors from $M$ and merge their concept sets into $\hat{C} (I)$ .

Stage 2: Retrieval-guided visual prompting

3:: Embed the retrieved concepts and aggregate them into the concept representation u.
4:: Predict the prompt code $p$ from u using Equations (2) and (3).
5:: for each selected encoder block $l \in L_{\mod}$ do
6:: Predict the modulation parameters $γ^{(l)}$ and $β^{(l)}$ from $p$ .
7:: Modulate the visual representation $H^{(l)}$ to obtain ${\tilde{H}}^{(l)}$ using Equation (6).
8:: end for
9:: Obtain the retrieval-enhanced visual memory $\tilde{V}$ .

Stage 3: Global semantic prior extraction

10:: Compute the global semantic prior g using Equation (8).

Stage 4: Training or inference with global–local semantic grounding

11:: if training then
12:: for $t = 1$ to T do
13:: Compute the decoder hidden state $h_{t}$ from the ground-truth prefix $y_{< t}^{*}$ .
14:: Compute the local visual context $c_{t} = CrossAttn (h_{t}, \tilde{V})$ using Equation (9).
15:: Compute the gate $α_{t}$ and grounded context ${\hat{c}}_{t}$ using Equation (11).
16:: Predict $P (y_{t}^{*} ∣ y_{< t}^{*}, I, M)$ .
17:: end for
18:: Update $Θ$ by minimizing $L_{R 2 G}$ .
19:: else
20:: Initialize the decoder with the start token and, if used, a beam of hypotheses.
21:: for $t = 1$ to T do
22:: Compute the decoder hidden state $h_{t}$ from previously generated tokens $y_{< t}$ .
23:: Compute $c_{t}$ , $α_{t}$ , and ${\hat{c}}_{t}$ as in training.
24:: Generate the next token from $P (y_{t} ∣ y_{< t}, I, M)$ .
25:: end for
26:: Return the generated caption Y.
27:: end if

3.4.3. Time Complexity

We further analyze the computational complexity of R2G to clarify the additional overhead. For RVP, exact top-K retrieval by cosine similarity costs

O (M d_{r})

, which is performed once per image. Concept embedding and attention pooling cost

O (m d)

. Prompt-conditioned modulation is applied only to selected layers

l \in L_{m o d}

, and the modulation cost is

O (| L_{m o d} | N d)

. For GLSG, the global semantic prior is computed once per image. At each decoding step, the gate and fused context adds

O (T d^{2})

over the full sequence. Therefore, the total additional complexity of R2G is

O (M d_{r}) + O (m d) + O (| L_{m o d} | N d) + O (T d^{2}) .

(14)

Since retrieval and global prior extraction are performed once per image, the backbone remains frozen, and modulation is applied to only three layers (

L_{m o d} = {6, 9, 12}

), R2G introduces moderate overhead while preserving the efficiency of the frozen-backbone setting.

3.5. Formal Analysis

3.5.1. Operator View of R2G

R2G can be written as the composition of a retrieval operator, an encoder-side modulation operator, and a decoder-side grounding operator. The retrieval operator selects the top-K memory entries according to cosine similarity:

R_{K} (q, M) = {TopK}_{i} \cos (q, k_{i}),

and returns the concept pool

\hat{C} (I)

. The retrieved concepts are then mapped to the prompt code p, which conditions the selected visual encoder layers through

M_{θ}^{(l)} (H^{(l)}, p) = H^{(l)} + γ^{(l)} (p) ⊙ LN (H^{(l)}) + β^{(l)} (p), l \in L_{m o d} .

Thus, the final visual memory can be expressed compactly as

\tilde{V} = E_{θ} (I, M),

where

E_{θ}

denotes the frozen visual encoder equipped with retrieval-guided modulation.

During decoding, R2G fuses global and local semantic evidence by

{\hat{c}}_{t} = F_{ϕ} (h_{t}, c_{t}, g) = LN (h_{t} + α_{t} ⊙ g + (1 - α_{t}) ⊙ c_{t}),

where

α_{t} = σ (W_{h} h_{t} + W_{c} c_{t} + W_{g} g + b) .

The next-token distribution is then predicted as

P (y_{t} ∣ y_{< t}, I, M) = D_{lm} ({\hat{c}}_{t}) .

This formulation also clarifies the difference between R2G and decoder-side retrieval. In decoder-side retrieval, the visual memory

V = E_{v} (I)

is fixed before retrieved information is used:

P (y_{t} ∣ y_{< t}, I, R) = D_{lm} (h_{t}, V, τ (R (I))),

where

τ (R (I))

denotes retrieved textual or concept cues added to the language side. In contrast, R2G makes the visual memory itself retrieval-dependent:

\tilde{V} = E_{θ} (I, M) .

Therefore, retrieved concepts can influence visual feature formation before autoregressive generation begins, rather than only biasing lexical prediction after visual encoding has already been completed.

3.5.2. Stability Under Retrieval Perturbations

We next analyze how retrieval noise affects the prompt-conditioned visual representation. Let u be the aggregated concept representation before prompt prediction, and suppose noisy retrieval perturbs it to

u + Δ u

, with

∥ Δ u ∥ \leq ϵ_{r} .

Assume that the prompt predictor is Lipschitz continuous with constant

L_{p}

. Then the induced prompt perturbation satisfies

∥ Δ p ∥ \leq L_{p} ϵ_{r} .

For a modulated layer l, assume that the scale and bias predictors

γ^{(l)} (\cdot)

and

β^{(l)} (\cdot)

are Lipschitz continuous with constants

L_{γ}^{(l)}

and

L_{β}^{(l)}

, respectively, and that

∥ LN (H^{(l)}) ∥_{F} \leq B_{l}

. The perturbation introduced by noisy retrieval at this layer is bounded by

∥ Δ H^{(l)} ∥_{F} \leq (B_{l} L_{γ}^{(l)} + L_{β}^{(l)}) L_{p} ϵ_{r} .

This bound shows that retrieval noise is filtered through the prompt predictor and the lightweight modulation heads. Therefore, the effect of noisy retrieval can be controlled by bounded prompt magnitudes, normalization, and applying modulation only to selected layers.

3.5.3. Robustness of Global–Local Fusion

The global–local fusion in GLSG can be interpreted as reliability-aware semantic estimation. Let

z_{t}

denote the ideal semantic evidence for predicting token

y_{t}

. Suppose the global prior and local context are two noisy estimates of

z_{t}

:

g = z_{t} + e_{g}, c_{t} = z_{t} + e_{c},

where

e_{g}

and

e_{c}

are zero-mean independent errors with variances

σ_{g}^{2}

and

σ_{c}^{2}

. For a scalar fusion coefficient

α \in [0, 1]

, the fused estimate

{\bar{z}}_{t} = α g + (1 - α) c_{t}

has expected squared error

E ∥ {\bar{z}}_{t} - z_{t} ∥^{2} = α^{2} σ_{g}^{2} + {(1 - α)}^{2} σ_{c}^{2} .

The optimal coefficient is

α^{*} = \frac{σ_{c}^{2}}{σ_{g}^{2} + σ_{c}^{2}},

and the corresponding error is

\frac{σ_{g}^{2} σ_{c}^{2}}{σ_{g}^{2} + σ_{c}^{2}},

which is no larger than using either g or

c_{t}

alone. This result explains why combining global and local evidence can improve robustness: when local evidence is noisy, more weight should be assigned to the global prior; when the local evidence is reliable, the model should rely more on

c_{t}

. The proposed gate

α_{t}

learns a state-dependent approximation to this reliability-aware weighting.

3.5.4. Perturbation Control in the Gated Context

We further examine how perturbations propagate through the fused context. Let

Δ h_{t}

,

Δ c_{t}

, and

Δ g

denote perturbations in the decoder state, local evidence, and global prior. Since the sigmoid function is

1 / 4

-Lipschitz, the gate perturbation satisfies

∥ Δ α_{t} ∥ \leq \frac{1}{4} ∥W_{h} Δ h_{t} + W_{c} Δ c_{t} + W_{g} Δ g∥ .

Ignoring higher-order terms, the perturbation of the fused context is bounded by

∥ Δ {\hat{c}}_{t} ∥ \leq L_{LN} (∥ Δ h_{t} ∥ + ∥ α_{t} ⊙ Δ g ∥ + ∥ (1 - α_{t}) ⊙ Δ c_{t} ∥ + ∥ Δ α_{t} ⊙ (g - c_{t}) ∥),

where

L_{LN}

is the local Lipschitz constant of layer normalization. This inequality shows that the gate controls the contribution of each evidence source. A larger

α_{t}

suppresses perturbations from local evidence, while a smaller

α_{t}

reduces the influence of a coarse or noisy global prior.

3.5.5. Error Propagation During Autoregressive Decoding

Autoregressive captioning is sensitive to early semantic errors because each predicted token affects subsequent hidden states. Let

δ_{t}

denote the deviation between the actual decoder trajectory and an ideal grounded trajectory at step t, and let

ϵ_{t} = ∥ {\hat{c}}_{t} - z_{t} ∥

be the semantic evidence error. If the decoder transition is locally Lipschitz with respect to its previous state and fused context, then

δ_{t + 1} \leq L_{D} δ_{t} + L_{C} ϵ_{t},

where

L_{D}

and

L_{C}

are local sensitivity constants. Recursively,

δ_{t} \leq L_{D}^{t} δ_{0} + L_{C} \sum_{j = 0}^{t - 1} L_{D}^{t - 1 - j} ϵ_{j} .

Thus, reducing the per-step evidence error

ϵ_{t}

helps limit the accumulation of semantic errors across decoding steps. GLSG reduces this error by grounding each step in both a stable image-level prior and token-specific visual evidence.

Finally, if a hallucination-risk score

H (\cdot)

is locally Lipschitz with respect to the semantic evidence, then

| H (a) - H (b) | \leq L_{H} ∥ a - b ∥ .

Therefore, any reduction in

E ∥ {\hat{c}}_{t} - z_{t} ∥

also reduces an upper bound on the expected hallucination risk. This provides an analytical explanation for the empirical CHAIR improvements observed in the ablation studies.

4. Experiments

4.1. Experimental Setup

Datasets and evaluation metrics. We evaluate R2G on two standard image captioning benchmarks: MS-COCO under the Karpathy split [1] and the NoCaps validation set [40]. MS-COCO measures caption quality in a conventional closed-domain setting, whereas NoCaps emphasizes transfer to novel objects and open-domain concepts. Following standard practice, we report BLEU-4 [41], METEOR [42], ROUGE-L [43], CIDEr [44], and SPICE [45]; higher values indicate better caption quality. To further evaluate semantic faithfulness in the ablation study, we additionally report CHAIR-S and CHAIR-I [10]. CHAIR-S measures the proportion of generated sentences containing at least one hallucinated object, while CHAIR-I measures the proportion of hallucinated object mentions among all generated object mentions; lower values are better for both metrics.

Implementation details. We build on a BLIP-2 style frozen-backbone captioner [6]. Unless otherwise stated, all optimization and decoding settings follow the base implementation. The learning rate is set to

2 \times 10^{- 4}

and the weight decay is 0.05. During decoding, we use beam search with beam size

B = 5

and length penalty 0.9.

For RVP, the number of retrieved neighbors is set to

K = 5

. Each retrieved concept is embedded into a 512-dimensional vector, and the concept sequence is aggregated by a single-head attention-pooling module. The hidden dimension of the prompt-prediction MLP is 256. Prompt-conditioned modulation is applied to visual encoder layers

{6, 9, 12}

, and each modulation head is implemented as a two-layer linear network.

For GLSG, the global semantic prior is extracted from the penultimate-layer global representation of the auxiliary vision–language encoder

f_{g}

. The projected prior dimension is 768, and the hidden dimension of the gating network is 256.

To ensure fair comparison across all methods, all experiments were performed under the same environment. Specifically, the platform used an Intel(R) Xeon(R) Silver 4314 processor with a base frequency of 2.40 GHz and an NVIDIA A30 GPU. The models were implemented in Python 3.9.19 based on the PyTorch 2.1.0 deep learning framework.

Compared methods. R2G is compared against several representative captioning baselines. BLIP [5] is a pre-trained VLM that bootstraps noisy image-text pairs through caption generation and filtering, and serves as a strong captioning baseline. SimVLM (huge) [4] is a large-scale VLM trained end-to-end with a simple prefix language modeling objective under weak supervision, representing a strong high-capacity pre-trained captioning system. EVCap (Vicuna-13B) [24] is a retrieval-augmented captioning method that uses an external visual-name memory to retrieve object names for open-world image captioning. BLIP-2 (OPT-6.7B) [6] connects a frozen image encoder and a frozen LLM through a lightweight Querying Transformer, achieving strong captioning performance with parameter-efficient adaptation. R2G adopts BLIP-2 as part of the frozen-backbone.

4.2. Main Results on MS-COCO

Table 1 reports the main results on the MS-COCO Karpathy test split. Both RVP and GLSG improve the BLIP-2 baseline, and the full R2G model achieves the best overall performance. In particular, R2G reaches 43.5 BLEU-4, 32.0 METEOR, 149.2 CIDEr, 25.6 SPICE, and 59.2 ROUGE-L, corresponding to gains of

+ 2.8

,

+ 2.8

,

+ 13.2

,

+ 3.2

, and

+ 2.0

over BLIP-2, respectively.

Several conclusions for each module follow from Table 1. RVP yields the larger single-module gain in CIDEr, suggesting that retrieval-guided visual prompting improves concept accessibility and content coverage, especially for entities that are weakly represented in the base model parameters. GLSG, in turn, consistently improves SPICE and ROUGE-L over the baseline, which is consistent with its role in stabilizing object–attribute–relation grounding during decoding. When the two modules are combined, the gains are larger than those obtained by either module alone, indicating that encoder-side concept injection and decoder-side semantic grounding are complementary.

4.3. Open-Domain Generalization on NoCaps

We further evaluate open-domain generalization on the NoCaps validation set. Following the standard NoCaps protocol, the validation set is divided into three subsets: in-domain, which contains images whose object classes are covered by MS-COCO; near-domain, which contains images with both MS-COCO classes and novel classes; and out-of-domain, which contains images composed only of novel classes outside the MS-COCO label space. This partition makes it possible to assess not only overall caption quality, but also robustness under progressively stronger domain shift.

Table 2 reports results on the NoCaps validation set. R2G consistently outperforms the BLIP-2 baseline in all domains and achieves the best overall performance, reaching 124.2 CIDEr and 15.8 SPICE. Relative to BLIP-2, this corresponds to gains of

+ 9.5

CIDEr and

+ 1.4

SPICE overall. The improvement is particularly clear on the out-of-domain subset, where R2G improves CIDEr from 116.8 to 127.2 and improves SPICE from 14.3 to 15.4.

The NoCaps results further clarify the role of each component under distribution shift. RVP is particularly helpful in the out-of-domain setting, where retrieval-derived prompts improve the visibility of novel or long-tail concepts during visual feature formation. GLSG yields the stronger single-module overall result, suggesting that stable semantic grounding remains crucial when local evidence is ambiguous, cluttered, or weak.

4.4. Ablation Studies

We next examine each key components on the MS-COCO validation set. Table 3 and Table 4 report separate ablations for GLSG and RVP, respectively. We separate the two studies for clarity because they target different failure modes: GLSG primarily addresses grounding stability and hallucination, whereas RVP primarily addresses concept accessibility and encoder-side knowledge injection.

Table 3 shows that the complete GLSG design improves both caption quality and semantic faithfulness. Removing the global prior causes CIDEr to drop by 3.9 points and increases CHAIR-I from 5.50% to 7.50%, indicating that scene-level semantics are important for suppressing unsupported object mentions. Replacing adaptive gating with concatenation or a fixed fusion weight also degrades both SPICE and CHAIR, which shows that the benefit comes not only from exposing the decoder to more information but from dynamically balancing global semantics and token-level local evidence.

Table 4 shows that the improvement is not explained by retrieval alone. Removing retrieval causes a clear drop, which confirms that external knowledge is beneficial. More importantly, replacing encoder-side modulation with decoder-side textual prompting or simple feature concatenation also underperforms the full model. This result supports the central design choice of RVP: retrieved concepts are most useful when they influence visual evidence formation before autoregressive decoding begins, rather than being injected only after the visual representation has already been formed.

4.5. Sensitivity Analysis

We finally examine the sensitivity of R2G to two key design choices and verify whether the default settings used in the main experiments are empirically justified. Overall, the results show that R2G is robust within a reasonable range of configurations, while the proposed defaults provide the most favorable balance between effectiveness and efficiency.

First, the retrieval size K directly controls the amount of external semantic information injected into the visual encoder. As shown in Figure 2, performance improves steadily when K increases from 1 to 5, with CIDEr rising from 145.0 to 147.8 and SPICE rising from 24.9 to 25.3. This trend indicates that retrieving a small set of additional neighbors helps expose the model to more relevant long-tail or visually related concepts. However, when K is further increased to 7 or 10, the gains saturate and then slightly decline. A likely reason is that lower-ranked neighbors are less semantically aligned with the query image and therefore introduce weaker or noisier concepts, which dilute the usefulness of the prompt signal.

We also evaluate the effect of modulation depth in the visual encoder. Figure 3 shows that applying modulation only in deep layers is suboptimal, because the injected semantic cues arrive too late to sufficiently affect feature formation, whereas shallow-biased modulation can interfere with more generic low-level representations. In contrast, modulation in middle-to-late layers consistently performs better, suggesting that these layers provide a better stage for integrating retrieved concepts with increasingly semantic visual features.

The retrieval size is set to

K = 5

based on the sensitivity analysis in Figure 2. Increasing K from 1 to 5 improves CIDEr and SPICE because more relevant concepts are retrieved, while larger values introduce weaker neighbors and slightly reduce performance. Prompt-conditioned modulation is applied to layers

{6, 9, 12}

, because middle-to-late layers provide the best trade-off between semantic abstraction and computational cost. Shallow layers are more sensitive to low-level visual patterns, while very deep-only modulation leaves insufficient opportunity for retrieved concepts to influence visual feature formation. The hidden dimensions of the prompt-prediction MLP and gating network are selected to keep the additional parameter count small while preserving enough capacity for concept aggregation and adaptive fusion.

4.6. Robustness to Retrieval Noise

To assess the sensitivity of R2G to noisy or weakly relevant retrieval results, we corrupt the retrieved concept pool by replacing a proportion

η \in {0, 0.2, 0.6, 1.0}

of the retrieved concepts with randomly sampled concepts from unrelated images.

As shown in Table 5, the performance decreases progressively as the noise ratio increases. This trend indicates that the proposed prompt prediction module and global–local grounding mechanism do not rely excessively on any single retrieved concept. Under moderate retrieval noise (

η \leq 0.6

), R2G still outperforms the frozen BLIP-2 baseline, suggesting that the continuous prompt representation and adaptive gating mechanism help reduce the influence of weakly relevant retrievals.

We also observe that, when

η = 0.6

, the performance becomes comparable to the variant without retrieval (the “w/o Retrieval” row in Table 4). This indicates that heavily corrupted retrieval provides limited useful semantic guidance. When

η = 1.0

, all retrieved concepts are replaced by unrelated concepts, leading to a clear performance drop.

4.7. Computational Overhead

As analyzed in Section 3.4.3, R2G is designed as a lightweight plug-in for frozen captioning backbones. To quantify the additional computation, we report the relative FLOPs of RVP, GLSG, and the full R2G model compared with the BLIP-2 baseline.

As shown in Table 6, RVP and GLSG introduce only

4 %

and

3 %

additional FLOPs, respectively, while the full R2G model increases FLOPs by

8 %

over the baseline. This moderate overhead comes from one-time retrieval, lightweight prompt-conditioned modulation, and the compact global–local gating module. These results confirm that R2G improves caption quality and semantic faithfulness while preserving the efficiency of the frozen-backbone setting.

5. Limitations and Future Work

Despite the encouraging results, several limitations remain and point to important directions for future research.

The effectiveness of RVP depends on the quality, diversity, and relevance of the external visual concept memory. When the retrieved neighbors are noisy, weakly aligned, or insufficiently diverse, the predicted prompt code may inject suboptimal semantic cues into the visual encoder. Thus, a promising next step is to develop noise-aware and confidence-aware retrieval strategies, such as learned reranking, retrieval calibration, or memory filtering based on semantic agreement between the image and retrieved concepts.
The GLSG module uses a compact scene-level prior to stabilize decoding, which is effective for improving overall semantic consistency. However, a single global representation may be too coarse for highly crowded scenes, fine-grained object distinctions, or complex relational descriptions involving multiple interacting entities. Therefore, future work could extend GLSG toward multi-granularity semantic grounding, for example by combining scene-level priors with region-level, object-level, or relation-level representations. Token-type-aware or relation-aware gating mechanisms may further help the decoder decide when to rely on global semantics, local evidence, or structured scene representations during caption generation.
The current experimental validation is mainly based on a BLIP-2-style frozen captioning backbone. Although the proposed RVP and GLSG modules are designed around general interfaces, including modifiable visual representations, decoder hidden states, and token-level visual contexts, we have not yet exhaustively evaluated R2G across a wider range of captioning architectures. Future work will extend R2G to more diverse vision–language backbones, such as encoder–decoder captioners, query-based VLMs, and recent multimodal large language models, to further verify its architectural generality and practical transferability.

6. Conclusions

In this paper, we presented R2G, a retrieval- and grounding-guided framework for open-domain image captioning that addresses two coupled challenges: insufficient concept coverage before generation and unstable semantic grounding during decoding. To tackle these issues in a unified frozen-backbone setting, we introduced retrieval-guided visual prompting, which injects retrieved external concepts into the visual encoder to influence feature formation early, and global–local semantic grounding, which stabilizes autoregressive generation by adaptively fusing scene-level semantic priors with token-level local visual evidence.

Experimental results on MS-COCO and NoCaps show that the proposed design consistently improves caption quality over the frozen BLIP-2 baseline, with particularly clear gains in open-domain and out-of-domain settings. The ablation studies further suggest that the two modules are complementary: RVP mainly strengthens concept accessibility and long-tail coverage, while GLSG improves semantic faithfulness and reduces hallucination by providing more stable grounding during decoding.

Overall, the proposed framework demonstrates that open-domain captioning benefits not only from accessing external knowledge, but also from integrating that knowledge at the right stage of the generation pipeline and maintaining stable semantic grounding throughout decoding. We hope this work provides a useful step toward image captioning systems that are more accurate, more faithful, and more robust in visually diverse real-world scenes.

7. Implications

The significance of the proposed R2G framework extends beyond the immediate improvement of caption quality on benchmark datasets. From an application perspective, the framework supports practical vision–language systems that must operate in open and changing environments, where rare concepts, visually ambiguous scenes, and incomplete parametric knowledge are common. By combining retrieval-guided concept injection with adaptive semantic grounding, R2G provides a lightweight strategy for improving descriptive accuracy and reducing unsupported content generation without requiring full retraining of a large backbone. This design is potentially useful in downstream scenarios such as assistive image description, multimedia indexing, human–robot interaction, digital content understanding, and domain-specific captioning systems in science, medicine, or remote sensing, where concept coverage and semantic faithfulness are both essential.

From a mathematical perspective, R2G also suggests several directions for further study. The retrieval-guided prompting module can be interpreted as a memory-dependent perturbation of latent representations, raising natural questions about the stability, boundedness, and sensitivity of encoder features under noisy or mismatched retrieval. Likewise, the global–local grounding module defines an adaptive fusion mechanism whose behavior may be studied in terms of information weighting, robustness, and error propagation across autoregressive decoding steps. These observations indicate that modern retrieval-augmented captioning systems can serve as concrete testbeds for mathematical analysis of coupled operators, controlled perturbations in high-dimensional representation spaces, and sequential inference with multiple interacting information sources.

Author Contributions

Conceptualization, S.L. and X.X.; methodology, X.X. and Z.Y.; validation, X.X.; formal analysis, S.L.; data curation, X.X.; visualization: S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L., Z.Y. and C.C.; supervision, Z.Y. and C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China OF FUNDER grant number 62476060.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the Deep Visual–Semantic Alignments for Generating Image Descriptions repository at https://cs.stanford.edu/people/karpathy/deepimagesent/ accessed on 10 February 2025 (paper link: https://arxiv.org/pdf/1412.2306, accessed on 10 February 2025), and via the public dataset link in the NoCaps official website at https://nocaps.org/download accessed on 10 February 2025 (paper link https://arxiv.org/pdf/1812.08658, accessed on 10 February 2025). These data were derived from publicly available resources in the public domain.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Karpathy, A.; Li, F. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2015; pp. 3128–3137. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2015; pp. 2048–2057. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 6077–6086. [Google Scholar]
Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. SimVLM: Simple visual language model pretraining with weak supervision. In Proceedings of the International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2022; pp. 12888–12900. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 19730–19742. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 9459–9474. [Google Scholar]
Ramos, R.; Martins, B.; Elliott, D.; Kementchedjhieva, Y. Smallcap: Lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 2840–2849. [Google Scholar]
Zeng, Z.; Xie, Y.; Zhang, H.; Chen, C.; Chen, B.; Wang, Z. Meacap: Memory-augmented zero-shot image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 14100–14110. [Google Scholar]
Rohrbach, A.; Hendricks, L.A.; Burns, K.; Darrell, T.; Saenko, K. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 4035–4045. [Google Scholar]
Biten, A.F.; Gómez, L.; Karatzas, D. Let there be a clock on the beach: Reducing object hallucination in image captioning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2022; pp. 1381–1390. [Google Scholar]
Petryk, S.; Whitehead, S.; Gonzalez, J.E.; Darrell, T.; Rohrbach, A.; Rohrbach, M. Simple token-level confidence improves caption correctness. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2024; pp. 5742–5752. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 49250–49267. [Google Scholar]
Xu, H.; Huang, P.Y.; Tan, X.; Yeh, C.F.; Kahn, J.; Jou, C.; Ghosh, G.; Levy, O.; Zettlemoyer, L.; Yih, W.T.; et al. Altogether: Image captioning via re-aligning alt-text. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 19302–19318. [Google Scholar]
Ye, Q.; Zeng, X.; Li, F.; Li, C.; Fan, H. Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Peng, R.; He, H.; Wei, Y.; Wen, Y.; Hu, D. Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2025; pp. 3963–3973. [Google Scholar]
Wang, Y.; Cai, Y.; Ren, S.; Yang, S.; Yao, L.; Liu, Y.; Zhang, Y.; Wan, P.; Sun, X. Rico: Improving accuracy and completeness in image recaptioning via visual reconstruction. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 21796–21815. [Google Scholar]
Rasheed, H.; Maaz, M.; Shaji, S.; Shaker, A.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Xing, E.; Yang, M.H.; Khan, F.S. Glamm: Pixel grounding large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 13009–13018. [Google Scholar]
Lian, L.; Ding, Y.; Ge, Y.; Liu, S.; Mao, H.; Li, B.; Pavone, M.; Liu, M.Y.; Darrell, T.; Yala, A.; et al. Describe anything: Detailed localized image and video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2025; pp. 21766–21777. [Google Scholar]
Hua, H.; Liu, Q.; Zhang, L.; Shi, J.; Kim, S.Y.; Zhang, Z.; Wang, Y.; Zhang, J.; Lin, Z.; Luo, J. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: New York, NY, USA, 2025; pp. 24763–24773. [Google Scholar]
Ramos, R.; Elliott, D.; Martins, B. Retrieval-augmented image captioning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3666–3681. [Google Scholar]
Li, W.; Li, J.; Ramos, R.; Tang, R.; Elliott, D. Understanding retrieval robustness for retrieval-augmented image captioning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9285–9299. [Google Scholar]
Li, J.; Vo, D.M.; Sugimoto, A.; Nakayama, H. Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 13733–13742. [Google Scholar]
Qu, T.; Tuytelaars, T.; Moens, M.F. Visually-aware context modeling for news image captioning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2927–2943. [Google Scholar]
Petryk, S.; Chan, D.; Kachinthaya, A.; Zou, H.; Canny, J.; Gonzalez, J.; Darrell, T. Aloha: A new measure for hallucination in captioning models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 342–357. [Google Scholar]
Kaul, P.; Li, Z.; Yang, H.; Dukler, Y.; Swaminathan, A.; Taylor, C.; Soatto, S. Throne: An object-based hallucination benchmark for the free-form generations of large vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 27228–27238. [Google Scholar]
Zhou, Y.; Cui, C.; Yoon, J.; Zhang, L.; Deng, Z.; Finn, C.; Bansal, M.; Yao, H. Analyzing and Mitigating Object Hallucination in Large Vision-Language Models. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Favero, A.; Zancato, L.; Trager, M.; Choudhary, S.; Perera, P.; Achille, A.; Swaminathan, A.; Soatto, S. Multi-modal hallucination control by visual information grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 14303–14312. [Google Scholar]
Ben-Kish, A.; Yanuka, M.; Alper, M.; Giryes, R.; Averbuch-Elor, H. Mitigating open-vocabulary caption hallucinations. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 22680–22698. [Google Scholar]
Ghosh, S.; Evuru, C.K.R.; Kumar, S.; Tyagi, U.; Nieto, O.; Jin, Z.; Manocha, D. Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Wang, C.; Chen, X.; Zhang, N.; Tian, B.; Xu, H.; Deng, S.; Chen, H. MLLM Can See? Dynamic Correction Decoding for Hallucination Mitigation. In Proceedings of the International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Chen, J.; Zhang, T.; Huang, S.; Niu, Y.; Zhang, L.; Wen, L.; Hu, X. Ict: Image-object cross-level trusted intervention for mitigating object hallucination in large vision-language models. In Proceedings of the Computer Vision and Pattern Recognition Conference; IEEE: New York, NY, USA, 2025; pp. 4209–4221. [Google Scholar]
Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 19113–19122. [Google Scholar]
Khattak, M.U.; Wasim, S.T.; Naseer, M.; Khan, S.; Yang, M.H.; Khan, F.S. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 15190–15200. [Google Scholar]
Cho, E.; Kim, J.; Kim, H.J. Distribution-aware prompt tuning for vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 22004–22013. [Google Scholar]
Li, Z.; Li, X.; Fu, X.; Zhang, X.; Wang, W.; Chen, S.; Yang, J. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 26617–26626. [Google Scholar]
Li, R.; Wu, Y.; He, X. Learning by correction: Efficient tuning task for zero-shot generative vision-language reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 13428–13437. [Google Scholar]
Kim, D.; Lee, G.; Shim, K.; Shim, B. Preserving pre-trained representation space: On effectiveness of prefix-tuning for large multi-modal models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 797–819. [Google Scholar]
Agrawal, H.; Desai, K.; Wang, Y.; Chen, X.; Jain, R.; Johnson, M.; Batra, D.; Parikh, D.; Lee, S.; Anderson, P. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 8948–8957. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2015; pp. 4566–4575. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 382–398. [Google Scholar]

Figure 1. Overall architecture of R2G. RVP retrieves image-relevant concepts from an external visual concept memory and injects them into the visual encoder as prompt-conditioned modulation, while GLSG fuses a global semantic prior with local visual evidence during decoding to generate more faithful captions.

Figure 2. Sensitivity of the RVP module to the retrieval size K. Performance improves up to

K = 5

and then saturates as additional neighbors introduce weaker concepts.

Figure 2. Sensitivity of the RVP module to the retrieval size K. Performance improves up to

K = 5

and then saturates as additional neighbors introduce weaker concepts.

Figure 3. Effect of modulation-layer selection on performance and computational overhead. Middle-to-late encoder layers provide the best trade-off.

Table 1. Main results on MS-COCO Karpathy test. The best result in each column is shown in bold.

Method	BLEU-4	METEOR	CIDEr	SPICE	ROUGE-L
BLIP	40.4	–	136.7	–	–
SimVLM (huge)	40.6	–	143.3	–	–
EVCap (Vicuna-13B)	41.5	31.2	140.1	24.7	–
BLIP-2 (OPT-6.7B)	40.7	29.2	136.0	22.4	57.2
RVP only	42.8	30.7	147.8	25.3	58.9
GLSG only	42.3	30.4	144.9	24.9	58.6
R2G	43.5	32.0	149.2	25.6	59.2

Table 2. Results on NoCaps validation, reported as CIDEr∣SPICE. The best result in each column is shown in bold.

Method	In-Domain	Near-Domain	Out-of-Domain	Overall
BLIP	114.9∣15.2	112.1∣14.9	115.3∣14.4	113.2∣14.8
EVCap	111.7∣15.3	119.5∣15.6	116.5∣14.7	118.3∣15.3
BLIP-2	114.5∣14.5	112.5∣14.0	116.8∣14.3	114.7∣14.4
GLSG only	124.1∣16.0	118.9∣15.6	124.5∣15.3	122.5∣15.5
RVP only	123.5∣15.9	118.5∣15.5	124.9∣15.4	122.1∣15.5
R2G	124.9∣16.4	121.8∣15.9	127.2∣15.4	124.2∣15.8

Table 3. Ablation results for the GLSG module on the MS-COCO validation set. Lower CHAIR is better. The best result in each column is shown in bold.

Variant	CIDEr	SPICE	CHAIR-S	CHAIR-I
Full GLSG	144.9	24.9	0.110	0.055
w/o Global Prior	141.0	23.5	0.130	0.075
w/o Gating (Concat)	142.1	24.1	0.122	0.063
w/o Gating (Fixed $α = 0.5$ )	141.9	24.0	0.120	0.061

Table 4. Ablation results for the RVP module on the MS-COCO validation set. Lower CHAIR is better. The best result in each column is shown in bold.

Variant	CIDEr	SPICE	CHAIR-S	CHAIR-I
Full RVP	147.8	25.3	0.112	0.055
w/o Retrieval	142.1	24.6	0.124	0.064
Retrieval + TextPrompt (late fusion)	144.3	24.5	0.122	0.061
w/o Modulation (Concat)	144.9	24.8	0.128	0.070

Table 5. Robustness of R2G under noisy retrieval. A fraction

η

of retrieved concepts is replaced by unrelated concepts on MS-COCO. Lower CHAIR is better.

Table 5. Robustness of R2G under noisy retrieval. A fraction

η

of retrieved concepts is replaced by unrelated concepts on MS-COCO. Lower CHAIR is better.

Noise Ratio $η$	CIDEr	SPICE	CHAIR-S	CHAIR-I
0.0	149.2	25.6	0.100	0.049
0.2	146.1	25.3	0.117	0.055
0.6	142.9	24.6	0.124	0.063
1.0	141.2	24.2	0.130	0.070

Table 6. Relative FLOPs compared with the BLIP-2 baseline.

Method	BLIP-2	RVP Only	GLSG Only	R2G
Relative FLOPs	$1.00 \times$	$1.04 \times$	$1.03 \times$	$1.08 \times$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, S.; Xie, X.; Yang, Z.; Chen, C. Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes. Mathematics 2026, 14, 1667. https://doi.org/10.3390/math14101667

AMA Style

Lin S, Xie X, Yang Z, Chen C. Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes. Mathematics. 2026; 14(10):1667. https://doi.org/10.3390/math14101667

Chicago/Turabian Style

Lin, Shanshan, Xiaoxuan Xie, Zexian Yang, and Chao Chen. 2026. "Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes" Mathematics 14, no. 10: 1667. https://doi.org/10.3390/math14101667

APA Style

Lin, S., Xie, X., Yang, Z., & Chen, C. (2026). Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes. Mathematics, 14(10), 1667. https://doi.org/10.3390/math14101667

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes

Abstract

1. Introduction

2. Related Work

2.1. Pre-Trained and Instruction-Tuned Vision–Language Captioning

2.2. Retrieval-Augmented Captioning and External Memory

2.3. Grounding, Hallucination, and Semantic Faithfulness

2.4. Prompting and Parameter-Efficient Adaptation

2.5. Position of Our Work

3. Method

3.1. Problem Formulation and Model Overview

3.2. Retrieval-Guided Visual Prompting

3.2.1. External Visual Concept Memory

3.2.2. Prompt Prediction from Retrieved Concepts

3.2.3. Prompt-Conditioned Visual Modulation

3.3. Global–Local Semantic Grounding (GLSG)

3.3.1. Global Semantic Prior

3.3.2. State-Dependent Fusion of Global and Local Evidence

3.4. Learning Framework

3.4.1. Algorithm Description

3.4.2. Training and Inference

3.4.3. Time Complexity

3.5. Formal Analysis

3.5.1. Operator View of R2G

3.5.2. Stability Under Retrieval Perturbations

3.5.3. Robustness of Global–Local Fusion

3.5.4. Perturbation Control in the Gated Context

3.5.5. Error Propagation During Autoregressive Decoding

4. Experiments

4.1. Experimental Setup

4.2. Main Results on MS-COCO

4.3. Open-Domain Generalization on NoCaps

4.4. Ablation Studies

4.5. Sensitivity Analysis

4.6. Robustness to Retrieval Noise

4.7. Computational Overhead

5. Limitations and Future Work

6. Conclusions

7. Implications

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI