1. Introduction
Image captioning aims to generate natural language descriptions that are fluent, informative, and faithful to visual content. The field has evolved from early encoder–decoder [
1] and attention-based models [
2,
3] to large-scale pre-trained vision–language systems such as SimVLM [
4], BLIP [
5], and BLIP-2 [
6]. These advances have improved language quality and transferability. However, robust caption generation remains challenging in open-domain settings, where images often contain long-tail objects, uncommon scene compositions, and fine-grained semantic distinctions.
This difficulty arises from two limitations. The first is
insufficient concept coverage. Modern captioning models largely rely on knowledge stored in parameters, which makes them less reliable when rare entities or domain-specific concepts appear at test time. In such cases, the generated captions are often vague, overly generic, or biased toward frequent language patterns. Retrieval augmentation offers a practical alternative by consulting external memory at inference time instead of forcing all knowledge into model weights [
7,
8,
9]. However, most existing retrieval-based captioning methods inject retrieved captions, textual descriptions, or concept cues mainly on the decoder side. While such late fusion may improve lexical access, it does not directly influence how the vision encoder forms visual evidence before sentence generation begins.
The second limitation is
unstable semantic grounding during decoding. Even when relevant visual evidence is available, standard token-level cross-attention may still be distracted by background clutter, partial occlusion, or visually similar objects. Consequently, the model may generate descriptions that are linguistically plausible but not fully supported by the image, including mistaken relations, and hallucinated objects [
10,
11,
12]. This problem is especially severe in open-domain settings, where weak local evidence and long-tail concepts increase the tendency of the decoder to rely on language priors.
More importantly, these two limitations are coupled. If external knowledge is introduced only after visual evidence has already been formed, retrieval can enrich lexical choices but cannot guide the model toward better visual representations. Conversely, even when richer concepts are available, caption quality may still deteriorate if decoding is not stably grounded in the image. This observation suggests that a reliable open-domain captioning framework should improve both knowledge accessibility before generation and semantic grounding during generation in a unified manner.
To this end, we propose
R2G (retrieval- and grounding-guided captioning), a lightweight plug-in framework for frozen image captioning backbones. R2G contains two complementary modules. The first,
retrieval-guided visual prompting, retrieves image-relevant concepts from an external visual concept memory and converts them into a continuous prompt code that modulates selected layers of the visual encoder. Unlike decoder-side retrieval prompting, this mechanism injects external semantic information directly into the visual stream, allowing retrieved concepts to influence feature formation before decoding starts. The second,
global–local semantic grounding, extracts a persistent global semantic prior from an auxiliary CLIP-style encoder [
13] and fuses the prior with token-specific local visual evidence through a decoder-state-dependent gating mechanism. In this way, the model maintains scene-level semantic consistency while preserving fine-grained grounding for objects, attributes, and relations.
Experimental results on MS-COCO and NoCaps demonstrate the effectiveness of R2G. Specifically, R2G consistently outperforms the frozen BLIP-2 baseline across all reported metrics in both a standard MS-COCO setting and a challenging NoCaps benchmark. The ablation studies further show that the two modules play complementary roles. These results support the central claim that open-domain captioning benefits from combining encoder-side knowledge injection with decoder-side semantic grounding.
Although this work is motivated by open-domain image captioning, it is also closely related to several themes of mathematical interest. Specifically, R2G can be viewed as a composition of three interacting mappings: a retrieval operator from an external memory, a prompt-conditioned modulation operator acting on latent visual representations, and a state-dependent fusion operator that combines global and local semantic evidence during decoding. Retrieval-guided visual prompting may be interpreted as a structured perturbation of encoder features induced by an auxiliary memory, while global–local semantic grounding can be regarded as an adaptive weighting mechanism over multiple information sources. From this perspective, R2G provides an application-driven setting in which ideas from operator analysis, dynamical systems, optimization, and statistical learning can help explain when external knowledge improves generation and when it may instead amplify noise or instability.
The main contributions of this paper are summarized as follows:
We identify open-domain image captioning as a joint problem of concept coverage and semantic grounding, and we introduce a unified framework that addresses both within a frozen-backbone setting.
We propose retrieval-guided visual prompting, which converts retrieved concepts into continuous prompt codes and injects them into selected visual encoder layers through lightweight modulation, enabling external knowledge to influence visual evidence formation early in the pipeline.
We propose global–local semantic grounding, which fuses a persistent scene-level semantic prior with token-specific local visual evidence through a gating mechanism, thereby improving semantic stability and reducing hallucination risk during decoding.
Experimental results on MS-COCO and NoCaps show that R2G consistently improves caption quality, confirming the effectiveness of the two key modules.
2. Related Work
2.1. Pre-Trained and Instruction-Tuned Vision–Language Captioning
Recent image captioning systems increasingly build on large pre-trained vision–language models (VLMs) and multimodal large language models (MLLMs), often in frozen or lightly tuned settings. BLIP-2 [
6] and InstructBLIP [
14] show that strong generation performance can be obtained by coupling frozen visual encoders and language models through lightweight bridging modules. Building on this trend, recent work has pushed captioning toward more detailed, controllable, and faithful generation. Examples include
Altogether, which improves caption quality by re-aligning noisy alt-text supervision [
15];
Painting with Words, which studies detailed captioning benchmarks and alignment learning [
16];
Patch Matters, which enhances fine-grained captioning through local perception and hierarchical aggregation [
17]; and
RICO, which refines captions through visual reconstruction to improve accuracy and completeness [
18]. Related grounded-generation systems such as GLaMM [
19], DAM [
20], and FINECAPTION [
21] further emphasize localized and compositional description. Our work is complementary to this line: instead of redesigning the whole captioning stack or relying on richer supervision, we target open-domain captioning under a frozen-backbone regime and focus on how external concepts and semantic grounding can be injected efficiently into the generation process.
2.2. Retrieval-Augmented Captioning and External Memory
Retrieval augmentation has become an effective strategy for improving captioning without storing all knowledge in model parameters. Early recent works such as SmallCap [
8] and EXTRA [
22] retrieve related captions from external datastores and use them as additional textual evidence for generation. Follow-up work studies the robustness of such retrieval-conditioned captioners and shows that retrieved text can strongly shape final outputs, both positively and negatively [
23]. More recent methods expand the form of retrieved knowledge: EVCap uses an external visual-name memory to retrieve object names for open-world captioning [
24], while MeaCap introduces a memory-augmented framework for zero-shot captioning [
9]. External context has also been explored in domain-specific captioning—for example, in news image captioning with visually aware context modeling [
25]. Compared with these approaches, our method differs in where retrieval is injected. Existing retrieval-based captioners mainly use retrieved captions, names, or context on the language side. In contrast, R2G converts retrieved concepts into a continuous prompt code and uses that code to modulate selected layers of the visual encoder, allowing retrieval to influence visual evidence formation before decoding begins.
2.3. Grounding, Hallucination, and Semantic Faithfulness
A central concern in modern captioning and multimodal generation is semantic faithfulness. Recent work has developed stronger benchmarks and metrics for hallucination analysis, such as ALOHa for open-vocabulary hallucination measurement [
26] and THRONE for free-form hallucination evaluation [
27]. On the mitigation side, LURE analyzes and revises object hallucinations in large VLMs [
28]; M3ID reduces hallucination by increasing the influence of visual information during decoding [
29]; and MOCHa introduces an open-vocabulary framework for measuring and reducing caption hallucinations, together with the OpenCHAIR benchmark [
30]. More recent inference time grounding methods include Visual Description Grounding, which uses grounded visual descriptions to reduce hallucinations [
31], DeCo, which performs dynamic correction decoding for hallucination mitigation [
32], and ICT, which intervenes across image- and object-level representations to improve trustworthiness during generation [
33]. These works provide important insight into faithfulness, but many of them operate through post hoc revision, decoding-time correction, or explicit hallucination-oriented control. Our global–local semantic grounding module is different in purpose and mechanism: it builds a persistent global semantic prior once per image and fuses it with token-level local evidence through a decoder-state-dependent gate, so semantic stabilization is integrated into the standard captioning forward pass rather than added only as a correction step.
2.4. Prompting and Parameter-Efficient Adaptation
Prompt-based and parameter-efficient adaptation methods provide a natural foundation for lightweight multimodal generation. In the vision–language literature, MaPLe [
34], PromptSRC [
35], and DAPT [
36] show that carefully designed prompts can adapt frozen vision–language backbones with limited trainable parameters. PromptKD further studies prompt-based knowledge distillation for efficient transfer [
37]. In the generative VLM setting, Learning by Correction demonstrates that lightweight tuning tasks can improve zero-shot generative reasoning without full model updating [
38]. Our work is related to this efficient adaptation perspective but differs in two important ways. First, our prompt is
image-conditioned and
retrieval-derived, rather than a static task prompt learned only from downstream supervision. Second, the prompt is injected into the visual encoder to reshape visual representations for captioning, rather than being used only as a generic adaptation.
2.5. Position of Our Work
Overall, our method lies at the intersection of retrieval-augmented captioning, grounded caption generation, and parameter-efficient adaptation. Relative to recent retrieval-based captioning methods, our novelty is to transform retrieved concepts into continuous encoder-side visual prompts rather than decoder-side textual cues. Relative to hallucination mitigation and grounding methods, our contribution is to stabilize generation through an explicit global–local fusion mechanism embedded in the autoregressive decoder. The novelty of R2G does not come from retrieval augmentation or gating alone, but from the integration strategy: external concepts are injected into the visual encoder before decoding through retrieval-guided modulation, and the resulting visual evidence is further stabilized during autoregressive generation by global–local semantic grounding. By combining early semantic injection with state-dependent grounding, R2G addresses concept coverage and semantic faithfulness within a unified frozen-backbone framework.
3. Method
3.1. Problem Formulation and Model Overview
Given an image
I, the goal is to generate a caption
that is both informative and visually faithful. We build on a frozen image captioning backbone consisting of a visual encoder
and an autoregressive language decoder
D, following the parameter-efficient adaptation paradigm used in models such as BLIP-2 [
6]. As
Figure 1 shows, to address two complementary sources of captioning errors: missing knowledge before generation and unstable grounding during generation, we introduce a plug-in method, called R2G (retrieval- and grounding-guided), consisting of two lightweight modules.
The idea behind R2G is simple: captioning errors in open-domain images arise not only from missing concepts, but also from unstable grounding. Accordingly, R2G addresses both aspects within a unified and parameter-efficient design. On the one hand,
retrieval-guided visual prompting (RVP) improves access to long-tail concepts by injecting external knowledge into the encoder rather than limiting retrieval to the language side. Unlike prior retrieval-augmented captioning methods that inject retrieved captions or object names primarily at the language side, RVP conditions visual feature formation before decoding begins [
22,
24]. On the other hand,
global–local semantic grounding (GLSG) stabilizes autoregressive generation by balancing persistent scene-level semantics with token-level visual evidence. Because both modules are lightweight, the framework remains compatible with frozen pre-trained backbones and standard decoding strategies such as greedy search and beam search.
3.2. Retrieval-Guided Visual Prompting
3.2.1. External Visual Concept Memory
The external visual concept memory is constructed offline from the training image set. Each memory item consists of a visual key and a concept set:
where
is extracted from a frozen visual retrieval encoder and
contains the normalized visual concepts associated with the reference image. In our implementation, the memory is built from the MS-COCO training split. Concepts are obtained from object annotations and salient noun or noun-phrase candidates in the reference captions. We lowercase all concepts, remove duplicates, filter extremely rare or non-visual words, and merge concepts with identical lemmas. Thus, each memory entry stores compact visual–semantic cues rather than full captions.
Given an input image
I, we compute the retrieval query
q from the visual global token and retrieve the top-
K memory entries according to cosine similarity:
The retrieved concept sets are merged into the image-specific concept pool
This memory provides semantically relevant candidates for open-domain concepts that may be under-represented in the parametric captioning model.
3.2.2. Prompt Prediction from Retrieved Concepts
Let
denote the final concept pool for image
I. Each concept is embedded by a pre-trained embedding layer
. The concept sequence is summarized by attention pooling:
where
u is a global concept representation. The prompt code is then predicted as
where
p is the conditioning vector used by the visual modulation layers.
This design keeps the retrieval branch lightweight: external concepts are injected as a continuous control signal rather than being serialized into text tokens and re-processed by the decoder.
3.2.3. Prompt-Conditioned Visual Modulation
Let
denote the
l-th block of the visual encoder, with
. The hidden states are updated as
For each selected block
, the prompt code predicts a feature-wise scale and bias:
The block output is then modulated as
where
and
are broadcast across all spatial tokens. The final retrieval-enhanced visual memory is
Compared with decoder-side prompting [
39], this encoder-side conditioning allows retrieved concepts to influence visual representation learning earlier in the forward pass, thereby improving the accessibility of visually grounded long-tail entities before sentence generation starts.
3.3. Global–Local Semantic Grounding (GLSG)
3.3.1. Global Semantic Prior
Although RVP improves concept accessibility, it does not by itself guarantee stable token generation. We therefore introduce a persistent scene-level prior using an auxiliary CLIP-style visual encoder [
13]. Let
be the global image representation extracted by the auxiliary encoder. We project it into the decoder hidden space
The vector
g captures coarse scene semantics, such as dominant objects, scene type, and overall semantic scope. Because it is computed once per image outside the autoregressive decoding loop, it provides a more stable anchor than token-level attention alone.
3.3.2. State-Dependent Fusion of Global and Local Evidence
At decoding step
t, let
denote the decoder hidden state. Token-specific local evidence is obtained by cross-attention over the retrieval-enhanced visual memory:
A gating network then balances the global prior and local context according to the current decoding state:
and then fuses the two information sources as
The next-token distribution is then predicted by the frozen language head:
When the current token mainly depends on scene-level semantics, the gate can place more weight on g. When the model needs specific entities, attributes, or relations, it can rely more on local visual evidence . The result is a decoding process that is guided by a stable global anchor without losing token-level grounding.
3.4. Learning Framework
3.4.1. Algorithm Description
Algorithm 1 summarizes the unified training and inference procedure of R2G. We divide the procedure into four stages: retrieval, retrieval-guided visual prompting, global prior extraction, and autoregressive caption generation. The first three stages are performed once for each input image, while the global–local semantic grounding module is applied at every decoding step.
3.4.2. Training and Inference
During training, RVP and GLSG are optimized jointly using the standard captioning objective. Under teacher forcing, the model minimizes the negative log-likelihood of the reference caption:
where
denotes the ground-truth token at step
t, and
denotes the corresponding ground-truth prefix. The gradients from
update only the newly introduced lightweight components, while the large captioning backbone remains frozen. Optional lightweight regularization can be added to constrain modulation magnitudes or prevent gate saturation, but the core objective remains standard next-token prediction.
At inference time, retrieval is performed once for each image. The retrieved concepts are converted into a prompt code
, which produces the retrieval-enhanced visual memory
. Meanwhile, the auxiliary encoder extracts the global semantic prior
g. The decoder then generates the caption autoregressively, using GLSG at each time step to fuse the global prior with token-level local visual evidence. This procedure is compatible with standard greedy decoding and beam search.
| Algorithm 1 Unified training and inference procedure for R2G |
Require: Image I, external memory , reference caption for training, trainable parameters Ensure: Updated parameters during training, or generated caption Y during inference Stage 1: Retrieval from external memory- 1:
Extract the global retrieval token from the base visual encoder and compute the retrieval query q. - 2:
Retrieve the top-K nearest neighbors from and merge their concept sets into . Stage 2: Retrieval-guided visual prompting- 3:
Embed the retrieved concepts and aggregate them into the concept representation u. - 4:
Predict the prompt code from u using Equations ( 2) and ( 3). - 5:
for each selected encoder block do - 6:
Predict the modulation parameters and from . - 7:
Modulate the visual representation to obtain using Equation ( 6). - 8:
end for - 9:
Obtain the retrieval-enhanced visual memory . Stage 3: Global semantic prior extraction- 10:
Compute the global semantic prior g using Equation ( 8). Stage 4: Training or inference with global–local semantic grounding- 11:
if training then - 12:
for to T do - 13:
Compute the decoder hidden state from the ground-truth prefix . - 14:
Compute the local visual context using Equation ( 9). - 15:
Compute the gate and grounded context using Equation ( 11). - 16:
Predict . - 17:
end for - 18:
Update by minimizing . - 19:
else - 20:
Initialize the decoder with the start token and, if used, a beam of hypotheses. - 21:
for to T do - 22:
Compute the decoder hidden state from previously generated tokens . - 23:
Compute , , and as in training. - 24:
Generate the next token from . - 25:
end for - 26:
Return the generated caption Y. - 27:
end if
|
3.4.3. Time Complexity
We further analyze the computational complexity of R2G to clarify the additional overhead. For RVP, exact top-
K retrieval by cosine similarity costs
, which is performed once per image. Concept embedding and attention pooling cost
. Prompt-conditioned modulation is applied only to selected layers
, and the modulation cost is
. For GLSG, the global semantic prior is computed once per image. At each decoding step, the gate and fused context adds
over the full sequence. Therefore, the total additional complexity of R2G is
Since retrieval and global prior extraction are performed once per image, the backbone remains frozen, and modulation is applied to only three layers (
), R2G introduces moderate overhead while preserving the efficiency of the frozen-backbone setting.
3.5. Formal Analysis
3.5.1. Operator View of R2G
R2G can be written as the composition of a retrieval operator, an encoder-side modulation operator, and a decoder-side grounding operator. The retrieval operator selects the top-
K memory entries according to cosine similarity:
and returns the concept pool
. The retrieved concepts are then mapped to the prompt code
p, which conditions the selected visual encoder layers through
Thus, the final visual memory can be expressed compactly as
where
denotes the frozen visual encoder equipped with retrieval-guided modulation.
During decoding, R2G fuses global and local semantic evidence by
where
The next-token distribution is then predicted as
This formulation also clarifies the difference between R2G and decoder-side retrieval. In decoder-side retrieval, the visual memory
is fixed before retrieved information is used:
where
denotes retrieved textual or concept cues added to the language side. In contrast, R2G makes the visual memory itself retrieval-dependent:
Therefore, retrieved concepts can influence visual feature formation before autoregressive generation begins, rather than only biasing lexical prediction after visual encoding has already been completed.
3.5.2. Stability Under Retrieval Perturbations
We next analyze how retrieval noise affects the prompt-conditioned visual representation. Let
u be the aggregated concept representation before prompt prediction, and suppose noisy retrieval perturbs it to
, with
Assume that the prompt predictor is Lipschitz continuous with constant
. Then the induced prompt perturbation satisfies
For a modulated layer
l, assume that the scale and bias predictors
and
are Lipschitz continuous with constants
and
, respectively, and that
. The perturbation introduced by noisy retrieval at this layer is bounded by
This bound shows that retrieval noise is filtered through the prompt predictor and the lightweight modulation heads. Therefore, the effect of noisy retrieval can be controlled by bounded prompt magnitudes, normalization, and applying modulation only to selected layers.
3.5.3. Robustness of Global–Local Fusion
The global–local fusion in GLSG can be interpreted as reliability-aware semantic estimation. Let
denote the ideal semantic evidence for predicting token
. Suppose the global prior and local context are two noisy estimates of
:
where
and
are zero-mean independent errors with variances
and
. For a scalar fusion coefficient
, the fused estimate
has expected squared error
The optimal coefficient is
and the corresponding error is
which is no larger than using either
g or
alone. This result explains why combining global and local evidence can improve robustness: when local evidence is noisy, more weight should be assigned to the global prior; when the local evidence is reliable, the model should rely more on
. The proposed gate
learns a state-dependent approximation to this reliability-aware weighting.
3.5.4. Perturbation Control in the Gated Context
We further examine how perturbations propagate through the fused context. Let
,
, and
denote perturbations in the decoder state, local evidence, and global prior. Since the sigmoid function is
-Lipschitz, the gate perturbation satisfies
Ignoring higher-order terms, the perturbation of the fused context is bounded by
where
is the local Lipschitz constant of layer normalization. This inequality shows that the gate controls the contribution of each evidence source. A larger
suppresses perturbations from local evidence, while a smaller
reduces the influence of a coarse or noisy global prior.
3.5.5. Error Propagation During Autoregressive Decoding
Autoregressive captioning is sensitive to early semantic errors because each predicted token affects subsequent hidden states. Let
denote the deviation between the actual decoder trajectory and an ideal grounded trajectory at step
t, and let
be the semantic evidence error. If the decoder transition is locally Lipschitz with respect to its previous state and fused context, then
where
and
are local sensitivity constants. Recursively,
Thus, reducing the per-step evidence error
helps limit the accumulation of semantic errors across decoding steps. GLSG reduces this error by grounding each step in both a stable image-level prior and token-specific visual evidence.
Finally, if a hallucination-risk score
is locally Lipschitz with respect to the semantic evidence, then
Therefore, any reduction in
also reduces an upper bound on the expected hallucination risk. This provides an analytical explanation for the empirical CHAIR improvements observed in the ablation studies.
4. Experiments
4.1. Experimental Setup
Datasets and evaluation metrics. We evaluate
R2G on two standard image captioning benchmarks: MS-COCO under the Karpathy split [
1] and the NoCaps validation set [
40]. MS-COCO measures caption quality in a conventional closed-domain setting, whereas NoCaps emphasizes transfer to novel objects and open-domain concepts. Following standard practice, we report BLEU-4 [
41], METEOR [
42], ROUGE-L [
43], CIDEr [
44], and SPICE [
45]; higher values indicate better caption quality. To further evaluate semantic faithfulness in the ablation study, we additionally report CHAIR-S and CHAIR-I [
10]. CHAIR-S measures the proportion of generated sentences containing at least one hallucinated object, while CHAIR-I measures the proportion of hallucinated object mentions among all generated object mentions; lower values are better for both metrics.
Implementation details. We build on a BLIP-2 style frozen-backbone captioner [
6]. Unless otherwise stated, all optimization and decoding settings follow the base implementation. The learning rate is set to
and the weight decay is 0.05. During decoding, we use beam search with beam size
and length penalty 0.9.
For RVP, the number of retrieved neighbors is set to . Each retrieved concept is embedded into a 512-dimensional vector, and the concept sequence is aggregated by a single-head attention-pooling module. The hidden dimension of the prompt-prediction MLP is 256. Prompt-conditioned modulation is applied to visual encoder layers , and each modulation head is implemented as a two-layer linear network.
For GLSG, the global semantic prior is extracted from the penultimate-layer global representation of the auxiliary vision–language encoder . The projected prior dimension is 768, and the hidden dimension of the gating network is 256.
To ensure fair comparison across all methods, all experiments were performed under the same environment. Specifically, the platform used an Intel(R) Xeon(R) Silver 4314 processor with a base frequency of 2.40 GHz and an NVIDIA A30 GPU. The models were implemented in Python 3.9.19 based on the PyTorch 2.1.0 deep learning framework.
Compared methods. R2G is compared against several representative captioning baselines.
BLIP [
5] is a pre-trained VLM that bootstraps noisy image-text pairs through caption generation and filtering, and serves as a strong captioning baseline.
SimVLM (huge) [
4] is a large-scale VLM trained end-to-end with a simple prefix language modeling objective under weak supervision, representing a strong high-capacity pre-trained captioning system.
EVCap (Vicuna-13B) [
24] is a retrieval-augmented captioning method that uses an external visual-name memory to retrieve object names for open-world image captioning.
BLIP-2 (OPT-6.7B) [
6] connects a frozen image encoder and a frozen LLM through a lightweight Querying Transformer, achieving strong captioning performance with parameter-efficient adaptation. R2G adopts BLIP-2 as part of the frozen-backbone.
4.2. Main Results on MS-COCO
Table 1 reports the main results on the MS-COCO Karpathy test split. Both RVP and GLSG improve the BLIP-2 baseline, and the full R2G model achieves the best overall performance. In particular, R2G reaches 43.5 BLEU-4, 32.0 METEOR, 149.2 CIDEr, 25.6 SPICE, and 59.2 ROUGE-L, corresponding to gains of
,
,
,
, and
over BLIP-2, respectively.
Several conclusions for each module follow from
Table 1. RVP yields the larger single-module gain in CIDEr, suggesting that retrieval-guided visual prompting improves concept accessibility and content coverage, especially for entities that are weakly represented in the base model parameters. GLSG, in turn, consistently improves SPICE and ROUGE-L over the baseline, which is consistent with its role in stabilizing object–attribute–relation grounding during decoding. When the two modules are combined, the gains are larger than those obtained by either module alone, indicating that encoder-side concept injection and decoder-side semantic grounding are complementary.
4.3. Open-Domain Generalization on NoCaps
We further evaluate open-domain generalization on the NoCaps validation set. Following the standard NoCaps protocol, the validation set is divided into three subsets: in-domain, which contains images whose object classes are covered by MS-COCO; near-domain, which contains images with both MS-COCO classes and novel classes; and out-of-domain, which contains images composed only of novel classes outside the MS-COCO label space. This partition makes it possible to assess not only overall caption quality, but also robustness under progressively stronger domain shift.
Table 2 reports results on the NoCaps validation set. R2G consistently outperforms the BLIP-2 baseline in all domains and achieves the best overall performance, reaching 124.2 CIDEr and 15.8 SPICE. Relative to BLIP-2, this corresponds to gains of
CIDEr and
SPICE overall. The improvement is particularly clear on the out-of-domain subset, where R2G improves CIDEr from 116.8 to 127.2 and improves SPICE from 14.3 to 15.4.
The NoCaps results further clarify the role of each component under distribution shift. RVP is particularly helpful in the out-of-domain setting, where retrieval-derived prompts improve the visibility of novel or long-tail concepts during visual feature formation. GLSG yields the stronger single-module overall result, suggesting that stable semantic grounding remains crucial when local evidence is ambiguous, cluttered, or weak.
4.4. Ablation Studies
We next examine each key components on the MS-COCO validation set.
Table 3 and
Table 4 report separate ablations for GLSG and RVP, respectively. We separate the two studies for clarity because they target different failure modes: GLSG primarily addresses grounding stability and hallucination, whereas RVP primarily addresses concept accessibility and encoder-side knowledge injection.
Table 3 shows that the complete GLSG design improves both caption quality and semantic faithfulness. Removing the global prior causes CIDEr to drop by 3.9 points and increases CHAIR-I from 5.50% to 7.50%, indicating that scene-level semantics are important for suppressing unsupported object mentions. Replacing adaptive gating with concatenation or a fixed fusion weight also degrades both SPICE and CHAIR, which shows that the benefit comes not only from exposing the decoder to more information but from dynamically balancing global semantics and token-level local evidence.
Table 4 shows that the improvement is not explained by retrieval alone. Removing retrieval causes a clear drop, which confirms that external knowledge is beneficial. More importantly, replacing encoder-side modulation with decoder-side textual prompting or simple feature concatenation also underperforms the full model. This result supports the central design choice of RVP: retrieved concepts are most useful when they influence visual evidence formation before autoregressive decoding begins, rather than being injected only after the visual representation has already been formed.
4.5. Sensitivity Analysis
We finally examine the sensitivity of R2G to two key design choices and verify whether the default settings used in the main experiments are empirically justified. Overall, the results show that R2G is robust within a reasonable range of configurations, while the proposed defaults provide the most favorable balance between effectiveness and efficiency.
First, the retrieval size
K directly controls the amount of external semantic information injected into the visual encoder. As shown in
Figure 2, performance improves steadily when
K increases from 1 to 5, with CIDEr rising from 145.0 to 147.8 and SPICE rising from 24.9 to 25.3. This trend indicates that retrieving a small set of additional neighbors helps expose the model to more relevant long-tail or visually related concepts. However, when
K is further increased to 7 or 10, the gains saturate and then slightly decline. A likely reason is that lower-ranked neighbors are less semantically aligned with the query image and therefore introduce weaker or noisier concepts, which dilute the usefulness of the prompt signal.
We also evaluate the effect of modulation depth in the visual encoder.
Figure 3 shows that applying modulation only in deep layers is suboptimal, because the injected semantic cues arrive too late to sufficiently affect feature formation, whereas shallow-biased modulation can interfere with more generic low-level representations. In contrast, modulation in middle-to-late layers consistently performs better, suggesting that these layers provide a better stage for integrating retrieved concepts with increasingly semantic visual features.
The retrieval size is set to
based on the sensitivity analysis in
Figure 2. Increasing
K from 1 to 5 improves CIDEr and SPICE because more relevant concepts are retrieved, while larger values introduce weaker neighbors and slightly reduce performance. Prompt-conditioned modulation is applied to layers
, because middle-to-late layers provide the best trade-off between semantic abstraction and computational cost. Shallow layers are more sensitive to low-level visual patterns, while very deep-only modulation leaves insufficient opportunity for retrieved concepts to influence visual feature formation. The hidden dimensions of the prompt-prediction MLP and gating network are selected to keep the additional parameter count small while preserving enough capacity for concept aggregation and adaptive fusion.
4.6. Robustness to Retrieval Noise
To assess the sensitivity of R2G to noisy or weakly relevant retrieval results, we corrupt the retrieved concept pool by replacing a proportion of the retrieved concepts with randomly sampled concepts from unrelated images.
As shown in
Table 5, the performance decreases progressively as the noise ratio increases. This trend indicates that the proposed prompt prediction module and global–local grounding mechanism do not rely excessively on any single retrieved concept. Under moderate retrieval noise (
), R2G still outperforms the frozen BLIP-2 baseline, suggesting that the continuous prompt representation and adaptive gating mechanism help reduce the influence of weakly relevant retrievals.
We also observe that, when
, the performance becomes comparable to the variant without retrieval (the “w/o Retrieval” row in
Table 4). This indicates that heavily corrupted retrieval provides limited useful semantic guidance. When
, all retrieved concepts are replaced by unrelated concepts, leading to a clear performance drop.
4.7. Computational Overhead
As analyzed in
Section 3.4.3, R2G is designed as a lightweight plug-in for frozen captioning backbones. To quantify the additional computation, we report the relative FLOPs of RVP, GLSG, and the full R2G model compared with the BLIP-2 baseline.
As shown in
Table 6, RVP and GLSG introduce only
and
additional FLOPs, respectively, while the full R2G model increases FLOPs by
over the baseline. This moderate overhead comes from one-time retrieval, lightweight prompt-conditioned modulation, and the compact global–local gating module. These results confirm that R2G improves caption quality and semantic faithfulness while preserving the efficiency of the frozen-backbone setting.
5. Limitations and Future Work
Despite the encouraging results, several limitations remain and point to important directions for future research.
The effectiveness of RVP depends on the quality, diversity, and relevance of the external visual concept memory. When the retrieved neighbors are noisy, weakly aligned, or insufficiently diverse, the predicted prompt code may inject suboptimal semantic cues into the visual encoder. Thus, a promising next step is to develop noise-aware and confidence-aware retrieval strategies, such as learned reranking, retrieval calibration, or memory filtering based on semantic agreement between the image and retrieved concepts.
The GLSG module uses a compact scene-level prior to stabilize decoding, which is effective for improving overall semantic consistency. However, a single global representation may be too coarse for highly crowded scenes, fine-grained object distinctions, or complex relational descriptions involving multiple interacting entities. Therefore, future work could extend GLSG toward multi-granularity semantic grounding, for example by combining scene-level priors with region-level, object-level, or relation-level representations. Token-type-aware or relation-aware gating mechanisms may further help the decoder decide when to rely on global semantics, local evidence, or structured scene representations during caption generation.
The current experimental validation is mainly based on a BLIP-2-style frozen captioning backbone. Although the proposed RVP and GLSG modules are designed around general interfaces, including modifiable visual representations, decoder hidden states, and token-level visual contexts, we have not yet exhaustively evaluated R2G across a wider range of captioning architectures. Future work will extend R2G to more diverse vision–language backbones, such as encoder–decoder captioners, query-based VLMs, and recent multimodal large language models, to further verify its architectural generality and practical transferability.
6. Conclusions
In this paper, we presented R2G, a retrieval- and grounding-guided framework for open-domain image captioning that addresses two coupled challenges: insufficient concept coverage before generation and unstable semantic grounding during decoding. To tackle these issues in a unified frozen-backbone setting, we introduced retrieval-guided visual prompting, which injects retrieved external concepts into the visual encoder to influence feature formation early, and global–local semantic grounding, which stabilizes autoregressive generation by adaptively fusing scene-level semantic priors with token-level local visual evidence.
Experimental results on MS-COCO and NoCaps show that the proposed design consistently improves caption quality over the frozen BLIP-2 baseline, with particularly clear gains in open-domain and out-of-domain settings. The ablation studies further suggest that the two modules are complementary: RVP mainly strengthens concept accessibility and long-tail coverage, while GLSG improves semantic faithfulness and reduces hallucination by providing more stable grounding during decoding.
Overall, the proposed framework demonstrates that open-domain captioning benefits not only from accessing external knowledge, but also from integrating that knowledge at the right stage of the generation pipeline and maintaining stable semantic grounding throughout decoding. We hope this work provides a useful step toward image captioning systems that are more accurate, more faithful, and more robust in visually diverse real-world scenes.
7. Implications
The significance of the proposed R2G framework extends beyond the immediate improvement of caption quality on benchmark datasets. From an application perspective, the framework supports practical vision–language systems that must operate in open and changing environments, where rare concepts, visually ambiguous scenes, and incomplete parametric knowledge are common. By combining retrieval-guided concept injection with adaptive semantic grounding, R2G provides a lightweight strategy for improving descriptive accuracy and reducing unsupported content generation without requiring full retraining of a large backbone. This design is potentially useful in downstream scenarios such as assistive image description, multimedia indexing, human–robot interaction, digital content understanding, and domain-specific captioning systems in science, medicine, or remote sensing, where concept coverage and semantic faithfulness are both essential.
From a mathematical perspective, R2G also suggests several directions for further study. The retrieval-guided prompting module can be interpreted as a memory-dependent perturbation of latent representations, raising natural questions about the stability, boundedness, and sensitivity of encoder features under noisy or mismatched retrieval. Likewise, the global–local grounding module defines an adaptive fusion mechanism whose behavior may be studied in terms of information weighting, robustness, and error propagation across autoregressive decoding steps. These observations indicate that modern retrieval-augmented captioning systems can serve as concrete testbeds for mathematical analysis of coupled operators, controlled perturbations in high-dimensional representation spaces, and sequential inference with multiple interacting information sources.