4.1. Datasets and Knowledge Bases
To comprehensively evaluate the performance of the BRIDGE framework, this paper conducts experiments on three mainstream KB-VQA benchmark datasets.
OK-VQA [
44] is one of the most challenging datasets in this field. Its distinguishing characteristic is that answering the questions heavily depends on external knowledge beyond the image, and no explicit background context is provided. The dataset covers 11 diverse categories, including vehicles, sports, and cooking, and contains approximately 14,055 questions, imposing demanding requirements on the model’s generalized knowledge reasoning capability.
FVQA [
2] is the first dataset to provide explicit supporting facts. Each sample consists of an image, a question, an answer, and a corresponding supporting knowledge entry, totaling 5286 questions. It is primarily used to evaluate the reasoning accuracy of models given explicit structured knowledge.
A-OKVQA [
45], as an enhanced version of OK-VQA [
44], contains approximately 25,000 questions and requires models to reason with more diverse types of external knowledge, encompassing multiple capability dimensions such as commonsense reasoning, world knowledge, and visual knowledge. The dataset provides both Multiple-Choice (MC) and Direct Answer (DA) evaluation settings and annotates reasoning rationales for each question in the training set. It has become a more prevalent evaluation benchmark in the KB-VQA field in recent years.
To support multi-modal reasoning, this study employs two publicly available, static knowledge corpora that are merged into a unified knowledge base
for dense retrieval. The first is a Wikipedia corpus, specifically the OKVQA_passages subset (344K entries) from the M2KR benchmark, which is the same static Wikipedia corpus used by TRiG [
5], RA-VQA [
13], and RA-VQA-v2 [
14]. It provides broad and detailed general world knowledge and serves as the primary source from which the Dense Knowledge Retriever (DKR) obtains open-domain explicit knowledge. The second is a Google Search corpus (okvqa_full_corpus, 168,306 entries) pre-collected by Luo et al. [
46] based on the OK-VQA training and testing data. This corpus supplements Wikipedia by covering web-sourced information that may not appear in encyclopedia-style entries, thereby improving knowledge coverage on diverse and long-tail queries. Importantly, both corpora are pre-collected static datasets rather than real-time retrieval results; no live API calls to Wikipedia or Google Search are made during training or inference, ensuring full reproducibility across experimental runs.
4.4. Implementation Details
The experiments are conducted on an Ubuntu 24.04.1 system using an NVIDIA RTX 4090 GPU, based on Python 3.8 and the PyTorch 1.10.0 deep learning framework. Model training employs the AdamW [
55] optimizer for parameter optimization, with the learning rate decaying by a factor of 0.75 per epoch.
The parameter settings in
Table 1 are determined as follows. The hidden dimension (768) and the number of attention heads (12) are structurally determined by the RoBERTa-base [
34] encoder adopted in our framework, and the dropout rate (0.2) follows its default configuration. The contrastive projection dimension
= 256 is consistent with standard practice in CLIP [
31] and ALBEF [
32]. The pre-alignment layer counts
follow the configuration in [
25], and
is selected based on the sensitivity analysis in
Section 4.7. For training, the learning rate (1 × 10
−5) falls within the recommended range for fine-tuning pretrained Transformers [
32,
34], the weight decay (1 × 10
−4) follows the AdamW [
55] default, and the batch size (32) represents the maximum feasible size on our hardware while being consistent with comparable systems [
5,
13]. The loss balancing coefficients
,
, and
are determined through the sensitivity analysis in
Section 4.7, while
reflects the standard equal-weighting convention for generation and auxiliary losses. The maximum knowledge length (k_max_len = 10) and question length (q_max_len = 15) are set to cover approximately 95% of training samples, the retrieval count (k_max_num = 10) and answer frequency threshold (min_occurrence = 3) follow established KB-VQA conventions [
5,
13], and the number of training epochs (10) is determined by validation convergence.
The detailed parameter settings of the BRIDGE framework are shown in
Table 1.
4.5. Comparison with Existing Methods
Table 2 presents the quantitative comparison between BRIDGE and current mainstream methods on the OK-VQA dataset. To comprehensively evaluate the performance of the proposed method, the compared methods cover three technical approaches: traditional retrieval-based methods, LLM prompt-based methods, and end-to-end multi-modal large models. The retrieval-alignment modules of the BRIDGE framework remain unchanged, and only the backbone model of the reader (SAMR) is replaced to verify the generality of the framework.
Compared with traditional retrieval-based methods (ConceptBERT [
8], KRISP [
47], TRiG [
5], etc.), the advantage of BRIDGE primarily stems from the multi-source semantic anchors provided by MSAC. These methods typically rely on only a single textual query for knowledge retrieval, whereas BRIDGE constructs a more comprehensive query representation by integrating captions, tags, and OCR text. Even under the Wikipedia-only setting, BRIDGE (BLIP, Wiki only) achieves 63.0%, outperforming RA-VQA [
13] (54.5%) and TRiG [
5] (50.5%)—both of which also rely solely on Wikipedia—by 8.5 and 12.5 percentage points, respectively, confirming that the gains primarily stem from the proposed alignment architecture rather than additional knowledge source coverage. This significant improvement is attributed to the symmetric pre-alignment mechanism of VACE and QACE that effectively bridges the cross-modal semantic gap, as well as the cross-residual gating mechanism of CRGF that substantially suppresses noise introduced during retrieval through semantic residual-driven dynamic gating.
Compared with LLM prompt-based methods (Prophet [
17], PromptCap [
18], SKP [
51], etc.), BRIDGE achieves comparable or even superior performance with substantially fewer parameters. With BLIP as the reader, BRIDGE (64.2%, 4B) surpasses Prophet (61.2%, 175B) and PromptCap [
18] (60.4%, 175B) by 3.0 and 3.8 percentage points, respectively, while using less than 2.3% of their parameter count. This improvement is attributed to the cross-residual gating mechanism of CRGF that effectively suppresses fusion noise, and the cross-modal contrastive learning of VACL that enhances alignment quality at the representation space level, compensating for the gap in model scale.
Compared with end-to-end multi-modal large models (LLaVA, Qwen2-VL [
41], PaLM-E [
21], etc.), BRIDGE consistently outperforms the corresponding base models when using readers of the same scale. BRIDGE (Qwen2-VL [
41]) (66.2%) surpasses the original Qwen2-VL-7b [
41] (58.8%) by 7.4 percentage points, and BRIDGE (LLaVA-NeXT [
42]) (67.8%) surpasses the original LLaVA-NeXT-8b [
42] (62.2%) by 5.6 percentage points. This consistent improvement indicates that the retrieval-alignment modules of BRIDGE provide effective knowledge enhancement and semantic alignment support for different reader backbones. Although PaLM-E-562B [
21] (66.1%) achieves high accuracy with its massive 562B parameter count, BRIDGE (LLaVA-NeXT [
42]) surpasses it with only approximately 8B parameters, less than 1.4% of the former.
Parameter efficiency analysis. The BRIDGE framework demonstrates significant advantages in parameter efficiency. With BLIP as the reader, the total parameter count is approximately 4B, constituting only 2.3% of the GPT-3 (175B) used by Prophet [
17] and 0.7% of PaLM-E-562B [
21], yet it achieves 64.2% accuracy on OK-VQA. Even with LLaVA-NeXT-8b [
42] as the reader (8B), BRIDGE remains far below the parameter counts of PromptCap [
18] (175B) and REVIVE [
15] (175B) while surpassing them in accuracy by 7.4 and 11.2 percentage points, respectively. This result strongly demonstrates that, without relying on extreme parameter scales, a high-performance visual question answering system can still be constructed in resource-constrained scenarios through refined cross-modal semantic alignment design (MSAC + VACE/QACE pre-alignment + CRGF gated fusion + VACL contrastive learning). Moreover, the consistent performance improvement of BRIDGE when equipped with different readers (BLIP 64.2% → Qwen2-VL 66.2% → LLaVA-NeXT 67.8%) indicates that the proposed retrieval-alignment modules possess strong generality and can work synergistically with generative models of varying scales.
To further disentangle the contribution of the proposed architecture from the benefit of incorporating dual knowledge sources, we conduct a knowledge source ablation study. As shown in
Table 2, when Wikipedia is used as the sole knowledge source, BRIDGE (BLIP, Wiki only) achieves 63.0% on OK-VQA, and BRIDGE (LLaVA-NeXT, Wiki only) achieves 66.5%; introducing Google Search on top of this yields only an additional gain of 1.2 to 1.3 percentage points. This result indicates that the architectural design itself is the primary driver of the overall performance improvement. Taking LLaVA-NeXT as an example, the BRIDGE architecture alone raises accuracy from 62.2% to 66.5%, a gain of 4.3 percentage points, whereas the addition of the extra knowledge source contributes only 1.3 percentage points (66.5% to 67.8%), with the former exceeding the latter by more than a factor of three.
Table 3 reports the comparison between BRIDGE and existing methods on the FVQA benchmark dataset.
BRIDGE achieves the best results on both Top-1 and Top-3 accuracy, reaching 66.53% and 79.87%, respectively. Compared with the previous best method, DEDR+MM-FiD [
4], BRIDGE achieves a 4.73 percentage point improvement in Top-1 accuracy while substantially leading UnifER+ViLT [
3] by 10.15 percentage points in Top-3 accuracy. The significant leap in the Top-3 metric validates the effectiveness of the MSAC multi-source semantic anchor mechanism in the knowledge retrieval stage: by uniformly transforming heterogeneous visual information into fine-grained semantic symbols, DKR successfully retrieves more relevant candidate knowledge. Meanwhile, the cross-residual gating mechanism of CRGF dynamically filters the inevitable modality noise during retrieval, achieving a breakthrough in Top-1 precision.
Table 4 reports the comparison between BRIDGE and mainstream methods on the A-OKVQA benchmark dataset.
BRIDGE achieves 76.8% under the DA setting, surpassing Prophet [
17] (76.4%), and 66.4% under the MC setting, surpassing SKP [
51] (65.3%). The consistent performance across datasets validates the generalization capability of the BRIDGE framework under different knowledge reasoning scenarios, demonstrating that the multi-source bridging mechanism of semantic anchors and the representation space alignment established through cross-modal contrastive learning are not overfitting designs tailored to a specific dataset but rather universally applicable cross-modal reasoning enhancement strategies.
Discussion on cross-method comparability. The methods in
Table 2,
Table 3 and
Table 4 span fundamentally different paradigms (retrieval-based, prompt-based, and end-to-end) and inevitably differ in backbone scale and knowledge source configuration—a characteristic common to KB-VQA benchmarking rather than a limitation of our evaluation. The cross-method tables are therefore intended to situate BRIDGE within the broader performance landscape. To isolate the contribution of the proposed architecture, the controlled ablation studies in
Table 5,
Table 6 and
Table 7 fix the backbone (BLIP), knowledge sources, and training protocol, varying only the module under evaluation. We further report a knowledge source ablation in
Table 2: under the Wikipedia-only setting, BRIDGE (LLaVA-NeXT) achieves 66.5% on OK-VQA, substantially outperforming TRiG [
5] (50.5%) and RA-VQA [
13] (54.5%) under the same knowledge source, confirming that the architectural innovations are the principal drivers of the observed improvements.
Table 5 reveals three findings regarding computational efficiency. First, the alignment modules of BRIDGE (VACE, QACE, CRGF, RoBERTa encoding, and dense retrieval) contribute 48 GFLOPs, accounting for 6.8% of BRIDGE (BLIP) and 1.7% of BRIDGE (LLaVA-NeXT). The proposed cross-modal alignment architecture thus imposes minimal overhead relative to the reader backbone. Second, BRIDGE (BLIP) achieves 64.2% accuracy at 711 GFLOPs—less than one-third of SKP (2500 G, 63.3%) and approximately one-quarter of LLaVA-NeXT-8b (2720 G, 62.2%)—yielding a favorable accuracy-efficiency trade-off for resource-constrained deployment. Third, augmenting LLaVA-NeXT-8b with BRIDGE incurs a 7.0% increase in FLOPs (2720 → 2911 G) and a 5.6 percentage point accuracy improvement (62.2 → 67.8%), confirming that the retrieval-alignment pipeline provides substantial gains at marginal computational cost. The latency increase (33.9%) exceeds the FLOPs increase because preprocessing modules (VinVL, Oscar, Umi-OCR) execute as sequential pipeline stages; in deployment scenarios with pre-indexed visual features, this sequential overhead is eliminated.
4.6. Ablation Studies
4.6.1. Core Component Ablation
To verify the effectiveness of each core component in the BRIDGE framework, we design a series of ablation experiments on the three benchmark datasets. Specifically, starting from the complete model (Full Model), each key component is sequentially removed to evaluate its contribution. The variants are defined as follows.
w/o Cross-Residual Gating: CRGF is degraded to a standard Q-V cross-modal encoder by removing the cross-residual gated modulation operations (semantic residual computation and gated scaling) after each encoder block layer, retaining only the standard CA → SA → FFN stacking structure.
w/o Pre-Alignment (VACE+QACE): The Visual–Anchor Cross-Modal Encoder (VACE) and Question–Anchor Cross-Modal Encoder (QACE) are removed. The linearly projected visual features and question features are directly fed into CRGF for Q-V fusion without the pre-alignment bridging through semantic anchors.
w/o : The Visual–Anchor Contrastive Learning module (VACL) is removed, and the training objective does not include the cross-modal contrastive learning loss.
w/o : The Variational Information Bottleneck regularization (VIB) is removed. The answer representation is directly used for classification and retrieval without stochastic constraints.
w/o Semantic Anchors (MSAC): The entire Multi-Source Semantic Anchor Construction module is removed, and no captions, tags, or OCR text are used. In this case, VACE and QACE become simultaneously ineffective due to the absence of anchor inputs; the semantic residuals in CRGF degenerate to zero without the reference, and VACL cannot compute the contrastive loss without . This variant is equivalent to retaining only the visual encoder + question encoder + standard Q-V encoder.
Baseline: Only the visual encoder and question encoder are retained, with answers predicted through simple feature concatenation and an MLP classifier, without using any cross-modal alignment mechanisms, semantic anchors, or auxiliary loss functions.
Table 6 reports the ablation results of each core component on the three benchmark datasets. The removal of each component leads to a consistent performance decline, and the contribution trends remain highly consistent across datasets, validating the generality of the proposed designs.
At the feature representation level, the semantic anchor mechanism (MSAC) contributes the most significant performance gain: removing it causes a 5.97 percentage point accuracy drop on OK-VQA, along with drops of 5.86 (Top-1) and 7.11 (DA) percentage points on FVQA and A-OKVQA, respectively, confirming the core value of multi-source semantic anchors as a cross-modal bridging hub. On this basis, the removal of the pre-alignment stage (VACE+QACE) leads to further performance degradation (OK-VQA −1.33%, FVQA Top-1 −1.22%, A-OKVQA DA −1.64%), indicating that skipping pre-alignment and directly performing Q-V fusion causes visual and question features to interact ineffectively due to the lack of semantic bridging, thereby validating the necessity of the two-stage design comprising pre-alignment followed by deep fusion.
At the fusion module level, degrading CRGF to a standard Q-V cross-modal encoder without cross-residual gating results in a 0.74 percentage point drop on OK-VQA and a 0.78 percentage point drop on A-OKVQA DA. Although this margin is less pronounced than those of MSAC and pre-alignment, it is consistently observed across all three datasets, validating that the semantic residual-driven gating mechanism provides additional noise suppression capability on top of the standard encoder block.
At the training objective level, the removal of the cross-modal contrastive learning loss causes a 1.15 percentage point drop on OK-VQA and a 1.45 percentage point drop on A-OKVQA DA, validating the effectiveness of constraining cross-modal alignment at the representation space structure level. The gain from is relatively modest (OK-VQA −0.38%), but it remains stable across all datasets, indicating that the variational information bottleneck effectively compresses redundant information in the fused representation under different scenarios, providing complementary probabilistic assurance for the deterministic noise reduction achieved by cross-residual gating.
4.6.2. Semantic Anchor Component Ablation
Table 7 shows that the complete combination of all three components significantly outperforms any single component or pairwise combination, validating the synergistic complementary value of multi-source information. From the single-component results, captions (60.94%) contribute the most, as they capture the global scene semantics of the image and provide foundational context for most questions. Tags rank second (59.68%), with their value lying in providing fine-grained entity and attribute information that captions may omit. OCR text alone yields the weakest performance (57.41%), since only a subset of images contains recognizable scene text. The complete anchor (64.2%) exceeds the mean accuracy of the three single components (59.34%) by 4.86 percentage points, indicating significant synergistic effects among the three components.
4.6.3. Contrastive Learning Configuration Ablation
Table 8 compares the performance of different contrastive learning configurations, validating the core hypothesis of BRIDGE. V-Sem pairs (64.2%) outperform V-C pairs (63.58%) and Q-V pairs (63.21%) on the OK-VQA dataset, confirming that multi-source semantic anchors exhibit a stronger semantic association with visual content than single captions. The trend that V-C pairs outperform Q-V pairs is consistent with the findings of prior research, further validating that descriptive text (captions/anchors) is more suitable than interrogative text (questions) as the anchoring target for cross-modal contrastive learning. Notably, the combination of V-Sem and Q-V (63.87%) not only fails to surpass V-Sem alone but causes a 0.33 percentage point performance drop. This indicates that simultaneously employing two contrastive learning objectives introduces redundant gradient signals that interfere with the alignment direction of the representation space.
4.6.4. Robustness Under Noisy Anchor Inputs
Anchor generators—object detectors and OCR engines—are imperfect in practice; we therefore assess whether the cross-residual gating in CRGF attenuates input-side corruption. Noise is injected at inference only, leaving the trained weights untouched.
Two corruption modes are considered at ratios
: in tag noise, each predicted tag is independently replaced with probability p by a category drawn uniformly from the COCO vocabulary; in OCR noise, each character is independently substituted with probability p by a random alphabetic character. The Full Model is compared against the w/o CRG variant, in which CRGF reduces to a standard cross-modal encoder. Results on OK-VQA are reported in
Table 9.
Two trends emerge from
Table 9. First, the Full Model exhibits consistently slower accuracy decay than the w/o CRG variant; at
tag noise, the Full Model loses 1.26 percentage points, whereas the ablated variant loses 1.75 percentage points, a relative degradation 38.9% larger. Second, the gap between the Full Model and the w/o CRG variant grows monotonically with p under both corruption modes—from 0.74 percentage points on clean inputs to 1.23 percentage points at 60% tag noise and 1.01 percentage points at 60% OCR noise. The gating mechanism thus contributes little when anchors are reliable but engages more strongly as their quality deteriorates.
Two mechanisms underlie this behavior. The multi-source anchor is intrinsically redundant: corruption in any single stream is diluted in by the remaining streams, constraining the noise that reaches the gating reference. Complementarily, the sigmoid function in the gating operation saturates at extreme residuals, bounding the modulation and preventing anomalous anchors from disproportionately perturbing the fused output.
4.7. Hyperparameter Sensitivity Analysis
To evaluate the sensitivity of the BRIDGE framework to key hyperparameters, this paper systematically analyzes the effects of the CRGF layer count
, contrastive loss weight
, information bottleneck weight
, and temperature coefficient
on the OK-VQA dataset. The results are shown in
Figure 4.
We investigate the sensitivity of four key hyperparameters and report accuracy on OK-VQA in
Figure 4, with
fixed following prior work [
25]. As shown in
Figure 4a, performance peaks at
= 5 (64.20%), with shallow networks underperforming due to insufficient cross-modal refinement and deeper ones exhibiting marginal overfitting. For the contrastive loss weight,
Figure 4b shows the best result at
= 0.1, where the contrastive loss provides effective alignment without dominating the total gradient. Regarding the information bottleneck weight,
Figure 4c indicates that
= 0.01 achieves optimal performance; notably, this is the most sensitive hyperparameter, as increasing
to 0.1 causes a sharp drop to 62.18% due to excessive compression of critical semantic information. Similarly,
Figure 4d reveals that the optimal temperature is
= 0.07, balancing discrimination sharpness and gradient stability. Overall, BRIDGE is reasonably robust across hyperparameters, with only
requiring careful tuning (recommended below 0.02).
4.8. Qualitative Analysis
To intuitively illustrate the reasoning process of BRIDGE,
Figure 5 presents four representative examples, including three correct and one incorrect prediction.
In
Figure 5a, the question asks about the contents of visible bottles. The bathroom scene anchor in the caption effectively constrains DKR to retrieve bath-product-related knowledge rather than generic “bottle” knowledge, enabling the model to correctly predict “shampoo.” In
Figure 5b, the question asks which desk item could help with a cold. Among multiple candidate objects (laptop, book, tissue box), the fine-grained tag “(box, tissue)” provides DKR with a precise retrieval anchor, linking the functional association between “tissue” and “cold” to yield the correct prediction. In
Figure 5c, the question asks what a sign on the cart instructs. OCR extracts “Go ahead and push me,” and SAMR precisely aligns this action instruction with the intent semantics of “tell you to do” in the question, correctly predicting “push it.”
Figure 5d illustrates a typical failure mode: correct knowledge but answer generation drift. DKR accurately retrieves that “orange vehicles are tow trucks used to move aircraft on the tarmac,” yet the reader shifts attention toward the operational object “aircraft” rather than extracting the core functional semantics “transport,” producing the incorrect answer “airplane.” This reveals that the reader struggles to distinguish functional descriptions from operational objects when both co-occur in the retrieved knowledge.
Table 10 reveals a clear pattern. In Example (a), where the caption already provides sufficient scene context (“bathroom interior”), all three methods correctly answer “shampoo”—demonstrating that when caption-level information is adequate, even single-caption retrieval (TRiG) and pure visual reasoning (LLaVA-NeXT) can succeed. However, this parity breaks down in Examples (b) and (c), where the critical information lies outside the coverage of the caption. In Example (b), the caption describes the overall scene (person, desk, laptop) but entirely omits the visually inconspicuous tissue box. Without the fine-grained tag “(box, tissue),” the retrieval query of TRiG lacks the key object anchor, leading it to a commonsense guess (“medicine”), while LLaVA-NeXT hallucinates an absent item (“tea”). Only BRIDGE, anchored by the tag, correctly identifies the target. In Example (c), both BRIDGE and LLaVA-NeXT arrive at correct answers but through fundamentally different mechanisms. BRIDGE explicitly extracts the sign text via OCR (“Go ahead and push me”) and aligns it with the question through SAMR, producing a complete and faithful answer: “push it.” LLaVA-NeXT, benefiting from the visual text recognition capability inherent in its vision encoder, also recognizes part of the text and outputs “push”—acceptable under the VQA soft voting protocol. However, this success is contingent on the text being clearly visible and in a common font; for degraded, occluded, or multilingual scene text, explicit OCR anchoring provides a more reliable extraction pathway. In contrast, TRiG, which lacks any text recognition capability, resorts to inferring the content of the sign from scene context and produces the functionally plausible but factually incorrect answer “carry luggage.”