This experimental evaluation is designed to systematically examine the effectiveness and practical implications of the proposed framework. Rather than focusing solely on performance improvements of individual models, the experiments aim to assess whether scientific reasoning benefits from organizing hypothesis validation, hypothesis refinement, and external evidence retrieval into a unified workflow. Specifically, we investigate the following three research questions:
All experiments are conducted on scientific texts in the chemistry domain for its rich domain-specific terminology and structured experimental reasoning in practical scientific research scenarios.
4.2. Comparison of the Proposed Workflow with Baselines (RQ1)
We evaluate the proposed workflow by comparing several representative configurations that progressively incorporate external evidence retrieval and modular retrieval controls. This comparison aims to quantify the system-level benefit of organizing hypothesis validation, refinement, and evidence retrieval as a unified and controllable pipeline rather than introduce a new task-specific model.
For hypothesis validation, we compare four workflow configurations: standalone NLI and three retrieval-augmented variants (Naive RAG, Advanced RAG, and our Modular RAG setting). From
Table 1, it can be observed that NLI with Naive RAG yields only marginal improvements compared to the NLI-only baseline, suggesting that simply incorporating external knowledge is not sufficient for chemical NLI when evidence quality is uncontrolled. In contrast, Advanced RAG shows a clearer performance improvement, indicating that stronger indexing and retrieval design can partially mitigate this problem. Our Modular RAG-based framework achieves the best performance across all metrics with 97.09% accuracy, 97.28% F1-score, and 99.28% AUC, which outperforms the NLI-only baseline by 5.86% in accuracy, 5.71% in F1-score, and 2.3% in AUC. These results suggest that chemical NLI benefits from the retrieval quality control and vector–entity joint retrieval in our Modular RAG-based framework, which helps yield more reliable validation decisions. Moreover, the confusion matrix and ROC curve of this evaluation are illustrated in
Figure 2.
Since the proposed hypothesis refinement module follows the context-aware text infilling method introduced in [
20], similar to hypothesis validation, we also evaluate four text infilling configurations, as shown in
Table 2 and
Table 3.
Table 2 reports token- and semantic-level similarity between the generated infillings and ground truth spans. Here, BERTScore is computed using SciBERT [
26]. Interestingly, we observe that Naive RAG is not improved compared to the infill-only setting, indicating that directly incorporating external knowledge may introduce irrelevant or noisy cues that contribute negatively to span infilling. Advanced RAG and our Modular RAG-based framework demonstrate a significant improvement. In particular, our framework achieves the best performance (48.43 BLEU, 48.8 ROUGE, and 0.9258 BERTScore), suggesting that our framework is capable of generating the lexical structure of ground truth infilling while maintaining semantic consistency.
Additionally, we evaluate the textual fluency (PPL) of completed hypotheses using a different evaluation model: Phi-3.5-mini-instruct [
27]. As illustrated in
Table 3, compared to the original hypothesis that serves as a baseline, Naive RAG yields the lowest PPL of 10.64, indicating that the textual fluency of completed hypotheses cannot benefit from incorporating external knowledge. In addition to BERTScore, which evaluates the span-level similarity, we employ LANLI [
21] to compute the NLI score, which assesses whether the completed hypotheses are semantically aligned with the conclusion. Since all hypothesis–conclusion pairs in the hypothesis revision dataset are labeled as entailment, the test set achieves an average NLI score of 0.832, as shown in
Table 3, which serves as an upper bound reference. It drops to 0.5357 when high-attribution spans within hypotheses are masked. We observe that all infilling variants recover the score, demonstrating the effectiveness of hypothesis refinement via span infilling. In particular, our Modular RAG-based framework achieves the highest NLI score of 0.8292, which approaches the original score (0.832), indicating that the revised hypotheses better align with the conclusions. Finally, we report SCR to quantify the models’ ability to produce outputs that conform to the expected output format (e.g., a well-formed set of infillings that matches the number of masked spans and aligns one-to-one with each corresponding mask). As illustrated in
Table 3, SCR is near-saturated across all configurations, indicating that introducing external knowledge has little influence on SCR and all baselines can reliably follow the expected output format.
To assess the effectiveness of the attribution method (SHAP [
28]) used in our framework, we conduct a word masking-based faithfulness evaluation experiment on the hypothesis revision dataset. Specifically, we perform individual masking of the top 10 high-attribution words ranked by SHAP and measure the NLI score drop
:
where
denotes the baseline NLI score without masking, and
denotes the NLI score after masking. Since the hypothesis revision dataset only consists of entailment-labeled samples, the NLI score is expected to drop after keyword masking, and a larger
indicates that the masked word is more important for the entailment decision. The results are illustrated in
Figure 3 (left). We observe that removing the top-ranked word leads to the largest NLI score decrease (Mean
), while the drops for lower-ranked words quickly shrink toward near-zero values. Additionally, we further measure the MoRF and LeRF by masking all the top five and bottom five high-attribution words ranked by SHAP, respectively. As illustrated in
Figure 3 (right), MoRF has a substantially larger and more dispersed
distribution compared to LeRF, whose values tightly center around 0. The consistent rank-wise decay in
and the great separation between MoRF and LeRF indicate that SHAP reliably identifies words that contribute most to the validation decision, supporting its use as the attribution signal for the subsequent refinement steps in our framework.
To mitigate potential evaluator coupling, we evaluate hypothesis refinement using three independent verifiers that span different model architectures: Roberta-large-mnli [
29] (encoder-only), Flan-T5-xxl [
30] (encoder–decoder), and Qwen2.5-7B-Instruct [
31] (decoder-only). We test the three verifiers in a zero-shot manner without any additional training or fine-tuning. Specifically, for all the contradiction-labeled samples in the CRNLI test set, we compute NLI scores for both the original hypothesis
h and the refined hypothesis
and report the NLI score difference
. In particular, for Roberta-large-mnli, we compute NLI scores using the entailment class probability; for Flan-T5-xxl and Qwen2.5-7B-Instruct, we restrict model outputs to exactly one token,
yes or
no, and compute NLI scores
based on the token-level probabilities of
yes and
no tokens:
Detailed prompts can be found in
Appendix D. The results can be found in
Table 4. We observe that Roberta-large-mnli yields the largest gain in NLI score (
), followed by Flan-T5-xxl (
) and Qwen2.5-7B-Instruct (
). While absolute scores are not directly comparable across verifiers due to different scoring mechanisms, the uniformly positive
NLI scores suggest the robustness of the attribution-guided hypothesis refinement.
In our setting, hypothesis refinement is implemented as attribution-guided span infilling rather than rewriting the entire hypothesis. To quantitatively evaluate the revision locality of the hypothesis refinement module, we further compute the average changed-token ratio, which is defined as the total numbers of infilled tokens normalized by the token length of the original hypothesis. We experiment on the evaluation dataset and obtain an average changed-token ratio of 0.232. Additionally, due to the text infilling mechanism, all unmasked tokens remain unchanged during hypothesis refinement. These verify the revision locality of the hypothesis refinement module.
4.3. Component-Wise Analysis of the Framework (RQ2)
To investigate how each component contributes to the overall performance, we conduct an ablation study of components including pre-retrieval processing, vector–entity joint retrieval, and post-retrieval filtering under the same experimental settings as in RQ1.
For hypothesis validation, we report the experimental results in
Table 5. We observe that removing either vector retrieval or entity retrieval leads to the most severe drop in performance. Especially, compared to the full framework, the accuracy of removing vector retrieval decreases from 97.09% to 92.73%. It confirms that dense semantic similarity retrieval serves as a fundamental component for capturing high-level contextual relevance. While entity retrieval provides strong domain specificity, relying solely on structured entities limits the model’s capability to retrieve semantically related information. Removing entity-based retrieval also causes a significant accuracy drop from 97.09% to 94.1%. It suggests that the framework benefits from the retrieved knowledge using entity retrieval. While removing pre- and post-retrieval processing causes a relatively lower drop in performance, they still contribute to the effectiveness of our proposed framework.
For hypothesis refinement (
Table 6), we observe a trend consistent with hypothesis validation: removing core retrieval modules such as vector retrieval leads to a larger performance drop than removing quality-control components. Additionally,
Table 6 indicates that retrieval settings affect not only span-level similarity metrics (BLEU and BERTScore) but also the semantic alignment of the completed hypothesis with the conclusion (NLI score).
To evaluate the impact of the NLI backbone on the overall framework, we conduct an ablation study using different decoder-only LLMs on the CRNLI test set. We restrict this comparison to decoder-only models because encoder-only and encoder–decoder architectures are constrained by input length and are less suitable for integrating external knowledge. As illustrated in
Table 7, the chemistry-adapted LANLI (without RAG) achieves 91.23% accuracy, outperforming other general-purpose NLI models. After incorporating external chemistry knowledge, all methods exhibit consistent improvements across the three evaluation metrics, with LANLI achieving the best performance. The consistent improvement suggests that, compared with the NLI backbone, hypothesis validation (NLI) significantly benefits from the Modular RAG mechanism, indicating the effectiveness of the proposed scientific workflow.
Next, to investigate the effectiveness of the post-retrieval evaluator, we employ several alternative evaluation LLMs using the same prompt illustrated in
Figure A2c, with all other retrieval settings unchanged. Here, we intentionally select a set of instruction-tuned LLMs with comparable capacity to conduct a controlled ablation study. Specifically, we report the chunk retention rate (average number of kept chunks out of retrieved candidates) and the downstream hypothesis validation performance with respect to the post-retrieval evaluator. The results are summarized in
Table 8. Despite significant variations in chunk retention rate, the downstream hypothesis validation performance remains consistently high with only slight fluctuations. This indicates that the proposed framework demonstrates robustness for post-retrieval evaluators with comparable capacity.
Furthermore, we conduct ablation experiments on the attribution method that is responsible for guiding the targeted hypothesis revision. Here, we compare SHAP with two representative feature attribution methods: integrated gradients and attention weights. Following
Figure 3 (left), we respectively mask the top 10 high-attribution words ranked by three attribution methods, and the results are summarized in
Figure 4. We observe a consistent rank-wise
decay across all three attribution methods, which indicates their effectiveness as attribution methods. Notably, SHAP exhibits a higher NLI score drop (
) when masking the top high-attribution word. It significantly outperforms integrated gradients and attention weights, suggesting that SHAP is more reliable for identifying high-contribution words and providing accurate guidance for targeted hypothesis revision.