A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research

Chen, Chenhao; Masuda, Taiga; Hirakawa, Tsubasa; Yamashita, Takayoshi; Fujiyoshi, Hironobu

doi:10.3390/info17030244

Open AccessArticle

A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research

by

Chenhao Chen

^1,*

,

Taiga Masuda

¹,

Tsubasa Hirakawa

²

,

Takayoshi Yamashita

³

and

Hironobu Fujiyoshi

¹

Department of Artificial Intelligence and Robotics, Chubu University, Kasugai-shi 487-8501, Aichi, Japan

²

Center for Mathematical Science and Artificial Intelligence, Chubu University, Kasugai-shi 487-8501, Aichi, Japan

³

Department of Computer Science, Chubu University, Kasugai-shi 487-8501, Aichi, Japan

^*

Author to whom correspondence should be addressed.

Information 2026, 17(3), 244; https://doi.org/10.3390/info17030244

Submission received: 15 January 2026 / Revised: 14 February 2026 / Accepted: 26 February 2026 / Published: 2 March 2026

(This article belongs to the Section Information Theory and Methodology)

Download

Browse Figures

Versions Notes

Abstract

Scientific research typically follows an iterative cycle where hypotheses are proposed, validated against experimental conclusions, and refined accordingly. While recent advances in large language models (LLMs) have enabled significant progress in automating individual stages of this process, existing systems are typically developed as standalone solutions, making it difficult to coordinate multiple research activities within a coherent research workflow. In this study, we present a modular framework for automated hypothesis validation and refinement in scientific research. Rather than introducing new task-specific models, the framework integrates established techniques, including natural language inference (NLI)-based hypothesis validation, attribution-guided hypothesis refinement, and retrieval-augmented generation (RAG)-based external evidence retrieval, into a unified and controllable workflow. We evaluate the proposed framework on scientific texts in the chemistry domain to assess its applicability in practical scientific research scenarios. Extensive experiments demonstrate the effectiveness of the proposed framework and suggest that it produces reliable intermediate signals that enhance transparency and traceability throughout hypothesis validation and refinement. Our work offers a modular solution for deploying LLM-based systems in scientific research workflows.

Keywords:

hypothesis validation; hypothesis refinement; natural language inference; retrieval-augmented generation; LLM-based scientific workflow

Graphical Abstract

1. Introduction

Traditionally, scientific discovery requires human researchers to collect background knowledge, draft initial hypotheses, construct evaluation procedures, assess evidence, and refine their hypotheses accordingly. However, this iterative process of hypothesis formulation, validation, and refinement is inherently limited by human researchers’ ingenuity [1]. As the scale of domain-specific knowledge expands continuously, researchers face increasing challenges in efficiently advancing this scientific research workflow. Early efforts to automate scientific discovery focused on providing computer-assisted support for specific stages of the scientific process, such as Automated Mathematician [2,3] and DENDRAL [4]. With the rapid development of artificial intelligence (AI), large language models (LLMs) have demonstrated remarkable capabilities in understanding and reasoning over scientific texts, which has introduced a paradigm shift to individual stages of scientific research workflows. For example, LLMs have been applied to understanding the scientific literature [5,6], hypothesis generation [1], and experimental planning [7]. Despite these advances, existing approaches for scientific research remain task-oriented. This means most systems are independently designed for specific research activities, without explicitly modeling the interactions and dependencies among multiple stages within a coherent research workflow. Additionally, many LLM-based systems adopt end-to-end architectures that restrict user interaction and provide limited transparency regarding the decision-making and reasoning process.

To address these limitations, in this work, we present a modular framework for automated hypothesis validation and refinement in scientific research. Rather than introducing new task-specific models, the framework systematically integrates hypothesis validation and hypothesis refinement as two interconnected yet decoupled components within a unified scientific research workflow. An overview of the proposed framework is illustrated in Figure 1, where all components are instantiated using existing models and techniques. As shown in the figure, the framework is organized around a common scientific reasoning cycle and consists of three core components: hypothesis validation, hypothesis refinement, and external evidence retrieval.

Specifically, given a hypothesis proposed by a researcher and a conclusion drawn from experiments, the hypothesis validation module assesses the logical consistency between the hypothesis and the conclusion using a natural language inference (NLI) model to provide validation decisions with confidence scores for downstream reasoning. If the hypothesis is assessed as unsupportive (which also covers contradictory or inconclusive hypotheses), the hypothesis refinement module is activated to identify and revise hypothesis spans that contribute to the semantic inconsistency. This refinement process is guided by attribution information derived from the NLI model, enabling targeted edits rather than unconstrained full-text rewriting. To support both the validation and refinement process with external scientific knowledge, the framework incorporates an evidence retrieval module based on retrieval-augmented generation (RAG) [8]. This module serves as an external knowledge base to provide relevant domain-specific evidence that can be leveraged during the process of hypothesis validation and refinement. The proposed framework offers a modular solution for deploying LLM-based systems in scientific research workflows. It is designed to assist researchers by providing intermediate research signals such as validation decisions, attribution information, and retrieved external knowledge, enabling researchers to maintain control and oversight over the research workflow.

We conduct extensive experiments and investigate the following three research questions:

(RQ1) Does the proposed framework improve end-to-end hypothesis validation and refinement compared to standalone baselines?
(RQ2) How does each component contribute to the performance of hypothesis validation and refinement?
(RQ3) How do intermediate signals produced by the framework support reliability and interpretability in practical scientific reasoning?

We evaluate the proposed framework on scientific texts in the chemistry domain for its rich domain-specific terminology and structured experimental reasoning in practical scientific research scenarios. Experimental results suggest the effectiveness and practical implications of the proposed framework.

The main contributions of this work are summarized as follows:

We propose a modular scientific reasoning framework that systematically organizes hypothesis validation, hypothesis refinement, and external evidence retrieval into a unified and controllable workflow.
We introduce a validation–refinement loop anchored by NLI confidence and attribution signals, enabling targeted local hypothesis revision instead of unconstrained end-to-end rewriting.
We conduct a comprehensive evaluation of the framework, demonstrating the effectiveness, transparency, and traceability of the proposed framework.

2. Related Works

2.1. Automated Hypothesis Generation

Recent studies [9,10] have explored automated hypothesis generation, which commonly follows an iterative refinement paradigm, a loop of hypothesis formulation, validation, and refinement. To improve hypothesis quality, existing approaches incorporate external feedback such as evaluation metrics, factual verification, and agent reviews to guide the refinement process. For instance, HypoGeniC [11] evaluates and updates hypotheses using accuracy; KG-CoI [12] integrates knowledge graphs to verify factual consistency; and ResearchAgent [13] coordinates multiple LLM-based reviewing agents whose evaluation criteria are elicited from human judgments via LLM prompting. However, these approaches lack fine-grained control over the revision process, relying on full-sentence rewriting rather than targeted edits. Additionally, they provide limited semantic transparency; while hypotheses are improved through iterative refinement, the semantic rationale behind each revision often remains unclear.

2.2. RAG Paradigms

Ref. [14] categorizes the RAG research paradigm into three stages: Naive RAG, Advanced RAG, and Modular RAG. Naive RAG represents the earliest RAG methodology, following a traditional process that includes indexing, retrieval, and generation, which is also characterized as a “Retrieve-Read” framework [15]. However, Naive RAG suffers from several limitations such as sensitivity to retrieval noise. As a result, Advanced RAG introduces specific improvements to overcome these limitations. For example, Advanced RAG optimizes the indexing structure by enhancing data granularity, adding metadata, mixed retrieval, etc. [16,17]. Recently, Modular RAG has been proposed as a flexible and extensible architectural framework that decomposes the RAG pipeline into loosely coupled functional modules. Compared to Naive RAG and Advanced RAG, innovations like restructured RAG modules [18] and rearranged RAG pipelines [19] have been introduced to tackle specific challenges.

3. Framework Overview

The proposed framework aims to provide a coherent and controllable reasoning process for hypothesis validation and refinement in scientific research. Rather than focusing on optimizing individual models, the framework emphasizes the organization and coordination of multiple reasoning components within a unified research workflow. As shown in Figure 1, the proposed framework models the iterative process of scientific discovery where the initial hypothesis is continuously validated and refined based on experimental conclusions. From a macro perspective, the framework adopts a modular architecture composed of three core components: hypothesis validation, hypothesis refinement, and external evidence retrieval. Each component is implemented as an independent module with clearly defined inputs and outputs, making the overall workflow flexible and interpretable. The framework takes as input a hypothesis proposed by a researcher and a conclusion derived from experimental results. Its primary objective is to evaluate the logical relationship between these inputs and revise contradictory spans within the hypothesis in a controlled manner. Next, we would like to give a detailed introduction on the proposed framework.

3.1. Hypothesis Validation as the Decision Anchor

The hypothesis validation module serves as the central decision anchor within the proposed framework. It is responsible for assessing whether a given hypothesis is logically supported by an experimental conclusion, thereby establishing a reliable basis for subsequent reasoning stages. Given a hypothesis–conclusion pair

(h, c)

, the hypothesis validation module assesses their logical relationship and estimates a probability distribution

p_{val}

over two validation classes: supportive and unsupportive.

p_{val} : = {(p (y ∣ h, c))}_{y \in {supportive, unsupportive}}

(1)

Here, supportive corresponds to entailment in standard NLI tasks, while unsupportive covers contradiction or, depending on the NLI setting, neutral cases where the hypothesis is not sufficiently supported by the conclusion. This validation process is performed using an NLI model, which is well-suited for capturing the fine-grained semantic consistency between two statements.

The validation results include the predicted entailment relationship and a corresponding confidence score. Rather than functioning as a standalone classifier, this module provides validation signals that guide the overall research workflow. The predicted validation confidence is compared against a pre-defined threshold, as illustrated in Figure 1. When the confidence score exceeds this threshold, the experimental conclusion is considered sufficiently supportive of the hypothesis, and no further refinement is required. Otherwise, the framework proceeds to the hypothesis refinement stage. In addition to the validation results, the hypothesis validation module also produces the fine-grained attribution information within the hypothesis that contributes most to the entailment classification, which is also fed back to researchers and subsequently leveraged to guide targeted hypothesis refinement.

3.2. Hypothesis Refinement via Attribution-Guided Local Editing

When a hypothesis is assessed as unsupportive by the validation module, the proposed framework activates the hypothesis refinement module to revise the hypothesis in a controlled and interpretable manner. The process of hypothesis refinement builds upon a pipeline of segmentation, alignment, tagging, and revision that has been systematically investigated in our prior work [20]. In this study, we integrate it as an established hypothesis refinement backbone. The input hypothesis is first decomposed into semantically coherent clauses. Attribution information produced by the hypothesis validation module is then aligned with these clauses to identify key spans within the original hypothesis that contribute most to the semantic inconsistency. Based on this alignment, these spans are tagged as refinement candidates while the remaining contents are preserved unchanged. This attribution-guided mechanism is more semantically transparent than rewriting the entire hypothesis using LLMs. Subsequently, the tagged spans are masked and revised through context-aware text infilling.

Notably, the hypothesis refinement stage is not designed to force entailment between the hypothesis and the experimental conclusion. Instead, its primary goal is to perform controllable revisions on the hypothesis which not only revise local contradictory spans inconsistent with the conclusion, but also explicitly reveal the source of contradiction, enabling researchers to reassess the original hypothesis and providing transparent evidence for subsequent decision-making (e.g., formulating alternative hypotheses, redesigning experiments, etc.).

3.3. RAG-Based External Evidence Retrieval

Scientific hypotheses and conclusions often include dense domain-specific terminology that requires critical background knowledge for scientific reasoning. For example, consider a hypothesis claiming that a palladium-based catalyst exhibits enhanced catalytic activity under mild oxidative conditions and an experimental conclusion that reports an increased turnover frequency after introducing molecular oxygen. Analyzing whether the hypothesis can be supported by such a conclusion requires background knowledge on aspects including catalytic mechanisms, the role of oxidizing agents, and experimental conditions.

To address this problem, the proposed framework incorporates a RAG-based external evidence retrieval module as an auxiliary component. Notably, this RAG module serves as an external knowledge provider that supplies relevant domain-specific knowledge to support hypothesis validation and refinement in a transparent and controllable manner. By explicitly separating evidence retrieval from decision-making, the framework ensures that the results of hypothesis validation and refinement remain anchored to the original hypothesis and conclusion, while retrieved evidence functions as an optional and interpretable source of contextual support.

As shown in Figure 1, we implement the RAG module following the paradigm of Modular RAG [14], which consists of four functional submodules: routing, pre-retrieval, search, and post-retrieval. The routing module serves as a task-level orchestration component. Given an input query composed of a hypothesis h, a conclusion c, and an instruction I, the routing module identifies the task context (e.g., hypothesis validation or hypothesis refinement) and activates the task-specific model. The pre-retrieval module aims to enhance query quality before retrieval. Specifically, it performs query rewriting and expansion on the original query. For query rewriting, we generate the rewritten query

(h^{'}, c^{'})

using LLMs:

(h^{'}, c^{'}) \leftarrow QueryRewrite (h, c; I_{rewrite}) .

(2)

For query expansion, inspired by hypothetical document embeddings (HyDE) [16], we generate a hypothetical conclusion

\hat{c^{'}}

based on the rewritten hypothesis

h^{'}

as an auxiliary retrieval probe:

\hat{c^{'}} \leftarrow QueryExpand (h^{'}; I_{expand}) .

(3)

Here, both the generated

(h^{'}, c^{'})

and

\hat{c^{'}}

are not treated as factual evidence but rather serve to explore the retrieval space. The search module bridges refined user queries and external knowledge bases. This module employs a vector–entity joint retrieval strategy that combines semantic similarity and entity relevance, enabling precise and context-aware information retrieval. The post-retrieval module is responsible for assessing the relevance and potential usefulness of the retrieved chunks

E

, which is performed using an evaluation LLM:

\tilde{E} \leftarrow Evaluate (E; I_{eval}) .

(4)

To provide an intuitive overview, we summarize the overall workflow of the proposed framework in Algorithm 1. Given an input hypothesis–conclusion pair, the framework first generates auxiliary representations to facilitate evidence retrieval from external knowledge bases. The retrieved evidence is then incorporated as contextual support for hypothesis validation (and hypothesis refinement when necessary), enabling the validation module to assess semantic consistency while remaining anchored to the original inputs. The validation result serves as the control signal of the framework. When the hypothesis is assessed as supportive, the workflow terminates without hypothesis refinement and returns validation results as well as attribution information and retrieved evidence as interpretable feedback to researchers. Otherwise, the framework activates the refinement module and leverages attribution information and retrieved evidence to guide targeted, transparent hypothesis revision. All models and prompt instructions used in the proposed framework are introduced in Appendix A, and we report the statistics of the external knowledge base in Appendix B.

Algorithm 1 High-level workflow of the proposed framework (hypothesis validation and refinement driven by RAG retrieval)

Require: hypothesis h, conclusion c,

validation module

M_{val}

with threshold

τ

,

refinement module

M_{ref}

Ensure: validation result y, validation distribution

p_{val}

,

attribution information

a

, revised hypothesis

\tilde{h}

, retrieved chunks

\tilde{E}

1:: $(h^{'}, c^{'}) \leftarrow QueryRewrite (h, c)$ {Pre-retrieval processing}
2:: $\hat{c^{'}} \leftarrow QueryExpand (h^{'})$
3:: $Q \leftarrow {h^{'}, c^{'}, \hat{c^{'}}}$
4:: $E \leftarrow \emptyset$
5:: for each query $q \in Q$ do
6:: $E \leftarrow E \cup VectorEntityRetrieval (q)$ {External evidence retrieval}
7:: end for
8:: $\tilde{E} \leftarrow RelevanceFilter (E)$ {Post-retrieval processing}
9:: $(y, p_{val}, a) \leftarrow M_{val} (h, c ∣ \tilde{E})$ {Hypothesis validation}
10:: $O \leftarrow {y, p_{val}, a, \tilde{E}}$
11:: if $p_{val} (supportive) < τ$ then
12:: $\tilde{h} \leftarrow M_{ref} (h, c ∣ a, \tilde{E})$ {Hypothesis refinement}
13:: $O \leftarrow O \cup \tilde{h}$
14:: end if
15:: return O

4. Experimental Evaluation

This experimental evaluation is designed to systematically examine the effectiveness and practical implications of the proposed framework. Rather than focusing solely on performance improvements of individual models, the experiments aim to assess whether scientific reasoning benefits from organizing hypothesis validation, hypothesis refinement, and external evidence retrieval into a unified workflow. Specifically, we investigate the following three research questions:

(RQ1) Does the proposed framework improve end-to-end hypothesis validation and refinement compared to standalone baselines?
(RQ2) How does each component contribute to the performance of hypothesis validation and refinement?
(RQ3) How do intermediate signals produced by the framework support reliability and interpretability in practical scientific reasoning?

All experiments are conducted on scientific texts in the chemistry domain for its rich domain-specific terminology and structured experimental reasoning in practical scientific research scenarios.

4.1. Experimental Settings

4.1.1. Evaluation Datasets

Since hypothesis validation can be formulated as an NLI problem, we select CRNLI [21], a structured NLI corpus in the chemistry domain, to evaluate the performance of the hypothesis validation module. For hypothesis refinement, we employ the datasets proposed in [20], which are built for hypothesis revision via text infilling in the general and chemistry domains. Details of these datasets are provided in Appendix C. In this section, all experiments are conducted on their test set, including 3812 instances for hypothesis validation and 1809 instances for hypothesis refinement.

4.1.2. Implementation Details

For hypothesis validation, the decision threshold

τ

of the NLI confidence score is set to 0.49, determined by maximizing Youden’s index on the CRNLI validation set to balance sensitivity and specificity.

For hypothesis refinement, feature attribution scores are first normalized, and tokens with contribution scores below

δ

(default

δ = 0.02

) are filtered out to remove negligible signals. Among the remaining tokens, the top m tokens (default

m = 3

) are selected for each hypothesis clause as refinement anchors. To preserve local semantic coherence during infilling, we construct editable spans centered at each selected token using a symmetric window of size w (default

w = 5

). Overlapping spans are merged to avoid redundant edits and ensure structural consistency.

For retrieval-augmented settings, the external evidence retrieval module returns the top k (default

k = 5

) most relevant chunks per query. The same external knowledge base is shared across all retrieval-based methods to ensure fair comparison.

4.1.3. Evaluation Metrics

We employ evaluation metrics including accuracy, F1-score, and AUC to measure the performance of hypothesis validation. For hypothesis refinement, we evaluate the revision quality using a series of metrics. We report BLEU-4 [22] and ROUGE-L [23] to measure token-level overlap between the generated infillings and ground truth, and BERTScore [24] to assess contextual semantic similarity. BLEU emphasizes n-gram precision, ROUGE-L captures structural overlap via the longest common subsequences, and BERTScore measures embedding-based semantic alignment. Additionally, we report PPL to assess the textual fluency of the overall revised hypotheses and NLI score to quantify their entailment relationship. Additionally, we introduce a binary evaluation metric named Span Completion Rate (SCR) to assess the model’s ability to produce outputs that conform to the expected output format:

SCR = \{\begin{matrix} 1, & if # G = # M \\ 0, & otherwise \end{matrix} .

(5)

Given a hypothesis with

# M

masked spans, the model is expected to generate infillings

# G

that correspond one-to-one with the masked spans. To assess the effectiveness of feature attribution, we report MoRF and LeRF [25] by progressively masking top- and low-ranked tokens and tracking the change in NLI confidence. To evaluate the reliability of intermediate results, we employ ECE and BS to assess the calibration quality of the validation confidence scores.

4.2. Comparison of the Proposed Workflow with Baselines (RQ1)

We evaluate the proposed workflow by comparing several representative configurations that progressively incorporate external evidence retrieval and modular retrieval controls. This comparison aims to quantify the system-level benefit of organizing hypothesis validation, refinement, and evidence retrieval as a unified and controllable pipeline rather than introduce a new task-specific model.

For hypothesis validation, we compare four workflow configurations: standalone NLI and three retrieval-augmented variants (Naive RAG, Advanced RAG, and our Modular RAG setting). From Table 1, it can be observed that NLI with Naive RAG yields only marginal improvements compared to the NLI-only baseline, suggesting that simply incorporating external knowledge is not sufficient for chemical NLI when evidence quality is uncontrolled. In contrast, Advanced RAG shows a clearer performance improvement, indicating that stronger indexing and retrieval design can partially mitigate this problem. Our Modular RAG-based framework achieves the best performance across all metrics with 97.09% accuracy, 97.28% F1-score, and 99.28% AUC, which outperforms the NLI-only baseline by 5.86% in accuracy, 5.71% in F1-score, and 2.3% in AUC. These results suggest that chemical NLI benefits from the retrieval quality control and vector–entity joint retrieval in our Modular RAG-based framework, which helps yield more reliable validation decisions. Moreover, the confusion matrix and ROC curve of this evaluation are illustrated in Figure 2.

Since the proposed hypothesis refinement module follows the context-aware text infilling method introduced in [20], similar to hypothesis validation, we also evaluate four text infilling configurations, as shown in Table 2 and Table 3. Table 2 reports token- and semantic-level similarity between the generated infillings and ground truth spans. Here, BERTScore is computed using SciBERT [26]. Interestingly, we observe that Naive RAG is not improved compared to the infill-only setting, indicating that directly incorporating external knowledge may introduce irrelevant or noisy cues that contribute negatively to span infilling. Advanced RAG and our Modular RAG-based framework demonstrate a significant improvement. In particular, our framework achieves the best performance (48.43 BLEU, 48.8 ROUGE, and 0.9258 BERTScore), suggesting that our framework is capable of generating the lexical structure of ground truth infilling while maintaining semantic consistency.

Additionally, we evaluate the textual fluency (PPL) of completed hypotheses using a different evaluation model: Phi-3.5-mini-instruct [27]. As illustrated in Table 3, compared to the original hypothesis that serves as a baseline, Naive RAG yields the lowest PPL of 10.64, indicating that the textual fluency of completed hypotheses cannot benefit from incorporating external knowledge. In addition to BERTScore, which evaluates the span-level similarity, we employ LANLI [21] to compute the NLI score, which assesses whether the completed hypotheses are semantically aligned with the conclusion. Since all hypothesis–conclusion pairs in the hypothesis revision dataset are labeled as entailment, the test set achieves an average NLI score of 0.832, as shown in Table 3, which serves as an upper bound reference. It drops to 0.5357 when high-attribution spans within hypotheses are masked. We observe that all infilling variants recover the score, demonstrating the effectiveness of hypothesis refinement via span infilling. In particular, our Modular RAG-based framework achieves the highest NLI score of 0.8292, which approaches the original score (0.832), indicating that the revised hypotheses better align with the conclusions. Finally, we report SCR to quantify the models’ ability to produce outputs that conform to the expected output format (e.g., a well-formed set of infillings that matches the number of masked spans and aligns one-to-one with each corresponding mask). As illustrated in Table 3, SCR is near-saturated across all configurations, indicating that introducing external knowledge has little influence on SCR and all baselines can reliably follow the expected output format.

To assess the effectiveness of the attribution method (SHAP [28]) used in our framework, we conduct a word masking-based faithfulness evaluation experiment on the hypothesis revision dataset. Specifically, we perform individual masking of the top 10 high-attribution words ranked by SHAP and measure the NLI score drop

Δ

:

Δ = S_{o r i g} - S_{m a s k},

(6)

where

S_{o r i g}

denotes the baseline NLI score without masking, and

S_{m a s k}

denotes the NLI score after masking. Since the hypothesis revision dataset only consists of entailment-labeled samples, the NLI score is expected to drop after keyword masking, and a larger

Δ

indicates that the masked word is more important for the entailment decision. The results are illustrated in Figure 3 (left). We observe that removing the top-ranked word leads to the largest NLI score decrease (Mean

Δ = 0.37

), while the drops for lower-ranked words quickly shrink toward near-zero values. Additionally, we further measure the MoRF and LeRF by masking all the top five and bottom five high-attribution words ranked by SHAP, respectively. As illustrated in Figure 3 (right), MoRF has a substantially larger and more dispersed

Δ

distribution compared to LeRF, whose values tightly center around 0. The consistent rank-wise decay in

Δ

and the great separation between MoRF and LeRF indicate that SHAP reliably identifies words that contribute most to the validation decision, supporting its use as the attribution signal for the subsequent refinement steps in our framework.

To mitigate potential evaluator coupling, we evaluate hypothesis refinement using three independent verifiers that span different model architectures: Roberta-large-mnli [29] (encoder-only), Flan-T5-xxl [30] (encoder–decoder), and Qwen2.5-7B-Instruct [31] (decoder-only). We test the three verifiers in a zero-shot manner without any additional training or fine-tuning. Specifically, for all the contradiction-labeled samples in the CRNLI test set, we compute NLI scores for both the original hypothesis h and the refined hypothesis

\tilde{h}

and report the NLI score difference

Δ

. In particular, for Roberta-large-mnli, we compute NLI scores using the entailment class probability; for Flan-T5-xxl and Qwen2.5-7B-Instruct, we restrict model outputs to exactly one token, yes or no, and compute NLI scores

S (h, c)

based on the token-level probabilities of yes and no tokens:

S (h, c) = \frac{P (yes ∣ h, c)}{P (yes ∣ h, c) + P (no ∣ h, c)} .

(7)

Detailed prompts can be found in Appendix D. The results can be found in Table 4. We observe that Roberta-large-mnli yields the largest gain in NLI score (

Δ = 48.95 %

), followed by Flan-T5-xxl (

Δ = 33.79 %

) and Qwen2.5-7B-Instruct (

Δ = 23.2 %

). While absolute scores are not directly comparable across verifiers due to different scoring mechanisms, the uniformly positive

Δ

NLI scores suggest the robustness of the attribution-guided hypothesis refinement.

In our setting, hypothesis refinement is implemented as attribution-guided span infilling rather than rewriting the entire hypothesis. To quantitatively evaluate the revision locality of the hypothesis refinement module, we further compute the average changed-token ratio, which is defined as the total numbers of infilled tokens normalized by the token length of the original hypothesis. We experiment on the evaluation dataset and obtain an average changed-token ratio of 0.232. Additionally, due to the text infilling mechanism, all unmasked tokens remain unchanged during hypothesis refinement. These verify the revision locality of the hypothesis refinement module.

4.3. Component-Wise Analysis of the Framework (RQ2)

To investigate how each component contributes to the overall performance, we conduct an ablation study of components including pre-retrieval processing, vector–entity joint retrieval, and post-retrieval filtering under the same experimental settings as in RQ1.

For hypothesis validation, we report the experimental results in Table 5. We observe that removing either vector retrieval or entity retrieval leads to the most severe drop in performance. Especially, compared to the full framework, the accuracy of removing vector retrieval decreases from 97.09% to 92.73%. It confirms that dense semantic similarity retrieval serves as a fundamental component for capturing high-level contextual relevance. While entity retrieval provides strong domain specificity, relying solely on structured entities limits the model’s capability to retrieve semantically related information. Removing entity-based retrieval also causes a significant accuracy drop from 97.09% to 94.1%. It suggests that the framework benefits from the retrieved knowledge using entity retrieval. While removing pre- and post-retrieval processing causes a relatively lower drop in performance, they still contribute to the effectiveness of our proposed framework.

For hypothesis refinement (Table 6), we observe a trend consistent with hypothesis validation: removing core retrieval modules such as vector retrieval leads to a larger performance drop than removing quality-control components. Additionally, Table 6 indicates that retrieval settings affect not only span-level similarity metrics (BLEU and BERTScore) but also the semantic alignment of the completed hypothesis with the conclusion (NLI score).

To evaluate the impact of the NLI backbone on the overall framework, we conduct an ablation study using different decoder-only LLMs on the CRNLI test set. We restrict this comparison to decoder-only models because encoder-only and encoder–decoder architectures are constrained by input length and are less suitable for integrating external knowledge. As illustrated in Table 7, the chemistry-adapted LANLI (without RAG) achieves 91.23% accuracy, outperforming other general-purpose NLI models. After incorporating external chemistry knowledge, all methods exhibit consistent improvements across the three evaluation metrics, with LANLI achieving the best performance. The consistent improvement suggests that, compared with the NLI backbone, hypothesis validation (NLI) significantly benefits from the Modular RAG mechanism, indicating the effectiveness of the proposed scientific workflow.

Next, to investigate the effectiveness of the post-retrieval evaluator, we employ several alternative evaluation LLMs using the same prompt illustrated in Figure A2c, with all other retrieval settings unchanged. Here, we intentionally select a set of instruction-tuned LLMs with comparable capacity to conduct a controlled ablation study. Specifically, we report the chunk retention rate (average number of kept chunks out of retrieved candidates) and the downstream hypothesis validation performance with respect to the post-retrieval evaluator. The results are summarized in Table 8. Despite significant variations in chunk retention rate, the downstream hypothesis validation performance remains consistently high with only slight fluctuations. This indicates that the proposed framework demonstrates robustness for post-retrieval evaluators with comparable capacity.

Furthermore, we conduct ablation experiments on the attribution method that is responsible for guiding the targeted hypothesis revision. Here, we compare SHAP with two representative feature attribution methods: integrated gradients and attention weights. Following Figure 3 (left), we respectively mask the top 10 high-attribution words ranked by three attribution methods, and the results are summarized in Figure 4. We observe a consistent rank-wise

Δ

decay across all three attribution methods, which indicates their effectiveness as attribution methods. Notably, SHAP exhibits a higher NLI score drop (

Δ_{SHAP}^{1} = 0.37

) when masking the top high-attribution word. It significantly outperforms integrated gradients and attention weights, suggesting that SHAP is more reliable for identifying high-contribution words and providing accurate guidance for targeted hypothesis revision.

4.4. Intermediate Signal Analysis (RQ3)

In this subsection, we investigate the intermediate signals produced by our workflow, including the validation decision and its confidence score, attribution scores over hypothesis words, and retrieved external knowledge chunks.

To evaluate the reliability of confidence estimates for hypothesis validation (NLI), we compute ECE and BS on different NLI configurations to assess whether their predicted probabilities reflect true likelihoods of entailment. The results are illustrated in Table 9. Compared to baselines, the proposed framework achieves the lowest ECE (0.029) and BS (0.051), indicating the most well-calibrated confidence among all compared configurations, which supports the use of confidence as a reliable decision signal. Given the already low ECE and BS, to avoid potential overfitting, we do not apply extra calibration.

Next, we present an end-to-end case study in Figure 5 and Figure 6 to provide an intuitive understanding of the overall pipeline of the proposed framework. Figure 5 illustrates evidence-driven hypothesis validation, feature attribution, and hypothesis revision via span infilling. Figure 6 illustrates the detailed chunk information retrieved from the external chemical knowledge base. Given an input hypothesis–conclusion pair, the framework first generates auxiliary representations to facilitate evidence retrieval from the external chemical knowledge base. Specifically, as illustrated in Figure 6, the rewritten hypothesis–conclusion pair and the hypothetical conclusion are generated to enhance query quality before retrieval. Followed by the query rewriting and expansion, the framework explores the external chemical knowledge base for relevant evidence using a vector–entity joint retrieval strategy. Then, the retrieved chunks are evaluated and filtered using an evaluation LLM to ensure the retrieval quality. As illustrated in Figure 5, the retrieved chunks are incorporated as contextual support for hypothesis validation. In this case, the NLI model outputs an NLI confidence of 0.1364, which is below the threshold

τ

(0.49), indicating that the inputs are classified as unsupportive. The attribution module produces word-level contribution scores over the hypothesis. We visualize them by highlighting each word in red based on its score (higher contribution scores correspond to darker red colors). For example, in this case, words such as “not”, “only”, “less”, etc., contribute most to the NLI decision. Then, we mask spans (top three for each hypothesis clause) centered on these words with a context window size of 5 and revise the hypothesis via context-aware text infilling. All infilled contents are highlighted in blue. The refined hypothesis is reevaluated against the same conclusion, resulting in a new NLI confidence score of 0.6921, which closes the workflow and feeds validation results along with retrieved chunks, attribution information, and the refinement hypothesis back to researchers. The case study not only indicates the effectiveness of the proposed framework, but also suggests that it produces reliable intermediate signals that enhance transparency and traceability throughout hypothesis validation and refinement.

For this end-to-end case study, we also report the runtime and GPU memory usage of each stage. Specifically, we measure the wall-clock latency (including model loading, data pre-processing, etc.) and the peak GPU memory allocated using one NVIDIA A100 GPU device. The results are illustrated in Table 10. We observe that feature attribution is most time-consuming. While the runtime of this stage can be reduced by decreasing the number of perturbation samples or adopting a more aggressive approximation strategy, it will reduce the stability and faithfulness, and influence the subsequent hypothesis refinement.

5. Discussion

The proposed framework aims to organize hypothesis validation and refinement into an inspectable scientific workflow rather than align hypotheses with experimental conclusions, which introduces several practical limitations: First, hypothesis refinement is constrained to attribution-guided local editing. While this design helps enhance the precision and interpretability of hypothesis revision, it may be insufficient for cases that require global restructuring. Second, the refinement quality is bounded by the reliability of upstream results. Since revision span is derived from the attribution on validation results, errors that occur in hypothesis validation can propagate to masking decisions and subsequently affect the refined hypothesis, making the workflow sensitive to NLI calibration and attribution noise. Third, the external knowledge base plays a key role in the proposed framework. Even with pre- and post-retrieval processing, low-quality evidence can limit the performance of both validation and refinement, especially when relevant domain-specific knowledge is absent or poorly represented in the corpus. These limitations suggest that the framework is most suitable when researchers investigate intermediate outputs as checkpoints and apply manual review to difficult cases in practical scientific research workflows.

6. Conclusions

In this study, we present a modular framework for automated hypothesis validation and refinement in scientific research. Rather than introducing new task-specific models, the framework integrates established techniques, including NLI-based hypothesis validation, attribution-guided hypothesis refinement, and Modular RAG-based external evidence retrieval, into a unified and controllable workflow. We conduct extensive experiments on scientific texts in the chemistry domain. Compared with baseline configurations, we evaluate and verify the effectiveness of the proposed framework in hypothesis validation, hypothesis refinement, and feature attribution. Ablation studies further show that the quality of retrieval and attribution plays a crucial role throughout the overall research workflow. Additionally, the end-to-end case study suggests that the framework produces reliable intermediate signals as checkpoints, enabling researchers to trace the research workflow and apply manual review for difficult cases if necessary. Our work offers a modular solution for deploying LLM-based systems in scientific research workflows. As future work, we plan to explore domain transfer and further optimize the framework with regard to the limitations discussed in Section 5.

Author Contributions

Conceptualization, C.C.; methodology, C.C.; software, C.C.; validation, C.C. and T.M.; formal analysis, C.C.; investigation, C.C. and T.M.; resources, C.C.; data curation, C.C. and T.M.; writing—original draft preparation, C.C.; writing—review and editing, C.C.; visualization, C.C.; supervision, T.H., T.Y., and H.F.; project administration, H.F.; funding acquisition, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JST Moonshot R&D grant number JPMJMS2236.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental code and data related to this paper can be obtained by contacting the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments, which greatly improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Models and Prompts Used in the Framework

All models and prompts used in the proposed framework are instantiated with chemistry-specific variants, reflecting the target application of this study.

Appendix A.1. Hypothesis Validation Module

The hypothesis validation module is implemented using a chemistry-domain NLI model, LANLI [21], which is a decoder-only method tailored for long-form hypothesis validation in chemical contexts. LANLI takes as input a hypothesis–conclusion pair and external reference knowledge and outputs a binary validation decision with a confidence score. Furthermore, it integrates NLI with SHAP [28] to identify key tokens within the hypothesis that contribute most to the NLI result. The prompt for hypothesis validation is illustrated in Figure A1a.

Figure A1. Prompt instructions for (a) hypothesis validation and (b) hypothesis refinement.

Appendix A.2. Hypothesis Refinement Module

The hypothesis refinement module employs our previous method [20], which performs clause-level attribution-guided span masking and context-aware text infilling. The model takes as input a masked hypothesis, a conclusion, and external reference knowledge and outputs infilling contents. In this work, the module is used as a local editing operator to revise contradictory spans identified by the validation module. The prompt for hypothesis refinement is illustrated in Figure A1b.

Appendix A.3. RAG-Based External Evidence Retrieval Module

The external evidence retrieval module follows the Modular RAG paradigm introduced in [14], composed of four submodules: routing, pre-retrieval, search, and post-retrieval. For the semantic routing strategy (routing module) and chunk embedding, we employ a SentenceBERT model all-MiniLM-L6-v2 [32,33] to encode the texts. Vector retrieval (search module) is conducted using FAISS [34] with a loose similarity threshold of 0.3. We employ the same LLM, DeepSeek-R1-Distill-Llama-8B [35,36], for query rewriting and expansion (pre-retrieval module) as well as evidence evaluation (post-retrieval module). The task-specific prompts are illustrated in Figure A2.

Figure A2. Prompt instructions for (a) query (hypothesis and conclusion) rewriting, (b) query expansion (hypothetical conclusion generation), and (c) post-retrieval evaluation.

Appendix B. External Knowledge Base

In this work, we employ a chemistry-specific knowledge base that is expected to be released in the near future. It is constructed from the publicly available scientific literature, primarily chemistry pre-prints (collected from ChemRxiv [37]) and their referenced publications (collected from Crossref [38]). All sources are open-access and used solely for research purposes. Specifically, we include ChemRxiv meta papers released from August 2017 to August 2025 and their references, and apply document-level de-duplication as well as basic quality filtering to remove duplicated versions and withdrawn records. Figure A3 and Figure A4 illustrate the distribution of disciplines and publication dates of the collected ChemRxiv meta-papers, respectively. All documents are converted into structured texts using optical character recognition (OCR) engine Nougat [39] and then segmented into section-aware chunks with a pre-defined chunking size ranging from 200 to 300 tokens and an overlap of 60 tokens. Furthermore, we employ a named entity recognition (NER) method SciSpaCy [40] to extract chemical entities from each chunk. These elements are organized into an entity database that corresponds entities with their source chunks. To provide a comprehensive overview, we report the statistics of the chemistry knowledge base in Table A1 with a breakdown by meta-papers and references.

Table A1. Statistics of the chemistry knowledge base.

	Meta-Paper	Referenced Paper	Total
Source data	ChemRxiv	Crossref	ChemRxiv, Crossref
Data volume	35,373	747,693	783,066
w/ PDF	35,224	4181	39,372
w/ Abstract only	149	357,634	357,783
Chunk volume	529,569	394,912	924,481
Entity volume	7,607,812	2,359,335	9,967,147

Figure A3. Discipline distribution of collected ChemRxiv meta-papers from which the chemistry knowledge base is built.

Figure A4. Publication date distribution of collected ChemRxiv meta-papers from which the chemistry knowledge base is built.

Appendix C. Evaluation Datasets Used in Experiments

For hypothesis validation, the evaluation dataset CRNLI is a structured NLI dataset in the chemistry domain. We compare CRNLI with SNLI (general-domain NLI dataset) [41] and summarize their statistical information in Table A2. As well as chemistry-specific data, CRNLI features an average token length of 122.8, which is substantially longer than that of traditional NLI datasets such as SNLI (11.2 tokens). For hypothesis refinement, we employ the hypothesis revision datasets proposed in [20], which are constructed from SNLI and CRNLI by applying attribution-guided masking to entailment-labeled NLI samples. Detailed dataset statistics are illustrated in Table A3. Notably, the external chemistry knowledge base is built independently of both the evaluation datasets by excluding their data source.

Table A2. SNLI and CRNLI statistical information.

	SNLI	CRNLI
Domain	General	Chemistry
Source data	Flickr 30 k, VisualGenome	ChemRxiv
Dataset volume	570 k	53.5 k
Label distribution	3 (E/C/N)	2 (E/C)
Data distribution	Balanced	Balanced
Train/Dev/Test	550 k/10 k/10 k	45.5 k/4.2 k/3.8 k
Avg. token length	11.2	122.8

Table A3. Dataset statistics of hypothesis revision datasets in the general and chemistry domains.

	General	Chemistry
Source data	SNLI	CRNLI
Data volume	174 k	25 k
Train/Dev/Test	147 k/14 k/13 k	21.3 k/1.9 k/1.8 k
Avg. token length	11.2	122.8
Avg. # of masked spans	1.7	2.6
Avg. masking window size	2.8	7.1

Appendix D. Prompts for Independent Verifier Evaluation

Prompts for Flan-T5-xxl (encoder–decoder) and Qwen2.5-7B-Instruct (decoder-only) are illustrated in Figure A5.

Figure A5. Prompt instructions for (a) the encoder–decoder model Flan-T5-xxl and (b) the decoder-only model Qwen2.5-7B-Instruct.

References

Lu, C.; Lu, C.; Lange, R.T.; Foerster, J.N.; Clune, J.; Ha, D. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv 2024, arXiv:2408.06292. [Google Scholar] [CrossRef]
Lenat, D.B. Automated Theory Formation in Mathematics. In Proceedings of the 5th International Joint Conference on Artificial Intelligence, Cambridge, MA, USA, 22–25 August 1977; Volume 2, pp. 833–842. [Google Scholar]
Lenat, D.B.; Brown, J.S. Why AM and EURISKO Appear to Work. Artif. Intell. 1984, 23, 269–294. [Google Scholar] [CrossRef]
Buchanan, B.G.; Feigenbaum, E.A. Dendral and Meta-Dendral: Their Applications Dimension. Artif. Intell. 1978, 11, 5–24. [Google Scholar] [CrossRef]
Freire, J.; Fan, G.; Feuer, B.; Koutras, C.; Liu, Y.; Peña, E.; Santos, A.S.; Silva, C.; Wu, E. Large Language Models for Data Discovery and Integration: Challenges and Opportunities. IEEE Data Eng. Bull. 2025, 49, 3–31. [Google Scholar]
Luo, Z.; Yang, Z.; Xu, Z.; Yang, W.; Du, X. LLM4SR: A Survey on Large Language Models for Scientific Research. arXiv 2025, arXiv:2501.04306. [Google Scholar] [CrossRef]
Zhu, Y.; Qiao, S.; Ou, Y.; Deng, S.; Zhang, N.; Lyu, S.; Shen, Y.; Liang, L.; Gu, J.; Chen, H. KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents. In Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 3709–3732. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
Qi, B.; Zhang, K.; Tian, K.; Li, H.; Chen, Z.; Zeng, S.; Hua, E.; Hu, J.; Zhou, B. Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation. arXiv 2024, arXiv:2407.08940. [Google Scholar] [CrossRef]
Hu, X.; Fu, H.; Wang, J.; Wang, Y.; Li, Z.; Xu, R.; Lu, Y.; Jin, Y.; Pan, L.; Lan, Z. Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas. arXiv 2024, arXiv:2410.14255. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, H.; Srivastava, T.; Mei, H.; Tan, C. Hypothesis Generation with Large Language Models. arXiv 2024, arXiv:2404.04326. [Google Scholar]
Xiong, G.; Xie, E.; Shariatmadari, A.H.; Guo, S.; Bekiranov, S.; Zhang, A. Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models. arXiv 2024, arXiv:2411.02382. [Google Scholar] [CrossRef]
Baek, J.; Jauhar, S.K.; Cucerzan, S.; Hwang, S.J. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models. arXiv 2024, arXiv:2404.07738. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Ma, X.; Gong, Y.; He, P.; Zhao, H.; Duan, N. Query Rewriting for Retrieval-Augmented Large Language Models. arXiv 2023, arXiv:2305.14283. [Google Scholar] [CrossRef]
Gao, L.; Ma, X.; Lin, J.J.; Callan, J. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022. [Google Scholar]
Zheng, H.S.; Mishra, S.; Chen, X.; Cheng, H.; Chi, E.H.; Le, Q.V.; Zhou, D. Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models. arXiv 2023, arXiv:2310.06117. [Google Scholar]
Yu, W.; Iter, D.; Wang, S.; Xu, Y.; Ju, M.; Sanyal, S.; Zhu, C.; Zeng, M.; Jiang, M. Generate rather than Retrieve: Large Language Models are Strong Context Generators. arXiv 2022, arXiv:2209.10063. [Google Scholar]
Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; Chen, W. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. arXiv 2023, arXiv:2305.15294. [Google Scholar]
Chen, C.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Hypothesis Alignment via Clause-level Attribution-guided Span Masking and Infilling. In Proceedings of the 5th International Conference on Communications, Networking and Machine Learning, Singapore, 24–26 October 2025. [Google Scholar]
Chen, C.; Masuda, T.; Ushiku, Y.; Tanaka, S.; Saito, K.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. CRNLI: A Textual Entailment Dataset in the Chemistry Domain. In Text, Speech and Dialogue; Springer: Cham, Switzerland, 2025. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002. [Google Scholar]
Lin, C. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Li, X.; Du, M.; Chen, J.; Chai, Y.; Lakkaraju, H.; Xiong, H. M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models. In NIPS’23: Proceedings of the 37th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Behl, H.S.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S. A Unified Approach to Interpreting Model Predictions. In NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
Yang, Q.A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Dong, G.; et al. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Yin, C.; Zhang, Z. A Study of Sentence Similarity Based on the All-minilm-l6-v2 Model with “Same Semantics, Different Structure” After Fine Tuning. In Proceedings of the 2024 2nd International Conference on Image, Algorithms and Artificial Intelligence (ICIAAI 2024); Atlantis Press: Paris, France, 2024. [Google Scholar]
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazar’e, P.; Lomeli, M.; Hosseini, L.; J’egou, H. The Faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 2025, 645, 633–638. [Google Scholar] [CrossRef]
ChemRxiv. Available online: https://chemrxiv.org/ (accessed on 9 February 2026).
Crossref. Available online: https://www.crossref.org/ (accessed on 9 February 2026).
Blecher, L.; Cucurull, G.; Scialom, T.; Stojnic, R. Nougat: Neural Optical Understanding for Academic Documents. arXiv 2023, arXiv:2308.13418. [Google Scholar] [CrossRef]
Neumann, M.; King, D.; Beltagy, I.; Ammar, B.W. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. arXiv 2019, arXiv:1902.07669. [Google Scholar] [CrossRef]
Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015. [Google Scholar]

Figure 1. An overview of the proposed modular framework for automated hypothesis validation and refinement. This framework consists of three modules: NLI-based hypothesis validation, attribution-guided hypothesis refinement, and RAG-based external evidence retrieval.

Figure 2. Confusion matrix and ROC curve of (a) NLI w/o RAG, (b) NLI w/ Naive RAG, (c) NLI w/ Advanced RAG, and (d) NLI w/ Modular RAG (ours).

Figure 3. Evaluation of attribution effectiveness. Left: Individual masking of top 10 high-attribution words ranked by SHAP. Right: Comparison of MoRF and LeRF, which mask all top 5 and bottom 5 high-attribution words, respectively. The vertical axis denotes the drop

Δ

in NLI score by keyword masking (

Δ = S_{o r i g} - S_{m a s k}

).

Figure 3. Evaluation of attribution effectiveness. Left: Individual masking of top 10 high-attribution words ranked by SHAP. Right: Comparison of MoRF and LeRF, which mask all top 5 and bottom 5 high-attribution words, respectively. The vertical axis denotes the drop

Δ

in NLI score by keyword masking (

Δ = S_{o r i g} - S_{m a s k}

).

Figure 4. Ablation of attribution methods under individual masking of top 10 high-attribution words ranked by SHAP (blue), integrated gradients (red), and attention weights (green).

Figure 5. An end-to-end case study of the proposed framework, including input hypothesis and conclusion, validation decision with confidence, attribution map highlighting decision critical words in red, context-aware span masking highlighted in gray, and hypothesis revision results highlighting infilled spans in blue. Detailed prompts used can be found in Appendix A.

Figure 6. Detailed chunk information retrieved from external chemical knowledge base, including the results of query rewriting and expansion, chunks retrieved by vector–entity joint retrieval, and chunks filtered by post-retrieval LLM evaluation. Red strikethrough indicates chunks that are filtered out by the post-retrieval evaluator.

Table 1. Comparison of the proposed framework with baselines in terms of accuracy, F1-score, and AUC in hypothesis validation (NLI) on the evaluation dataset.

Method	Accuracy (%)	F1-Score (%)	AUC (%)
NLI w/o RAG	91.23	91.57	96.98
NLI w/ Naive RAG	91.56	91.62	97.08
NLI w/ Advanced RAG	94.10	94.22	98.12
NLI w/ Modular RAG (Ours)	97.09	97.28	99.28

Table 2. Comparison of the proposed framework with baselines in terms of BLEU-4, ROUGE-L, and BERTScore in hypothesis refinement (text infilling) on the evaluation dataset.

Method	BLEU	ROUGE	BERTScore (%)
Infill w/o RAG	44.21	45.72	90.97
Infill w/ Naive RAG	43.04	45.29	90.74
Infill w/ Advanced RAG	47.15	46.91	91.36
Infill w/ Modular RAG (Ours)	48.43	48.8	92.58

Table 3. Comparison of the proposed framework with baselines in terms of PPL, NLI score, and SCR in hypothesis refinement (text infilling) on the evaluation dataset.

Method	PPL ↓	NLI Score (%) ↑	SCR (%) ↑
Original	10.1	83.2	N/A
Masked	N/A	53.57	N/A
Infill w/o RAG	10.72	79.23	100
Infill w/ Naive RAG	10.64	80.41	99.83
Infill w/ Advanced RAG	11.2	82.7	100
Infill w/ Modular RAG (Ours)	10.81	82.92	100

Table 4. Independent verifier evaluation of hypothesis refinement under zero-shot NLI scoring. We report NLI scores for original hypotheses and refined hypotheses using three off-the-shelf NLI methods.

NLI Model	Model Size (B)	Original NLI Score (%)	Refined NLI Score (%)	$Δ$ NLI Score (%)
Roberta-large-mnli (encoder-only)	0.4	21.57	70.52	48.95
Flan-T5-xxl (encoder–decoder)	11	33.62	67.41	33.79
Qwen2.5-7B-Instruct (decoder-only)	7	37.16	60.36	23.2
LANLI (ours)	7	19.57	72.84	53.27

Table 5. Performance of different ablation settings in terms of accuracy, F1-score, and AUC in hypothesis validation (NLI) on the evaluation dataset, with NLI-only and proposed full framework as reference.

NLI Method	Accuracy (%)	F1-Score (%)	AUC (%)
w/o RAG	91.23	91.57	96.98
Ours w/o pre-retrieval	96.27	95.91	99.09
Ours w/o entity retrieval	94.10	94.22	98.12
Ours w/o vector retrieval	92.73	93.04	97.55
Ours w/o post-retrieval	96.73	96.50	99.12
w/ Modular RAG (ours)	97.09	97.28	99.28

Table 6. Performance of different ablation settings in terms of BLEU-4, BERTScore, and NLI score in hypothesis refinement (text infilling) on the evaluation dataset, with infill-only and the proposed full framework as reference.

Infill Method	BLEU	BERTScore (%)	NLI Score (%)
Original	100	100	83.2
Masked	N/A	N/A	53.57
w/o RAG	44.21	90.97	79.23
Ours w/o pre-retrieval	47.63	91.86	82.86
Ours w/o entity retrieval	47.15	91.38	82.7
Ours w/o vector retrieval	45.92	91.24	80.75
Ours w/o post-retrieval	46.18	91.4	82.51
w/ Modular RAG (ours)	48.43	92.58	82.92

Table 7. Ablation of NLI backbone in terms of accuracy, F1-score, and AUC on the evaluation dataset.

NLI Backbone	Accuracy (%)	F1-Score (%)	AUC (%)
Llama-3.1-8B-Instruct w/o RAG	86.12	87.54	91.3
Qwen2.5-7B-Instruct w/o RAG	85.43	85.26	89.37
Phi-3.5-mini-instruct w/o RAG	84.9	84.24	88.96
LANLI w/o RAG	91.23	91.57	96.98
Llama-3.1-8B-Instruct	92.76	93.38	97.6
Qwen2.5-7B-Instruct	92.12	92.85	97.54
Phi-3.5-mini-instruct	91.37	91.43	96.71
LANLI (ours)	97.09	97.28	99.28

Table 8. Ablation of post-retrieval evaluator in terms of retention rate, accuracy, F1-score, and AUC in hypothesis validation (NLI) on the evaluation dataset.

Evaluator	Retention Rate (%)	Accuracy (%)	F1-Score (%)	AUC (%)
Llama-3.1-8B-Instruct	40.7	96.95	97.13	99.2
Qwen2.5-7B-Instruct	55.3	97.12	97.25	99.24
Phi-3.5-mini-instruct	23.8	96.7	96.51	99.13
DeepSeek-R1-Distill-Llama-8B (ours)	48.1	97.09	97.28	99.28

Table 9. Comparison of the proposed framework with three baselines in terms of ECE and BS in hypothesis validation (NLI).

Method	ECE ↓	BS ↓
NLI w/o RAG	0.068	0.094
NLI w/ Naive RAG	0.065	0.090
NLI w/ Advanced RAG	0.042	0.068
NLI w/ Modular RAG (Ours)	0.029	0.051

Table 10. Per-stage runtime and peak GPU memory usage for the end-to-end case study.

Stage	Runtime (s)	Peak GPU Memory (MiB)
Query rewriting	11.2	15,467
Query expansion	10.3	15,465
Retrieval	1.7	2174
Post-retrieval evaluation	8.8	15,283
Hypothesis validation	6.7	4676
Feature attribution	67.5	18,009
Hypothesis refinement	8.1	17,792

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, C.; Masuda, T.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research. Information 2026, 17, 244. https://doi.org/10.3390/info17030244

AMA Style

Chen C, Masuda T, Hirakawa T, Yamashita T, Fujiyoshi H. A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research. Information. 2026; 17(3):244. https://doi.org/10.3390/info17030244

Chicago/Turabian Style

Chen, Chenhao, Taiga Masuda, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. 2026. "A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research" Information 17, no. 3: 244. https://doi.org/10.3390/info17030244

APA Style

Chen, C., Masuda, T., Hirakawa, T., Yamashita, T., & Fujiyoshi, H. (2026). A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research. Information, 17(3), 244. https://doi.org/10.3390/info17030244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Modular Framework for Automated Hypothesis Validation and Refinement in Scientific Research

Abstract

1. Introduction

2. Related Works

2.1. Automated Hypothesis Generation

2.2. RAG Paradigms

3. Framework Overview

3.1. Hypothesis Validation as the Decision Anchor

3.2. Hypothesis Refinement via Attribution-Guided Local Editing

3.3. RAG-Based External Evidence Retrieval

4. Experimental Evaluation

4.1. Experimental Settings

4.1.1. Evaluation Datasets

4.1.2. Implementation Details

4.1.3. Evaluation Metrics

4.2. Comparison of the Proposed Workflow with Baselines (RQ1)

4.3. Component-Wise Analysis of the Framework (RQ2)

4.4. Intermediate Signal Analysis (RQ3)

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Models and Prompts Used in the Framework

Appendix A.1. Hypothesis Validation Module

Appendix A.2. Hypothesis Refinement Module

Appendix A.3. RAG-Based External Evidence Retrieval Module

Appendix B. External Knowledge Base

Appendix C. Evaluation Datasets Used in Experiments

Appendix D. Prompts for Independent Verifier Evaluation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI