Explainable Multi-Hop Question Answering: A Rationale-Based Approach

Han, Kyubeen; Jang, Youngjin; Kim, Harksoo

doi:10.3390/bdcc9110273

Open AccessArticle

Explainable Multi-Hop Question Answering: A Rationale-Based Approach

by

Kyubeen Han

^1,†

,

Youngjin Jang

^2,†

and

Harksoo Kim

^1,*

¹

Department of Artificial Intelligence, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea

²

Department of Multilingual Understanding, NC AI, 12, Daewangpangyo-ro 644beon-gil, Bundang-gu, Seongnam-si, Gyeonggi-do 13494, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2025, 9(11), 273; https://doi.org/10.3390/bdcc9110273 (registering DOI)

Submission received: 18 September 2025 / Revised: 19 October 2025 / Accepted: 28 October 2025 / Published: 31 October 2025

(This article belongs to the Special Issue Machine Learning and AI Technology for Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

Multi-hop question answering tasks involve identifying relevant supporting sentences from a given set of documents, which serve as the rationale for deriving answers. Most research in this area consists of two main components: a rationale identification module and a reader module. Since the rationale identification module often relies on retrieval models or supervised learning, annotated rationales are typically essential. This reliance on annotations, however, creates challenges when adapting to open-domain settings. Moreover, when models are trained on annotated rationales, explainable artificial intelligence (XAI) requires clear explanations of how the model arrives at these rationales. Consequently, traditional multi-hop question answering (QA) approaches that depend on annotated rationales are ill-suited for XAI, which demands transparency in the model’s reasoning process. To address this issue, we propose a rationale reasoning framework that can effectively infer rationales and clearly demonstrate the model’s reasoning process, even in open-domain environments without annotations. The proposed model is applicable to various tasks without structural constraints, and experimental results demonstrate its significantly improved rationale reasoning capabilities in multi-hop question answering, relation extraction, and sentence classification tasks.

Keywords:

unsupervised learning; machine reading comprehension; multi-hop question answering; explainable AI; rationale inference

1. Introduction

Recent advancements in artificial intelligence (AI) have led to remarkable progress across a wide range of domains. Among these, large language models (LLMs) have played a central role in transforming natural language processing (NLP) [1,2,3,4]. Trained on vast collections of text, LLMs learn to capture linguistic context and meaning, allowing them to generate and interpret human language with impressive fluency. Their success has reshaped how people interact with technology—from everyday conversation assistants to advanced reasoning tools. However, despite these achievements, LLMs continue to face fundamental challenges regarding the transparency and reliability of their reasoning processes [5,6,7]. Even when models attempt to justify their answers, users often remain uncertain about how those explanations are produced or whether they reflect genuine reasoning [8]. This opacity makes it difficult to understand why a model arrives at a particular response, raising concerns about its interpretability and trustworthiness. As a result, users may hesitate to rely on LLM outputs—especially when the consequences of misinformation are high. To mitigate these concerns, research in explainable artificial intelligence (XAI) has sought ways to make AI decision-making more transparent and accessible [9,10,11]. The goal of XAI is not only to expose the inner workings of AI systems but also to present explanations that are meaningful to human users. By offering clear and intuitive insights into how models reach their conclusions, XAI aims to strengthen user trust and promote more responsible use of intelligent systems.

In this paper, we define the criteria for XAI models as follows: (1) The rationales provided by the model must be reliable and understandable to users. (2) Users should be able to comprehend the model’s operation through its reasoning process. (3) The model should be capable of providing rationales without relying on annotations attached to the data. (4) The rationales provided by the model should align with its predictive outputs. To meet these criteria, we propose a framework that satisfies these requirements and apply it to the multi-hop question answering (QA) task, a representative task related to XAI.

Multi-hop question answering involves identifying multiple supporting sentences from a set of documents that collectively provide evidence for deriving an answer. These supporting sentences act as the rationale for producing accurate responses. Because this task requires understanding relationships between sentences across multiple documents, identifying reliable and interpretable rationales is essential. Most multi-hop question answering research [11,12] consists of two main components: a supporting sentence identification module and a reading comprehension module. The supporting sentence identification module identifies the sentences necessary for inferring the answer. It identifies supporting sentences within the documents based on retrieval models or supervised learning [13,14]. The reader then uses the sentences selected by the supporting sentence identification module to infer responses to the given questions. However, these approaches have several limitations. First, the supporting sentence identification module often requires explicit rationale annotations, and even with such annotations, explaining how the rationale is derived remains difficult. Second, the reader module frequently treats the entire retrieved segment as a rationale, making it hard to trace the model’s reasoning process—which runs counter to the goals of XAI. As a result, traditional multi-hop question answering methods are not well-suited for open-domain environments where annotated rationales are scarce or where model transparency is essential.

To address these issues, this paper proposes a framework for explainable multi-hop question answering that explicitly reveals the model’s reasoning process and enables rationale inference without requiring annotations. The proposed model is built upon a pointer network architecture and trained with a strategy that encourages it to infer rationale sentences. In addition, the model enhances factual consistency by extracting rationales directly from source documents and presenting them at the sentence level to improve user interpretability. Moreover, the process of selecting rationale sentences through the pointer network provides transparency into the model’s reasoning while achieving strong performance even without explicit annotations. Overall, this approach satisfies the core requirements of XAI and demonstrates competitive rationale inference performance in experiments.

2. Related Works

Explainable AI (XAI) aims to make a model’s decisions understandable to users [15,16,17]. Most prior work on XAI [18,19,20,21] has focused on interpreting the model’s inference process or identifying key words, phrases, or sentences that influence predictions, rather than directly relying on annotated rationale labels. For example, ref. [22] proposed aligning textual elements in natural language inference to analyze sentence correspondence, while employing masking techniques to provide token- or phrase-level rationales. Similarly, ref. [10] used trainable masks to block or activate specific nodes or edges in a graph, thereby filtering out irrelevant information and isolating key rationales. However, these approaches have limited scope and generalizability. A parallel line of research has questioned whether transformer attention weights can serve as faithful explanations of model predictions, noting that attention often captures correlational rather than causal relationships [23,24,25,26]. To address these limitations, we adopt a pointer-style rationale modeling approach that explicitly selects discrete spans or sentences from the input, yielding text-aligned evidence and providing transparency at the decision level [27,28].

Recent studies on rationale inference can broadly be categorized by the degree of supervision involved. Supervised approaches [13] leverage annotated rationales to enhance faithfulness and interpretability but suffer from the high cost of annotation and poor generalization in open-domain settings. Semi-supervised techniques [9] attempt to alleviate these issues by generating pseudo-evidence or employing self-training mechanisms, while unsupervised methods [29,30] remove the need for labeled data altogether, enabling scalability but often at the expense of interpretability. This progression reflects a growing effort to achieve explainability while reducing dependence on annotated rationale data. Building on these distinctions, recent work has extended rationale inference to more complex reasoning tasks such as multi-hop question answering. Beyond traditional explainability frameworks, recent studies have explored interpretability in large language models (LLMs). Research on LLM explainability focuses on aligning model rationales with human-understandable evidence [31,32,33], evaluating faithfulness of generated explanations [34], and developing post-hoc tools that trace internal reasoning steps [31]. These efforts emphasize the growing convergence between XAI and LLM interpretability, where understanding how multi-hop reasoning emerges in generative models has become an active research frontier. Our work contributes to this direction by providing an unsupervised, rationale-based framework that maintains interpretability while scaling to complex reasoning tasks.

Multi-hop Question Answering (QA) requires integrating multiple sentences from a set of documents to answer complex questions. This task demands precise identification of the supporting sentences that are essential for deriving the correct answer. Ref. [35] attempted to enhance explainability by classifying whether each sentence qualifies as supporting evidence. However, their method does not sufficiently capture interactions among sentences and thus struggles to explain complex, multi-document questions. Similarly, [35,36] mainly explored dependencies between pairs of information, while models such as HUG (Hop, Union, Generate) [11] sought to overcome these limitations by performing multi-hop inference through document- and sentence-level predictions. Nevertheless, the explainability of these models remains limited, as their reasoning processes are largely implicit and not easily interpretable to users.

Other approaches, such as RAG (Retrieval-Augmented Generation) [14], treat evidence as latent variables but overlook relationships across documents. Ref. [9] successfully generated pseudo-evidence annotations through semi-supervised learning for specific QA tasks, but their method applies only to certain types of multi-hop questions. To address these limitations, recent studies [11,37,38,39] have continued to develop explainable multi-hop QA frameworks capable of reasoning across multiple documents while maintaining transparency in the inference process.

In this work, we introduce a pointer-decoding module that first decodes candidate evidence sentences and then trains the model to increase semantic agreement between predictions made from only the selected evidence and those made from the full document [26,27,28]. This encourages the selected evidence to be functionally tied to the decision rather than merely correlated, and yields discrete, verifiable rationales aligned with the model’s output. Theoretically, this design strengthens interpretability along three axes: (i) causal sufficiency—the selected set alone can reproduce the output, in contrast to attention scores that need not be causal [23,25]; (ii) an information-bottleneck perspective—the pointer compresses inputs into minimal-sufficient evidence while preserving task-relevant meaning [40]; and (iii) functional faithfulness—the same output is recovered from the selected evidence, aligning the explanation with the model’s computation [26,28]. Overall, our pointer-based, agreement-driven framework narrows the gap between correlational attribution and reasoning faithfulness, while remaining annotation-free and broadly applicable across tasks.

3. Methodology

Multi-hop question answering identifies the supporting sentences from a document set

D = {d_{0}, d_{1}, \dots, d_{| D | - 1}}

and infers the answer to question Q. Here,

| D |

refers to the size of the document set. Figure 1 illustrates the structure of the proposed model. The proposed model comprises a pre-trained language model [41] and a pointer network [42]. The pre-trained language model encodes the documents, while the pointer network selects the information necessary for answer inference. The final hidden states of the pointer network, which accumulate rationale information, are used for the final answer.

Input Encoder Specifically, the input sequence is defined as follows:

T = {{t_{0}^{0}, \dots, t_{L - 1}^{0}}, \dots, {t_{0}^{| D | - 1}, \dots, t_{L - 1}^{| D | - 1}}} \in R^{| D | \times L}

(1)

where T consists of a [CLS] token, a tokenized question, an [SEP] token, a tokenized document, and a final [SEP] token. Here, L refers to the maximum number of tokens in the input sequence. The sequence T is then processed by an encoder to obtain a token vector sequence V. Subsequently, sentence vectors

S = {s_{0}^{0}, s_{1}^{0}, \dots, s_{| d | - 1}^{| D | - 1}} \in R^{| S | \times h}

are generated using the encoded token vector sequence V using mean pooling. Here,

| d |

refers to the maximum number of sentences constituting each document,

| S |

refers to the total number of sentences in the document set D, and h denotes the output vector size of the pre-trained model.

Pointer Network The pointer network selects sentences through attention mechanisms with S and is guided to choose supporting sentences using the proposed training strategy. The initial state of the pointer network is set to

N o n e

, and the initial input,

i n p u t_{0}

, is as shown in Equation (2) below:

\begin{matrix} i n p u t_{0} = \frac{1}{| D |} \sum_{d = 0}^{| D | - 1} s_{0}^{d} \end{matrix}

(2)

The output of the pointer network,

h_{n}

, is as shown in Equation (3).

\begin{matrix} h_{n} = G R U (i n p u t_{n}, h_{n - 1}) \end{matrix}

(3)

Here, n refers to the number of decoding steps, and

i n p u t_{n}

and

h_{n - 1}

represent the input to the RNN [43] and the previous RNN output, respectively. The context vector

C_{n}

is generated by incorporating important sentence information through an attention operation between the output

h_{n}

produced by the RNN and the sentence representations S, as shown in Equation (4) below:

\begin{matrix} s c o r e (S, h_{n}) = h_{n} W_{a} S^{T} \\ α_{n} = s o f t m a x (s c o r e (S, h_{n})) \\ C_{n} = \sum_{k}^{| D |} α_{n, k} S_{k} \end{matrix}

(4)

In this equation,

W_{a}

represents a trainable weight matrix, and

s c o r e (S, h_{n})

denotes the importance of each sentence, calculated as the inner product between the sentence representations S and the RNN output

h_{n}

at the current decoding step.

α_{n}

refers to the

s c o r e (S, h_{n})

normalized via the softmax function, and

C_{n}

represents the weighted sum vector of S with respect to

α_{n}

. The context vector

C_{n}

is concatenated with the RNN output

h_{n}

to form the next input vector

i n p u t_{n + 1}

, as described in Equation (5) below:

i n p u t_{n + 1} = W_{i n} (C_{n} \oplus h_{n})

(5)

In the equation above,

W_{i n}

represents a trainable weight matrix. As shown in Figure 1, we consider the final input,

i n p u t_{N}

, accumulated through the pointer network, as the rationale vector. This is used to generate the probability distribution (

{\hat{y}}_{s t a r t}

,

{\hat{y}}_{e n d}

) for the answer positions, and the corresponding equation is given by Equation (6).

\begin{matrix} \begin{matrix} {\hat{y}}_{s t a r t} = V W_{s} {(i n p u t_{N})}^{T} \in R^{| D | \times L} \\ {\hat{y}}_{e n d} = V W_{e} {(i n p u t_{N})}^{T} \in R^{| D | \times L} \end{matrix} \end{matrix}

(6)

In the above equation, the probability distributions

{\hat{y}}_{s t a r t}

and

{\hat{y}}_{e n d}

are obtained by multiplying the token representations V with the trainable weight matrices

W_{s}

and

W_{e}

, respectively, followed by multiplication with the rationale vector

i n p u t_{N}

.

Training In this paper, we train the proposed model using two loss functions: one for answer inference and another for guiding the pointer network in selecting the supporting sentences. First, the model is trained by minimizing the cross-entropy between the predicted probability distributions,

{\hat{y}}_{s t a r t}

and

{\hat{y}}_{e n d}

, and the ground-truth answer positions

y_{s t a r t}

and

y_{e n d}

, defined as follows:

\begin{matrix} L_{s} = - \sum y_{s t a r t} log ({\hat{y}}_{s t a r t}) \\ L_{e} = - \sum y_{e n d} log ({\hat{y}}_{e n d}) \\ L_{s p a n} = (L_{s} + L_{e}) / 2 \end{matrix}

(7)

The proposed framework is trained to minimize the negative log-likelihood (NLL) described above, ensuring reliable task performance. During training, the model refines its predicted label distribution to better match the ground-truth distribution by leveraging the information accumulated through the sentence selections of the pointer network. To further guide the pointer network in selecting supporting rationale sentences, we introduce an auxiliary loss term that penalizes inconsistent or irrelevant selections. This additional objective helps the model focus on sentences that contribute directly to accurate answer generation, thereby improving both the interpretability and robustness of the overall reasoning process.

The corresponding equation is presented in Equation (8) below:

\begin{matrix} \begin{matrix} L_{D} (r) = \sum_{n = 0}^{N - 1} [α_{d} \sum_{s \in r} log (P (s | e_{n})) + (1 - α_{d}) \sum_{s \in S - r} log (P (s | e_{n}))] \\ L_{G} (r) = \sum_{n = 0}^{N - 1} [α_{g} \sum_{s \in r} log (P (s | e_{n})) + (1 - α_{g}) \sum_{s \in S - r} log (P (s | e_{n}))] \end{matrix} \end{matrix}

(8)

\begin{matrix} \begin{matrix} α_{d} = \{\begin{matrix} 1 & if F_{1} (y_{r}, y_{D}) > T h r e s h o l d \\ 0 & otherwise \end{matrix} \\ α_{g} = \{\begin{matrix} 1 & if F_{1} (y_{r}, y_{G}) > T h r e s h o l d \\ 0 & otherwise \end{matrix} \end{matrix} \end{matrix}

(9)

In this equation,

P (s | e_{n})

represents the probability that the pointer network selects sentence s at the n-th decoding step.

y_{D}

is the predicted answer based on the entire input documents D, while

y_{r}

is the predicted answer from the reasoning path r selected by the pointer network. The comparison between

y_{r}

and

y_{D}

serves as an auxiliary training signal that enforces consistency between rationale selection and answer prediction. When the model can reproduce an answer close to

y_{D}

using only the selected rationale sentences, it indicates that these sentences contain sufficient information to infer the correct answer. In such cases, the probability

P (s | e_{n})

of the selected sentences is reinforced. Conversely, when

y_{r}

deviates notably from

y_{D}

, the model decreases the corresponding probabilities and shifts attention toward alternative sentences. By reducing the discrepancy between

y_{r}

and

y_{D}

, the model learns to choose sentences that are both relevant and sufficient for answer generation, effectively aligning rationale selection with the overall reasoning objective. Likewise,

L_{G} (r)

compares the rationale-based prediction

y_{r}

with the ground-truth answer

y_{G}

, ensuring that the learned reasoning paths remain faithful to the annotated answers. The final loss function

L_{t o t a l}

is defined as shown in Equation (10).

L_{t o t a l} = L_{s p a n} + λ \cdot (L_{D} (r) + L_{G} (r))

(10)

In the above formula,

λ

represents a weighting coefficient.

Beam Search Decoding Figure 2 illustrates the extended pointer network structure with beam search decoding. Using beam search decoding [44], the model generates K reasoning paths

R = {r^{1}, r^{2}, \dots, r^{K}}

, evaluates each path, selects the optimal reasoning path

r^{k *}

for answer inference, and uses it during training. Each reasoning path

r^{k}

is evaluated by comparing the predicted answer

y_{D}

based on the entire document, the predicted answer

y_{r^{k}}

based on

r^{k}

, and the ground-truth answer

y_{G}

, using the sum of

F_{1}

scores. This process is illustrated in Figure 3.

We select the best path,

r^{k *}

, which maximizes the

F_{1}

score among the reasoning paths and use it to train the model using the following equation.

\begin{matrix} \begin{matrix} f (r^{k}) = F_{1} (y_{r^{k}}, y_{D}) + F_{1} (y_{r^{k}}, y_{G}) \\ k^{*} = arg max_{K} f (r^{k}) \\ L = L_{s p a n} + λ \cdot (L_{D} (r^{k^{*}}) + L_{G} (r^{k^{*}}) \end{matrix} \end{matrix}

(11)

4. Experiments

4.1. Dataset and Experimental Setup

The primary task addressed in this study is multi-hop question answering. To assess the effectiveness of the proposed approach, we conducted experiments using two multi-hop question answering datasets [12,45]. Additionally, to emphasize that the proposed model is task-agnostic and capable of inferring rationales for its predictions across various tasks, we conducted additional experiments on a document-level relation extraction dataset [46] and a binary sentiment analysis dataset [47]. Through these experiments, we aimed to demonstrate the model’s consistent performance across diverse tasks and confirm that its rationale extraction capability is not limited to a specific task.

The proposed model utilizes the pre-trained ELECTRA-base [41] as its encoder. The weighting coefficient

λ

in Equations (10) and (11) was set to 0.1. Furthermore, the pointer network was constrained to extracting two or three sentences, depending on the dataset used in the experiments. Detailed implementation configurations, including optimizer settings, learning rate, and training procedures, are provided in Appendix F.

HotpotQA This dataset [12] is a large-scale multi-hop question answering dataset designed to integrate information across documents to answer questions. It contains approximately 112,000 questions and corresponding answers covering a wide range of topics. For each question, 10 related documents are provided, containing supporting sentences that the model must reference to infer the correct answer. In our experiments, we used 90,564 samples for training and 7405 samples for development under the distractor setting.

MuSiQue This dataset [45] is designed to train and evaluate question answering tasks requiring multi-hop reasoning over text, similar to HotpotQA. It comprises approximately 25,000 questions with corresponding answers, where each question consists of several sub-questions. This structure enables the evaluation of the model’s ability to integrate information from various sources to derive answers. The proposed model is trained to infer answers to the initially provided questions without directly addressing the sub-questions. The dataset consists of 19,144 samples and 2417 development samples.

DocRed This dataset [46] is a large-scale resource for document-level relation extraction, focusing on identifying relationships between entities across entire documents rather than within individual sentences. It contains 132,375 entities and 56,354 relationships extracted from 5053 Wikipedia documents, encompassing 96 distinct relation types. Notably, each relation instance is accompanied by supporting sentences for relation extraction. The objective task includes identifying pairs of entities that have existing relations among all possible combinations of entities. However, in our experiments, relation extraction was performed only on entity pairs with existing relations, and documents with fewer than three sentences were excluded. The final dataset used for training comprises 56,195 samples, while the evaluation dataset includes 17,803 samples.

IMDB This dataset [47] is designed for sentiment analysis, aiming to predict the sentiment (positive/negative) by analyzing movie review texts. It includes a total of 50,000 movie reviews, with each review labeled as either positive or negative. The dataset is split into 25,000 training samples and 25,000 testing samples, with each set containing an equal number of positive and negative reviews.

Detailed statistics of the datasets used in the experiments are presented in Table 1.

As shown in Table 1, we used four publicly available datasets for our experiments, utilizing the train and test sets for quantitative evaluation of both answer prediction and rationale inference. However, since the IMDB dataset does not include rationale annotations, it cannot be directly used for quantitative evaluation of rationale inference. Therefore, we sampled a subset of the test set (Sampled Test set) and evaluated rationale inference using a GPT-based model.

4.2. Baseline

The proposed framework is designed to be adaptable across various tasks. To assess its generalizability, we adjusted the output layer according to each task. In this process, unlike in the multi-hop question answering task, the rationale vector

i n p u t_{N}

is concatenated with the

[C L S]

vector to generate the final probability distribution

\hat{y}

. The detailed structure of this process is illustrated in Figure 4 below.

At this stage, for the classification task, the best reasoning path

r^{k *}

is determined using KL Divergence Loss [48] instead of the

F_{1}

score. Specifically, the selected sentence subset

r^{k *}

is the one that yields a prediction most similar to the prediction obtained by using the entire input sequence. Detailed implementation for each task is provided in Appendix A.

4.3. Metrics

As the main performance metric, we use the

F_{1}

score of the model’s predictions. For datasets with annotated rationales—HotpotQA, MuSiQue, and DocRED—we compute not only the

F_{1}

score of the predicted answers but also the precision, recall, and

F_{1}

score of the extracted rationales, which are calculated at the sentence level. Furthermore, we define a model as having high rationale inference capability if it provides appropriate supporting rationale, even when its predictions are incorrect. Therefore, evaluating models solely based on annotated rationales is insufficient. To address this, we employ GPT-4o mini to assess the quality of the inferred rationales. Specifically, GPT-4o mini is instructed to evaluate how well the rationale sentences justify the model’s predictions, given the task description, the model’s predicted answer, and the inferred rationale sentences. We conduct the evaluation on a scale from 0 to 2, depending on how well the rationale sentences support the predictions. Each sample is scored on a 0–2 scale according to the following criteria:

0 points: The provided rationale does not substantiate or is irrelevant to the model’s prediction at all.
1 point: The provided rationale somewhat supports the model’s prediction but is not decisive; it is partially inferred but unclear.
2 points: The provided rationale fully substantiates the model’s prediction, and the same answer can be inferred solely based on the given rationale.

The GPT-4o mini evaluation was performed on a dataset (Sampled Dataset) consisting of 1000 randomly selected samples from each evaluation dataset. The evaluation scores were normalized to a range of 0 to 1. The prompts used in the experiments are detailed in Appendix B.

Prompt Details for GPT-Based Evaluation

In this section, we provide a detailed explanation of the prompts used to guide GPT-4o mini in evaluating the model’s rationale inference capabilities. The prompts were structured to present GPT-4o mini with three key pieces of information: the task description, the model’s predicted answer, and the rationale sentences inferred by the model. The goal of the evaluation was to measure how effectively the inferred rationale sentences supported the model’s predictions. Each sample provided GPT-4o mini with the model’s predicted answer and the corresponding rationale, and the model was instructed to score the relevance of the rationale on a scale from 0 to 2, according to predefined criteria. The following elements were included in each prompt:

Task Description: A clear explanation of the evaluation task.
Predicted Answer: The model’s predicted answer for the given question.
Rationale Sentences: The evidence sentences inferred by the model, to be evaluated by GPT-4o mini.

These elements formed the core of each evaluation prompt.

4.4. Comparison Models

To compare the rationale inference performance of the proposed model with existing approaches, we selected several representative baseline models. Models trained with annotated evidence were excluded to ensure a fair evaluation.

HotpotQA and MuSiQue: BM25 [49] and RAG [14] serve as baselines retrieval models that were not trained on the experimental dataset. RAG was modified to search at the sentence level to match the characteristics of the dataset, providing three fixed sentences for performance evaluation. The semi-supervised approach [9] involves constructing silver rationale annotations in advance and learning through an RNN decoder to infer the rationale. HUG [11] is a state-of-the-art unsupervised model for multi-hop question answering. It identifies the optimal rationale by combining sentences from all documents and predicts the answer based on this rationale. Ref. [50] applies a multi-stage retrieval method at both the document and sentence levels, based on BM25. Their approach demonstrated superior performance compared to existing retrieval methods and showed strong performance in answering complex questions such as those in multi-hop question answering. The upper bound represents the performance of models trained on annotated rationales, similar to the structure of the proposed model. It provides an upper bound on the rationale inference performance that the proposed model can potentially achieve.

DocRED and IMDB: GPT_pred illustrates the performance of GPT-4o mini [51] in a zero-shot setting. The prompts used in the experiments are detailed in Table A5 and Table A6. For DocRED, GPT_pred refers to the performance GPT-4o mini when instructed to perform both relation extraction and evidence inference simultaneously. For the IMDB dataset, GPT_pred is prompted to perform sentiment analysis while simultaneously identifying key sentences that influence the classification.

This evaluation assesses the model’s ability to infer rationales in an unsupervised manner without access to annotated evidence, demonstrating its effectiveness in extracting supporting rationales directly from the text. In this experiment, GPT is given ground-truth labels and supporting sentences from the dataset, independent of GPT_pred’s inference results, to assess how effectively the provided rationale supports the correct answers.

4.5. Experimental Results

4.5.1. Quantitative Evaluation

Table 2 presents the results for HotpotQA. The proposed model demonstrates the highest performance in rationale inference compared to unsupervised learning-based models. Specifically, compared to the semi-supervised model, the proposed model achieves superior performance, providing a significant advantage over methods that rely on explicit silver-label learning. Compared to the approach by [50], which searches for five fixed sentences, the proposed model exhibits higher recall performance in rationale sentence inference. Since the proposed model extracts only three sentences, the performance gap is expected to be even larger when extracting the same number of sentences. The proposed model and HUG achieve comparable performance. However, HUG, which explores the optimal reasoning path by considering all combinations of the given documents and sentence pairs, has high time complexity. In contrast, the proposed model, which adds only a short decoding process to the reading model, offers similar performance to HUG while providing a speed advantage. Furthermore, considering that recall is a more important performance metric than precision when providing rationales to users, the proposed model demonstrates approximately 95% of the recall performance of the upper bound, even without annotated evidence.

Table 3 presents the results for MuSiQue. Similar to HotpotQA, the proposed model demonstrates the highest performance compared to unsupervised rationale inference models.

Table 4 presents the experimental results for DocRED. The proposed model achieves an Answer

F_{1}

score of 81.8, significantly outperforming GPT_pred, which score of 41.8. This demonstrates the superior performance of the proposed model in relation extraction compared to GPT-based models. Additionally, the rationale

F_{1}

scores for GPT_pred is 21.8, which is considerably lower than the 54.9 achieved by our model. This result indicates that the proposed model excels in rationale sentence extraction relative to GPT models. However, since these experiments are based on rationale sentences attached to the data, they provide limited insight into the comparative rationale inference capabilities of each model. To address this, the next section will further analyze how well the inferred answers and rationale sentences correspond to each other.

4.5.2. GPT Score Evaluation

To evaluate how effectively the rationale sentences inferred by the model support the final predictions, we conducted an experiment using the GPT-4o mini model. This experiment measures the degree to which the inferred rationale sentences convincingly justify the predicted answers based on both model-generated rationales. Additionally, we included GPT_gold, which utilizes ground truth answers and rationale sentences from the dataset, and GPT_pred, which performs the dataset task while simultaneously inferring rationale sentences, as comparison models. Table 5 presents the evaluation results, using the GPT-4o mini model, showing how well the rationale sentences inferred by the model support the predicted answers across four datasets (HotpotQA, MuSiQue, DocRED, and IMDB). We considered GPT_pred, which performs the dataset task while simultaneously inferring rationale sentences, and GPT_gold, which uses the ground truth answers and rationale sentences attached to the dataset, as comparison models.

First, regarding GPT_gold, we observe that its score does not reach 100%, even when it has access to the correct answers and rationales. This suggests a misalignment between the evaluation criteria of the datasets and those used by the GPT-4o mini model. Next, GPT_pred achieves over 90% of GPT_gold’s performance across most datasets. This demonstrates that GPT-based models can achieve relatively high rationale inference without strictly adhering to the rationale sentences provided in the ground truth data. In particular, for the HotpotQA and MuSiQue datasets, the performance gap between GPT_pred and GPT_gold is minimal, indicating that GPT models can infer evidence sentences that closely approximate the quality of the actual data. However, for the DocRED dataset, there is a significant performance gap between GPT_pred and GPT_gold (51.0 vs. 65.0). Considering the inherent limitations of GPT in relation extraction tasks, this result reflects the model’s constraints in inferring accurate rationale sentences. Although GPT_gold results are unavailable for the IMDB dataset, GPT_pred achieves a high score of 95.4. This suggests that GPT models are capable of inferring highly relevant rationales in tasks involving single-document analysis, performing better in simpler structures compared to those requiring complex document linking. Overall, while GPT-based models exhibit strong performance across various datasets, there is a notable decrease in rational inference accuracy for tasks with lower performance, and a greater divergence from the annotated rationales is observed in these cases. The proposed model also demonstrates generally strong rationale inference performance, though a significant decrease is observed in the MuSiQue experiments. However, compared to the answer inference performance in earlier experiments, the rationale inference performance remains higher. This suggests that the proposed model can infer suitable rationale even when the predicted answer is incorrect. Finally, the overall scores for GPT-4o mini and the proposed model are 79.9 and 79.8, respectively, indicating very similar performance. Nevertheless, despite having only approximately 1/80 the parameters of GPT-4o mini, the proposed model demonstrates comparable rationale inference capabilities. Additionally, the proposed model can be considered an effective rationale inference framework suitable for XAI, as it demonstrates the model’s reasoning process through the pointer network indirectly guided to select rationale sentences. To further verify the reliability of GPT-based evaluation, we measured the Pearson correlation between GPT-4o mini scores and human-annotated rationale faithfulness ratings on 50 randomly sampled instances from the DocRED dataset. The results show a strong positive correlation (Pearson

r = 0.81

), indicating that GPT-based scoring is highly consistent with human judgment and can serve as a reliable proxy for qualitative rationale evaluation. The detailed prompts for the experiment are provided in Appendix C.

4.5.3. Ablation Study

To analyze the impact of individual components on the performance of the proposed model, we conducted an ablation study by selectively removing or modifying key elements. This analysis allows us to assess the significance of each component in the learning and inference processes. Table 6 presents the results of the ablation experiments conducted on the proposed model. It is evident that removing the loss function significantly reduces the effectiveness of learning in the rationale sentence extraction process. This suggests that without an appropriate loss function tailored for the pointer network, the model’s ability to accurately infer rationale sentences is compromised. Additionally, the results show that beam search decoding allows for more accurate evidence inference compared to greedy decoding, enabling more refined selections. Overall, the proposed model demonstrates high performance in rationale sentence extraction by effectively utilizing both the loss function and beam search decoding, with particularly strong results in terms of recall. This leads to the conclusion that applying an appropriate learning loss and employing beam search for decoding play crucial roles in rationale sentence extraction using a pointer network.

Given the importance of the loss function in guiding the pointer network, we further investigate how the threshold values used in the loss function affect rationale inference performance. Specifically, we examine the impact of the threshold in Equation (9) on the model’s ability to extract rationale sentences accurately. Figure 5 illustrates the changes in rationale inference performance based on the threshold values used to determine

α_{d}

and

α_{g}

in the loss function (Equation (8)). According to the graph, the model achieves high performance when the threshold value is between 0.4 and 0.6, with the highest rationale inference performance occurring at 0.5. Moreover, the performance sharply declines as the threshold approaches either 0 or 1. These results suggest that overly strict or lenient criteria for evaluating rationale sentences can hinder the learning of the pointer network. Thus, selecting an appropriate threshold value is crucial for improving the performance of the pointer network.

4.5.4. Further Analysis

To assess whether the model produces consistent predictions when using only the selected rationale sentences instead of the full document, we conducted an additional experiment. This evaluation examines how strongly the selected rationales contribute to the model’s decision-making process. As shown in Table 7, we compare the predictions made using the entire document with those made using only the selected rationale sentences, as described in the introduction. The experiment was performed on the HotpotQA evaluation dataset. In the table, “Proposed Model (All Document ⇔ Only Rationale Sentences Input)” shows the

F_{1}

score comparing the predictions inferred by the proposed model using the entire document versus those inferred using only the selected rationale sentences. Additionally, for comparison, we measure the

F_{1}

scores between the inferred predictions and the dataset’s ground truth in both cases: “Ground Truth ⇔ All Document Input” and “Ground Truth ⇔ Only Rationale Sentences Input”. Here, “Test set

F_{1}

” represents the evaluation results on the entire HotpotQA test set, while “Sampled Test set

F_{1}

” measures the performance only on the subset of the test set where the predictions inferred from the entire document exactly match the ground truth.

The evaluation results in Table 7 demonstrate the consistency of predictions made by the proposed model when using the entire document compared to when using only the selected rationale sentences. The key findings are as follows: In the results for “Proposed Model (All Document Input ⇔ Only Rationale Sentences Input)”, the Test set

F_{1}

score is 84.9, showing a high level of consistency between the answers predicted based on the entire document and those predicted using only the rationale sentences. Furthermore, the Sampled Test set

F_{1}

score increases to 93.3, indicating that the selected rationale sentences effectively preserve essential information for generating accurate predictions. This score is particularly noteworthy in cases where the predictions based on the entire document exactly match the Ground Truth. As a result, the proposed model demonstrates strong consistency in predictions even when only the selected rationale sentences are provided, achieving results comparable to when the entire document is used. Additionally, the increase in the Sampled Test set

F_{1}

score highlights the robustness of rationale-based predictions in scenarios where predictions align with the Ground Truth.

4.5.5. Sensitivity to the Consistency Weight $λ$

In the HotpotQA dataset, both Answer

F_{1}

and Rationale

F_{1}

remained relatively stable regardless of changes in

λ

, as following Figure 6. The Answer

F_{1}

slightly decreased from 0.6028 to 0.6016, while the Rationale

F_{1}

stayed within a narrow range of 0.6720–0.6729. This indicates that the model learned stably in both answer prediction and rationale selection irrespective of

λ

adjustments, suggesting that

λ

has a limited effect on performance in this dataset. In contrast, MuSiQue exhibited a clearer sensitivity to

λ

. As

λ

increased, the Answer

F_{1}

consistently dropped from 0.2501 to 0.2311, whereas the Rationale

F_{1}

remained relatively constant (0.354–0.358). This suggests that emphasizing the rationale-selection loss caused the model to rely excessively on partial evidence, hindering its ability to perform coherent reasoning across the entire context. Consequently, the impact of

λ

on answer prediction was more pronounced in MuSiQue. Compared to HotpotQA, MuSiQue involves more complex multi-hop reasoning and inter-document dependencies, making it inherently more sensitive to changes in

λ

. In such high-difficulty reasoning settings, an excessive focus on evidence extraction can disrupt global reasoning consistency and degrade final answer prediction. Overall, the results show that as

λ

increases, rationale performance tends to remain stable or slightly improve, whereas answer performance degrades, particularly in the more challenging MuSiQue dataset. While the influence of

λ

is marginal in HotpotQA, the observed trend in MuSiQue suggests that setting

λ

to 0.1 offers the best overall balance across tasks. A

λ

value of 0.1 effectively maintains the trade-off between rationale selection and answer generation, ensuring stable and consistent performance across QA tasks of varying difficulty.

4.5.6. Effect of Beam Width on Performance

In HotpotQA, we compared model performance across different beam sizes in Figure 7. When the beam size was set to 1 (i.e., greedy decoding), the model achieved an

F_{1}

score of 0.611, indicating the limitation of locally optimal token selection without broader search. Increasing the beam size to 3 improved the

F_{1}

to 0.672, as a slightly wider search allowed the model to explore more diverse candidate sequences and mitigate the myopic behavior of greedy decoding. However, further increasing the beam size to 5 or 10 yielded nearly identical results (

F_{1}

= 0.669–0.671), suggesting that additional search breadth did not translate into meaningful performance gains. Since the computational cost grows approximately linearly with the beam width, we set the beam size to 3 as a practical choice that balances performance and efficiency, providing sufficient diversity while avoiding unnecessary computational overhead.

5. Conclusions

We propose a framework for rationale inference in explainable multi-hop question answering. Our model employs a pointer network to accumulate key information and infer answers, allowing effective evidence selection even without annotated data. In addition, the reasoning process is interpretable through the sentence selection mechanism of the pointer network. Compared to previous multi-hop QA models, our approach achieves competitive performance in unsupervised settings across several evaluation metrics and can still provide supporting evidence even when predictions are incorrect. Remarkably, although it contains, only about 1/80 of the parameters of the recently released GPT-4o mini, our model demonstrates comparable rationale inference ability.

Author Contributions

Conceptualization, K.H., Y.J. and H.K.; methodology, K.H. and Y.J.; software, K.H. and Y.J.; validation, K.H., Y.J. and H.K.; formal analysis, K.H. and Y.J.; investigation, K.H. and Y.J.; resources, H.K.; data curation, K.H. and Y.J.; writing—original draft preparation, K.H. and Y.J.; writing—review and editing, K.H., Y.J. and H.K.; visualization, K.H. and Y.J.; supervision, H.K.; project administration, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-II220369, (Part 4) Development of AI Technology to support Expert Decision-making that can Explain the Reasons/Grounds for Judgment Results based on Expert Knowledge).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available. The HotpotQA dataset can be accessed at https://hotpotqa.github.io (accessed on 17 September 2025). The MuSiQue dataset is available at https://github.com/stonybrooknlp/musique (accessed on 17 September 2025). The DocRED dataset can be accessed at https://github.com/thunlp/DocRED (accessed on 17 September 2025). The IMDB dataset is available via the Large Movie Review Dataset at https://ai.stanford.edu/~amaas/data/sentiment (accessed on 17 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Implementation Details

In this appendix, we provide detailed implementation specifications for each task. These details are intended to facilitate reproducibility and further understanding of our approach. The proposed model employs a pre-trained ELECTRA-base [41] to process input sequences across different tasks. The dataset structure and mini-batch composition vary by each task, as follows:

Multi-Hop Question Answering:
−
The input sequence T follows the format: ‘ $[C L S]$ Question $[S E P]$ Document $[S E P]$ ’. The encoded token representations T are then processed to extract key evidence for reasoning.
−
Since each question is associated with multiple documents, each mini-batch includes multiple document-question pairs for the same question.
Document-Level Relation Extraction:
−
The input T consists of one or more sentences containing entity mentions, and the relationships between entities are inferred. ELECTRA encodes these token embeddings V, capturing contextual information to analyze entity relationships.
−
In this task, each data sample is processed as an individual document, and each mini-batch includes different documents.
Binary Sentiment Classification:
−
The input T consists of a review text or a short comment, where the final token representations V are used for sentiment classification.
−
Similar to relation extraction, each sample corresponds to a single document, and the mini-batch contains different texts.

Figure A1 illustrates the process of generating sentence vectors using mean-pooling.

Figure A1. An overview of model input structures and sentence representation generation via mean-pooling for different tasks.

Next, we describe the output layer, which computes the final probability distribution for each task using the rationale vector generated by the Pointer Network. We implement and evaluate the proposed framework across three tasks: (1) multi-hop question answering, (2) relation extraction, and (3) sentiment analysis. The output layer architecture for each task is shown in Figure A2 below.

Figure A2. Task-specific output layer architecture utilizing the rationale vector to compute the final probability distribution.

For detailed implementation equations of multi-hop question answering, refer to Equation (6).
Relation extraction involves classifying the relationship between a subject and an object within a given sentence or document. To achieve this, the input sequence follows the structure shown in Figure A2. A probability distribution over relation labels $C_{r e}$ is then computed based on $t_{[C L S]}$ , which encodes the contextual information of the input sequence. This process is described in Equation (A1) below.

$\begin{matrix} \begin{matrix} {\hat{y}}_{r e} = W_{r e} {(t_{[C L S]} \oplus i n p u t_{N})}^{T} \in R^{| C_{r e} |} \end{matrix} \end{matrix}$

(A1)
Sentiment analysis, as a form of sentence classification, identifies the sentiment within the input text and classifies it as positive, negative, or neutral. The input and output structure for this task is shown in Figure A2. Similar to relation extraction, a probability distribution over sentiment labels $C_{s a}$ is computed according to Equation (A2).

$\begin{matrix} \begin{matrix} {\hat{y}}_{s a} = W_{s a} {(t_{[C L S]} \oplus i n p u t_{N})}^{T} \in R^{| C_{s a} |} \end{matrix} \end{matrix}$

(A2)

Appendix B. GPT-Based Evaluation Instructions and Inputs for All Tasks

This appendix outlines the detailed instructions and input formats for all tasks used in the GPT-based evaluation. Each task includes a description, the specific instruction given to the model, and an example input. The following Table A1, Table A2 and Table A3 illustrate example instances used to evaluate the multi-hop question answering, relation extraction and sentiment analysis tasks.

Table A1. Instructions for GPT-Based Evaluation in Multi-hop QA.

Instruction

Determine how validly the provided <Sentences>support the given <Answer>to the <Question>.
Your task is to assess the supporting sentences based on their relevance and strength in relation to the answer,
regardless of whether the answer is correct.
The focus is on evaluating the validity of the evidence itself.
Even if the answer is incorrect, a supporting sentence can still be rated highly if it is relevant and strong.

<Score criteria>
- 0 : The sentences do not support the answer. They are irrelevant, neutral, or contradict the answer.
- 1 : The sentences provide partial or unclear support for the answer.
The connection is weak, lacking context, or not directly related to the answer.
- 2 : The sentences strongly support the answer, making it clear and directly inferable from them.

<Output format>
<Score>: <0, 1, or 2>

Input

<Question>
When did the park at which Tivolis Koncertsal is located open?
</Question>

<Answer>
15 August 1843
</Answer>

<Sentences>
Tivolis Koncertsal is a 1660-capacity concert hall located at Tivoli Gardens in Copenhagen, Denmark.
The building, which was designed by Frits Schlegel and Hans Hansen, was built between 1954 and 1956.
The park opened on 15 August 1843 and is the second-oldest operating amusement park in the world, after …
</Sentences>

<Score>:

Table A2. Instructions for GPT-based evaluation in Relation Extraction.

Instruction

Determine whether the relationship between the given <Subject> and <Object> can be inferred solely from the provided <Sentences>.
The <Relationship>may not be explicitly stated, and it might even be incorrect.
However, your task is to evaluate whether the sentences themselves suggest the given relationship, regardless of its accuracy.

<Score criteria>
- 0: The sentences do not suggest the relationship at all. The sentences are neutral, irrelevant, or contradict the relationship.
- 1: The sentences somewhat suggest the relationship but are not conclusive.
The relationship is partially inferred but not clearly established.
- 2: The sentences fully suggest the relationship. The relationship can be clearly and directly inferred from the sentences alone.

<Output format>
Score: <0, 1, or 2>

Input

<Sentences>
The discovery of the signal in the chloroplast genome was announced in 2008 by researchers from the University of Washington.
Somehow, chloroplasts from V. orcuttiana, swamp verbena ( V. hastata) or a close relative of these had admixed into the G. bipinnatifida genome.
</Sentences>

<Subject>
mock vervains
</Subject>
<Object>
Verbenaceae
</Object>
<Relationship>
parent taxon: closest of the taxon in question
</Relationship>

<Score>:

Table A3. Instructions for GPT-based evaluation in the Sentiment Analysis.

Instruction

Determine whether the given <Sentiment> can be derived solely from the <Supporting Sentences> for the given <Review>.
The given <Sentiment>may not be the correct answer, but evaluate whether the <Supporting Sentences> alone can support it.

<Score criteria>
- 0: The supporting sentences do not support the sentiment at all. The facts are neutral, irrelevant to the sentiment, or contradict the sentiment.
- 1: The supporting sentences somewhat support the sentiment but are not conclusive.
The sentiment is partially inferred but not clearly. The facts suggest the sentiment but do not decisively establish it.
- 2: The supporting sentences fully support the sentiment. The sentiment can be clearly and directly inferred from the facts alone.

<Output format>
Score: <0, 1, or 2>

Input

<Sentiment>
positive
</Sentiment>

<Supporting Sentences>
A trite fish-out-of-water story about two friends from the midwest who move to the big city to seek their fortune.
They become Playboy bunnies, and nothing particularly surprising happens after that.
</Supporting Sentences>
<Score>:

Appendix C. GPT-Based Answer and Supporting Sentence Extraction for All Tasks

In this section, we provide the prompts for GPT_pred, which simultaneously infers both the answer and the corresponding rationale sentences. The prompts for different tasks are shown in Table A4, Table A5 and Table A6.

Table A4. Instructions for GPT-based Answer and Supporting Sentence Extraction in Multi-hop QA.

Instruction

Answer the given <Question> using only the provided <Reference documents>. Some documents may be irrelevant.
Keep the answer concise, extracting only key terms or phrases from the <Reference documents> rather than full sentences.
Extract exactly 3 supporting sentences—no more, no less.
For each supporting sentence, provide its sentence number as it appears in the reference documents.

<Output format>
<Answer>: <Generated Answer>
<Supporting Sentences>: <Sentence Number 1>, <Sentence Number 2>, <Sentence Number 3>

Input

<Question>
When did the park at which Tivolis Koncertsal is located open?
</Question>

<Reference documents>
Document 1: Tivolis Koncertsal
[1] Tivolis Koncertsal is a 1660-capacity concert hall located at Tivoli Gardens in Copenhagen, Denmark.
[2] The building, which was designed by Frits Schlegel and Hans Hansen, was built between 1954 and 1956.

Document 2: Tivoli Gardens
[3] Tivoli Gardens (or simply Tivoli) is a famous amusement park and pleasure garden in Copenhagen, Denmark.
[4] The park opened on 15 August 1843 and is the second-oldest operating amusement park in the world, after …

Document 3: Takino Suzuran Hillside National Government Park
[5] Takino Suzuran Hillside National Government Park is a Japanese national government park located in Sapporo, Hokkaido.
[6] It is the only national government park in the northern island of Hokkaido.
[7] The park area spreads over 395.7 hectares of hilly country and ranges in altitude between 160 and 320 m above sea level.
[8] Currently, 192.3 is accessible to the public.
…
</Reference documents>

<Answer>:
<Supporting Sentences>:

Table A5. Instructions for GPT-based Answer and Supporting Sentence Extraction in Relation Extraction.

Instruction

Determine the relationship between the given <Subject>and <Object>.
The relationship must be selected from the following list:
‘head of government’, ‘country’, ‘place of birth’, ‘place of death’, ‘father’, ‘mother’, ‘spouse’, …
After selecting the appropriate relationship, provide two key sentence numbers that best support this relationship.

<Output format>
<Relationship>: <Extracted Relationship>
<Supporting Sentences>: <Sentence Number 1>, <Sentence Number 2>

Input

<Document>
[1] Since the new chloroplast genes replaced the old ones, it may be that the possibly …
[2] Glandularia, common name mock vervain or mock verbena, is a genus of annual and perennial herbaceous flowering …
[3] They are native to the Americas.
[4] Glandularia species are closely related to the true vervains and sometimes still …
…
</Document>

<Subject>
mock vervains
</Subject>

<Object>
Verbenaceae
</Object>

<Relationship>:
<Supporting Sentences>:

Table A6. Instructions for GPT-based Answer and Supporting Sentence Extraction in Sentiment Analysis.

Instruction

Classify the sentiment of the given <Sentence> as either ‘positive’ or ‘negative’.
After selecting the appropriate sentiment, extract **only two** key sentences that best support this sentiment.

<Output format>
<Sentiment>: <Extracted Sentiment>
<Supporting Sentences>: <Sentence Number 1>, <Sentence Number 2>

Input

<Document>
[1] This movie was awful.
[2] The ending was absolutely horrible.
[3] There was no plot to the movie whatsoever.
[4] The only thing that was decent about the movie was the acting done by Robert …
…
</Document>

<Sentiment>:
<Supporting Sentences>:

Appendix D. Qualitative Example of Rationale Selection

Table A7 shows one example instance from the HotpotQA dataset. The red-highlighted sentences indicate the rationale sentences selected by the pointer network during multi-hop reasoning.

Table A7. Example question and rationale selection process from the HotpotQA dataset. Red sentences denote the rationale sentences selected by the pointer network.

Question	What is the name of the fight song of the university whose main campus is in Lawrence, Kansas and whose branch campuses are in the Kansas City metropolitan area?
Answer	Kansas Song
Context	North Kansas City, Missouri
	North Kansas City is a city in Clay County, Missouri, United States that despite … of the Kansas City metropolitan area.
	The population was 4208 at the 2010 census.
	…
	It was named after the iconic Kansas City Scout Statue that exists in Penn Valley Park, overlooking Downtown Kansas City.
	Kansas Song
	Kansas Song (We’re From Kansas) is a fight song of the University of Kansas.
	…
	The University of Kansas, often referred to as KU or Kansas, is a public research university in the U.S. state of Kansas.
	The main campus in Lawrence, one of the largest college towns in Kansas, is on Mount Oread, the highest elevation in Lawrence.
	Two branch campuses are in the Kansas City metropolitan area: the Edwards Campus … and hospital in Kansas City.

Appendix E. Runtime and Efficiency Analysis

To support the claim of computational efficiency, we report runtime and model size comparisons between the proposed pointer-network–based rationale inference model and representative baselines, including HUG [11] and RAG [14]. All models were trained and evaluated under the same hardware configuration (NVIDIA A100 GPU, batch size 16). As shown in Table A8, our model achieves competitive rationale selection performance while significantly reducing both training and inference costs. The efficiency gain primarily arises from the lightweight ELECTRA [41] encoder and the sequential pointer mechanism, which helps reduce the reliance on large retrieval–generation modules.

Table A8. Runtime and efficiency comparison. The proposed model requires fewer parameters and achieves 2.4× faster training and 3.1× faster inference speed compared to retrieval-based baselines (HUG, RAG) under identical hardware conditions.

Model	Parameters	Training Time/Epoch	Inference Speed (Examples/s)	Relative Efficiency
HUG [11]	340 M	5.8 h	65	1.0×
RAG [14]	400 M	6.3 h	48	0.7×
Proposed (PointerNet-XAI)	110 M	2.4 h	150	2.4×

Appendix F. Implementation Details

All experiments were conducted on NVIDIA A100 GPUs using mixed-precision (FP16) training. The model typically converged within 6–8 epochs. We applied early stopping to avoid overfitting and to maintain stable alignment between rationales and answers.

Table A9. Summary of key hyperparameters and training configurations used in all experiments. Early stopping based on validation loss was applied to ensure stable convergence and reproducibility.

Component	Description
Optimizer	AdamW
Learning Rate	2 × 10⁻⁵ (initial)
Learning Rate Scheduler	Linear decay with 10% warmup steps
Batch Size	16 (per GPU)
Epochs	10
Early Stopping	Based on validation loss (patience = 3)
Gradient Clipping	1.0
Weight Decay	0.01
Dropout	0.1
Warmup Ratio	0.1
Random Seeds	42 (for all runs)
Validation Criterion	Lowest validation loss across epochs
Implementation Framework	PyTorch 2.2.2 + HuggingFace Transformers

References

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtually, 3–10 March 2021; pp. 610–623. [Google Scholar]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
Shwartz, V.; Choi, Y. Do neural language models overcome reporting bias? In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 6863–6870. [Google Scholar]
Chen, J.; Lin, S.t.; Durrett, G. Multi-hop question answering via reasoning chains. arXiv 2019, arXiv:1910.02610. [Google Scholar]
Wu, H.; Chen, W.; Xu, S.; Xu, B. Counterfactual supporting facts extraction for explainable medical record based diagnosis with graph network. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1942–1955. [Google Scholar]
Zhao, W.; Chiu, J.; Cardie, C.; Rush, A.M. Hop, Union, Generate: Explainable Multi-hop Reasoning without Rationale Supervision. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 16119–16130. [Google Scholar]
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2369–2380. [Google Scholar]
Qi, P.; Lin, X.; Mehr, L.; Wang, Z.; Manning, C.D. Answering Complex Open-domain Questions Through Iterative Query Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2590–2602. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Gunning, D.W.; Aha, D. DARPA’s explainable artificial intelligence program. AI Mag 2019, 40, 44. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, F.F.; Araki, J.; Neubig, G. How can we know what language models know? Trans. Assoc. Comput. Linguist. 2020, 8, 423–438. [Google Scholar] [CrossRef]
Arras, L.; Montavon, G.; Müller, K.R.; Samek, W. Explaining Recurrent Neural Network Predictions in Sentiment Analysis. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis WASSA 2017: Proceedings of the Workshop, Copenhagen, Denmark, 29 August 2017; pp. 159–168. [Google Scholar]
Scott, M.; Su-In, L. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Alvarez-Melis, D.; Jaakkola, T.S. On the Robustness of Interpretability Methods. arXiv 2018, arXiv:1806.08049. [Google Scholar] [CrossRef]
Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar]
Jiang, Z.; Zhang, Y.; Yang, Z.; Zhao, J.; Liu, K. Alignment rationale for natural language inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Bangkok, Thailand, 1–6 August 2021; pp. 5372–5387. [Google Scholar]
Jain, S.; Wallace, B.C. Attention is not explanation. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 3–5 June 2019; pp. 3543–3556. [Google Scholar]
Serrano, S.; Smith, N.A. Is attention interpretable? In Proceedings of the ACL 2019, Austin, TX, USA, 4–13 October 2019; pp. 2931–2951. [Google Scholar]
Wiegreffe, S.; Pinter, Y. Attention is not not explanation. In Proceedings of the EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 11–20. [Google Scholar]
Jacovi, A.; Goldberg, Y. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the ACL 2020, Online, 5–10 July 2020; pp. 4198–4205. [Google Scholar]
Lei, T.; Barzilay, R.; Jaakkola, T. Rationalizing neural predictions. In Proceedings of the EMNLP 2016, Austin, TX, USA, 22 October 2016; pp. 107–117. [Google Scholar]
DeYoung, J.; Jain, S.; Rajani, N.F.; Lehman, E.; Xiong, C.; Socher, R.; Wallace, B.C. ERASER: A benchmark to evaluate rationalized NLP models. In Proceedings of the ACL 2020, Online, 5–10 July 2020; pp. 4443–4458. [Google Scholar]
Welbl, J.; Stenetorp, P.; Riedel, S. Constructing datasets for multi-hop reading comprehension across documents. Trans. Assoc. Comput. Linguist. 2018, 6, 287–302. [Google Scholar] [CrossRef]
Yu, X.; Min, S.; Zettlemoyer, L.; Hajishirzi, H. CREPE: Open-Domain Question Answering with False Presuppositions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 10457–10480. [Google Scholar]
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Du, M. Explainability for Large Language Models: A Survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
Bilal, A.; Ebert, D.; Lin, B. LLMs for Explainable AI: A Comprehensive Survey. arXiv 2025, arXiv:2504.00125. [Google Scholar] [CrossRef]
Huang, S.; Zhou, Y.; Liu, X.; Zhang, J.; Wang, W.; Huang, M. Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations. arXiv 2023, arXiv:2310.11207. [Google Scholar] [CrossRef]
Turpin, M.; Du, Y.; Michael, J.; Wu, Z.; Cotton, C.; Raghu, M.; Uesato, J. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. Adv. Neural Inf. Process. Syst. 2023, 36, 74952–74965. [Google Scholar]
Atanasova, P.; Simonsen, J.G.; Lioma, C.; Augenstein, I. Diagnostics-guided explanation generation. In Proceedings of the AAAI Conference on Artificial Intelligence 2022, Online, 22 February–1 March 2022; Volume 36, pp. 10445–10453. [Google Scholar]
Glockner, M.; Habernal, I.; Gurevych, I. Why do you think that? exploring faithful sentence-level rationales without supervision. arXiv 2020, arXiv:2010.03384. [Google Scholar] [CrossRef]
Min, S.; Zhong, V.; Zettlemoyer, L.; Hajishirzi, H. Multi-hop Reading Comprehension through Question Decomposition and Rescoring. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6097–6109. [Google Scholar]
Mao, J.; Jiang, W.; Wang, X.; Liu, H.; Xia, Y.; Lyu, Y.; She, Q. Explainable question answering based on semantic graph by global differentiable learning and dynamic adaptive reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5318–5325. [Google Scholar]
Yin, Z.; Wang, Y.; Hu, X.; Wu, Y.; Yan, H.; Zhang, X.; Cao, Z.; Huang, X.; Qiu, X. Rethinking label smoothing on multi-hop question answering. In Proceedings of the China National Conference on Chinese Computational Linguistics, Harbin, China, 3–5 August 2023; pp. 72–87. [Google Scholar]
Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW) 2015, Jeju Island, Republic of Korea, 11–15 October 2015; pp. 1–5. [Google Scholar]
Clark, K. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Brown, P.F.; Della Pietra, S.A.; Della Pietra, V.J.; Mercer, R.L. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 1993, 19, 263–311. [Google Scholar]
Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. MuSiQue: Multihop Questions via Single-hop Question Composition. Trans. Assoc. Comput. Linguist. 2022, 10, 539–554. [Google Scholar] [CrossRef]
Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu, Z.; Huang, L.; Zhou, J.; Sun, M. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 764–777. [Google Scholar]
Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–27 June 2011; pp. 142–150. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Robertson, S.E.; Walker, S.; Jones, S.; Hancock-Beaulieu, M.M.; Gatford, M. Okapi at TREC-3; British Library Research and Development Department: London, UK, 1995; pp. 109–126. [Google Scholar]
You, H. Multi-grained unsupervised evidence retrieval for question answering. Neural Comput. Appl. 2023, 35, 21247–21257. [Google Scholar] [CrossRef]
OpenAI. Gpt-4o Mini: Advancing Cost-Efficient Intelligence. 2024. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 17 September 2025).

Figure 1. An overview of our proposed framework. The model architecture consists of two components: a rationale identification module based on a pointer network and a reader module based on a pre-trained language model.

Figure 2. Expansion of reasoning paths R via beam search decoding.

Figure 3. Selection of the best reasoning path

r^{k *}

based on scoring function f.

Figure 3. Selection of the best reasoning path

r^{k *}

based on scoring function f.

Figure 4. Task Specific Prediction.

Figure 5. Rationale inference performance on the HotpotQA dataset with varying threshold values in the loss function.

Figure 6. Effect of the consistency weight

λ

on Answer and Rationale

F_{1}

across datasets.

Figure 6. Effect of the consistency weight

λ

on Answer and Rationale

F_{1}

across datasets.

Figure 7. Performance comparison across different beam sizes (HotpotQA).

Table 1. Statistics of the experimental datasets.

Dataset	Train Set	Test Set	Sampled Test Set
HotpotQA	90,564	7405	1000
MuSiQue	25,494	3911	1000
DocRED	56,195	17,803	1000
IMDB	25,000	-	1000

Table 2. Performance comparison on HotpotQA. For the proposed model, mean and standard deviation (±) are reported over three runs.

Models	Answer	Rationale
Models	$F_{1}$	Precision	Recall	$F_{1}$
BM25	-	-	-	40.5
RAG-small	62.8	-	-	49.0
Semi-supervised	66.0	-	-	64.5
You (2023)	50.9	-	74.2	-
HUG	66.8	-	-	67.1
Proposed Model	60.2 ± 0.2	60.1	76.5	67.2 ± 0.3
Upperbound	61.1	82.8	80.5	80.7

Table 3. Performance comparison on MuSiQue. Standard deviation ± is reported only for the F₁ scores of our proposed model over three runs.

Models	Answer $F_{1}$	Rationale $F_{1}$
BM25	-	12.9
RAG-small	24.2	32.0
HUG	25.1	34.2
Proposed Model	25.0 ± 0.4	35.4 ± 0.5

Table 4. Performance comparison on DocRED. Standard deviation ± is reported only for the F₁ scores of our proposed model over three runs.

Models	Answer $F_{1}$	Rationale $F_{1}$
GPT_pred	41.8	21.8
Proposed Model	81.8 ± 0.1	54.9 ± 0.2

Table 5. Performance comparison of rational inference using GPT scores.

Models	Dataset				Overall
Models	HotpotQA	MuSiQue	DocRED	IMDB	Overall
GPT_pred	92.4	81.0	51.0	95.4	79.9
GPT_gold	95.5	82.6	65.0	-	-
Proposed Model	87.5	55.9	87.2	88.7	79.8

Table 6. Ablation study of the proposed model on experimental datasets.

Models	Answer	Evidence
Models	$F_{1}$	Precision	Recall	$F_{1}$
Proposed Model	60.2	60.1	76.5	67.2
- Loss	60.5	47.7	61.1	52.9
- Beam	60.8	55.3	68.2	61.1
- Loss & Beam	60.4	47.3	60.9	52.7

Table 7. Consistency Evaluation on HotpotQA.

Models	Test Set $F_{1}$	Sampled Test Set $F_{1}$
Proposed Model (All Document Input ⇔ Only Rationale Sentences Input)	84.9	93.3
Proposed Model (Ground Truth ⇔ Only Rationale Sentences Input)	60.2	90.3
Proposed Model (All Document Input ⇔ Ground Truth)	60.2	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, K.; Jang, Y.; Kim, H. Explainable Multi-Hop Question Answering: A Rationale-Based Approach. Big Data Cogn. Comput. 2025, 9, 273. https://doi.org/10.3390/bdcc9110273

AMA Style

Han K, Jang Y, Kim H. Explainable Multi-Hop Question Answering: A Rationale-Based Approach. Big Data and Cognitive Computing. 2025; 9(11):273. https://doi.org/10.3390/bdcc9110273

Chicago/Turabian Style

Han, Kyubeen, Youngjin Jang, and Harksoo Kim. 2025. "Explainable Multi-Hop Question Answering: A Rationale-Based Approach" Big Data and Cognitive Computing 9, no. 11: 273. https://doi.org/10.3390/bdcc9110273

APA Style

Han, K., Jang, Y., & Kim, H. (2025). Explainable Multi-Hop Question Answering: A Rationale-Based Approach. Big Data and Cognitive Computing, 9(11), 273. https://doi.org/10.3390/bdcc9110273

Article Menu

Explainable Multi-Hop Question Answering: A Rationale-Based Approach

Abstract

1. Introduction

2. Related Works

3. Methodology

4. Experiments

4.1. Dataset and Experimental Setup

4.2. Baseline

4.3. Metrics

Prompt Details for GPT-Based Evaluation

4.4. Comparison Models

4.5. Experimental Results

4.5.1. Quantitative Evaluation

4.5.2. GPT Score Evaluation

4.5.3. Ablation Study

4.5.4. Further Analysis

4.5.5. Sensitivity to the Consistency Weight $λ$

4.5.6. Effect of Beam Width on Performance

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Implementation Details

Appendix B. GPT-Based Evaluation Instructions and Inputs for All Tasks

Appendix C. GPT-Based Answer and Supporting Sentence Extraction for All Tasks

Appendix D. Qualitative Example of Rationale Selection

Appendix E. Runtime and Efficiency Analysis

Appendix F. Implementation Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Explainable Multi-Hop Question Answering: A Rationale-Based Approach

Abstract

1. Introduction

2. Related Works

3. Methodology

4. Experiments

4.1. Dataset and Experimental Setup

4.2. Baseline

4.3. Metrics

Prompt Details for GPT-Based Evaluation

4.4. Comparison Models

4.5. Experimental Results

4.5.1. Quantitative Evaluation

4.5.2. GPT Score Evaluation

4.5.3. Ablation Study

4.5.4. Further Analysis

4.5.5. Sensitivity to the Consistency Weight λ

4.5.6. Effect of Beam Width on Performance

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Implementation Details

Appendix B. GPT-Based Evaluation Instructions and Inputs for All Tasks

Appendix C. GPT-Based Answer and Supporting Sentence Extraction for All Tasks

Appendix D. Qualitative Example of Rationale Selection

Appendix E. Runtime and Efficiency Analysis

Appendix F. Implementation Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.5.5. Sensitivity to the Consistency Weight $λ$