Testing Explainability of Chain of Thought for Large Language Models

Chen, Hao; Zhao, Zhuang; Shuai, Ziqi; Xuan, Jifeng

doi:10.3390/app16073112

Open AccessArticle

Testing Explainability of Chain of Thought for Large Language Models

School of Computer Science, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3112; https://doi.org/10.3390/app16073112

Submission received: 4 February 2026 / Revised: 11 March 2026 / Accepted: 17 March 2026 / Published: 24 March 2026

(This article belongs to the Special Issue Intelligent Computing in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) have demonstrated superior abilities in complex tasks such as text generation, reasoning, and question answering. However, the explainability of LLMs becomes weak as the parameters and complexity of LLMs increase. Chains of Thought (CoTs) guide the model to perform step-by-step reasoning and effectively enhance its reasoning ability. The multi-step rationales verbalized in a CoT are widely regarded as the explanation of the model itself. This paper proposes an automated approach to testing the behavioral sensitivity of responses to self-cited evidence in CoTs from sufficiency and necessity perspectives under context intervention. Specifically, we intervene in the reasoning chain by changing the input context and measure the behavioral consistency as a proxy for the faithfulness of the CoT. We test the CoT rationales of mainstream open-source LLMs on multi-hop question-answering tasks. The experimental results show that the self-stated reasoning chain is insufficient and unnecessary. The CoT cannot fully explain the behavior of LLMs.

Keywords:

program comprehension; explainable AI; software testing; chain of thought; multi-hop question answering

1. Introduction

Large Language Models (LLMs) have been widely applied and have demonstrated superior capabilities in many complex tasks. However, the large parameter scale and complex internal calculations of LLMs make their decision-making process like a black box. The lack of transparency seriously hinders the application of LLMs in high-risk fields such as healthcare, finance, and justice. Explainable AI (XAI) aims to uncover this black box. XAI utilizes explainability technology to create artificial intelligence systems that users can understand and trust, while maintaining a high level of performance [1]. Explainability refers to explaining the decision-making basis and process of a model [2].

Unlike traditional machine learning models that mainly focus on classification and prediction, LLMs can handle complex tasks, including text generation, reasoning, and question answering. This brings challenges and opportunities for the development of explainability technology. The challenge is that LLMs with large scales and parameters are more difficult to explain. Traditional feature attribution methods, such as SHAP [3], may generate huge computational costs in the process of explaining language models. The opportunity is that the text generation and common-sense reasoning capabilities of LLMs can drive the innovation of explainability technologies. Chain-of-Thought (CoT) prompting guides models to perform logical reasoning step by step and generate prediction results [4]. CoT prompting can be achieved by using direct instructions such as “think step by step” or providing examples of step-by-step reasoning and has become an effective method for improving the reasoning ability of LLMs. The CoT is the step-by-step reasoning process output by LLMs before they provide the final answer. The CoT rationales are widely interpreted as explanations for model decisions [5].

Based on the CoT, users can understand the thinking process of the model and then carry out model optimization and trust building. However, the premise is that the CoT truly reflects the decision-making behavior of the model. If this assumption is not valid, the CoT will mislead users. The model might get the answer first, while the CoT is just a seemingly reasonable explanation fabricated afterwards. Turpin et al. [5] find that CoTs may systematically distort the true reasons for model predictions by biasing model inputs. They ensure that the answers in the few-shot prompt are always “(A)” by reordering the options of multiple-choice questions. When the model is guided to incorrect answers, it frequently generates a CoT that rationalize these answers. Although this CoT seems reasonable, it is actually misleading. Such explanations can trigger a trust crisis for LLMs among users and are more likely to cause security risks and serious consequences in high-risk and sensitive fields.

It is unconvincing to claim the successful construction of explainability merely because the explanation seems reasonable without scientific testing and evaluation [6]. At present, there are few studies on evaluating explainability methods, and the standardization level is low. Vilone and Longo [7] investigated 406 papers related to explainability from 1975 to 2020, of which only 70 focused on explainability assessment. Only by systematically testing explainability and comprehensively verifying the faithfulness of explanation can we discover the defects in explanation. Explainability testing can guide researchers to optimize algorithms in a targeted manner. This will further promote the continuous development of explainability research and ultimately help artificial intelligence models achieve safe applications.

The purpose of explanation is to enable humans to understand the basis for model decisions. The difficulty in explaining LLMs lies in the lack of ground truth, which also poses challenges to the credibility and accuracy of testing explanations. There is no formal definition of the correct explanation in many application scenarios. The method of evaluating an explanation is to capture the subjective perception of the user or observe the impact of the explanation on user behavior [8]. For example, Chen et al. [9] evaluate the counterfactual simulatability of natural language explanations by observing whether explanations can enable humans to accurately infer the results of models under different counterfactual inputs. Although subjective evaluation can directly reflect the degree of human understanding and trust in explanations, costs are high, and the results are subjective. Researchers may unconsciously influence the experimental procedures to meet user expectations. In addition, subjective evaluation is the comparison of the similarity between the explanation and human reasoning, which often ignores the principle of faithfulness. Faithfulness refers to the degree of closeness between the explanation and the decision-making process of the explained model [9]. Even if the explanation is similar to the human reasoning process, it may not be faithful to the true reasoning process of the model. We adopt a behavioral notion of faithfulness in this paper. We frame faithfulness in terms of behavioral alignment: a CoT is faithful if the model’s output is dependent on its self-cited evidence. Objective evaluation does not involve humans. Evaluating the quality of explanation through indicators is beneficial for achieving repeatable, automated, and low-cost explainability testing. The automated explainability testing method proposed in this paper aims to evaluate the behavioral faithfulness of CoTs generated by LLMs through quantifiable indicators.

In this paper, we consider a CoT as the self-description explanation of the model and test whether this explanation is consistent with the actual behavior of the model. We propose an automated approach to test whether the self-cited evidence in CoT is behaviorally influential on the output through input perturbation. This approach does not directly capture the internal computational processes of LLMs. Rather, it provides a scalable, black-box method for assessing behavioral consistency as a proxy for faithfulness: the model’s output should depend on the evidence it reports. Figure 1 shows an overview of the proposed approach. Based on the two assumptions of necessity and sufficiency, this paper constructs a test framework comprising questioning, explanation extraction, reasoning chain intervention, and answer comparison. We change the inputs and observe the changes in the outputs to test the explanatory quality of the CoT. We test whether the model’s output is robust to input perturbations by measuring behavioral consistency. The testing process includes two-step questioning and consistency determination. Different strategies are used to intervene in the reasoning chain, and the answers of LLMs to “complete context” and “context after reasoning chain intervention” are compared to test the explainability of the CoT.

The main contributions of this paper are as follows:

We propose an automated framework to test the behavioral sensitivity of responses to self-cited evidence in CoTs to help users analyze explainability.
We design testing indicators to quantitatively measure the behavioral faithfulness of CoTs based on consistency judgment.
We reveal that the CoT explanation of mainstream open-source LLMs cannot fully reflect the decision-making basis and behavior of the model. There are insufficient and unnecessary issues.

The rest of this paper is organized as follows. Section 2 shows the background and motivation of testing the explainability of CoTs. Section 3 introduces our proposed method. Section 4 presents the experimental setup, including three research questions and the test process. Section 5 describes the experimental results. Section 6 discusses the threats to validity. Section 7 lists the related work, and Section 8 concludes.

2. Background and Motivation

We present the background and the motivation of our work.

2.1. Explainability

There is no generally accepted standard definition for explainability currently. It is difficult to unify the definition of explainability due to the different demands in various application fields [10,11]. Biran and Cotton [12] define explainability as the extent to which an observer can understand the reasons for a decision. Kim et al. [13] believe that explainability refers to the ability of users to accurately and effectively predict the outcome of a method. Doshi Velez and Kim [6] define explainability as the ability to explain or present in terms that are understandable to humans. They believe that explainability can help qualitatively determine whether other demands, such as fairness, reliability, availability, etc., are being met.

Explainability can be divided into ante hoc explainability and post hoc explainability. Ante hoc explainability refers to the explainability of the model itself, such as decision trees, linear models, naive Bayes, etc. Post hoc explainability refers to the generation of explanations for black-box models with opaque decision-making mechanisms. Post hoc explainability can be divided into global explainability and local explainability. Global explainability explains the overall behavior of a model, while local explainability explains individual decisions. In real-world applications, the global distributions of samples cannot be obtained. Thus, the explainability in this paper only focuses on local explainability.

For LLMs trained with the pre-training and fine-tuning paradigms, local explainability approaches are mainly divided into feature attribution-based explanation, attention-based explanation, example-based explanation, and natural language explanation [14]. Among them, natural language explanation refers to the process of generating text to explain the decision-making process of a model. The language model is trained jointly with raw data and manually labeled explanations to enable the model to automatically generate explanations in the form of natural language [15]. Additionally, LLMs can also be used to explain other models. For example, a study by OpenAI shows that GPT-4 can generate natural language explanations for single neuron activation in GPT-2 XL [16].

For LLMs based on the prompting paradigm, the significant increase in model size makes traditional computationally intensive explainability technologies no longer applicable. In addition, the overly complex internal decision-making mechanism of the model cannot be represented by a simplified model. The CoT provides a new way and perspective for understanding the behavior of large language models.

2.2. Mechanistic vs. Behavioral Interpretability

Mechanism interpretability aims to reveal the causal role of specific model components (such as attention heads, neurons, or circuits) in generating output. These approaches often involve techniques like attention pattern visualization [17] or activation patching [18]. While powerful, mechanistic methods require access to internal states and do not scale easily to black-box models.

Behavioral interpretability treats the model as a black box and relies on input–output manipulations to infer its reasoning. Behavioral consistency verifies the matching degree between the external explanation (such as CoT) and the decision behavior, focusing only on the input and output behaviors of the model. The common strategy is to measure the change in prediction by interfering with the input [19,20].

Given the distributed, probabilistic nature of transformer-based LLMs, there is no single ground-truth reasoning chain encoded in weights or activations. We focus on behavioral interpretability in this paper. The CoT is regarded as the self-reporting explanation of the model. We test whether the model’s behavior depends on these self-reporting paths by intervening in the input. Black-box testing avoids the difficulty of accessing internal representations and provides a practical, replicable evaluation framework.

2.3. Chain of Thought

Wei et al. [21] and Wang et al. [22] proposed the chain-of-thought prompting technique. The CoT can guide the language model to generate a reasoning path that breaks down complex reasoning into multiple simple steps by inputting step-by-step reasoning examples. Kojima et al. [4] demonstrated that large language models perform equally well in zero-shot reasoning by adding a simple prompt “Let’s think step by step” to guide the model to think step by step before answering each question.

Prompt-based approaches may not trigger LLMs to provide explanations. DeepSeek-R1 [23] trains the model to provide explanations before making the final prediction and only gives positive rewards if the prediction is correct. Even without human-labeled explanations as ground truths to guide training, LLMs can still learn to provide human-understandable explanations and maintain high accuracy when answering reasoning questions.

The CoT effectively improves the accuracy of LLMs in solving complex problems by decomposing complex problems into a series of sub-problems [21]. Outputting intermediate steps also helps users adjust prompts when they observe abnormal model behavior [24,25]. In addition, the step-by-step reasoning process of LLMs can also serve as a fine-tuning dataset for small LLMs to help improve their reasoning capabilities [26].

2.4. Multi-Hop Question Answering

Multi-hop question answering (Multi-hop QA) focuses on complex reasoning capabilities. Unlike single-hop QA, which can directly match answers through a single document or knowledge fragment, multi-hop QA requires the model to integrate information from multiple data sources or information fragments, and indirectly deduce the answer through logical associations and reasoning chains [27,28,29]. The multi-hop QA task aims to break through the limitations of traditional retrieval-based matching in QA systems. It is closer to the cognitive process of gradually associating and deducing when humans solve complex problems. Multi-hop QA has become a key benchmark for evaluating the deep semantic understanding and logical reasoning ability of LLMs.

The complexity and step-by-step nature of the multi-hop QA reasoning process make it an ideal scenario for testing the CoT. In multi-hop QA tasks, the CoT guides the LLM to output reasoning steps and visually presents the decision-making path of the model from information retrieval to answer generation. For example, when answering the question “The establishment time of the institution where a key figure of a certain historical event once worked”, the LLM needs to first identify the key figure, then associate it with the institution where he or she worked, and finally deduce the establishment time of the institution.

2.5. Motivation

A CoT seems intuitive. However, there is a key question that needs to be confirmed: Does the CoT truly describe the reasoning process of LLMs? Studies have shown that large language models still rely on sentence-level memory capabilities and corpus-level statistical patterns rather than having strong reasoning abilities [30,31]. In multi-hop QA tasks, there may be deviations between the CoT and the actual inference of the model. It is necessary to explore whether the CoT reflects the decision-making behavior of the model, which can provide a key research direction for further optimizing the explainability of LLMs. We propose an automated explainability testing framework to quantitatively test the behavioral consistency of CoTs from the dimensions of sufficiency and necessity.

3. Proposed Approach

The approach proposed in this paper is based on context intervention and answer consistency measurement. We evaluate the behavioral robustness of model responses under input perturbation and measure the degree of behavioral correlation between the information cited in CoT explanations and the outputs. We design testing indicators to measure the impact of CoT reasoning steps on model behavior.

3.1. Assumptions of Explainability Testing

Assumption 1 (Sufficiency Assumption).

If the CoT conforms to the actual behavior of the model, and only the context cited in the CoT is retained, the regenerated answers of the model will maintain high consistency with the original answers.

Assumption 2 (Necessity Assumption).

If the CoT conforms to the actual behavior of the model, removing the context cited in the CoT will significantly reduce the consistency between the regenerated answers and the original answers generated by the model.

3.2. Explainability Testing

The testing framework includes questioning, explanation extraction, reasoning chain intervention, and answer comparison, to test the behavioral consistency of the CoT. Figure 2 shows the test process and the strategies for intervening in the reasoning chain. This proposed approach changes the input of the model and observes the changes in the outputs to determine whether the CoT conforms to the actual decision-making behavior.

In the multi-hop question-answering task, we provide background knowledge paragraphs to the LLM and ask it questions. Our framework is designed to test whether the paragraphs cited in the CoT of the LLM are behaviorally influential on its final answer. The testing process consists of questioning and consistency judgment. We compare the responses of the LLM to “complete context” and “context after reasoning chain intervention” to test the explainability. Firstly, we provide a complete context (the background knowledge in Figure 1) to the model and require the model to think step by step to answer the question. Secondly, we extract the set of background paragraphs cited in the CoT. Then, we carry out targeted intervention on the context, including paragraph filtering or deletion. Thirdly, we question the model again and obtain an answer. Finally, we compare the original answer with the answer after interventions to determine whether there is a significant change in the model outputs, to test the behavioral consistency of the CoT.

3.3. Reasoning Chain Intervention Strategies

The background knowledge (complete input context) consists of multiple paragraphs:

P = {p_{1}, p_{2}, \dots, p_{n}}

. The background knowledge cited in the CoT is referred to as the supporting paragraphs and are sorted according to the reasoning steps:

P^{'} = {p_{s_{1}}, p_{s_{2}}, \dots, p_{s_{m}}}

.

P I

is the background knowledge set obtained after applying various strategies for the reasoning chain intervention.

Strategy 1 (Keep only supporting paragraphs cited in CoT).

We retain supported paragraphs cited in the CoT and remove all paragraphs not cited in the CoT from the original background knowledge set:

P I = P^{'},

(1)

Strategy 1 verifies the sufficiency of the supporting paragraphs set (information claimed by the CoT for inference). Sufficiency measures whether the information declared by the CoT alone is sufficient for the model to generate the original answer. If the model fails to output a consistent answer while supporting paragraphs are retained, it indicates that the information declared by the CoT is insufficient to support the decision, and the model actually relies on other undeclared information.

Strategy 2 (Remove the supporting paragraph cited in the first step of the CoT).

We remove the supporting paragraph cited in the first step of the CoT from the original background knowledge set, while all other paragraphs (including the supporting paragraphs cited later in the CoT and those not cited) are retained:

P I = P ∖ {p_{s_{1}}} .

(2)

Strategy 3 (Remove all supporting paragraphs cited in the CoT).

We remove all supported paragraphs cited in CoT from the original background knowledge set and only retain paragraphs not cited by the CoT:

P I = P - P^{'} .

(3)

A fundamental characteristic of transformer-based LLMs is that they operate as probabilistic next-token predictors. Language generation is inherently stochastic: the same input may result in different computational pathways depending on the sampling choices made at each step. LLMs represent information in superposition across parameters and layers. The same input can activate different combinations of attention heads and neurons, leading to multiple computational pathways that yield the same output distribution. Therefore, the test framework proposed in this paper does not assume that the CoT represents the only possible reasoning path. This paper assesses whether the cited information in the CoT is behaviorally relevant to the model’s output. For situations where there may be multiple reasoning chains that can reach the final answer, we adopt a multi-round removal strategy. After removing the supporting paragraphs cited in the CoT for the first time, we ask the LLM again and remove the supporting paragraphs again based on the CoT to intervene in the reasoning chain multiple times.

Strategy 2 and Strategy 3 are used to test the necessity of CoT explainability. Whether we remove the supporting paragraph involved in the first step of the CoT or all the cited supporting paragraphs, it will affect the model behavior. A significant change in the answer after removing cited paragraphs suggests that the model’s output is sensitive to those paragraphs, i.e., the model behaviorally relies on them. Otherwise, the model may either not depend on those paragraphs or rely on external information. It is easier to test the impact of input perturbations on model behavior in Strategy 2 than in Strategy 3. Part of the information related to answering the question is still retained, while the self-described reasoning chain in the CoT cannot be formed. This makes it easy to test whether the inference of the LLM relies on matching ability while measuring the correlation between the CoT and decision-making behavior of the LLM. Strategy 3 defines relatively lenient requirements for the explainability of LLMs than Strategy 2, as the model is prone to output different answers when all relevant information is removed.

3.4. Test Indicators

We name the original background knowledge set P and the background knowledge set after intervention

P I

. Y is the set of answers output by the LLM based on the original background knowledge set P.

Y^{'}

is the set of answers output by the LLM based on the background knowledge set after intervention

P I

. When the CoT mentions

P_{i} \subseteq P

, the answer of the LLM is

{\hat{y}}_{i} \in Y

. After intervening in the input context, the answer of the LLM is

{\hat{y}}_{i}^{'} \in Y^{'}

when mentioning

P I_{i} \subseteq P I

in the CoT. The consistency score measures the average degree of consistency between the answer pairs, while the inconsistency score is the opposite:

\begin{matrix} c o n s i s t e n c y ({\hat{y}}_{i}, {\hat{y}}_{i}^{'}) & = \frac{1}{|Y|} \sum_{\begin{matrix} {\hat{y}}_{i} \in Y, \\ {\hat{y}}_{i}^{'} \in Y^{'} \end{matrix}} δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'}), \end{matrix}

(4)

\begin{matrix} i n c o n s i s t e n c y ({\hat{y}}_{i}, {\hat{y}}_{i}^{'}) & = 1 - c o n s i s t e n c y ({\hat{y}}_{i}, {\hat{y}}_{i}^{'}), \end{matrix}

(5)

where

δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})

measures the degree of consistency of the answers. The definition of

δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})

can be found later in Equations (9) and (10). Based on the F1 score and Exact Match (EM), which are universally adopted to assess performance of models on multi-hop QA tasks [32,33], we design two evaluation metrics for calculating

δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})

: F1 score and Contain. Specifically, we take F1 as a string-based similarity measure that quantifies the token-level information overlap. Contain measures consistency based on containment relationship and refers to the EM for measuring a strict match. The definition of

δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})

, which is calculated using F1 and Contain, is as follows:

After standardizing the text, we perform word segmentation, then calculate the precision and recall based on word-level overlap to derive the F1 score. In this case, we assign the value of F1 to the $δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})$ :

$\begin{matrix} p r e c i s i o n & = \frac{s a m e ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})}{l e n ({\hat{y}}_{i}^{'})}, \end{matrix}$

(6)

$\begin{matrix} r e c a l l & = \frac{s a m e ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})}{l e n ({\hat{y}}_{i})}, \end{matrix}$

(7)

$\begin{matrix} F 1 & = 2 \cdot \frac{p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}, \end{matrix}$

(8)

$\begin{matrix} δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'}) & = F 1, \end{matrix}$

(9)

where $s a m e ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})$ counts the number of word matches between the answers, $l e n ({\hat{y}}_{i})$ measures the total number of words of answer ${\hat{y}}_{i}$ , and $l e n ({\hat{y}}_{i}^{'})$ measures the total number of words of answer ${\hat{y}}_{i}^{'}$ .
The calculation rule of Contain in this paper is that after standardizing the text, if ${\hat{y}}_{i}^{'}$ contains ${\hat{y}}_{i}$ or ${\hat{y}}_{i}$ contains ${\hat{y}}_{i}^{'}$ , it is regarded as a match. In this case, we define $δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})$ as follows:

$\begin{matrix} δ ({\hat{y}}_{i}, {\hat{y}}_{i}^{'}) = c o n t a i n ({\hat{y}}_{i}, {\hat{y}}_{i}^{'}), \end{matrix}$

(10)

where $c o n t a i n ({\hat{y}}_{i}, {\hat{y}}_{i}^{'})$ measures whether ${\hat{y}}_{i}$ is a subsequence of ${\hat{y}}_{i}^{'}$ or whether ${\hat{y}}_{i}^{'}$ is a subsequence of ${\hat{y}}_{i}$ .

Consistency reflects whether the answer after the intervention is consistent with the original answer. When retaining supporting paragraphs, a high consistency score generally denotes satisfactory behavioral relevance between the CoT and the model’s behavior. In the setting where supporting paragraphs are excluded, a low consistency score implies superior relevance; in other words, a high inconsistency score is indicative of strong relevance.

Additionally, we use the unknown rate to reflect the proportion of the model that directly answers “unknown” after removing supporting paragraphs. We define the unknown rate as an indicator of sensitivity to contextual intervention behavior. The reasons for the abstention behavior include not only the removal of supporting paragraphs but also alignment tuning, conservative response strategies, or safety mechanisms. In addition, it is related to format compliance. Therefore, we only consider it as a referential indicator of the behavioral correlation between the cited paragraph and the decision of the model. The high rate of answering “unknown” means sensitivity to the cited paragraphs:

\begin{matrix} unknown rate & = \frac{1}{|Y^{'}|} \sum_{\begin{matrix} {\hat{y}}_{i}^{'} \in Y^{'} \end{matrix}} u ({\hat{y}}_{i}^{'}), \end{matrix}

(11)

\begin{matrix} u ({\hat{y}}_{i}^{'}) & = \{\begin{matrix} 1, & {\hat{y}}_{i}^{'} is “ unknown ” \\ 0, & o t h e r w i s e \end{matrix} . \end{matrix}

(12)

The metrics defined above are operational indicators designed to capture different facets of behavioral dependence. They are not direct measures of semantic equivalence or causal alignment. Rather, they provide reproducible, scalable signals that support inferences about model behavior.

4. Experimental Setup

To test the explainability of mainstream open-source LLMs, we conducted experiments to verify the two assumptions of sufficiency and necessity by applying three strategies (see Section 3). We propose three Research Questions (RQs) to test the behavioral consistency of CoTs for LLMs.

4.1. Research Questions

RQ1. Will the LLM output an answer consistent with the original answer when Strategy 1 is applied? The purpose of this research question is to test the sufficiency of the CoT explanation. Specifically, if the CoT is consistent with the behavior of the model, the model can infer an answer consistent with the original one only based on the supporting paragraphs used in the CoT.

RQ2. Will the LLM output an answer inconsistent with the original answer when Strategy 2 and Strategy 3 are applied? This research question aims to test the necessity of CoT explanation. Removing the supporting paragraphs cited in the CoT will interfere with reasoning behavior. If the model can still return a result consistent with the original answer, it indicates that the self-stated explanation of the LLM is not a necessary reasoning basis of its decision-making.

RQ3. How is the consistency between the model outputs after multiple rounds of Strategy 2 and Strategy 3? Since there may be multiple reasoning chains that can lead to the answer, the model may still be able to obtain the answer after interfering with the reasoning chain for the first time. Therefore, we remove the supporting paragraphs again based on the second reasoning chain, and ask the model to regenerate the answer. Then, we compare the consistency between the last answer and the original answer.

4.2. Data Preparation

We employed an open-source dataset to evaluate the test approach proposed in this paper. The MuSiQue dataset [27] is a multi-hop question-answering dataset with 25K 2–4 hop questions. Hop refers to the number of steps for reasoning. k-Hop means that the model needs to break down the question into k sub-questions and go through k steps of reasoning before the answer can be obtained. When answering the questions in this dataset, the answer to each sub-question must rely on the answer of the previous sub-questions. Each sample includes 20 background paragraphs marked with an index, questions, and reference answers, which are suitable for verifying the dependency relationship between paragraphs and answers. We used the dev set with a total of 2417 samples, including 1252 two-hop questions, 760 three-hop questions, and 405 four-hop questions.

4.3. Implementation

We selected three mainstream open-source large language models from the Open LLM Leaderboard (https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, accessed on 16 March 2026) as test models. Qwen3-Next-80B-A3B-Instruction (https://www.aliyun.com/, accessed on 16 March 2026) and Llama-4-Maverick (https://ai.meta.com/llama/, accessed on 16 March 2026) have reliable general capabilities, and DeepSeek-V3.2-Exp (https://www.deepseek.com/, accessed on 16 March 2026) achieves strong performance on the long context reasoning task. We explicitly report the model call parameters in Appendix A.1.

4.4. Test Process

Step 1. We input background knowledge, questions and format constraints into the model. We designed a prompt to explicitly instruct the model to indicate which paragraphs (by their index) were used to support the answer. The model was required to think step by step, declare supporting paragraph indexes in the order of inference, and give the final answer.

Step 2. Based on different intervention strategies, we filtered or removed background paragraphs according to the support paragraph indexes obtained in the first step. Then, we questioned the model again and recorded the answer. We prompted the model to directly answer “unknown” if it could not infer the answer from the provided background knowledge.

Step 3. We measured test indicators such as consistency between two answers and the unknown rate in the second step. Before calculating consistency, we performed text standardization, including converting the text to lowercase letters, removing all punctuation marks, filtering stop words, and standardizing spaces.

We specify the full prompt template used in Step 1 and Step 2 in Appendix A.2 and the preprocessing procedures with the tokenization rules and the stop-word list in Appendix A.3. If there was a call error or if the answer or the supporting paragraph indexes could not be extracted, the retry mechanism was triggered, with a maximum of 3 retries. If the retries failed, the sample was marked as an error sample. To ensure that extraction failures did not bias our comparisons, we quantified the frequency of errors. We describe the parsing rules and report the error rate in Appendix A.4. Errors were mainly due to the security checks on the input data. The extraction failed mainly because the API returned an empty response. Error samples were excluded from quantitative analysis. The low error rate and extraction failure rate indicated that the extraction mechanism was robust and that the vast majority of samples were successfully processed.

5. Experimental Results and Analysis

We tested the CoT explainability of LLMs by answering research questions. We quantitatively tested the quality of the supporting indexes cited in CoTs by comparing them with the ground-truth supporting facts provided in the MuSiQue dataset. We calculated the average Jaccard similarity and recall of the five experiments in RQ1–3 to observe the reliability of supporting paragraph indexes. Recall refers to the proportion of ground-truth indexes covered by cited indexes. Table 1 shows the degree of overlap. We evaluated the model performance on the original complete context against MuSiQue ground-truth answers. DeepSeek-V3.2-Exp obtained 68.51%, Llama 4 Maverick 66.46%, and Qwen3-Next-80B-A3B-Instruction 62.69%.

We added matched control baselines for context manipulation. For Strategy 1, we compared “keep only supporting paragraphs” to “keep the same number of randomly chosen paragraphs” and to “keep the same number of non-supporting paragraphs”. For Strategy 2 and Strategy 3, we added “remove the same number of random paragraphs” and “remove the same number of non-supporting paragraphs” as baselines. Table 2, Table 3 and Table 4 shows the baselines. The test results all exceeded the baselines, which proved that changes in consistency were specific to CoT-cited evidence rather than generic context reduction.

5.1. Will the LLM Output an Answer Consistent with the Original Answer When Strategy 1 Is Applied?

To answer RQ1, we input the original background knowledge into the LLM and asked it to answer the question. We retained the supporting paragraphs cited in the CoT based on the index extracted from the CoT and removed the others. We re-input the filtered background knowledge and questioned the LLM again. Then, we compared whether the two answers output by the LLM were consistent. Table 5 shows the test results when Strategy 1 was applied. We calculated the consistency for two-hop, three-hop and four-hop questions by using Contain and F1 evaluation metrics. Avg represents the average scores. The reason we used the containment rule is to make the metric more robust to superficial variations in model outputs which do not change the semantic content. We assessed the extent of potential overestimation. We conducted manual validation on a stratified sample of instances, where Contain judged the answers as consistent, while EM judged them as inconsistent. These were the high-risk false positive samples. The samples were drawn proportionally by hop count: 90 from two-hop, 45 from three-hop, and 15 from four-hop, for a total of 150 instances. We manually determined whether the answers were semantically equivalent. The results showed that 119 out of 150 instances (79.33%) were semantically equivalent, which indicated that Contain captured semantic equivalence in most cases compared to EM. Based on this analysis, we adopted Contain as the primary metric. We report the EM results in Appendix B.1.

The average consistency-Contain of the three large language models was around 60%. Llama 4 Maverick had the highest value (66.64%), followed by DeepSeek-V3.2-Exp (63.96%). Qwen3-Next-80B-A3B-Instruction was the lowest (55.56%). The trend of consistency-F1 was consistent with consistency-Contain, ranging from 56% to 65%. DeepSeek performed better when the number of hops was high. We performed statistical analyses on inconsistency-Contain scores to assess the model differences. Between-model comparisons via one-way ANOVA with Tukey HSD post hoc tests revealed significant overall differences at each hop (all p < 0.001). At two hops, DeepSeek and Llama significantly outperformed Qwen3 (p < 0.001). At three hops, all pairwise differences were significant (p < 0.05), with Llama leading. At four hops, DeepSeek and Llama both exceeded Qwen3 (p < 0.01). If the CoT is related to the decision-making behavior of the model, retaining only supporting paragraphs should enable the model to generate consistent answers, with a consistency rate close to 100%. However, the actual result was far lower than this expectation. The inconsistent responses observed after intervention suggest behavior resembling ex-post rationalization, which appears most prominently in the CoT of Qwen3-Next-80B-A3B-Instruction. From a behavioral perspective, the information declared by the CoT is insufficient to support complete inference. In addition, the consistency rate decreases as the number of hops increases. We performed paired t-tests to evaluate the hop-related degradation. The results showed that Qwen3-Next-80B-A3B-Instruction exhibited significant declines between all consecutive hops (two hops → three hops: p < 0.001; three hops → four hops: p = 0.003). As the complexity increases, the proportion of models that could derive the original answer based on supporting paragraphs decreased. The model could not fully reproduce the original reasoning process while retaining only the supported paragraphs cited in the CoT, possibly because it relied on other implicit information that was not cited in the CoT. This phenomenon indicates that CoTs can only partially reflect the decision-making basis of the model and cannot fully cover all key information. There is a significant defect in the sufficiency of the explanation.

To reduce confounds from context length changes, we conducted a length-controlled replacement experiment. We replaced non-supporting paragraphs with an equal-length ellipsis placeholder while keeping the supporting paragraphs unchanged. This ensured the total context length was the same as that of the original input. We re-ran the consistency evaluation under this length-controlled setting on Qwen3-Next-80B-A3B-Instruction. We conducted a paired Wilcoxon signed-rank test between the non-supporting removal setting and the length-controlled replacement setting. Results showed no significant differences across inconsistency-Contain, inconsistency-F1, and unknown rate (all p > 0.05). This confirmed that context length did not confound our main findings.

Conclusion. When only the supporting paragraphs cited in the CoT were retained, the average consistency of the model answers was below 70%, which means the CoT explanation was insufficient. In more than 30% of cases, the supporting paragraphs declared by the CoT were not sufficient conditions for the decision behavior of the model.

5.2. Will the LLM Output an Answer Inconsistent with the Original Answer When Strategy 2 and Strategy 3 Are Applied?

To answer RQ2, after questioning the LLM, we removed the supporting paragraphs involved in the CoT. We re-input the modified background knowledge and questioned the LLM again. Then, we compared whether the two answers output by the LLM were inconsistent. Table 6 shows the test results when Strategy 2 was applied.

The average inconsistency-Contain of all three models was below 76%, with Qwen3-Next-80B-A3B-Instruction having the highest (75.37%), DeepSeek-V3.2-Exp following closely (73.53%), and Llama 4 Maverick having the lowest (58.24%). The trend of inconsistency-F1 was consistent with inconsistency-Contain. Llama 4 Maverick returned consistent answers after removing the first supporting paragraph in the CoT rationales in nearly half of the cases. Even if the first supporting paragraph was removed, the model could still give an answer consistent with the original answer based on other information (such as the remaining paragraphs and prior knowledge obtained during training). The reasoning steps described by the CoT were not necessary conditions for decision-making.

The average unknown rate of Qwen3-Next-80B-A3B-Instruction was the highest, reaching 71.31%. After removing the first supporting paragraph, the model was unable to generate clear answers on over 70% of the samples based on the remaining context. This indicates that its behavioral correlation between the cited paragraphs and the model decisions was relatively high. Llama 4 Maverick performed the worst, with an average unknown rate of only 37.79%. This indicates that there was a gap between the model’s behavior and its self-described thinking path. Even though the supporting paragraph was removed, this model still provided an answer.

We selected a case to introduce the situation when the answer in the second step of Llama 4 Maverick was not “unknown”. In that case, the model made inference errors after removing the first supporting paragraph in the CoT and returned an unrelated answer, as shown in Figure 3. The red part in the figure represents key information. The blue part represents irrelevant interference information, while the green part represents incorrect reasoning. When the background paragraph of “Philippe, Duke of Orléans, was the younger son of Louis XIII of France” is removed, the model assumes that “Philippe, Duke of Orléans” refers to “Louis Philippe I”. The model first derives “Louis Philippe II and his wife are the grandparents of François d’Orléans” based on the interference information. Then, the model makes its first inference error. Françoise Marie de Bourbon is the wife of Philippe II. According to the previous analysis of the model, it should be concluded that Françoise Marie de Bourbon is the grandmother of François d’Orléans. However, the model concludes that “Françoise Marie de Bourbon is the grandmother of Louis Philippe I”. Moreover, the subsequent reasoning of the model does not use this information. Instead, it directly concludes “For François d’Orléans, his paternal grandmother would be the mother of Louis Philippe I. Louis Philippe I’s mother was Louise Marie Adélaïde de Bourbon”. It is worth noting that “Louise Marie Adélaïde de Bourbon” does not appear in the context, which is directly dependent on memory retrieval by the LLM. Then, the model makes another inference error and concludes that “Louis Philippe I…, his grandmother on the father’s side is Louise Marie Adélaïde de Bourbon”, which directly changed the “mother” in the previous conclusion to “grandmother”. Finally, the wrong answer was given: “Louise Marie Adélaïde de Bourbon”. From this case, we can find that the large language model relied on the matching and memory capabilities of similar texts to provide a possible guessed answer. The CoT provided by the LLM was not reliable.

As shown in Table 7, we report how often models produced an incorrect but confident answer versus a correct answer versus “unknown” under Strategy 2. It can be seen from the results that Llama 4 Maverick is inclined to produce an incorrect but confident answer versus “unknown”. The relatively high correct rate compared to the other two models also suggests that Llama 4 Maverick may answer based on prior knowledge or shortcuts.

The test results for removing all supported paragraphs are shown in Table 8. The average inconsistency-Contain of all three models exceeded 85%, with DeepSeek-V3.2-Exp having the highest (93.54%), followed by Qwen3-Next-80B-A3B-Instruction (92.68%), and Llama 4 Maverick having the lowest (85.43%). The results of applying Strategy 3 were significantly higher than those of applying Strategy 2. As shown in Figure 4, after removing all the support paragraphs cited in the CoT rationales, although the models still output consistent answers with the original ones on some samples, the overall consistency performed better compared with the results of removing the first support paragraph. This indicates that when the complete information basis cited in the CoT is removed, the ability of the model to generate original answers is limited. When the first supporting paragraph is removed, although the reasoning chain is affected, the LLM may guess the answer based on the context related to the question. After removing all supporting paragraphs, almost all relevant contexts are discarded, and the difficulty for models to obtain the same answer through matching significantly increases. However, the CoT does not fully cover the necessary basis for model decision-making. Although the deficiency in the necessity of explanation has been alleviated, it has not been eliminated.

The average unknown rate of Qwen3-Next-80B-A3B-Instruction is the highest when removing all supporting paragraphs, reaching 91.05%. The model cannot generate clear answers on over 90% of the samples based on remaining knowledge. DeepSeek-V3.2-Exp ranks second, with an average unknown rate of 88.65%, and its overall performance is similar to that of Qwen3-Next-80B-A3B-Instruction. The abstention behavior might be related to the conservative response strategy or security mechanism of these models. Llama 4 Maverick still performs the worst, with an average unknown rate of only 65.11%. Even if all supporting paragraphs are removed, the model can generate clear answers on over 30% of the samples. Although the inconsistency rate is high after removing all supporting paragraphs, the low unknown rate reflects that the model is more inclined to guess an incorrect answer rather than answer “unknown”. There will be many flawed reasoning steps in CoT rationales, which will severely affect the quality of the CoT as an explanation.

Conclusion. After removing supporting paragraphs cited in the CoT rationales, all three models had outputs that were consistent with the original answer, which indicates that there is a gap between CoT explanation and model behavior. The self-described CoT appears to rationalize after the fact and only serves as the reasoning basis for the model to choose to present, rather than necessarily as the basis that the model must rely on. The necessity of the CoT explanation is limited.

5.3. How Is the Consistency Between the Model Outputs After Multiple Rounds of Strategy 2 and Strategy 3?

The distributed representation of LLMs allows for multiple reasonable reasoning chains. A correct answer may be obtained through different combinations of evidence. The faithfulness, as defined behaviorally, is not about identifying a unique ground-truth chain but about measuring the influence of cited evidence on the output. When multiple paths exist, deleting one set of evidence may not change the answer if another path remains accessible. Our multi-round removal strategy was precisely designed to address this multi-path situation. For example, as shown in Figure 5, when the question is “What type of animal is Xiao Liwu’s mother”, the standard problem decomposition step given by the dataset is to deduce that Xiao Liwu’s mother is Bai Yun based on the 11th paragraph. Then, according to the 14th paragraph, it is known that Bai Yun is a panda. However, in reality, there is other implicit information that can also lead to the answer. According to the first paragraph, Xiao Liwu is Zhen Zhen’s full sibling, and Zhen Zhen’s mother is Bai Yun. It can be concluded that Xiao Liwu’s mother is also Bai Yun. Then, based on the 14th paragraph, it can be inferred that Bai Yun is a panda. The final answer can be deduced: Xiao Liwu’s mother is a panda. For the situation where the model can obtain the answer after the first disruption of the thinking chain, we adopted a multi-round removal strategy. We removed the context again based on the CoT of the second response of the model and compared the consistency between the final answer and the original answer to test explainability to a loose extent. The experimental results are shown in Table 9 and Table 10.

From the experimental results, we can see that when the first supporting paragraph was removed twice, Llama 4 Maverick performed the worst, with an average inconsistency-Contain of 73.87%, and the average unknown rate (52.35%) was much lower than that of the other two models. This further confirmed its highest dependence on undeclared information from the CoT. Even after multiple rounds of removing the first supporting paragraph, the model still provided consistent answers on over 25% of the samples. After multiple rounds of reasoning chain intervention, all three large language models still had some outputs that were consistent with the original answers.

The intervention effect of removing all supported paragraphs twice was significantly enhanced. The average inconsistency-Contain of the three models exceeded 90%, with DeepSeek-V3.2-Exp having the highest (95.48%), followed by Qwen3-Next-80B-A3B-Instruction (94.59%). The average unknown rates of DeepSeek-V3.2-Exp and Qwen3-Next-80B-A3B-Instruction both exceeded 95%, which means that the models were unable to generate answers based on the remaining knowledge on over 90% of the samples. Although the average inconsistency-Contain of Llama 4 Maverick increased to 90.96%, the average unknown rate (71.93%) was still lower than the other two models. This suggests that Llama 4 Maverick may be more inclined to fabricate even the wrong answer based on relevant information or relying on memory abilities.

We compared test results under multiple rounds of removal and a single round of removal. The comparison results can be found in Appendix B.2. To evaluate the impact of multi-round removal of supporting paragraphs on model performance, we conducted paired t-tests and Wilcoxon signed-rank tests (

α

= 0.05) on inconsistency-Contain, inconsistency-F1, and unknown rate. For all three LLMs, multi-round removal led to significantly higher inconsistency and unknown rate compared to single-round removal (all p < 0.05), which confirms that repeated removal of supporting paragraphs exacerbates model inconsistency and reduces its ability to generate valid answers. As more relevant contexts are discarded, the difficulty for the model to output the same answer increases. Removing all background paragraphs cited in the CoT leaves less information related to the answer compared to removing the first paragraph cited in the CoT. Multiple rounds of removal delete more relevant information compared to a single round of removal. When removing a small amount of relevant information, the self-narrative inference chain in the CoT is broken, while some information related to the answer is still retained. In this case, it is easier to test the CoT of the LLM. If the LLM can provide consistent answers, it means that it may rely on matching and not conform to the reasoning behavior described in the CoT.

Conclusion. Considering that there are multiple reasoning chains that can lead to the final answer, we conducted multiple rounds of removal to repeatedly intervene in the self-stated reasoning chains in CoT to test the explainability. Compared with simple removal, the LLMs performed better in the case of multiple rounds of paragraph removal. Because a large amount of relevant information was discarded, the difficulty for LLMs to obtain consistent answers through matching increased.

5.4. Summary of Answers to RQs

Based on the experimental results and the analysis of the three research questions, we can summarize the following two conclusions:

The sufficiency of the CoT explanation is flawed. When only the supporting paragraphs cited in CoT rationales are retained, the LLMs cannot reproduce the original answer in a relatively high proportion of cases. This indicates that CoT rationales are not sufficient conditions for decision-making.
The necessity of CoT explanation is flawed. After removing the supporting paragraphs cited in CoT rationales, the model can generate consistent answers in a higher proportion. This indicates that CoT rationales are not necessary conditions for decision-making.

These conclusions indicate that the CoT explanation of current mainstream LLMs cannot fully reflect the behavior of the model. There are insufficient and unnecessary problems. The explainability of CoTs can be optimized based on the goals of sufficiency and necessity in the future.

6. Threats to Validity

We now discuss the threats to the validity of our results from two dimensions.

6.1. Internal Validity

The reasoning process that humans consider correct can be obtained based on experience, while the true reasoning process that is faithful to the decision-making mechanism of the LLM is unknown. Our proposed testing method avoids the threat of exploring the ground truth by observing the changes in the results of LLMs to indirectly test whether the CoT is consistent with the model’s behavior.

Transformer-based LLMs operate via distributed representations and probabilistic next-token prediction. There is no clearly defined, single symbolic reasoning chain internally encoded in the model. We dealt with the situation where there are multiple effective reasoning paths from two aspects. First, faithfulness was defined as behavioral consistency rather than path consistency; second, we introduced a multi-round removal strategy for testing multiple reasoning paths.

The test indicators proposed in this paper are operational measures designed to capture different facets of behavioral dependence, not direct measures of semantic equivalence or causal alignment. The metrics have limitations related to answer formulation, semantic matching, and abstention behavior. The F1-based indicator is a function of string normalization choices related to the preprocessing pipeline. We employed it as a similarity measure quantifying the token overlap. Although it is not a principled measure, it can reflect behavioral consistency in a practical and reproducible manner.

The interventions operated at the paragraph level, which means we could only remove or replace full supporting paragraphs. This limited the granularity of the analysis, as we could not isolate individual sentences or facts within paragraphs.

Behavioral consistency does not measure direct causal alignment between internal states and explanations. Despite this, we believe it can offer valuable insights. We evaluated the behavioral robustness under context intervention as a proxy for explanatory faithfulness. It tested a minimal necessary condition for faithfulness: if the model’s answer depended on the evidence it cited, the CoT was at least behaviorally relevant. Future work could combine behavioral interventions with finer-grained analyses, such as attention attribution or causal tracing.

6.2. External Validity

Test-case generation is challenging [34,35,36]. Due to time-cost limitations, we could not conduct detailed testing on all LLMs and datasets. As a preliminary step toward broader generalization, we added experiments on the commercial model GPT-4o-mini (see Appendix B.3). We re-ran the experiment with the temperature set to zero to improve reproducibility and report the results in Appendix B.4. We will extend the testing framework to other prominent models in future work. The testing concept proposed in this paper can be extended to other scenarios, as long as the explainability technology can provide explanations for a single sample and the context dependencies can be extracted from the explanation. Based on the testing schema proposed in this paper, users can test explainability by changing the inference preconditions and comparing the consistency of the results.

7. Related Work

We list related work in three categories: explainability based on input perturbations, evaluation of traditional feature attribution algorithms, and evaluation of CoTs.

7.1. Explainability Based on Input Perturbations

The research on using input perturbations to explain models is constantly increasing. The core idea is to infer which features are influential by modifying inputs and observing changes in outputs. LIME [37] perturbs input instances to learn a local proxy model. SHAP [3] uses Shapley values from cooperative game theory to attribute importance. Goldberg et al. [38] propose an approach to creating a Bi-directional Decision Support System (DSS) as an intermediary between an expert and a machine learning system to select the optimal solution, where the user can change some initial inputs and see the changes in the output results. DSS is conceptually similar to the approach proposed in this paper. DSS aims to provide explanations to help users understand the decisions of machine learning systems. This paper aimed to test the CoT explanation by perturbing the input and observing the impact on the model output behavior.

7.2. Evaluation of Traditional Feature Attribution Algorithms

The common strategy for explainability evaluation is to remove features from the input and observe the drop in model performance. For example, Shah and Sheppard [39] design a pair of experiments to evaluate the explanations generated by LIME on a Convolutional Neural Network (CNN). Warnecke et al. [40] remove the most relevant features sequentially from a sample and measure the drop in classification scores to evaluate the descriptive accuracy of explanations. However, the training data and test data come from different distributions. Without retraining, these methods cannot determine whether the drop in model performance is due to a distribution shift. Hooker et al. [41] evaluate the explainability by observing the decrease in the performance of the retrained model after the important features are removed. However, although the model structure is the same, the evaluated model is different from the model used to obtain feature importance.

In this paper, we proposed a universal framework of explainability testing that can be adapted to other indicators. Unlike the common strategies that focus on performance degradation, we only paid attention to whether the results of the model were consistent. Based on consistency judgment, we avoided the problems existing in the current common strategies and tested the explainability of CoTs for LLMs.

7.3. Evaluation of CoTs

A CoT guides models to think step by step and generate results. The CoT rationales have been widely regarded as explanations for the LLM. However, a CoT may not be faithful to the LLM. Turpin et al. [5] reorder the options of multiple-choice questions in the few-shot prompt to ensure the answers are always “(A)”. They find that CoT may systematically distort the true reasons for model predictions. When the model is guided to incorrect answers, the CoT frequently rationalizes these answers. Chen et al. [9] evaluate the counterfactual simulatability of explanations generated by the LLM. They observe whether explanations can enable humans to accurately infer the results of models under different counterfactual inputs.

However, there is no unified testing framework up to now. The standardization level of testing is relatively low. This paper proposed an automated testing framework and designed test indicators to quantitatively evaluate the behavioral faithfulness of CoTs.

8. Conclusions

This paper constructed a testing framework, which included questioning, explanation extraction, reasoning chain intervention, and answer comparison, based on the two assumptions of necessity and sufficiency. We used reasoning chain intervention strategies to change the input of the model and compared the consistency of answers to test whether responses of LLMs depended on self-cited evidence in CoT explanations. The experiment results showed that the CoT explanation of current LLMs generally had deficiencies in sufficiency and necessity. When the supporting paragraphs of the CoT rationales were retained, the information cited in the CoT could not independently support the model to reproduce the original reasoning process. When the support paragraphs were removed to interfere with the self-stated reasoning chain in the CoT, the models could generate an answer consistent with the original answer to a large extent through undeclared implicit paths. The CoT had a tendency to rationalize afterwards and relied on implicit information.

In future work, we plan to optimize the CoT generation mechanism and promote the evolution of explanations for LLMs from surface rationality to intrinsic trustworthiness. Researchers can also apply the test framework to other metrics to expand the coverage of explainability tests. In addition, subjective evaluation can be added based on the testing approach proposed in this paper to improve the explainability testing. This automated testing approach can be considered to be combined with interactive approaches to test the quality of the explanations by observing whether the user can reproduce the decision of the model based on the explanations.

Author Contributions

Conceptualization, H.C. and J.X.; methodology, H.C. and J.X.; software, H.C.; validation, H.C. and Z.Z.; formal analysis, H.C. and Z.S.; investigation, H.C.; resources, J.X. and Z.S.; data curation, H.C.; writing—original draft preparation, H.C.; writing—review and editing, J.X., Z.Z. and Z.S.; visualization, H.C.; supervision, J.X.; project administration, H.C. and J.X.; funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the National Natural Science Foundation of China (Grant No. 62572363).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data in the study are publicly available at https://github.com/chentohao/explainability-testing-of-CoT, accessed on 16 March 2026. A versioned archival release is permanently archived on Zenodo with DOI https://doi.org/10.5281/zenodo.18900676, accessed on 16 March 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

Table A1 shows the specific calling parameters, all of which used the default parameters recommended by each model or the calling platform.

Table A1. The calling parameters of models.

Model	Temperature	Top-p	Max Tokens
DeepSeek-V3.2-Exp	0.6	0.95	65,536
Qwen3-Next-80B-A3B-Instruction	0.7	0.01	32,768
Llama 4 Maverick	1.0	1.0	1 M

Appendix A.2

The prompt for the question in the first step of the test process (see Section 4.4) was as follows:

“Background knowledge:{background_text}
Question: {question}
1. First, think step by step to answer the question using the background
knowledge.
2. Then, you MUST clearly state which paragraphs (by their idx) you used to
support your answer.
   - You MUST use the EXACT format: ‘idx: x, y, z’ (use commas to separate
   multiple indices).
   - The indices MUST be listed in the order they were used in your
   reasoning chain.
   - Do NOT use any other format for this section (e.g., do not write
   ‘indices used’ or ‘supporting paragraphs’).
3. Finally, you MUST provide a concise answer using the EXACT format:
‘Answer: your_answer‘.
   - Do NOT use any other format for the answer (e.g., do not write ‘The
   answer is’ or ‘**Answer**’).”

The prompt for the question in the second step of the test process (see Section 4.4) was as follows:

“Background knowledge:{remaining_background}
Question: {question}
Instructions:
1. Answer the question using only the provided background knowledge.
2. If you CANNOT deduce the answer from the provided background knowledge, directly return ‘Answer: unknown’.
3. If you CAN deduce the answer, provide a concise answer with format:
‘Answer: your_answer’
4. You MUST provide a concise answer using the EXACT format:
‘Answer: your_answer’.
- Do NOT use any other format for the answer (e.g., do not write
‘The answer is’ or ‘**Answer**’).”

Appendix A.3

The F1 score was computed after applying the following preprocessing procedures to answers:

Lowercase: Convert the entire string to lowercase.
Punctuation removal: Delete every character that belongs to the Python 3.11.7 string.punctuation set, which includes:
```
!"#$%&’()*+,-./:;<=>?@[\]^_`{|}~
```
Stop-word removal: Remove tokens that appear in the following list of stop-words:
```
the, a, an, in, on, at, to, of, for, with, is, are, was, were, and, or,
but, what, which, where, when, how, this, that, these, those, it, its,
they, their, them, i, me, my, we, our, us, you, your, he, him, his,
she, her, it, its, by, as, from, up, down, about
```
The removal was performed by splitting the text on whitespace, filtering out any token that exactly matches a stop-word, and then rejoining the remaining tokens with a single space.
Whitespace normalization: Collapse multiple spaces and strip leading or trailing spaces.
Tokenization: Whitespace splitting.

Appendix A.4

We extracted supporting paragraph indexes using the regular expression:

  idx\s*:\s*([\d,  \s]+)(?:$|\s|\.)

Answers were extracted with the pattern:

  (?:\*\*|\s)*answer(?:\*\*|\s)*:\s*(.*)

To improve robustness, we set the refusal-detection rule as follows: if the string contains phrases such as “not provide”, “not provides”, or “not mentioned”, it is also regarded as “unknown”.

Table A2 shows the error rate and extraction failure rate across models.

Table A2. The error rate and extraction failure rate across LLMs.

Model	Error Rate (%)	Extraction Failure Rate (%)
DeepSeek-V3.2-Exp	0.12	0.00
Qwen3-Next-80B-A3B-Instruction	5.49	0.06
Llama 4 Maverick	0.17	0.02

Appendix B

Appendix B.1

Table A3 shows the test results measured by strict EM. Strategy 1 is abbreviated as S1, Strategy 2 as S2, and Strategy 3 as S3.

Table A3. Test results measured by strict EM.

Model	Consistency-EM (%) ↑	Inconsistency-EM (%) ↑
Model	S1	S2	S3	Multiple Rounds of S2	Multiple Rounds of S3
DeepSeek-V3.2-Exp	53.91	77.37	95.08	86.59	96.44
Qwen3-Next-80B-A3B-Instruction	49.11	78.53	95.08	87.75	95.32
Llama 4 Maverick	56.64	64.75	90.11	78.90	93.46

↑ indicates a higher value means better performance.

Appendix B.2

Figure A1 and Figure A2 show the comparison of the experimental results of multi-round removal and single-round removal.

Figure A1. Comparison results of removing the first supporting paragraph once and twice.

Figure A2. Comparison results of removing all supporting paragraphs once and twice.

Appendix B.3

We tested the behavioral faithfulness of GPT-4o-mini by applying three strategies and multiple rounds of removal (see Section 3). Table A4 and Table A5 show the test results. It can be seen from the experimental results that the CoT of GPT-4o-mini performs well in terms of necessity, while it is relatively lacking in sufficiency.

Table A4. The test results of GPT-4o-mini when Strategy 1 was applied.

Strategy	Consistency-Contain (%) ↑				Consistency-F1 (%) ↑
Strategy	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
Strategy 1	52.48	45.66	37.53	52.03	52.35	43.71	34.95	46.72

↑ indicates a higher value means better performance.

Table A5. The test results of GPT-4o-mini when Strategy 2 and Strategy 3 were applied.

Strategy	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Strategy	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
Strategy 2	84.58	78.16	78.02	81.38	83.97	79.14	78.87	81.59	76.12	65.39	59.26	70.24
Strategy 3	92.89	92.89	95.06	93.22	92.27	93.66	95.11	93.18	85.70	81.18	80.49	83.90
Multiple rounds of Strategy 2	92.81	86.97	84.44	89.51	91.78	87.44	84.10	89.13	86.82	76.84	69.63	81.27
Multiple rounds of Strategy 3	96.25	96.58	96.79	96.42	95.31	97.37	98.63	96.51	93.21	91.71	88.15	92.46

↑ indicates a higher value means better performance.

Appendix B.4

We re-ran the experiment with the temperature set to zero to improve reproducibility. Table A6, Table A7, Table A8, Table A9 and Table A10 show the test results.

Table A6. Test results with temperature set to 0 when Strategy 1 was applied.

Model	Consistency-Contain (%) ↑				Consistency-F1 (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	67.17	62.76	52.59	63.77	66.35	63.11	52.12	62.95
Qwen3-Next-80B-A3B-Instruction	63.82	52.63	40.25	56.75	64.40	54.42	38.64	56.95
Llama 4 Maverick	72.84	72.63	53.58	69.55	70.99	69.39	50.25	67.01

↑ indicates a higher value means better performance.

Table A7. Test results with temperature set to 0 when Strategy 2 was applied.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	81.95	71.71	72.84	77.20	82.41	71.17	74.33	77.52	75.64	60.26	63.95	68.85
Qwen3-Next-80B-A3B-Instruction	78.35	74.74	76.54	76.73	77.94	71.83	78.11	76.05	71.73	66.45	79.75	71.98
Llama 4 Maverick	61.82	48.82	50.86	55.90	62.48	50.52	54.02	57.30	47.04	32.63	32.35	44.05

↑ indicates a higher value means better performance.

Table A8. Test results with temperature set to 0 when Strategy 3 is applied.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	92.65	93.82	92.10	92.89	93.17	92.23	92.64	92.79	89.06	86.05	91.60	89.02
Qwen3-Next-80B-A3B-Instruction	92.25	93.68	93.09	92.79	92.05	92.24	94.40	92.50	89.78	90.66	95.80	91.71
Llama 4 Maverick	87.38	84.61	84.69	86.06	87.15	85.40	85.65	86.35	70.29	63.29	58.77	66.16

↑ indicates a higher value means better performance.

Table A9. Test results with temperature set to 0 after multiple rounds of Strategy 2.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	88.02	78.95	76.30	83.10	88.15	76.54	76.07	82.48	86.58	70.53	74.57	80.02
Qwen3-Next-80B-A3B-Instruction	86.18	82.76	83.70	84.58	85.87	80.81	84.65	84.07	84.42	78.29	86.67	83.49
Llama 4 Maverick	78.75	66.18	66.91	72.81	79.25	66.62	68.24	73.43	64.86	47.37	42.47	55.63

↑ indicates a higher value means better performance.

Table A10. Test results with temperature set to 0 after multiple rounds of Strategy 3.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	95.05	95.66	94.07	95.05	95.17	94.50	95.13	94.95	96.09	91.84	94.57	95.01
Qwen3-Next-80B-A3B-Instruction	94.97	95.39	94.32	94.96	95.03	94.40	94.81	94.79	96.33	94.74	98.02	96.75
Llama 4 Maverick	92.25	88.16	92.10	90.94	92.32	88.56	91.87	91.06	81.15	73.95	69.38	76.91

↑ indicates a higher value means better performance.

References

Gunning, D.; Aha, D.W. DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Mag. 2019, 40, 44–58. [Google Scholar] [CrossRef]
Murdoch, W.J.; Singh, C.; Kumbier, K.; Abbasi-Asl, R.; Yu, B. Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. USA 2019, 116, 22071–22080. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, Proceedings of the Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; NIPS Foundation: San Diego, CA, USA, 2017; pp. 4765–4774. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems 35, Proceedings of the Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; NIPS Foundation: San Diego, CA, USA, 2022. [Google Scholar]
Turpin, M.; Michael, J.; Perez, E.; Bowman, S.R. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. In Advances in Neural Information Processing Systems 36, Proceedings of the Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; NIPS Foundation: San Diego, CA, USA, 2023. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Vilone, G.; Longo, L. Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 2021, 76, 89–106. [Google Scholar] [CrossRef]
Nunes, I.; Jannach, D. A systematic review and taxonomy of explanations in decision support and recommender systems. User Model. User Adapt. Interact. 2017, 27, 393–444. [Google Scholar] [CrossRef]
Chen, Y.; Zhong, R.; Ri, N.; Zhao, C.; He, H.; Steinhardt, J.; Yu, Z.; McKeown, K.R. Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations. In Proceedings of the Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Arrieta, A.B.; Rodríguez, N.D.; Ser, J.D.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Silva, L.; Metrôlho, J.C.; Ribeiro, F.R. Efficient Data Exchange between WebAssembly Modules. Future Internet 2024, 16, 341. [Google Scholar] [CrossRef]
Biran, O.; Cotton, C. Explanation and justification in machine learning: A survey. In Proceedings of the IJCAI-17 Workshop on Explainable AI (XAI), Melbourne, Australia, 20 August 2017; Volume 8, pp. 8–13. [Google Scholar]
Kim, B.; Koyejo, O.; Khanna, R. Examples are not enough, learn to criticize! Criticism for Interpretability. In Advances in Neural Information Processing Systems 29, Proceedings of the Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R., Eds.; NIPS Foundation: San Diego, CA, USA, 2016; pp. 2280–2288. [Google Scholar]
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for Large Language Models: A Survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
Rajani, N.F.; McCann, B.; Xiong, C.; Socher, R. Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In Proceedings of the 57th Conference of the Association for Computational Linguistics; Korhonen, A., Traum, D.R., Màrquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1: Long Papers, pp. 4932–4942. [Google Scholar] [CrossRef]
Bills, S.; Cammarata, N.; Mossing, D.; Tillman, H.; Gao, L.; Goh, G.; Sutskever, I.; Leike, J.; Wu, J.; Saunders, W. Language Models Can Explain Neurons in Language Models. 2023. Available online: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html (accessed on 16 March 2026).
Elhage, N.; Nanda, N.; Olsson, C.; Henighan, T.; Joseph, N.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; Conerly, T.; et al. A Mathematical Framework for Transformer Circuits. Transform. Circuits Thread 2021, 1, 12. Available online: https://transformer-circuits.pub/2021/framework/index.html (accessed on 16 March 2026).
Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems 35, Proceedings of the Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; NIPS Foundation: San Diego, CA, USA, 2022. [Google Scholar]
DeYoung, J.; Jain, S.; Rajani, N.F.; Lehman, E.; Xiong, C.; Socher, R.; Wallace, B.C. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4443–4458. [Google Scholar] [CrossRef]
Atanasova, P.; Camburu, O.; Lioma, C.; Lukasiewicz, T.; Simonsen, J.G.; Augenstein, I. Faithfulness Tests for Natural Language Explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, (Volume 2: Short Papers), ACL 2023; Rogers, A., Boyd-Graber, J.L., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 283–294. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems 35, Proceedings of the Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; NIPS Foundation: San Diego, CA, USA, 2022. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Wang, P.; Zhu, Q.; Xu, R.; Zhang, R.; Ma, S.; Bi, X.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
Lyu, Q.; Havaldar, S.; Stein, A.; Zhang, L.; Rao, D.; Wong, E.; Apidianaki, M.; Callison-Burch, C. Faithful Chain-of-Thought Reasoning. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP 2023; Park, J.C., Arase, Y., Hu, B., Lu, W., Wijaya, D., Purwarianti, A., Krisnadhi, A.A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; Volume 1: Long Papers, pp. 305–329. [Google Scholar] [CrossRef]
Wang, S.; Zhu, Y.; Liu, H.; Zheng, Z.; Chen, C.; Li, J. Knowledge Editing for Large Language Models: A Survey. ACM Comput. Surv. 2025, 57, 1–37. [Google Scholar] [CrossRef]
Magister, L.C.; Mallinson, J.; Adámek, J.; Malmi, E.; Severyn, A. Teaching Small Language Models to Reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2023; Rogers, A., Boyd-Graber, J.L., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1773–1781. [Google Scholar] [CrossRef]
Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. MuSiQue: Multihop Questions via Single-hop Question Composition. Trans. Assoc. Comput. Linguist. 2022, 10, 539–554. [Google Scholar] [CrossRef]
Zhong, Z.; Wu, Z.; Manning, C.D.; Potts, C.; Chen, D. MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 15686–15702. [Google Scholar] [CrossRef]
Tan, Y.; Min, D.; Li, Y.; Li, W.; Hu, N.; Chen, Y.; Qi, G. Can ChatGPT Replace Traditional KBQA Models? An In-Depth Analysis of the Question Answering Performance of the GPT LLM Family. In The Semantic Web—ISWC 2023—22nd International Semantic Web Conference, Athens, Greece, 6–10 November 2023; Lecture Notes in Computer Science; Payne, T.R., Presutti, V., Qi, G., Poveda-Villalón, M., Stoilos, G., Hollink, L., Kaoudi, Z., Cheng, G., Li, J., Eds.; Springer: Cham, Switzerland, 2023; Proceedings, Part I; Volume 14265, pp. 348–367. [Google Scholar] [CrossRef]
McKenna, N.; Li, T.; Cheng, L.; Hosseini, M.J.; Johnson, M.; Steedman, M. Sources of Hallucination by Large Language Models on Inference Tasks. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 2758–2774. [Google Scholar] [CrossRef]
Xuan, J.; Hu, Y.; Jiang, H. Debt-Prone Bugs: Technical Debt in Software Maintenance. Int. J. Adv. Comput. Technol. 2012, 4, 453–461. [Google Scholar] [CrossRef][Green Version]
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 2369–2380. [Google Scholar] [CrossRef]
Gutierrez, B.J.; Shu, Y.; Gu, Y.; Yasunaga, M.; Su, Y. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. In Advances in Neural Information Processing Systems 38, Proceedings of the Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, 10–15 December 2024; Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C., Eds.; NIPS Foundation: San Diego, CA, USA, 2024. [Google Scholar]
Ren, Z.; Sun, S.; Xuan, J.; Li, X.; Zhou, Z.; Jiang, H. Automated Patching for Unreproducible Builds. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, 25–27 May 2022; pp. 200–211. [Google Scholar] [CrossRef]
Yang, F.; Xin, Q.; Ren, Z.; Xuan, J. Kotsuite: Unit Test Generation for Kotlin Programs in Android Applications. In Proceedings of the 33rd IEEE/ACM International Conference on Program Comprehension, ICPC@ICSE 2025, Ottawa, ON, Canada, 27–28 April 2025; pp. 226–236. [Google Scholar] [CrossRef]
Herrera, J.L.; Moya, A.; Berrocal, J.; Murillo, J.M.; Navarro, E. A Developer-Focused Genetic Algorithm for IoT Application Placement in the Computing Continuum. IEEE Trans. Serv. Comput. 2025, 18, 1185–1198. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Goldberg, S.; Pinsky, E.; Galitsky, B. A bi-directional adversarial explainability for decision support. Hum.-Intell. Syst. Integr. 2021, 3, 1–14. [Google Scholar] [CrossRef]
Shah, S.S.; Sheppard, J.W. Evaluating Explanations of Convolutional Neural Network Image Classifications. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Warnecke, A.; Arp, D.; Wressnegger, C.; Rieck, K. Evaluating Explanation Methods for Deep Learning in Security. In Proceedings of the 2020 IEEE European Symposium on Security and Privacy (Euro S&P), Genoa, Italy, 7–11 September 2020; pp. 158–174. [Google Scholar] [CrossRef]
Hooker, S.; Erhan, D.; Kindermans, P.J.; Kim, B. A Benchmark for Interpretability Methods in Deep Neural Networks. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]

Figure 1. Overview of automatically testing the explainability of CoTs.

Figure 2. Overview of the test process and reasoning chain intervention strategies. Notation: The circled numbers denote background paragraphs, while the cross mark indicates that the reasoning step is interfered with.

Figure 3. Case analysis on the reasoning process of Llama 4 Maverick. Notation: The numbers with brackets denote the indexes of paragraphs. The question mark indicates information that does not appear in the context. The red part represents key information. The blue part represents irrelevant interference information, while the green part represents incorrect reasoning.

Figure 4. Comparison results of removing the first supporting paragraph and removing all supporting paragraphs.

Figure 5. A situation where there are multiple reasoning chains that can lead to an answer. Notation: The numbers with brackets denote the indexes of paragraphs. Different colors represent different reasoning paths.

Table 1. Degree of overlap between the paragraphs cited by CoT and the ground-truth paragraphs.

Metric	DeepSeek-V3.2-Exp (%)	Qwen3-Next-80B-A3B-Instruction (%)	Llama 4 Maverick (%)
Jaccard similarity	79.11	64.45	75.93
Recall	83.89	79.39	82.33

Table 2. Baseline of Strategy 1.

Model	Keep Randomly Chosen Paragraphs		Keep Non-Supporting Paragraphs
Model	Consistency-Contain (%) ↑	Consistency-F1 (%) ↑	Consistency-Contain (%) ↑	Consistency-F1 (%) ↑
DeepSeek-V3.2-Exp	7.37	6.69	4.14	3.77
Qwen3-Next-80B-A3B-Instruction	8.57	9.45	4.89	5.55
Llama 4 Maverick	12.78	12.11	5.10	5.12

↑ indicates a higher value means better performance.

Table 3. Baseline of Strategy 2.

Model	Remove a Random Paragraph		Remove a Non-Supporting Paragraph
Model	Inconsistency-Contain (%) ↑	Inconsistency-F1 (%) ↑	Inconsistency-Contain (%) ↑	Inconsistency-F1 (%) ↑
DeepSeek-V3.2-Exp	56.26	56.14	52.94	52.75
Qwen3-Next-80B-A3B-Instruction	45.40	44.72	40.92	40.54
Llama 4 Maverick	30.46	31.86	26.95	28.45

↑ indicates a higher value means better performance.

Table 4. Baseline of Strategy 3.

Model	Remove Randomly Chosen Paragraphs		Remove Non-Supporting Paragraphs
Model	Inconsistency-Contain (%) ↑	Inconsistency-F1 (%) ↑	Inconsistency-Contain (%) ↑	Inconsistency-F1 (%) ↑
DeepSeek-V3.2-Exp	58.78	59.48	52.20	52.75
Qwen3-Next-80B-A3B-Instruction	53.22	52.95	43.44	43.54
Llama 4 Maverick	36.24	37.82	29.00	30.74

↑ indicates a higher value means better performance.

Table 5. Consistency test results of retaining the supporting paragraphs cited in the CoT from the background knowledge.

Model	Consistency-Contain (%) ↑				Consistency-F1 (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	69.01	59.34	56.54	63.96	68.47	58.28	53.46	62.75
Qwen3-Next-80B-A3B-Instruction	60.86	52.24	41.98	55.56	62.11	55.11	40.47	56.29
Llama 4 Maverick	70.13	67.76	52.59	66.64	69.84	64.93	49.25	64.85

↑ indicates a higher value means better performance. Bootstrap 95% confidence intervals (CIs) for key results: DeepSeek-V3.2-Exp: Avg Consistency-Contain [0.621, 0.657]; Avg Consistency-F1 [0.61, 0.645]; Qwen3-Next-80B: Avg Consistency-Contain [0.53, 0.571]; Avg Consistency-F1 [0.61, 0.645]; Llama 4 Maverick: Avg Consistency-Contain [0.646, 0.684]; Avg Consistency-F1 [0.631, 0.666]. Bootstrap CIs were computed with 1000 resamples (seed = 42) for reproducibility.

Table 6. Test results of removing the first supporting paragraph cited in the CoT from the background knowledge.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	77.32	69.21	70.12	73.53	77.54	68.51	70.69	73.55	72.12	57.24	67.16	66.69
Qwen3-Next-80B-A3B-Instruction	76.44	73.82	76.54	75.37	76.21	70.57	77.66	74.68	70.53	67.11	77.04	71.31
Llama 4 Maverick	64.06	48.82	58.02	58.24	64.34	50.84	59.14	59.22	44.33	29.34	33.33	37.79

↑ indicates a higher value means better performance. Bootstrap 95% confidence intervals (CIs) for key results: DeepSeek-V3.2-Exp: Avg Inconsistency-Contain [0.718, 0.752]; Avg Inconsistency-F1 [0.718, 0.752]; Avg Unknown rate [0.647, 0.686]; Qwen3-Next-80B: Avg Inconsistency-Contain [0.739, 0.773]; Avg Inconsistency-F1 [0.73, 0.763]; Avg Unknown rate [0.695, 0.732]; Llama 4 Maverick: Avg Inconsistency-Contain [0.562, 0.601]; Avg Inconsistency-F1 [0.572, 0.61]; Avg Unknown rate [0.359, 0.397]. Bootstrap CIs were computed with 1000 resamples (seed = 42).

Table 7. The proportion that model answers correctly, incorrectly, and as “unknown” under Strategy 2.

Model	Correct Rate (%)	Incorrect Rate (%)	Unknown Rate (%)
DeepSeek-V3.2-Exp	21.75	11.68	66.69
Qwen3-Next-80B-A3B-Instruction	17.61	12.55	71.31
Llama 4 Maverick	35.68	26.57	37.79

Table 8. Test results of removing all supporting paragraphs cited in the CoT from the background knowledge.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	93.13	94.61	92.84	93.54	93.33	93.69	93.36	93.45	90.42	83.55	92.10	88.65
Qwen3-Next-80B-A3B-Instruction	92.25	93.95	92.10	92.68	92.38	90.83	92.82	91.97	88.82	89.08	96.05	91.05
Llama 4 Maverick	87.22	82.37	85.68	85.43	87.04	83.30	86.39	85.76	68.85	60.66	61.73	65.11

↑ indicates a higher value means better performance. Bootstrap 95% confidence intervals (CIs) for key results: DeepSeek-V3.2-Exp: Avg Inconsistency-Contain [0.926, 0.945]; Avg Inconsistency-F1 [0.925, 0.943]; Avg Unknown rate [0.873, 0.899]; Qwen3-Next-80B: Avg Inconsistency-Contain [0.916, 0.938]; Avg Inconsistency-F1 [0.908, 0.931]; Avg Unknown rate [0.899, 0.921]; Llama 4 Maverick: Avg Inconsistency-Contain [0.84, 0.868]; Avg Inconsistency-F1 [0.845, 0.87]; Avg Unknown rate [0.632, 0.669]. Bootstrap CIs were computed with 1000 resamples (seed = 42).

Table 9. Test results of removing the first supporting paragraph cited in the CoT from the background knowledge twice.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	89.22	79.61	78.52	84.38	89.00	79.42	79.13	84.33	87.38	72.37	75.56	80.78
Qwen3-Next-80B-A3B-Instruction	86.26	84.21	86.17	85.42	86.46	80.84	87.53	84.87	85.22	78.42	88.89	84.75
Llama 4 Maverick	79.63	67.89	67.90	73.87	79.94	68.64	70.10	74.74	61.34	42.76	41.23	52.35

↑ indicates a higher value means better performance. Bootstrap 95% confidence intervals (CIs) for key results: DeepSeek-V3.2-Exp: Avg Inconsistency-Contain [0.83, 0.858]; Avg Inconsistency-F1 [0.829, 0.857]; Avg Unknown rate [0.791, 0.824]; Qwen3-Next-80B: Avg Inconsistency-Contain [0.842, 0.869]; Avg Inconsistency-F1 [0.835, 0.862]; Avg Unknown rate [0.834, 0.861]; Llama 4 Maverick: Avg Inconsistency-Contain [0.722, 0.756]; Avg Inconsistency-F1 [0.73, 0.764]; Avg Unknown rate [0.504, 0.542]. Bootstrap CIs were computed with 1000 resamples (seed = 42).

Table 10. Test results of removing all supporting paragraphs cited in the CoT from the background knowledge twice.

Model	Inconsistency-Contain (%) ↑				Inconsistency-F1 (%) ↑				Unknown Rate (%) ↑
Model	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg	2-Hop	3-Hop	4-Hop	Avg
DeepSeek-V3.2-Exp	95.21	96.05	95.31	95.48	95.56	95.85	96.18	95.76	96.33	93.42	95.31	95.36
Qwen3-Next-80B-A3B-Instruction	94.17	95.39	94.81	94.59	94.04	92.37	95.56	93.77	95.93	93.55	98.27	96.94
Llama 4 Maverick	93.29	88.03	89.38	90.96	92.93	89.56	89.52	91.30	75.24	70.26	63.95	71.93

↑ indicates a higher value means better performance. Bootstrap 95% confidence intervals (CIs) for key results: DeepSeek-V3.2-Exp: Avg Inconsistency-Contain [0.947, 0.963]; Avg Inconsistency-F1 [0.95, 0.965]; Avg Unknown rate [0.945, 0.961]; Qwen3-Next-80B: Avg Inconsistency-Contain [0.938, 0.955]; Avg Inconsistency-F1 [0.928, 0.946]; Avg Unknown rate [0.962, 0.976]; Llama 4 Maverick: Avg Inconsistency-Contain [0.899, 0.921]; Avg Inconsistency-F1 [0.902, 0.923]; Avg Unknown rate [0.701, 0.736]. Bootstrap CIs were computed with 1000 resamples (seed = 42).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, H.; Zhao, Z.; Shuai, Z.; Xuan, J. Testing Explainability of Chain of Thought for Large Language Models. Appl. Sci. 2026, 16, 3112. https://doi.org/10.3390/app16073112

AMA Style

Chen H, Zhao Z, Shuai Z, Xuan J. Testing Explainability of Chain of Thought for Large Language Models. Applied Sciences. 2026; 16(7):3112. https://doi.org/10.3390/app16073112

Chicago/Turabian Style

Chen, Hao, Zhuang Zhao, Ziqi Shuai, and Jifeng Xuan. 2026. "Testing Explainability of Chain of Thought for Large Language Models" Applied Sciences 16, no. 7: 3112. https://doi.org/10.3390/app16073112

APA Style

Chen, H., Zhao, Z., Shuai, Z., & Xuan, J. (2026). Testing Explainability of Chain of Thought for Large Language Models. Applied Sciences, 16(7), 3112. https://doi.org/10.3390/app16073112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Testing Explainability of Chain of Thought for Large Language Models

Abstract

1. Introduction

2. Background and Motivation

2.1. Explainability

2.2. Mechanistic vs. Behavioral Interpretability

2.3. Chain of Thought

2.4. Multi-Hop Question Answering

2.5. Motivation

3. Proposed Approach

3.1. Assumptions of Explainability Testing

3.2. Explainability Testing

3.3. Reasoning Chain Intervention Strategies

3.4. Test Indicators

4. Experimental Setup

4.1. Research Questions

4.2. Data Preparation

4.3. Implementation

4.4. Test Process

5. Experimental Results and Analysis

5.1. Will the LLM Output an Answer Consistent with the Original Answer When Strategy 1 Is Applied?

5.2. Will the LLM Output an Answer Inconsistent with the Original Answer When Strategy 2 and Strategy 3 Are Applied?

5.3. How Is the Consistency Between the Model Outputs After Multiple Rounds of Strategy 2 and Strategy 3?

5.4. Summary of Answers to RQs

6. Threats to Validity

6.1. Internal Validity

6.2. External Validity

7. Related Work

7.1. Explainability Based on Input Perturbations

7.2. Evaluation of Traditional Feature Attribution Algorithms

7.3. Evaluation of CoTs

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

Appendix A.4

Appendix B

Appendix B.1

Appendix B.2

Appendix B.3

Appendix B.4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI