We tested the CoT explainability of LLMs by answering research questions. We quantitatively tested the quality of the supporting indexes cited in CoTs by comparing them with the ground-truth supporting facts provided in the MuSiQue dataset. We calculated the average Jaccard similarity and recall of the five experiments in RQ1–3 to observe the reliability of supporting paragraph indexes. Recall refers to the proportion of ground-truth indexes covered by cited indexes.
Table 1 shows the degree of overlap. We evaluated the model performance on the original complete context against MuSiQue ground-truth answers. DeepSeek-V3.2-Exp obtained 68.51%, Llama 4 Maverick 66.46%, and Qwen3-Next-80B-A3B-Instruction 62.69%.
We added matched control baselines for context manipulation. For Strategy 1, we compared “keep only supporting paragraphs” to “keep the same number of randomly chosen paragraphs” and to “keep the same number of non-supporting paragraphs”. For Strategy 2 and Strategy 3, we added “remove the same number of random paragraphs” and “remove the same number of non-supporting paragraphs” as baselines.
Table 2,
Table 3 and
Table 4 shows the baselines. The test results all exceeded the baselines, which proved that changes in consistency were specific to CoT-cited evidence rather than generic context reduction.
5.1. Will the LLM Output an Answer Consistent with the Original Answer When Strategy 1 Is Applied?
To answer RQ1, we input the original background knowledge into the LLM and asked it to answer the question. We retained the supporting paragraphs cited in the CoT based on the index extracted from the CoT and removed the others. We re-input the filtered background knowledge and questioned the LLM again. Then, we compared whether the two answers output by the LLM were consistent.
Table 5 shows the test results when Strategy 1 was applied. We calculated the consistency for two-hop, three-hop and four-hop questions by using Contain and F1 evaluation metrics. Avg represents the average scores. The reason we used the containment rule is to make the metric more robust to superficial variations in model outputs which do not change the semantic content. We assessed the extent of potential overestimation. We conducted manual validation on a stratified sample of instances, where Contain judged the answers as consistent, while EM judged them as inconsistent. These were the high-risk false positive samples. The samples were drawn proportionally by hop count: 90 from two-hop, 45 from three-hop, and 15 from four-hop, for a total of 150 instances. We manually determined whether the answers were semantically equivalent. The results showed that 119 out of 150 instances (79.33%) were semantically equivalent, which indicated that Contain captured semantic equivalence in most cases compared to EM. Based on this analysis, we adopted Contain as the primary metric. We report the EM results in
Appendix B.1.
The average consistency-Contain of the three large language models was around 60%. Llama 4 Maverick had the highest value (66.64%), followed by DeepSeek-V3.2-Exp (63.96%). Qwen3-Next-80B-A3B-Instruction was the lowest (55.56%). The trend of consistency-F1 was consistent with consistency-Contain, ranging from 56% to 65%. DeepSeek performed better when the number of hops was high. We performed statistical analyses on inconsistency-Contain scores to assess the model differences. Between-model comparisons via one-way ANOVA with Tukey HSD post hoc tests revealed significant overall differences at each hop (all p < 0.001). At two hops, DeepSeek and Llama significantly outperformed Qwen3 (p < 0.001). At three hops, all pairwise differences were significant (p < 0.05), with Llama leading. At four hops, DeepSeek and Llama both exceeded Qwen3 (p < 0.01). If the CoT is related to the decision-making behavior of the model, retaining only supporting paragraphs should enable the model to generate consistent answers, with a consistency rate close to 100%. However, the actual result was far lower than this expectation. The inconsistent responses observed after intervention suggest behavior resembling ex-post rationalization, which appears most prominently in the CoT of Qwen3-Next-80B-A3B-Instruction. From a behavioral perspective, the information declared by the CoT is insufficient to support complete inference. In addition, the consistency rate decreases as the number of hops increases. We performed paired t-tests to evaluate the hop-related degradation. The results showed that Qwen3-Next-80B-A3B-Instruction exhibited significant declines between all consecutive hops (two hops → three hops: p < 0.001; three hops → four hops: p = 0.003). As the complexity increases, the proportion of models that could derive the original answer based on supporting paragraphs decreased. The model could not fully reproduce the original reasoning process while retaining only the supported paragraphs cited in the CoT, possibly because it relied on other implicit information that was not cited in the CoT. This phenomenon indicates that CoTs can only partially reflect the decision-making basis of the model and cannot fully cover all key information. There is a significant defect in the sufficiency of the explanation.
To reduce confounds from context length changes, we conducted a length-controlled replacement experiment. We replaced non-supporting paragraphs with an equal-length ellipsis placeholder while keeping the supporting paragraphs unchanged. This ensured the total context length was the same as that of the original input. We re-ran the consistency evaluation under this length-controlled setting on Qwen3-Next-80B-A3B-Instruction. We conducted a paired Wilcoxon signed-rank test between the non-supporting removal setting and the length-controlled replacement setting. Results showed no significant differences across inconsistency-Contain, inconsistency-F1, and unknown rate (all p > 0.05). This confirmed that context length did not confound our main findings.
Conclusion. When only the supporting paragraphs cited in the CoT were retained, the average consistency of the model answers was below 70%, which means the CoT explanation was insufficient. In more than 30% of cases, the supporting paragraphs declared by the CoT were not sufficient conditions for the decision behavior of the model.
5.2. Will the LLM Output an Answer Inconsistent with the Original Answer When Strategy 2 and Strategy 3 Are Applied?
To answer RQ2, after questioning the LLM, we removed the supporting paragraphs involved in the CoT. We re-input the modified background knowledge and questioned the LLM again. Then, we compared whether the two answers output by the LLM were inconsistent.
Table 6 shows the test results when Strategy 2 was applied.
The average inconsistency-Contain of all three models was below 76%, with Qwen3-Next-80B-A3B-Instruction having the highest (75.37%), DeepSeek-V3.2-Exp following closely (73.53%), and Llama 4 Maverick having the lowest (58.24%). The trend of inconsistency-F1 was consistent with inconsistency-Contain. Llama 4 Maverick returned consistent answers after removing the first supporting paragraph in the CoT rationales in nearly half of the cases. Even if the first supporting paragraph was removed, the model could still give an answer consistent with the original answer based on other information (such as the remaining paragraphs and prior knowledge obtained during training). The reasoning steps described by the CoT were not necessary conditions for decision-making.
The average unknown rate of Qwen3-Next-80B-A3B-Instruction was the highest, reaching 71.31%. After removing the first supporting paragraph, the model was unable to generate clear answers on over 70% of the samples based on the remaining context. This indicates that its behavioral correlation between the cited paragraphs and the model decisions was relatively high. Llama 4 Maverick performed the worst, with an average unknown rate of only 37.79%. This indicates that there was a gap between the model’s behavior and its self-described thinking path. Even though the supporting paragraph was removed, this model still provided an answer.
We selected a case to introduce the situation when the answer in the second step of Llama 4 Maverick was not “unknown”. In that case, the model made inference errors after removing the first supporting paragraph in the CoT and returned an unrelated answer, as shown in
Figure 3. The red part in the figure represents key information. The blue part represents irrelevant interference information, while the green part represents incorrect reasoning. When the background paragraph of “Philippe, Duke of Orléans, was the younger son of Louis XIII of France” is removed, the model assumes that “Philippe, Duke of Orléans” refers to “Louis Philippe I”. The model first derives “Louis Philippe II and his wife are the grandparents of François d’Orléans” based on the interference information. Then, the model makes its first inference error. Françoise Marie de Bourbon is the wife of Philippe II. According to the previous analysis of the model, it should be concluded that Françoise Marie de Bourbon is the grandmother of François d’Orléans. However, the model concludes that “Françoise Marie de Bourbon is the grandmother of Louis Philippe I”. Moreover, the subsequent reasoning of the model does not use this information. Instead, it directly concludes “For François d’Orléans, his paternal grandmother would be the mother of Louis Philippe I. Louis Philippe I’s mother was Louise Marie Adélaïde de Bourbon”. It is worth noting that “Louise Marie Adélaïde de Bourbon” does not appear in the context, which is directly dependent on memory retrieval by the LLM. Then, the model makes another inference error and concludes that “Louis Philippe I…, his grandmother on the father’s side is Louise Marie Adélaïde de Bourbon”, which directly changed the “mother” in the previous conclusion to “grandmother”. Finally, the wrong answer was given: “Louise Marie Adélaïde de Bourbon”. From this case, we can find that the large language model relied on the matching and memory capabilities of similar texts to provide a possible guessed answer. The CoT provided by the LLM was not reliable.
As shown in
Table 7, we report how often models produced an incorrect but confident answer versus a correct answer versus “unknown” under Strategy 2. It can be seen from the results that Llama 4 Maverick is inclined to produce an incorrect but confident answer versus “unknown”. The relatively high correct rate compared to the other two models also suggests that Llama 4 Maverick may answer based on prior knowledge or shortcuts.
The test results for removing all supported paragraphs are shown in
Table 8. The average inconsistency-Contain of all three models exceeded 85%, with DeepSeek-V3.2-Exp having the highest (93.54%), followed by Qwen3-Next-80B-A3B-Instruction (92.68%), and Llama 4 Maverick having the lowest (85.43%). The results of applying Strategy 3 were significantly higher than those of applying Strategy 2. As shown in
Figure 4, after removing all the support paragraphs cited in the CoT rationales, although the models still output consistent answers with the original ones on some samples, the overall consistency performed better compared with the results of removing the first support paragraph. This indicates that when the complete information basis cited in the CoT is removed, the ability of the model to generate original answers is limited. When the first supporting paragraph is removed, although the reasoning chain is affected, the LLM may guess the answer based on the context related to the question. After removing all supporting paragraphs, almost all relevant contexts are discarded, and the difficulty for models to obtain the same answer through matching significantly increases. However, the CoT does not fully cover the necessary basis for model decision-making. Although the deficiency in the necessity of explanation has been alleviated, it has not been eliminated.
The average unknown rate of Qwen3-Next-80B-A3B-Instruction is the highest when removing all supporting paragraphs, reaching 91.05%. The model cannot generate clear answers on over 90% of the samples based on remaining knowledge. DeepSeek-V3.2-Exp ranks second, with an average unknown rate of 88.65%, and its overall performance is similar to that of Qwen3-Next-80B-A3B-Instruction. The abstention behavior might be related to the conservative response strategy or security mechanism of these models. Llama 4 Maverick still performs the worst, with an average unknown rate of only 65.11%. Even if all supporting paragraphs are removed, the model can generate clear answers on over 30% of the samples. Although the inconsistency rate is high after removing all supporting paragraphs, the low unknown rate reflects that the model is more inclined to guess an incorrect answer rather than answer “unknown”. There will be many flawed reasoning steps in CoT rationales, which will severely affect the quality of the CoT as an explanation.
Conclusion. After removing supporting paragraphs cited in the CoT rationales, all three models had outputs that were consistent with the original answer, which indicates that there is a gap between CoT explanation and model behavior. The self-described CoT appears to rationalize after the fact and only serves as the reasoning basis for the model to choose to present, rather than necessarily as the basis that the model must rely on. The necessity of the CoT explanation is limited.
5.3. How Is the Consistency Between the Model Outputs After Multiple Rounds of Strategy 2 and Strategy 3?
The distributed representation of LLMs allows for multiple reasonable reasoning chains. A correct answer may be obtained through different combinations of evidence. The faithfulness, as defined behaviorally, is not about identifying a unique ground-truth chain but about measuring the influence of cited evidence on the output. When multiple paths exist, deleting one set of evidence may not change the answer if another path remains accessible. Our multi-round removal strategy was precisely designed to address this multi-path situation. For example, as shown in
Figure 5, when the question is “What type of animal is Xiao Liwu’s mother”, the standard problem decomposition step given by the dataset is to deduce that Xiao Liwu’s mother is Bai Yun based on the 11th paragraph. Then, according to the 14th paragraph, it is known that Bai Yun is a panda. However, in reality, there is other implicit information that can also lead to the answer. According to the first paragraph, Xiao Liwu is Zhen Zhen’s full sibling, and Zhen Zhen’s mother is Bai Yun. It can be concluded that Xiao Liwu’s mother is also Bai Yun. Then, based on the 14th paragraph, it can be inferred that Bai Yun is a panda. The final answer can be deduced: Xiao Liwu’s mother is a panda. For the situation where the model can obtain the answer after the first disruption of the thinking chain, we adopted a multi-round removal strategy. We removed the context again based on the CoT of the second response of the model and compared the consistency between the final answer and the original answer to test explainability to a loose extent. The experimental results are shown in
Table 9 and
Table 10.
From the experimental results, we can see that when the first supporting paragraph was removed twice, Llama 4 Maverick performed the worst, with an average inconsistency-Contain of 73.87%, and the average unknown rate (52.35%) was much lower than that of the other two models. This further confirmed its highest dependence on undeclared information from the CoT. Even after multiple rounds of removing the first supporting paragraph, the model still provided consistent answers on over 25% of the samples. After multiple rounds of reasoning chain intervention, all three large language models still had some outputs that were consistent with the original answers.
The intervention effect of removing all supported paragraphs twice was significantly enhanced. The average inconsistency-Contain of the three models exceeded 90%, with DeepSeek-V3.2-Exp having the highest (95.48%), followed by Qwen3-Next-80B-A3B-Instruction (94.59%). The average unknown rates of DeepSeek-V3.2-Exp and Qwen3-Next-80B-A3B-Instruction both exceeded 95%, which means that the models were unable to generate answers based on the remaining knowledge on over 90% of the samples. Although the average inconsistency-Contain of Llama 4 Maverick increased to 90.96%, the average unknown rate (71.93%) was still lower than the other two models. This suggests that Llama 4 Maverick may be more inclined to fabricate even the wrong answer based on relevant information or relying on memory abilities.
We compared test results under multiple rounds of removal and a single round of removal. The comparison results can be found in
Appendix B.2. To evaluate the impact of multi-round removal of supporting paragraphs on model performance, we conducted paired
t-tests and Wilcoxon signed-rank tests (
= 0.05) on inconsistency-Contain, inconsistency-F1, and unknown rate. For all three LLMs, multi-round removal led to significantly higher inconsistency and unknown rate compared to single-round removal (all
p < 0.05), which confirms that repeated removal of supporting paragraphs exacerbates model inconsistency and reduces its ability to generate valid answers. As more relevant contexts are discarded, the difficulty for the model to output the same answer increases. Removing all background paragraphs cited in the CoT leaves less information related to the answer compared to removing the first paragraph cited in the CoT. Multiple rounds of removal delete more relevant information compared to a single round of removal. When removing a small amount of relevant information, the self-narrative inference chain in the CoT is broken, while some information related to the answer is still retained. In this case, it is easier to test the CoT of the LLM. If the LLM can provide consistent answers, it means that it may rely on matching and not conform to the reasoning behavior described in the CoT.
Conclusion. Considering that there are multiple reasoning chains that can lead to the final answer, we conducted multiple rounds of removal to repeatedly intervene in the self-stated reasoning chains in CoT to test the explainability. Compared with simple removal, the LLMs performed better in the case of multiple rounds of paragraph removal. Because a large amount of relevant information was discarded, the difficulty for LLMs to obtain consistent answers through matching increased.