1. Introduction
Hallucination remains a major reliability problem in LLMs, especially when fluent outputs hide factual errors that users may not notice [
1,
2]. In deployed systems, such errors can move directly into downstream automation rather than remain isolated text mistakes, so the issue is not only about surface fluency but also about system behavior under use. This problem becomes harder under resource constraints. Edge-facing systems increasingly rely on compressed or quantized LLMs, but the corresponding safety checks still need to be accurate, inexpensive, and easy to add to the inference pipeline without adding another costly pass. Recent articles in Electronics reflect the same concern through work on rule-augmented LLM systems, retrieval-augmented generation (RAG), and compression techniques for deployable AI pipelines [
3,
4,
5]. Recent low-bit deployment methods such as QLoRA, AWQ, and MobileQuant also show how central quantization has become for on-device and edge-oriented LLM use [
6,
7,
8].
Most detection methods fall into two broad groups. Extrinsic methods compare generated content against retrieved evidence or external knowledge. They can improve factuality, but they also add latency and system complexity. Intrinsic methods instead rely on self-consistency, uncertainty, or sampling-based diagnostics [
9,
10]. These methods are easier to attach to an existing model, but repeated decoding or multi-sample validation may still be too expensive for edge deployment because the cost grows with each additional check. Recent white-box work further suggests that internal activations themselves retain useful signals for truthfulness and hallucination detection [
11,
12]. At the same time, edge-oriented deployment increasingly depends on aggressive quantization, which reduces memory pressure but may also distort the very internal states on which lightweight monitoring would rely [
5,
6,
7,
8]. The central question is therefore not only whether hallucinations can be detected, but whether they can be detected efficiently within a single forward pass during quantized inference. This distinction also matters methodologically: controlled target formulations and free-running generation do not expose exactly the same failure modes once error accumulation and exposure bias enter the decoding process [
13].
This paper takes an empirical, engineering-oriented white-box view of the problem. Rather than presenting a new detection algorithm, we ask three narrower questions: where truthful and incorrect responses become most separable in the network, which lightweight internal feature family offers the best balance of reliability and simplicity under a shared candidate-answer benchmark, and whether the same picture survives limited scale-up and auxiliary compression checks. We address these questions with a repeated cross-architecture benchmark on Qwen2.5-1.5B-Instruct and Llama-3.2-1B-Instruct, a preliminary larger-model two-seed probe on Qwen2.5-7B-Instruct, and auxiliary Int8 checks on Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. The evidence does not support a strong claim that one detector is uniformly superior in every setting. It instead supports a more useful engineering conclusion: separability tends to peak in the middle-to-late part of the network, and simple residual-space statistics already provide practical guidance for low-cost monitoring under resource constraints.
This framing leads to three contributions. First, the paper provides cross-model evidence that the strongest separability of truthful and incorrect internal states appears in the middle-to-late layers, roughly 50–70% of total depth rather than at a single absolute layer index. Second, it benchmarks three white-box detectors under the same five-fold protocol and repeated reruns, showing that the depth-relative localization is more stable than any single detector ranking and that simple residual-space statistics remain competitive with fused alternatives. Third, it adds supporting checks beyond the main benchmark: a recovered preliminary two-seed Qwen2.5-7B BF16 probe under the same protocol and auxiliary Qwen Int8 checks at 1.5B and 7B showing that the peak-layer pattern remains visible after quantization. Taken together, the study treats internal hallucination detection less as a race for a more complicated detector and more as a question of representational dynamics, detector choice, and deployment tradeoffs.
2. Materials and Methods
2.1. Study Design, Models, and Hardware
The paper combines two complementary experiment tracks. The first is a controlled cross-architecture benchmark in which Qwen2.5-1.5B-Instruct and Llama-3.2-1B-Instruct are evaluated under the same five-fold white-box protocol. These models were selected because they allow repeated cross-validation on limited hardware while still spanning different open model families, different depths, and different attention styles, including grouped-query attention in Qwen and standard multi-head attention in Llama. This choice makes it possible to test whether the observed layerwise phenomenon is tied to one architecture or persists across different attention organizations. All runs in this benchmark used bfloat16 (BF16) for numerical stability and were executed locally on a workstation built around an AMD Ryzen AI Max+ 395 APU with integrated Radeon 8060S graphics (Advanced Micro Devices, Santa Clara, CA, USA) and 128 GB of unified memory. The host system ran Ubuntu 25.10, and the software stack used a local ROCm container based on local/rocm-llm-runtime:py312, with Python 3.12.3, PyTorch 2.9.1+rocm7.2.0, and transformers 4.57.6; auxiliary data-processing, evaluation, and quantization packages were installed as required for individual runs. To test whether the detector ranking survives compression in the same benchmark setting, we additionally reran the Qwen2.5-1.5B benchmark under 8-bit integer (Int8) quantization using Quanto.
The second track is a larger-model scale-up check on Qwen2.5-7B-Instruct [
14]. This branch uses the same local workstation and software stack, but it is evaluated under the same candidate-answer white-box protocol rather than as a separate end-to-end deployment benchmark. Two independent BF16 reruns are reported in the Results section as a preliminary scale-up probe to test whether the middle-to-late peak remains visible at a larger scale. After that BF16 branch was stabilized, we also completed one auxiliary Int8 seed at 7B under the same protocol as a confirmatory low-bit check. This larger-model branch is treated as supporting evidence for layerwise localization rather than as a full architecture sweep or a complete variance estimate at 7B scale.
2.2. Datasets and Sample Construction
TruthfulQA was used as the main benchmark for truthfulness-sensitive question answering [
2]. CommonsenseQA was added to test whether the same internal signals remain measurable in a structured commonsense setting [
15]. For the cross-architecture benchmark, each question was converted into two candidate answer pairs: one positive example used the reference answer and one negative example used a sampled distractor. The primary benchmark summary in this revision reports five independent reruns with 256 examples per dataset and a positive rate of 0.5; an additional 512-example stress test is used only to check whether the same depth-relative pattern remains visible at a larger sample size. Earlier 128-example development runs are retained for continuity with the initial experimental pass, but they are not the main quantitative evidence reported below. The goal of this design was not to imitate the exact surface form of free generation, but to create a fair and reproducible white-box benchmark in which all detectors operate on the same internal states.
2.3. Residual, Mahalanobis, and Fused Spectral Representations
Let
denote the residual-state vector of the final token at layer
l. Following the probing literature [
16], we first test whether truthfulness is approximately linearly separable at different depths by fitting a logistic probe to the residual state in Equation (
1):
where
indicates a truthful response. Probe parameters are estimated by minimizing the binary cross-entropy objective in Equation (
2):
To test whether a purely statistical white-box detector is sufficient, we also evaluate a Mahalanobis baseline on the same residual states. For numerical stability, each training fold is first projected into a lower-dimensional subspace by a fold-specific PCA operator
, after which a class-conditioned shrinkage covariance is estimated. The class-wise Mahalanobis distance is defined in Equation (
3):
where
and
denote the class-specific mean and shrinkage covariance for class
. The Mahalanobis score is computed from the difference between the truthful and incorrect class distances, which makes the baseline sensitive to distributional shifts in the residual space rather than only to a single separating hyperplane.
To complement residual-space features, we represent attention as a weighted graph over token positions. For each head
h in layer
l, the attention matrix
defines the edge weights and the corresponding combinatorial Laplacian is given in Equation (
4):
where
is the corresponding degree matrix. Spectral analysis of graph-structured signals is widely used to characterize whether information flow is concentrated or diffuse [
17]. We summarize the spectrum of each head through the normalized spectral entropy in Equation (
5),
where
denotes the normalized eigenvalues of
. A steeper spectrum indicates concentrated attention, whereas a flatter spectrum implies a more diffuse information flow. In the revised benchmark, no single attention head is selected. Instead, the head-level spectral entropies are fused within each layer through their mean and standard deviation as shown in Equation (
6),
where
is the number of heads in layer
l. The fused spectra-linear representation in Equation (
7) is then
and the corresponding detector is a logistic probe trained on
. In the revised paper, this fused representation is treated as one feature family among several rather than as an assumed best detector.
2.4. Evaluation Protocol
All white-box detectors were evaluated with five-fold stratified cross-validation. For each dataset, samples were partitioned into five equally sized folds. In each iteration, model statistics and detectors were fitted on four folds and evaluated on the remaining unseen fold. All area under the receiver operating characteristic curve (AUROC) values reported in the Results section are macro-averages across the five folds. For the primary 256-example benchmark, the paper additionally reports mean performance and run-to-run standard deviation across five independent reruns. This protocol was used for the residual probe, the Mahalanobis baseline, and the fused spectra-linear detector.
For each evaluation prompt, the white-box pipeline runs the model once, extracts the residual state of the final token at a candidate layer, computes multi-head spectral descriptors from the attention tensor of that same forward pass, and scores the sample with the corresponding detector. This design avoids retrieval, repeated sampling, or auxiliary decoding loops. The detector-side computation of the Mahalanobis baseline is also lightweight at inference time because it reduces to projected distance evaluations against precomputed means and precision matrices. In practice, this additional computation is dominated by matrix-vector products and remains fully compatible with edge-oriented deployment.
Figure 1 summarizes the revised workflow. The paper separates a controlled cross-architecture white-box benchmark from a preliminary larger-model scale-up probe and auxiliary quantized checks, while keeping all tracks within the same single-pass monitoring perspective.
3. Results
3.1. Layer-Wise Dynamics of Hallucination Features
The clearest result in the revised study is not a single winning detector, but a shared depth pattern.
Figure 2 plots layer-wise AUROC against relative network depth for the residual probe. Across five independent 256-example BF16 reruns, the strongest separability for Qwen2.5-1.5B-Instruct appears at layers 16–19 of 28, corresponding to 57.1–67.9% of total depth. For Llama-3.2-1B-Instruct, the peak appears at layers 8–10 of 16, or 50.0–62.5% of total depth. The additional 512-example stress test keeps the same pattern: Qwen peaks at layers 15–19 of 28 and Llama at layers 8–9 of 16. Despite the difference in absolute layer counts, both model families show their cleanest truthful-versus-incorrect separation in the middle-to-late part of the network rather than near the embedding side or immediately before the output surface.
This turns a model-specific observation into a depth-relative one. The informative region is better described as a percentage of network depth than as a fixed layer number. In practical terms, the most useful monitoring location appears after early lexical and local-context processing but before the final collapse into output logits. By that point, semantic state is already consolidated while generation-specific surface effects have not yet dominated the representation, which makes this region a plausible insertion point for low-overhead runtime monitoring.
3.2. Benchmarking White-Box Detectors
Table 1 summarizes the primary white-box benchmark as mean ± run-to-run standard deviation across five independent 256-example BF16 reruns. The more stable result is the localization of the peak layers rather than a single detector ranking. On both datasets, the residual probe and the fused spectra-linear detector are nearly indistinguishable on average, while Mahalanobis remains competitive and still wins individual TruthfulQA reruns on Qwen. At the same time, the CommonsenseQA result on Qwen2.5-1.5B shows a clear exception: Mahalanobis falls to 0.6802 ± 0.0570, well below the residual probe at 0.7862 ± 0.0399 and the fused detector at 0.7857 ± 0.0402. These results show that stronger performance does not simply follow from adding more feature types, but they also caution against claiming a universally best detector from a single small run.
For deployment-oriented work, the implication is therefore conditional rather than absolute. Residual-space detectors are often strong, cheap, and reproducible, but the Qwen2.5-1.5B CommonsenseQA case shows that a simpler detector can still lose clear margin on a harder task. The more useful practical question is therefore not which detector wins once, but which internal statistic remains robust enough for the target task while preserving low monitoring cost.
3.3. Scale-Up Check on Qwen2.5-7B and Auxiliary Int8 Checks
The cross-architecture benchmark above was intentionally repeated in BF16 so that the detector comparison would remain controlled. To test whether the same depth-relative phenomenon survives at a larger scale, we reran Qwen2.5-7B-Instruct under the same candidate-answer protocol in BF16 for two independent seeds. Because a full five-rerun 7B sweep would impose a much heavier local memory and runtime burden,
Table 2 should be read as a preliminary two-seed scale-up probe rather than as a complete variance estimate at 7B scale. Even with that narrower scope, the larger model reproduces the same middle-to-late localization. On TruthfulQA, the strongest residual and fused scores fall between layers 16 and 19 of 28, and on CommonsenseQA the best layer is 18 of 28 in both reruns. The main value of this 7B branch is therefore not a new detector ranking, but the confirmation that the same layerwise pattern is still visible beyond the smaller 1B–1.5B models used in the cross-architecture benchmark.
To probe compression more cautiously, we also reran the Qwen2.5-1.5B white-box benchmark under Int8-Quanto and compared it with the original 128-example BF16 development run under the same five-fold protocol.
Table 3 shows that the main qualitative conclusions are stable. On TruthfulQA, Mahalanobis remains the best detector and changes only from 0.7015 in BF16 to 0.7079 in Int8. On CommonsenseQA, the residual probe and the fused detector remain tied, increasing from 0.8296 to 0.8522, while Mahalanobis changes from 0.7010 to 0.7121. The best layers also remain in the same middle-to-late region, with layers 18–19 of 28 still dominating after quantization. We then completed one auxiliary 256-example Qwen2.5-7B Int8 seed under the same candidate-answer protocol. That run preserved the same peak layers as the BF16 7B probe, with TruthfulQA peaking at layer 19 of 28 and CommonsenseQA at layer 18 of 28; residual and fused AUROC reached 0.9274/0.9277 on TruthfulQA and 0.9118/0.9118 on CommonsenseQA, while Mahalanobis remained lower at 0.9185 and 0.8472. Because this 7B Int8 branch currently contains only one seed, we treat it as confirmatory auxiliary evidence rather than as a new benchmark table. Within these auxiliary Int8 settings, the main effect of quantization is therefore not a collapse of the layerwise pattern, but a preservation of the same depth-relative localization.
4. Discussion
The main takeaway of the paper changes with the revised evidence. The strongest result is no longer that one detector dominates all alternatives. The more defensible claim is that hallucination-related separability follows a depth-relative pattern that is visible across model families and remains visible in the preliminary 7B probe. In both Qwen and Llama, the clearest truthful-versus-incorrect separation appears after early token processing but before the final output surface. This makes the middle-to-late region a natural target for instrumentation because it captures semantic consolidation without waiting until the model has already collapsed its state into output logits.
The detector benchmark also suggests a more modest interpretation of internal hallucination features. Repeated reruns show that no single detector dominates every setting; the more stable phenomenon is the peak-layer localization itself. Mahalanobis scoring remains competitive, especially on TruthfulQA and in several individual Qwen reruns, which supports using residual-space statistics as a practical baseline. However, the CommonsenseQA result on Qwen2.5-1.5B shows a meaningful counterexample: Mahalanobis drops well below both the residual probe and the fused detector. In that setting, the simpler residual-space statistic is still useful, but it is not the most robust detector. The more defensible engineering conclusion is therefore not that one detector is universally best, but that several low-cost summaries recover much of the useful signal while detector ranking remains task dependent.
The recovered Qwen2.5-7B BF16 reruns strengthen this reading. They do not create a full new architecture benchmark, and they should be interpreted as a preliminary two-seed probe rather than as a full 7B variance study, but they still show that the same middle-to-late localization is visible at a larger scale, with TruthfulQA peaking at layers 16–19 of 28 and CommonsenseQA at layer 18 of 28. The auxiliary Int8 checks sharpen a different point: moderate compression does not erase the layerwise phenomenon even at larger scale. In particular, the completed Qwen2.5-7B Int8 seed preserves the same layer-19 and layer-18 peaks seen in the BF16 7B probe. What the current revision does not justify is a strong ranking among low-bit formats or a universal claim that one quantizer is best for white-box monitoring. The present quantization evidence should therefore be read as a narrow stability check, not as a definitive quantization leaderboard.
Table 4 condenses these revised findings into a deployment-oriented summary.
Limitations and Failure Cases
A few limitations should be kept in view. First, the cross-architecture benchmark uses controlled candidate-answer pairs rather than free-running long-form generation. This design makes detector comparisons reproducible, but it cannot capture exposure bias, multi-step error accumulation, or self-correction behavior during autonomous decoding [
13]. The present results should therefore be read as evidence about white-box separability under a controlled benchmark, not as a full substitute for generation-time evaluation in open-ended pipelines.
Second, the quantization evidence in this revision is narrower than the BF16 benchmark. The fully repeated main comparison is reported in BF16, the larger-model extension is currently strongest in BF16, and the auxiliary low-bit checks include Qwen2.5-1.5B under Int8-Quanto plus a single completed 7B Int8 seed. The 7B branch is therefore best read as a preliminary two-seed BF16 probe plus one confirmatory Int8 seed, not as a full run-to-run variance estimate at that scale. The paper supports a layered conclusion: the repeated white-box benchmark establishes the depth-relative phenomenon, the recovered 7B probe shows that the same phenomenon remains visible at a larger scale, and the auxiliary Int8 checks indicate that moderate quantization does not immediately destroy it. Broader claims about low-bit format superiority require additional controlled reruns. We also tested a cross-lingual alignment branch during development, but the signal was too weak under low-bit distortion to support a reliable multilingual claim, so that branch remains outside the main engineering contribution of the paper.
5. Conclusions
This paper frames internal hallucination detection as an empirical study of representational dynamics, detector choice, and deployment constraints. The revised evidence shows that truthful and incorrect candidate answers become most separable in the middle-to-late part of the network, with the peak appearing around 50–70% of total depth across the tested Qwen and Llama model families. The repeated benchmark also shows that simple residual-space statistics remain competitive with more complex fused features, while the depth-relative localization itself is more stable than any single detector ranking. A separate preliminary two-seed Qwen2.5-7B BF16 probe reproduces the same middle-to-late peak at a larger scale, and auxiliary Int8 checks on Qwen2.5-1.5B and Qwen2.5-7B suggest that this layerwise pattern remains visible after moderate quantization. The value of the paper therefore lies less in naming a universally best detector or quantizer and more in identifying where hallucination cues emerge, which internal statistics remain reliable in benchmarking, and where the limits of the current controlled protocol begin.
Author Contributions
Conceptualization, H.L. and J.X.; methodology, H.L. and J.X.; software, H.L.; validation, H.L. and J.X.; formal analysis, H.L.; investigation, H.L.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L. and J.X.; visualization, H.L.; supervision, J.X.; funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China, grant number 11701075.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
TruthfulQA and CommonsenseQA are publicly available from the cited sources. The derived white-box benchmark outputs and the cross-validation summaries supporting the findings of this study are available from the corresponding author upon reasonable request.
Acknowledgments
The authors thank colleagues who provided feedback on earlier drafts of this study.
Conflicts of Interest
The authors declare no conflicts of interest. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
| LLM | Large language model |
| AUROC | Area under the receiver operating characteristic curve |
| BF16 | bfloat16 |
| Int8 | 8-bit integer |
| RAG | Retrieval-augmented generation |
References
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 3214–3252. [Google Scholar] [CrossRef]
- Panagoulias, D.P.; Virvou, M.; Tsihrintzis, G.A. Augmenting large language models with rules for enhanced domain-specific interactions: The case of medical diagnosis. Electronics 2024, 13, 320. [Google Scholar] [CrossRef]
- Wagenpfeil, S. Multimedia graph codes for fast and semantic retrieval-augmented generation. Electronics 2025, 14, 2472. [Google Scholar] [CrossRef]
- Han, S.; Wang, M.; Zhang, J.; Li, D.; Duan, J. A review of large language models: Fundamental architectures, key technological evolutions, interdisciplinary technologies integration, optimization and compression techniques, applications, and challenges. Electronics 2024, 13, 5040. [Google Scholar] [CrossRef]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient finetuning of quantized LLMs. arXiv 2023, arXiv:2305.14314. [Google Scholar] [CrossRef]
- Lin, J.; Tang, J.; Tang, H.; Yang, S.; Dang, X.; Han, S. AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv 2023, arXiv:2306.00978. [Google Scholar] [CrossRef]
- Park, J.; Yao, Z.; Gholami, A.; Mahoney, M.W.; Keutzer, K.; Yao, J. MobileQuant: Mobile-friendly quantization for on-device language models. arXiv 2024, arXiv:2402.14914. [Google Scholar] [CrossRef]
- Manakul, P.; Liusie, A.; Gales, M. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 9004–9017. [Google Scholar] [CrossRef]
- Kuhn, L.; Gal, Y.; Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv 2023, arXiv:2302.09664. [Google Scholar] [CrossRef]
- Azaria, A.; Mitchell, T. The internal state of an LLM knows when it is lying. arXiv 2023, arXiv:2304.13734. [Google Scholar] [CrossRef]
- Kossen, J.; Jumelet, J.; Paulus, R. INSIDE LLMs: Internal state representations retain the power of hallucination detection. arXiv 2024, arXiv:2402.03744. [Google Scholar] [CrossRef]
- Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
- Hui, B.; Yang, J.; Cui, C.; Li, S.; Zhang, Y.; Li, X.; Yang, J.; Wang, G.; Bai, L.; Li, C.; et al. Qwen2.5-Coder technical report. arXiv 2024, arXiv:2409.12186. [Google Scholar] [CrossRef]
- Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4149–4158. [Google Scholar] [CrossRef]
- Alain, G.; Bengio, Y. Understanding intermediate layers using linear classifier probes. arXiv 2018, arXiv:1610.01644. [Google Scholar] [CrossRef]
- Ortega, A.; Frossard, P.; Kovacevic, J.; Moura, J.M.F.; Vandergheynst, P. Graph signal processing: Overview, challenges, and applications. Proc. IEEE 2018, 106, 808–828. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |