TECP: Token-Entropy Conformal Prediction for LLMs

Xu, Beining; Lu, Yongming

doi:10.3390/math13203351

Open AccessArticle

TECP: Token-Entropy Conformal Prediction for LLMs

by

Beining Xu

¹

and

Yongming Lu

^2,*

¹

School of Engineering, Shenzhen MSU-BIT University, Shenzhen 518000, China

²

MSU-BIT-SMBU Joint Research Center of Applied Mathematics, Shenzhen MSU-BIT University, Shenzhen 518000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(20), 3351; https://doi.org/10.3390/math13203351

Submission received: 8 September 2025 / Revised: 5 October 2025 / Accepted: 15 October 2025 / Published: 21 October 2025

(This article belongs to the Topic Challenges and Solutions in Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

Uncertainty quantification (UQ) for open-ended language generation remains a critical yet underexplored challenge, particularly in settings where token-level log-probabilities are available during decoding. We present Token-Entropy Conformal Prediction (TECP), which treats a log-probability-based token-entropy statistic as a nonconformity score and integrates it with split conformal prediction to construct prediction sets with finite-sample coverage guarantees. We work in a white-box regime in which per-token log-probabilities are accessible during decoding. TECP estimates episodic uncertainty from the token-entropy structure of sampled generations and calibrates thresholds via conformal quantiles to ensure provable error control. Empirical evaluations across six large language models and two QA benchmarks (CoQA and TriviaQA) show that TECP consistently achieves reliable coverage and compact prediction sets, outperforming prior self-UQ methods. These results provide a principled and efficient solution for trustworthy generation in white-box, log-probability-accessible LLM settings.

Keywords:

token-entropy; conformal prediction; predictive uncertainty; coverage

MSC:

68T07; 62C20

1. Introduction

Large Language Models (LLMs) are increasingly serving as the core technological substrate across diverse tasks and exhibit outstanding cross-domain capabilities [1,2,3,4,5]. With the continued accumulation and effective utilization of high-quality data, their performance has consistently improved, revealing substantial application potential in healthcare, code generation, scientific research, and psychological counseling [6,7]. Represented by systems such as ChatGPT (GPT-4o, version 2025-10), pretrained on large-scale corpora and finely aligned with human preferences, these models not only adapt flexibly to heterogeneous demands but also significantly enhance the efficiency and reliability with which AI handles both everyday workflows and complex professional tasks [8,9], laying a firm foundation for the deeper integration of intelligent technologies and sustained innovation.

Despite their strong performance across many tasks, LLMs still exhibit pervasive reliability issues [10,11,12,13], including hallucinations and factual errors, and may produce responses that appear well-formed yet are fabricated or detached from reality. It reduces their adaptability for deployment in high-stakes scenarios. Uncertainty quantification (UQ) for open-ended language generation remains a critical yet underexplored challenge, particularly in white-box settings where per-token log-probabilities are accessible during decoding (e.g., via API-exposed log-probabilities) [14,15,16,17,18,19]. It is especially critical in high-risk professional domains, such as medicine and psychology, where decision-making often relies heavily on LLM-generated content. However, uncertainty arises from heterogeneous sources, including epistemic and aleatoric components, so developing effective and well-grounded approaches to quantifying uncertainty remains an urgent and important issue [20,21,22,23].

Our motivation stems from an intuitive observation: surface fluency alone is insufficient to adjudicate the reliability of model outputs in the presence of hallucinations, thereby necessitating the incorporation of uncertainty quantification (UQ) to assess answer credibility. Contemporary approaches typically extract salient generative signals—such as attention distributions, hidden-state dynamics, or, as in our setting, per-token log-probabilities—during decoding [24,25]; use these signals to compute uncertainty scores at multiple granularities (e.g., token- and sentence-level) [26,27]; and subsequently aggregate them into a holistic credibility indicator. This indicator is then surfaced to end users via concise interface cues (e.g., “high confidence” or “potentially unreliable”), providing real-time decision support in high-stakes domains such as medicine and psychology, and thereby mitigating the risks associated with hallucinated content.

The existing uncertainty estimation methods have demonstrated significant effectiveness in distinguishing correct from incorrect answers; however, such heuristic approaches fail to provide a provable risk guarantee (i.e., correctness coverage). To address this limitation, we propose Token-Entropy Conformal Prediction for LLMs (TECP), Figure 1 shows the component’s overall workflow, which adopts the conformal prediction (CP) framework [28,29,30,31,32,33] on the basis of token entropy. CP is a statistical learning paradigm designed to generate interpretable prediction sets with guaranteed confidence levels alongside model outputs. Its core principle lies in specifying a significance level (or allowable error rate), computing a nonconformity score from historical data, and comparing it with the prediction of a new instance, thereby constructing a prediction set that, under the assumption of exchangeability, contains the true label with the desired long-run frequency. In TECP, the nonconformity score is computed based on token entropy. We further discuss the potential implications of filtering calibration samples as “assessable” and its impact on coverage.

Compared to traditional probabilistic prediction, CP imposes minimal assumptions on data distribution, requiring only that samples satisfy the property of exchangeability—which endows it with strong adaptability and robustness across diverse tasks and model architectures. Moreover, the size of the prediction set can directly indicate the model’s uncertainty: when confidence is low, the set expands to preserve coverage; when confidence is high, the set contracts to enhance decision precision. Owing to its provable coverage guarantee, CP is of particular importance in high-stakes applications, such as medical diagnosis, legal text analysis, and scientific research, as it not only quantifies the reliability of predictions but also enables flexible trade-offs between risk and utility under varying confidence thresholds. In recent years, researchers have further extended CP beyond the traditional classification and regression tasks to more complex domains such as open-ended natural language generation and structured prediction. By incorporating techniques such as sampling and reprompting, these advancements effectively address the challenges posed by vast output spaces and the inherent diversity of possible answers.

We conduct a systematic empirical investigation of six state-of-the-art large language models—LLaMA-3.2-1B, Qwen2.5-3B-Instruct, Vicuna-7B-v1.5, Qwen2.5-7B-Instruct, LLaMA-3.1-8B-Instruct, and Vicuna-13B-v1.5—on the CoQA and TriviaQA benchmarks, with the aim of rigorously evaluating their uncertainty quantification capabilities. The evaluation is grounded in two theoretically motivated metrics: (1) Empirical Miscoverage Rate (EMR), which measures the proportion of instances in which the constructed prediction set does not contain the ground-truth answer (lower is better; coverage

= 1 - EMR

); and (2) Average Prediction Set Size (APSS), which captures epistemic uncertainty by quantifying the expected cardinality of the prediction set. Experimental analyses reveal substantial disparities in EMR and APSS across different models and datasets, highlighting the sensitivity of uncertainty estimation to both model-specific generative competence and dataset-inherent statistical properties. TECP enables the construction of calibrated prediction sets under minimal distributional assumptions and without additional model training. The empirical results demonstrate that the proposed method achieves rigorous error control while attaining high coverage of ground-truth answers, thereby providing robust reliability guarantees for model predictions in open-domain generative tasks.

2. Related Work

Uncertainty Analysis: Framed by uncertainty analysis, prior work traces a progression from general UQ to white-box UQ for LLMs and, ultimately, to CP-based coverage guarantees: in ML and NLP, UQ is foundational for decision-making and risk control, yet confidence-based metrics (e.g., entropy) are vulnerable to calibration mismatch, while Bayesian and ensemble paradigms—despite their theoretical appeal—incur prohibitive computational and engineering overheads at LLM scale. Early LLM efforts relied on white-box signals (token likelihoods and internal activations) and fine-tuning for improved calibration. However, API-level white-box constraints and resource budgets limit their practicality, motivating an output-only, white-box route: multi-sample generation is used to capture semantic consistency and dispersion, with the “most frequent generation” serving as a self-consistency anchor to construct actionable, internal-agnostic uncertainty surrogates. Building on this, conformal prediction (CP) offers a systematic bridge from heuristic scores to coverage-guaranteed prediction sets by using a small number of i.i.d. calibration instances to map external UQ scores to set thresholds and to achieve ground-truth coverage at a user-specified error rate while remaining model- and distribution-agnostic and computationally feasible. Further, to accommodate the vast output space of open-ended generation, recent work extends CP from classification to sequence generation via stopping rules and sampling approximations, thereby making explicit both “when confidence is sufficient” and “how large the set should be to remain robust” [34].

Conformal Prediction: As a distribution-free statistical calibration framework, conformal prediction (CP) offers a systematic pathway from heuristic uncertainty scores to coverage-guaranteed prediction sets and has garnered increasing attention in ML and NLP [32]. Its core mechanism, in the classification setting, excludes implausible labels via nonconformity scores to form a candidate set that, with high probability, contains the ground truth, while using the set size to quantify predictive uncertainty [35]; this paradigm has been validated in part-of-speech tagging, paraphrase detection, and fact verification. Owing to its nonparametric, distribution-free, model-agnostic, and computationally efficient characteristics, CP is well suited to large language models; moreover, its ability to deliver distribution-free, finite-sample calibration of general risk functions provides a methodological bridge from discriminative tasks to generative ones. For open-ended text generation with an effectively unbounded output space, recent work extends CP from classification to sequences via sampling and stopping rules, making explicit both when confidence is sufficient and how large a set must be to remain robust—without enumerating the entire candidate space. By contrast, CP-style confidence intervals developed for diffusion-based image generation remain at the pixel level and do not directly transfer to the combinatorial and semantic structure of language, underscoring the need for CP frameworks tailored to free-form generation in LLMs.

3. Method

This section proposes a general framework grounded in token-entropy scoring and split conformal prediction (SCP) to assess uncertainty in outputs generated by white-box language models. We assume access to token-level log-probabilities (or normalized probabilities) during decoding via API, but do not require access to parameters, gradients, or hidden states. Under this setting, the central challenge is to transform the available probability information from a finite set of sampled outputs into a calibrated prediction set with finite-sample coverage guarantees.

3.1. Problem Setup and Candidate Generation Mechanism

Let

x \in X

denote the input to a natural language task (e.g., a question in QA or a passage in summarization), with reference answer

y^{*} \in Y

. For a white-box language model

f (\cdot)

, we generate, for input x, a set of candidate outputs

\hat{Y} (x) = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{M}}

, where M is a fixed number of generations. The decoding can employ various strategies (e.g., temperature sampling, beam search, or top-p sampling), which we abstract as

{\hat{y}}_{m} \sim f (x; θ), m = 1, \dots, M .

(1)

where

θ

collects control hyperparameters (e.g., temperature and beam width) and is treated as fixed. Given the openness and diversity of natural language generation, a single input x often admits multiple semantically equivalent yet lexically distinct outputs. We, thus, adopt a set-generation perspective, viewing all the candidates as uncertain responses to x and seeking to extract information about output quality and confidence from their distributional structure. Our ultimate objective is to estimate the uncertainty

U ({\hat{y}}_{m})

for each candidate

{\hat{y}}_{m}

and, at a user-specified error tolerance

α

, to construct a prediction set

Γ (x)

that satisfies the following coverage guarantee:

P [y^{*} \in Γ (x)] \geq 1 - α .

(2)

The construction is realized via quantile calibration in Section 3.3.

3.2. Uncertainty Estimation via Token Entropy

Because per-token log-probabilities are available in our white-box setting, we adopt token entropy as the primary uncertainty (nonconformity) score used throughout our conformal procedure. Concretely, for a candidate

{\hat{y}}_{m}

of length

L_{m}

,

\begin{matrix} U ({\hat{y}}_{m}) & = \sum_{t = 1}^{L_{m}} H_{t} \\ = - \sum_{t = 1}^{L_{m}} \sum_{v \in V} p_{t} (v) log p_{t} (v), \end{matrix}

(3)

where

V

is the vocabulary and

p_{t} (v)

is the model’s predictive distribution at position t. Higher cumulative entropy indicates greater uncertainty (lower confidence). To mitigate potential output-length effects, we also report a mean-entropy variant

U ({\hat{y}}_{m}) = \frac{1}{L_{m}} \sum_{t = 1}^{L_{m}} H_{t}

in ablations.

Optional (analysis/baseline) semantic self-consistency. For completeness, we include a semantic self-consistency score as an optional analysis/baseline (not used to set CP thresholds). For the candidate set

\hat{Y} (x)

, the uncertainty of

{\hat{y}}_{m}

is defined as a convex combination of (i) the frequency with which other candidates are semantically equivalent to

{\hat{y}}_{m}

and (ii) the average semantic similarity:

\begin{matrix} U ({\hat{y}}_{m}) & = 1 - λ \cdot \underset{semantic-consistency frequency}{\underset{︸}{\frac{| {{\hat{y}}_{j} \in \hat{Y} (x) : {\hat{y}}_{j} \approx {\hat{y}}_{m}} |}{M}}} \\ - (1 - λ) \cdot \underset{average semantic similarity}{\underset{︸}{\frac{1}{M} \sum_{j = 1}^{M} S ({\hat{y}}_{m}, {\hat{y}}_{j})}}, \end{matrix}

(4)

where

S (\cdot, \cdot)

denotes sentence-level semantic similarity and

λ \in [0, 1]

is a trade-off parameter.

3.3. Uncertainty Calibration and Prediction Set Construction

Given an uncertainty score for each candidate, we adopt the split conformal prediction (SCP) framework: the available data are partitioned into a calibration set and a test set; the former is used to estimate a confidence threshold, which is then applied to the latter to build prediction sets with formal guarantees.

Data partitioning and filtering. For each sample $(x_{i}, y_{i}^{*})$ , if there exists a candidate ${\hat{y}}_{m}$ such that $S ({\hat{y}}_{m}, y_{i}^{*}) \geq τ$ , the sample is deemed assessable (i.e., of acceptable generation quality). Moreover, it is retained in $D_{filtered}$ , where $τ$ is a semantic-matching threshold typically in the range $0.8$ – $0.9$ . We then randomly split $D_{filtered}$ into a calibration subset $D_{cal}$ and a test subset $D_{test}$ in a user-specified ratio.
Construction of the nonconformity score multiset $R$ . On $D_{cal}$ , we collect uncertainty scores of the (semantically correct) candidates:

$\begin{matrix} R = {U ({\hat{y}}_{m}^{(i)}) | & (x_{i}, y_{i}^{*}) \in D_{cal}, \\ {\hat{y}}_{m}^{(i)} is semantically correct} . \end{matrix}$

(5)

Using all the candidates is a permissible, more conservative variant.
Quantile-based confidence threshold. For target coverage $1 - α$ , we order $R$ increasingly and take the $⌈ (1 - α) (n + 1) ⌉$ -th order statistic as the empirical threshold ${\hat{q}}_{α}$ :

$\begin{matrix} {\hat{q}}_{α} & = Quantile (R, q_{level}), \\ q_{level} & = \frac{⌈ (1 - α) (n + 1) ⌉}{n}, \end{matrix}$

(6)

where $n = | R |$ . We use a “higher” interpolation rule to avoid underestimating the threshold and thereby preserve coverage.
Prediction set construction. For any test input x, we define

$\begin{matrix} Γ (x) & = {{\hat{y}}_{m} \in \hat{Y} (x) : \\ U ({\hat{y}}_{m}) \leq {\hat{q}}_{α}} . \end{matrix}$

(7)

By construction, $Γ (x)$ achieves a nominal coverage of at least $1 - α$ for the ground-truth answer under exchangeability. We caution that pre-filtering calibration samples as “assessable” may affect exchangeability; we, therefore, report sensitivity to the semantic threshold $τ$ and also consider a no-filter variant as a conservative baseline.

4. Experiments

4.1. Experimental Setup

Backbone LLMs and Evaluation Tasks. Since we incorporated token entropy into the conformal prediction framework for open-ended question-answering (QA) tasks, it was necessary to evaluate the performance of the proposed architecture rigorously. To this end, we selected a diverse set of open-source large language models (LLMs) as backbone models, including the Llama series (Llama-3.2-1B and Llama-3.1-8B-Instruct), the Qwen series (Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct), and the Vicuna series (Vicuna-7B-v1.5 and Vicuna-13B-v1.5).

Datasets. We employed TriviaQA and CoQA as our experimental datasets. The TriviaQA dataset contains over 650,000 question–answer–evidence triples, covering a wide range of topics such as history, science, and entertainment, and supports multiple tasks including open-domain question-answering and extractive question-answering. CoQA, The dataset is large-scale and intended to facilitate the development and construction of conversational question-answering systems.

Baseline. The baseline method adopts a predictive uncertainty-based strategy, which constructs the prediction set by quantifying the model’s uncertainty over its generated outputs:

ConU. A prediction method is proposed based on Conformal Uncertainty, which evaluates candidate responses generated for each input prompt and quantifies predictive uncertainty by incorporating both model confidence scores and calibration error. In contrast to heuristic approaches that rely solely on response-level variability, ConU offers a principled mechanism for constructing prediction sets with theoretical risk control guarantees. Notably, the method does not require access to model log-probabilities, yet is still able to retain high-confidence responses selectively. It enables the construction of prediction sets that are statistically guaranteed to meet predefined risk levels, while remaining representative, stable, and interpretable.

Metrics. We used the following metrics for evaluation:

Empirical miscoverage rate (EMR) quantifies the proportion of instances where the prediction set contains at least one output that passes the correctness criterion (higher is better). It captures the set’s validity.
Average prediction set size (APSS) computes the mean cardinality of prediction sets across the evaluation corpus. Lower values indicate greater efficiency.

Correctness Evaluation. We adopted a similarity-based criterion to evaluate the consistency between model predictions and ground-truth answers. Each generated response was paired with its corresponding reference answer and evaluated using a semantic matching model. To this end, we employed a DistilRoBERTa-based cross-encoder to score the semantic closeness between the two texts. We set a fixed threshold of (

τ = 0.7

): a prediction was considered correct if its similarity score exceeds this threshold, thereby ensuring that only semantically faithful outputs are retained. The same (

τ

) value was used for both calibration filtering (“assessable” samples) and final evaluation to avoid selection bias; in all the reported experiments, (

τ

) was set to 0.7. For datasets with multiple reference answers, we took the maximum similarity score among all the references; if any reference exceeded (

τ

), the candidate prediction was regarded as correct. This approach enabled a more fine-grained evaluation of correctness beyond surface-level token overlap. The exact checkpoint identifier and configuration of the cross-encoder are provided in the reproducibility section.

4.2. Results for QA

We conducted systematic experiments on two open-domain question-answering datasets, TriviaQA (Figure 2 and Figure 3) and CoQA (Figure 4 and Figure 5), to evaluate the coverage performance and prediction set size of the proposed method across different large language models. Figure 2 reports the EMR results of six models—Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, LLaMA-3.2-1B, LLaMA-3.1-8B-Instruct, Vicuna-7B-v1.5, and Vicuna-13B-v1.5—on the TriviaQA dataset, while Figure 4 presents the corresponding results on the CoQA dataset. All the experiments were conducted with a sampling size of 10, where model predictions were ranked according to predictive uncertainty and calibrated through the conformal prediction framework to construct output sets with formal coverage guarantees (under the assumption of exchangeability). At a risk level of (

α

< 0.2), the EMR remained consistently below 0.1 for all the models except for the relatively weaker Vicuna-7B-v1.5 (EMR = 0.11), indicating that the proposed method can generate reliable and high-confidence prediction sets under risk-controlled conditions. Furthermore, as illustrated in Figure 4, EMR exhibited a decreasing trend as model capacity increased (e.g., from Qwen2.5-3B to Qwen2.5-7B to Vicuna-13B), suggesting that more capable models produce more compact and selective prediction sets under the same uncertainty-based ranking and conformal calibration strategy. Moreover, we observed that, across all six models and for any value of (

α

), both the mean and variance of EMR remained below (

α

), demonstrating that (

α

) serves as a reliable and well-calibrated confidence level for prediction sets.

Meanwhile, the APSS results (Table 1) show that the size of the prediction sets decreases monotonically as the risk level increases: when

α = 0.1

, the average prediction set size was close to nine candidates, while at

α = 0.9

it contracted to about one candidate. This trend remained consistent across both the TriviaQA and CoQA datasets, with only minor differences observed among the models, highlighting the stability of the proposed method and its cross-task generalizability. Notably, larger-scale models exhibit a faster reduction in prediction set size at medium to high risk levels, indicating that their prediction sets are more selective. In addition, as shown in Table 1, the prediction set sizes are highly similar across all the models, suggesting that conformal prediction is model-agnostic.

4.3. Ablation Study

To comprehensively evaluate the effect of different split ratios on prediction set coverage, we conducted ablation studies on the open-domain QA datasets TriviaQA and CoQA. This experiment comprised three parts.

Ablation on split ratio. We set the calibration–test split ratios to 0.3, 0.5, and 0.7, corresponding to different amounts of calibration data. All the evaluations used a fixed risk level of

α = 0.1

, and the results were averaged over 100 random seeds to mitigate sampling variability. The resulting empirical coverage (defined as

1 - EMR

) is presented as bar charts in Figure 6 and Figure 7, showing model performance under each split configuration. All the ablations used the same semantic threshold

τ = 0.7

for both calibration filtering (assessable samples) and evaluation to avoid selection bias, with decoding parameters and sampling size M held fixed. When the split ratio was reduced, risk control remained intact because the test-time average coverage consistently exceeded

1 - α

, evidencing the robustness of our method. Moreover, the approach was efficient: it attained reliable coverage guarantees on large test sets using only limited calibration data. These results indicate that the method is both robust and practical under varying data partitioning schemes. From a cross-dataset perspective, despite differences in language style, task formulation, and reasoning complexity, our method exhibited strong transferability and consistent coverage control across both distributions. Under all three split settings, most models maintained coverage within the narrow interval

[0.94, 1.00]

without noticeable degradation or volatility. Notably, Vicuna-13B achieved perfect coverage (

1.00

) across all splits on CoQA, while LLaMA-3.1-8B and Qwen2.5-7B consistently exceeded

0.99

, indicating reliable calibration even with limited calibration data.

Ablation on semantic similarity threshold $τ$ . We varied

τ

from 0.1 to 0.9 (Figure 8). Semantic similarity substantially affected EMR; however, overall performance did not degrade, as

EMR < α

held for all

τ

, demonstrating consistent miscoverage-risk control irrespective of the threshold choice.

Ablation on sampling size. We evaluated sampling sizes

M \in {5, 7, 10}

(Figure 9). The results remained relatively stable across settings: performance was slightly better at

M = 10

and weaker at

M = 5

, yet miscoverage-risk control was achieved in all the cases.

4.4. Comparison with Baseline

Comparison with ConU. After comparing our proposed method (TECP) with the ConU baseline, the results on the TriviaQA dataset are shown in Figure 2 and Figure 3, and the results on the CoQA dataset are shown in Figure 4 and Figure 5. We found that TECP consistently produced empirical miscoverage rate (EMR) curves with lower variance across different random seeds and calibration-test splits, indicating stronger stability. Furthermore, while ConU exhibited wider confidence intervals and higher fluctuations across multiple models, TECP demonstrated closer alignment with the theoretical upper bound

y = α

(under exchangeability) and significantly narrower uncertainty bands across all the models. This suggests that TECP maintains consistent performance under random partitioning and better achieves the target theoretical risk control. All the curves use the same semantic threshold

τ = 0.7

for both calibration filtering (assessable) and evaluation to avoid selection bias.

Token entropy captures the internal uncertainty of the language model by modeling the probability distribution over tokens in an autoregressive generation process. In contrast, frequency-based approaches assess semantic diversity solely from sampled outputs and are thus vulnerable to issues such as hallucination. Specifically, when a language model is highly confident yet repeatedly generates an incorrect answer, frequency-based uncertainty estimates may misleadingly appear low, failing to reflect the model’s true epistemic uncertainty. This discrepancy introduces bias into the construction of nonconformity scores, undermining the reliability of the measured disagreement between the input and output. As a result, when calibration and test sets are randomly split, such errors are further amplified, leading to large variance and unstable uncertainty quantification.

Comparison with VL-Uncertainty. VL-Uncertainty [30] treats “uncertainty” as an intrinsic metric of LVLM hallucination, without requiring any external annotations or auxiliary models. The model directly uses 1 as the uncertainty threshold and autonomously relies on the internal uncertainty of the LVLM as the sole decision criterion. We conducted ablation experiments on CoQA and TriviaQA to evaluate VL-Uncertainty, and the results are shown in Table 2. In these experiments, we hard-coded the prediction set thresholds to 0.1, 0.3, 0.5, 0.7, and 1.0, and observed that the EMR exhibited no discernible pattern as the threshold varied. This indicates that such a heuristic threshold-setting strategy in VL-Uncertainty fails to achieve effective control over the miscoverage risk.

5. Conclusions

We propose and validate a method for prediction set construction that leverages uncertainty-based ranking in conjunction with conformal calibration to achieve rigorous coverage control. Comprehensive evaluations on the TriviaQA and CoQA datasets—spanning multiple models and calibration–test splits—demonstrate that the approach maintains stable coverage and compact prediction set sizes across variations in model scale, task distribution, and data partitioning. As model capacity increases, the resulting prediction sets exhibit greater concentration and selectivity; meanwhile, coverage remains robust under different split ratios, with all the findings averaged over repeated random trials to ensure statistical stability. Overall, this white-box framework achieves provable coverage (under exchangeability) and shows cross-dataset consistency within QA, mitigates calibration bias in uncertainty scores, and offers a practical solution for reliable prediction and risk control in large language model applications.

6. Limitations

Our approach reflects a typical limitation of white-box conformal prediction frameworks—namely, the reliance on uncertainty scores derived from model output probabilities. Although predictive uncertainty supports the creation of prediction sets with theoretical coverage guarantees, the effectiveness of calibration largely depends on the stability and accuracy of the underlying uncertainty signal, which can be affected by instruction tuning or reinforcement learning. In practice, extracting per-token log-probabilities via APIs entails latency, cost, and request-rate constraints, which we do not optimize or benchmark against baselines under equal resource budgets in this study. Furthermore, the method presumes that such uncertainty scores can be reliably extracted across diverse models and tasks, an assumption that may break down under distribution shifts or non-standard decoding protocols. Future directions include developing scoring functions that are more robust to model-specific artifacts, or exploring hybrid frameworks that integrate both internal and external uncertainty cues to improve generalization.

Author Contributions

Conceptualization, B.X. and Y.L.; methodology, B.X. and Y.L.; software, B.X.; validation, B.X.; formal analysis, B.X.; investigation, B.X.; resources, Y.L.; data curation, B.X.; writing—original draft, B.X.; writing—review and editing, B.X. and Y.L.; visualization, B.X.; project administration, Y.L.; funding acquisition, Y.L. All the authors have read and agreed to the published version of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by General Project under the Stable Support Plan for Basic Research, Shenzhen Municipality (grant number 20231128113233002).

Data Availability Statement

The data presented in this study are available in public domain repositories: CoQA (Stanford NLP) at https://stanfordnlp.github.io/coqa/ (accessed on 14 October 2025) and TriviaQA (University of Washington NLP) at https://nlp.cs.washington.edu/triviaqa/ (accessed on 14 October 2025). These data were derived from the following resources available in the public domain: CoQA—https://stanfordnlp.github.io/coqa/ (accessed on 14 October 2025); TriviaQA—https://nlp.cs.washington.edu/triviaqa/ (accessed on 14 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
Veeramachaneni, V. Large Language Models: A Comprehensive Survey on Architectures, Applications, and Challenges. Adv. Innov. Comput. Program. Lang. 2025, 7, 20–39. [Google Scholar]
Chen, H.; Zhang, Y.; Bi, Y.; Zhang, Y.; Liu, T.; Bi, J.; Lan, J.; Gu, J.; Grosser, C.; Krompass, D.; et al. Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs. arXiv 2025, arXiv:2505.23270. [Google Scholar] [CrossRef]
Rong, X.; Huang, W.; Liang, J.; Bi, J.; Xiao, X.; Li, Y.; Du, B.; Ye, M. Backdoor Cleaning without External Guidance in MLLM Fine-Tuning. arXiv 2025, arXiv:2505.16916. [Google Scholar] [CrossRef]
Lu, R.; Bi, J.; Ma, Y.; Xiao, F.; Du, Y.; Tian, Y. MV-Debate: Multi-View Agent Debate with Dynamic Reflection Gating for Multimodal Harmful Content Detection in Social Media. arXiv 2025, arXiv:2508.05557. [Google Scholar]
Yang, D.; Wei, J.; Xiao, D.; Wang, S.; Wu, T.; Li, G.; Li, M.; Wang, S.; Chen, J.; Jiang, Y.; et al. PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications. Adv. Neural Inf. Process. Syst. 2024, 37, 138632–138662. [Google Scholar]
Chen, J.; Yang, D.; Jiang, Y.; Lei, Y.; Zhang, L. MISS: A Generative Pre-Training and Fine-Tuning Approach for Med-VQA. In Proceedings of the International Conference on Artificial Neural Networks (ICANN), Lugano, Switzerland, 17–20 September 2024. [Google Scholar]
Zhang, G.; Bi, J.; Gu, J.; Chen, Y.; Tresp, V. SPOT! Revisiting Video-Language Models for Event Understanding. arXiv 2023, arXiv:2311.12919. [Google Scholar] [CrossRef]
Chen, J.; Jiang, Y.; Yang, D.; Li, M.; Wei, J.; Qian, Z.; Zhang, L. Can LLMs’ Tuning Methods Work in Medical Multimodal Domain? In Proceedings of the MICCAI, Marrakesh, Morocco, 6–10 October 2024. [Google Scholar]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Lavrinovics, E.; Biswas, R.; Bjerva, J.; Hose, K. Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective. J. Web Semant. 2025, 85, 100844. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, M.; Lee, Z.; Ye, W.; Zhang, S. Hademif: Hallucination Detection and Mitigation in Large Language Models. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Virtual, 24–28 April 2025. [Google Scholar]
Jiang, Y.; Chen, J.; Yang, D.; Li, M.; Wang, S.; Wu, T.; Li, K.; Zhang, L. CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation. In Proceedings of the ICASSP 2025—IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025. [Google Scholar]
Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature 2024, 85, 100844. [Google Scholar] [CrossRef]
Duan, J.; Cheng, H.; Wang, S.; Zavalny, A.; Wang, C.; Xu, R.; Kailkhura, B.; Xu, K. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
Qiu, X.; Miikkulainen, R. Semantic Density: Uncertainty Quantification for Large Language Models Through Confidence Measurement in Semantic Space. Adv. Neural Inf. Process. Syst. 2024, 37, 134507–134533. [Google Scholar]
Wang, Z.; Duan, J.; Yuan, C.; Chen, Q.; Chen, T.; Zhang, Y.; Wang, R.; Shi, X.; Xu, K. Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond. Eng. Appl. Artif. Intell. 2025, 139, 109553. [Google Scholar] [CrossRef]
Wang, Z.; Duan, J.; Cheng, L.; Zhang, Y.; Wang, Q.; Shi, X.; Xu, K.; Shen, H.T.; Zhu, X. ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees. In Findings of the Association for Computational Linguistics: EMNLP 2024; Association for Computational Linguistics: Miami, FL, USA, 2024. [Google Scholar]
Mora-Cross, M.; Calderon-Ramirez, S. Uncertainty Estimation in Large Language Models to Support Biodiversity Conservation. Proceedings of NAACL 2024: Industry Track, Mexico City, Mexico, 16–21 June 2024. [Google Scholar]
Wang, K.; Shen, C.; Li, X.; Lu, J. Uncertainty Quantification for Safe and Reliable Autonomous Vehicles: A Review of Methods and Applications. IEEE Trans. Intell. Transp. Syst. 2025, 26, 2880–2896. [Google Scholar] [CrossRef]
Wang, Y.; Lei, Y.; Li, N.; Feng, K.; Wang, Z.; Tan, Y.; Li, H. Machinery Multimodal Uncertainty-Aware RUL Prediction: A Stochastic Modeling Framework for Uncertainty Quantification and Informed Fusion. IEEE Internet Things J. 2025, 12, 31643–31653. [Google Scholar] [CrossRef]
Chen, J.; Yang, D.; Wu, T.; Jiang, Y.; Hou, X.; Li, M.; Wang, S.; Xiao, D.; Li, K.; Zhang, L. Detecting and Evaluating Medical Hallucinations in Large Vision Language Models. arXiv 2024, arXiv:2406.10185. [Google Scholar] [CrossRef]
Wang, Y.; Bi, J.; Ma, Y.; Pirk, S. ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM. arXiv 2025, arXiv:2506.14766. [Google Scholar] [CrossRef]
Kadavath, S.; Conerly, T.; Askell, A.; Henighan, T.; Drain, D.; Perez, E.; Schiefer, N.; Dodds, Z.; German, M.; Johnston, S.; et al. Language Models (Mostly) Know What They Know. arXiv 2022, arXiv:2207.05221. [Google Scholar] [CrossRef]
Kuhn, L.; Gal, Y.; Farquhar, S. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Virtual, 1–5 May 2023. [Google Scholar]
Malinin, A.; Gales, M. Uncertainty Estimation in Autoregressive Structured Prediction. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 4–8 May 2021. [Google Scholar]
Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Peng, Q.; Bao, Y.; Ren, H.; Wang, Z.; Zou, C. Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach. In Proceedings of the 42nd International Conference on Machine Learning (ICML), Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Snell, J.C.; Griffiths, T.L. Conformal Prediction as Bayesian Quadrature. In Proceedings of the 42nd International Conference on Machine Learning (ICML), Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
Wang, Z.; Wang, Q.; Zhang, Y.; Chen, T.; Zhu, X.; Shi, X.; Xu, K. SConU: Selective Conformal Uncertainty in Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), Vienna, Austria, 27 July–1 August 2025. [Google Scholar]
Barber, R.F.; Candès, E.J.; Ramdas, A.; Tibshirani, R.J. Conformal Prediction Beyond Exchangeability. Ann. Stat. 2023, 51, 816–845. [Google Scholar] [CrossRef]
Angelopoulos, A.N.; Bates, S. Conformal Prediction: A Gentle Introduction. Found. Trends® Mach. Learn. 2023, 16, 494–591. [Google Scholar] [CrossRef]
Wang, Z.; Duan, J.; Wang, Q.; Zhu, X.; Chen, T.; Shi, X.; Xu, K. COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees. arXiv 2025, arXiv:2506.20178. [Google Scholar] [CrossRef]
Iutzeler, F.; Mazoyer, A. Risk-Controlling Prediction with Distributionally Robust Optimization. Trans. Mach. Learn. Res. 2025. [Google Scholar]
Wang, Q.; Geng, T.; Wang, Z.; Wang, T.; Fu, B.; Zheng, F. Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Virtual, 24–28 April 2025. [Google Scholar]

Figure 1. TECP flowchart.

Figure 2. EMR vs. Alpha for six models on the TriviaQA dataset using TECP.

Figure 3. EMR vs. Alpha for six models on the TriviaQA dataset using ConU.

Figure 4. EMR vs. Alpha for six models on the CoQA dataset using TECP.

Figure 5. EMR vs. Alpha for six models on the CoQA dataset using ConU.

Figure 6. Coverage rate under different split ratios on CoQA.

Figure 7. Coverage rate under different split ratios on CoQA.

Figure 8. Semantic similarity threshold vs. EMR.

Figure 9. Sampling size vs. EMR.

Table 1. Result of the prediction set size at various risk levels.

Dataset	LLMs/ $α$	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
TriviaQA	Llama-3.1-8B-Instruct	9.02	8.05	7.06	6.10	5.10	4.06	3.03	2.00	1.02
	Llama-3.2-1B	9.00	7.99	6.99	6.00	5.00	4.00	3.01	1.99	1.01
	Qwen2.5-7B-Instruct	9.01	8.03	7.02	6.00	4.97	3.95	2.96	2.00	1.00
	Qwen2.5-3B-Instruct	8.99	7.99	7.00	6.00	4.99	4.00	3.02	2.03	1.05
	vicuna-7b-v1.5	9.01	8.00	6.99	6.02	5.02	4.00	3.00	2.00	1.01
	vicuna-13b-v1.5	9.00	7.98	6.97	5.98	4.97	3.99	3.00	1.99	1.02
CoQA	Llama-3.1-8B-Instruct	9.02	8.02	7.01	6.02	5.03	4.03	3.03	2.01	1.00
	Llama-3.2-1B	9.00	7.99	7.01	6.01	5.01	4.03	3.03	2.02	1.02
	Qwen2.5-7B-Instruct	8.95	8.01	7.03	6.01	5.02	4.03	3.04	2.05	1.01
	Qwen2.5-3B-Instruct	8.95	7.96	6.99	6.02	5.03	4.02	3.00	2.03	1.01
	vicuna-7b-v1.5	9.00	8.00	7.00	6.01	5.01	4.01	3.00	2.00	1.00
	vicuna-13b-v1.5	8.98	7.98	6.98	5.98	4.98	3.98	2.98	2.00	1.02

Table 2. EMR across fixed prediction set thresholds.

Dataset	LLMs/Threshold	0.1	0.3	0.5	0.7	1.0
TriviaQA	Llama-3.2-1B	0.97	0.90	0.84	0.80	0.73
	llama-3.1-8B-Instruct	0.68	0.52	0.39	0.31	0.23
	vicuna-7b-v1.5	1.00	1.00	1.00	1.00	1.00
	vicuna-13b-v1.5	1.00	1.00	1.00	1.00	1.00
	Qwen2.5-3B-Instruct	0.84	0.74	0.66	0.59	0.52
	Qwen2.5-7B-Instruct	0.74	0.62	0.55	0.49	0.42
CoQA	Llama-3.2-1B	1.00	1.00	1.00	1.00	1.00
	llama-3.1-8B-Instruct	1.00	0.99	0.99	0.97	0.87
	vicuna-7b-v1.5	1.00	1.00	1.00	1.00	1.00
	vicuna-13b-v1.5	1.00	1.00	1.00	1.00	1.00
	Qwen2.5-3B-Instruct	0.99	0.97	0.97	0.95	0.90
	Qwen2.5-7B-Instruct	0.99	0.97	0.95	0.94	0.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, B.; Lu, Y. TECP: Token-Entropy Conformal Prediction for LLMs. Mathematics 2025, 13, 3351. https://doi.org/10.3390/math13203351

AMA Style

Xu B, Lu Y. TECP: Token-Entropy Conformal Prediction for LLMs. Mathematics. 2025; 13(20):3351. https://doi.org/10.3390/math13203351

Chicago/Turabian Style

Xu, Beining, and Yongming Lu. 2025. "TECP: Token-Entropy Conformal Prediction for LLMs" Mathematics 13, no. 20: 3351. https://doi.org/10.3390/math13203351

APA Style

Xu, B., & Lu, Y. (2025). TECP: Token-Entropy Conformal Prediction for LLMs. Mathematics, 13(20), 3351. https://doi.org/10.3390/math13203351

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TECP: Token-Entropy Conformal Prediction for LLMs

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Problem Setup and Candidate Generation Mechanism

3.2. Uncertainty Estimation via Token Entropy

3.3. Uncertainty Calibration and Prediction Set Construction

4. Experiments

4.1. Experimental Setup

4.2. Results for QA

4.3. Ablation Study

4.4. Comparison with Baseline

5. Conclusions

6. Limitations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI