Next Article in Journal
Ratio-Dependent Contrarian Activation in Opinion Dynamics
Previous Article in Journal
Anomalous Coulomb-Enhanced Charge Transport in Triangular Triple-Quantum-Dot Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Soft-Community Kernel Rényi Spectrum for Semantic Uncertainty Estimation in Large Language Models

1
Centre for Advanced Robotics, School of Engineering and Materials Science, Queen Mary University of London, London E1 4NS, UK
2
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai 200240, China
*
Author to whom correspondence should be addressed.
Entropy 2026, 28(4), 442; https://doi.org/10.3390/e28040442
Submission received: 24 January 2026 / Revised: 7 March 2026 / Accepted: 11 March 2026 / Published: 14 April 2026
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

Uncertainty estimation is critical for deploying large language models (LLMs) in safety-sensitive and decision-critical applications. Recent approaches estimate semantic uncertainty by clustering multiple sampled responses into equivalence classes and measuring their diversity via entropy-based criteria. However, existing methods typically rely on greedy hard clustering and von Neumann entropy, which suffer from sensitivity to clustering order, noise in semantic equivalence judgments, and limited control over spectral contributions. In this work, we propose a principled information-theoretic framework for LLM semantic uncertainty estimation based on soft semantic communities and kernel Rényi entropy. Given multiple generations for a query, we construct a weighted semantic graph using pairwise semantic similarity scores and infer soft community assignments via weighted graph community detection. These soft assignments induce a positive semi-definite semantic kernel that captures the distribution of semantic modes without enforcing hard equivalence relations. Uncertainty is then quantified by the Rényi entropy of the kernel spectrum, yielding a tunable measure that interpolates between sensitivity to dominant semantic modes and long-tail semantic diversity. Compared to prior von Neumann entropy-based estimators, the proposed Rényi spectral uncertainty offers improved robustness to semantic noise, reduced dependence on clustering heuristics, and greater flexibility through its order parameter. Extensive experiments on question answering tasks demonstrate that our method provides more stable and discriminative uncertainty estimates, particularly under limited sampling budgets and noisy semantic judgments.

1. Introduction

Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language generation tasks, including question answering, reasoning, and code generation [1,2]. Despite these advances, reliably estimating the uncertainty of LLM outputs remains a fundamental challenge, particularly in safety-critical and decision-sensitive applications such as medical consultation, scientific assistance, and automated programming [3,4].
The practical importance of semantic uncertainty estimation extends across a range of real-world LLM deployments. In medical question answering, a model may produce multiple fluent responses that differ subtly in diagnosis, treatment suggestion, or risk interpretation [5]; in legal or financial assistance, semantically distinct answers may imply different obligations, recommendations, or decisions [6]; and in retrieval-augmented generation, a model may generate responses that sound confident despite deviating from the retrieved evidence [7]. In such cases, token-level confidence alone is often insufficient, because linguistic fluency does not guarantee semantic consistency or factual reliability. This makes semantic uncertainty estimation a practically important component for trustworthy LLM deployment, especially in high-stakes settings where undetected semantic variation may lead to harmful downstream consequences.
A common approach to uncertainty estimation in LLMs is based on token-level predictive entropy, which quantifies uncertainty from the model’s output distribution over tokens [8,9]. However, token-level measures often fail to reflect uncertainty at the semantic level: multiple generated responses may differ substantially in surface form while conveying the same meaning, or conversely, appear linguistically similar while implying contradictory semantic interpretations. As a result, token entropy can severely underestimate or misrepresent the true uncertainty of a model’s answers.
To address this limitation, recent work has shifted attention toward semantic uncertainty estimation, where uncertainty is defined over the space of meanings expressed by multiple sampled generations rather than over individual tokens [10,11,12]. A representative line of work samples multiple responses for a given query, clusters them into semantic equivalence classes using natural language inference (NLI) [13,14], and computes uncertainty as the entropy of the resulting semantic distribution [5,11]. This paradigm marks an important conceptual step toward meaning-aware uncertainty estimation.
Nevertheless, existing semantic uncertainty estimators typically rely on hard, greedy clustering procedures and von Neumann entropy [15] applied to kernelized semantic representations. Such designs introduce several limitations. First, greedy hard clustering is inherently sensitive to sampling order and noise in pairwise semantic judgments, and it enforces strict equivalence relations that are often violated in practice due to ambiguity and paraphrasing variability. Second, von Neumann entropy corresponds to a Shannon entropy over the kernel spectrum and provides limited flexibility in controlling the contribution of dominant versus long-tail semantic modes. Finally, the tight coupling between clustering heuristics and entropy computation makes the resulting uncertainty estimates brittle under limited sampling budgets.
In this work, we propose a principled information-theoretic framework for semantic uncertainty estimation in LLMs that overcomes these limitations by combining soft semantic community discovery with kernel Rényi entropy [16,17]. Instead of enforcing hard semantic equivalence classes, we represent sampled generations as nodes in a weighted semantic graph and infer soft community memberships via weighted graph community detection [18,19]. These soft assignments naturally capture overlapping and ambiguous semantic relationships while avoiding the order dependence of greedy clustering.
The inferred communities induce a positive semi-definite semantic kernel whose spectrum characterizes the distribution of semantic modes expressed by the LLM. We quantify uncertainty by computing the Rényi entropy of the kernel spectrum, yielding a tunable uncertainty measure that interpolates between sensitivity to dominant semantic interpretations and robustness to rare or noisy semantic variations. This spectral perspective decouples semantic structure discovery from uncertainty quantification and generalizes existing entropy-based estimators as a special case.
Through extensive experiments on question-answering tasks, we demonstrate that the proposed framework produces more stable, discriminative, and sample-efficient uncertainty estimates compared to prior semantic entropy methods. Our results highlight the advantages of soft community modeling and Rényi spectral analysis for semantic uncertainty quantification in LLMs.
The main contributions of this work are summarized as follows:
  • We propose a soft-community formulation for semantic uncertainty estimation in LLMs, representing multiple sampled generations as a weighted semantic graph and inferring soft community memberships instead of enforcing hard semantic equivalence classes.
  • We introduce a kernel-based Rényi spectral uncertainty estimator, which quantifies semantic uncertainty via the Rényi entropy of the semantic kernel spectrum, generalizing von Neumann entropy and enabling tunable sensitivity to dominant and long-tail semantic modes.
  • We present a unified information-theoretic framework that decouples semantic structure discovery from uncertainty quantification, providing a principled and extensible perspective on semantic uncertainty estimation for LLMs.
  • Extensive experiments demonstrate that the proposed method yields more stable, discriminative, and sample-efficient uncertainty estimates under limited sampling budgets and noisy semantic judgments compared to existing semantic entropy approaches.

2. Related Work

2.1. Uncertainty Estimation in Large Language Models

Uncertainty estimation is a fundamental component in assessing the reliability of large language model (LLM) outputs, particularly for hallucination detection and risk-aware deployment. Early approaches primarily relied on token-level predictive distributions, such as entropy or variance computed from output probabilities. While effective for capturing lexical uncertainty, these measures often fail to reflect ambiguity at the semantic level, where responses may differ in meaning despite similar surface forms [8,9].
Motivated by this limitation, recent work has increasingly focused on semantic uncertainty estimation, which characterizes uncertainty over sets of sampled responses rather than individual tokens. In this paradigm, uncertainty reflects the diversity of semantic interpretations produced by an LLM for a given query and has been shown to correlate more closely with hallucination likelihood [10]. A common strategy is to evaluate semantic similarity between sampled responses using natural language inference (NLI) models or sentence embeddings, followed by aggregation of these similarities into a global uncertainty measure [20].
Several black-box semantic uncertainty estimators adopt graph-based formulations, where sampled responses are represented as nodes and edge weights encode semantic similarity. Uncertainty is then inferred from structural properties of the resulting graph. For example, some methods estimate uncertainty via node degree statistics, eccentricity measures, or spectral characteristics of the graph Laplacian [20]. While these approaches provide flexible representations of semantic relationships, they often rely on specific graph proxies whose sensitivity to semantic noise and sampling variability may be difficult to control.
Beyond structural graph measures, entropy-based semantic uncertainty estimators have been proposed to directly quantify the dispersion of semantic interpretations. Semantic entropy [10] computes uncertainty by clustering responses into semantic equivalence classes and applying Shannon entropy to the resulting distribution. Subsequent extensions have explored finer-grained semantic modeling, including discrete semantic units [11], equivalence-aware decompositions [21], and smooth uncertainty estimation based on transformer-derived sentence embeddings [22]. These methods highlight the benefits of moving beyond strict entailment decisions, but many still rely on hard clustering assumptions or fixed entropy formulations.
Graph- and kernel-based perspectives further generalize semantic uncertainty estimation by embedding semantic relationships into positive semi-definite matrices and analyzing their spectral properties [23]. Von Neumann entropy has been employed to summarize semantic dispersion via the normalized kernel spectrum [24]. However, this formulation corresponds to a fixed Shannon entropy over eigenvalues and offers limited flexibility in controlling the contribution of dominant versus long-tail semantic modes. Moreover, hard equivalence assumptions or rigid structural proxies may still limit robustness under noisy semantic judgments. Taken together, these methods suggest that semantic uncertainty can be viewed as a spectral property of semantic similarity structures, rather than a by-product of token-level confidence.
In parallel, there is growing interest in uncertainty estimation for long-form and open-ended generation tasks, where responses may span multiple sentences or paragraphs and exhibit richer semantic variation [25,26,27]. In contrast, the present work focuses on short-form, proposition-level responses, where uncertainty arises primarily from competing semantic interpretations rather than extended discourse structure [10]. This setting provides a controlled testbed for studying semantic uncertainty and evaluating the robustness of uncertainty estimators.

2.2. Hallucinations and Confabulations in Large Language Models

Hallucinations in LLMs broadly refer to the generation of fluent but unsupported, incorrect, or internally inconsistent content. Prior studies have identified multiple manifestations of hallucination, including factual fabrication, instruction inconsistency, and reasoning failures, across tasks such as question answering, summarization, and code generation [10,28,29,30]. These phenomena pose a significant challenge to the reliable deployment of LLMs, particularly in scenarios requiring factual accuracy and logical consistency.
Within this broad category, a specific and practically important form of hallucination is often referred to as confabulation, also described as fabrication in short-form question-answering settings [10]. Confabulation occurs when an LLM produces an answer despite lacking sufficient knowledge to support it, leading to responses that are arbitrary or mutually inconsistent across repeated generations. For example, when presented with the same factual query, a model may output different, incompatible answers in separate samples, indicating the absence of a stable underlying semantic belief.
A commonly cited explanation for confabulation is the tendency of LLMs to generate a response even when a query exceeds their effective knowledge boundaries. This behavior is closely tied to training objectives that reward fluent answer generation rather than calibrated abstention, resulting in an overconfident or over-eager response pattern [10,28]. As a consequence, models may prefer to produce a plausible-sounding answer instead of signaling uncertainty or deferring the question.
A variety of approaches have been proposed to mitigate or detect hallucinations and confabulations. Some methods rely on external knowledge sources, cross-referencing generated content with curated databases or retrieval systems to verify factual correctness [31]. Other approaches employ auxiliary models, such as using an external LLM as a judge to assess the consistency or plausibility of generated responses [32]. While effective in certain settings, these strategies typically require additional resources, supervision, or task-specific infrastructure.
An alternative line of research formulates hallucination detection as a supervised classification problem, training models to distinguish accurate from fabricated content using internal representations of the LLM [33]. Although promising, such methods depend on labeled data and may struggle to generalize across domains or prompt distributions.
More recently, uncertainty-based approaches have gained attention as unsupervised indicators of hallucination risk. In particular, semantic uncertainty has been shown to correlate strongly with confabulation in short-form question answering, where competing semantic interpretations emerge across multiple sampled responses [10]. In such settings, uncertainty estimation provides a lightweight and task-agnostic proxy for identifying hallucination-prone queries.
Despite these advances, the effectiveness of uncertainty-based hallucination detection critically depends on the robustness of the underlying uncertainty estimator. Sensitivity to clustering heuristics, rigid equivalence assumptions, or fixed entropy formulations can limit reliability under noisy semantic judgments. This highlights the need for more flexible and principled semantic uncertainty frameworks that can better support the detection of hallucinations and confabulations in large language models.

3. Method

3.1. Problem Setup

Let x denote an input query, such as a short-form factual question. Given a large language model M and a sampling strategy (e.g., temperature sampling), we generate a set of N responses
S ( x ) = { s 1 , s 2 , , s N } ,
where each s i represents a complete natural language response sampled independently from M conditioned on x.
Our goal is to quantify the semantic uncertainty of M with respect to x, defined as the degree of disagreement among the semantic interpretations expressed by the sampled responses. Unlike token-level uncertainty measures, which operate on predictive distributions over vocabulary items, semantic uncertainty is defined over the space of meanings induced by S ( x ) .
Figure 1 provides a step-by-step overview of the proposed semantic uncertainty estimation framework, illustrating how sampled responses are organized into soft semantic communities and summarized via Rényi spectral entropy. We next formalize each component of the framework. Throughout the paper, we focus on query-level semantic uncertainty estimation and do not assume access to model internals.
While several ingredients of the framework, such as semantic similarity graphs, sentence embedding/NLI-based similarity estimation, and spectral graph analysis, are adapted from existing literature, they are not the primary novelty of this work. The main methodological contribution lies in their integration into a unified semantic uncertainty framework based on soft community inference and Rényi spectral kernel entropy. In particular, the proposed method replaces hard semantic equivalence classes with soft community assignments, constructs a community-induced positive semi-definite semantic kernel, and quantifies uncertainty through the Rényi entropy of its spectrum. This design decouples semantic structure discovery from uncertainty quantification and generalizes existing Shannon- or von Neumann-style formulations.

3.2. Semantic Similarity and Graph Construction

To model semantic relationships among sampled responses, we represent S ( x ) as a weighted undirected graph
G = ( V , E , W ) ,
where each node v i V corresponds to a response s i , and edge weights W i j [ 0 , 1 ] encode the semantic similarity between responses s i and s j .
We compute semantic similarity by jointly leveraging sentence-level semantic embeddings and NLI scores, which capture complementary aspects of semantic relatedness [13,34,35]. Specifically, sentence embeddings provide a continuous notion of global semantic proximity, while NLI scores capture directional, logic-aware semantic entailment. NLI is a fundamental task in natural language understanding that aims to determine the semantic relationship between a premise and a hypothesis, typically categorized as entailment, contradiction, or neutrality. In the context of semantic uncertainty estimation, NLI provides a logic-aware measure of semantic consistency that goes beyond surface-level semantic similarity. While sentence embeddings primarily capture distributional or topical proximity between responses, they may assign high similarity to statements that are semantically related yet factually inconsistent. In contrast, NLI explicitly models whether one response semantically supports another, making it particularly suitable for detecting factual disagreement and mutual inconsistency among short-form generated answers.
Let e ( s i ) R d denote the fixed sentence embedding of response s i , obtained from a pretrained sentence embedding model (e.g., all-mpnet-base-v2 [14]) that is independent of the language model used for response generation. We first compute an embedding-based similarity
S i j emb = e ( s i ) , e ( s j ) e ( s i ) e ( s j ) ,
which is then linearly rescaled to [ 0 , 1 ] for numerical consistency.
In parallel, we compute a symmetric NLI-based similarity score
S i j nli = σ NLI ( s i s j ) · σ NLI ( s j s i ) ,
where NLI ( s i s j ) denotes the entailment score from s i to s j , and σ ( · ) denotes the Sigmoid activation function that maps raw scores to [ 0 , 1 ] . This formulation captures mutual semantic support while remaining agnostic to strict equivalence decisions.
We combine the two similarity measures through a multiplicative fusion:
W i j = S i j emb η · S i j nli 1 η , η [ 0 , 1 ] ,
which yields the final edge weight in the semantic graph. This fusion emphasizes response pairs that are both semantically close in embedding space and mutually entailed under NLI, while suppressing spurious similarity arising from either measure alone. The fusion parameter η plays a role analogous to a temperature that balances geometric and logical notions of semantic similarity. In our experiment, we fix η = 0.5 and do not tune it per dataset. An ablation study on the effect of the fusion weight η is provided in Section 4.5.2.

3.3. Soft Community Representation

Rather than enforcing hard semantic equivalence classes, we infer a soft community structure over the semantic graph to capture graded and overlapping semantic relationships among sampled responses. Specifically, we estimate a community assignment matrix
P R N × K ,
where P i k 0 denotes the membership strength of response s i in semantic community k, and k = 1 K P i k = 1 for all i. In practice, K corresponds to the number of leading non-trivial spectral components used in the graph embedding and is treated as a fixed hyperparameter across datasets.
To compute the soft community assignment matrix P, we first perform a spectral analysis of the semantic similarity graph. Let W R N × N denote the semantic similarity matrix and D the corresponding degree matrix with D i i = j W i j . We construct the symmetric normalized graph Laplacian
L sym = I D 1 / 2 W D 1 / 2 ,
which is positive semi-definite and admits an orthogonal eigendecomposition.
We compute the eigenpairs of L sym ,
L sym u k = λ k u k ,
with eigenvalues ordered as 0 = λ 1 λ 2 λ N . A spectral embedding is then formed by stacking the eigenvectors corresponding to the smallest non-trivial eigenvalues:
U = [ u 2 , u 3 , , u K + 1 ] R N × K .
Each row of U provides a low-dimensional semantic representation of a sampled response. The first eigenvector u 1 is excluded, as it corresponds to a trivial uniform mode associated with the zero eigenvalue and does not encode discriminative semantic structure.
Given the spectral embedding U, we obtain soft community memberships via a row-wise softmax operation:
P i k = exp ( τ U i k ) k = 1 K exp ( τ U i k ) ,
where U i k denotes the ( i , k ) -th element of U and τ > 0 controls the sharpness of the assignment. This probabilistic formulation yields overlapping communities and avoids the order sensitivity and instability of hard clustering procedures.
Finally, each semantic community is represented by a weighted aggregate embedding
c k = i = 1 N P i k e ( s i ) ,
where e ( s i ) denotes the sentence-level semantic embedding of response s i . This representation summarizes each community as a soft semantic prototype and serves as the basis for subsequent kernel construction and spectral uncertainty estimation.

3.4. Semantic Kernel and Rényi Spectral Uncertainty

Given the soft community representations { c k } k = 1 K obtained from the semantic graph, we construct a kernel-based representation that summarizes the global semantic structure of the sampled responses. Specifically, we define a positive semi-definite semantic kernel
G = k = 1 K ω k c k c k ,
where
ω k = 1 N i = 1 N P i k
denotes the relative prevalence of semantic community k among the sampled responses. This formulation aggregates community-level semantic prototypes while weighting them according to their empirical support, yielding a compact second-order representation of semantic structure. Here, each community prototype c k R d is a sentence embedding-level representation, and the resulting semantic kernel G R d × d captures second-order semantic structure in the embedding space.
The semantic kernel G captures both the diversity and dominance of semantic communities in a unified matrix form that is amenable to spectral analysis. To ensure numerical stability and comparability across different queries, we normalize the kernel to unit trace:
G ˜ = G Tr ( G ) .
The resulting normalized kernel can be interpreted as a distribution over semantic modes, with its eigenvalues reflecting the relative importance of distinct semantic directions.
We quantify semantic uncertainty by computing the Rényi entropy of the spectrum of G ˜ . Let { λ 1 , , λ d } denote the eigenvalues of the normalized kernel G ˜ . The Rényi semantic uncertainty of order α > 0 , α 1 is defined as
U α ( x ) = 1 1 α log i = 1 d λ i α .
The order parameter α controls the sensitivity of the uncertainty measure to the kernel spectrum. Larger values of α emphasize dominant semantic modes, corresponding to widely shared interpretations, whereas smaller values increase sensitivity to long-tail semantic variation that may arise from minority or unstable interpretations. In the limit α 1 , the Rényi semantic uncertainty recovers the Shannon (von Neumann) entropy as a special case.
This kernel-based spectral formulation decouples semantic structure discovery from uncertainty quantification and provides a flexible information-theoretic framework for measuring semantic disagreement among sampled responses. The complete procedure is summarized in Algorithm 1.
Algorithm 1: Soft-Community Rényi Semantic Uncertainty
Input:
Query x, language model M , number of samples N, Rényi order α , fusion weight η [ 0 , 1 ] , soft assignment temperature τ > 0 , number of communities K
Output:
Semantic uncertainty score U α ( x )
Sampling. Sample N responses S ( x ) = { s 1 , , s N } from M conditioned on x;
Semantic similarity. Obtain sentence embeddings e ( s i ) for each response. Compute embedding similarity S i j emb (rescaled to [ 0 , 1 ] ) and symmetric NLI similarity S i j nli = σ ( NLI ( s i s j ) ) · σ ( NLI ( s j s i ) ) . Fuse similarities to form the graph weights W i j = ( S i j emb ) η ( S i j nli ) 1 η ;
Spectral embedding. Compute degree matrix D with D i i = j W i j and normalized Laplacian L sym = I D 1 / 2 W D 1 / 2 . Compute eigenpairs L sym u k = λ k u k and form U = [ u 2 , , u K + 1 ] R N × K ;
Soft communities. Compute soft memberships by row-wise softmax: P i k = exp ( τ U i k ) / k = 1 K exp ( τ U i k ) ;
Kernel construction. Compute community prototypes c k = i = 1 N P i k e ( s i ) and weights ω k = 1 N i = 1 N P i k . Construct kernel G = k = 1 K ω k c k c k and normalize G ˜ = G / Tr ( G ) ;
Rényi spectral uncertainty. Compute eigenvalues { λ i } of G ˜ and output
U α ( x ) = 1 1 α log i λ i α .
return  U α ( x ) ;
We next analyze the computational complexity of the proposed approach and compare it with the original hard-clustering semantic entropy framework. Let N denote the number of sampled responses, d the dimensionality of the sentence embeddings, and K the number of semantic communities.
Computing pairwise semantic similarity scores, including both embedding-based and NLI-based components, requires O ( N 2 ) operations. We construct the normalized graph Laplacian and perform eigendecomposition to obtain the spectral embedding for soft community inference scales as O ( N 3 ) in the worst case. Compared with the original hard-clustering pipeline, this step introduces additional overhead, but the cost remains modest in practice due to the small number of sampled responses.
Given the spectral embedding, computing soft community assignments via a row-wise softmax requires O ( N K ) operations. Constructing the semantic kernel incurs O ( K d 2 ) cost, and computing the Rényi spectral uncertainty involves eigendecomposition of a d × d kernel matrix, which scales as O ( d 3 ) in the worst case. This kernel-level spectral computation also constitutes additional cost relative to Shannon-entropy-based semantic entropy, but enables a richer uncertainty characterization through the proposed spectral Rényi formulation.
Overall, the computational complexity is dominated by the eigendecomposition steps on the response-level graph and the embedding-level kernel. In practice, since N is small (typically on the order of tens) and d is moderate and fixed by the embedding model, the overall computation remains efficient and tractable for query-level semantic uncertainty estimation. Therefore, the additional computation can be viewed as a reasonable trade-off for improved robustness, yielding uncertainty estimates that are less sensitive to clustering order, semantic ambiguity, and noisy pairwise relations.

3.5. Theoretical Properties of Rényi Spectral Uncertainty

We analyze several fundamental properties of the proposed Rényi spectral uncertainty measure. Let G ˜ be a unit-trace positive semi-definite semantic kernel with eigenvalues { λ i } i = 1 d . For α > 0 , α 1 , the Rényi semantic uncertainty is defined as
U α ( G ˜ ) = 1 1 α log i = 1 d λ i α .
Proposition 1
(Degeneracy). If G ˜ has rank one, i.e., λ 1 = 1 and λ i = 0 for all i > 1 , then U α ( G ˜ ) = 0 for any α > 0 .
Proof. 
All proofs are provided in Appendix A. □
Proposition 2
(Unitary Invariance). For any orthogonal matrix U, the Rényi spectral uncertainty is invariant under orthogonal similarity transformations:
U α ( G ˜ ) = U α ( U G ˜ U ) .
Proposition 3
(Monotonicity under Spectral Dispersion). For fixed trace, U α ( G ˜ ) increases as the eigenvalue distribution of G ˜ becomes more uniform. In particular, kernels with more evenly distributed eigenvalues exhibit higher semantic uncertainty.
Proposition 4
(Order Sensitivity). The Rényi order α controls the sensitivity of U α to dominant semantic modes. Larger values of α emphasize large eigenvalues, making the uncertainty measure more sensitive to dominant semantic interpretations, whereas smaller values of α increase sensitivity to long-tail semantic variation.
Together, these properties characterize the behavior of the proposed semantic uncertainty measure. Proposition 1 ensures that uncertainty vanishes when all sampled responses collapse to a single semantic interpretation, corresponding to maximal semantic certainty. Proposition 2 guarantees that the uncertainty depends only on the intrinsic spectral structure of the semantic kernel and is invariant to the choice of basis or representation.
Proposition 3 formalizes the intuition that semantic uncertainty reflects the dispersion of competing semantic modes: concentration of semantic mass on a small number of dominant modes leads to low uncertainty, whereas a more uniform distribution across modes yields higher uncertainty. Finally, Proposition 4 highlights a key advantage of the Rényi formulation, namely the ability to explicitly control the relative contribution of dominant versus minor semantic interpretations through the order parameter α . This flexibility is particularly valuable under limited sampling, where small eigenvalues may correspond either to noise or to meaningful but infrequent semantic alternatives.

4. Experiments

4.1. Experimental Setup

We name our approach Rényi spectral uncertainty (RSU), a kernel-based framework for semantic uncertainty estimation in large language models. We evaluate its performance in a black-box question-answering setting. For each input query x, we sample multiple independent responses from a pretrained instruction-tuned language model to induce semantic variability. Unless otherwise specified, all experiments are conducted using a fixed sampling budget per query and identical decoding configurations across methods.
Sentence-level semantic embeddings are computed using a fixed pretrained sentence embedding model, which is independent of the language model used for generation. Logical semantic relations are estimated using a pretrained natural language inference (NLI) model. Both models are held fixed throughout all experiments and are not fine-tuned.
For semantic similarity construction, we employ the multiplicative fusion of embedding-based similarity and symmetric NLI entailment scores described in Section 3, with fusion weight η controlling their relative contributions. Soft community assignments are obtained via spectral embedding of the normalized graph Laplacian followed by a row-wise softmax with temperature τ. The number of communities K, the Rényi order α, and all other hyperparameters are fixed across datasets unless explicitly varied in ablation studies.
For each input query, we sample N = 10 responses using a combination of top-K sampling (K = 50) and nucleus sampling (p = 0.9) at temperature T = 1. This setting provides a balance between semantic diversity and generation stability and is consistent with our sensitivity analysis in Section 4.6.

4.2. Evaluation Tasks and Data

We consider short-form generative question-answering tasks for evaluation, where semantic uncertainty arises primarily from competing interpretations or incomplete knowledge rather than long-form discourse. This setting provides a controlled environment for analyzing semantic disagreement across sampled responses and its relationship to hallucination and confabulation.
Experiments are conducted on a diverse collection of open-domain and domain-specific question-answering benchmarks, covering conversational, closed-book, and specialized knowledge settings. Specifically, we evaluate on the open-book conversational QA dataset CoQA [36], the closed-book QA dataset TriviaQA [37], the biomedical QA dataset BioASQ [38], and the Natural Questions (NQ) benchmark [39]. These datasets span a wide range of domains and question styles, enabling a comprehensive evaluation of semantic uncertainty estimation under varying knowledge and reasoning requirements.
Following standard practice, we use the development split of CoQA, the deduplicated validation split of TriviaQA (rc.nocontext subset), the validation split of NQ, and the training split of BioASQ. Across all datasets, uncertainty is evaluated at the query level by aggregating information from multiple sampled responses.
For each query, we assess whether uncertainty estimates can reliably distinguish between correct and incorrect model outputs. Correctness labels are obtained using dataset-provided ground-truth answers combined with automated verification procedures, following standard practices in prior uncertainty estimation work.
We utilize four widely adopted off-the-shelf instruction-tuned large language models for evaluation, with model sizes ranging from 1B to 12B parameters. These models include Llama-3.2-1B (https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct, accessed on 1 January 2026), Llama-3.1-8B (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, accessed on 1 January 2026), Mistral-7B-v0.3 (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3, accessed on 5 January 2026), and Mistral-Nemo-12B (https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407, accessed on 10 January 2026), allowing us to assess the robustness of the proposed uncertainty estimator across different model scales and architectures.

4.3. Evaluation Metrics and Baselines

4.3.1. Evaluation Metrics

We evaluate uncertainty estimates using two complementary metrics. First, we report the area under the receiver operating characteristic curve (AUROC), which measures how well uncertainty scores distinguish between correct and incorrect model outputs. An AUROC of 0.5 corresponds to random discrimination, whereas higher values indicate stronger alignment between uncertainty and answer correctness.
Second, we report the area under the accuracy–rejection curve (AUARC), which quantifies the potential accuracy improvement obtained by rejecting answers with high estimated uncertainty. This metric captures the practical utility of uncertainty estimates in risk-aware deployment scenarios.

4.3.2. Baselines

We compare the proposed Rényi spectral uncertainty estimator with a diverse set of representative uncertainty estimation baselines that reflect different modeling assumptions about semantic variability in generated responses. Rather than focusing on individual implementations, we group these baselines according to their underlying principles for quantifying semantic uncertainty.
The first category consists of semantic entropy-based methods, which cluster sampled responses into semantic equivalence classes and compute uncertainty as the entropy of the resulting empirical distribution. These approaches include semantic entropy (SE) [10], Discrete Semantic Entropy (DSE) [10], and Kernel Language Entropy (KLE) [11], which typically rely on hard clustering decisions and treat semantic interpretations as mutually exclusive.
The second category consists of graph-based semantic uncertainty methods, including Ecc [20], EigV [20], Deg [20], and D-UE [21]. These methods construct a semantic similarity graph over sampled responses and derive uncertainty from structural or spectral properties of the graph, such as node centrality, degree statistics, or eigenvalue-based measures.
We additionally include other state-of-the-art approaches, including Number of Semantic Sets (NSS) [24] and Semantic Embedding Uncertainty (SEU) [22]. All baselines are evaluated using the same sampled response and semantic similarity representations to ensure fair comparison.

4.4. Main Results

Table 1 and Table 2 report the AUROC and AUARC performance of different uncertainty estimation methods across a diverse set of model–dataset combinations. Overall, the proposed RSU achieves consistently strong performance and outperforms existing baselines in the majority of settings.

4.4.1. AUROC Performance

Across most model–dataset pairs, RSU attains the highest or second-highest AUROC, indicating improved discrimination between correct and incorrect responses based on uncertainty estimates. Performance gains are particularly evident on datasets where semantic ambiguity or confabulation is more prevalent, such as open-domain and knowledge-intensive question-answering benchmarks. These results suggest that explicitly modeling semantic disagreement at the community level provides a more reliable signal of answer correctness than token-level uncertainty or hard semantic clustering.
Compared to entropy-based baselines that rely on hard semantic equivalence assumptions, RSU benefits from soft community representations that capture graded and overlapping semantic interpretations. Relative to graph-based baselines that exploit response-level relational structure, RSU further improves discrimination by combining probabilistic community memberships with spectral aggregation.

4.4.2. AUARC Performance

As shown in Table 2, RSU also yields consistent improvements in accuracy–rejection performance. By selectively rejecting responses with high estimated uncertainty, the proposed method achieves higher retained accuracy across a wide range of rejection thresholds. This demonstrates that RSU provides uncertainty estimates that are not only discriminative but also practically useful for risk-aware decision making.
Across language models of varying sizes and architectures, RSU exhibits robust behavior, indicating that its effectiveness does not depend on specific model internals. Instead, performance gains stem from the explicit modeling of semantic structure among sampled responses, reinforcing the applicability of RSU in black-box settings.
Across all evaluated models and datasets, RSU consistently achieves the best or second-best performance, with only moderate variance across runs. Taken together, these results demonstrate that the proposed framework offers a robust and effective measure of semantic uncertainty, generalizing existing entropy- and graph-based approaches while providing improved discrimination and practical utility.

4.5. Ablation Studies

We further conduct a series of ablation studies to analyze the contribution of key design choices in RSU and to better understand why the proposed framework yields robust semantic uncertainty estimates. All ablations are performed using the same evaluation protocol as in the main experiments, while varying one factor at a time.

4.5.1. Effect of Rényi Order α

We first study the effect of the Rényi order α, which controls the spectral sensitivity of the proposed RSU. Recall that smaller values of α emphasize low-magnitude eigenvalues corresponding to long-tail or rare semantic variations, whereas larger values of α increasingly focus on dominant semantic modes.
We conduct this ablation on Llama-3.2-1B across all four datasets, varying α { 0.2 , 0.5 , 1 , 2 , 5 , 10 } while keeping all other components fixed. The Shannon (von Neumann) entropy case is approximated using α = 1.01 for numerical stability.
Table 3 reports the AUROC results. Across all datasets, RSU exhibits a clear and consistent U-shaped performance profile as a function of α. Very small values of α lead to degraded performance, as the uncertainty measure becomes overly sensitive to minor eigenvalues that often reflect sampling noise or spurious semantic variations. As α increases, performance improves steadily and peaks around α = 2, indicating an optimal balance between dominant and secondary semantic interpretations.
For larger values of α (e.g., α = 5 and α = 10), performance slightly decreases, suggesting that over-emphasizing the dominant spectral components suppresses meaningful semantic disagreement. Importantly, this trend is consistent across all datasets, including open-domain (NQ, TriviaQA) and knowledge-intensive (BioASQ) benchmarks. This observation is consistent with the theoretical role of α as a control over spectral sensitivity, as discussed in Section 3.4.
Based on these observations, we adopt α = 2 as the default setting in all main experiments. This choice is empirically robust and aligns with the theoretical motivation that effective semantic uncertainty estimation requires balancing dominant interpretations against structured semantic diversity.

4.5.2. Effect of Fusion Weight η

RSU constructs the semantic similarity graph by fusing two complementary signals: sentence embedding similarity and entailment-based semantic consistency. The fusion weight η [ 0 , 1 ] controls their relative contributions, where η 0 yields a graph dominated by entailment-based similarity, and η 1 relies primarily on embedding-based similarity.
We evaluate the effect of η on Llama-3.2-1B across all datasets by varying η { 0.01 , 0.25 , 0.5 , 0.75 , 0.99 } , while keeping all other components fixed. The corresponding AUROC results are reported in Table 4.
Overall, intermediate values of η consistently achieve the best performance, with η = 0.5 yielding the highest average AUROC. When η approaches either extreme, performance degrades, indicating that relying exclusively on a single similarity source is suboptimal. In particular, graphs constructed solely from embedding-based similarity ( η 1 ) exhibit the largest performance drop. This behavior can be attributed to the fact that sentence embeddings primarily capture distributional and topical similarity, which may conflate semantically related but factually inconsistent answers. As a result, embedding-only graphs tend to underestimate semantic disagreement in short-form question answering, leading to overconfident uncertainty estimates.
In contrast, entailment-only graphs ( η 0 ) perform more competitively, as entailment models are explicitly trained to detect logical consistency and contradiction, making them more sensitive to factual conflicts among sampled responses. However, entailment predictions are inherently noisy and discrete, which may introduce instability in the induced graph structure and subsequent spectral analysis.
The best performance is achieved by fusing the two signals. Embedding-based similarity provides a smooth geometric structure that stabilizes the semantic graph, while entailment-based similarity supplies strong discriminative cues for logical inconsistency. Their combination allows RSU to capture both continuous semantic proximity and discrete logical disagreement, resulting in a more robust and expressive semantic kernel for uncertainty estimation.

4.5.3. Soft vs. Hard Semantic Community Assignment

We further investigate the impact of soft semantic community modeling by comparing RSU with a hard community variant. In the hard assignment setting, each response is assigned to a single semantic community by selecting the community with the highest spectral embedding activation. Formally, given the spectral embedding U R N × K , hard assignments are obtained via P i k = 1 [ k = arg max j U i j ] , resulting in mutually exclusive semantic communities.
Table 5 reports the AUROC comparison. Across all datasets, soft community assignment consistently outperforms hard clustering. This performance gap highlights the importance of modeling graded and overlapping semantic relationships among sampled responses.
Hard community assignment enforces strict semantic boundaries, which can be problematic in short-form question answering where responses may partially agree, share common entities, or differ only in subtle factual details. Such rigid partitioning may artificially fragment semantically related answers or collapse distinct but related interpretations, leading to distorted semantic structure.
In contrast, soft community modeling allows each response to contribute to multiple semantic communities with different strengths. This probabilistic representation preserves nuanced semantic overlap and yields a smoother and more stable semantic kernel. As a result, the subsequent spectral analysis captures semantic disagreement more faithfully, leading to improved uncertainty estimation.

4.6. Sensitivity Analysis

We finally analyze the sensitivity of RSU with respect to the number of sampled responses N. RSU estimates semantic uncertainty by aggregating information across multiple sampled responses. We therefore examine the sensitivity of the proposed method to the number of samples N. Specifically, we vary N { 3 , 5 , 10 , 20 , 50 } while keeping all other components fixed.
Table 6 reports the AUROC results across all datasets. When the number of samples is very small (N = 3), performance is substantially degraded, indicating that insufficient sampling fails to capture the underlying semantic structure. As N increases, performance improves rapidly and stabilizes around N = 10.
Notably, further increasing the number of samples beyond N = 10 yields only marginal improvements. This saturation behavior suggests that RSU is able to recover reliable semantic uncertainty estimates with a relatively small number of samples, making it computationally efficient in practice. Across all datasets, the relative performance trends remain consistent, further confirming the robustness of the proposed method with respect to sampling size.

5. Conclusions and Future Work

This work adopts a spectral, information-theoretic perspective on semantic uncertainty estimation in large language models. A central design choice is the use of Rényi entropy over the spectrum of a semantic kernel, which generalizes existing Shannon- and von Neumann-based formulations and defines a parametric family that explicitly controls the relative contribution of dominant versus long-tail spectral components. In the context of semantic uncertainty, this flexibility is crucial: dominant eigenvalues typically correspond to widely shared semantic interpretations, whereas smaller eigenvalues capture minority, unstable, or competing semantic alternatives. The Rényi order α therefore provides a principled mechanism for balancing sensitivity to semantic disagreement against robustness under limited and noisy sampling. This spectral formulation is further strengthened by the use of soft, community-aware semantic representations, which model semantic similarity as inherently graded and avoid the noise amplification and order dependence associated with hard semantic equivalence classes.
Building on these design principles, we introduced Rényi spectral uncertainty (RSU), a principled and flexible framework for semantic uncertainty estimation in large language models. By modeling semantic relationships among multiple sampled responses as a weighted graph and adopting a soft community representation, RSU captures graded and overlapping semantic structures beyond hard equivalence assumptions. Extensive experiments across multiple question-answering benchmarks and language models demonstrate that RSU consistently provides stronger discrimination between correct and incorrect responses and yields more effective accuracy–rejection trade-offs than existing uncertainty estimators. These results highlight the importance of integrating semantic structure discovery with information-theoretic uncertainty measures for uncertainty-aware deployment of large language models.
Several directions remain open for future work. While RSU is particularly effective for short-form, proposition-level responses, the overall framework is not restricted to question answering. More generally, it can be applied to other LLM generation tasks, such as summarization, dialogue, retrieval-augmented generation, and long-form text generation, whenever multiple sampled outputs can be interpreted as alternative semantic realizations of the same prompt. In such settings, the core pipeline of semantic graph construction, soft community inference, and Rényi spectral uncertainty estimation remains applicable, while the main challenge lies in designing task-appropriate semantic similarity measures. For instance, long-form generation may require discourse-aware or segment-level semantic comparisons, whereas retrieval-grounded generation may benefit from evidence-aware similarity modeling.
In addition, integrating more adaptive or task-specific semantic similarity models may further enhance robustness across domains, since semantic similarity directly determines the response graph structure and therefore strongly influences the resulting uncertainty estimate. While the current embedding- and NLI-based similarity design works well for the short-form QA setting considered here, other applications may require more specialized modeling, such as domain-adapted sentence encoders, task-specific fusion of different similarity signals, evidence-aware similarity for retrieval-grounded generation, or discourse-aware similarity for long-form generation. Moreover, if supervised signals such as hallucination or factuality labels are available, the proposed framework could be further extended through supervised calibration or learned similarity modeling. For example, one may learn task-specific fusion weights for the semantic graph, replace the fixed similarity components with trainable semantic comparators, or calibrate the resulting Rényi spectral uncertainty score into a downstream hallucination risk predictor.

Author Contributions

Conceptualization, Z.L. and J.D.; methodology, Z.L. and J.D.; software, Z.L.; validation, Z.L. and J.D.; formal analysis, Z.L.; investigation, Z.L. and J.D.; writing—original draft preparation, Z.L.; writing—review and editing, J.D.; visualization, Z.L.; supervision, J.D.; funding acquisition, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 52108399) and the Shanghai Municipal Science and Technology Major Project (Grant No. 2021SHZDZX0102).

Data Availability Statement

This study evaluates semantic uncertainty estimation using publicly available question-answering benchmarks and pretrained language models. All evaluation datasets are publicly accessible from their original sources. The sampled model responses used for uncertainty estimation are generated following the experimental protocols described in the paper and can be fully reproduced using the same prompts, decoding configurations, and pretrained models. Sentence embedding models and natural language inference models employed in this work are also publicly available. The code for semantic similarity construction, soft community inference, and Rényi spectral uncertainty computation will be available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs of Theoretical Properties

In this appendix, we provide proofs for the theoretical properties of the Rényi spectral uncertainty measure introduced in Section 3.5. Let G ˜ denote a unit-trace positive semi-definite matrix with eigenvalues { λ i } i = 1 d , which define a probability distribution over spectral modes, and let α > 0 , α 1 .
Proof of Proposition 1 (Degeneracy).
If G ˜ has rank one, then its spectrum satisfies λ 1 = 1 and λ i = 0 for all i > 1 . Substituting into the definition of Rényi entropy yields
U α ( G ˜ ) = 1 1 α log i = 1 d λ i α = 1 1 α log ( 1 ) = 0 .
Thus, the Rényi spectral uncertainty vanishes for rank-one kernels, completing the proof. □
Proof of Proposition 2 (Unitary Invariance).
Let U be an orthogonal matrix. Since G ˜ is symmetric, the matrices G ˜ and U G ˜ U share the same eigenvalue spectrum. Because U α ( G ˜ ) depends only on the eigenvalues, it follows immediately that
U α ( G ˜ ) = U α ( U G ˜ U ) .
This establishes unitary invariance. □
Proof of Proposition 3 (Monotonicity under Spectral Dispersion).
For fixed trace i λ i = 1 , the Rényi entropy
H α ( λ ) = 1 1 α log i λ i α
is a Schur-concave function of the eigenvalue vector λ for all α > 0 . Therefore, if λ majorizes μ , then H α ( λ ) H α ( μ ) . This implies that Rényi spectral uncertainty increases as the eigenvalue distribution becomes more uniform, formalizing the notion that greater spectral dispersion corresponds to higher semantic uncertainty. □
Proof of Proposition 4 (Order Sensitivity).
Consider two Rényi orders α 1 < α 2 . For a fixed eigenvalue distribution, larger values of α assign greater relative weight to larger eigenvalues in the sum i λ i α . This follows from the fact that, for λ ( 0 , 1 ) , the function λ α decreases monotonically with increasing α , thereby amplifying the contribution of dominant spectral components. As a result, U α 2 becomes increasingly dominated by the largest eigenvalues, whereas U α 1 remains more sensitive to long-tail spectral variation, completing the proof. □

References

  1. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  2. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent Abilities of Large Language Models. arXiv 2022, arXiv:2206.07682. [Google Scholar] [CrossRef]
  3. Shorinwa, O.; Mei, Z.; Lidard, J.; Ren, A.Z.; Majumdar, A. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions. ACM Comput. Surv. 2025, 58, 63. [Google Scholar] [CrossRef]
  4. Liu, X.; Chen, T.; Da, L.; Chen, C.; Lin, Z.; Wei, H. Uncertainty quantification and confidence calibration in large language models: A survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2; Association for Computing Machinery: New York, NY, USA, 2025; pp. 6107–6117. [Google Scholar]
  5. Penny-Dimri, J.C.; Bachmann, M.; Cooke, W.R.; Mathewlynn, S.; Dockree, S.; Tolladay, J.; Kossen, J.; Li, L.; Gal, Y.; Jones, G.D. Measuring large language model uncertainty in women’s health using semantic entropy and perplexity: A comparative study. Lancet Obstet. Gynaecol. Women’s Health 2025, 1, e47–e56. [Google Scholar] [CrossRef]
  6. Dahl, M.; Magesh, V.; Suzgun, M.; Ho, D.E. Large legal fictions: Profiling legal hallucinations in large language models. J. Leg. Anal. 2024, 16, 64–93. [Google Scholar] [CrossRef]
  7. Hu, H.; He, C.; Xie, X.; Zhang, Q. Lrp4rag: Detecting hallucinations in retrieval-augmented generation via layer-wise relevance propagation. arXiv 2024, arXiv:2408.15533. [Google Scholar] [CrossRef]
  8. Gal, Y.; Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2016; pp. 1050–1059. [Google Scholar]
  9. Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  10. Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef]
  11. Nikitin, A.; Kossen, J.; Gal, Y.; Marttinen, P. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities. Adv. Neural Inf. Process. Syst. 2024, 37, 8901–8929. [Google Scholar]
  12. Qiu, X.; Miikkulainen, R. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. Adv. Neural Inf. Process. Syst. 2024, 37, 134507–134533. [Google Scholar]
  13. Bowman, S.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 632–642. [Google Scholar]
  14. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; p. 3982. [Google Scholar]
  15. Von Neumann, J. Mathematical Foundations of Quantum Mechanics: New Edition; Princeton University Press: Princeton, NJ, USA, 2018. [Google Scholar]
  16. Giraldo, L.G.S.; Rao, M.; Principe, J.C. Measures of entropy from data using infinitely divisible kernels. IEEE Trans. Inf. Theory 2014, 61, 535–548. [Google Scholar] [CrossRef]
  17. Bach, F. Information theory with kernel methods. IEEE Trans. Inf. Theory 2022, 69, 752–775. [Google Scholar] [CrossRef]
  18. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
  19. Traag, V.A.; Waltman, L.; Van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019, 9, 5233. [Google Scholar] [CrossRef]
  20. Lin, Z.; Trivedi, S.; Sun, J. Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models. arXiv 2024, arXiv:2305.19187. [Google Scholar] [CrossRef]
  21. Da, L.; Chen, T.; Cheng, L.; Wei, H. Llm uncertainty quantification through directional entailment graph and claim level response augmentation. arXiv 2024, arXiv:2407.00994. [Google Scholar] [CrossRef]
  22. Grewal, Y.S.; Bonilla, E.V.; Bui, T.D. Improving uncertainty quantification in large language models via semantic embeddings. arXiv 2024, arXiv:2410.22685. [Google Scholar] [CrossRef]
  23. Li, Z.; Shen, S.; Yang, W.; Jin, R.; Chen, H.; Ren, J. Enhancing Uncertainty Quantification in Large Language Models through Semantic Graph Density. In Proceedings of the 41st Conference on Uncertainty in Artificial Intelligence; PMLR: New York, NY, USA, 2025. [Google Scholar]
  24. Kuhn, L.; Gal, Y.; Farquhar, S. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  25. Zhang, C.; Liu, F.; Basaldella, M.; Collier, N. LUQ: Long-text Uncertainty Quantification for LLMs. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 5244–5262. [Google Scholar]
  26. Jiang, M.; Ruan, Y.; Sattigeri, P.; Roukos, S.; Hashimoto, T. Graph-based uncertainty metrics for long-form language model generations. Adv. Neural Inf. Process. Syst. 2024, 37, 32980–33006. [Google Scholar]
  27. Fang, X.; Huang, Z.; Tian, Z.; Fang, M.; Pan, Z.; Fang, Q.; Wen, Z.; Pan, H.; Li, D. Zero-resource hallucination detection for text generation via graph-based contextual knowledge triples modeling. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 23868–23877. [Google Scholar]
  28. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 42. [Google Scholar] [CrossRef]
  29. Berglund, L.; Tong, M.; Kaufmann, M.; Balesni, M.; Stickland, A.C.; Korbak, T.; Evans, O. The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  30. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
  31. Sui, Y.; Ren, J.; Tan, H.; Chen, H.; Li, Z.; Wang, J. Enhancing LLM’s Reliability by Iterative Verification Attributions with Keyword Fronting. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2024; pp. 251–268. [Google Scholar]
  32. Cohen, R.; Hamri, M.; Geva, M.; Globerson, A. LM vs LM: Detecting Factual Errors via Cross Examination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
  33. Azaria, A.; Mitchell, T. The Internal State of an LLM Knows When It’s Lying. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 967–976. [Google Scholar]
  34. Williams, A.; Nangia, N.; Bowman, S. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1112–1122. [Google Scholar]
  35. He, P.; Liu, X.; Gao, J.; Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
  36. Reddy, S.; Chen, D.; Manning, C.D. Coqa: A conversational question answering challenge. Trans. Assoc. Comput. Linguist. 2019, 7, 249–266. [Google Scholar] [CrossRef]
  37. Joshi, M.; Choi, E.; Weld, D.S.; Zettlemoyer, L. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1601–1611. [Google Scholar]
  38. Tsatsaronis, G.; Balikas, G.; Malakasiotis, P.; Partalas, I.; Zschunke, M.; Alvers, M.R.; Weissenborn, D.; Krithara, A.; Petridis, S.; Polychronopoulos, D.; et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 2015, 16, 138. [Google Scholar] [CrossRef]
  39. Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural questions: A benchmark for question answering research. Trans. Assoc. Comput. Linguist. 2019, 7, 453–466. [Google Scholar] [CrossRef]
Figure 1. Overview of the proposed RSU framework for semantic uncertainty estimation in large language models. Given an input prompt, the language model first samples multiple candidate responses. Pairwise semantic similarities between responses are then computed by combining embedding-based similarity and natural language inference (NLI) signals, and used to construct a weighted response-level semantic similarity graph. Based on this graph, the proposed method performs spectral embedding and infers overlapping soft semantic communities, rather than relying on hard semantic clustering. These soft communities are subsequently used to construct a community-induced positive semi-definite semantic kernel matrix K. Finally, the eigenvalue spectrum of K is analyzed through Rényi entropy to produce a query-level semantic uncertainty score H α ( K ) . The two key innovations of the framework are the soft community inference module and the Rényi spectral uncertainty estimation module, which together decouple semantic structure discovery from uncertainty quantification.
Figure 1. Overview of the proposed RSU framework for semantic uncertainty estimation in large language models. Given an input prompt, the language model first samples multiple candidate responses. Pairwise semantic similarities between responses are then computed by combining embedding-based similarity and natural language inference (NLI) signals, and used to construct a weighted response-level semantic similarity graph. Based on this graph, the proposed method performs spectral embedding and infers overlapping soft semantic communities, rather than relying on hard semantic clustering. These soft communities are subsequently used to construct a community-induced positive semi-definite semantic kernel matrix K. Finally, the eigenvalue spectrum of K is analyzed through Rényi entropy to produce a query-level semantic uncertainty score H α ( K ) . The two key innovations of the framework are the soft community inference module and the Rényi spectral uncertainty estimation module, which together decouple semantic structure discovery from uncertainty quantification.
Entropy 28 00442 g001
Table 1. Performance (AUROC, %) comparison of uncertainty estimation methods. Results are reported as mean ± standard deviation. For each model–dataset combination, the best performance is highlighted in bold, and the second-best performance is underlined.
Table 1. Performance (AUROC, %) comparison of uncertainty estimation methods. Results are reported as mean ± standard deviation. For each model–dataset combination, the best performance is highlighted in bold, and the second-best performance is underlined.
DatasetEntropy-Based MethodsGraph-Based MethodsConsistency-Based MethodsOurs
SEDSEKLEEccEigVDegD-UENSSSEURSU
Llama-3.2-1B
NQ77.48 ± 0.5576.50 ± 0.5375.36 ± 0.5076.66 ± 0.6275.87 ± 0.5777.18 ± 0.5171.93 ± 0.6376.43 ± 0.5667.38 ± 0.7277.90 ± 0.54
CoQA73.73 ± 0.2173.22 ± 0.2274.59 ± 0.2873.84 ± 0.2970.82 ± 0.2575.73 ± 0.2773.95 ± 0.2872.51 ± 0.2269.80 ± 0.2775.75 ± 0.28
BioASQ86.87 ± 0.4586.76 ± 0.4786.73 ± 0.4086.79 ± 0.4585.53 ± 0.4287.25 ± 0.3985.62 ± 0.4486.36 ± 0.4478.78 ± 0.5187.55 ± 0.39
TriviaQA82.17 ± 0.1881.15 ± 0.1680.41 ± 0.1881.13 ± 0.1878.84 ± 0.1681.64 ± 0.1679.25 ± 0.1380.51 ± 0.1577.04 ± 0.1482.23 ± 0.16
Average80.0679.4179.2779.6177.7780.4577.6978.9573.2580.86
Llama-3.1-8B
NQ78.30 ± 0.4377.88 ± 0.4777.55 ± 0.4477.73 ± 0.4676.26 ± 0.4178.64 ± 0.4475.00 ± 0.4277.48 ± 0.4771.03 ± 0.4478.86 ± 0.43
CoQA75.26 ± 0.3674.89 ± 0.3578.92 ± 0.2776.97 ± 0.4071.75 ± 0.3380.04 ± 0.2477.90 ± 0.3074.14 ± 0.3572.71 ± 0.3380.32 ± 0.27
BioASQ83.40 ± 0.4783.35 ± 0.4784.28 ± 0.4583.03 ± 0.5881.32 ± 0.4284.73 ± 0.4682.59 ± 0.5782.45 ± 0.5174.81 ± 0.7484.92 ± 0.48
TriviaQA85.95 ± 0.1185.23 ± 0.1385.67 ± 0.1284.97 ± 0.2483.27 ± 0.1286.23 ± 0.1284.51 ± 0.3884.42 ± 0.1381.95 ± 0.1387.11 ± 0.13
Average80.7380.3481.6180.6878.1582.4180.0079.6275.1382.80
Mistral-7B-v0.3
NQ76.88 ± 0.6076.88 ± 0.6077.58 ± 0.5577.24 ± 0.4676.62 ± 0.3777.42 ± 0.5676.15 ± 0.4576.67 ± 0.5771.85 ± 0.4377.79 ± 0.48
CoQA75.82 ± 0.3375.76 ± 0.2977.60 ± 0.2178.11 ± 0.3572.18 ± 0.2779.61 ± 0.2878.44 ± 0.2675.32 ± 0.2873.47 ± 0.2580.21 ± 0.24
BioASQ80.86 ± 0.5380.90 ± 0.5083.66 ± 0.4183.05 ± 0.5082.66 ± 0.5083.57 ± 0.5380.54 ± 0.5580.98 ± 0.4967.84 ± 0.4184.38 ± 0.52
TriviaQA83.76 ± 0.2983.53 ± 0.2883.86 ± 0.2883.74 ± 0.1182.80 ± 0.1285.04 ± 0.2883.58 ± 0.2882.98 ± 0.2879.59 ± 0.1285.22 ± 0.26
Average79.3379.2780.6880.5478.5781.4179.6878.9973.1981.90
Mistral-Nemo-12B
NQ76.78 ± 0.5976.35 ± 0.5777.78 ± 0.5876.55 ± 0.4776.28 ± 0.5876.92 ± 0.5273.04 ± 0.4475.84 ± 0.5669.53 ± 0.3978.38 ± 0.51
CoQA76.08 ± 0.1975.72 ± 0.2478.09 ± 0.1977.25 ± 0.1971.11 ± 0.2479.10 ± 0.1477.01 ± 0.1675.05 ± 0.2372.41 ± 0.2078.88 ± 0.23
BioASQ81.66 ± 0.4881.58 ± 0.5684.54 ± 0.4482.20 ± 0.3981.90 ± 0.6183.60 ± 0.3879.55 ± 0.4480.91 ± 0.5769.64 ± 0.4984.24 ± 0.41
TriviaQA85.44 ± 0.1084.88 ± 0.1986.10 ± 0.1084.61 ± 0.1483.31 ± 0.1186.29 ± 0.1184.29 ± 0.1184.07 ± 0.0981.47 ± 0.1186.93 ± 0.11
Average79.9979.6381.6380.1578.1581.4878.4778.9773.2682.11
Table 2. Performance (AUARC, %) comparison of various uncertainty metrics. All results are presented as percentages. For each model–dataset combination, the best performance is highlighted in bold, and the second-best performance is underlined.
Table 2. Performance (AUARC, %) comparison of various uncertainty metrics. All results are presented as percentages. For each model–dataset combination, the best performance is highlighted in bold, and the second-best performance is underlined.
DatasetEntropy-Based MethodsGraph-Based MethodsConsistency-Based MethodsOurs
SEDSEKLEEccEigVDegD-UENSSSEURSU
Llama-3.2-1B
NQ27.54 ± 0.6327.43 ± 0.6327.08 ± 0.5827.75 ± 0.4926.50 ± 0.5328.10 ± 0.5725.91 ± 0.5227.20 ± 0.5423.70 ± 0.5028.66 ± 0.56
CoQA86.97 ± 0.2886.32 ± 0.1687.65 ± 0.1187.25 ± 0.1485.28 ± 0.1287.98 ± 0.2887.45 ± 0.1086.08 ± 0.2685.80 ± 0.1187.89 ± 0.16
BioASQ71.25 ± 0.8671.01 ± 0.8570.64 ± 0.9370.63 ± 0.9069.37 ± 0.8870.88 ± 0.8370.11 ± 0.8670.62 ± 0.9165.72 ± 0.8171.33 ± 0.82
TriviaQA52.04 ± 0.2451.48 ± 0.2351.39 ± 0.3751.55 ± 0.2449.35 ± 0.2152.22 ± 0.2450.47 ± 0.2150.89 ± 0.2348.30 ± 0.2252.46 ± 0.26
Average59.4559.0659.1959.3057.6359.8058.4958.7055.8860.09
Llama-3.1-8B
NQ51.74 ± 1.0251.10 ± 1.1151.39 ± 1.0651.11 ± 0.9049.51 ± 1.0052.10 ± 0.9249.66 ± 0.9750.71 ± 1.2046.83 ± 0.8152.27 ± 0.88
CoQA94.74 ± 0.2994.79 ± 0.3496.02 ± 0.1995.80 ± 0.1794.13 ± 0.2496.30 ± 0.1595.92 ± 0.1694.69 ± 0.1895.15 ± 0.1596.39 ± 0.15
BioASQ82.48 ± 0.7782.30 ± 0.8083.57 ± 0.6782.97 ± 0.6381.05 ± 0.7983.94 ± 0.5382.82 ± 0.5483.21 ± 0.3781.84 ± 0.3684.23 ± 0.49
TriviaQA84.12 ± 0.3183.60 ± 0.3384.18 ± 0.3283.85 ± 0.3282.50 ± 0.3384.49 ± 0.2983.57 ± 0.3383.21 ± 0.3781.84 ± 0.3684.77 ± 0.32
Average78.2777.9578.7978.4376.8079.2177.9977.9676.4279.42
Mistral-7B-v0.3
NQ51.96 ± 0.6151.49 ± 0.7152.53 ± 0.5052.22 ± 0.6051.16 ± 0.5652.75 ± 0.4852.01 ± 0.5151.27 ± 0.7049.52 ± 0.6453.19 ± 0.58
CoQA92.83 ± 0.2693.11 ± 0.2294.29 ± 0.2094.17 ± 0.1691.95 ± 0.3294.47 ± 0.1394.32 ± 0.1693.02 ± 0.1893.28 ± 0.1794.63 ± 0.16
BioASQ80.46 ± 0.7780.07 ± 0.6682.27 ± 0.7081.07 ± 0.5980.34 ± 0.8681.65 ± 0.6380.18 ± 0.6180.05 ± 0.7873.21 ± 0.6282.18 ± 0.61
TriviaQA82.58 ± 0.2582.53 ± 0.3183.02 ± 0.2582.31 ± 0.2681.81 ± 0.3383.11 ± 0.3082.11 ± 0.3082.23 ± 0.3779.48 ± 0.3584.26 ± 0.33
Average76.9676.8078.0377.4476.3278.0077.1676.6473.8778.57
Mistral-Nemo-12B
NQ51.32 ± 1.3051.27 ± 1.1752.12 ± 1.1651.18 ± 1.1050.47 ± 1.2851.70 ± 1.1249.83 ± 1.1250.82 ± 1.3947.12 ± 1.0252.28 ± 1.14
CoQA93.35 ± 0.2493.15 ± 0.2494.24 ± 0.1894.15 ± 0.1791.82 ± 0.1794.54 ± 0.1394.10 ± 0.1493.02 ± 0.2393.03 ± 0.1894.98 ± 0.16
BioASQ82.31 ± 0.5682.00 ± 0.6683.55 ± 0.5382.33 ± 0.5381.63 ± 0.7083.50 ± 0.4581.49 ± 0.5681.73 ± 0.5976.35 ± 0.5484.28 ± 0.60
TriviaQA85.35 ± 0.2685.07 ± 0.3885.85 ± 0.3285.14 ± 0.2684.14 ± 0.3285.93 ± 0.2785.14 ± 0.2784.63 ± 0.2783.31 ± 0.2786.34 ± 0.30
Average78.0877.8778.9478.2077.0278.9277.6477.5574.9579.47
Table 3. Effect of the Rényi order α on AUROC (%) using Llama-3.2-1B. Results are reported as mean ± standard deviation over multiple runs. The best performance is highlighted in bold.
Table 3. Effect of the Rényi order α on AUROC (%) using Llama-3.2-1B. Results are reported as mean ± standard deviation over multiple runs. The best performance is highlighted in bold.
α 0.20.51.012510
NQ73.25 ± 0.6874.68 ± 0.6475.64 ± 0.5277.90 ± 0.5476.91 ± 0.5476.32 ± 0.56
CoQA73.11 ± 0.3874.33 ± 0.3274.81 ± 0.2875.75 ± 0.2875.36 ± 0.2974.78 ± 0.30
BioASQ86.20 ± 0.4886.77 ± 0.4987.23 ± 0.4087.55 ± 0.3987.36 ± 0.4087.34 ± 0.42
TriviaQA79.92 ± 0.2680.18 ± 0.2680.76 ± 0.1782.23 ± 0.1681.45 ± 0.1681.42 ± 0.18
Average78.1278.9979.6180.8680.2779.97
Table 4. Effect of the fusion weight η on AUROC (%) using Llama-3.2-1B. Results are reported as mean ± standard deviation. The best performance is highlighted in bold.
Table 4. Effect of the fusion weight η on AUROC (%) using Llama-3.2-1B. Results are reported as mean ± standard deviation. The best performance is highlighted in bold.
η 0.010.250.50.750.99
NQ76.98 ± 0.5477.39 ± 0.5577.90 ± 0.5476.23 ± 0.6275.31 ± 0.58
CoQA75.21 ± 0.3075.43 ± 0.2875.75 ± 0.2875.02 ± 0.3374.71 ± 0.36
BioASQ87.41 ± 0.4287.57 ± 0.4187.55 ± 0.3987.23 ± 0.3986.45 ± 0.40
TriviaQA81.88 ± 0.1682.16 ± 0.1782.23 ± 0.1681.38 ± 0.1981.24 ± 0.20
Average80.3780.6480.8679.9779.43
Table 5. Comparison between soft and hard semantic community assignments in RSU using Llama-3.2-1B. Results are reported as AUROC (%) with mean ± standard deviation. The best performance is highlighted in bold.
Table 5. Comparison between soft and hard semantic community assignments in RSU using Llama-3.2-1B. Results are reported as AUROC (%) with mean ± standard deviation. The best performance is highlighted in bold.
DatasetSoft CommunityHard Community
NQ77.90 ± 0.5475.69 ± 0.55
CoQA75.75 ± 0.2874.31 ± 0.30
BioASQ87.55 ± 0.3986.36 ± 0.40
TriviaQA82.23 ± 0.1681.23 ± 0.17
Average80.8679.40
Table 6. Sensitivity of RSU to the number of sampled responses N. Results are reported as AUROC (%) with mean ± standard deviation using Llama-3.2-1B. The best performance is highlighted in bold.
Table 6. Sensitivity of RSU to the number of sampled responses N. Results are reported as AUROC (%) with mean ± standard deviation using Llama-3.2-1B. The best performance is highlighted in bold.
N35102050
NQ69.38 ± 0.5674.55 ± 0.5677.90 ± 0.5477.91 ± 0.5477.92 ± 0.56
CoQA67.51 ± 0.3373.42 ± 0.3075.75 ± 0.2875.82 ± 0.2975.81 ± 0.30
BioASQ78.29 ± 0.4183.61 ± 0.4087.55 ± 0.3987.55 ± 0.4187.52 ± 0.43
TriviaQA76.45 ± 0.2278.22 ± 0.2082.23 ± 0.1682.20 ± 0.1682.28 ± 0.17
Average72.9177.4580.8680.8780.88
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Du, J. Soft-Community Kernel Rényi Spectrum for Semantic Uncertainty Estimation in Large Language Models. Entropy 2026, 28, 442. https://doi.org/10.3390/e28040442

AMA Style

Li Z, Du J. Soft-Community Kernel Rényi Spectrum for Semantic Uncertainty Estimation in Large Language Models. Entropy. 2026; 28(4):442. https://doi.org/10.3390/e28040442

Chicago/Turabian Style

Li, Zongkai, and Junliang Du. 2026. "Soft-Community Kernel Rényi Spectrum for Semantic Uncertainty Estimation in Large Language Models" Entropy 28, no. 4: 442. https://doi.org/10.3390/e28040442

APA Style

Li, Z., & Du, J. (2026). Soft-Community Kernel Rényi Spectrum for Semantic Uncertainty Estimation in Large Language Models. Entropy, 28(4), 442. https://doi.org/10.3390/e28040442

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop