1. Introduction
With the rapid development of social media and online platforms, the speed and scale of information dissemination have reached an unprecedented level. At the same time, misinformation and rumors can rapidly propagate through social networks, exerting sustained and far-reaching impacts on public cognition, social stability, and collective decision-making. Accurately identifying the origin of rumors in such complex diffusion environments has therefore emerged as a critical research problem in the fields of online public opinion analysis and information governance.
Rumor source detection not only facilitates a deeper understanding of the underlying diffusion mechanisms of misinformation, but also provides essential evidence for subsequent intervention, debunking, and accountability tracing. Compared with merely judging the veracity of information, locating the initial source of a rumor enables more effective containment of its spread and offers more actionable support for platform governance and emergency response. In parallel with the development of graph-based approaches, recent advances in large language models (LLMs) have significantly reshaped misinformation analysis. A growing body of work in 2024–2025 explores LLM-powered fake news detection, multimodal reasoning, and graph-enhanced semantic modeling, demonstrating strong capabilities in contextual understanding and cross-document inference. However, most of these studies focus on veracity classification at the document level or adopt full-sample reasoning strategies, without explicitly addressing node-level source localization within propagation graphs or the allocation of reasoning resources under uncertainty. Not all rumor propagation graphs require equally complex reasoning, yet most existing methods treat them uniformly. Consequently, developing rumor source detection methods that are accurate, stable, and practically deployable is of substantial theoretical significance and practical value.
Early studies on rumor source detection were primarily grounded in classical probabilistic and diffusion-based models. These approaches typically assume that rumor propagation follows specific infection models, such as the SI, SEIR, or Independent Cascade (IC) models, and infer the most likely source node via techniques including maximum likelihood estimation, belief propagation, or centrality analysis. While such methods offer strong theoretical interpretability and provide clear mathematical formulations for modeling diffusion processes, their effectiveness is highly dependent on the validity of model assumptions and network structural properties. In real-world social networks, however, information diffusion often exhibits substantial heterogeneity, noise, and incomplete observations, which violate these strict assumptions and significantly limit the applicability of such methods on real social media data.
With the development of deep learning techniques, researchers have gradually introduced deep learning and graph neural network methods to reduce the reliance on explicit diffusion models. These approaches learn representations over propagation graphs through graph convolution or attention mechanisms, reformulating rumor source detection as a node classification or scoring problem, and thereby achieving strong predictive performance under complex network structures. Compared with traditional methods, graph neural networks can automatically capture multi-hop neighborhood information and integrate node features with topological structures, significantly enhancing model expressiveness. However, most existing methods adopt an end-to-end prediction paradigm, applying the same inference complexity to all samples, and lack explicit modeling and differentiation of prediction uncertainty. When facing large-scale propagation graphs or samples with substantially varying levels of difficulty, they struggle to achieve an effective balance between detection performance and computational cost, which limits overall inference efficiency. In addition, the outputs of these methods are typically presented as scores or probabilities, without explicit semantic interpretation, which remains insufficient for practical application scenarios.
To address the three core issues identified above—over-reliance on explicit diffusion assumptions, uniform inference complexity across samples, and limited semantic interpretability—we design TSR-RSD following a stage-wise functional decomposition.
For structural modeling, we adopt a node-level Graph Neural Network rather than diffusion-based estimators or heuristic centrality measures. Classical diffusion models rely on strict infection assumptions and complete observations, which are rarely satisfied in real-world social media graphs. Heuristic approaches, in contrast, fail to capture multi-hop relational dependencies. GNNs provide a data-driven mechanism to aggregate structural information without requiring predefined diffusion equations, making them more suitable for heterogeneous propagation environments.
For selective invocation, we introduce entropy-based uncertainty filtering to regulate inference complexity. Instead of relying solely on top-1 confidence, entropy measures the dispersion over Top-K candidate probabilities, which better reflects ambiguity among structurally similar nodes. This lightweight uncertainty indicator can be directly computed from model outputs without additional training overhead.
For semantic refinement, we incorporate large language models to resolve cases where structural signals alone are insufficient. Deepening structural models cannot fully distinguish nodes that share similar topological positions but differ in semantic inheritance or discourse evolution. By restricting LLM reasoning to high-uncertainty samples, the framework preserves efficiency while leveraging semantic reasoning only when necessary.
Through this principled stage-wise design, each component in TSR-RSD serves a distinct role: GNNs ensure structural expressiveness, entropy filtering controls computational cost, and LLM reasoning enhances semantic discrimination. This coordinated design enables the model to balance accuracy, efficiency, and interpretability in rumor source detection.
The main contributions of this paper are summarized as follows:
- 1.
We propose a node-level graph neural network modeling approach for rumor source detection, which formulates rumor tracing as a node ranking and localization problem within propagation graphs, thereby improving adaptability to real-world social network diffusion structures.
- 2.
We introduce an entropy-based uncertainty filtering mechanism to enable selective reasoning on difficult samples, significantly reducing unnecessary computational overhead while preserving detection performance.
- 3.
We design a multi-stage reasoning framework based on large language models to conduct semantic- and propagation-structure-level interpretative analysis of candidate source nodes, enhancing the credibility, interpretability, and practical usability of rumor source detection results.
3. TSR-RSD: A Rumor Source Detection Model
This paper proposes a tri-stage selective reasoning model for rumor source detection, termed Tri-stage Selective Reasoning for Rumor Source Detection (TSR-RSD). In the first stage, a graph neural network is employed to perform node-level modeling over the propagation graph of a single event, learning representations for each node and producing source likelihood scores, which are normalized within the graph to obtain a ranked list of candidate source nodes. In the second stage, an entropy-based uncertainty filtering mechanism is introduced to evaluate the GNN predictions, allowing only samples with high predictive uncertainty to proceed to the subsequent reasoning stage. In the third stage, the filtered candidate results are fed into a multi-agent large language model workflow constructed with Dify, where hierarchical prompting is applied to jointly reason over propagation structures and node-level semantics. This process ultimately generates structured and interpretable rumor source predictions, achieving an effective balance between inference efficiency and performance. The overall architecture of the proposed model is illustrated in
Figure 1.
3.1. GNN-Based Rumor Source Detection Model
Graph neural networks (GNNs) are typically used for graph-level node classification, where the core procedure consists of first computing node embeddings, then applying global pooling to generate a graph representation, and finally producing classification outputs. Leveraging the strong adaptability of GNNs to graph-structured data, this work modifies the standard GNN architecture to better suit the rumor source detection task.
Specifically, rumor source detection on propagation graphs is formulated as a node-level localization problem. Given the reposting or interaction graph of a single event as input, the GNN learns representations for each node, and an MLP is applied to output a score for every node. A softmax operation is then performed over node scores within each graph to obtain source probabilities, and supervised training is conducted using the ground-truth source node index source_idx. During inference, the Top-
K nodes are selected as candidate sources and can be explicitly exported. Compared with conventional graph-level GNN classification models, the main modifications include removing global pooling and the graph-level classification head, replacing them with node-wise scoring and an intra-graph cross-entropy loss, switching evaluation metrics from accuracy/F1 to Hit@
K and MRR, and incorporating early stopping and candidate extraction procedures. An illustrative overview of the model architecture is shown in
Figure 2.
Task Definition. In this work, rumor source detection is formulated as a source node localization task on propagation graphs. For each event, a propagation graph is constructed, where the node set V represents entities involved in the diffusion process (e.g., users or posts), and the edge set E denotes reposting, replying, or interaction relations. Each node is associated with a feature vector , and the graph structure is specified by edge_index. The objective is to predict, given the entire propagation graph, the most likely source node among all nodes and output its position (index) in the node sequence of the graph. This formulation differs from graph-level classification tasks such as veracity prediction, as it emphasizes intra-graph comparison and ranking of nodes to identify the diffusion origin.
Model Output. In standard GNN-based graph classification, node representations are obtained via message passing, aggregated into a graph-level representation through global pooling, and fed into a classification head to produce graph-level class probabilities, supervised by a graph label y. To adapt this paradigm for rumor source detection, we instead adopt a node-level scoring framework. Specifically, the GNN preserves the latent representation of each node without applying global pooling. A linear layer or MLP is then applied to output a scalar score node_scores for each node, indicating its relative confidence of being the source.
Supervision Signal. In conventional GNN models, supervision is provided in the form of graph-level labels. In this work, the graph label is instead transformed into a source node index source_idx (one integer per graph). Accordingly, the loss function is reformulated from graph-level negative log-likelihood to an intra-graph “softmax + cross-entropy” objective. Specifically, a softmax operation is applied over node scores within each graph to obtain a source probability distribution, and the negative log-probability at the ground-truth source index is summed or averaged. This design explicitly encourages the model to distinguish the source node from other nodes within the same propagation graph.
Evaluation Metrics and Training Procedure. Since the model output shifts from category prediction to node ranking, ranking- and retrieval-based metrics are adopted to evaluate source detection quality. Concretely, nodes within each graph are sorted in descending order according to their scores: the ground-truth source ranked at the first position is counted as Hit@1, while inclusion within the top three and top five positions corresponds to Hit@3 and Hit@5, respectively. In addition, Mean Reciprocal Rank (MRR) is employed to measure the average rank of the true source, jointly capturing both hit performance and ranking position. During training, validation-set MRR is used as the model selection criterion, and early stopping with a patience parameter is introduced to improve training stability and mitigate overfitting.
3.2. Entropy-Based Uncertainty Filtering
Entropy-based uncertainty is commonly used to quantify the confidence of a model’s predictive distribution: the more uniform the predicted probability distribution, the higher the entropy value, indicating greater model uncertainty. In the literature, entropy uncertainty is widely applied to sample selection, active learning, and selective reasoning, where prioritizing high-entropy samples helps improve overall model performance and efficiency under limited computational resources.
Entropy Input. In the proposed selective invocation framework, the input to entropy-based uncertainty estimation is not derived from all nodes in the original propagation graph, but rather from a candidate set produced by the GNN-based source detection stage. Specifically, the source detection model outputs scores for nodes within each propagation graph and generates a Top-K candidate list accordingly. From the candidates field, the score of each candidate is extracted to form a score sequence scores = [s1,…, sK], which serves as the direct input for uncertainty estimation. In this work, the entropy measures whether the GNN model is uncertain among the K most probable source candidates, rather than computing uncertainty over all nodes. This design reduces computational and storage overhead while preserving decision relevance, and is consistent with the subsequent interaction paradigm in which the LLM only needs to reason over a small set of candidates.
Uncertainty Computation. Uncertainty is quantified using Shannon entropy and is computed in two steps. First, the candidate scores are normalized into a probability distribution via the softmax function, and entropy is then calculated over this distribution. Let
denote the raw score of the
i-th candidate among the Top-
K candidates, and
denote the corresponding normalized probability. The computation is defined as
To avoid numerical overflow, we apply a standard max-shift by subtracting
from each
, and truncate extremely small probabilities to prevent
. Intuitively, as illustrated in
Figure 3, when one candidate has a substantially higher probability than the others, the distribution becomes more
peaked and the entropy is low; when multiple candidates have similar probabilities and the distribution is flatter, the entropy is higher, indicating greater uncertainty of the GNN.
Decision Rule. We adopt a threshold-based gating strategy to convert entropy into a binary decision indicating whether LLM-based reasoning should be invoked. For each propagation graph, the entropy
H computed over the Top-
K candidates is compared against a predefined threshold
: when
, the prediction is regarded as uncertain and an LLM is triggered; otherwise, the GNN prediction is considered sufficiently confident and the LLM is skipped. Formally, the decision rule is defined as
where
denotes the indicator function.
In our implementation, the default threshold is set to threshold , and can be adjusted via command-line arguments. Since the entropy range depends on K, in theory ; for example, when , the upper bound is approximately . In addition, invalid cases such as fewer than two candidates or all-zero scores are directly treated as certain to avoid erroneous triggering. Overall, a lower results in broader LLM coverage over difficult samples at higher computational cost, whereas a higher leads to fewer invocations and greater savings, but may miss borderline cases.
3.3. LLM-Based Natural Language Reasoning
To enhance the semantic understanding and interpretability of rumor source determination, we introduce a multi-stage reasoning module based on large language models for candidate samples. This module adopts the principle of Hierarchical Prompting, decomposing the complex source detection task into multiple low-cognitive-load subtasks, which are sequentially handled by several agents with clearly defined roles. By doing so, the design avoids reasoning bias caused by information overload within a single prompt. This strategy is consistent with Cognitive Load Theory, which posits that controlling the complexity of inputs at each stage allows the model to focus on the current sub-goal within a limited context window, thereby improving overall reasoning stability and consistency.
In implementation, we construct a multi-agent workflow based on Dify, with the overall process illustrated in
Figure 4. The workflow consists of four sequential stages: data parsing, propagation structure analysis, source node determination, and structured output generation. Each stage is handled by a dedicated agent with a specific function, and strict sequential control is enforced to ensure the stability of the reasoning process.
First, a Data Parsing Agent extracts the node_id, timestamps, and core factual descriptions from the candidate nodes produced by the GNN, and generates a compact structured input. Next, a Propagation Structure Analysis Agent assesses whether each node is a plausible source based on information inheritance, content expansion, and temporal cues, and provides evidence-based descriptions. On this basis, a Source Node Decision Agent ranks the nodes according to predefined rules and assigns normalized confidence scores. Finally, a Structured Output Agent consolidates the reasoning results into a standardized JSON format for subsequent evaluation and comparison. The entire workflow is governed by strict input–output constraints, ensuring the parsability and reproducibility of the reasoning outcomes.
The above reasoning procedure is realized through a stage-wise inference pipeline. Given a set of candidate nodes output by the GNN model, the system first extracts and compresses node information to reduce input complexity; it then analyzes content inheritance and diffusion relationships at the node level to construct structured propagation cues; and finally completes candidate ranking and result generation under explicit rule constraints. Through this multi-stage, multi-agent reasoning design, semantic reasoning is restricted to a small set of high-uncertainty samples filtered by the structural model, enabling the large language model to perform effective judgments within a controlled input space. As illustrated in
Figure 4, this module primarily serves candidate re-ranking and explanation generation within the overall TSR-RSD framework, complementing the preceding structural modeling stage.
Implementation Details and Reproducibility. The hierarchical reasoning module is implemented using Qwen3-32B-instruct-Q4 as the underlying large language model within the Dify workflow engine.
For each graph instance, we select the Top-5 nodes from the GNN ranking as the candidate pool. The four agents described above are invoked in a stage-wise and sequential manner. Specifically, the semantic re-scoring, comparative decision, and consistency validation stages each invoke the LLM once, resulting in exactly three LLM calls per triggered sample.
Decoding parameters follow the deployment default configuration (effective temperature ). We do not manually tune sampling hyperparameters such as temperature or top-p. Although non-zero temperature introduces stochasticity, structured JSON constraints and deterministic post-validation significantly reduce output variance in practice.
Failure Handling and Retry Policy. The LLM output is required to strictly follow a predefined JSON schema. We treat the following conditions as failures:
invalid or unparsable JSON format;
missing required keys;
predicted node not belonging to the Top-5 candidate pool;
inconsistency between the predicted top node and the ranked list.
When validation fails, the system retries generation up to a fixed maximum number of attempts. If repeated failures occur, the final output falls back to the GNN Top-1 prediction to ensure robustness and reproducibility.
Standardized Output Schema
The final reasoning output is constrained to the following JSON structure:
{
"graph_id": string,
"llm_top1_node": int,
"llm_ranked_nodes": list[int],
"llm_scores": {node_id: float},
"reasoning": string
}
All prompts are constructed using a fixed template-based hierarchical prompting strategy and are shared across all datasets without dataset-specific manual tuning. The strict input–output constraints enforced by Dify ensure that the reasoning process remains parsable, controlled, and reproducible.
4. Experiment
All experiments run on Ubuntu 24.04.1 LTS with two RTX A6000 GPUs (96 GB total VRAM) and an Intel i7-12700F CPU. Implementations use Python 3.9.20 and the Dify framework.
For each dataset, we evaluate three GNN backbones, namely GCN, GraphSAGE, and GAT, combined with four types of node features, including BERT, spaCy, profile-based, and content-based representations. The best-performing backbone–feature configuration is selected based on validation MRR within each run and then evaluated on the corresponding test set. Specifically, for GossipCop we adopt a 2-layer GAT with BERT features and hidden size 128. For PolitiFact, we use a 2-layer GraphSAGE with BERT features and hidden size 128. For PHEME, we employ a 2-layer GCN with content-based features and hidden size 128. Across all datasets, the common hyperparameters are set as follows: learning rate is 0.001, dropout ratio is 0.3, weight decay is , batch size is 64, maximum training epochs are 50, and early stopping is applied with patience 10 based on validation MRR. We split each dataset at the propagation-graph (event) level into 70%/10%/20% for training, validation, and test, respectively. All compared methods share identical splits within each run. To reduce variance introduced by random partitioning, we repeat each experiment over five independent runs with different random seeds and report the mean results. Results are evaluated on the test set and averaged over five runs.
4.1. Datasets
GossipCop [
39]: The GossipCop dataset is derived from the FakeNewsNet repository and is designed for rumor detection and diffusion analysis in the domains of entertainment and social news. It contains a total of 3,825 news instances. Each news item is annotated with a veracity label based on fact-checking results from the GossipCop website, and its propagation process on Twitter is further collected, including the original post, retweets, replies, and corresponding temporal information. Compared with political news, stories in GossipCop typically spread more rapidly, involve a larger number of participating users, and exhibit highly heterogeneous and noisy propagation structures, often characterized by multi-branch diffusion and information overlap. Such complex propagation graphs pose greater challenges for rumor source localization, making GossipCop an important benchmark for evaluating the robustness and generalization ability of source detection models in large-scale, unstructured social diffusion scenarios.
PolitiFact [
39]: The PolitiFact dataset is also drawn from the FakeNewsNet repository and primarily focuses on political news and public affairs. Its veracity labels are provided by the professional fact-checking organization PolitiFact, and the dataset contains a total of 219 news instances. Each sample is associated with its propagation traces on social media platforms, forming diffusion graphs centered on reposting and discussion relationships. Compared with GossipCop, PolitiFact typically exhibits smaller-scale diffusion but more targeted user interactions, clearer propagation paths, and stronger stance polarization reflected in structured discussion chains. These characteristics make PolitiFact particularly suitable for analyzing rumor diffusion processes with relatively explicit structures and strong semantic correlations. Using PolitiFact in conjunction with GossipCop enables a more comprehensive evaluation of rumor source detection models across different domains and diffusion patterns, thereby assessing their stability and applicability under diverse social contexts.
In FakeNewsNet-based datasets (GossipCop and PolitiFact), the source node corresponds to the original tweet that initiates the event-level propagation cascade. While this node is structurally the root of the diffusion tree, its identification in our framework is not directly derived from trivial structural heuristics such as in-degree = 0 or earliest timestamp alone, but learned through node-level ranking under varying feature configurations.
PHEME [
40]: The PHEME dataset is a well-established benchmark for rumor analysis in breaking news scenarios, originally introduced to study the propagation, evolution, and verification of rumors on social media. Constructed from Twitter data, it covers a range of real-world public events, including natural disasters, terrorist attacks, political incidents, and other socially salient topics. Each event consists of a source tweet and its subsequent replies and retweets, forming propagation structures organized as conversation threads, with veracity labels annotated by human experts.
In this study, GossipCop and PolitiFact are processed using a unified FNNDataset data interface to ensure experimental consistency and reproducibility. The original FakeNewsNet-style data are reorganized into graph-structured formats compatible with PyTorch Geometric 2.7.0, where each sample corresponds to an independent propagation graph containing node features, edge structures, and graph-level labels. The PHEME dataset is uniformly reformatted into event-level propagation graphs to support node-level rumor source detection. Specifically, global diffusion edges are first read from the original adjacency files, and the concatenated large graph is then partitioned into multiple propagation subgraphs via node–graph mappings, with self-loop addition and edge deduplication applied for normalization. Node features support multiple representations (e.g., text-based or user-attribute-based features) and are uniformly converted into numerical matrices as model inputs. To facilitate rumor source detection, an additional source index source_idx is introduced into the dataset to supervise node ranking and localization within each propagation graph. Moreover, fixed training, validation, and test splits are adopted to enable fair comparison across models. In addition to the information required for model training, node-level tweet identifiers, timestamps, and graph-level raw texts are retained as metadata for candidate export and subsequent LLM-based explanatory reasoning, but are not involved in the main GNN training process.
4.2. Evaluation Metrics
The objective of rumor source detection is to accurately localize the true source node from all nodes within a given event-specific propagation graph. Consequently, this task is inherently a node-level ranking and retrieval problem, rather than a conventional graph-level classification or binary node classification task. In light of this formulation, we adopt Hit@K and Mean Reciprocal Rank (MRR) as the primary evaluation metrics to assess the model’s ability to rank source nodes within propagation graphs. In addition, to evaluate the system-level efficiency advantages of the proposed approach, auxiliary metrics such as inference time and the ratio of LLM invocations are also reported, enabling a comprehensive evaluation from both accuracy and efficiency perspectives.
Hit@K. Hit@K is a commonly used hit-rate metric in rumor source detection and information retrieval tasks, which measures whether the ground-truth source node appears among the top-K candidate nodes predicted by the model. Specifically, for each propagation graph sample, the model assigns a source confidence score to every node and ranks all nodes in descending order of their scores to produce a candidate list. A hit is recorded if the true source node is ranked within the top-K positions of this list.
Assume that the test set contains
N propagation graph samples, where the ground-truth source node of the
i-th sample is denoted as
, and the Top-
K candidate set predicted by the model is denoted as
. Hit@
K is defined as
where
is the indicator function, which equals 1 if the condition holds and 0 otherwise.
Hit@1, Hit@3, and Hit@5 respectively reflect the model’s source detection capability under strict and more relaxed hit conditions. The Hit@K metric is intuitive and easy to interpret, as it directly captures the model’s ability to provide correct candidate sources for manual verification or subsequent reasoning modules in practical applications.
MRR. While Hit@K measures whether the true source node is covered by the predicted candidates, it does not distinguish the exact ranking position of the source within the Top-K list. To further characterize the quality of node ranking, we introduce Mean Reciprocal Rank (MRR) as a complementary metric.
For the
i-th propagation graph sample, let
denote the position of the ground-truth source node in the predicted ranking list. Its reciprocal rank is defined as
. When the true source is ranked first, this value equals 1; as the rank decreases, the reciprocal rank correspondingly diminishes. MRR is defined as the average reciprocal rank over all samples:
MRR jointly accounts for both hit coverage and ranking position, making it more sensitive to the model’s ability to distinguish among multiple highly similar candidate nodes within a propagation graph. In the context of rumor source detection, MRR is particularly suitable for evaluating whether a model can consistently rank the true source near the top of the candidate list, rather than merely achieving marginal coverage.
4.3. Overall Performance Evaluation
In this section, we conduct the main experimental evaluation of the proposed TSR-RSD framework on three real-world datasets: GossipCop, PolitiFact, and PHEME. A GNN-based rumor source detection model that relies solely on structural reasoning is adopted as the baseline. Since rumor source detection is inherently a node ranking task within a propagation graph, we employ Hit@1, Hit@3, Hit@5, and Mean Reciprocal Rank (MRR) to evaluate the model’s ability to rank the true source node among top candidates.
To avoid overwhelming the main text with extensive combinations of structural features and GNN variants, we report, for each dataset, only the most stable and best-performing GNN configuration as the representative baseline. TSR-RSD is then evaluated under the same structural configuration, with selective semantic reasoning incorporated on top. This controlled comparison is designed to highlight the performance gains brought by the proposed reasoning paradigm upgrade, rather than by architectural or feature-level variations.
Table 1 summarizes the core results across the three datasets. On GossipCop, the baseline model already achieves near-saturated performance, with Hit@
K values approaching 1.0. This observation suggests that large-scale entertainment news propagation graphs often exhibit relatively stable source signals that can be effectively captured by structural models alone.
Under such high-confidence scenarios, TSR-RSD does not perturb the original ranking through unnecessary semantic reasoning. Instead, it preserves performance almost identical to the upper bound established by the baseline, while still achieving a marginal improvement on Hit@1. This behavior is particularly important, as it indicates that the proposed selective reasoning mechanism does not sacrifice the strengths of structural models on easy samples merely for the sake of introducing additional reasoning steps. Rather, TSR-RSD remains conservative on high-confidence cases and reserves semantic reasoning capacity for samples that genuinely require correction, resulting in a more disciplined and robust overall inference strategy.
The near-saturated performance observed on GossipCop primarily reflects the structural characteristics of the FakeNewsNet propagation graphs rather than any particular modeling artifact. As evidenced by our ablation over feature types and GNN backbones, performance varies substantially across configurations. In the main text, we report only the best-performing configuration per dataset for clarity, while the complete GNN results under all feature–backbone combinations are provided in
Appendix A.
For example, when using content-only features, Hit@1 decreases sharply to 0.0105 (GraphSAGE) and 0.4232 (GAT), suggesting that accurate localization depends on meaningful structural–textual alignment rather than trivial cues. In contrast, when combining high-quality semantic embeddings (BERT) with graph modeling, the strong regularities in FakeNewsNet cascades enable near-deterministic source identification.
On PolitiFact, TSR-RSD yields a more pronounced improvement in ranking quality. Compared to GossipCop, PolitiFact exhibits smaller propagation graphs but stronger semantic polarization, where cascades with highly similar structures may nonetheless differ substantially in stance and factual entailment. In such cases, structural models often achieve reasonable Top-K coverage, yet remain constrained in fine-grained ordering among the Top-K candidates. The results show that TSR-RSD delivers a clear gain on Hit@3 and a consistent improvement on Hit@1, suggesting that semantic reasoning does not merely enlarge the candidate pool; rather, it tends to promote the true source from being within the top few to being ranked even higher, thereby improving strict-hit performance and ranking stability. On PolitiFact, TSR-RSD improves strict hit metrics (Hit@1 and Hit@3), indicating enhanced top-level localization accuracy.
However, MRR shows a moderate decrease compared to the structural baseline.This suggests that while semantic reasoning promotes certain hard samples into the top-K region, it may slightly perturb the fine-grained ordering among already well-ranked candidates. Given the relatively small size of PolitiFact (219 events), minor rank shifts between positions 1 and 2 can produce disproportionate effects on MRR.
On PHEME, TSR-RSD is the most representative, as the dataset comprises breaking-news-driven conversational threads with more complex structures and intertwined temporal signals, where topology-only aggregation often fails to reliably distinguish the earliest initiator from the earliest diffuser. The baseline achieves Hit@1 of only ∼0.29 on PHEME, indicating that the true source is frequently ranked far from the top. With TSR-RSD, Hit@1 increases to ∼0.42 and MRR improves from 0.46 to 0.49, demonstrating the corrective value of semantic reasoning in challenging scenarios: it leverages semantic cues such as content inheritance, expressive consistency, and propagation logic to re-rank structurally similar candidates, pushing the true source more reliably toward the front of the list. Meanwhile, Hit@5 decreases on PHEME, which is more consistent with a side effect of more aggressive re-ranking. When semantic reasoning strongly corrects a small fraction of samples, the true source may move from around rank 5 to near rank 1, but in a few cases may also be over-corrected, thereby affecting the looser Hit@5 metric. Given the concurrent gains on Hit@1 and MRR, this phenomenon does not weaken our main conclusion: the key benefit of TSR-RSD lies in improving stricter and more goal-aligned ranking metrics, rather than pursuing superficial gains in Top-K coverage.
Overall, the main results indicate that the gains of TSR-RSD do not stem from reprocessing every sample with an LLM. Instead, TSR-RSD first relies on the structural model to produce strong candidates and an initial ranking, and then applies selective semantic reasoning to high-ambiguity samples. This design improves the rank position of the true source in complex propagation environments while preserving the baseline upper-bound performance when structural signals are clear. Such a behavior pattern—stable on easy cases and corrective on hard cases—captures the key value proposition of selective reasoning in a top-tier venue setting: it emphasizes not only effectiveness, but also controllability in inference decisions and principled allocation of expensive reasoning resources.
4.4. Comparison of Generation Time and Inference Cost
Beyond source detection accuracy, practical rumor source detection systems must also account for inference efficiency and overall system cost. This consideration is particularly critical in large-scale social media monitoring or online early-warning scenarios, where inference latency directly affects deployability. Accordingly, this section conducts a system-level comparison of generation-stage inference time across different model configurations to assess the practical efficiency gains introduced by TSR-RSD.
Figure 5 presents a system-level comparison of the trade-off between inference time and source detection performance (MRR) for the evaluated methods.
We compare four different inference strategies, corresponding to purely structural modeling, purely semantic reasoning, an unscreened structure–semantic cascaded scheme, and the proposed tri-stage selective reasoning framework. All experiments are conducted under identical hardware environments and inference settings. The reported metric is the average inference time per propagation graph, measured from input to the generation of the complete source detection result, in seconds.
Table 2 reports the average generation-time results of different methods on the GossipCop, PolitiFact, and PHEME datasets.
As illustrated in
Figure 5, the pure GNN-based method lies in the low-latency but performance-limited region, whereas the LLM-only and unscreened GNN+LLM approaches achieve relatively strong performance on some datasets at the cost of substantially increased inference time and higher overall system cost. In contrast, TSR-RSD resides in the Pareto-optimal region of the inference time–performance trade-off, significantly reducing inference latency while maintaining near-optimal source detection performance. This result highlights the effectiveness of the proposed selective reasoning mechanism from a system-level perspective.
The overall experimental results are summarized in
Table 2, which reports the average generation time of different methods on the three datasets under identical hardware environments and inference settings. The results show that the pure GNN-based method consistently exhibits low generation latency, as it only involves structural modeling and a single forward pass. However, its performance gains are inherently limited in complex diffusion scenarios. In contrast, both the LLM-only and the unscreened GNN+LLM approaches incur substantially higher inference time across all datasets, particularly on GossipCop and PolitiFact, where larger diffusion scales cause the overall latency to be dominated by multi-round reasoning of large language models.
After introducing the selective reasoning mechanism, TSR-RSD demonstrates clear advantages in generation efficiency. The experimental results indicate that, on the GossipCop dataset, the average generation time of TSR-RSD is significantly reduced compared with both the GNN+LLM and LLM-only methods, and a consistent trend is observed on PolitiFact and PHEME. These findings suggest that the uncertainty filtering mechanism, which triggers semantic reasoning only for samples with uncertain predictions, effectively reduces the number of LLM invocations, thereby substantially shortening overall inference time without sacrificing source detection performance.
A further comparison across datasets reveals that the efficiency advantages of TSR-RSD are particularly pronounced in datasets with larger diffusion scales or relatively clear propagation structures. In such scenarios, the GNN model alone is able to produce high-confidence predictions for the majority of samples, allowing the entropy-based uncertainty filtering mechanism to effectively exclude a large portion of instances that do not require semantic reasoning. Conversely, in settings with more complex propagation structures and higher sample uncertainty, although the LLM invocation ratio increases, TSR-RSD still maintains substantially lower overall inference time than the unscreened cascaded scheme.
In summary, the experimental results in this section clearly demonstrate that the reduced generation-stage inference time of TSR-RSD primarily stems from its uncertainty-driven selective reasoning design. By concentrating high-cost semantic reasoning on a small number of critical samples, TSR-RSD significantly lowers overall generation overhead while preserving source detection performance, providing strong efficiency guarantees for practical deployment in large-scale or near-real-time rumor source detection tasks.
4.5. Analysis of LLM Invocation Ratio
As observed in the previous generation-time comparison experiments, TSR-RSD substantially reduces overall inference time while maintaining strong rumor source detection performance. To further validate the core mechanism by which TSR-RSD achieves efficient inference at the system level, this section focuses on analyzing the proportion of samples that actually invoke large language model (LLM) reasoning under different model configurations. By comparing the LLM invocation ratios of TSR-RSD with those of an unscreened cascaded scheme and a purely semantic reasoning approach on the test set, we examine the practical effectiveness of uncertainty-driven selective reasoning in reducing computational overhead from an execution-oriented perspective. The corresponding statistics are summarized in
Table 3. Furthermore, to investigate the global regulatory effect of the entropy threshold
on LLM utilization, we analyze the variation trends of LLM invocation ratios under different threshold settings, as illustrated in
Figure 6.
The evaluation metric is defined as the ratio of propagation graph samples that trigger LLM-based reasoning to the total number of samples in the test set. For the LLM-only and GNN + LLM methods, since all samples indiscriminately enter the semantic reasoning stage, the theoretical LLM invocation ratio is 100%. In contrast, for TSR-RSD, this ratio is adaptively determined by the entropy-threshold-based gating mechanism.
Figure 6 illustrates the variation of LLM invocation ratios of TSR-RSD under different entropy threshold
settings across the three datasets.
The results show that for both the LLM-only and the GNN+LLM approaches, all samples are indiscriminately forwarded to the semantic reasoning stage, leading to an LLM invocation ratio of 100% on all three datasets. As shown in
Figure 6, TSR-RSD instead introduces entropy threshold
as a gating mechanism for triggering semantic reasoning, resulting in a controllable and monotonically decreasing LLM invocation ratio as the threshold increases. Although such full or unscreened reasoning paradigms can achieve competitive source detection performance in certain cases, they inevitably introduce substantial computational redundancy. In particular, for samples where structural signals are already sufficient to support accurate decisions, the additional cost of semantic reasoning does not yield commensurate benefits.
In contrast, TSR-RSD significantly reduces the actual LLM invocation ratio across all three datasets, with clear dataset-specific patterns. On GossipCop, as illustrated in
Figure 6, the LLM invocation ratio decreases smoothly and substantially as
increases, with only a small fraction of highly uncertain samples triggering semantic reasoning at higher thresholds. This behavior indicates that in scenarios with large propagation scales and salient structural patterns, the GNN model is able to produce high-confidence predictions for the majority of samples, allowing the selective reasoning mechanism to effectively filter out unnecessary LLM calls. On PolitiFact, the LLM invocation ratio remains relatively high and the curve stays elevated across a wide range of thresholds, suggesting a stronger reliance on semantic reasoning for rumor source detection. This observation is consistent with the dataset’s characteristics: smaller propagation graphs but higher semantic discriminability, where semantic reasoning plays a more critical role in correcting the ranking produced by structural models. For PHEME, TSR-RSD similarly maintains a certain level of semantic reasoning coverage while keeping the LLM invocation ratio well below that of full reasoning.
The entropy threshold is selected to balance inference efficiency and ranking performance. Since entropy is theoretically bounded by , and in our setting, corresponds to approximately of the maximum uncertainty level. Empirically, this value effectively distinguishes highly ambiguous samples from structurally confident ones.
As shown in
Figure 6,
achieves a favorable trade-off across datasets: (i) on GossipCop, it substantially reduces the LLM invocation ratio and inference time while preserving near-saturated ranking performance; (ii) on PolitiFact and PHEME, where structural ambiguity is higher,
does not prematurely suppress semantic reasoning, thereby maintaining necessary coverage for difficult samples.
The experimental results in this section clearly demonstrate that the entropy-based uncertainty filtering mechanism can substantially reduce the participation ratio of large language models without significantly compromising source detection performance, thereby achieving an effective balance between accuracy and efficiency. This property endows TSR-RSD with stronger deployability in practical scenarios where computational resources are limited or rapid response is required, and also provides a solid foundation for subsequent analyses on uncertain sample characteristics and ablation studies.
4.6. Analysis of Uncertainty-Guided Sample Selection
To further investigate the practical role of the entropy-based uncertainty filtering mechanism in the TSR-RSD framework, this section conducts an in-depth analysis at the sample level, focusing on propagation graphs that are selected to enter the semantic reasoning stage. Unlike the previous analysis, which examined LLM invocation ratios purely from a system-level perspective, we here aim to answer two key questions: (1) whether samples classified as uncertain indeed exhibit higher source detection difficulty, and (2) whether the uncertainty-based filtering mechanism can effectively distinguish propagation scenarios that are inherently challenging for structural models to resolve. The corresponding results are summarized in
Table 4, and the structural and predictive differences between certain and uncertain samples are further illustrated in
Figure 7.
The results demonstrate that across all three datasets, propagation graphs categorized as certain consistently achieve substantially higher Hit@1, Hit@3, and MRR scores at the initial GNN prediction stage. As shown in
Figure 7, these samples typically exhibit clearer propagation structures and more salient source node characteristics, enabling the GNN model to produce high-confidence rankings. This observation indicates that for the majority of samples with well-defined structural patterns and prominent source signals, reliable source ranking can be achieved without semantic reasoning assistance.
In contrast, propagation graphs classified as uncertain show a pronounced degradation in GNN-based source detection performance. The true source nodes are more widely dispersed within the candidate lists, and ranking stability is significantly reduced. As illustrated on the right side of
Figure 7, such samples often correspond to propagation structures with multi-branch diffusion patterns or overlapping temporal signals. In these cases, the predicted probability distribution over Top-
K candidates becomes flatter, resulting in higher entropy values that subsequently trigger the semantic reasoning stage.
A cross-dataset comparison further reveals that the performance gap between certain and uncertain samples is particularly pronounced on the GossipCop and PolitiFact datasets. This suggests that samples selected for semantic reasoning in these datasets are often associated with scenarios such as multi-branch diffusion, closely scored candidate nodes, or ambiguous temporal cues, where structural information alone is insufficient to support confident decisions. For the PHEME dataset, although the overall source detection difficulty is higher, uncertain samples still exhibit larger performance variance under the structural model, reflecting the additional challenges posed by complex, event-driven propagation dynamics.
Importantly, the uncertainty-based filtering mechanism does not select samples for semantic reasoning in a random or indiscriminate manner. Instead, it statistically concentrates on propagation graphs where structural models are most prone to ambiguity. This property is directly evidenced by the consistently lower Hit@K and MRR scores of uncertain samples at the initial GNN prediction stage. When combined with the LLM invocation ratio analysis in the previous section, these results confirm that TSR-RSD allocates expensive LLM reasoning resources primarily to high-difficulty samples, thereby avoiding redundant computation on cases that are already easy to resolve.
4.7. Ablation Study
To further validate the role and necessity of each constituent module within the TSR-RSD framework, this section conducts a systematic ablation study to analyze the impact of graph neural network modeling, the uncertainty filtering mechanism, and the large language model reasoning module on the final source tracing performance. Specifically, the corresponding modules are removed or replaced on each of the three datasets while keeping all other settings unchanged, enabling a clear examination of how these components affect source localization accuracy and ranking stability.
We further observe that when LLM-based reranking is applied to all samples (i.e., the GNN+LLM configuration without entropy gating), the final performance becomes nearly identical to the LLM-only setting. This is expected, since the ultimate ranking is fully determined by the LLM stage in both cases. However, such a design introduces redundant structural computation without additional performance gains. Therefore, the primary contribution of TSR-RSD does not lie in combining GNN and LLM indiscriminately, but in selectively invoking semantic reranking only for high-uncertainty samples. This selective mechanism ensures that structural priors are preserved for confident cases while semantic reasoning is reserved for ambiguous scenarios.
4.7.1. GNN Model
To assess the fundamental role of the graph neural network (GNN) module within the TSR-RSD framework, we conduct an ablation study focusing on the structural modeling component. This module is responsible for node-level representation learning over propagation graphs and for providing a ranked candidate source set for subsequent reasoning stages. Consequently, removing this component directly affects the model’s ability to exploit structural information embedded in propagation patterns. In the experimental setup, we remove the GNN-based source tracing module entirely, such that no structural modeling or candidate filtering is performed. Instead, event-related information is directly fed into the large language model for source identification. This setting is equivalent to the LLM-only scheme and is compared against the full TSR-RSD framework. The corresponding results are summarized in
Table 5.
The experimental results show that removing the GNN module leads to varying degrees of performance change across different datasets. On the GossipCop dataset, although the LLM-only approach still achieves relatively strong overall performance, it exhibits noticeable degradation on stricter metrics such as Hit@1 and MRR compared to the full framework. This indicates that in scenarios with large-scale propagation and salient structural patterns, ignoring propagation structure weakens the model’s ability to precisely rank source nodes. On the PolitiFact and PHEME datasets, while LLM-only achieves advantages on certain metrics, its predictions rely heavily on semantic cues and lack structural constraints, resulting in noticeably reduced overall stability. In contrast, the full TSR-RSD framework leverages the GNN module to provide structural priors and candidate constraints, allowing semantic reasoning to operate within a more reasonable and controllability.
4.7.2. Entropy Uncertainty Filtering
Subsequently, to evaluate the practical effect of the entropy-based uncertainty filtering mechanism within the TSR-RSD framework, we conduct an ablation study on the selective reasoning module. The core function of this module is to assess the confidence of the predictive distribution over the GNN-generated candidates and to determine whether subsequent large language model reasoning should be triggered. Removing this component causes all samples to enter the semantic reasoning stage indiscriminately. In the experimental setup, we eliminate the entropy-threshold-based gating mechanism, causing the model to degenerate into an unfiltered GNN+LLM cascading scheme. We then compare its performance with that of the full TSR-RSD framework on the same datasets. The corresponding results are summarized in
Table 6.
The experimental results indicate that, after removing the uncertainty filtering mechanism, the overall source tracing performance declines to varying degrees across all three datasets, with the degradation being particularly pronounced on stricter ranking-oriented metrics such as Hit@1 and MRR. This observation suggests that indiscriminately applying semantic reasoning to all samples does not yield performance gains commensurate with its computational cost; instead, it may introduce additional noise in certain scenarios, interfering with the otherwise stable predictions of the structural model. In contrast, the full TSR-RSD framework leverages the entropy-based uncertainty filtering mechanism to concentrate semantic reasoning on samples with more dispersed predictive distributions and lower candidate separability under the structural model, thereby achieving more consistent and robust source tracing performance across multiple datasets. These results demonstrate that the uncertainty filtering mechanism not only substantially reduces redundant reasoning overhead at the system level, but also constitutes a critical component for improving the ranking stability and overall performance of TSR-RSD.
4.7.3. LLM Reasoning
To assess the performance gains contributed by the large language model (LLM) reasoning module within the TSR-RSD framework, we conduct an ablation study on the semantic reasoning component. This module is primarily responsible for incorporating semantic and propagation-level logical analysis to refine the ranking of candidate sources when the structural model produces uncertain predictions. Consequently, removing this component causes the model to directly output the GNN predictions without invoking semantic reasoning. In the experimental setup, we remove the LLM reasoning module from the full framework while retaining the graph neural network and the uncertainty filtering mechanism, and compare its performance against the complete TSR-RSD framework on the same datasets. The corresponding results are summarized in
Table 7.
The experimental results show that removing the LLM reasoning module leads to a clear degradation in source tracing performance across all three datasets, with the decline being particularly pronounced in datasets characterized by complex propagation structures or less salient source nodes. Notably, on the PHEME dataset, models that rely solely on structural information exhibit substantial instability in source ranking, with the true source nodes more likely to be placed toward the lower end of the candidate lists. In contrast, the full TSR-RSD framework effectively mitigates ambiguities arising from complex propagation scenarios by introducing semantic reasoning for highly uncertain samples, thereby promoting the true source nodes to higher ranks within the candidate lists. These results demonstrate that the LLM reasoning module is not merely an auxiliary performance enhancement, but a critical corrective component for challenging samples, and a necessary condition for improving both the accuracy and stability of overall source tracing performance.