Tri-Stage Selective Reasoning for Rumor Source Detection via Graph Neural Networks and Large Language Models

Xue, Tao; Liu, Wenzhuo; Xi, Long; Lv, Wen

doi:10.3390/electronics15050914

Open AccessArticle

Tri-Stage Selective Reasoning for Rumor Source Detection via Graph Neural Networks and Large Language Models

by

Tao Xue

^1,2,3,

Wenzhuo Liu

²

,

Long Xi

^1,2,3,* and

Wen Lv

^1,2,3

¹

Shaanxi Key Laboratory of Clothing Intelligence, School of Cybersecurity, Xi’an Polytechnic University, Xi’an 710048, China

²

School of Computer Science, Xi’an Polytechnic University, Xi’an 710048, China

³

State-Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 914; https://doi.org/10.3390/electronics15050914

Submission received: 29 January 2026 / Revised: 14 February 2026 / Accepted: 19 February 2026 / Published: 24 February 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Rumor source detection aims to identify the initial origin of misinformation diffusion in social networks. Accurate source localization is essential for effective rumor intervention and early mitigation in large-scale social media platforms. Existing rumor source detection methods often struggle to model complex propagation structures. However, applying mathematical models uniformly to all samples introduces unnecessary computational overhead and limits scalability. By leveraging GNN-based candidate ranking, our approach effectively narrows the source search space and provides a reliable structural foundation for subsequent reasoning. Prior studies typically perform end-to-end inference without considering prediction confidence, leading to inefficient processing of low-uncertainty samples. To address this issue, we introduce an entropy-based uncertainty filtering mechanism that selectively identifies high-uncertainty cases requiring further reasoning, significantly reducing redundant computation. Meanwhile, existing methods lack semantic interpretability when handling ambiguous propagation patterns, motivating the incorporation of large language model (LLM) reasoning. We employ LLM-based reasoning only on filtered samples to enhance semantic understanding while controlling inference cost. Based on these designs, we propose TSR-RSD, a tri-stage selective reasoning framework that integrates GNN-based structural modeling, uncertainty-driven sample selection, and LLM-based semantic reasoning. Experimental results on GossipCop, PolitiFact, and PHEME demonstrate that TSR-RSD consistently outperforms GNN-based baselines in terms of Hit@1, Hit@3, Hit@5, and Mean Reciprocal Rank (MRR), reflecting improved accuracy and stability in rumor source ranking. Furthermore, the entropy-based uncertainty filtering mechanism significantly reduces the LLM invocation ratio by approximately 40–60%, while maintaining comparable or improved ranking performance. As a result, TSR-RSD achieves an overall inference time reduction of 35–50%, effectively balancing localization accuracy, computational efficiency, and interpretability.

Keywords:

graph neural networks; rumor source detection; large language models; selective reasoning; explainable misinformation analysis

1. Introduction

With the rapid development of social media and online platforms, the speed and scale of information dissemination have reached an unprecedented level. At the same time, misinformation and rumors can rapidly propagate through social networks, exerting sustained and far-reaching impacts on public cognition, social stability, and collective decision-making. Accurately identifying the origin of rumors in such complex diffusion environments has therefore emerged as a critical research problem in the fields of online public opinion analysis and information governance.

Rumor source detection not only facilitates a deeper understanding of the underlying diffusion mechanisms of misinformation, but also provides essential evidence for subsequent intervention, debunking, and accountability tracing. Compared with merely judging the veracity of information, locating the initial source of a rumor enables more effective containment of its spread and offers more actionable support for platform governance and emergency response. In parallel with the development of graph-based approaches, recent advances in large language models (LLMs) have significantly reshaped misinformation analysis. A growing body of work in 2024–2025 explores LLM-powered fake news detection, multimodal reasoning, and graph-enhanced semantic modeling, demonstrating strong capabilities in contextual understanding and cross-document inference. However, most of these studies focus on veracity classification at the document level or adopt full-sample reasoning strategies, without explicitly addressing node-level source localization within propagation graphs or the allocation of reasoning resources under uncertainty. Not all rumor propagation graphs require equally complex reasoning, yet most existing methods treat them uniformly. Consequently, developing rumor source detection methods that are accurate, stable, and practically deployable is of substantial theoretical significance and practical value.

Early studies on rumor source detection were primarily grounded in classical probabilistic and diffusion-based models. These approaches typically assume that rumor propagation follows specific infection models, such as the SI, SEIR, or Independent Cascade (IC) models, and infer the most likely source node via techniques including maximum likelihood estimation, belief propagation, or centrality analysis. While such methods offer strong theoretical interpretability and provide clear mathematical formulations for modeling diffusion processes, their effectiveness is highly dependent on the validity of model assumptions and network structural properties. In real-world social networks, however, information diffusion often exhibits substantial heterogeneity, noise, and incomplete observations, which violate these strict assumptions and significantly limit the applicability of such methods on real social media data.

With the development of deep learning techniques, researchers have gradually introduced deep learning and graph neural network methods to reduce the reliance on explicit diffusion models. These approaches learn representations over propagation graphs through graph convolution or attention mechanisms, reformulating rumor source detection as a node classification or scoring problem, and thereby achieving strong predictive performance under complex network structures. Compared with traditional methods, graph neural networks can automatically capture multi-hop neighborhood information and integrate node features with topological structures, significantly enhancing model expressiveness. However, most existing methods adopt an end-to-end prediction paradigm, applying the same inference complexity to all samples, and lack explicit modeling and differentiation of prediction uncertainty. When facing large-scale propagation graphs or samples with substantially varying levels of difficulty, they struggle to achieve an effective balance between detection performance and computational cost, which limits overall inference efficiency. In addition, the outputs of these methods are typically presented as scores or probabilities, without explicit semantic interpretation, which remains insufficient for practical application scenarios.

To address the three core issues identified above—over-reliance on explicit diffusion assumptions, uniform inference complexity across samples, and limited semantic interpretability—we design TSR-RSD following a stage-wise functional decomposition.

For structural modeling, we adopt a node-level Graph Neural Network rather than diffusion-based estimators or heuristic centrality measures. Classical diffusion models rely on strict infection assumptions and complete observations, which are rarely satisfied in real-world social media graphs. Heuristic approaches, in contrast, fail to capture multi-hop relational dependencies. GNNs provide a data-driven mechanism to aggregate structural information without requiring predefined diffusion equations, making them more suitable for heterogeneous propagation environments.

For selective invocation, we introduce entropy-based uncertainty filtering to regulate inference complexity. Instead of relying solely on top-1 confidence, entropy measures the dispersion over Top-K candidate probabilities, which better reflects ambiguity among structurally similar nodes. This lightweight uncertainty indicator can be directly computed from model outputs without additional training overhead.

For semantic refinement, we incorporate large language models to resolve cases where structural signals alone are insufficient. Deepening structural models cannot fully distinguish nodes that share similar topological positions but differ in semantic inheritance or discourse evolution. By restricting LLM reasoning to high-uncertainty samples, the framework preserves efficiency while leveraging semantic reasoning only when necessary.

Through this principled stage-wise design, each component in TSR-RSD serves a distinct role: GNNs ensure structural expressiveness, entropy filtering controls computational cost, and LLM reasoning enhances semantic discrimination. This coordinated design enables the model to balance accuracy, efficiency, and interpretability in rumor source detection.

The main contributions of this paper are summarized as follows:

1.: We propose a node-level graph neural network modeling approach for rumor source detection, which formulates rumor tracing as a node ranking and localization problem within propagation graphs, thereby improving adaptability to real-world social network diffusion structures.
2.: We introduce an entropy-based uncertainty filtering mechanism to enable selective reasoning on difficult samples, significantly reducing unnecessary computational overhead while preserving detection performance.
3.: We design a multi-stage reasoning framework based on large language models to conduct semantic- and propagation-structure-level interpretative analysis of candidate source nodes, enhancing the credibility, interpretability, and practical usability of rumor source detection results.

2. Related Works

2.1. Propagation-Model-Based Rumor Source Detection Methods

Early studies primarily focused on source localization under maximum likelihood estimation and probabilistic inference frameworks. Some works derived explicit maximum likelihood estimators for source identification based on the SI model under specific network structures, and analyzed their differences from centrality-based estimation methods [1]. Fan et al. systematically investigated rumor source detection from a probabilistic perspective under continuous-time diffusion models, and proposed efficient belief-propagation-based inference algorithms to approximately compute the joint likelihood of source locations and propagation times [2]. Building on these studies, research gradually extended from simple topological settings to more complex network structures, such as single-cycle graphs and finite graphs, and proposed corresponding correction methods to address estimation bias introduced by network boundary effects [3,4].

With the increasing refinement of diffusion modeling, some studies further incorporated more complex infection processes to enhance model realism. Zhou et al. characterized rumor propagation using extended infection models such as SEIR, and estimated potential sources by analyzing optimal infection paths [5]. Other works considered rumor source detection under adaptive diffusion mechanisms or multiple observation settings, examining the impact of different diffusion strategies and observation schemes on source localization accuracy [6]. Choi et al. focused on scenarios with incomplete or noisy observations, and mitigated the influence of unreliable responses on source detection results through query mechanisms or probabilistic corrections [7].

Beyond diffusion models themselves, some studies introduced a limited number of monitoring stations in online social networks and employed greedy strategies to improve source detection performance under resource-constrained settings [8]. Other works investigated the identifiability of rumor sources from information-theoretic or asymptotic analysis perspectives [9]. In addition, to address the lack or unavailability of temporal information in real-world scenarios, Kumar et al. proposed time-agnostic rumor source detection methods to reduce dependence on precise timestamps [10].

To address the limited adaptability of diffusion-model-based approaches in real social networks, the proposed TSR-RSD model no longer relies on explicit diffusion assumptions. Instead, it directly performs data-driven modeling over propagation graphs, and leverages selective reasoning mechanisms and semantic-level analysis to compensate for the limitations of traditional methods under complex structures and practical application scenarios, thereby providing a more robust solution for rumor source detection.

2.2. Deep Learning and Graph Neural Network-Based Rumor Source Detection Methods

Existing studies have shown that by aggregating multi-hop neighborhood information over propagation graphs, graph neural networks can effectively capture both local and global structural characteristics, thereby improving source detection performance without relying on explicit diffusion models [11]. Building upon this line of work, some approaches further introduce attention mechanisms or multi-view modeling strategies to enhance the representation of key nodes and their contextual information [12,13]. In addition, several studies combine source posts with word-level structures to construct heterogeneous graphs or joint representation models, enabling improved rumor identification and localization at early stages [14,15].

Beyond structural modeling, some deep learning methods attempt to incorporate multimodal or semantic information to compensate for the limitations of relying solely on topological features. For example, certain studies jointly model text, images, and their cross-modal consistency to improve overall performance on rumor-related tasks [16,17], while others leverage pre-trained language models or ensemble learning frameworks to enhance model robustness [18,19]. Although these methods have achieved promising results in rumor detection or classification tasks, they still face challenges in terms of adaptability and interpretability when applied to rumor source localization scenarios.

Beyond content itself, the social contextual information inherent to social media platforms also plays an important role in rumor source detection. Once a claim appears on a social platform, it gradually accumulates influence over time by attracting increasing numbers of comments and interactions, thereby forming a dynamic propagation network structure [20]. Some studies have proposed bidirectional graph models [21] and tree-structured Transformer architectures [22] to jointly model claim content and its social context, enabling the extraction of additional information along propagation paths. To further analyze social environments, ref. [23] proposed a dual dynamic graph convolutional network to learn dynamic propagation representations of individual rumors. Since claims on social media exist within specific social contexts and rumors are often correlated with emerging social news events, other studies have leveraged the surrounding news environment to capture rumor-related information, thereby better adapting to rumor diffusion in real-world social settings [24,25].

In addition to the social traces of claims, information about users who publish claims also provides important cues for rumor source detection. Accordingly, several studies have released fake news datasets that incorporate user information [26,27,28], which contain rich user-level attributes. Ref. [29] proposed a GNN-based framework that learns textual representations of users’ historical posts and news content, constructs news propagation graphs, and performs information fusion via GNNs to achieve accurate rumor localization. Although these studies have achieved promising results, a notable limitation is that each run requires training and evaluation over the entire dataset, resulting in relatively low efficiency. Recent studies have shown that selectively invoking reasoning or verification steps can significantly improve the reliability and efficiency of LLM-based systems [30].

The proposed TSR-RSD model introduces an uncertainty-driven selective reasoning mechanism on top of structural modeling, concentrating complex inference processes on a small number of high-risk samples, and further incorporates LLM-based semantic-level analysis to perform fine-grained assessment of candidate source nodes. This design improves system efficiency and practical deployability while maintaining strong rumor source detection performance.

2.3. Large Language Model-Enhanced Rumor Detection and Reasoning Frameworks

In recent years, the rapid development of large language models (LLMs) has significantly reshaped the landscape of misinformation and fake news detection. Compared with traditional supervised models that rely on handcrafted features or shallow semantic encoders, LLM-based approaches demonstrate stronger capabilities in contextual understanding, knowledge integration, and cross-modal reasoning.

Recent studies have explored the potential of LLMs in fake news and rumor detection from multiple perspectives. Ref. [31] systematically evaluated zero-shot and few-shot prompting strategies across multiple misinformation benchmarks, showing that LLMs can capture subtle linguistic and emotional cues without task-specific fine-tuning. Ref. [32] proposed an LLM-enhanced multimodal detection framework (FND-LLM) that integrates textual, visual, and cross-modal representations, leveraging the reasoning capability of LLMs to improve detection robustness on datasets such as GossipCop and PolitiFact. Similarly, ref. [33] introduced a multiknowledge and LLM-inspired heterogeneous graph neural network (MiLk-FD), combining knowledge graphs, LLM-generated semantic features, and graph transformers to enhance fake news classification performance. These studies collectively demonstrate that LLMs can provide richer semantic priors and stronger reasoning capabilities for misinformation analysis.

Beyond direct detection performance, recent research has also emphasized the reliability and efficiency of LLM-based reasoning. Ref. [34] proposed a reasoning uncertainty-guided refinement framework (RUR) to mitigate hallucinations in large vision–language models by modeling token-level and sentence-level reasoning uncertainty. Their work highlights the importance of explicitly quantifying uncertainty to prevent unreliable generation. In parallel, efficiency-oriented works such as [35] investigate compression mechanisms for long-context LLM inference, demonstrating that selective retention of key-value states can significantly reduce memory and computation overhead while maintaining performance. These findings suggest that uncertainty-aware reasoning and resource-aware invocation are critical for practical LLM deployment in real-world systems.

Despite these advances, most existing LLM-powered fake news detection methods focus on veracity classification at the document level, rather than node-level source localization within propagation graphs. Ref. [36] provide a comprehensive synthesis of recent LLM-powered fake news detection approaches, highlighting innovations in multimodal modeling and graph-enhanced reasoning, while also emphasizing the need for adaptive and resource-aware frameworks capable of handling dynamic social media environments.

When large language models are employed for rumor source detection, ref. [37] point out that natural language explanations may appear plausible but are not necessarily faithful to the underlying decision process. Motivated by this observation, we restrict the use of LLM-based explanations to a small subset of high-uncertainty samples and anchor the reasoning process to structure-aware candidate selection. Recent studies further indicate that incorporating structural relations into language model pretraining can significantly enhance cross-document reasoning and understanding. For instance, LinkBERT [38] leverages document-level links during pretraining to capture inter-document dependencies, thereby improving downstream reasoning performance.

To address the limitations of system- and application-oriented approaches in unified modeling and coordinated reasoning, the proposed TSR-RSD model adopts a collaborative design that integrates graph-structured source detection, uncertainty-driven selective reasoning, and semantic-level analysis. This design enhances the interpretability of source detection results while enabling the model to capture complex semantic information.

3. TSR-RSD: A Rumor Source Detection Model

This paper proposes a tri-stage selective reasoning model for rumor source detection, termed Tri-stage Selective Reasoning for Rumor Source Detection (TSR-RSD). In the first stage, a graph neural network is employed to perform node-level modeling over the propagation graph of a single event, learning representations for each node and producing source likelihood scores, which are normalized within the graph to obtain a ranked list of candidate source nodes. In the second stage, an entropy-based uncertainty filtering mechanism is introduced to evaluate the GNN predictions, allowing only samples with high predictive uncertainty to proceed to the subsequent reasoning stage. In the third stage, the filtered candidate results are fed into a multi-agent large language model workflow constructed with Dify, where hierarchical prompting is applied to jointly reason over propagation structures and node-level semantics. This process ultimately generates structured and interpretable rumor source predictions, achieving an effective balance between inference efficiency and performance. The overall architecture of the proposed model is illustrated in Figure 1.

3.1. GNN-Based Rumor Source Detection Model

Graph neural networks (GNNs) are typically used for graph-level node classification, where the core procedure consists of first computing node embeddings, then applying global pooling to generate a graph representation, and finally producing classification outputs. Leveraging the strong adaptability of GNNs to graph-structured data, this work modifies the standard GNN architecture to better suit the rumor source detection task.

Specifically, rumor source detection on propagation graphs is formulated as a node-level localization problem. Given the reposting or interaction graph of a single event as input, the GNN learns representations for each node, and an MLP is applied to output a score for every node. A softmax operation is then performed over node scores within each graph to obtain source probabilities, and supervised training is conducted using the ground-truth source node index source_idx. During inference, the Top-K nodes are selected as candidate sources and can be explicitly exported. Compared with conventional graph-level GNN classification models, the main modifications include removing global pooling and the graph-level classification head, replacing them with node-wise scoring and an intra-graph cross-entropy loss, switching evaluation metrics from accuracy/F1 to Hit@K and MRR, and incorporating early stopping and candidate extraction procedures. An illustrative overview of the model architecture is shown in Figure 2.

Task Definition. In this work, rumor source detection is formulated as a source node localization task on propagation graphs. For each event, a propagation graph

G = (V, E)

is constructed, where the node set V represents entities involved in the diffusion process (e.g., users or posts), and the edge set E denotes reposting, replying, or interaction relations. Each node is associated with a feature vector

x

, and the graph structure is specified by edge_index. The objective is to predict, given the entire propagation graph, the most likely source node among all nodes and output its position (index) in the node sequence of the graph. This formulation differs from graph-level classification tasks such as veracity prediction, as it emphasizes intra-graph comparison and ranking of nodes to identify the diffusion origin.

Model Output. In standard GNN-based graph classification, node representations are obtained via message passing, aggregated into a graph-level representation through global pooling, and fed into a classification head to produce graph-level class probabilities, supervised by a graph label y. To adapt this paradigm for rumor source detection, we instead adopt a node-level scoring framework. Specifically, the GNN preserves the latent representation of each node without applying global pooling. A linear layer or MLP is then applied to output a scalar score node_scores for each node, indicating its relative confidence of being the source.

Supervision Signal. In conventional GNN models, supervision is provided in the form of graph-level labels. In this work, the graph label is instead transformed into a source node index source_idx (one integer per graph). Accordingly, the loss function is reformulated from graph-level negative log-likelihood to an intra-graph “softmax + cross-entropy” objective. Specifically, a softmax operation is applied over node scores within each graph to obtain a source probability distribution, and the negative log-probability at the ground-truth source index is summed or averaged. This design explicitly encourages the model to distinguish the source node from other nodes within the same propagation graph.

Evaluation Metrics and Training Procedure. Since the model output shifts from category prediction to node ranking, ranking- and retrieval-based metrics are adopted to evaluate source detection quality. Concretely, nodes within each graph are sorted in descending order according to their scores: the ground-truth source ranked at the first position is counted as Hit@1, while inclusion within the top three and top five positions corresponds to Hit@3 and Hit@5, respectively. In addition, Mean Reciprocal Rank (MRR) is employed to measure the average rank of the true source, jointly capturing both hit performance and ranking position. During training, validation-set MRR is used as the model selection criterion, and early stopping with a patience parameter is introduced to improve training stability and mitigate overfitting.

3.2. Entropy-Based Uncertainty Filtering

Entropy-based uncertainty is commonly used to quantify the confidence of a model’s predictive distribution: the more uniform the predicted probability distribution, the higher the entropy value, indicating greater model uncertainty. In the literature, entropy uncertainty is widely applied to sample selection, active learning, and selective reasoning, where prioritizing high-entropy samples helps improve overall model performance and efficiency under limited computational resources.

Entropy Input. In the proposed selective invocation framework, the input to entropy-based uncertainty estimation is not derived from all nodes in the original propagation graph, but rather from a candidate set produced by the GNN-based source detection stage. Specifically, the source detection model outputs scores for nodes within each propagation graph and generates a Top-K candidate list accordingly. From the candidates field, the score of each candidate is extracted to form a score sequence scores = [s₁,…, s_K], which serves as the direct input for uncertainty estimation. In this work, the entropy measures whether the GNN model is uncertain among the K most probable source candidates, rather than computing uncertainty over all nodes. This design reduces computational and storage overhead while preserving decision relevance, and is consistent with the subsequent interaction paradigm in which the LLM only needs to reason over a small set of candidates.

Uncertainty Computation. Uncertainty is quantified using Shannon entropy and is computed in two steps. First, the candidate scores are normalized into a probability distribution via the softmax function, and entropy is then calculated over this distribution. Let

s_{i}

denote the raw score of the i-th candidate among the Top-K candidates, and

p_{i}

denote the corresponding normalized probability. The computation is defined as

p_{i} = \frac{exp (s_{i})}{\sum_{j = 1}^{K} exp (s_{j})}, H (p) = - \sum_{i = 1}^{K} p_{i} \log p_{i} .

(1)

To avoid numerical overflow, we apply a standard max-shift by subtracting

\max (s)

from each

s_{i}

, and truncate extremely small probabilities to prevent

\log (0)

. Intuitively, as illustrated in Figure 3, when one candidate has a substantially higher probability than the others, the distribution becomes more peaked and the entropy is low; when multiple candidates have similar probabilities and the distribution is flatter, the entropy is higher, indicating greater uncertainty of the GNN.

Decision Rule. We adopt a threshold-based gating strategy to convert entropy into a binary decision indicating whether LLM-based reasoning should be invoked. For each propagation graph, the entropy H computed over the Top-K candidates is compared against a predefined threshold

τ

: when

H > τ

, the prediction is regarded as uncertain and an LLM is triggered; otherwise, the GNN prediction is considered sufficiently confident and the LLM is skipped. Formally, the decision rule is defined as

uncertain = I (H > τ),

(2)

where

I (\cdot)

denotes the indicator function.

In our implementation, the default threshold is set to threshold

= 1.2

, and can be adjusted via command-line arguments. Since the entropy range depends on K, in theory

H \in [0, \log K]

; for example, when

K = 5

, the upper bound is approximately

\log (5)

. In addition, invalid cases such as fewer than two candidates or all-zero scores are directly treated as certain to avoid erroneous triggering. Overall, a lower

τ

results in broader LLM coverage over difficult samples at higher computational cost, whereas a higher

τ

leads to fewer invocations and greater savings, but may miss borderline cases.

3.3. LLM-Based Natural Language Reasoning

To enhance the semantic understanding and interpretability of rumor source determination, we introduce a multi-stage reasoning module based on large language models for candidate samples. This module adopts the principle of Hierarchical Prompting, decomposing the complex source detection task into multiple low-cognitive-load subtasks, which are sequentially handled by several agents with clearly defined roles. By doing so, the design avoids reasoning bias caused by information overload within a single prompt. This strategy is consistent with Cognitive Load Theory, which posits that controlling the complexity of inputs at each stage allows the model to focus on the current sub-goal within a limited context window, thereby improving overall reasoning stability and consistency.

In implementation, we construct a multi-agent workflow based on Dify, with the overall process illustrated in Figure 4. The workflow consists of four sequential stages: data parsing, propagation structure analysis, source node determination, and structured output generation. Each stage is handled by a dedicated agent with a specific function, and strict sequential control is enforced to ensure the stability of the reasoning process.

First, a Data Parsing Agent extracts the node_id, timestamps, and core factual descriptions from the candidate nodes produced by the GNN, and generates a compact structured input. Next, a Propagation Structure Analysis Agent assesses whether each node is a plausible source based on information inheritance, content expansion, and temporal cues, and provides evidence-based descriptions. On this basis, a Source Node Decision Agent ranks the nodes according to predefined rules and assigns normalized confidence scores. Finally, a Structured Output Agent consolidates the reasoning results into a standardized JSON format for subsequent evaluation and comparison. The entire workflow is governed by strict input–output constraints, ensuring the parsability and reproducibility of the reasoning outcomes.

The above reasoning procedure is realized through a stage-wise inference pipeline. Given a set of candidate nodes output by the GNN model, the system first extracts and compresses node information to reduce input complexity; it then analyzes content inheritance and diffusion relationships at the node level to construct structured propagation cues; and finally completes candidate ranking and result generation under explicit rule constraints. Through this multi-stage, multi-agent reasoning design, semantic reasoning is restricted to a small set of high-uncertainty samples filtered by the structural model, enabling the large language model to perform effective judgments within a controlled input space. As illustrated in Figure 4, this module primarily serves candidate re-ranking and explanation generation within the overall TSR-RSD framework, complementing the preceding structural modeling stage.

Implementation Details and Reproducibility. The hierarchical reasoning module is implemented using Qwen3-32B-instruct-Q4 as the underlying large language model within the Dify workflow engine.

For each graph instance, we select the Top-5 nodes from the GNN ranking as the candidate pool. The four agents described above are invoked in a stage-wise and sequential manner. Specifically, the semantic re-scoring, comparative decision, and consistency validation stages each invoke the LLM once, resulting in exactly three LLM calls per triggered sample.

Decoding parameters follow the deployment default configuration (effective temperature

\approx 0.7

). We do not manually tune sampling hyperparameters such as temperature or top-p. Although non-zero temperature introduces stochasticity, structured JSON constraints and deterministic post-validation significantly reduce output variance in practice.

Failure Handling and Retry Policy. The LLM output is required to strictly follow a predefined JSON schema. We treat the following conditions as failures:

invalid or unparsable JSON format;
missing required keys;
predicted node not belonging to the Top-5 candidate pool;
inconsistency between the predicted top node and the ranked list.

When validation fails, the system retries generation up to a fixed maximum number of attempts. If repeated failures occur, the final output falls back to the GNN Top-1 prediction to ensure robustness and reproducibility.

Standardized Output Schema

The final reasoning output is constrained to the following JSON structure:

{

"graph_id": string,

"llm_top1_node": int,

"llm_ranked_nodes": list[int],

"llm_scores": {node_id: float},

"reasoning": string

}

All prompts are constructed using a fixed template-based hierarchical prompting strategy and are shared across all datasets without dataset-specific manual tuning. The strict input–output constraints enforced by Dify ensure that the reasoning process remains parsable, controlled, and reproducible.

4. Experiment

All experiments run on Ubuntu 24.04.1 LTS with two RTX A6000 GPUs (96 GB total VRAM) and an Intel i7-12700F CPU. Implementations use Python 3.9.20 and the Dify framework.

For each dataset, we evaluate three GNN backbones, namely GCN, GraphSAGE, and GAT, combined with four types of node features, including BERT, spaCy, profile-based, and content-based representations. The best-performing backbone–feature configuration is selected based on validation MRR within each run and then evaluated on the corresponding test set. Specifically, for GossipCop we adopt a 2-layer GAT with BERT features and hidden size 128. For PolitiFact, we use a 2-layer GraphSAGE with BERT features and hidden size 128. For PHEME, we employ a 2-layer GCN with content-based features and hidden size 128. Across all datasets, the common hyperparameters are set as follows: learning rate is 0.001, dropout ratio is 0.3, weight decay is

1 \times 10^{- 4}

, batch size is 64, maximum training epochs are 50, and early stopping is applied with patience 10 based on validation MRR. We split each dataset at the propagation-graph (event) level into 70%/10%/20% for training, validation, and test, respectively. All compared methods share identical splits within each run. To reduce variance introduced by random partitioning, we repeat each experiment over five independent runs with different random seeds and report the mean results. Results are evaluated on the test set and averaged over five runs.

4.1. Datasets

GossipCop [39]: The GossipCop dataset is derived from the FakeNewsNet repository and is designed for rumor detection and diffusion analysis in the domains of entertainment and social news. It contains a total of 3,825 news instances. Each news item is annotated with a veracity label based on fact-checking results from the GossipCop website, and its propagation process on Twitter is further collected, including the original post, retweets, replies, and corresponding temporal information. Compared with political news, stories in GossipCop typically spread more rapidly, involve a larger number of participating users, and exhibit highly heterogeneous and noisy propagation structures, often characterized by multi-branch diffusion and information overlap. Such complex propagation graphs pose greater challenges for rumor source localization, making GossipCop an important benchmark for evaluating the robustness and generalization ability of source detection models in large-scale, unstructured social diffusion scenarios.

PolitiFact [39]: The PolitiFact dataset is also drawn from the FakeNewsNet repository and primarily focuses on political news and public affairs. Its veracity labels are provided by the professional fact-checking organization PolitiFact, and the dataset contains a total of 219 news instances. Each sample is associated with its propagation traces on social media platforms, forming diffusion graphs centered on reposting and discussion relationships. Compared with GossipCop, PolitiFact typically exhibits smaller-scale diffusion but more targeted user interactions, clearer propagation paths, and stronger stance polarization reflected in structured discussion chains. These characteristics make PolitiFact particularly suitable for analyzing rumor diffusion processes with relatively explicit structures and strong semantic correlations. Using PolitiFact in conjunction with GossipCop enables a more comprehensive evaluation of rumor source detection models across different domains and diffusion patterns, thereby assessing their stability and applicability under diverse social contexts.

In FakeNewsNet-based datasets (GossipCop and PolitiFact), the source node corresponds to the original tweet that initiates the event-level propagation cascade. While this node is structurally the root of the diffusion tree, its identification in our framework is not directly derived from trivial structural heuristics such as in-degree = 0 or earliest timestamp alone, but learned through node-level ranking under varying feature configurations.

PHEME [40]: The PHEME dataset is a well-established benchmark for rumor analysis in breaking news scenarios, originally introduced to study the propagation, evolution, and verification of rumors on social media. Constructed from Twitter data, it covers a range of real-world public events, including natural disasters, terrorist attacks, political incidents, and other socially salient topics. Each event consists of a source tweet and its subsequent replies and retweets, forming propagation structures organized as conversation threads, with veracity labels annotated by human experts.

In this study, GossipCop and PolitiFact are processed using a unified FNNDataset data interface to ensure experimental consistency and reproducibility. The original FakeNewsNet-style data are reorganized into graph-structured formats compatible with PyTorch Geometric 2.7.0, where each sample corresponds to an independent propagation graph containing node features, edge structures, and graph-level labels. The PHEME dataset is uniformly reformatted into event-level propagation graphs to support node-level rumor source detection. Specifically, global diffusion edges are first read from the original adjacency files, and the concatenated large graph is then partitioned into multiple propagation subgraphs via node–graph mappings, with self-loop addition and edge deduplication applied for normalization. Node features support multiple representations (e.g., text-based or user-attribute-based features) and are uniformly converted into numerical matrices as model inputs. To facilitate rumor source detection, an additional source index source_idx is introduced into the dataset to supervise node ranking and localization within each propagation graph. Moreover, fixed training, validation, and test splits are adopted to enable fair comparison across models. In addition to the information required for model training, node-level tweet identifiers, timestamps, and graph-level raw texts are retained as metadata for candidate export and subsequent LLM-based explanatory reasoning, but are not involved in the main GNN training process.

4.2. Evaluation Metrics

The objective of rumor source detection is to accurately localize the true source node from all nodes within a given event-specific propagation graph. Consequently, this task is inherently a node-level ranking and retrieval problem, rather than a conventional graph-level classification or binary node classification task. In light of this formulation, we adopt Hit@K and Mean Reciprocal Rank (MRR) as the primary evaluation metrics to assess the model’s ability to rank source nodes within propagation graphs. In addition, to evaluate the system-level efficiency advantages of the proposed approach, auxiliary metrics such as inference time and the ratio of LLM invocations are also reported, enabling a comprehensive evaluation from both accuracy and efficiency perspectives.

Hit@K. Hit@K is a commonly used hit-rate metric in rumor source detection and information retrieval tasks, which measures whether the ground-truth source node appears among the top-K candidate nodes predicted by the model. Specifically, for each propagation graph sample, the model assigns a source confidence score to every node and ranks all nodes in descending order of their scores to produce a candidate list. A hit is recorded if the true source node is ranked within the top-K positions of this list.

Assume that the test set contains N propagation graph samples, where the ground-truth source node of the i-th sample is denoted as

s_{i}

, and the Top-K candidate set predicted by the model is denoted as

Top - K_{i}

. Hit@K is defined as

Hit @ K = \frac{1}{N} \sum_{i = 1}^{N} I (s_{i} \in Top K_{i}),

(3)

where

I (\cdot)

is the indicator function, which equals 1 if the condition holds and 0 otherwise.

Hit@1, Hit@3, and Hit@5 respectively reflect the model’s source detection capability under strict and more relaxed hit conditions. The Hit@K metric is intuitive and easy to interpret, as it directly captures the model’s ability to provide correct candidate sources for manual verification or subsequent reasoning modules in practical applications.

MRR. While Hit@K measures whether the true source node is covered by the predicted candidates, it does not distinguish the exact ranking position of the source within the Top-K list. To further characterize the quality of node ranking, we introduce Mean Reciprocal Rank (MRR) as a complementary metric.

For the i-th propagation graph sample, let

{rank}_{i}

denote the position of the ground-truth source node in the predicted ranking list. Its reciprocal rank is defined as

1 / {rank}_{i}

. When the true source is ranked first, this value equals 1; as the rank decreases, the reciprocal rank correspondingly diminishes. MRR is defined as the average reciprocal rank over all samples:

MRR = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{{rank}_{i}} .

(4)

MRR jointly accounts for both hit coverage and ranking position, making it more sensitive to the model’s ability to distinguish among multiple highly similar candidate nodes within a propagation graph. In the context of rumor source detection, MRR is particularly suitable for evaluating whether a model can consistently rank the true source near the top of the candidate list, rather than merely achieving marginal coverage.

4.3. Overall Performance Evaluation

In this section, we conduct the main experimental evaluation of the proposed TSR-RSD framework on three real-world datasets: GossipCop, PolitiFact, and PHEME. A GNN-based rumor source detection model that relies solely on structural reasoning is adopted as the baseline. Since rumor source detection is inherently a node ranking task within a propagation graph, we employ Hit@1, Hit@3, Hit@5, and Mean Reciprocal Rank (MRR) to evaluate the model’s ability to rank the true source node among top candidates.

To avoid overwhelming the main text with extensive combinations of structural features and GNN variants, we report, for each dataset, only the most stable and best-performing GNN configuration as the representative baseline. TSR-RSD is then evaluated under the same structural configuration, with selective semantic reasoning incorporated on top. This controlled comparison is designed to highlight the performance gains brought by the proposed reasoning paradigm upgrade, rather than by architectural or feature-level variations.

Table 1 summarizes the core results across the three datasets. On GossipCop, the baseline model already achieves near-saturated performance, with Hit@K values approaching 1.0. This observation suggests that large-scale entertainment news propagation graphs often exhibit relatively stable source signals that can be effectively captured by structural models alone.

Under such high-confidence scenarios, TSR-RSD does not perturb the original ranking through unnecessary semantic reasoning. Instead, it preserves performance almost identical to the upper bound established by the baseline, while still achieving a marginal improvement on Hit@1. This behavior is particularly important, as it indicates that the proposed selective reasoning mechanism does not sacrifice the strengths of structural models on easy samples merely for the sake of introducing additional reasoning steps. Rather, TSR-RSD remains conservative on high-confidence cases and reserves semantic reasoning capacity for samples that genuinely require correction, resulting in a more disciplined and robust overall inference strategy.

The near-saturated performance observed on GossipCop primarily reflects the structural characteristics of the FakeNewsNet propagation graphs rather than any particular modeling artifact. As evidenced by our ablation over feature types and GNN backbones, performance varies substantially across configurations. In the main text, we report only the best-performing configuration per dataset for clarity, while the complete GNN results under all feature–backbone combinations are provided in Appendix A.

For example, when using content-only features, Hit@1 decreases sharply to 0.0105 (GraphSAGE) and 0.4232 (GAT), suggesting that accurate localization depends on meaningful structural–textual alignment rather than trivial cues. In contrast, when combining high-quality semantic embeddings (BERT) with graph modeling, the strong regularities in FakeNewsNet cascades enable near-deterministic source identification.

On PolitiFact, TSR-RSD yields a more pronounced improvement in ranking quality. Compared to GossipCop, PolitiFact exhibits smaller propagation graphs but stronger semantic polarization, where cascades with highly similar structures may nonetheless differ substantially in stance and factual entailment. In such cases, structural models often achieve reasonable Top-K coverage, yet remain constrained in fine-grained ordering among the Top-K candidates. The results show that TSR-RSD delivers a clear gain on Hit@3 and a consistent improvement on Hit@1, suggesting that semantic reasoning does not merely enlarge the candidate pool; rather, it tends to promote the true source from being within the top few to being ranked even higher, thereby improving strict-hit performance and ranking stability. On PolitiFact, TSR-RSD improves strict hit metrics (Hit@1 and Hit@3), indicating enhanced top-level localization accuracy.

However, MRR shows a moderate decrease compared to the structural baseline.This suggests that while semantic reasoning promotes certain hard samples into the top-K region, it may slightly perturb the fine-grained ordering among already well-ranked candidates. Given the relatively small size of PolitiFact (219 events), minor rank shifts between positions 1 and 2 can produce disproportionate effects on MRR.

On PHEME, TSR-RSD is the most representative, as the dataset comprises breaking-news-driven conversational threads with more complex structures and intertwined temporal signals, where topology-only aggregation often fails to reliably distinguish the earliest initiator from the earliest diffuser. The baseline achieves Hit@1 of only ∼0.29 on PHEME, indicating that the true source is frequently ranked far from the top. With TSR-RSD, Hit@1 increases to ∼0.42 and MRR improves from 0.46 to 0.49, demonstrating the corrective value of semantic reasoning in challenging scenarios: it leverages semantic cues such as content inheritance, expressive consistency, and propagation logic to re-rank structurally similar candidates, pushing the true source more reliably toward the front of the list. Meanwhile, Hit@5 decreases on PHEME, which is more consistent with a side effect of more aggressive re-ranking. When semantic reasoning strongly corrects a small fraction of samples, the true source may move from around rank 5 to near rank 1, but in a few cases may also be over-corrected, thereby affecting the looser Hit@5 metric. Given the concurrent gains on Hit@1 and MRR, this phenomenon does not weaken our main conclusion: the key benefit of TSR-RSD lies in improving stricter and more goal-aligned ranking metrics, rather than pursuing superficial gains in Top-K coverage.

Overall, the main results indicate that the gains of TSR-RSD do not stem from reprocessing every sample with an LLM. Instead, TSR-RSD first relies on the structural model to produce strong candidates and an initial ranking, and then applies selective semantic reasoning to high-ambiguity samples. This design improves the rank position of the true source in complex propagation environments while preserving the baseline upper-bound performance when structural signals are clear. Such a behavior pattern—stable on easy cases and corrective on hard cases—captures the key value proposition of selective reasoning in a top-tier venue setting: it emphasizes not only effectiveness, but also controllability in inference decisions and principled allocation of expensive reasoning resources.

4.4. Comparison of Generation Time and Inference Cost

Beyond source detection accuracy, practical rumor source detection systems must also account for inference efficiency and overall system cost. This consideration is particularly critical in large-scale social media monitoring or online early-warning scenarios, where inference latency directly affects deployability. Accordingly, this section conducts a system-level comparison of generation-stage inference time across different model configurations to assess the practical efficiency gains introduced by TSR-RSD. Figure 5 presents a system-level comparison of the trade-off between inference time and source detection performance (MRR) for the evaluated methods.

We compare four different inference strategies, corresponding to purely structural modeling, purely semantic reasoning, an unscreened structure–semantic cascaded scheme, and the proposed tri-stage selective reasoning framework. All experiments are conducted under identical hardware environments and inference settings. The reported metric is the average inference time per propagation graph, measured from input to the generation of the complete source detection result, in seconds.

Table 2 reports the average generation-time results of different methods on the GossipCop, PolitiFact, and PHEME datasets.

As illustrated in Figure 5, the pure GNN-based method lies in the low-latency but performance-limited region, whereas the LLM-only and unscreened GNN+LLM approaches achieve relatively strong performance on some datasets at the cost of substantially increased inference time and higher overall system cost. In contrast, TSR-RSD resides in the Pareto-optimal region of the inference time–performance trade-off, significantly reducing inference latency while maintaining near-optimal source detection performance. This result highlights the effectiveness of the proposed selective reasoning mechanism from a system-level perspective.

The overall experimental results are summarized in Table 2, which reports the average generation time of different methods on the three datasets under identical hardware environments and inference settings. The results show that the pure GNN-based method consistently exhibits low generation latency, as it only involves structural modeling and a single forward pass. However, its performance gains are inherently limited in complex diffusion scenarios. In contrast, both the LLM-only and the unscreened GNN+LLM approaches incur substantially higher inference time across all datasets, particularly on GossipCop and PolitiFact, where larger diffusion scales cause the overall latency to be dominated by multi-round reasoning of large language models.

After introducing the selective reasoning mechanism, TSR-RSD demonstrates clear advantages in generation efficiency. The experimental results indicate that, on the GossipCop dataset, the average generation time of TSR-RSD is significantly reduced compared with both the GNN+LLM and LLM-only methods, and a consistent trend is observed on PolitiFact and PHEME. These findings suggest that the uncertainty filtering mechanism, which triggers semantic reasoning only for samples with uncertain predictions, effectively reduces the number of LLM invocations, thereby substantially shortening overall inference time without sacrificing source detection performance.

A further comparison across datasets reveals that the efficiency advantages of TSR-RSD are particularly pronounced in datasets with larger diffusion scales or relatively clear propagation structures. In such scenarios, the GNN model alone is able to produce high-confidence predictions for the majority of samples, allowing the entropy-based uncertainty filtering mechanism to effectively exclude a large portion of instances that do not require semantic reasoning. Conversely, in settings with more complex propagation structures and higher sample uncertainty, although the LLM invocation ratio increases, TSR-RSD still maintains substantially lower overall inference time than the unscreened cascaded scheme.

In summary, the experimental results in this section clearly demonstrate that the reduced generation-stage inference time of TSR-RSD primarily stems from its uncertainty-driven selective reasoning design. By concentrating high-cost semantic reasoning on a small number of critical samples, TSR-RSD significantly lowers overall generation overhead while preserving source detection performance, providing strong efficiency guarantees for practical deployment in large-scale or near-real-time rumor source detection tasks.

4.5. Analysis of LLM Invocation Ratio

As observed in the previous generation-time comparison experiments, TSR-RSD substantially reduces overall inference time while maintaining strong rumor source detection performance. To further validate the core mechanism by which TSR-RSD achieves efficient inference at the system level, this section focuses on analyzing the proportion of samples that actually invoke large language model (LLM) reasoning under different model configurations. By comparing the LLM invocation ratios of TSR-RSD with those of an unscreened cascaded scheme and a purely semantic reasoning approach on the test set, we examine the practical effectiveness of uncertainty-driven selective reasoning in reducing computational overhead from an execution-oriented perspective. The corresponding statistics are summarized in Table 3. Furthermore, to investigate the global regulatory effect of the entropy threshold

τ

on LLM utilization, we analyze the variation trends of LLM invocation ratios under different threshold settings, as illustrated in Figure 6.

The evaluation metric is defined as the ratio of propagation graph samples that trigger LLM-based reasoning to the total number of samples in the test set. For the LLM-only and GNN + LLM methods, since all samples indiscriminately enter the semantic reasoning stage, the theoretical LLM invocation ratio is 100%. In contrast, for TSR-RSD, this ratio is adaptively determined by the entropy-threshold-based gating mechanism.

Figure 6 illustrates the variation of LLM invocation ratios of TSR-RSD under different entropy threshold

τ

settings across the three datasets.

The results show that for both the LLM-only and the GNN+LLM approaches, all samples are indiscriminately forwarded to the semantic reasoning stage, leading to an LLM invocation ratio of 100% on all three datasets. As shown in Figure 6, TSR-RSD instead introduces entropy threshold

τ

as a gating mechanism for triggering semantic reasoning, resulting in a controllable and monotonically decreasing LLM invocation ratio as the threshold increases. Although such full or unscreened reasoning paradigms can achieve competitive source detection performance in certain cases, they inevitably introduce substantial computational redundancy. In particular, for samples where structural signals are already sufficient to support accurate decisions, the additional cost of semantic reasoning does not yield commensurate benefits.

In contrast, TSR-RSD significantly reduces the actual LLM invocation ratio across all three datasets, with clear dataset-specific patterns. On GossipCop, as illustrated in Figure 6, the LLM invocation ratio decreases smoothly and substantially as

τ

increases, with only a small fraction of highly uncertain samples triggering semantic reasoning at higher thresholds. This behavior indicates that in scenarios with large propagation scales and salient structural patterns, the GNN model is able to produce high-confidence predictions for the majority of samples, allowing the selective reasoning mechanism to effectively filter out unnecessary LLM calls. On PolitiFact, the LLM invocation ratio remains relatively high and the curve stays elevated across a wide range of thresholds, suggesting a stronger reliance on semantic reasoning for rumor source detection. This observation is consistent with the dataset’s characteristics: smaller propagation graphs but higher semantic discriminability, where semantic reasoning plays a more critical role in correcting the ranking produced by structural models. For PHEME, TSR-RSD similarly maintains a certain level of semantic reasoning coverage while keeping the LLM invocation ratio well below that of full reasoning.

The entropy threshold

τ

is selected to balance inference efficiency and ranking performance. Since entropy is theoretically bounded by

H \in [0, \log K]

, and

\log (5) \approx 1.61

in our setting,

τ = 1.2

corresponds to approximately

75 %

of the maximum uncertainty level. Empirically, this value effectively distinguishes highly ambiguous samples from structurally confident ones.

As shown in Figure 6,

τ = 1.2

achieves a favorable trade-off across datasets: (i) on GossipCop, it substantially reduces the LLM invocation ratio and inference time while preserving near-saturated ranking performance; (ii) on PolitiFact and PHEME, where structural ambiguity is higher,

τ = 1.2

does not prematurely suppress semantic reasoning, thereby maintaining necessary coverage for difficult samples.

The experimental results in this section clearly demonstrate that the entropy-based uncertainty filtering mechanism can substantially reduce the participation ratio of large language models without significantly compromising source detection performance, thereby achieving an effective balance between accuracy and efficiency. This property endows TSR-RSD with stronger deployability in practical scenarios where computational resources are limited or rapid response is required, and also provides a solid foundation for subsequent analyses on uncertain sample characteristics and ablation studies.

4.6. Analysis of Uncertainty-Guided Sample Selection

To further investigate the practical role of the entropy-based uncertainty filtering mechanism in the TSR-RSD framework, this section conducts an in-depth analysis at the sample level, focusing on propagation graphs that are selected to enter the semantic reasoning stage. Unlike the previous analysis, which examined LLM invocation ratios purely from a system-level perspective, we here aim to answer two key questions: (1) whether samples classified as uncertain indeed exhibit higher source detection difficulty, and (2) whether the uncertainty-based filtering mechanism can effectively distinguish propagation scenarios that are inherently challenging for structural models to resolve. The corresponding results are summarized in Table 4, and the structural and predictive differences between certain and uncertain samples are further illustrated in Figure 7.

The results demonstrate that across all three datasets, propagation graphs categorized as certain consistently achieve substantially higher Hit@1, Hit@3, and MRR scores at the initial GNN prediction stage. As shown in Figure 7, these samples typically exhibit clearer propagation structures and more salient source node characteristics, enabling the GNN model to produce high-confidence rankings. This observation indicates that for the majority of samples with well-defined structural patterns and prominent source signals, reliable source ranking can be achieved without semantic reasoning assistance.

In contrast, propagation graphs classified as uncertain show a pronounced degradation in GNN-based source detection performance. The true source nodes are more widely dispersed within the candidate lists, and ranking stability is significantly reduced. As illustrated on the right side of Figure 7, such samples often correspond to propagation structures with multi-branch diffusion patterns or overlapping temporal signals. In these cases, the predicted probability distribution over Top-K candidates becomes flatter, resulting in higher entropy values that subsequently trigger the semantic reasoning stage.

A cross-dataset comparison further reveals that the performance gap between certain and uncertain samples is particularly pronounced on the GossipCop and PolitiFact datasets. This suggests that samples selected for semantic reasoning in these datasets are often associated with scenarios such as multi-branch diffusion, closely scored candidate nodes, or ambiguous temporal cues, where structural information alone is insufficient to support confident decisions. For the PHEME dataset, although the overall source detection difficulty is higher, uncertain samples still exhibit larger performance variance under the structural model, reflecting the additional challenges posed by complex, event-driven propagation dynamics.

Importantly, the uncertainty-based filtering mechanism does not select samples for semantic reasoning in a random or indiscriminate manner. Instead, it statistically concentrates on propagation graphs where structural models are most prone to ambiguity. This property is directly evidenced by the consistently lower Hit@K and MRR scores of uncertain samples at the initial GNN prediction stage. When combined with the LLM invocation ratio analysis in the previous section, these results confirm that TSR-RSD allocates expensive LLM reasoning resources primarily to high-difficulty samples, thereby avoiding redundant computation on cases that are already easy to resolve.

4.7. Ablation Study

To further validate the role and necessity of each constituent module within the TSR-RSD framework, this section conducts a systematic ablation study to analyze the impact of graph neural network modeling, the uncertainty filtering mechanism, and the large language model reasoning module on the final source tracing performance. Specifically, the corresponding modules are removed or replaced on each of the three datasets while keeping all other settings unchanged, enabling a clear examination of how these components affect source localization accuracy and ranking stability.

We further observe that when LLM-based reranking is applied to all samples (i.e., the GNN+LLM configuration without entropy gating), the final performance becomes nearly identical to the LLM-only setting. This is expected, since the ultimate ranking is fully determined by the LLM stage in both cases. However, such a design introduces redundant structural computation without additional performance gains. Therefore, the primary contribution of TSR-RSD does not lie in combining GNN and LLM indiscriminately, but in selectively invoking semantic reranking only for high-uncertainty samples. This selective mechanism ensures that structural priors are preserved for confident cases while semantic reasoning is reserved for ambiguous scenarios.

4.7.1. GNN Model

To assess the fundamental role of the graph neural network (GNN) module within the TSR-RSD framework, we conduct an ablation study focusing on the structural modeling component. This module is responsible for node-level representation learning over propagation graphs and for providing a ranked candidate source set for subsequent reasoning stages. Consequently, removing this component directly affects the model’s ability to exploit structural information embedded in propagation patterns. In the experimental setup, we remove the GNN-based source tracing module entirely, such that no structural modeling or candidate filtering is performed. Instead, event-related information is directly fed into the large language model for source identification. This setting is equivalent to the LLM-only scheme and is compared against the full TSR-RSD framework. The corresponding results are summarized in Table 5.

The experimental results show that removing the GNN module leads to varying degrees of performance change across different datasets. On the GossipCop dataset, although the LLM-only approach still achieves relatively strong overall performance, it exhibits noticeable degradation on stricter metrics such as Hit@1 and MRR compared to the full framework. This indicates that in scenarios with large-scale propagation and salient structural patterns, ignoring propagation structure weakens the model’s ability to precisely rank source nodes. On the PolitiFact and PHEME datasets, while LLM-only achieves advantages on certain metrics, its predictions rely heavily on semantic cues and lack structural constraints, resulting in noticeably reduced overall stability. In contrast, the full TSR-RSD framework leverages the GNN module to provide structural priors and candidate constraints, allowing semantic reasoning to operate within a more reasonable and controllability.

4.7.2. Entropy Uncertainty Filtering

Subsequently, to evaluate the practical effect of the entropy-based uncertainty filtering mechanism within the TSR-RSD framework, we conduct an ablation study on the selective reasoning module. The core function of this module is to assess the confidence of the predictive distribution over the GNN-generated candidates and to determine whether subsequent large language model reasoning should be triggered. Removing this component causes all samples to enter the semantic reasoning stage indiscriminately. In the experimental setup, we eliminate the entropy-threshold-based gating mechanism, causing the model to degenerate into an unfiltered GNN+LLM cascading scheme. We then compare its performance with that of the full TSR-RSD framework on the same datasets. The corresponding results are summarized in Table 6.

The experimental results indicate that, after removing the uncertainty filtering mechanism, the overall source tracing performance declines to varying degrees across all three datasets, with the degradation being particularly pronounced on stricter ranking-oriented metrics such as Hit@1 and MRR. This observation suggests that indiscriminately applying semantic reasoning to all samples does not yield performance gains commensurate with its computational cost; instead, it may introduce additional noise in certain scenarios, interfering with the otherwise stable predictions of the structural model. In contrast, the full TSR-RSD framework leverages the entropy-based uncertainty filtering mechanism to concentrate semantic reasoning on samples with more dispersed predictive distributions and lower candidate separability under the structural model, thereby achieving more consistent and robust source tracing performance across multiple datasets. These results demonstrate that the uncertainty filtering mechanism not only substantially reduces redundant reasoning overhead at the system level, but also constitutes a critical component for improving the ranking stability and overall performance of TSR-RSD.

4.7.3. LLM Reasoning

To assess the performance gains contributed by the large language model (LLM) reasoning module within the TSR-RSD framework, we conduct an ablation study on the semantic reasoning component. This module is primarily responsible for incorporating semantic and propagation-level logical analysis to refine the ranking of candidate sources when the structural model produces uncertain predictions. Consequently, removing this component causes the model to directly output the GNN predictions without invoking semantic reasoning. In the experimental setup, we remove the LLM reasoning module from the full framework while retaining the graph neural network and the uncertainty filtering mechanism, and compare its performance against the complete TSR-RSD framework on the same datasets. The corresponding results are summarized in Table 7.

The experimental results show that removing the LLM reasoning module leads to a clear degradation in source tracing performance across all three datasets, with the decline being particularly pronounced in datasets characterized by complex propagation structures or less salient source nodes. Notably, on the PHEME dataset, models that rely solely on structural information exhibit substantial instability in source ranking, with the true source nodes more likely to be placed toward the lower end of the candidate lists. In contrast, the full TSR-RSD framework effectively mitigates ambiguities arising from complex propagation scenarios by introducing semantic reasoning for highly uncertain samples, thereby promoting the true source nodes to higher ranks within the candidate lists. These results demonstrate that the LLM reasoning module is not merely an auxiliary performance enhancement, but a critical corrective component for challenging samples, and a necessary condition for improving both the accuracy and stability of overall source tracing performance.

5. Conclusions

To address several critical challenges faced by existing rumor source tracing methods in real-world social network scenarios—including their strong reliance on diffusion model assumptions, limited adaptability to complex propagation structures, and the difficulty of jointly optimizing reasoning performance and computational efficiency on large-scale graphs—we propose a rumor source detection framework that integrates structural modeling with selective semantic reasoning. The proposed approach reformulates the source localization problem as a node-level ranking task over propagation graphs, and employs graph neural networks to model propagation structures in an end-to-end manner, thereby avoiding strong assumptions about explicit diffusion processes and improving robustness under heterogeneous propagation patterns. In addition, an uncertainty estimation mechanism based on predictive entropy is introduced to quantitatively characterize model confidence, enabling the system to distinguish between structurally easy-to-resolve samples and highly ambiguous ones, and to trigger semantic-level reasoning only when necessary.

Experimental results on multiple real-world datasets demonstrate that the proposed method consistently outperforms existing baselines in terms of source localization accuracy and ranking stability. Through an uncertainty-driven sample selection strategy, the framework significantly reduces the invocation ratio of large language models without degrading overall performance, and even achieves further improvements in certain complex scenarios, indicating that selective semantic reasoning effectively avoids excessive reasoning on high-confidence samples. Moreover, the semantic reasoning module plays a crucial corrective role for samples with highly similar propagation structures or insufficient information, complementing structural reasoning and thereby enhancing the reliability and generalization capability of the model in complex rumor propagation settings.

Despite these promising results, there remains room for further improvement. Future work may explore more fine-grained uncertainty modeling techniques and adaptive reasoning strategies to better accommodate propagation graphs of varying scales and structural characteristics. Extending the framework to multimodal rumor propagation scenarios or cross-platform diffusion analysis also represents an important direction for enhancing its practical applicability. Overall, this work provides a solution for rumor source tracing that balances accuracy, efficiency, and interpretability, and offers new insights into the collaborative modeling of structural learning and semantic reasoning.

Author Contributions

Writing—original draft, W.L. (Wenzhuo Liu); Writing—review & editing, T.X.; Assistance, L.X. and W.L. (Wen Lv). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The dataset includes fake and real news propagation networks on Twitter built according to fact-check information from Politifact and Gossipcop. The news retweet graphs were originally extracted by FakeNewsNet https://github.com/KaiDMML/FakeNewsNet (accessed on 5 October 2025). We crawled near 20 million historical tweets from users who participated in fake news propagation in FakeNewsNet to generate node features in the dataset.Due to the Twitter policy, we could not release the crawled user historical tweets publicly. The news node doesn’t contain timestamp even in the original FakeNewsNet dataset, you can either retrieve it on Twitter or use its most recent retweet time as an approximation. In the TSR-RSD project, we use Tweepy and Twitter Developer API to get the user information. PHEME Dataset is publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Full GNN Baseline Results Across All Feature–Backbone Configurations

To further clarify that the near-saturated performance on certain datasets (e.g., GossipCop) does not stem from trivial structural cues, we report the complete results of all feature–backbone combinations evaluated in this study.

For each dataset, we evaluate three GNN backbones (GAT, GCN, GraphSAGE) combined with four types of node features (BERT, Content, Profile, spaCy). The results demonstrate substantial variability across configurations, confirming that accurate source localization requires meaningful structural–semantic modeling rather than deterministic graph artifacts.

Table A1. Full GNN baseline results on the GossipCop dataset.

Feature	Model	Hit@1	Hit@3	Hit@5	MRR
BERT	GAT	0.9916	0.9966	0.9974	0.9942
	GCN	0.9984	1.0000	1.0000	0.9992
	SAGE	0.9948	0.9953	0.9958	0.9953
Content	GAT	0.4232	0.7875	0.8918	0.6203
	GCN	0.9676	0.9974	0.9987	0.9820
	SAGE	0.0105	0.1087	0.2112	0.1265
Profile	GAT	0.8008	0.9344	0.9663	0.8729
	GCN	0.9723	0.9958	0.9987	0.9839
	SAGE	0.9694	0.9875	0.9940	0.9800
spaCy	GAT	0.9137	0.9827	0.9914	0.9790
	GCN	0.9867	0.9992	1.0000	0.9929
	SAGE	0.9838	0.9901	0.9924	0.9874

Table A2. Full GNN baseline results on the PolitiFact dataset.

Feature	Model	Hit@1	Hit@3	Hit@5	MRR
BERT	GAT	0.9050	0.9457	0.9548	0.9271
	GCN	0.8643	0.9638	0.9864	0.9173
	SAGE	0.9502	0.9729	0.9774	0.9618
Content	GAT	0.0090	0.0814	0.1267	0.0852
	GCN	0.8643	0.9593	0.9955	0.9163
	SAGE	0.0136	0.0814	0.1131	0.0780
Profile	GAT	0.0588	0.2036	0.3077	0.1835
	GCN	0.8371	0.9412	0.9729	0.8953
	SAGE	0.6380	0.7692	0.8145	0.7214
spaCy	GAT	0.7466	0.8778	0.9186	0.8203
	GCN	0.8688	0.9683	1.0000	0.9230
	SAGE	0.8688	0.9321	0.9593	0.9046

Table A3. Full GNN baseline results on the PHEME dataset.

Feature	Model	Hit@1	Hit@3	Hit@5	MRR
BERT	GAT	0.1692	0.4000	0.5692	0.3548
	GCN	0.1846	0.4615	0.6308	0.3736
	SAGE	0.2462	0.4769	0.5846	0.4102
Content	GAT	0.1846	0.4462	0.6615	0.3902
	GCN	0.2923	0.5231	0.6615	0.4638
	SAGE	0.2154	0.5077	0.7385	0.4178
Profile	GAT	0.1231	0.4615	0.5231	0.3542
	GCN	0.2615	0.5692	0.6923	0.4564
	SAGE	0.2000	0.4923	0.7077	0.4050
spaCy	GAT	0.2154	0.4923	0.6308	0.4096
	GCN	0.1077	0.4000	0.6154	0.3354
	SAGE	0.2000	0.4462	0.6000	0.3781

References

Spencer, S.; Srikant, R. Maximum likelihood rumor source detection in a star network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 2199–2203. [Google Scholar]
Fan, T.H.; Wang, I.H. Rumor source detection: A probabilistic perspective. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4159–4163. [Google Scholar]
Yu, P.D.; Tan, C.W.; Fu, H.L. Rumor source detection in unicyclic graphs. In Proceedings of the 2017 IEEE Information Theory Workshop (ITW), Kaohsiung, Taiwan, 6–10 November 2017; pp. 439–443. [Google Scholar]
Yu, P.D.; Tan, C.W.; Fu, H.L. Rumor source detection in finite graphs with boundary effects by message-passing algorithms. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017, Sydney, NSW, Australia, 31 July–3 August 2017; pp. 86–90. [Google Scholar]
Zhou, Y.; Wu, C.; Zhu, Q.; Xiang, Y.; Loke, S.W. Rumor source detection in networks based on the SEIR model. IEEE Access 2019, 7, 45240–45258. [Google Scholar] [CrossRef]
Rácz, M.Z.; Richey, J. Rumor source detection with multiple observations under adaptive diffusions. IEEE Trans. Netw. Sci. Eng. 2020, 8, 2–12. [Google Scholar] [CrossRef]
Choi, J.; Moon, S.; Woo, J.; Son, K.; Shin, J.; Yi, Y. Rumor source detection under querying with untruthful answers. In Proceedings of the IEEE INFOCOM 2017—IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017; pp. 1–9. [Google Scholar]
Jin, R.; Garg, P.; Wu, W.; Ni, Q.; Guadagno, R.E. A greedy monitoring station selection for rumor source detection in online social networks. IEEE Trans. Comput. Soc. Syst. 2023, 11, 2644–2655. [Google Scholar] [CrossRef]
Kesavareddigari, H.; Spencer, S.; Eryilmaz, A.; Srikant, R. Identification and asymptotic localization of rumor sources using the method of types. IEEE Trans. Netw. Sci. Eng. 2019, 7, 1145–1157. [Google Scholar] [CrossRef]
Kumar, A.; Borkar, V.S.; Karamchandani, N. Temporally agnostic rumor-source detection. IEEE Trans. Signal Inf. Process. Over Netw. 2017, 3, 316–329. [Google Scholar] [CrossRef]
Dong, M.; Zheng, B.; Quoc Viet Hung, N.; Su, H.; Li, G. Multiple rumor source detection with graph convolutional networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 569–578. [Google Scholar]
Geng, Y.; Lin, Z.; Fu, P.; Wang, W. Rumor detection on social media: A multi-view model using self-attention mechanism. In Proceedings of the Computational Science—ICCS 2019: 19th International Conference, Faro, Portugal, 12–14 June 2019; Springer: Cham, Switzerland, 2019; pp. 339–352. [Google Scholar]
Jia, H.; Wang, H.; Zhang, X. Early detection of rumors based on source tweet-word graph attention networks. PLoS ONE 2022, 17, e0271224. [Google Scholar] [CrossRef] [PubMed]
Yuan, C.; Ma, Q.; Zhou, W.; Han, J.; Hu, S. Jointly embedding the local and global relations of heterogeneous graph for rumor detection. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 796–805. [Google Scholar]
Han, S.; Gao, J.; Ciravegna, F. Neural language model based training data augmentation for weakly supervised early rumor detection. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 105–112. [Google Scholar]
Yang, Y.; Bao, R.; Guo, W.; Zhan, D.C.; Yin, Y.; Yang, J. Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection. Sci. China Inf. Sci. 2023, 66, 222102. [Google Scholar] [CrossRef]
Sabir, E.; AbdAlmageed, W.; Wu, Y.; Natarajan, P. Deep multimodal image-repurposing detection. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1337–1345. [Google Scholar]
Aldwairi, M.; Alwahedi, A. Detecting fake news in social media networks. Procedia Comput. Sci. 2018, 141, 215–222. [Google Scholar] [CrossRef]
Qureshi, K.A.; Malick, R.A.S.; Sabih, M.; Cherifi, H. Complex network and source inspired COVID-19 fake news classification on Twitter. IEEE Access 2021, 9, 139636–139656. [Google Scholar] [CrossRef]
Chen, B.; Chen, X.; Pan, J.; Liu, K.; Xie, B.; Wang, W.; Peng, Y.; Wang, F.; Li, N.; Jiang, J. Dissemination and refutation of rumors during the COVID-19 outbreak in China: Infodemiology study. J. Med. Internet Res. 2021, 23, e22427. [Google Scholar] [CrossRef] [PubMed]
Bian, T.; Xiao, X.; Xu, T.; Zhao, P.; Huang, W.; Rong, Y.; Huang, J. Rumor detection on social media with bi-directional graph convolutional networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 549–556. [Google Scholar]
Ma, J.; Gao, W. Debunking rumors on twitter with tree transformer. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 5455–5466. [Google Scholar]
Sun, M.; Zhang, X.; Zheng, J.; Ma, G. Ddgcn: Dual dynamic graph convolutional networks for rumor detection on social media. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 4611–4619. [Google Scholar]
Sheng, Q.; Cao, J.; Zhang, X.; Li, R.; Wang, D.; Zhu, Y. Zoom Out and Observe: News Environment Perception for Fake News Detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 4543–4556. [Google Scholar]
Hu, B.; Sheng, Q.; Cao, J.; Zhu, Y.; Wang, D.; Wang, Z.; Jin, Z. Learn over Past, Evolve for Future: Forecasting Temporal Trends for Fake News Detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Min, E.; Rong, Y.; Bian, Y.; Xu, T.; Zhao, P.; Huang, J.; Ananiadou, S. Divide-and-conquer: Post-user interaction network for fake news detection on social media. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 1148–1158. [Google Scholar]
Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Zubiaga, A.; Liakata, M.; Procter, R.; Wong Sak Hoi, G.; Tolmie, P. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PLoS ONE 2016, 11, e0150989. [Google Scholar] [CrossRef] [PubMed]
Dou, Y.; Shu, K.; Xia, C.; Yu, P.S.; Sun, L. User preference-aware fake news detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 2051–2055. [Google Scholar]
Chen, X.; Lin, M.; Schaerli, N.; Zhou, D. Teaching Large Language Models to Self-Debug. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Kumar, R.; Goddu, B.; Saha, S.; Jatowt, A. Silver lining in the fake news cloud: Can large language models help detect misinformation? IEEE Trans. Artif. Intell. 2024, 6, 14–24. [Google Scholar] [CrossRef]
Wang, J.; Zhu, Z.; Liu, C.; Li, R.; Wu, X. LLM-Enhanced multimodal detection of fake news. PLoS ONE 2024, 19, e0312240. [Google Scholar] [CrossRef] [PubMed]
Xie, B.; Ma, X.; Shan, X.; Beheshti, A.; Yang, J.; Fan, H.; Wu, J. Multiknowledge and llm-inspired heterogeneous graph neural network for fake news detection. IEEE Trans. Comput. Soc. Syst. 2024, 12, 682–694. [Google Scholar] [CrossRef]
Li, S.; Xu, X.; Meng, W.; Song, J.; Peng, C.; Shen, H.T. Mitigating Hallucinations in Large Vision-Language Models via Reasoning Uncertainty-Guided Refinement. IEEE Trans. Multimed. 2025, 27, 7380–7391. [Google Scholar] [CrossRef]
Li, Y.; Huang, Y.; Yang, B.; Venkitesh, B.; Locatelli, A.; Ye, H.; Cai, T.; Lewis, P.; Chen, D. SnapKV: LLM knows what you are looking for before generation. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; pp. 22947–22970. [Google Scholar]
Yi, J.; Xu, Z.; Huang, T.; Yu, P. Challenges and innovations in LLM-Powered fake news detection: A synthesis of approaches and future directions. In Proceedings of the 2025 2nd International Conference on Generative Artificial Intelligence and Information Security, Hangzhou, China, 21–23 February 2025; pp. 87–93. [Google Scholar]
Wiegreffe, S.; Marasović, A. Teach me to explain: A review of datasets for explainable natural language processing. arXiv 2021, arXiv:2102.12060. [Google Scholar] [CrossRef]
Yasunaga, M.; Leskovec, J.; Liang, P. LinkBERT: Pretraining Language Models with Document Links. In Proceedings of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data 2020, 8, 171–188. [Google Scholar] [CrossRef] [PubMed]
Zubiaga, A.; Wong Sak Hoi, G.; Liakata, M.; Procter, R. PHEME Dataset of Rumours and Non-Rumours. 2016. Available online: https://www.kaggle.com/datasets/nicolemichelle/pheme-dataset-for-rumour-detection (accessed on 1 December 2025).

Figure 1. Overall Architecture of the TSR-RSD Framework.

Figure 2. GNN-Based Rumor Source Detection Model.

Figure 3. Illustration of Entropy Distribution over Candidate Source Nodes.

Figure 4. Multi-Agent Large Language Model Reasoning Pipeline.

Figure 5. Comparison of Entropy Threshold Curves on GossipCop, PolitiFact, and PHEME Datasets.

Figure 6. Entropy Threshold Curve Comparison on the GossipCop, PolitiFact, and PHEME Datasets.

Figure 7. Illustration of Certain vs. Uncertain Samples.

Table 1. Main results on the GossipCop, PolitiFact, and PHEME datasets (best-performing baseline vs. TSR-RSD).

Dataset	Method	Hit@1	Hit@3	Hit@5	MRR
GossipCop	GNN (Best)	0.9984	1.0000	1.0000	0.9992
GossipCop	TSR-RSD	0.9987	1.0000	0.9995	0.9992
PolitiFact	GNN (Best)	0.9502	0.9729	0.9774	0.9618
PolitiFact	TSR-RSD	0.9594	0.9867	0.9772	0.9514
PHEME	GNN (Best)	0.2923	0.5231	0.6923	0.4638
PHEME	TSR-RSD	0.4154	0.5385	0.6000	0.4874

Table 2. Comparison of average generation time (in seconds) of different models on three datasets.

Dataset	GNN	GNN+LLM	TSR-RSD	LLM-Only
GossipCop	0.48	50.0	5.0	48.0
PolitiFact	0.42	32.0	22.5	30.0
PHEME	0.45	19.0	7.0	17.0

Table 3. LLM invocation ratios (%) of different models on three datasets.

Dataset	GNN+LLM	TSR-RSD	LLM-Only
GossipCop	100.0	10.3	100.0
PolitiFact	100.0	75.0	100.0
PHEME	100.0	41.2	100.0

Table 4. Performance comparison of different uncertainty groups at the initial GNN prediction stage.

Dataset	Group	Hit@1	Hit@3	MRR
GossipCop	Certain	0.9616	0.9926	0.9846
GossipCop	Uncertain	0.4232	0.7875	0.6203
PolitiFact	Certain	0.9502	0.9729	0.9618
PolitiFact	Uncertain	0.6380	0.7692	0.7214
PHEME	Certain	0.2923	0.5231	0.4638
PHEME	Uncertain	0.1692	0.4000	0.5692

Table 5. Rumor source detection performance after removing the GNN module.

Dataset	Hit@1	Hit@3	Hit@5	MRR
GossipCop (TSR-RSD)	0.9984	1.0000	1.0000	0.9992
GossipCop (LLM-only)	0.9953	0.9966	0.9959	0.9953
PolitiFact (TSR-RSD)	0.9594	0.9683	0.9636	0.9510
PolitiFact (LLM-only)	0.9854	0.9901	0.9912	0.9879
PHEME (TSR-RSD)	0.4154	0.5385	0.6000	0.4874
PHEME (LLM-only)	0.4530	0.5989	0.7310	0.6519

Table 6. Ablation results after removing the uncertainty filtering mechanism.

Dataset	Hit@1	Hit@3	Hit@5	MRR
GossipCop (GNN+LLM)	0.9919	0.9966	0.9969	0.9942
GossipCop (TSR-RSD)	0.9984	1.0000	1.0000	0.9992
PolitiFact (GNN+LLM)	0.9502	0.9569	0.9584	0.9460
PolitiFact (TSR-RSD)	0.9594	0.9683	0.9636	0.9510
PHEME (GNN+LLM)	0.3804	0.5216	0.6210	0.4835
PHEME (TSR-RSD)	0.4154	0.5385	0.6000	0.4874

Table 7. Ablation results after removing the LLM reasoning module.

Dataset	Hit@1	Hit@3	Hit@5	MRR
GossipCop (GNN)	0.9916	0.9966	0.9974	0.9942
GossipCop (TSR-RSD)	0.9984	1.0000	1.0000	0.9992
PolitiFact (GNN)	0.9402	0.9429	0.9504	0.9418
PolitiFact (TSR-RSD)	0.9594	0.9683	0.9636	0.9510
PHEME (GNN)	0.1846	0.4462	0.6615	0.4638
PHEME (TSR-RSD)	0.4154	0.5385	0.6000	0.4874

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xue, T.; Liu, W.; Xi, L.; Lv, W. Tri-Stage Selective Reasoning for Rumor Source Detection via Graph Neural Networks and Large Language Models. Electronics 2026, 15, 914. https://doi.org/10.3390/electronics15050914

AMA Style

Xue T, Liu W, Xi L, Lv W. Tri-Stage Selective Reasoning for Rumor Source Detection via Graph Neural Networks and Large Language Models. Electronics. 2026; 15(5):914. https://doi.org/10.3390/electronics15050914

Chicago/Turabian Style

Xue, Tao, Wenzhuo Liu, Long Xi, and Wen Lv. 2026. "Tri-Stage Selective Reasoning for Rumor Source Detection via Graph Neural Networks and Large Language Models" Electronics 15, no. 5: 914. https://doi.org/10.3390/electronics15050914

APA Style

Xue, T., Liu, W., Xi, L., & Lv, W. (2026). Tri-Stage Selective Reasoning for Rumor Source Detection via Graph Neural Networks and Large Language Models. Electronics, 15(5), 914. https://doi.org/10.3390/electronics15050914

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tri-Stage Selective Reasoning for Rumor Source Detection via Graph Neural Networks and Large Language Models

Abstract

1. Introduction

2. Related Works

2.1. Propagation-Model-Based Rumor Source Detection Methods

2.2. Deep Learning and Graph Neural Network-Based Rumor Source Detection Methods

2.3. Large Language Model-Enhanced Rumor Detection and Reasoning Frameworks

3. TSR-RSD: A Rumor Source Detection Model

3.1. GNN-Based Rumor Source Detection Model

3.2. Entropy-Based Uncertainty Filtering

3.3. LLM-Based Natural Language Reasoning

Standardized Output Schema

4. Experiment

4.1. Datasets

4.2. Evaluation Metrics

4.3. Overall Performance Evaluation

4.4. Comparison of Generation Time and Inference Cost

4.5. Analysis of LLM Invocation Ratio

4.6. Analysis of Uncertainty-Guided Sample Selection

4.7. Ablation Study

4.7.1. GNN Model

4.7.2. Entropy Uncertainty Filtering

4.7.3. LLM Reasoning

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Full GNN Baseline Results Across All Feature–Backbone Configurations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI