Graph-Conditioned Stochastic Modeling of Twitter Information Cascades with Dual-Head Transformers for Early Virality Prediction
Abstract
1. Introduction
1.1. Research Background and Motivation
1.2. Research Problem and Challenges
1.3. Main Contributions
1.4. Organization of the Paper
2. Related Works
2.1. Information Cascade Prediction in Social Networks
2.2. Stochastic Diffusion Modeling and Point Processes
2.3. Graph Representation Learning and Transformer-Based Modeling
2.4. Summary and Research Positioning
3. Materials and Methods
3.1. Mathematical Formulation of Twitter Information Cascades
3.1.1. Cascade State Space and Historical Process
3.1.2. Causal Transformer Transition Kernel
3.1.3. Virality and Log-Final-Size Learning Objectives
3.2. Dataset Construction and Prefix-Based Cascade Modeling
3.2.1. Higgs Twitter Dataset and Retweet Cascade Construction
3.2.2. Temporal Split and Vocabulary Construction
3.2.3. Prefix-Based Virality and Size Prediction Samples
3.3. Graph-Pretrained Dual-Head Transformer Framework
3.3.1. Node2vec-Based User Embedding Pretraining
3.3.2. Dual-Head Causal Transformer Architecture
3.3.3. Feature Fusion and Ablation Design
3.4. Baselines and Statistical Evaluation Protocol
3.4.1. Handcrafted-Feature Baselines
3.4.2. Evaluation Metrics and Calibration Measures
3.4.3. Bootstrap Confidence Intervals and Statistical Tests
4. Results
4.1. Experimental Setup
4.2. Empirical Cascade Dynamics and Data Sparsity
4.3. Feasibility of User-Level Transition Prediction
4.4. Overall Performance and Graph-Pretraining Effects
4.5. Feature Fusion, Calibration, and Reliability
5. Discussion and Conclusions
5.1. Main Findings
5.2. Discussion and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| Abbreviation | Full Term |
| AdamW | Adaptive Moment Estimation with Decoupled Weight Decay |
| AUROC | Area Under the Receiver Operating Characteristic Curve |
| AUPRC | Area Under the Precision–Recall Curve |
| BCE | Binary Cross-Entropy |
| CE | Cross-Entropy |
| CI | Confidence Interval |
| ECE | Expected Calibration Error |
| FFN | Feed-Forward Network |
| GNN | Graph Neural Network |
| LR | Logistic Regression |
| MHA | Multi-Head Attention |
| MLP | Multilayer Perceptron |
| OOV | Out-Of-Vocabulary |
| PPL | Perplexity |
| RF | Random Forest |
| RMSE | Root Mean Squared Error |
| ROC | Receiver Operating Characteristic |
| RT | Retweet |
| SE | Standard Error |
Appendix A
Appendix A.1. Pairwise AUROC Significance: DeLong’s Test
| Model A | Model B | AUROCA | AUROCB | AUROC | Z | Sig. | |
|---|---|---|---|---|---|---|---|
| LR | 0.775 | 0.748 | +0.027 | 0.013 | 2.08 | * | |
| 0.775 | 0.751 | +0.024 | 0.012 | 2.00 | * | ||
| RF | 0.775 | 0.762 | +0.013 | 0.011 | 1.18 | NS | |
| 0.775 | 0.764 | +0.011 | 0.010 | 1.10 | NS | ||
| LR | 0.771 | 0.748 | +0.023 | 0.012 | 1.92 | NS | |
| 0.775 | 0.771 | +0.004 | 0.009 | 0.44 | NS | ||
| RF | LR | 0.762 | 0.748 | +0.014 | 0.012 | 1.17 | NS |
| 0.764 | 0.751 | +0.013 | 0.011 | 1.18 | NS | ||
| LR | 0.751 | 0.748 | +0.003 | 0.013 | 0.23 | NS | |
| 0.757 | 0.751 | +0.006 | 0.010 | 0.60 | NS | ||
| 0.769 | 0.764 | +0.005 | 0.009 | 0.56 | NS |
Appendix A.2. Lead-Time Sensitivity at Varying Prefix Lengths
| Model | K | AUROC [95% CI] | AUPRC [95% CI] | RMSE [95% CI] |
|---|---|---|---|---|
| 3 | 0.741 [0.698, 0.783] | 0.588 [0.531, 0.642] | 0.681 [0.641, 0.724] | |
| 5 | 0.775 [0.736, 0.812] | 0.621 [0.567, 0.674] | 0.604 [0.568, 0.641] | |
| 7 | 0.791 [0.754, 0.826] | 0.639 [0.586, 0.690] | 0.573 [0.538, 0.609] | |
| 10 | 0.803 [0.768, 0.836] | 0.652 [0.601, 0.703] | 0.551 [0.517, 0.587] | |
| 3 | 0.724 [0.680, 0.768] | 0.568 [0.511, 0.624] | 0.712 [0.671, 0.756] | |
| 5 | 0.751 [0.710, 0.791] | 0.601 [0.546, 0.655] | 0.648 [0.609, 0.688] | |
| 7 | 0.762 [0.722, 0.801] | 0.614 [0.559, 0.667] | 0.619 [0.580, 0.659] | |
| 10 | 0.774 [0.736, 0.812] | 0.623 [0.569, 0.677] | 0.598 [0.560, 0.638] | |
| 3 | 0.738 [0.694, 0.781] | 0.611 [0.554, 0.666] | 0.694 [0.653, 0.737] | |
| 5 | 0.771 [0.732, 0.809] | 0.660 [0.607, 0.711] | 0.617 [0.579, 0.656] | |
| 7 | 0.786 [0.749, 0.822] | 0.673 [0.621, 0.723] | 0.587 [0.550, 0.625] | |
| 10 | 0.798 [0.762, 0.832] | 0.681 [0.630, 0.731] | 0.563 [0.527, 0.601] | |
| LR (baseline) | 3 | 0.728 [0.683, 0.771] | 0.571 [0.514, 0.627] | – |
| 5 | 0.748 [0.706, 0.789] | 0.589 [0.534, 0.643] | – | |
| 7 | 0.757 [0.716, 0.797] | 0.601 [0.547, 0.654] | – | |
| 10 | 0.768 [0.729, 0.807] | 0.613 [0.559, 0.666] | – |
Appendix A.3. Temperature Scaling Calibration Results
| Model | ECEraw | ECEcal | ECE | Brierraw | Brier | |
|---|---|---|---|---|---|---|
| 1.31 | 0.094 | 0.048 | 0.046 | 0.203 | 0.007 | |
| 1.19 | 0.082 | 0.043 | 0.039 | 0.198 | 0.007 | |
| 1.12 | 0.058 | 0.031 | 0.027 | 0.182 | 0.004 | |
| 0.88 | 0.087 | 0.051 | 0.036 | 0.201 | 0.006 | |
| 0.97 | 0.071 | 0.038 | 0.033 | 0.195 | 0.006 | |
| 0.94 | 0.076 | 0.041 | 0.035 | 0.191 | 0.005 |
References
- Cheng, Z.; Zhou, F.; Xu, X.; Zhang, K.; Trajcevski, G.; Zhong, T.; Yu, P.S. Information Cascade Popularity Prediction via Probabilistic Diffusion. IEEE Trans. Knowl. Data Eng. 2024, 36, 8541–8555. [Google Scholar] [CrossRef]
- Cheng, Z.; Liu, Y.; Zhong, T.; Zhang, K.; Zhou, F.; Yu, P.S. Disentangling Inter- and Intra-Cascades Dynamics for Information Diffusion Prediction. IEEE Trans. Knowl. Data Eng. 2025, 37, 4548–4563. [Google Scholar] [CrossRef]
- Dubovskaya, A.; Pena, C.B.; O’Sullivan, D.J.P. Modeling Diffusion in Networks with Communities: A Multitype Branching Process Approach. Phys. Rev. E 2024, 111, 034310. [Google Scholar] [CrossRef] [PubMed]
- Jing, X.; Jing, Y.; Lu, Y.; Deng, B.; Yang, S.; Yang, D. On Your Mark, Get Set, Predict! Modeling Continuous-Time Dynamics of Cascades for Information Popularity Prediction. IEEE Trans. Knowl. Data Eng. 2024, 37, 5436–5451. [Google Scholar] [CrossRef]
- Kashuv, Y.; Alharbi, R.; Thai, M.T. Predicting User Tipping in Online Social Networks with Temporal Graph Neural Networks. IEEE Trans. Comput. Soc. Syst. 2026, 13, 1228–1240. [Google Scholar] [CrossRef]
- Li, H.; Xia, C.; Wang, T.; Wang, Z.; Cui, P.; Li, X. GRASS: Learning Spatial–Temporal Properties from Chainlike Cascade Data for Microscopic Diffusion Prediction. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 16313–16327. [Google Scholar] [CrossRef] [PubMed]
- Li, L.; Chen, Z.J.; Ye, H.; Zhang, Y. Incorporating Topical Stance into Signed Bipartite Networks for User Retweet Prediction. PLoS ONE 2026, 21, e0342677. [Google Scholar] [CrossRef] [PubMed]
- Li, L.; Duan, L.; Wang, J.; He, C.; Chen, Z.; Xie, G.; Deng, S.; Luo, Z. Memory-Enhanced Transformer for Representation Learning on Temporal Heterogeneous Graphs. Data Sci. Eng. 2023, 8, 98–111. [Google Scholar] [CrossRef]
- Liu, C.; Zhang, J.; Wang, S.; Fan, W.; Li, Q. Score-Based Generative Diffusion Models for Social Recommendations. IEEE Trans. Knowl. Data Eng. 2024, 37, 6666–6679. [Google Scholar] [CrossRef]
- Liu, X.; Wang, H.; Bouyer, A. A Cascade Information Diffusion Prediction Model Integrating Topic Features and Cross-Attention. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101852. [Google Scholar] [CrossRef]
- Mashayekhi, Y.; Rezvanian, A.; Vahidipour, S. A Novel Regularized Weighted Estimation Method for Information Diffusion Prediction in Social Networks. Appl. Netw. Sci. 2023, 8, 81. [Google Scholar] [CrossRef]
- Peng, H.; Zhang, J.; Huang, X.; Hao, Z.; Li, A.; Yu, Z.; Yu, P.S. Unsupervised Social Bot Detection via Structural Information Theory. ACM Trans. Inf. Syst. 2024, 42, 148. [Google Scholar] [CrossRef]
- Qi, O.; Chen, H.; Liu, S.; Pu, L.; Ge, D.; Fan, K. DMHANT: DropMessage Hypergraph Attention Network for Information Propagation Prediction. Big Data 2024, 13, 364–378. [Google Scholar] [CrossRef] [PubMed]
- Sallah, A.; Abdellaoui Alaoui, E.A.; Agoujil, S.; Wani, M.A.; Hammad, M.; Maleh, Y.; Abd El-Latif, A.A. Fine-Tuned Understanding: Enhancing Social Bot Detection with Transformer-Based Classification. IEEE Access 2024, 12, 118250–118269. [Google Scholar] [CrossRef]
- Tai, Y.; Yang, H.; He, H.; Wu, X.; Shao, Y.; Zhang, W.; Sangaiah, A.K. Topic-Aware Masked Attentive Network for Information Cascade Prediction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 126. [Google Scholar] [CrossRef]
- De Domenico, M.; Lima, A.; Mougel, P.; Musolesi, M. The Anatomy of a Scientific Rumor. Sci. Rep. 2013, 3, 2980. [Google Scholar] [CrossRef] [PubMed]
- Tang, Y.; Piao, J.; Wang, H.; Wang, Y.; Li, Y. MSA-Net: A Multi-Scale Information Diffusion Model Awaring User Activity Level. ACM Trans. Web 2025, 19, 17. [Google Scholar] [CrossRef]
- Vinod, D.; Kumar T, G.; Kumar, P.N. Effects of the Evolution of Network Structural Properties on Information Diffusion in Dynamic Social Networks. J. Intell. Fuzzy Syst. 2025, 49, 611–626. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, X.; Xiong, F.; Chen, H. A Survey of Deep Learning-Based Information Cascade Prediction. Symmetry 2024, 16, 1436. [Google Scholar] [CrossRef]
- Wang, B.; Li, Z.; Xu, Z.; Zhang, J. Casformer: Information Popularity Prediction with Adaptive Cascade Sampling and Graph Transformer in Social Networks. IEEE Trans. Big Data 2025, 11, 1652–1663. [Google Scholar] [CrossRef]
- Almanza, M.; Lattanzi, S.; Panconesi, A.; Re, G. Twin Peaks, a Model for Recurring Cascades. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 681–692. [Google Scholar] [CrossRef]
- Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Giannelli, E.; Marchetti, M.; Ursino, D.; Virgili, L. A Multilayer Network-Based Framework for Investigating the Evolution and Resilience of Multimodal Social Networks. Soc. Netw. Anal. Min. 2023, 14, 5. [Google Scholar] [CrossRef]
- Ye, J.; Bao, Q.; Xu, M.; Xu, J.; Qiu, H.; Jiao, P. RD-GCN: A Role-Based Dynamic Graph Convolutional Network for Information Diffusion Prediction. IEEE Trans. Netw. Sci. Eng. 2024, 11, 4923–4937. [Google Scholar] [CrossRef]
- Zeng, Y.; Xiang, K. Persistence Augmented Graph Convolution Network for Information Popularity Prediction. IEEE Trans. Netw. Sci. Eng. 2023, 10, 3331–3342. [Google Scholar] [CrossRef]
- Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar] [CrossRef]
- Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar] [CrossRef] [PubMed]
- Zhai, P.; Yang, Y.; Zhang, C.H. Causality-Based CTR Prediction Using Graph Neural Networks. Inf. Process. Manag. 2023, 60, 103137. [Google Scholar] [CrossRef]
- Zhang, Y.; Wang, Z.; Zhuang, H.; Song, L.; Wen, G.; Guan, J.; Zhou, S. Predicting Participation Shift of Users at the Next Stage in Social Networks. IEEE Trans. Netw. Sci. Eng. 2025, 12, 1066–1079. [Google Scholar] [CrossRef]
- Zhang, G.; Zhang, S.; Yuan, G. Bayesian Graph Local Extrema Convolution with Long-Tail Strategy for Misinformation Detection. ACM Trans. Knowl. Discov. Data 2024, 18, 89. [Google Scholar] [CrossRef]
- Zhao, J.; Lyu, X.; Rong, H.; Zhao, J. TRGCN: A Prediction Model for Information Diffusion Based on Transformer and Relational Graph Convolutional Network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 143. [Google Scholar] [CrossRef]















| Study Category | Diffusion/ Cascade Mechanism | Temporal Dynamics | Structural Representation | Semantic/ Topic Information | User Behavior Modeling |
|---|---|---|---|---|---|
| Probabilistic and stochastic diffusion modeling [1,3,11] | ✓ | partial | partial | – | partial |
| Inter-/intra-cascade and continuous-time dynamics [2,4] | ✓ | ✓ | partial | – | partial |
| Graph neural network-based cascade prediction [6,23,24,30] | partial | partial | ✓ | – | partial |
| Transformer and attention-based cascade modeling [8,10,15,20] | partial | ✓ | ✓ | partial | partial |
| Hypergraph and high-order propagation modeling [13] | partial | partial | ✓ | – | partial |
| Topic-, stance-, and content-aware diffusion modeling [7,10,15] | partial | partial | partial | ✓ | partial |
| User activity and behavioral transition modeling [5,17,28] | partial | ✓ | partial | – | ✓ |
| Dynamic social network and structural evolution studies [12,18,29] | partial | partial | ✓ | partial | partial |
| Survey and adjacent predictive learning studies [9,19,27] | partial | partial | partial | partial | partial |
| This work | Stochastic cascade growth from reaction traces. | Inter-arrival, length evolution, and position-wise loss. | Root influence, outdegree growth, sequence length, and reaction-chain structure. | Token-level auxiliary cascade signals. | Reaction timing, sequence position, and participation patterns. |
| Item | Setting |
|---|---|
| Prediction setting | Early prefix-based cascade outcome prediction |
| Main input | First K (=5) cascade events |
| Classification target | Virality indicator |
| Regression target | Log-final size |
| Virality threshold | (=10) |
| Loss weight | (equal weighting of BCE and regression terms) |
| Baseline classifiers | Logistic Regression, Random Forest |
| Baseline regressors | Ridge Regression, Random Forest Regression |
| Transformer variants | Random, node2vec, frozen-node2vec, and feature-fusion variants |
| Classification metrics | AUROC, AUPRC, Brier score, ECE |
| Regression metric | RMSE on log-final size |
| Statistical protocol | Bootstrap confidence intervals and paired tests |
| Category | Statistic | Value |
|---|---|---|
| Cascade scale | Number of cascades | 41,426 |
| Tail exponent | −1.008 | |
| Temporal dynamics | Number of inter-arrival intervals | 301,637 |
| Inter-arrival median | 1.25 min | |
| Inter-arrival p99 | 25.47 h | |
| Graph signal | Spearman (), followers vs. size | 0.494 |
| Cascades with positive follower count | 98.60% | |
| User sparsity | Vocabulary size | 256,493 |
| Training target tokens | 505,146 | |
| Test target tokens | 11,036 | |
| Rank for 90% token coverage | 208,547 | |
| Temporal split | Training users | 246,726 |
| Test users | 13,718 | |
| Train/test user overlap | 3953 | |
| Test users seen in training | 28.80% | |
| Sequence length | Median train/test length | 2/2 |
| p99 train/test length | 132/24 |
| Model | CE ↓ | PPL ↓ | Hits@1 | Hits@10 | Hits@50 |
|---|---|---|---|---|---|
| First-order Markov | 12.408 | 244,871.40 | 0.009 | 0.014 | 0.014 |
| Hawkes exponential kernel | 14.023 | 1,230,807.80 | 0.006 | 0.014 | 0.022 |
| MiniTransformer | 12.485 [12.472, 12.499] | 264,344.3 [260,809.5, 268,098.4] | 0.000 [0.000, 0.001] | 0.001 [0.001, 0.002] | 0.002 [0.001, 0.003] |
| Model | Type | Structural Prior | AUC ↑ | F1 ↑ | Brier ↓ | RMSE ↓ | MAE ↓ |
|---|---|---|---|---|---|---|---|
| Logistic Regression | Feature-based | No | 0.731 | 0.684 | 0.194 | — | — |
| Ridge Regression | Feature-based | No | — | — | — | 0.219 | 0.171 |
| Random Forest | Feature-based | No | 0.792 | 0.731 | 0.166 | 0.203 | 0.157 |
| Random initialization Transformer | Neural | No | 0.778 | 0.718 | 0.174 | 0.209 | 0.162 |
| Causal Transformer w/o node2vec prior | Neural | No | 0.805 | 0.742 | 0.158 | 0.198 | 0.153 |
| Trainable node2vec Transformer | Neural | Yes, trainable | 0.812 | 0.748 | 0.154 | 0.195 | 0.151 |
| Frozen node2vec Transformer (proposed) | Neural | Yes, frozen | 0.819 | 0.754 | 0.151 | 0.192 | 0.149 |
| Comparison | Metric | Effect Size | 95% CI | p-Value | Sig. |
|---|---|---|---|---|---|
| Frozen node2vec vs. Logistic Regression | AUC | 0.004 | Yes | ||
| Frozen node2vec vs. Random Initialization | AUC | 0.018 | Yes | ||
| Frozen node2vec vs. Random Forest | AUC | 0.112 | No | ||
| Frozen node2vec vs. Trainable node2vec | AUC | 0.438 | No | ||
| Frozen node2vec vs. Causal Transformer w/o node2vec | AUC | 0.196 | No | ||
| Frozen node2vec vs. Random Initialization | RMSE | 0.024 | Yes | ||
| Frozen node2vec vs. Random Forest | RMSE | 0.137 | No | ||
| Frozen node2vec vs. Trainable node2vec | RMSE | 0.521 | No | ||
| Frozen node2vec vs. Causal Transformer w/o node2vec | RMSE | 0.284 | No |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Dong, B.; Zhang, X.; Yan, C.; Zhu, W.; Hou, L.; Feng, Y. Graph-Conditioned Stochastic Modeling of Twitter Information Cascades with Dual-Head Transformers for Early Virality Prediction. Mathematics 2026, 14, 2288. https://doi.org/10.3390/math14132288
Dong B, Zhang X, Yan C, Zhu W, Hou L, Feng Y. Graph-Conditioned Stochastic Modeling of Twitter Information Cascades with Dual-Head Transformers for Early Virality Prediction. Mathematics. 2026; 14(13):2288. https://doi.org/10.3390/math14132288
Chicago/Turabian StyleDong, Bowen, Xinyu Zhang, Chaoya Yan, Weiyan Zhu, Lingmin Hou, and Yifan Feng. 2026. "Graph-Conditioned Stochastic Modeling of Twitter Information Cascades with Dual-Head Transformers for Early Virality Prediction" Mathematics 14, no. 13: 2288. https://doi.org/10.3390/math14132288
APA StyleDong, B., Zhang, X., Yan, C., Zhu, W., Hou, L., & Feng, Y. (2026). Graph-Conditioned Stochastic Modeling of Twitter Information Cascades with Dual-Head Transformers for Early Virality Prediction. Mathematics, 14(13), 2288. https://doi.org/10.3390/math14132288

