Multi-Strategy Improvement and Comparative Research on Data-Driven Social Network Construction in Edge-Deficient Scenarios for Social Bot Account Detection

Wang, Junjie; Tang, Minghu

doi:10.3390/info17040360

Open AccessArticle

Multi-Strategy Improvement and Comparative Research on Data-Driven Social Network Construction in Edge-Deficient Scenarios for Social Bot Account Detection

by

Junjie Wang

^* and

Minghu Tang

^*

Joint Laboratory of Cyberspace Security, School of Intelligent Science and Engineering, School of Cyberspace Security, Qinghai Minzu University, Xining 810007, China

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(4), 360; https://doi.org/10.3390/info17040360

Submission received: 7 February 2026 / Revised: 30 March 2026 / Accepted: 7 April 2026 / Published: 9 April 2026

(This article belongs to the Section Information Security and Privacy)

Download

Browse Figure

Versions Notes

Abstract

Accurate social bot detection relies on simulated data to alleviate the scarcity of labeled real-world datasets. Synthetic graph data serves as the core training resource for detection models within simulated data; nevertheless, edge deficiency in real social networks (induced by privacy constraints and data collection limitations) gives rise to “pseudo-isolated nodes” and distorts the quality of synthetic graph data. Furthermore, mainstream data-driven synthetic graph generation methods lack systematic and credible comparative analyses. To tackle these problems, this study optimizes two representative synthetic graph generation approaches (the Chung-Lu model and the Random Classifier-based Multi-Hop (RCMH) sampling + diffusion model) and puts forward an edge completion strategy grounded in sociological theories. Multiple groups of comparative experiments are conducted to assess the performance of the improved methods and the edge completion strategy. Experimental results demonstrate that the “interest + social association” edge completion strategy achieves an F1-score (F1) of 0.7051, and the improved sampling + diffusion model integrated with edge completion reaches an F1-score of 0.7071, which performs better than traditional and unmodified methods to a certain extent. This work preliminarily enhances the reliability of synthetic graph generation methods and provides relatively high-quality synthetic social graph data for social bot detection. It should be noted that the proposed methods are validated solely on Twitter-derived datasets, and their effectiveness remains to be verified in cross-platform adaptation and dynamic social network scenarios.

Keywords:

social bot detection; edge deficiency; synthetic social graph generation; data-driven; edge completion

1. Introduction

Social bots pose severe threats to the information ecosystem security of social networks, and accurate social bot detection is an urgent research need for network security governance [1]. However, the increasingly stringent privacy protection policies of social platforms have led to the scarcity of high-quality real labeled data for bot detection, making simulated social network data the primary training source for bot detection models. Privacy constraints and malicious behavior threats are core challenges in social network security. In the field of general network security, researchers have explored rule-based defense against collusion attacks in cloud environments [2] and machine learning-based secure data transmission for IoT [3], which provide security-oriented research ideas for network data processing. For social bot detection, the privacy-induced edge deficiency in social networks further distorts topological structure, which is the key problem addressed in this study.

Existing social network simulation, data synthesis and augmentation methods—including diffusion-model-based synthetic graph generation, GNN/rule-based topology replication, and variational generative adversarial network-assisted minority sample synthesis—all aim to enhance the consistency between simulated social network data and real-world data. Yet these methods share an unvalidated core assumption: they treat real data as a complete and reliable reference benchmark. For instance, diffusion-generated social graphs [4] train models on real datasets (e.g., TwiBot-22) without considering topological distortion caused by edge deficiency; comparative studies of GNN and rule-based methods [5] assume sampled subgraphs preserve original network patterns while ignoring the pre-existing community fragmentation; Topology-Aware Gated GNN [6] relies on distorted topological metrics of real nodes for sampling; and HMM-VGAN [7] only removes superficial feature noise while neglecting deep flaws such as missing social relations and label-related noise in real data.

The construction of realistic synthetic graph data must rely on the topological characteristics and interaction preference of real social network data as the core reference [8]. Regrettably, the limited real social network data available for reference is widely plagued by the problem of edge deficiency, which is caused by two core factors: on the one hand, users’ active privacy settings make part of the social connection edges invisible; on the other hand, the restricted scope of data collection makes it impossible to fully tally all the real connection targets of each node [9,10], thus generating a large number of “pseudo-isolated nodes” in the real network. Based on the following relationship data of 1 million nodes in the TwiBot-22 dataset, our experimental results demonstrated that the proportion of isolated nodes in the real social network constructed from this data is as high as 30.62%. As the proportion of isolated nodes further increased, the downstream detection performance declined most sharply before it reached 35%. This observation indirectly confirms that if the real reference data serving as the benchmark for synthetic social graph data construction suffers from increasingly severe edge deficiency, the quality of the resulting synthetic social graph data will inevitably be compromised, and such structural deficiencies of real data directly lead to serious topological distortions and interaction preference deviation of synthetic social graph data constructed with it—meaning that the quality of such data is far from meeting the training requirements of bot detection models [11].

To address the critical challenges caused by edge deficiency in real reference data, this study is dedicated to improving the quality of synthetic social graph data for bot detection. We systematically uncover the structural causes of edge deficiency in real social networks and its distortion mechanism for synthetic social graph data and propose a real data processing approach of “completion first, reference later”. On this basis, we optimize two mainstream data-driven synthetic social graph generation methods (the Chung-Lu model and the RCMH sampling + diffusion model) to enhance their performance in generating high-quality task-adaptive synthetic social graph data (see Figure 1). Furthermore, we design a node label-independent edge completion strategy that integrates interest identification and social association mechanisms and validate its effectiveness using the improved synthetic social graph generation methods as a consistent experimental setup for edge-deficient scenarios—by evaluating the topological quality enhancement of completed real data and the performance improvement of downstream bot detection (refer to Appendix A for relevant experimental methods).

Therefore, the main contributions of this paper are threefold, as follows:

(a): It systematically reveals the structural causes and synthetic social graph data distortion mechanisms of edge deficiency in real social networks and proposes a “complete first, then reference” research approach for real reference data, which mitigates the negative impact of incomplete real data on high-quality synthetic social graph construction.
(b): It systematically compares and improves two mainstream data-driven synthetic social graph generation strategies: node degree-driven models and real incomplete network-driven methods, enhancing their ability to generate high-quality synthetic social graph data for the social bot detection task in edge-deficient scenarios.
(c): We design and verify a node label-independent edge completion strategy for real edge-deficient social network data, which integrates interest identification and social association mechanisms based on user behavior logic, helps infer and supplement potential topological connections in real data, improves the structural authenticity of real reference data, and supports the construction of high-quality synthetic social graph data.

2. Improved Strategies for Social Network Construction in Edge-Deficient Scenarios

2.1. Principle, Defect Analysis, and Improvement of Node Degree-Driven Strategy

The Chung-Lu model is a classic node degree-driven directed network generation method, which is widely used in synthetic social graph generation for its advantage of retaining the real node degree distribution. Aiming to address its adaptability defects in edge-deficient scenarios for social bot detection, this section clarifies the model’s core principle, quantifies its key defects and their impacts on detection tasks, and proposes targeted improvement strategies with clear design motivations, while explaining the physical meaning of key formula and design choices.

2.1.1. Core Principle of the Original Chung-Lu Model

The model takes the out-degree/in-degree sequence of real social network nodes as the core input and must satisfy the hard constraint for directed network topology rationality:

\sum_{i = 1}^{n} d_{i}^{o u t} = \sum_{i = 1}^{n} d_{i}^{i n} = M

(where

d_{i}^{o u t} / d_{i}^{i n}

is the out-degree/in-degree of node

i

,

n

is the total number of nodes, and

M

is the total number of edges).

For any non-self-loop node pair (

i

,

j

), the edge connection probability is calculated as follows:

p_{i j} = \frac{d_{i}^{o u t} \cdot d_{j}^{i n}}{M}

. This formula is based on the basic social network feature that higher-degree nodes have higher connection probability—the product of

d_{i}^{o u t}

and

d_{j}^{i n}

reflects the basic connection potential of the node pair, and

M

normalizes the probability to the

[0, 1]

interval. The model generates the synthetic social graph by calibrating the degree sequence to meet the constraint, calculating the connection probability of all node pairs, and sampling source/target nodes by remaining degree weighting and normalized probability, respectively, while retaining the human–bot labels of real nodes (refer to Appendix B for the relevant pseudocode) [12].

2.1.2. Core Defects and Impacts on Bot Detection

Experimental results indicate that the original model has two core defects in edge-deficient scenarios, which lead to notable topological distortions of the synthetic social graph and further reduce downstream bot detection performance:

The randomization of edge connections leads to distortions in high-order topology and associations. Without reference to real networks, edge connections lack constraints from association logic. Data such as the human–bot interaction ratio, community structure, and node attribute associations in real networks can only be extracted with real networks as references, which the node degree-driven strategy lacks. This inevitably results in topological and associative distortions, making this flaw inherently unavoidable [13].
There is a mismatch between a node’s labeled degree and its actual degree in the network, which stems from the lack of complete network topology. While degree information such as in-degrees and out-degrees of social network nodes is easy to collect, the target network fails to cover all associated nodes due to limitations in data collection scope and node selection rules. This leads to significant differences between the global association degree labeled in node attributes and the local actual degree within the network. This has been verified on the TwiBot-22 dataset [14]: the global total number of follows and being followed for 1 million nodes in the dataset reaches 43.7 billion, while the number of actual effective association edges in the target network constructed based on it is only 3.74 million, showing an enormous scale gap.

2.1.3. Targeted Improvement Strategies

In the absence of full and complete data, should we construct the network based on the global real degrees of nodes or restore the topology using available local node degrees? Both approaches have their disadvantages. Aiming to address the above two defects, this study proposes targeted improvement strategies based on Group A (70% of real data, reference set), with all optimizations retaining the model’s core advantage of degree distribution preservation.

Proportional scaling and fine-tuning of the real degree sequence

Design motivation: Direct use of global/local degree alone either aggravates distortion or loses the real degree distribution difference (humans are mostly high-degree nodes, bots are mostly low-degree). This method balances the two by Gini coefficient matching + local statistical feature alignment.

Implementation: Extract the out-degree/in-degree distribution, Gini coefficient, total edges and average degree of human/bot nodes in Group A; proportionally scale the global degree sequence to align its total edges and average degree with Group A, while keeping the Gini coefficient consistent to retain the relative difference in the original degree distribution; and, finally, calibrate the scaled sequence to meet the model’s hard constraint [15]. This method reduces pseudo-isolated nodes by making the degree sequence consistent with the actual data collection range and real distribution characteristics.

2.: Integration of human–bot interaction preference weights into connection probability

Design motivation: Introduce real human–bot interaction preference as a soft constraint to solve the randomness of edge connection, and restore the differential interaction characteristics between human and bot nodes. Implementation: First count the edge numbers of four interaction types in Group A (

{c o u n t}_{B B}

,

{c o u n t}_{B H}

,

{c o u n t}_{H B}

,

{c o u n t}_{H H}

), and normalize to obtain the interaction preference weight:

ω_{X Y} = \frac{{c o u n t}_{X Y}}{\sum_{X, Y \in \{B, H\}} {c o u n t}_{X Y}} (X, Y \in {B, H})

(1)

ω_{X Y}

reflects the proportion of connection probability that X-type nodes initiate to Y-type nodes and satisfies

ω_{B B} + ω_{B H} + ω_{H B} + ω_{H H} = 1

. On the premise of meeting the model’s hard constraint, integrate

ω_{X Y}

into the original formula to obtain the optimized connection probability [16]:

p_{i j} = \frac{d_{i}^{o u t} \cdot d_{j}^{i n}}{M} {\times ω}_{t (i) t (j)}

(2)

where

t (i) / t (j)

is the type (human/bot) of node

i / j

. In edge generation, the target node sampling probability is subject to the dual constraint of degree product and interaction preference (source node still by remaining degree weighting), ensuring the synthetic social graph aligns with real-world interaction preference.

Physical meaning: Formula (2) combines the degree distribution feature (first part) and real human–bot interaction preference (second part), making edge connection no longer random.

2.2. Research and Improvement of Real Incomplete Network-Driven Strategy

To alleviate labeled data scarcity for bot detection, the subgraph sampling + diffusion model is a mainstream data-driven synthetic social graph generation method. It uses Rejection-Controlled Metropolis–Hastings (RCMH) sampling [17] to extract representative nodes from incomplete real networks and then generates synthetic social graphs via a diffusion model [4,18]. However, the original method is designed for general social network topology restoration, which is ill-suited to the core goal of bot cluster feature capture. This section first clarifies the original logic and task inadaptability and then proposes targeted improvements with clear design motivations.

2.2.1. Original RCMH Sampling + Diffusion Model: Core Logic

The original RCMH sampling performs node extraction via degree-biased random walk, aiming to retain the global topology of social networks. For any node v in the real network, its total degree is defined as the sum of the out-degree and in-degree (Formula (3)):

d e g (v) = o u t_{d e g (v)} + i n_{d e g (v)} = \sum_{u \in V} A_{v, u}^{(0)} + \sum_{u \in V} A_{u, v}^{(0)}

(3)

Equation (3) is the basic degree definition of directed social networks, where

A^{(0)}

is the

0 - 1

adjacency matrix (

1

means a follow relationship exists).

The random walk transition acceptance probability (Formula (4)) controls the walk tendency to high-degree nodes:

α (u \to v) = m i n (1, \frac{{d e g (v)}^{α}}{{d e g (u)}^{α}})

(4)

The larger

α \in [0, 1]

is, the more the walk prefers high-degree nodes, which facilitates rapid capture of the network’s core skeleton. To avoid the walk stagnating without new nodes, a restart mechanism is set (Formula (5)):

u_{n e w} = \{\begin{matrix} v i f α (u \to v) \geq ϵ (ϵ \sim U n i f o r m (0, 1)) \\ R a n d o m (V \ S) i f i d l e s t e p s \geq \max_i d l e \end{matrix}

(5)

When the walk is stuck for too long, randomly select an unsampled node to restart, ensuring full coverage of the network.

After sampling, the original diffusion model generates synthetic graphs based on the entire graph density, without considering the human–bot interaction characteristics required for bot detection (refer to Appendix B for the relevant pseudocode).

2.2.2. Core Defects and Impacts on Bot Detection

The original method has two defects for bot detection, directly verified by the TwiBot-22 dataset (Table 1):

Sampling bias misses bot nodes: Social network degrees follow a power-law distribution;

97.69 %

of high-degree nodes (

d e g > 10^{5}

) are humans, while

96.09 %

of bots concentrate in medium and low-degree nodes. The original degree-biased RCMH oversamples high-degree humans and ignores low-degree bots, leading to the loss of bot cluster features.

Diffusion distorts topology and interaction ratios: The original diffusion uses the entire graph density to calculate synthetic edges, causing inconsistency between the synthetic graph’s sparsity and the real subgraph. It also does not retain the real human–bot interaction ratio, diluting bot-specific interaction preference (e.g., high-density bot–bot connections).

2.2.3. Improved RCMH Sampling: Node Importance Weighting and Human–Bot Balance

Improvement motivation: Eliminate sampling bias, increase the sampling probability of medium and low-degree bots, and keep the human–bot ratio of the sampled set balanced.

Step 1: Degree interval division

Based on the power-law degree distribution of social networks and the human–bot degree statistics in Table 1, nodes are divided into 5 intervals: (0, 10), (10, 100), (100, 1000), (1000, 10,000), (10,000,

\infty

). Bots are mainly concentrated in 0–100, while humans dominate intervals above 10,000. This division enables precise node-type differentiation.

Step 2: Node importance weight assignment

A differentiated weighting rule is designed based on node type (bot/human) and degree interval (grounded in Table 1):

Bot nodes: Apply positive enhancement weighting to low- and medium-degree intervals

(0 - 100)

to raise the sampling probability; apply negative weakening weighting to high-degree intervals (>10,000) to reduce redundant sampling of human-dominated nodes [19].

Human nodes: Adopt the reverse weighting strategy—enhance high-degree intervals and weaken low/medium-degree intervals to balance the human–bot ratio. All weights

ω (v)

are uniformly calculated and normalized by the code in Appendix H, serving as the sole weight source in the model.

Step 3: Improved transition acceptance probability

The node importance weight is integrated into the original transition probability to retain topological rationality while correcting sampling bias (Formula (6)):

α^{'} (u \to v) = \min (1, {(\frac{\deg (v)}{\deg (u)})}^{α}) \times ω (v) (α = 0.95)

(6)

The first half preserves the degree-biased topological logic of the original RCMH; the second half

ω (v)

(automatically computed weight) helps boost sampling chances for medium- and low-degree bots, alleviating the under-sampling of bot nodes in traditional sampling. Weighted sampling is conducted after normalization to better retain the key bot cluster features in the sampled subgraph.

2.2.4. Improved Diffusion Model: Subgraph Density Constraint and Human–Bot Interaction Preservation

Improvement motivation: Solve topological sparsity distortion and bot feature dilution caused by the original diffusion model.

Step 1: Subgraph density constraint (Formula (7))

The original diffusion uses the entire graph density, leading to structural distortion. We use the density of the optimized RCMH sampled subgraph to calculate the target number of synthetic edges:

M_{s y n t h} = s a m p l e_d e n s i t y \times N_{s y n t h}^{2}

(7)

Ensure the synthetic graph has the same sparsity as the real sampled subgraph, avoiding over-dense/over-sparse topology distortion.

Step 2: Human–bot interaction weight preservation (Formula (8))

To retain real bot interaction preference, we calculate the weight of four interaction types (bot → bot, bot → human, human → bot, human → human) from real data:

ω_{X Y} = \frac{{c o u n t}_{X Y}}{\sum_{X, Y \in {B, H}} {c o u n t}_{X Y}}

(8)

Allocate synthetic edges according to this weight to avoid diluting bot cluster features (e.g., high-probability bot→bot connections).

Finally, the GraphMaker diffusion model is used: SVD dimensionality reduction (

256

dimensions) reconstructs node association features [20], reverse diffusion generates a node similarity matrix, and the top

M_{s y n t h}

non-self-loop edges are selected to generate a synthetic graph that meets bot detection requirements.

2.3. Research on Edge Completion Strategies for Social Networks with Edge Deficiency

To address the prevalent edge deficiency in real social networks, this section proposes a data-driven and node label-independent edge completion strategy. Supported by the sociological “common identity and common bond theory” as its core theoretical foundation [21,22], this strategy aims to use publicly observable user behavior data in Group A to capture two types of potential connection patterns [23], “interest identification” and “social association”. On the premise of protecting user privacy, it constructs a more complete and structurally authentic social network topology, providing high-quality data support for downstream bot detection tasks.

The formation of connections in social networks is essentially driven by two core mechanisms, which are highly consistent with the core connotation of the common identity and common bond theory: first, the common identity mechanism—users with similar interests form identity recognition due to shared topic preferences, leading to a significantly higher connection probability; second, the common bond mechanism—frequently interacting users form emotional or behavioral bonds through direct social associations, making it easier to generate explicit connections. Therefore, inferring potential follow relationships based on users’ public behavior trajectories (e.g., posted content and mention relationships) is theoretically logical and practically grounded.

This strategy adheres to a key principle: using the behavioral statistical laws of public nodes to infer the potential connection patterns of hidden nodes. All completion rules are learned from the visible public follow relationship network in Group A and applied equally to all nodes (whether human or bot), thereby minimizing the introduction of human bias or label information leakage. Combining the above theories and principles, the completion strategy is materialized into the following three sequential and complementary logical modules—with the first two corresponding to the two core connection mechanisms, respectively, and the third serving as supplementary enhancement [5].

2.3.1. Interest Identification: Edge Completion Based on Topic Participation

The topic categories that users participate in are the core representation of their interests. All topics in the dataset are classified into 17 macro categories based on IPTC Media Topics [24] by invoking the Gemma3-12b model. First, the average in-degree and out-degree of public nodes under each topic category are calculated to establish a mapping model of “topic category and connection strength”. For a hidden node, its potential interest community can be inferred by analyzing the topic categories of its historical posts. Subsequently, based on the prior connection strength of its main topic category and combined with the number of follows and followers indicated in its profile for weighted allocation, directed edges pointing to other nodes in the same topic community are generated for it. This method aims to restore social bonds based on shared interests (refer to Appendix D for detailed methods).

2.3.2. Social Association: Edge Completion Based on Mention Relationships

Mutual mentions between users are core evidence of strong social interaction. The relevant edge completion rules and extension strategies are as follows:

If user A and user B have had direct mentions in any post (including A actively @B or B actively @A), both parties are confirmed to have explicit interaction intentions. Regardless of how many times such mention behaviors occur, a directed edge is added between these two nodes with high confidence. This module can directly capture strong social interaction signals and restore explicit social connections that were broken due to data deficiency.

For users with no direct mentions, the “similarity of common mention preferences among user groups” (i.e., the overlapping characteristics of the groups frequently mentioned by different users) is combined, and the verified association data distribution rules in real social networks are referenced to add high-probability relationship edges for such users with potential associations (refer to Appendix D for detailed methods).

2.3.3. Edge Completion Based on Link Prediction Technology

Link prediction technology can infer potential high-probability connections between nodes under the condition of known partial network topology structure. Most existing link prediction methods (such as those based on common neighbors, Jaccard coefficient, Adamic–Adar index, node embedding, etc.) mainly rely on local topological similarity for statistical inference [25,26]. Whether such methods can reveal the deep-seated differences between humans and bots in network structures needs to be verified. To this end, this study systematically tests a variety of link prediction methods and their combination strategies, constructs the completed network based on these methods, and then comprehensively evaluates the impact of different completed networks on model training and testing performance through bot detection tasks (among the following four types of methods, each method supplements the top 1 million relational edges with the highest probability that do not exist). The specific methods are divided into the following four categories:

Link prediction method combining traditional topological features+ network embedding+ tree-model supervised learning [27,28] (basic supervised paradigm);
Link prediction method combining enhanced topological features + community-aware embedding [29] + ensemble tree model (enhanced supervised paradigm);
End-to-end graph neural network + hard negative sampling link prediction method [30,31,32] (representation learning paradigm);
Link prediction method combining multi-embedding fusion + multi-feature integration + meta-learning [33] (hybrid integration paradigm).

3. Results

To systematically evaluate the effectiveness of different social network construction strategies for social bot detection tasks in edge-deficient scenarios, this section designs comparative experiments with raw real social network data as the unified benchmark, aiming to comprehensively compare the performance of various methods in topological restoration and detection performance. The experiments are mainly divided into two groups of controlled tests.

First, we compared six basic and improved network construction methods under the baseline condition without introducing the edge completion strategy.

Second, we integrated the edge completion strategy proposed in this paper (as detailed in Section 2.3) with each of these network construction methods to form another five groups of experimental conditions. Therefore, this section conducts performance comparison and in-depth analysis of social networks under 11 different construction conditions in total.

In addition, to clarify the contribution of each improved module, this section also sets up corresponding ablation experiments. All experiments adopt a unified feature set (9-dimensional-node features) and classification model (random forest) and use accuracy, precision, recall, and F1-score as the core evaluation metrics on the same real test set (Group B data). The specific experimental group design is shown in Table 2 and Table 3 (refer to Appendix A, Appendix E for specific details of the experimental design; see Appendix F for feature engineering, Appendix G for results of feature importance analysis and ablation experiments, and Appendix I for data preprocessing and partitioning methods).

Comparative experiments were conducted on the TwiBot-22 and TwiBot-20 datasets, which preliminarily verified the effectiveness and stability of the improved network construction methods proposed in this study. With the raw real social network data as the unified benchmark, both the Improved Degree-Based Method and the Improved Subgraph Sampling + Diffusion Model exhibited stable detection performance on the two datasets. In contrast to traditional approaches, our improved methods show no occurrence of extreme metric collapses, and the overall results remain consistent and close to those obtained from real data. On TwiBot-22, the Improved Subgraph Sampling + Diffusion Model achieved an accuracy of 0.6190, close to the detection performance of real data; on TwiBot-20, this method obtained an F1-score of 0.6365, slightly outperforming the traditional Subgraph Sampling + Diffusion Model, which alleviated the severe fluctuation in detection performance seen in traditional degree-based methods.

The two traditional degree-based methods showed extreme detection metrics with distinct failure causes: on TwiBot-22, the recall of the Attribute-Based Degree Method reached 1.0000 because this method only relied on network degree for detection and could not distinguish the differences between humans and bots in the network at all. In addition, the attribute degree of each node in the dataset was higher than the average level of the test data, which further made the model lose its ability to distinguish humans and bots, and this is the key reason for the failure of this method in the detection task. The recall of the Network-Based Degree Method was only 0.0024, mainly due to the extremely high proportion of isolated nodes in the network constructed by the dataset. Even with random 1:1 human–bot sampling, a large number of isolated nodes were likely to be sampled, and such nodes could not be used for network construction, which indirectly broke the actual balance of human–bot samples. When the differences in network features were small, the model tended to predict the category with a larger sample size, ultimately leading to the collapse of detection metrics. Aiming at the experimental bias of the Network-Based Degree Method, this study additionally supplemented the Network-Based Degree (non-isolated) experiment, which first sampled a sufficient number of non-isolated bot nodes and then an equal number of non-isolated human nodes. Consequently, the recall of this method on TwiBot-22 rebounded to 0.8353 with an F1-score of 0.6297, which verifies the necessity of sampling rule optimization for eliminating such experimental biases.

Objectively, the poor performance of traditional degree-based methods in the social bot detection task does not mean their complete ineffectiveness in synthetic social graph generation. These methods can retain basic topological features such as node degree distribution and are applicable to general synthetic social graph generation scenarios. However, they only perform random edge connection based on the product of node degrees without integrating the core logics of social networks (e.g., user interest, social association, and human–bot interaction preference), making them poorly suited to bot detection requirements. By integrating human–bot interaction preference, node importance weight and hard degree constraint, the improved methods in this study are specifically designed to adapt to social bot detection in edge-deficient scenarios, thus exhibiting more stable detection performance on datasets with different characteristics and providing a reference for similar studies.

Taking the detection performance of raw real social network data as the benchmark, this table compares the core performance metrics of social bot detection for different edge completion strategies and improved network construction methods after edge completion, which verifies the significant improvement effect of edge completion strategies on downstream detection tasks.

Among the single edge completion strategies, both interest-based and social association-based edge completion achieve a positive performance improvement, with the interest-based edge completion performing better and reaching an F1-score of 0.6888, which is higher than that of 0.6463 for the raw data. The interest + social association fusion edge completion strategy achieves a further performance breakthrough with its F1-score rising to 0.7051, and all its metrics are superior to those of the raw data and single edge completion strategies, repairing topological structure defects of edge-deficient networks. Meanwhile, edge completion strategies and improved network construction methods have a good synergistic optimization effect: the detection performance of both the Improved Degree-Based Method and the Improved Subgraph Sampling + Diffusion Model is improved to varying degrees after combining with edge completion. Among them, the F1-score of the Improved Subgraph Sampling + Diffusion Model rises to 0.7071 after edge completion, making it the overall optimal solution, and the Improved Degree-Based Method also achieves a comprehensive increase in all core metrics. This indicates that edge completion strategies cannot only directly improve the bot detection performance of real edge-deficient networks but also cooperate with improved network construction methods to optimize the detection performance, which verifies the effectiveness of the proposed strategies in adapting to the downstream task of social bot detection in edge-deficient scenarios.

To further verify the differences between the proposed method in this paper and traditional link prediction methods, we separately adopted four groups of link prediction methods for edge completion and obtained the experimental results in Table 4 under the same experimental configuration.

Taking the detection performance of raw real social network data as the benchmark, this table presents the core performance of four types of traditional link prediction methods in social bot detection after edge completion. The results show that no effective improvement is achieved in the detection metrics of all methods, some metrics even decline, and the overall performance is basically consistent with that of the raw data. This suggests that the networks after edge completion by traditional link prediction methods have relatively insufficient adaptability in the specific downstream task of social bot detection.

To verify the reliability of the edge completion of the method proposed in this paper, we designed relevant validation experiments, with the detailed contents provided in Appendix C.

4. Discussion

This work provides incremental improvements for social network construction in edge-deficient bot detection, helping mitigate data scarcity and topological distortion. Our task-adaptive optimizations and label-free edge completion offer a practical reference for related research under constrained data conditions.

It should be emphasized that traditional synthetic social graph generation methods have made important contributions to general social network simulation by effectively retaining basic topological features such as node degree distribution and global network structure, which still have high application value in generic network analysis tasks. Their limitations only lie in the mismatch with the specific requirements of social bot detection in edge-deficient scenarios, rather than inherent defects of the methods themselves, yet this mismatch leads to drastic performance fluctuations in bot detection tasks.

Ablation experiments indicate that optimizing only with human–bot interaction preference weights fails to deliver ideal performance, with the single-component model showing a sharp drop in core metrics. In an ideal scenario without data collection limitations and full access to users’ social relationships, the attribute-degree network and network-degree network would be fully consistent, and the interaction preference weight alone would work effectively. However, in real edge-deficient scenarios, such a single constraint can only adjust macro connection ratios between human and bot nodes and cannot resolve the fundamental discrepancy between attribute degree (global public follow/follower counts) and network degree (actual collected valid edges) caused by limited collection scope and user privacy settings. It also fails to mitigate sampling bias toward high-degree human nodes and topological fragmentation from pseudo-isolated nodes. Since social bot detection relies on complete local topology and fine-grained behavioral association features, single-module optimization is far from repairing the inherent structural defects of incomplete real social networks.

In addition, link prediction-based edge completion fails to bring effective performance gains and even degrades partial metrics. On the TwiBot-22 dataset, the basic supervised link prediction only obtains an F1-score of 0.6474 (close to the original data’s 0.6463), and the hybrid integration paradigm drops to 0.6321. Such methods rely solely on topological statistical inference (e.g., common neighbors, node embedding) to generate potential high-probability edges, which cannot correspond to real user behavior-driven social connections and fail to restore human–bot interaction patterns. This only reflects the task misalignment of link prediction methods, not a lack of value in themselves.

The high-recall and low-precision imbalance in some baseline models is also reflected in quantitative results, as the topological discrepancy between synthetic graphs and real networks makes the model misclassify human nodes as bots, which further verifies the necessity of edge completion and task-adaptive synthetic graph optimization.

5. Conclusions

This study focuses on multi-strategy improvement and comparative analysis of data-driven social network construction for social bot detection in edge-deficient scenarios. We optimize two classic synthetic social graph generation methods and propose a node label-independent edge completion strategy combining interest identification and social association. Experimental results on TwiBot-20 and TwiBot-22 datasets validate that the “completion first, reference later” paradigm and multi-module collaborative optimization effectively alleviate topological distortion caused by edge deficiency, and under the specific experimental settings for social bot detection, our improved method exhibits performance metrics that are highly close to and stable with those of real-world data and achieves the optimal F1-score of 0.7071, demonstrating the effectiveness of the proposed approach.

Despite the stable performance improvement, this study has several limitations. First, all experiments are conducted on Twitter-derived datasets, and the generalization ability across multilingual, cross-platform and different topological social networks remains to be verified. Second, the proposed edge completion strategy is static and cannot adapt to the dynamic evolution of real social networks. Third, the balance between data quality and user privacy protection in edge completion and synthetic graph generation needs further exploration. Fourth, the interpretability of topological and behavioral features for bot detection requires enhancement.

Future research will be carried out in four directions: (1) verify the generalization of the proposed methods on more diverse social network datasets; (2) explore dynamic edge completion mechanisms adaptive to network evolution; (3) develop privacy-preserving technologies for edge completion and synthetic graph generation; (4) improve model interpretability to clarify the contribution mechanism of network features to bot detection.

Author Contributions

Conceptualization, J.W. and M.T.; methodology, J.W.; software, J.W.; validation, J.W. and M.T.; formal analysis, J.W.; investigation, J.W.; resources, J.W.; data curation, J.W.; writing—original draft, J.W.; writing—review and editing, M.T.; visualization, J.W.; supervision, M.T.; project administration, M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Kunlun Talent Project (No project number).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed and generated during this study are derived from the publicly available TwiBot-22 benchmark dataset (https://arxiv.org/abs/2206.04564, accessed on 24 December 2025). The original raw data supporting the findings of this study is available from the corresponding author of the TwiBot-22 dataset, subject to their terms and conditions. The custom code supporting the construction, synthetic social graph generation, and completion strategies presented in this paper can be requested from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RCMH	Rejection-Controlled Metropolis–Hastings
SVD	Singular Value Decomposition
GNN	Graph Neural Network
IPTC	International Press Telecommunications Council

Appendix A. Impact of Isolated Node Proportion on Bot Detection and Reliability Test Method for Synthetic Social Graphs

To systematically explore the impact of edge deficiency on social bot detection performance, the experimental scheme is designed as follows: taking a network with an isolated node proportion of 30.62% as the benchmark (without implementing the edge completion strategy), 15 groups of experimental networks with isolated node proportions ranging from 30% to 95% are constructed by random edge deletion, with an edge deletion gradient of 5%.

Nine types of network structure features are extracted for model training, and the feature selection focuses on network topological attributes to accurately evaluate the reliability of synthetic social graphs. A random forest algorithm (100 decision trees) is used to construct the detection model. The dataset is divided into a training set, test set and validation set by stratified sampling in the ratio of 7:2:1, and the balance between human and bot samples in the training set and test set is achieved through random downsampling. This method is also applicable to the subsequent reliability test of synthetic social graphs, with the core difference being that the latter uses synthetic social graph data as the training set and real data as the test set and validation set.

Algorithm A1: Edge Deficiency Rate vs. Bot Detection

INPUT: Edge/label files; Missing rate [0.24, 0.99]
(step = 0.05); RF n_est = 100
OUTPUT: Metrics

1: Func BuildGraph(edges, rate):
2: Return directed graph G
3: Function ExtractFeat(G, nodes, labels):
4: Return 9-dim node features, binary labels
5: Func TrainEval(X, y):
6: Return metrics, feat_imp
7: MAIN:
8: Load edges→real_nodes, labels→label_dict
9: For rate in [0.24,0.29,…,0.99]:
10: G = BuildGraph(edges, rate);
11: X,y = ExtractFeat(G, real_nodes, label_dict)
12: metrics, feat_imp = TrainEval(X, y)
13: Save results; Return metrics

Appendix B. Pseudocode for Two Types of Classic Network Synthesis Methods

This appendix provides the pseudocode of unimproved baseline implementations for two classic data-driven network synthesis methods in the experiments, including the attribute-degree based Chung-Lu model, the real-degree based synthesis method and its non-isolated node sampling variant, all serving as core baselines for performance comparison.

Algorithm A2: Traditional Attribute-Degree Chung-Lu Network

INPUT: Label file

L

, Degree file

D

, Bot count

B

, Human count

H

OUTPUT: Synthetic edge set

E

, Report

R

1: label_map ← Load(id → label) from

L

2:

V

, out_deg, in_deg ← Sample

B

Bots

+

H

Humans, match degrees from

D

3: Adjust in_deg to make

Σ

out_deg

=

Σ

in_deg
4: rem_out, rem_in ← out_deg, in_deg
5: While sum(rem_out)

>

0:
6:

u

← Random sample by rem_out weight
7:

v

← Random non-u sample by P = (rem_out[u]·rem_in[v])/total_edges
8: Add edge (u,v) to

E

; rem_out[u]

- =

1; rem_in[v]

- =

1
9: Save

E

to CSV; Generate report

R

Algorithm A3: Traditional Real-Degree Network Synthesis

INPUT : Label set L

, Real follow graph G_{r e a l}

, Bot number B

, Human number H

OUTPUT : Synthetic edge set E

, Network report R

1: label_map ← Load(user id → label) from

L

2: real_out, real_in ← Calculate in/out-degree from

G_{r e a l}

3: V, out_deg, in_deg ← Sample

B

Bots +

H

Humans, match real degrees
4: Adjust in_deg to satisfy

Σ

out_\deg =

Σ

in_deg
5: rem_out, rem_in ← out_deg, in_deg
6: While sum(rem_out)

>

0:
7: u ← Random sample weighted by rem_out
8: v ← Random non-u node with P = (rem_out[u]·rem_in[v])/total_edges
9: Add edge (u,v) to E; rem_out[u]

- =

1; rem_in [v] - =

1
10:

Save E

to CSV; Generate network report R

Algorithm A4: Traditional Real-Degree Network Synthesis (Non-Isolated Nodes)

INPUT: Label set

L

, Real follow graph

G_{r e a l}

OUTPUT: Synthetic edge set

E

, Network report

R

1: label_map ← Load(user id → label) from

L

2: real_out, real_in ← Calculate in/out-degree from

G_{r e a l}

3: V, out_deg, in_deg ← Select non-isolated nodes (out_deg > 0), 1:1 Bots/Humans
4: rem_out, rem_in ← out_deg, in_deg
5: While sum(rem_out)

>

0:
6: u ← Random sample weighted by rem_out
7: v ← Random non-u node with P = (rem_out[u]·rem_in[v])/total_edges
8: Add edge (u,v) to E; rem_out[u]

- =

1; rem_in[v]

- =

1
9: Save

E

to CSV; Generate network report

R

Appendix C. Reliability Validation Experiments for the Edge Completion Strategy

This appendix takes the native real follow network of the TwiBot-22 dataset as the Ground-Truth, designs five groups of controlled validation experiments, with random edge completion as the control group, and verifies the scientificity and reliability of the “Interest + Social Association” edge completion strategy proposed in this paper from five core dimensions: actual existence of completed edges, reliability of interest-driven edge completion, fidelity of community structure, consistency of human–bot interaction, and convergence of underlying topological structure. Quantitative indicators (authenticity rate, clustering coefficient, etc.) and statistical tests (K-S Test) are adopted for objective verification. Only the definitions of core formulas are retained, and key data are shown in Table A1, Table A2, Table A3 and Table A4.

Experiment 1: Validation of the Actual Existence of Completed Edges.

Design: Taking the native user follow relationship of the dataset as the real benchmark, the experimental group is the social association-completed edges based on user mention relationships in this paper, the control group is the random completed edges with the same quantity, and the authenticity rate (the overlap ratio of completed edges and real edges) is used as the evaluation metric.

Table A1. Comparison of Results for the Actual Existence of Completed Edges.

Experimental Method	Number of True Matched Edges	Total Number of Completed Edges	Authenticity Rate (Precision)
Social Association-Based Edge Completion	476,439	1,466,912	32.4790%
Random Edge Completion	25	1,466,912	0.0017%

Rational explanation for only 25 true matches in random edge completion: Let the number of network nodes be

N = 693,761

; then, the theoretical maximum number of directed edges in the network is

N \times (N - 1) = 481,303,631,360

(about

481.3

billion). However, the number of valid directed edges in the real dataset is

3,743,634

(about

3.74

million). Calculation shows that the probability of a single randomly generated edge matching a true edge is about

0.0000078

(Formula (A1)), and the theoretical matching expectation for

1,466,912

random completed edges is about

11.4

accordingly. Thus, the actual result of

25

true matches falls within a reasonable range.

P = \frac{E_{r e a l}}{N \times (N - 1)}

(A1)

Symbol Explanation:

P =

probability that a single random directed edge matches a true edge;

E_{r e a l}

=

total number of valid directed real edges in the network;

N =

total number of network nodes. Core Purpose: Quantitatively calculate the matching probability between randomly generated edges and true edges, to explain the low authenticity of random edge completion.

The implementation mode of interest-driven edge completion is significantly different from that of social association-based edge completion, and the former sets stringent criteria for the screening of reference nodes, which must be information-public nodes: among 1,000,000 nodes in the dataset, only 10,009 have the out-degree feature, and such nodes are required to have records of topic participation behaviors. Only 8575 valid reference nodes are obtained after screening, which also results in an extremely scarce number of true benchmark edges. For these 8575 nodes, this study conducted a network reconstruction experiment via the interest-driven method and compared the results with those of network reconstruction by random edge completion. The experiment shows that the interest-driven edge completion matches 6039 true benchmark edges, while random edge completion only matches 589 ones, indicating that the reliability of interest-driven edge completion is about 10 times that of the random edge completion method. However, this study holds that the reliability of interest-driven edge completion cannot be fully verified in this experimental scenario, with the core reason being the insufficient number of referenceable true benchmark edges—among 1,000,000 nodes, only 8575 meet the conditions of both information disclosure and topic interaction, while the number of nodes with topic interaction behaviors but undisclosed information reaches 810,656. Therefore, this study further designs and conducts subsequent validation experiments to realize the indirect verification of the reliability of interest-driven edge completion.

Experiment 2: Reliability Validation of Interest-Driven Edge Completion.

Purpose: Indirectly verify the rationality of interest-driven edge completion (direct verification is impossible due to the scarcity of public nodes).

Design: (1) Count the proportion of “connected and sharing fine-grained topics” among public nodes, and compare the average number of common topics between real edges and random edges; (2) add the same number of interest-driven completed edges and random completed edges to the original network, and use the clustering coefficient to measure the ability of network topological feature retention (Formula (A2)).

C = \frac{1}{N} \sum_{i = 1}^{N} \frac{T_{i}}{k_{i} (k_{i} - 1)}

(A2)

Symbol Explanation:

C =

average clustering coefficient of the network;

N =

total number of network nodes;

T_{i}

=

number of directed triangular closures of node

i

;

k_{i} =

total degree of node

i

(in-degree + out-degree). Core Purpose: This was to measure the local aggregation characteristics of network nodes, to verify the ability of the completed network to retain the aggregation of real topology.

Results: We found that 60.30% of public node connections contain common fine-grained topics, with the average number of common topics of real edges being 4.24 (0.78 for random edges); the clustering coefficient is shown in Table A2.

Table A2. Comparison of Topological Characteristic Retention (Clustering Coefficient) of Interest-Driven Edge Completion.

Network Type	Clustering Coefficient
Raw Real Follow Network	0.0618
Network with Interest-Driven Edge Completion	0.0329
Network with Random Edge Completion	0.0091

Conclusion: There is a positive correlation between user connections and common interests, and interest-driven edge completion can retain the local topological aggregation characteristics of real networks.

The nodes for edge completion in this experiment are the isolated nodes in the dataset (306,239 in total, calculated as 1,000,000–693,761), which have no original social connections. Thus, the overall clustering coefficient of the network drops to 0.0329 after edge completion, and this decline is a normal phenomenon in social network edge completion, indicating that interest-driven edge completion can still retain the core topological structure of the real network to a certain extent. In contrast, the clustering coefficient of the network with random edge completion is only 0.0091, and that of interest-driven edge completion is 3.6 times this value. The core reason is that random edge completion is unstructured random connection, which cannot form the inherent triadic closure structure of social networks and only introduces a large number of invalid topological edges, while the interest-driven edge completion method abides by social laws, can retain the local topological structure of the network, and thus has a significantly better clustering coefficient than the random edge completion strategy.

Experiment 3: Validation of Community Structure Fidelity.

Purpose: We aimed to verify the ability of the completion strategy to retain the community aggregation characteristics of real social networks.

Design: Randomly extract three groups of nodes with different scales from the dataset to construct subgraphs, generate the original network, interest-driven completed network and random completed network, respectively, take Modularity Coefficient Q as the evaluation metric (Formula (A3)), and calculate the relative error with the real network.

Q = \frac{1}{2 E} \sum_{i, j} (A_{i j} - \frac{k_{i} k_{j}}{2 E}) δ (c_{i}, c_{j})

(A3)

Symbol Explanation:

Q =

Modularity Coefficient;

E =

total number of edges in the network;

A_{i j} =

adjacency matrix (

A_{i j} = 1

means there is an edge between node

i

and

j

, otherwise

0

);

k_{i}

/

k_{j} =

total degree of node

i

/

j

;

c_{i}

/

c_{j} =

community to which node

i

/

j

belongs;

δ (c_{i}, c_{j}) =

indicator function (

1

for the same community,

0

for different). Core Purpose: This was to quantify the community division effect of the network, to judge whether the completed network retains the real community aggregation characteristics.

Results: See Table A3.

Table A3. Comparison of Modularity Coefficients and Relative Errors for Interest-Driven Edge Completion.

Network Type	Sampled Node Scale	$Modularity Coefficient (Q)$	Relative Error vs. Real Network (%)
Raw Real Network	50,000	0.7472	-
	100,000	0.6859	-
	200,000	0.6241	-
Interest-Driven Edge Completion Network	50,000	0.6311	15.54
	100,000	0.5376	21.62
	200,000	0.4605	26.21
Random Edge Completion Network	50,000	0.1104	85.22
	100,000	0.1575	77.04
	200,000	0.2332	62.63

Conclusion: The decrease in Modularity Coefficient Q after interest-driven edge completion is the regression of the network from a “pseudo-sparse” state to a real dense state, not the destruction of community structure; its relative error is much lower than that of random edge completion, which can retain the community structure of real networks.

Experiment 4: Validation of Human–Bot Interaction Consistency.

Purpose: This was to verify the ability of the completion strategy to retain the inherent distribution law of human–bot interaction in real social networks.

Design: Screen the public active nodes in the original network, count the proportion of HH (human–human), BH (bot–human) and BB (bot–bot) interaction edges; add the same number of interest-driven/random completed edges to the original network, and use the Euclidean distance to measure the similarity of interaction distribution between the completed network and the benchmark network (Formula (A4)).

d = \sqrt{{(x_{1} - y_{1})}^{2} + {(x_{2} - y_{2})}^{2} + {(x_{3} - y_{3})}^{2}}

(A4)

Symbol Explanation:

d =

Euclidean distance;

x = (x_{1}, x_{2}, x_{3}) = H H / B H / B B

interaction ratio vector of the completed network;

y = (y_{1}, y_{2}, y_{3}) = H H / B H / B B

interaction ratio vector of the original benchmark network. Core Purpose: This was to measure the similarity between the completed network and the real distribution of human–bot interactions; the smaller the distance, the higher the consistency.

Results: See Table A4.

Table A4. Comparison of Human–Bot Interaction Ratios and Euclidean Distances for Interest-Driven Edge Completion.

Network Type	HH	BH	BB	Euclidean Distance
No Edge Completion (Benchmark)	0.8483	0.1464	0.0053	-
Interest-Driven Edge Completion	0.8166	0.1624	0.0021	0.0357
Random Edge Completion	0.7431	0.2376	0.0193	0.2708

Conclusion: The Euclidean distance between the interest-driven completed network and the benchmark network is extremely small, which basically retains the real human–bot interaction ratio; random edge completion seriously damages the topological structure of human–bot interaction.

Experiment 5: Validation of the Convergence of Underlying Topological Structure.

Purpose: This was to verify the ability of the completion strategy to retain the core topological scale-free, high-clustering and small-world characteristics of real social networks.

Design: Taking the original real network as the benchmark, construct the interest-driven/random completed network, and calculate three indicators: power-law exponent, average clustering coefficient, and Average Path Length. Adopt the K-S Test to verify the consistency of the degree distribution between the completed network and the benchmark network (Formulas (A5) and (A6)).

p (k) ~ {c k}^{- γ}

(A5)

Symbol Explanation:

p (k) =

probability of the occurrence of nodes with degree

k

;

c =

normalization constant;

γ =

power-law exponent (core observation metric of the experiment);

\sim

means subject to the distribution. Core Purpose: This is to describe the scale-free property of social networks and verify whether the completed network retains the real degree distribution law through the power-law exponent.

D = {s u p}_{x} | F_{n} (x) - F (x) |

(A6)

Symbol Explanation:

D

= K-S test statistic;

{s u p}_{x}

means taking the supremum of all

x

;

F_{n} (x)

= empirical cumulative distribution function of the degree distribution of the completed network;

F (x)

= cumulative distribution function of the degree distribution of the benchmark network. Core Purpose: This is to quantitatively test whether the degree distributions of two networks have statistical consistency; the smaller the

D

and the larger the

p

-value, the higher the consistency.

Results: See Table A5.

Table A5. Comparison of Underlying Topological Characteristics and K-S Test Results for Interest-Driven Edge Completion.

Network Type	Power-Law Exponent	Average Clustering Coefficient	Average Path Length	K-S p-Value
No Edge Completion (Real Observation)	1.4053	0.1571	3.1166	-
Our Interest-Driven Edge Completion	1.3967	0.1635	3.0789	0.033
Random Edge Completion	1.4855	0.0709	2.8148	0.000

Conclusion: The underlying topological characteristics of the interest-driven completed network are highly consistent with the real network, with good topological convergence; random edge completion seriously damages the core statistical laws of social networks.

The above five groups of experiments verify the reliability of the “Interest + Social Association” edge completion strategy from multiple dimensions: social association-based edge completion can directly restore real missing edges, interest-driven edge completion conforms to the laws of social behavior, and both can retain the community structure, human–bot interaction mode and underlying topological characteristics of real social networks. Due to the limitation of user privacy protection, it is impossible to obtain the 100% complete topological structure of real social networks. Interest-driven edge completion can only be verified indirectly due to the scarcity of public benchmark edges, and there is a slight deviation between the experimental results and the topological characteristics of the real complete network. However, this does not affect the core effectiveness of this strategy for social network topology repair in edge-deficient scenarios.

The core goal of this study is not to fully restore the complete topological structure of real social networks but to optimize the quality of synthetic social graph data through edge completion, so as to provide high-quality training data for social bot detection. Future research can incorporate more public behavioral data to further enhance the accuracy of the edge completion strategy.

Appendix D. Specific Implementation Details of the Edge Completion Strategy (Based on the Common Identity and Common Bond Theory)

Logic of Edge Completion Based on Interest Identification: Use the Gemma3 model to generate a 17-dimensional IPTC topic probability distribution for the posted content of all nodes. For each node participating in multiple topics, the single core topic with the highest probability is selected via the argmax function, and all secondary topics are discarded without traversal. Nodes are grouped by their core topic to form standardized topic communities. Meanwhile, calculate the average in-degree and out-degree of public benchmark nodes in each topic category as the prior connection strength. For edge-deficient nodes, determine their core participating topics based on the topic mapping results, allocate the number of edges to be completed by combining the node’s number of follows and followers as well as the average connection strength of the topic they belong to, and finally randomly generate directed supplementary edges strictly within the same topic community. This method supplements a total of more than 1.66 million edges, reducing the proportion of isolated nodes in the network to within 17%. Subsequent experiments found that bots have a wider topic coverage—the minimum number of topics required to cover 80% of bot users is more than twice that of human users.

Logic of Edge Completion Based on Social Association: First, add corresponding directed edges for users who have mutual @ behaviors. For users without direct @ behaviors, we first compute the Jaccard similarity of their mention sets within the public benchmark nodes. From this, we empirically learn the Jaccard similarity → edge existence probability mapping from the real edge distribution of public nodes. For edge-deficient (isolated) nodes, we then generate supplementary edges by performing probabilistic random sampling based on this learned mapping. No fixed manual threshold is set for Jaccard similarity; instead, a “similarity–edge existence probability” mapping relationship is automatically learned from the real edge distribution of public nodes. Supplementary edges are generated for information-hidden nodes via probabilistic random sampling based on the learned mapping. This method supplements a total of more than 1.4 million edges, reducing the proportion of isolated nodes in the network to within 24%.

The edge completion strategy in this paper does not use human–bot labels. It only infers the connection patterns of hidden nodes in reverse through the behavioral statistical laws of public nodes. The core goal is to make the characteristics of hidden nodes consistent with those of public nodes, rather than deliberately fitting bot characteristics. In the future, the stability of the edge completion rules under different label distributions will be further verified.

Algorithm A5: Interest-Based Edge Completion

INPUT: Node set V, Post data D, Topic categories K = 17, Attribute data Attr
OUTPUT: Supplementary edge set E_add

1: Func TopicClassify(D):
// Use Gemma3-12b to generate 17-dimensional topic probability distribution
// Return: node -> {topic 1: prob, …, topic 17: prob}
Return topic_dict
2: Func BuildTopicGroups(topic_dict):
// Complete Topic Mapping (Enhanced)
Initialize node_main_topic = empty dict // node -> its core topic
Initialize same_topic_nodes = empty dict // topic -> node set
For each node v in V:
dist = topic_dict[v], main_k = argmax(dist) // Select the ONLY core topic
node_main_topic[v] = main_k, Add v to same_topic_nodes[main_k]
Return same_topic_nodes, node_main_topic
3: Func CalcAvgDegree(V_public, topic_dict):
Calculate average in/out degree for each topic k
Return avg_out[k], avg_in[k] (∀k ∈ K)
4: Func AssignTargets(v, main_k, Attr[v], avg_degree):
Return target_out, target_in
5: Func GenEdges(v, targets, same_topic_nodes):
Return E_supp
6: MAIN:
7: topic_dict = TopicClassify(D)
8: same_topic_nodes, node_main_topic = BuildTopicGroups(topic_dict)
9: avg_degree = CalcAvgDegree(V_public, topic_dict)
10: For each edge-deficient node v:
11: main_k = node_main_topic[v] // Directly use mapped core topic
12: targets = AssignTargets(v, main_k, Attr[v], avg_degree)
13: E_add += GenEdges(v, targets, same_topic_nodes[main_k])
14: Save E_add; Return E_add

Algorithm A6: Social Association Edge Completion

INPUT: User set V, Mention data M, Edge set E_existing
OUTPUT: Supplementary edge set E_social

1: # Stage 1: Direct mention edges
2: For (u,v) In M where @ relation exists:
3: If (u,v) Not In E_existing: E_social.add((u,v))

4: # Stage 2: Similar mention group edges
5: mention_sets = {u: set(mentioned_by(u)) For u In V_public}
6: For u In V Where is_edge_deficient(u):
7: For v In V_public Where v

\neq

u:
8: sim = jaccard(mention_sets[u], mention_sets[v])
9: prob = learn_edge_prob_from_A(sim)
10: if random() < prob: E_social.add((u,v))

Appendix E. Experimental Environment and Hyperparameter Configuration

The Python version used is 3.12.3, relying on core libraries such as numpy, networkx, pandas, torch, and scikit-learn. For the synthetic social graph, the target number of bot and human nodes is 70,000 each, with a batch size of 10,000. The improved RCMH sampling + diffusion model sets a random seed of 42 (combined with 4 other seeds: 123, 456, 789, 10,086; results are averaged across five seed tests). Both the sampling and synthetic graph sizes are 5000 nodes, with 1000 diffusion steps, a learning rate of 1 × 10⁻⁴, 50 training epochs, a noise scale of 0.1, and an embedding dimension of 256. The RCMH sampling transition parameter is 0.95, with a maximum idle step count of 100. For bot nodes, weights are 1.1–1.2 in the low-degree interval and 0.7–0.8 in the high-degree interval; the weight assignment for human nodes is the opposite.

Appendix F. Feature Engineering Explanation

Nine key structural features include in-degree (number of followers), out-degree (number of follows), total degree (in-degree + out-degree), Proportion of Bot In-Neighbors/Out-Neighbors (proportion of bots in the corresponding neighbor set), Proportion of Human In-Neighbors/Out-Neighbors (proportion of humans in the corresponding neighbor set), Degree Centrality (ratio of the node’s total degree to the total number of labeled nodes in the network), and Node Activity Status (marked as 1 if total degree > 0; otherwise, 0).

Note: Some features need to be calculated based on node labels. The original intention is to compare the topological quality of network construction methods, not to adapt to real detection scenarios. In unlabeled scenarios, approximate estimation should be performed through unsupervised or self-supervised methods. All features are directly calculated based on network topology and node labels without redundant dimensions, and the extraction process is not elaborated on.

Appendix G. Feature Importance and Ablation Experiments

Table A6. Feature Importance.

Feature Name	Importance (Pre-Completion)	Importance (Post-Completion)
In-Degree	0.378	0.245
Total Degree	0.257	0.167
Degree Centrality	0.161	0.160
Proportion of Bot In-Neighbors	0.070	0.087
Proportion of Human In-Neighbors	0.066	0.142
Out-Degree	0.024	0.059
Proportion of Bot Out-Neighbors	0.023	0.069
Proportion of Human Out-Neighbors	0.022	0.071
Node Activity Status	0.001	0.001

Table A7. Ablation Experiment Performance (Two Improved Strategies).

Method Name	ACC	Pre	Rec	F1
Degree Scaling Only	0.5772 [0.5716, 0.5822]	0.5515 [0.5454, 0.5579]	0.8262 [0.8206, 0.8322]	0.6614 [0.6561, 0.6666]
Interaction Preference Weight Only	0.5124 [0.5070, 0.5176]	0.5459 [0.5316, 0.5600]	0.1477 [0.1417, 0.1530]	0.2324 [0.2247, 0.2402]
Node Importance Weight Only	0.5286 [0.5237, 0.5336]	0.6578 [0.6427, 0.6737]	0.1191 [0.1143, 0.1237]	0.2017 [0.1944, 0.2088]
Balance + Degree Constraint Only	0.5255 [0.5207, 0.5305]	0.6220 [0.6071, 0.6379]	0.1301 [0.1251, 0.1349]	0.2152 [0.2079, 0.2223]

Regarding feature importance, in-degree and out-degree are the two most important features. After edge completion, the importance of traditional degree-related features decreases significantly, while the importance of features reflecting interaction patterns increases. This reduces the model’s reliance on a single degree feature, makes the distribution of feature importance more balanced, better aligns with the diverse topological patterns required for social bot detection, and plays a positive role.

Appendix H. Data-Driven Balanced Node Weight Calculation

The node weight calculation method proposed in this paper first partitions nodes into multiple degree bins by their degrees and counts the number of bot and human nodes in each bin. Raw weights are then computed based on the class distribution of the node’s bin: the weight of a bot node is positively correlated with the number of human nodes in the bin, while the weight of a human node is positively correlated with the number of bot nodes, to amplify the sampling probability of underrepresented classes. A lower bound constraint prevents weights from approaching zero, followed by global normalization to set the mean weight to 1, yielding balanced weights directly applicable for sampling and mitigating class bias in social network sampling.

Algorithm A7: Data-Driven Balanced Node Weight Calculation

INPUT: Graph

G

, node labels label:

V \to {0,1}

, degree bins

B

OUTPUT: Node weights

ω : V \to R^{+}

1: For each bin

i

do count bot and human nodes:

B_{i} \leftarrow ∣ {u \in V : l a b e l (u) = 1, \deg (u) \in B_{i}} ∣, H_{i} \leftarrow ∣ {u \in V : l a b e l (u) = 0, d e g (u) \in B_{i}} ∣

2: For each node

u \in V

do
3: Find bin

i

of

d e g (u)

4: If

l a b e l (u) = 1

then

ω_{r a w} (u) \leftarrow H_{i} / (B_{i} + 1)

else

ω_{r a w} (u) \leftarrow B_{i} / (H_{i} + 1)

5:

ω_{r a w} (u) \leftarrow m a x (0.1, ω_{r a w} (u))

6:

μ \leftarrow \frac{1}{|V|} \sum_{u} ω_{r a w} (u)

7: For each node

u \in V

do

ω (u)

←

ω_{r a w} (u) / μ

8: Return

ω

Appendix I. Data Preprocessing and Division Details

The dataset used in this study has undergone basic preprocessing by the original team. For data division, the complete dataset is first randomly split into two independent groups: Group A (70%) and Group B (30%). In subsequent experiments, all the construction of synthetic social graphs and strategy learning are based solely on Group A data, and the generated synthetic social graph data is used as the training set; Group B data is further randomly divided into a test set and a validation set at a ratio of 2:1 for model evaluation.

In comparative experiments using real data directly, to maintain the stability of data distribution, we adopt the stratified random sampling method. The entire real data is divided into a training set, a validation set, and a test set at a ratio of 7:2:1, ensuring that the class ratio of human and bot nodes in each subset is consistent with the original population. Class balance is achieved through random downsampling during the training phase to improve the stability and generalization ability of model training.

References

Chu, Z.; Gianvecchio, S.; Wang, H.; Jajodia, S. Detecting automation of twitter accounts: Are you a human, bot, or cyborg? IEEE Trans. Dependable Secur. Comput. 2012, 9, 811–824. [Google Scholar] [CrossRef]
Al-Na’amneh, Q.; Aljawarneh, M.; Alhazaimeh, A.S.; Hazaymih, R.; Shah, S.M.; Dhifallah, W. Securing trust: Rule-based defense against on/off and collusion attacks in cloud environments. STAP J. Secur. Risk Manag. 2025, 2025, 85–114. [Google Scholar] [CrossRef]
Abdulateef, O.G.; Joudah, A.; Abdulsahib, M.G.; Alrammahi, H. Designing a robust machine learning-based framework for secure data transmission in internet of things (IoT) environments: A multifaceted approach to security challenges. J. Cyber Secur. Risk Audit. 2025, 2025, 266–275. [Google Scholar] [CrossRef]
Laprevotte, A.; Lin, R.Y.; Ojha, S. Diffusion-Generated Social Graphs Enhance Bot Detection. In Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: New Perspectives in Graph Machine Learning (NPGML), San Diego, CA, USA, 7 December 2025. [Google Scholar]
Davies, A.O.; Ajmeri, N.S.; Telmo De Menezes Filho, E.S. A comparative study of gnns and rule-based methods for synthetic social network generation. IEEE Access 2025, 13, 32198–32210. [Google Scholar] [CrossRef]
Pi, J.; Xian, Y.; Huang, Y.; Xiang, Y.; Song, R.; Yu, Z. Topology-Aware Gated Graph Neural Network for Social Bot Detection. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics; The Asian Federation of Natural Language Processing and The Association for Computational Linguistics: Mumbai, India, 2025; pp. 235–245. [Google Scholar]
Lingam, G.; Das, S.K. Social bot detection using variational generative adversarial networks with hidden Markov models in Twitter network. Knowl.-Based Syst. 2025, 311, 113019. [Google Scholar] [CrossRef]
Dehghan, A.; Siuta, K.; Skorupka, A.; Dubey, A.; Betlen, A.; Miller, D.; Xu, W.; Kamiński, B.; Prałat, P. Detecting bots in social-networks using node and structural embeddings. J. Big Data 2023, 10, 119. [Google Scholar] [CrossRef] [PubMed]
Alkathiri, N.; Slhoub, K. Challenges in machine learning-based social bot detection: A systematic review. Discov. Artif. Intell. 2025, 5, 214. [Google Scholar] [CrossRef]
Yu, Z.; Bai, L.; Ye, O.; Cong, X. Social Robot Detection Method with Improved Graph Neural Networks. Comput. Mater. Contin. 2024, 78, 1773–1795. [Google Scholar] [CrossRef]
Li, Y.; Shi, S.; Guo, X.; Zhou, C.; Hu, Q. G-CutMix: A CutMix-based graph data augmentation method for bot detection in social networks. PLoS ONE 2025, 20, e0331978. [Google Scholar] [CrossRef] [PubMed]
Chung, F.; Lu, L. Connected components in random graphs with given expected degree sequences. Ann. Comb. 2002, 6, 125–145. [Google Scholar] [CrossRef]
Seshadhri, C.; Pinar, A.; Kolda, T.G. An in-depth study of stochastic Kronecker graphs. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining; IEEE: Piscataway, NJ, USA, 2011; pp. 587–596. [Google Scholar]
Feng, S.; Tan, Z.; Wan, H.; Wang, N.; Chen, Z.; Zhang, B.; Zheng, Q.; Zhang, W.; Lei, Z.; Yang, S.; et al. Twibot-22: Towards graph-based twitter bot detection. Adv. Neural Inf. Process. Syst. 2022, 35, 35254–35269. [Google Scholar]
Leskovec, J.; Faloutsos, C. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2006; pp. 631–636. [Google Scholar]
Newman, M.E.J. Mixing patterns in networks. Phys. Rev. E 2003, 67, 026126. [Google Scholar] [CrossRef] [PubMed]
Cresci, S.; Di Pietro, R.; Petrocchi, M.; Spognardi, A.; Tesconi, M. The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race. In Proceedings of the 26th International Conference on World Wide Web Companion; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2017; pp. 963–972. [Google Scholar]
Gjoka, M.; Kurant, M.; Butts, C.T.; Markopoulou, A. Walking in facebook: A case study of unbiased sampling of osns. In Proceedings of the 2010 Proceedings IEEE INFOCOM; IEEE: Piscataway, NJ, USA, 2010; pp. 1–9. [Google Scholar]
Feng, S.; Wan, H.; Wang, N.; Li, J.; Luo, M. Twibot-20: A comprehensive twitter bot detection benchmark. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2021; pp. 4485–4494. [Google Scholar]
Gao, M.; Du, H.; Wen, W.; Duan, Q.; Wang, X.; Chen, Y. FediData: A Comprehensive Multi-Modal Fediverse Dataset from Mastodon. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2025; pp. 6372–6376. [Google Scholar]
Ren, Y.; Kraut, R.; Kiesler, S. Applying common identity and bond theory to design of online communities. Organ. Stud. 2007, 28, 377–408. [Google Scholar] [CrossRef]
Postmes, T.; Spears, R.; Lea, M. Breaching or building social boundaries? SIDE-effects of computer-mediated communication. Commun. Res. 1998, 25, 689–715. [Google Scholar] [CrossRef]
Barbieri, N.; Bonchi, F.; Manco, G. Who to follow and why: Link prediction with explanations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2014; pp. 1266–1275. [Google Scholar]
Kuzman, T.; Ljubešić, N. LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification. IEEE Access 2025, 13, 35621–35633. [Google Scholar] [CrossRef]
Adamic, L.A.; Adar, E. Friends and neighbors on the web. Soc. Netw. 2003, 25, 211–230. [Google Scholar] [CrossRef]
Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 855–864. [Google Scholar]
Liben-Nowell, D.; Kleinberg, J. The link prediction problem for social networks. In Proceedings of the Twelfth International Conference on Information and Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2003; pp. 556–559. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016. [Google Scholar]
Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
Kipf, T.N. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Zhao, J.; Wen, Q.; Sun, S.; Ye, Y. Multi-view self-supervised heterogeneous graph embedding. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer International Publishing: Cham, Switzerland, 2021; pp. 319–334. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning (PMLR); JMLR.org: Cambridge, MA, USA, 2017; pp. 1126–1135. [Google Scholar]

Figure 1. The Impact of the Proportion of Isolated Nodes on Social Network Bot Detection Performance.

Table 1. Human–Bot Proportion of High-Degree Nodes and Their Interaction Objects.

Statistical Dimension	Human Proportion (%)	Bot Proportion (%)
Composition of high-degree nodes (deg > 10⁵)	97.69	2.31
Composition of interaction objects of high-degree human nodes	95.05	4.95

Table 2. Comparison of Social Bot Detection Performance of Different Network Construction Methods.

Dataset	Method Name	ACC (95% CI)	Pre (95% CI)	Rec (95% CI)	F1 (95% CI)
TwiBot-22	Real Data (train:val:test = 7:2:1)	0.6170 [0.6115, 0.6224]	0.6006 [0.5928, 0.6076]	0.6997 [0.6929, 0.7064]	0.6463 [0.6404, 0.6521]
	Attribute-Based Degree	0.4999 [0.4947, 0.5052]	0.5000 [0.4947, 0.5048]	1.0000 [1.0000, 1.0000]	0.6666 [0.6620, 0.6710]
	Network-Based Degree	0.5009 [0.4957, 0.5062]	0.7734 [0.6557, 0.8724]	0.0024 [0.0017, 0.0031]	0.0047 [0.0033, 0.0061]
	Network-Based Degree (Non-Isolated)	0.5088 [0.5034, 0.5139]	0.5054 [0.4998, 0.5105]	0.8353 [0.8297, 0.8409]	0.6297 [0.6246, 0.6343]
	Subgraph Sampling + Diffusion Model	0.6061 [0.6014, 0.6111]	0.5760 [0.5703, 0.5821]	0.8041 [0.7984, 0.8093]	0.6712 [0.6664, 0.6762]
	Improved Degree-Based Method	0.5788 [0.5732, 0.5839]	0.5529 [0.5468, 0.5595]	0.8235 [0.8179, 0.8295]	0.6615 [0.6564, 0.6666]
	Improved Subgraph Sampling + Diffusion Model	0.6190 [0.6142, 0.6241]	0.6042 [0.5975, 0.6108]	0.6902 [0.6833, 0.6970]	0.6443 [0.6386, 0.6501]
TwiBot-20	Real Data (train:val:test = 7:2:1)	0.5750 [0.5532, 0.5978]	0.5951 [0.5597, 0.6347]	0.4752 [0.4424, 0.5087]	0.5286 [0.4974, 0.5565]
	Attribute-Based Degree	0.4023 [0.3822, 0.4223]	0.5077 [0.3709, 0.6305]	0.0218 [0.0143, 0.0297]	0.0417 [0.0274, 0.0576]
	Network-Based Degree	0.4320 [0.4111, 0.4530]	0.5542 [0.5160, 0.5929]	0.2540 [0.2308, 0.2761]	0.3490 [0.3224, 0.3784]
	Network-Based Degree (Non-Isolated)	0.4423 [0.4209, 0.4642]	0.5438 [0.5123, 0.5743]	0.4144 [0.3876, 0.4407]	0.4702 [0.4446, 0.4945]
	Subgraph Sampling + Diffusion Model	0.5750 [0.5535, 0.5917]	0.6596 [0.6384, 0.6902]	0.6058 [0.5804, 0.6273]	0.6297 [0.6127, 0.6528]
	Improved Degree-Based Method	0.5656 [0.5447, 0.5861]	0.6425 [0.6146, 0.6686]	0.6153 [0.5875, 0.6426]	0.6287 [0.6059, 0.6512]
	Improved Subgraph Sampling + Diffusion Model	0.5722 [0.5530, 0.5915]	0.6497 [0.6237, 0.6742]	0.6282 [0.6050, 0.6514]	0.6365 [0.6153, 0.6566]

Bold values indicate the highest performance of the corresponding metric among all methods under the same dataset (This explanation applies to all figures and tables in this paper).

Table 3. Bot Detection Performance of Edge Completion Strategies and Improved Methods After Edge Completion.

Method Name	ACC (95% CI)	Pre (95% CI)	Rec (95% CI)	F1 (95% CI)
Real Data (train:val:test = 7:2:1)	0.6170 [0.6115, 0.6224]	0.6006 [0.5928, 0.6076]	0.6997 [0.6929, 0.7064]	0.6463 [0.6404, 0.6521]
Real Data (Interest-Based Edge Completion)	0.6833 [0.6784, 0.6881]	0.6771 [0.6704, 0.6840]	0.7008 [0.6936, 0.7075]	0.6888 [0.6833, 0.6945]
Real Data (Social Association Edge Completion)	0.6299 [0.6246, 0.6348]	0.6151 [0.6078, 0.6212]	0.6943 [0.6878, 0.7012]	0.6523 [0.6467, 0.6575]
Real Data (Interest + Social Association Edge Completion)	0.6875 [0.6829, 0.6923]	0.6675 [0.6619, 0.6735]	0.7472 [0.7410, 0.7537]	0.7051 [0.7002, 0.7104]
Improved Degree-Based Method (Post Edge Completion)	0.5908 ↑ [0.5861, 0.5956]	0.5587 ↑ [0.5530, 0.5643]	0.8648 ↑ [0.8599, 0.8695]	0.6788 ↑ [0.6741, 0.6834]
Improved Subgraph Sampling + Diffusion (Post Edge Completion)	0.6427 ↑ [0.6383, 0.6474]	0.5991 ↓ [0.5934, 0.6049]	0.8626 ↑ [0.8580, 0.8677]	0.7071 ↑ [0.7025, 0.7117]

Up/down arrows indicate whether the corresponding metric is improved or decreased after edge completion under the same network synthesis model, compared with the baseline without edge completion. Bold values indicate the highest performance of the corresponding metric among all methods under the same dataset (This explanation applies to all figures and tables in this paper).

Table 4. Comparison of Bot Detection Performance (Link Prediction Edge Completion, Edge-Deficient).

Method Name	ACC (95% CI)	Pre (95% CI)	Rec (95% CI)	F1 (95% CI)
Real Data (20% Test Set)	0.6170 [0.6115, 0.6224]	0.6006 [0.5928, 0.6076]	0.6997 [0.6929, 0.7064]	0.6463 [0.6404, 0.6521]
Link Prediction (Basic Supervision)	0.6176 [0.6124, 0.6229]	0.6007 [0.5937, 0.6076]	0.7021 [0.6953, 0.7093]	0.6474 [0.6418, 0.6533]
Link Prediction (Enhanced Supervision)	0.5941 [0.5890, 0.5993]	0.5743 [0.5675, 0.5811]	0.7280 [0.7208, 0.7347]	0.6420 [0.6368, 0.6474]
Link Prediction (Representation Learning)	0.6166 [0.6113, 0.6221]	0.6000 [0.5928, 0.6069]	0.7002 [0.6933, 0.7073]	0.6462 [0.6405, 0.6520]
Link Prediction (Hybrid Integration)	0.6058 [0.6007, 0.6110]	0.5928 [0.5856, 0.5999]	0.6770 [0.6701, 0.6841]	0.6321 [0.6264, 0.6380]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Tang, M. Multi-Strategy Improvement and Comparative Research on Data-Driven Social Network Construction in Edge-Deficient Scenarios for Social Bot Account Detection. Information 2026, 17, 360. https://doi.org/10.3390/info17040360

AMA Style

Wang J, Tang M. Multi-Strategy Improvement and Comparative Research on Data-Driven Social Network Construction in Edge-Deficient Scenarios for Social Bot Account Detection. Information. 2026; 17(4):360. https://doi.org/10.3390/info17040360

Chicago/Turabian Style

Wang, Junjie, and Minghu Tang. 2026. "Multi-Strategy Improvement and Comparative Research on Data-Driven Social Network Construction in Edge-Deficient Scenarios for Social Bot Account Detection" Information 17, no. 4: 360. https://doi.org/10.3390/info17040360

APA Style

Wang, J., & Tang, M. (2026). Multi-Strategy Improvement and Comparative Research on Data-Driven Social Network Construction in Edge-Deficient Scenarios for Social Bot Account Detection. Information, 17(4), 360. https://doi.org/10.3390/info17040360

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Strategy Improvement and Comparative Research on Data-Driven Social Network Construction in Edge-Deficient Scenarios for Social Bot Account Detection

Abstract

1. Introduction

2. Improved Strategies for Social Network Construction in Edge-Deficient Scenarios

2.1. Principle, Defect Analysis, and Improvement of Node Degree-Driven Strategy

2.1.1. Core Principle of the Original Chung-Lu Model

2.1.2. Core Defects and Impacts on Bot Detection

2.1.3. Targeted Improvement Strategies

2.2. Research and Improvement of Real Incomplete Network-Driven Strategy

2.2.1. Original RCMH Sampling + Diffusion Model: Core Logic

2.2.2. Core Defects and Impacts on Bot Detection

2.2.3. Improved RCMH Sampling: Node Importance Weighting and Human–Bot Balance

2.2.4. Improved Diffusion Model: Subgraph Density Constraint and Human–Bot Interaction Preservation

2.3. Research on Edge Completion Strategies for Social Networks with Edge Deficiency

2.3.1. Interest Identification: Edge Completion Based on Topic Participation

2.3.2. Social Association: Edge Completion Based on Mention Relationships

2.3.3. Edge Completion Based on Link Prediction Technology

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Impact of Isolated Node Proportion on Bot Detection and Reliability Test Method for Synthetic Social Graphs

Appendix B. Pseudocode for Two Types of Classic Network Synthesis Methods

Appendix C. Reliability Validation Experiments for the Edge Completion Strategy

Appendix D. Specific Implementation Details of the Edge Completion Strategy (Based on the Common Identity and Common Bond Theory)

Appendix E. Experimental Environment and Hyperparameter Configuration

Appendix F. Feature Engineering Explanation

Appendix G. Feature Importance and Ablation Experiments

Appendix H. Data-Driven Balanced Node Weight Calculation

Appendix I. Data Preprocessing and Division Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI