Next Article in Journal
PCB Electronic Component Soldering Defect Detection Using YOLO11 Improved by Retention Block and Neck Structure
Previous Article in Journal
Enhanced Rail Surface Defect Segmentation Using Polarization Imaging and Dual-Stream Feature Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CB-MTE: Social Bot Detection via Multi-Source Heterogeneous Feature Fusion

1
School of Computer Science, Qinghai Normal University, Xining 810008, China
2
The State Key Laboratory of Tibetan Intelligence, Qinghai Normal University, Xining 810008, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(11), 3549; https://doi.org/10.3390/s25113549
Submission received: 15 April 2025 / Revised: 2 June 2025 / Accepted: 3 June 2025 / Published: 4 June 2025
(This article belongs to the Section Sensors and Robotics)

Abstract

Social bots increasingly mimic real users and collaborate in large-scale influence campaigns, distorting public perception and making their detection both critical and challenging. Traditional bot detection methods, constrained by single-source features, often fail to capture the complete behavioral and contextual characteristics of social bots, especially their dynamic behavioral evolution and group coordination tactics, resulting in feature incompleteness and reduced detection performance. To address this challenge, we propose CB-MTE, a social bot detection framework based on multi-source heterogeneous feature fusion. CB-MTE adopts a hierarchical architecture: user metadata is used to construct behavioral portraits, deep semantic representations are extracted from textual content via DistilBERT, and community-aware graph embeddings are learned through a combination of random walk and Skip-gram modeling. To mitigate feature redundancy and preserve structural consistency, manifold learning is applied for nonlinear dimensionality reduction, ensuring both local and global topology are maintained. Finally, a CatBoost-based collaborative reasoning mechanism enhances model robustness through ordered target encoding and symmetric tree structures. Experiments on the TwiBot-22 benchmark dataset demonstrate that CB-MTE significantly outperforms mainstream detection models in recognizing dynamic behavioral traits and detecting collaborative bot activities. These results confirm the framework’s capability to capture the complete behavioral and contextual characteristics of social bots through multi-source feature integration.

1. Introduction

The widespread adoption of digital technologies has reshaped the way human society is connected. The ITU released the Facts and Figures 2024 report, which predicts that by the end of 2024, the number of Internet users worldwide will reach 5.5 billion. With the popularity of the mobile Internet and the deepening of the functions of social media platforms, the Internet has become the main channel for information acquisition and social interaction. However, a large number of social bots are threatening the security of cyberspace through intelligent means. Social bots have evolved into agents capable of identity camouflage, behavior simulation, and group cooperation, which seriously damages the ecological balance of the network by means of spamming junk information and public opinion manipulation [1,2,3]. Such technological misuse poses a severe challenge to the existing detection system, and it is urgent to build a multi-source intelligent detection framework to cope with the rapidly evolving situation.
Social bot detection faces two core challenges: efficient identification of massive data and adaptation to dynamic evolutionary behavior. In traditional methods, explicit indicators such as user activity, user portraits, and text features are constructed, and algorithms such as Random Forest and Support Vector Machine are used to establish classification models [4,5]. Although it can effectively identify the programmed behavioral features of early bots, there are two fundamental defects: First, these methods are highly dependent on the features designed by humans, and it is difficult to cope with the feature camouflage attacks. Second, although the text analysis method based on NLP can extract semantic features [6], it faces text noise interference and insufficient semantic generalization caused by the content manipulation strategies.
In order to overcome these limitations, recent research has shifted to graph-based analysis, which provides a novel methodology for bot detection. Its core value lies in the in-depth mining of group behavior patterns. By constructing a user interaction graph, the topological features and propagation patterns of bot clusters can be effectively captured [7]. Furthermore, dynamic community division technology supported by structural information theory can identify abnormal group behaviors [8]. However, the prior art still faces two major constraints: First, the sparsity of social network structural data leads to incomplete feature extraction. Second, there is a significant contradiction between the real-time computing demands brought about by the evolution of dynamic networks and the static processing mechanism of existing models.
To sum up, it is difficult for traditional feature engineering to effectively resist feature camouflage attacks; text analysis is limited in identifying the content manipulation strategies; graph structure methods are often limited by structural data sparsity. Therefore, multi-source heterogeneous feature fusion and dynamic behavior perception have become the inevitable evolution direction of social bot detection.
We propose innovative solutions for the detection task of social bots. Core contributions are as follows:
(1)
Multi-source heterogeneous feature collaborative modeling
To address the characterization limitations of single-source features in social bot detection, we propose a metadata-text-social topology multimodal collaborative analysis framework. Through innovative design of ten new metadata features, including device entropy, tweet mutation rate and other dimensions, the multi-granularity behavior profile system is constructed, and the semantic discrete features of the text are extracted by combining lightweight pre-training language models. The topological anomaly patterns in community communication are extracted by deep graph embedding techniques, and the influence difference of nodes is quantified by introducing multi-scale network centrality indices. A cross-modal feature space is then constructed through joint optimization.
(2)
Hierarchical double fusion detection framework CB-MTE
We propose CB-MTE, a two-tier fusion framework that synergizes feature-level and model-level integration. At the feature level, the heterogeneous feature space mapping is realized by nonlinear dimensionality reduction algorithms based on manifold learning to solve the dimension mismatch problem of metadata, text and graph embeddings. At the model level, a dynamic weight allocation mechanism is constructed using the CatBoost gradient boosting tree, and the classification boundary is optimized by combining ordered target encoding, which significantly improves the detection accuracy compared to traditional methods.
(3)
Fine-grained evaluation system
We established a cross-scenario evaluation benchmark using the TwiBot-22 dataset. To systematically validate framework robustness, five social sub-datasets covering political, entertainment, medical, and other domains were selected for cross-topic migration testing. Experimental verification showed that the macro F1 value of the framework reached 80.84% under the scenario of new topic camouflaging attack and dynamic evolution of social network structures. Compared with the mainstream model BotRGCN, it is 23.34 percentage points higher, and its performance indicators are significantly superior to those of mainstream baseline methods, which confirms the strong adaptability of the multi-source feature fusion mechanism to the complex evolution situation.

2. Related Work

2.1. Social Bot Detection

Social bots are automated accounts controlled by programs, and their behavior patterns are highly organized. Although some service bots exist, most of them are used in malicious activities such as forging interactive data, spreading junk information and manipulating public opinion [9]. In response to such threats, the current detection technology mainly adopts three types of technical routes: research paradigms based on feature engineering, text semantic analysis, and graph structure mining.
Feature-based approach: Relevant features are extracted from user metadata and published tweets, and then combined with a traditional binary classification model for bot detection [10]. Typical studies include Echeverria et al. [11], who developed an adaptive classifier for 20 new bot types through cross-class generalization tests; Alarfaj et al. [12], who integrated message features, part-of-speech tagging, and emotion polarity to construct feature sets to achieve fine-grained detection; Abreu et al. [13], who obtained real dataset from tweets, extracted and selected five core features, and used four types of machine learning models for training and evaluation.
Text-based approach: This path focuses on the content analysis of tweets, mining the semantic features, and generation patterns of bot texts. Wang et al. [14] used LSA model to extract four similarities of tweets for social bot detection of tweets. Duki et al. [6] used BERT-base model and a number of labels for bot detection according to the content of tweets. Kumar et al. [15] classify tweets as bot tweets or non-bot tweets based on the text content of the tweets.
Graph structure-based approach: This type of approach reveals group behavior characteristics by modeling user interaction networks. Feng et al. [7] capture topological patterns of bot interactions by constructing heterogeneous graph neural networks. Yu et al. [16] proposed a multi-modal detection framework based on social relationship subgraphs, which constructed user interaction subgraphs and linear splicing of text content features and behavioral gene sequences to generate social behavior features for bot recognition. Lin et al. [17] jointly encode user semantics, attributes, and neighborhood information, and adopt an improved graph attention network model to carry out parallel computation of large-scale graphs through subgraph sampling to realize bot detection.
Although the above methods have achieved remarkable results in their respective fields, the single detection paradigm has significant shortcomings in complex adversarial scenarios; for example, feature engineering is vulnerable to reverse engineering; text analysis is difficult to distinguish semantic confusion, and the graph structure relies on the complete social topology. Therefore, multi-source heterogeneous feature fusion has become a key direction to break through the limitations. In terms of multi-source feature fusion, Lei et al. [18] combined text and graph structure to recognize bots; Kudugunta et al. [19] use tweet content and metadata to detect Twitter bots, but there is still insufficient coverage of feature sources. Therefore, we propose a CB-MTE framework and use the triple collaborative mechanism of metadata behavior modeling, deep semantic understanding, and graph relationship inference to realize accurate identification of social bots.

2.2. Social Bot Detection Technology

2.2.1. Text Modeling

In social media text analysis, BERT is widely used for feature extraction of user-generated content, such as tweets, profiles, and interactive comments [20]. Although existing studies have explored the potential of BERT in social media analysis, such as generating tweet embedment [21], capturing bot features in sentiment classification [22], and heterogeneous graph neighbor aggregation [23], problems such as relying on a single text mode and neglecting metadata fusion are generally found. Although the pre-trained language model based on Transformer improves the semantic understanding accuracy through bi-directional context coding [24], its huge number of parameters and high computing cost seriously restrict the deployment feasibility in real-time streaming data scenarios. In view of this core contradiction, we use DistilBERT, a lightweight semantic modeling framework compressed by BERT model knowledge distillation technology, which is smaller, faster, and lighter. The framework maintains 97% of the original semantic understanding performance while reducing the parameter number by 40%, and achieves a 60% improvement in reasoning speed [25], which is critical for processing massive streams of tweets. We use DistilBERT to dynamically encode text, integrate user metadata and social topology, and realize multi-dimensional collaborative enhancement.

2.2.2. Graph Structure Embedding

Social bots often evade detection through star topology or community infiltration, and we use DeepWalk’s random walk strategy to model local structural features [26], and its truncated walk can effectively model these two attack modes. The walk path of the center node in the star topology presents high repeatability, while the walk sequence of the penetration bot shows cross-community migration characteristics. Compared with the homogeneity bias of Node2Vec [27] and the computational complexity of the Struc2Vec’s global structure [28], DeepWalk is simpler to implement and faster to process in large-scale graphs. In addition, Berriche et al. [29] proposed a hybrid approach that integrates DeepWalk to learn low-dimensional structural embeddings, thereby reducing computational complexity and enhancing robustness in complex networks. By further incorporating Beam Search, their method achieves significant improvements in both efficiency and robustness for community detection. Leskovec et al. [30] obtained graph embedding by using node feature information and structure information, which has outstanding performance in node classification and link prediction. In our work, we further enhance DeepWalk embeddings by integrating classical centrality measures and reducing dimensions via UMAP, yielding a compact and structurally informative feature representation.

2.2.3. Decision Classification

We adopt CatBoost, a categorical gradient boosting algorithm [31], for multi-source feature classification. Compared with the traditional gradient boosting algorithms, CatBoost can process the original category data directly without the need for one-hot encoding when processing categorical features, thus avoiding the problems of data bloat and information loss. For example, Zhang et al. [32] used the bag-of-words approach to process TF-IDF for feature extraction and integrated the CatBoost algorithm for training, achieving a spam detection accuracy rate of over 98%. Ibrahim et al. [33] compared the performance of CatBoost with other classifiers on business process datasets, and found that CatBoost outperformed the alternatives. In our framework, CatBoost serves as the final decision layer, effectively integrating metadata, textual, and graph-based features.

3. Framework and Methods

3.1. Framework Architecture

Figure 1 shows an overview of the proposed CB-MTE framework, which consists of four modules that work together to improve the robustness and accuracy of social bot detection:
(1) Metadata feature extraction module: We extract user metadata from the TwiBot-22 dataset, including account attributes, behavioral attributes, and social attributes. Metadata is cleaned and standardized to eliminate noise and handle missing values. We then compute statistical and derived features based on prior work on behavioral profiling, and generate user embeddings to support downstream modules.
(2) Text Feature Extraction Module: We collect user-generated text and adopt DistilBERT [25] to produce high-dimensional, context-aware semantic embeddings. To reduce dimensionality while preserving structure, we apply UMAP (Uniform Manifold Approximation and Projection) [34], which has shown effectiveness in retaining global and local relationships in textual embeddings. The final text feature is a 16-dimensional vector suitable for lightweight classification.
(3) Graph structure feature extraction module: An interaction graph is constructed based on users’ social connections. While we adopt the DeepWalk strategy [26] to obtain initial node embeddings, we enhance this representation by further integrating structural properties. Specifically, we compute three classical centrality measures—degree, closeness, and betweenness centrality [35,36,37]—to quantify users’ topological importance. To capture the intrinsic structural patterns more effectively, we apply UMAP [34] to reduce the dimensionality of both node embeddings and centrality vectors. The final graph-based feature vector is formed by concatenating these reduced representations, enabling a comprehensive integration of local structural embeddings and global influence metrics for improved user modeling.
(4) Data feature fusion and decision module: The semantic, structural, and metadata embeddings are fused via vector concatenation to form a joint multi-source representation. We then feed this vector into the CatBoost classifier [31], a gradient boosting framework that employs ordered target statistics and symmetric tree structures to improve performance and reduce overfitting. We train the model in an end-to-end fashion using 10-fold cross-validation to evaluate classification performance.

3.2. Metadata Feature Extraction

In the task of social bot detection, metadata feature extraction plays an important role. For each Twitter user, in addition to the published text and friend relationship, we focus on extracting user metadata and subdivide it into account attributes, behavioral attributes, and social attributes, and define them as M u = [ A u , B u , S u ] 32 to comprehensively depict the basic characteristics and behavioral patterns of users. The following can be seen in Table 1:
(1) Account attributes: record the user’s basic information and account settings, provide a preliminary basis for understanding the user’s background and identifying the account type, and effectively distinguish ordinary users from bots.
(2) Behavioral attributes: involve the user’s activity frequency and interaction characteristics, reflect the user’s activity degree and behavior law on the platform, and reveal abnormal signals in the operation mode.
(3) Social attributes: reflect the user’s interaction and social structure in the social network and can show the closeness and location distribution of the connection between the user and other accounts.
Building on previous feature extraction approaches [23,38], we further design 10 novel metadata indicators (denoted with ‘*’ in Table 1) that capture fine-grained behavioral dynamics and social interaction patterns. These new features are motivated by observations in user behavior not addressed in prior work and are tailored to better distinguish bots from human users.
After normalization and imputation, we construct a 32-dimensional metadata feature vector, which encodes both classical and novel indicators. This representation serves as a foundational user embedding, supporting downstream modules such as text-based and graph-based detection. The combination of established and newly proposed features demonstrates both continuity with prior research and our technical innovation.

3.3. Text Feature Extraction

We extract historical tweet texts from a large number of Twitter accounts in the public TwiBot-22 dataset. All retweets and quote Tweets are filtered out, and only the original tweet content is retained. The remaining text is processed using the DistilBERT tokenizer [25], which transforms each tweet into a sequence of tokens in the following format
s = [ [ C L S ] , x 1 , x 2 , , x n , [ S E P ] ] ,
where x t is the vocabulary index of the t token, [CLS] is a special classification token used by DistilBERT to aggregate sentence-level representations, and [SEP] is a separator token indicating the end of the input sequence. The length of the sequence is unified to Lmax by dynamic padding and truncation, and the sequence after word segmentation is input to the DistilBERT encoder, which iteratively updates the text representation through the 6-layer Transformer block. The output of the L-th is represented as
H ( L ) = L a y e r N o r m ( H ( L 1 ) + M u l t i H e a d A t t n ( H ( L 1 ) ) ) ,
where H1 denotes the output of the first Transformer layer, MultiHeadAttn is the multi-head self-attention mechanism, and LayerNorm is the layer normalization operation.
Next, we use the hidden state of the last layer’s [CLS] token as the textual representation, since the [CLS] token is specifically designed in DistilBERT as a classification token. The hidden state of this token aggregates global information from the entire sequence via the self-attention mechanism:
h [ C L S ] ( 6 ) = H ( 6 ) [ 0 , : ] 768 ,
where H(6) represents the hidden state matrix of all tokens in the 6th layer of DistilBERT, and each row corresponds to a token’s contextual embedding. As a lightweight alternative to BERT, DistilBERT consists of 6 Transformer layers and outputs 768-dimensional embeddings per token. The [CLS] token is automatically inserted at the beginning of each input sequence, and is designed to aggregate global contextual information through multi-head self-attention. For k tweets from the same user, we obtain k such [CLS] vectors and compute their average as the user’s semantic-level representation,
f = 1 k i = 1 k h [ C L S ] ( i ) ( 6 ) 768 ,
where f is the averaged textual representation that encodes the user’s overall semantic behavior. Through the forward propagation of DistilBERT encoder, each tweet is dynamically converted into a semantic embedded representation. The complete representation of a user is defined as T = [ f 0 , f 1 , , f n ] , laying the data foundation for subsequent dimensionality reduction analysis and multi-dimensional feature fusion. Although we employ DistilBERT for tokenization and text embedding, we use the model in its pre-trained form without fine-tuning on domain-specific Twitter data, considering factors such as computational cost and the generalizability of its pretrained knowledge. This design choice balances performance and efficiency, though we acknowledge that domain-specific fine-tuning could enhance model performance in future work.

3.4. Feature Extraction of Graph Structure

The topological properties of social networks, as illustrated in Figure 2, are utilized to ascertain the global structural position of user nodes and local propagation behavioral patterns. These are captured synchronously through the joint optimization of quantitative analysis of structural statistics and deep wandering embedding learning, thereby offering multi-level topological evidence for bot detection.

3.4.1. Global Location Extraction

Firstly, the directed relationship graph G = (V, E) is constructed, and the core metrics for nodes v i V are calculated. Degree centrality is typically used to indicate the degree of direct connection between nodes and other nodes. The number of followers and attention received can effectively reflect this. Meanwhile, betweenness centrality is mainly used to measure the role of nodes as a bridge for transmitting information between different communities. However, since the nodes in this study’s dataset have been divided by community, the role of this indicator is relatively weak. Consequently, this study has identified three distinct categories of indicators for further examination.
Closeness Centrality [35]: a metric that calculates the mean shortest path distance between a node and a network node.
C c v i = N 1 v j v i d v i , v j .
Eigenvector Centrality [36]: this metric serves to evaluate the extent to which a node exerts influence over the entire network, taking into account the impact of its significant neighbors.
A E = λ max E ,
where A is the adjacency matrix of the graph, E is the eigenvector, and λmax is the largest eigenvalue.
Clustering Coefficient [37]: indicates the tightness of the connection between neighboring nodes.
C C ( v i ) = 2 T ( v i ) deg ( v i ) × ( deg ( v i ) 1 ) ,
where T(vi) is the number of triangles of node vi.

3.4.2. Local Propagation Mode Learning

In order to capture the propagation pattern of the user population, the DeepWalk algorithm is employed to generate node embedding vectors z i 128 . These vectors are then obtained through random walk sequences. Furthermore, a step size of λ(u) is implemented, along with a window size of δ(u). The Skip-gram model is utilized for optimization purposes. The objective function is designed to maximize the probability of co-occurrence of context nodes as follows
max z 1 T i = 1 T j w i n d o w ( i ) log p ( v j | z i ) ,
where z i is the embedding vector of node v i , v j is the context node, window(i) is the set of context nodes within the sliding window centered at v i , with a window size of δ(u). Then the index {C, CC, E} is concatenated with the embedded vector z and input the downstream detection module after standardization:
h i = N o r m a l i z e ( [ C ( v i ) , C C ( v i ) , e ( v i ) | | z i ] ) )

3.5. Feature Fusion

In order to alleviate the dimensional disaster caused by high-dimensional feature fusion, we adopt UMAP to reduce the dimension of text embedding T N × 768 and node embedding h N × 128 , respectively [39]. UMAP is a nonlinear dimensionality reduction algorithm based on manifold learning, which aims to preserve the local and global structure of data. Its core idea is to map high-dimensional data to low-dimensional spaces through fuzzy topology and probabilistic optimization [34]. First, the data needs to be standardized to eliminate the effects of different feature scales.
X scaled = X min ( X ) , max ( X ) min ( X ) + ε ,
where min(X) and max(X) are the minimum and maximum values of each feature, respectively, ε = 1 × 10−5, to prevent division by zero. Then for each sample x i , calculate the local connected probability of k nearest neighbors as
μ i j = exp ( max ( 0 , d ( x i , x j ) ρ i ) σ i ) ,
where d ( x i , x j ) is the distance between x i and x j , ρ i is the nearest neighbor distance, and the parameter σ i is usually solved iteratively such that
j μ i j log 2 ( k ) .
The fuzzy set thus constructed represents the local structure of the high-dimensional data. Then in the lower dimensional space, we need to find an embed y i 16 that makes the similarity distribution between the lower dimensions as consistent as possible with the fuzzy set μ i j in the higher dimensional space. To do this, the probability distribution in the lower dimensional space is defined as
v i j = 1 1 + a y i y j 2 b ,
where the parameters a and b control the effect of distance on similarity, and these two parameters are usually estimated automatically by UMAP. Next, UMAP optimizes low-dimensional embedding by minimizing the cross-entropy between high–low dimensional distributions, whose objective function is typically
L = i j μ i j log μ i j v i j + ( 1 μ i j ) log 1 μ i j 1 v i j .
By minimizing the above objective function, UMAP maps high-dimensional data X to low-dimensional embedding Y, capturing the global data distribution while preserving the local structure. In our implementation, three key hyperparameters are configured: the number of neighbors N(p), the minimum distance between embedded points M(p), and the output dimensionality D(p). These settings are designed to balance the preservation of local neighborhood relationships and the compactness of the low-dimensional representation.
Based on the resulting embeddings, we extract both the text embedding f i N × 16 and node embedding h i N × 19 after dimensionality reduction, and finally fuse the semantic embedding vector after dimensionality reduction, composite graph features, and metadata feature representation into joint feature vectors, defined as ψ i = { M i | | f i | | h i } N × 67 , to provide powerful data support for the subsequent framework decision.

3.6. CatBoost Classification

In order to optimize the fusion efficiency of multi-source heterogeneous features, this paper uses CatBoost to make the final classification decision. In order to prevent information leakage, CatBoost adopts ordered target coding based on sample order when processing category features. The joint ψ i feature vector is encoded within the formula as follows
T E ( ψ i ) = j : o r d e r ( j ) < o r d e r ( i ) y i + α j : o r d e r ( j ) < o r d e r ( i ) 1 + β ,
where α and β are smoothing parameters, order represents the order of samples in training, and uses historical sample information to generate feature statistics, effectively avoiding the target leakage problem in traditional mean coding. After this transformation, the original class features are mapped to numerical features, and combined with the smoothing process of a posteriori probability, making the subsequent model training more stable. CatBoost uses the gradient boosting tree algorithm to achieve the binary classification task by iteratively optimizing the following objective functions
L ( F ) = i = 1 N [ y i log ( 1 1 + e F ( x i ) ) + ( 1 y i ) log ( 1 1 1 + e F ( x i ) ) ] ,
where the 1 / ( 1 + e F ( x i ) ) is the sigmoid function, which maps the model output F ( x i ) to the prediction probability. In the model initialization phase, the initial prediction is usually set as the logarithmic probability of the positive example in the training set
F 0 ( x ) = log ( p 1 p ) ,
where p is the ratio of positive cases. In each iteration, the negative gradient is first calculated as the residual to avoid the prediction deviation caused by the full data dependence of the traditional GBDT, the formula is
g i = y i 1 1 + e F ( x i ) .
At the same time, the second derivative can be obtained by using the sigmoid function:
h i = 1 1 + e F ( x i ) ( 1 1 1 + e F ( x i ) ) .
In the process of decision tree construction, the optimal splitting point is determined by second-order Taylor expansion, and the splitting gain formula is
G a i n = 1 2 ( i I L g i ) 2 i I L h i + λ + ( i I R g i ) 2 i I R h i + λ ( i I g i ) 2 i I h i + λ .
where λ is the regularization parameter, I(x) represents the sample set of the current node, and IL and IR are the sample set of the left and right child nodes, respectively. For each leaf node, the leaf weights are calculated using the Newton update method, and the formula is
γ j = x i R j g i x i R j h i + λ .
Finally, set the learning rate η ( c ) to update the output T(x) weight of the new tree into the current model:
F ( x ) F ( x ) + η ( c ) T ( X ) .
After several rounds of iterative optimization, the above process continuously reduces the value of the objective function, and finally realizes the efficient prediction of the binary classification task.

3.7. Global Structure

In Algorithm 1, tweet data serve as the input. In Step 2, a 32-dimensional metadata feature vector is extracted through a series of heuristic-based computations. In Step 3, a 768-dimensional textual embedding is derived using the DistilBERT language model. Step 4 employs the DeepWalk algorithm to learn a 128-dimensional structural representation (node embedding) for each user. Subsequently, Steps 5 through 9 compute three categories of centrality metrics—namely, closeness centrality, eigenvector centrality, and clustering coefficient—which are concatenated with the node embeddings to enhance structural characterization. To address the issue of high dimensionality, UMAP is applied to jointly reduce the dimensionality of the textual and structural embeddings. The resulting low-dimensional representations are fused via vector concatenation with the metadata features to form a unified feature vector, which is subsequently fed into a CatBoost classifier for model training and prediction.
Algorithm 1: Bot detection via CB-MTE Framework
input: Twitter bot detection dataset T = {u1, u2, …, un}
output: Predicted labels F ( x i )  (0: human, 1: bot)
1:for each user  u i  in T do
2:  metadata feature extraction: M i = f m e t a ( u i ) 32
3:  textual feature extraction: f i 768  ← Equations (1)–(4)
4:  Compute structural features for user  u i :
5:     C ( u i ) , C C ( u i ) , e ( u i )  ← Equations (5)–(7)
6:    graph feature extraction:
7:       z i = D e e p W a l k ( u i ) 128  ← Equation (8)
8:    Concatenate with DeepWalk embedding:
9:       h i  ← Equation (9)
10:  Reduce dimensionality via UMAP:
11:   f i 16 , h i 19  ← Equations (10)–(14)
12:  fused via vector concatenation: ψ i = c o n c a t { M i , f i , h i } N × 67
13:  predict label using CatBoost classifier: F ( x i )  ← Equations (15)–(22)
14:end for

4. Experimental Results and Analysis

4.1. Dataset Preparation

The dataset used in this article is TwiBot-22, the largest and most comprehensive Twitter bot detection benchmark to date [40]. Due to the massive amount of data in TwiBot-22, five sub-datasets are used for implementation in this paper. This paper evaluates each sub-dataset experimentally, as shown in Table 2.

4.2. Experimental Setup

4.2.1. Data Preprocessing

In the research, this paper uses user data from the TwiBot-22 dataset and extracts the following three types of information: user metadata, user tweet information, and user friend relationship information. These data provide the necessary features and background information for framework training and bot user detection. In order to evaluate the performance and generalization ability of the framework more comprehensively, the 10-fold cross-validation method is used for framework validation in this paper. In the process of cross-validation, the dataset is divided into 10 mutually exclusive subsets, each containing different user samples, ensuring that the users in the training set and the validation set are completely independent to avoid data leakage.
For tweet data, this paper adopts the preprocessing methods of word segmentation, removal of stop words, and filtering of special symbols. Firstly, the word_tokenize function in the nltk library is used to segment the English tweets. This method effectively splits the text into individual words and is able to recognize punctuation and other grammatical features. Next, stop words (e.g., common English words like ‘the’ and ‘and’), which usually do not carry useful information in the text analysis, are removed. In addition, in order to avoid the influence of noise in the analysis process, special symbols in the text (such as emojis, punctuation symbols, etc.) are filtered to ensure the purity of the analysis results.

4.2.2. Experimental Parameters

We provide our hyperparameter settings in Table 3 to facilitate reproducibility.

4.2.3. Baseline Method: We Compare CB-MTE with the Following Baselines

  • BGSRD [41] combines BERT pre-training model and graph convolutional network (GCN) BGSRD model by constructing heterogeneous graphs that integrate text semantic and social relations and jointly training them.
  • Botometer [42] uses more than 1000 features derived from user metadata, content, and interactions.
  • BotRGCN [7] builds heterogeneous graphs from Twitter networks and employs graph convolutional networks for user representation learning and Twitter bot detection.
  • SimpleHGN [43] achieves superior performance on the heterogeneous graph benchmark HGB by building a multi-graph neural network structure that fuses node features with heterogeneous information.
  • Lee et al. [44] realize efficient bot detection across social media platforms by extracting statistical features from user metadata and using a lightweight logistic regression model.
  • Deshmukh et al. [45] propose a social bot detection model that integrates GraphSage and BERT, enhancing detection accuracy by fusing graph structure and textual features, and demonstrates outstanding performance in experiments.
  • RGT [46], which stands for Relational Graph Converter, models the inherent heterogeneity in the Twitter domain to improve Twitter bot detection.
  • SGBot [47] extracts features from the user’s metadata and feeds them into a Random Forest classifier for scalable and generalizable bot recognition.
  • T5 [48] achieve state-of-the-art performance across a range of Natural Language Processing tasks by unifying all NLP tasks into a text-to-text format, and by pre-training and fine-tuning on large amounts of unlabeled data.
  • BotDGT [49] is a hybrid model combining GNNs and Transformers, enabling dynamicity-aware detection of evolving social bots.

4.2.4. Evaluation Indicators

In order to evaluate the framework performance, the following commonly used classification evaluation indicators are used: A c c u r a c y = T P + T N T P + T N + F P + F N , P r e c i s i o n = T P T P + F P , R e c a l l = T P T P + F N , F 1 = 2 P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l . Here, TP (true positive) represents the number of social bots correctly detected, FN (false negative) represents the number of social bots incorrectly detected, and FP (false positive) represents the number of human users incorrectly detected. TN (true negative) is the number of human users correctly detected.

4.3. Experimental Results and Comparative Analysis

In this paper, five sub-datasets of TwiBot-22 are trained through the CB-MTE framework proposed in this paper, and the results are shown in Table 4.
The results in Table 4 show that CB-MTE shows excellent cross-scenario detection capability on the five balanced sub-datasets of TwiBot-22: TwiBot_3 has a moderate scale of tweets and edges, and the framework achieves the highest accuracy of 0.856 and F1 score of 0.847, indicating that the moderate scale of social interaction data is conducive to multi-source feature fusion. The TwiBot_2 sub-dataset has the largest number of edges, but the framework still maintains a higher recall of 0.785, which verifies the ability of multi-source interaction module to parse dense social relations. From an overall perspective, the average accuracy of CB-MTE is 0.8214, the precision is 0.7924, the recall is 0.8254, and the F1 score is 0.8084 on the five sub-datasets, which demonstrates its consistency and stability across multiple sub-datasets. This performance verifies the effectiveness of CB-MTE in multi-source heterogeneous feature fusion, and it can accurately identify bot users in complex social media data, especially in the detection task of bot accounts, which has strong robustness.
This paper evaluates CB-MTE against ten representative baseline models on the TwiBot-22 dataset, as shown in Table 5, and the experiments show the following:
The CB-MTE framework comprehensively outperforms all baseline methods in terms of accuracy, recall, and F1 score, improves the F1 score by 23.34 percentage points over the current optimal graph model BotRGCN, and achieves a 120.8% relative improvement in F1 score over the traditional feature engineering method SGBot. This breakthrough performance stems from the deep interaction mechanism between metadata, text semantics, and social graph features, proving the effectiveness of multi-source heterogeneous feature fusion for the detection of complex adversarial scenarios.
In the graph neural model population, BotRGCN and RGT significantly outperform static metadata methods (such as SGBot) and plain text methods (such as T5), confirming that social relationship topology is the key clue for bot detection. However, the single graph structure model is limited by data sparsity, and its recall is 35.74% lower than that of CB-MTE, which highlights the necessity of multi-source complementarity.
The F1 score of models based on manual features (e.g., SGBot), are generally ineffective in new bot detection, and their average F1 score is 11.5% lower than that of the graph model. This indicates that traditional feature engineering is difficult to cope with feature camouflage enabled by large language models, while CB-MTE significantly improves its feature anti-jamming ability through the collaboration between dynamic semantic parsing and social propagation pattern mining.

4.4. Robustness Experiment

4.4.1. Noise Robustness Experiment

To validate the robustness of the CB-MTE framework against data noise interference, this paper conducts systematic noise injection experiments on five sub-datasets of TwiBot-22. The experiment adopts the additive Gaussian noise strategy, aimed at the joint feature matrix X n × 67 after multi-source heterogeneous feature fusion (n is the number of samples, and the 67-dimensional features include 32-dimensional metadata feature + 16-dimensional text embedding + 16-dimensional graph embedding + class 3 centrality index). The noise items independently sampled by element superposition are defined as
X n o i s y = X + υ ,   υ i j N ( 0 , σ 2 )
where X denotes the original feature matrix and σ is the standard deviation of noise, which is set to {0.05, 0.1, 0.2, 0.5} in the experiment, covering no-noise baseline (σ = 0), mild disturbance (σ = 0.05), and strong interference (σ = 0.5) scenarios. Under each group of noise conditions, the stability and generalization ability of the framework are evaluated through 10-fold cross-validation with the mean and standard deviation of accuracy, precision, recall and F1 score. As shown in Figure 3, heatmaps visualize the performance changes of each sub-dataset under different σ values, and Figure 4 further reveals the trend of the noise resistance of the framework through global mean analysis.
The experimental results show that CB-MTE still exhibits strong robustness under the interference of multi-dimensional noise. When the noise level increases from 0 to 0.5, the F1 score variation across sub-datasets ranges from 0.68% to 4.06%, and the variation range of accuracy is between 0.46% and 3.59%, while the F1 score of the global mean decreases by only 2.2% and accuracy by 1.51%, indicating that the multi-source heterogeneous feature fusion mitigates noise propagation through feature diversity. The recall only decreases by 3.09%, which highlights the ability of the framework to suppress missed detection; The smallest decrease in accuracy was only 1.34%, indicating that its judgment of positive samples is still robust. Although the noise resistance of the features of each module is not tested separately, the stability of fused features not only validates CB-MTE’s practicality in noisy environments, but also empirically supports multi-modal collaborative detection.
To further analyze the sensitivity of different feature modules in the CB-MTE framework to noise perturbation, this paper conducts a module-level robustness evaluation on the three types of features, namely metadata, text, and graph, based on the TwiBot_3 sub-dataset. The experiments adopt the same Gaussian noise injection strategy as that for the fused features, with the standard deviation σ set to {0, 0.05, 0.10, 0.20, 0.50}. Under the condition of keeping other training parameters consistent, noise perturbation tests are conducted on each type of feature separately. The performance changes of each module under the three metrics of AUC, accuracy, and F1 are shown in Figure 5.
The experimental results show that different types of features exhibit significant differences in their robustness when facing additive Gaussian noise. The metadata module shows the least performance fluctuation under various noise levels, with an AUC decrease from 0.889 to 0.873, an accuracy decrease from 0.807 to 0.793, and an F1 score decrease of only 2.2%. This demonstrates extremely strong stability. This indicates that the structured user metadata has a strong ability to resist perturbations and is insensitive to noise variations. The overall robustness performance of the graph module is second, with AUC remaining stable in the range of 0.83 to 0.802 under different noise intensities, F1 only decreasing by approximately 2.7%, and accuracy showing no significant change, always remaining between 0.768 and 0.739, indicating that the graph structure features have a better ability to maintain information when facing local noise perturbations. In contrast, the Text module is the most sensitive to noise, with the most significant performance decline. As the noise level increases from 0 to 0.50, AUC drops sharply from 0.603 to 0.554, accuracy decreases from 0.584 to 0.552, and F1 value drops from 0.468 to 0.371. This result reflects that text features are affected by the high dimensionality and sparsity of semantic representation, and are easily disrupted by noise, destroying their word vector structure, resulting in a weakened model semantic discrimination ability.
In conclusion, the module-level robustness assessment results further confirm the effectiveness of the multi-source feature fusion strategy in CB-MTE: structured information and graph structure features have inherent advantages in the noise environment, while text features need to introduce strategies such as adversarial training, noise filtering, or semantic enhancement to further improve their robustness, thereby enhancing the reliability and stability of the overall system in actual complex social environments.

4.4.2. Robustness Testing of Structured Camouflage Attacks

To further verify the robustness of the proposed CB-MTE framework in real attack scenarios, we conduct a series of structured camouflage attack experiments. This attack strategy simulates the behavior of intelligent adversaries by modifying the structured features of certain bots in the test set to resemble those of normal users, thereby misleading the detection framework.
Specifically, we use the mean and standard deviation of the structured features of human users in the training set as a reference to camouflage the samples of a specified proportion of robots in the test set. The camouflage feature vectors are generated through Gaussian sampling to simulate the behavior of “imitating normal users.” We set different camouflage proportions (10%, 20%, 30%) and evaluate the performance fluctuations of the CB-MTE framework using 10-fold cross-validation for each proportion. The performance of the framework under different camouflage proportions is shown in Figure 6.
The experimental results show that the CB-MTE framework performs well in the absence of attacks, achieving an AUC of 0.9332, an accuracy of 0.8559, and an F1 score of 0.8466, which demonstrates the effectiveness of multi-source feature integration in user behavior modeling. However, as the camouflage ratio increases, the framework’s performance exhibits a noticeable decline.
When only 10% of bot samples are camouflaged, the AUC decreases to 0.9206 (a drop of 1.3%) and the F1 score drops to 0.8258 (a decrease of 2.5%), indicating that even small-scale structural perturbations already impair the framework’s discriminative ability. As the camouflage ratio rises to 20%, the AUC and F1 scores further decrease to 0.9089 and 0.8059, respectively. When the attack intensity increases to 30%, the AUC and F1 scores drop to 0.8957 and 0.7827, suggesting that structured attacks continuously erode the framework’s detection performance.
Nevertheless, even under the highest attack intensity, the CB-MTE framework still maintains an AUC above 0.89 and an F1 score above 0.78, reflecting its inherent robustness resulting from strong feature redundancy and complementarity in the multi-source fusion process. In summary, the structured camouflage attack experiment reveals the robustness boundary of the CB-MTE framework in adversarial environments and highlights its potential vulnerability when facing adversarial samples with intelligent camouflage capabilities.

4.4.3. Dimensional Analysis

In order to systematically evaluate the impact of embedding dimension on multimodal feature fusion, this study designs parametric sensitivity experiments for textual semantic embeddings and social graph node embeddings, respectively. UMAP is used to reduce the dimensionality of two types of high-dimensional embeddings. The target dimension d ∈ {8, 16, 32, 64, 128}, and cross-dimensional performance comparison is conducted based on the balanced subset of TwiBot-22 dataset. Experimental results are shown in Figure 7.
This experiment reveals the robustness of the framework to feature compression through systematic comparison of multi-dimensional feature spaces: 16-dimensional embedding becomes the optimal feature representation dimension by balancing information retention and feature redundancy suppression. The experiment shows that the framework has a strong tolerance to the variation of dimensional parameters (F1 score variation < 0.6%), and the 16-dimensional embeddings achieve peak accuracy (85.6%) and recall (86.7%), which verifies the full representation ability of medium and low dimensional feature space for the behavior pattern of social bots [38]. This finding provides a theoretical basis for the design of lightweight detection framework to avoid the risk of overfitting caused by blind pursuit of high-dimensional features.

4.4.4. UMAP Hyperparameter Analysis

To assess the robustness and stability of dimensionality reduction via UMAP, we conduct ablation experiments on its key hyperparameters: the number of neighbors (n_neighbors) and the minimum distance between embedded points (min_dist). The output dimensionality (n_components) is fixed at 16 for fair comparison. Table 6 presents the performance under different configurations.
We vary n_neighbors among 5, 15, and 30 to control the balance between local and global structure preservation. As shown in Table 6, all three settings yield consistently strong performance, with AUC values of 0.935 (A), 0.934 (B), and 0.935 (C), respectively. Other metrics such as accuracy, precision, recall, and F1 score also show marginal fluctuation. These results indicate that our framework is relatively insensitive to the specific value of n_neighbors, suggesting stable behavior under various local connectivity assumptions. To examine the influence of compactness in the low-dimensional space, we compare the performance under min_dist = 0.1 and min_dist = 0.5, keeping other parameters unchanged. The performance metrics between these two settings are nearly identical, with only negligible changes. This demonstrates that the model is robust against variations in min_dist and does not heavily rely on a specific embedding compactness.
The ablation results confirm that our framework maintains stable and reliable performance under different UMAP configurations, validating the robustness of our dimensionality reduction strategy.

4.4.5. Classifier Selection

In order to select a suitable decision model, this paper systematically evaluates five types of classical machine learning models—logistic regression (LR), Random Forest (RF), Decision Tree (DT), XGBoost, and CatBoost—and conducts experiments using fusion features, as shown in Figure 8.
Cross-model comparison experiments confirm that CatBoost, with its symmetric tree structure and dynamic gradient optimization mechanism, presents significant advantages in dealing with heterogeneous features such as discrete metadata and sparse social relations. Its classification performance outperforms the traditional models across the board, and it shows the lowest performance fluctuation across datasets. The algorithm effectively balances model complexity and generalization ability through adaptive regularization strategy, and its low variance characteristics meet the strict stability requirements of real-time detection systems, providing an efficient and reliable decision engine for the engineering deployment of multi-modal fusion detection frameworks.

4.5. Ablation Experiment Design

In order to evaluate the effectiveness of the generation features of each module in the framework, ablation experiments are conducted in this paper. Based on the evaluation results of the TwiBot-22 sub-dataset in Table 4, TwiBot_3 is significantly ahead of other sub-datasets in four core indicators, with outstanding comprehensive performance and balanced indicators, which could better help identify which modules contributed the most to the overall performance. Therefore, the TwiBot_3 dataset is used for experimental evaluation of different feature combinations. Four different ablation settings were designed in the experiment, as shown in Table 7:
Through multidimensional ablation experiment, this study verified the synergistic enhancement mechanism of heterogeneous feature fusion on social bot detection, as shown in Figure 9. Specific conclusions are as follows:
Validation of the effectiveness of innovative features: Following the removal of the 10 metadata metrics (M − M*) proposed in this paper, the framework accuracy decreases from 0.806 to 0.797, the F1 score decreases from 0.788 to 0.777, and the recall decreases from 0.789 to 0.775. This phenomenon suggests that novel metadata metrics, such as the device entropy, are critical for the dynamic camouflage behavior of adversarial bots to discriminative power, which is further supported by the fact that the removal operation also causes a decline in precision and recall. This finding indicates that the novel feature system enhances the framework capacity to concurrently verify all elements and ensure accuracy through the cross-dimensional anomaly association detection mechanism.
Single-modal limitation analysis: Although the single metadata model (M) presents an apparent balance between recall and accuracy, its F1 score in the adversarial subset has a significant gap of 7.5% compared with the full-modal framework, revealing the inherent defects of traditional detection methods relying on static single-dimensional data, especially in the face of adversarial feature obfuscation attacks.
Multimodal enhancement effect: After the metadata feature is fused with the text feature (M + T), the recall increases by 3.0% and the F1 score increases by 2.9%, confirming that the natural language feature can effectively identify the text style anomaly of the generative bot. After the metadata feature is fused with the topological feature (M + G), the recall increases by 5.9% and the F1 score increases by 3.9%, which highlights the advantages of social topological analysis for detecting large-scale bot clusters.
Multi-source collaboration advantage: The proposed CB-MTE framework (M + T + G) achieves an F1 score of 0.847, which is 7.5% higher than the single-mode benchmark, and the difference between accuracy and recall is reduced to 3.8 percentage points. This performance improvement is due to the metadata–text–graph topology triple check mechanism. The cross-verification of the three forms a dynamic defense system, which makes the framework robust to multi-source counterattack.
The experimental results show that the text feature expands the detection boundary of the framework under different attack scenarios by quantifying the content generation pattern deviation and the graph structure feature by analyzing the social relationship topological anomaly. The collaborative fusion of metadata and multi-source features significantly improves the generalization ability of the system to the new adversarial bot through the cross-dimensional inconsistency detection.

5. Conclusions

In order to address the limitations of single-source feature characterization and the challenges of multimodal feature camouflage faced by the bot detection task in social networks, this paper proposes a CB-MTE detection framework based on the heterogeneous fusion of metadata, text, and graph topology. The paper’s contributions can be seen in three aspects: theoretical innovation, method optimization, and practical validation. (1) At the theoretical level, we construct a behavioral-semantic-topological defense system to capture hardware and behavioral timing anomalies through dynamic metadata metrics (e.g., device entropy, mutation rate of tweets, etc.), extract context-sensitive semantic features based on lightweight DistilBERT to identify generative text disguises, and analyze social topology anomalies by combining graph embedding algorithms to form a multi-dimensional joint defense system. (2) On the technical level, we propose a feature-model bi-level fusion framework, which utilizes UMAP nonlinear dimensionality reduction to eliminate the cross-source dimensionality gap between text, graph topology, and metadata, and integrates CatBoost Gradient Boosting Tree and Ordered Objective Coding to achieve efficient decision-making, with an F1 score improvement of 23.34 percentage points over the single-source graph model BotRGCN. (3) In terms of application, experiments on the TwiBot-22 benchmark set demonstrate that the discrepancy between precision and recall of CB-MTE diminishes to 3.3 percentage points, with an accuracy of 82.14%, which is notably superior to all baseline models in Table 5. On the adversarial subset TwiBot_3, CB-MTE achieves an F1 score of 84.7%, representing a 317.8% relative improvement over T5 (20.27%), which verifies the strong adaptability of the framework to complex scenarios.
This paper reveals the enhancement effect of multi-feature fusion on social entity modeling at the theoretical level, provides a scalable solution for false account identification in social network environments at the technical level, and confirms the advantages of CB-MTE in detection performance and adversarial robustness at the application level. Future research should focus on breaking through the key technologies such as heterogeneous feature fusion, lightweight graph computation, and adversarial sample defense to establish an intelligent detection system adapted to the complex ecology of social networks. In addition, the structural principles of embodied intelligent systems in robotics offer promising inspiration for extending CB-MTE. Recent developments in autonomous transportation robots and adaptive impedance control systems [50,51] highlight the benefits of modular perception–action coupling, real-time environmental adaptation, and task-aware control, which could inspire future improvements to the robustness and adaptability of social bot detection frameworks. Drawing on such architectural parallels, CB-MTE may evolve into a more generalizable and interactive detection paradigm capable of self-adjustment and deployment in dynamic, adversarial online ecosystems.

Author Contributions

Conceptualization, M.C.; methodology, M.C.; software, M.C.; validation, M.C., Y.X., T.H., C.L., and C.Z.; formal analysis, M.C.; investigation, M.C.; resources, M.C.; data curation, M.C.; writing—original draft preparation, M.C.; writing—review and editing, T.H. and Y.X.; visualization, M.C.; supervision, T.H. and Y.X.; project administration, M.C. and Y.X.; funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key Research and Development Program of China (314) and The State Key Laboratory Independent Project of China (2024-SKL-005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from Shangbin Feng and are available https://github.com/LuoUndergradXJTU/TwiBot-22 (accessed on 31 October 2024) with the permission of Shangbin Feng.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Martini, F.; Samula, P.; Keller, T.R.; Klinger, U. Bot, or not? Comparing three methods for detecting social bots in five political discourses. Big Data Soc. 2021, 8, 1–13. [Google Scholar] [CrossRef]
  2. Hagen, L.; Neely, S.; Keller, T.E.; Scharf, R.; Vasquez, F.E. Rise of the machines? Examining the influence of social bots on a political discussion network. Soc. Sci. Comput. Rev. 2022, 40, 264–287. [Google Scholar] [CrossRef]
  3. Shahid, W.; Li, Y.; Staples, D.; Amin, G.; Hakak, S.; Ghorbani, A. Are you a cyborg, bot or human?—A survey on detecting fake news spreaders. IEEE Access 2022, 10, 27069–27083. [Google Scholar] [CrossRef]
  4. Kantepe, M.; Ganiz, M.C. Preprocessing framework for Twitter bot detection. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 5–8 October 2017; pp. 630–634. [Google Scholar]
  5. Ouni, S.; Fkih, F.; Omri, M.N. Bots and gender detection on Twitter using stylistic features. In Proceedings of the International Conference on Computational Collective Intelligence, Hammamet, Tunisia, 28–30 September 2022; Springer International Publishing: Cham, Switzerland, 2022; Volume 1653, pp. 650–660. [Google Scholar]
  6. Dukić, D.; Keča, D.; Stipić, D. Are you human? Detecting bots on Twitter Using BERT. In Proceedings of the 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Sydney, NSW, Australia, 6–9 October 2020; pp. 631–636. [Google Scholar]
  7. Feng, S.; Wan, H.; Wang, N.; Luo, M. BotRGCN: Twitter bot detection with relational graph convolutional networks. In Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, New York, NY, USA, 8–11 November 2021; pp. 236–239. [Google Scholar]
  8. Peng, H.; Zhang, J.; Huang, X.; Hao, Z.; Li, A.; Yu, Z.; Yu, P.S. Unsupervised social bot detection via structural information theory. ACM Trans. Inf. Syst. 2024, 42, 1–42. [Google Scholar] [CrossRef]
  9. Aljabri, M.; Zagrouba, R.; Shaahid, A.; Alnasser, F.; Saleh, A.; Alomari, D.M. Machine learning-based social media bot detection: A comprehensive literature review. Soc. Netw. Anal. Min. 2023, 13, 20. [Google Scholar] [CrossRef]
  10. Wu, J.; Ye, X.; Mou, C. Botshape: A novel social bots detection approach via behavioral patterns. arXiv 2023, arXiv:2303.10214. [Google Scholar]
  11. Echeverra, J.; De Cristofaro, E.; Kourtellis, N.; Leontiadis, I.; Stringhini, G.; Zhou, S. LOBO: Evaluation of generalization deficiencies in Twitter bot classifiers. In Proceedings of the 34th Annual Computer Security Applications Conference, San Juan, PR, USA, 3–7 December 2018; pp. 137–146. [Google Scholar]
  12. Alarfaj, F.K.; Ahmad, H.; Khan, H.U.; Alomair, A.M.; Almusallam, N.; Ahmed, M. Twitter bot detection using diverse content features and applying machine learning algorithms. Sustainability 2023, 15, 6662. [Google Scholar] [CrossRef]
  13. Abreu, J.V.F.; Ralha, C.G.; Gondim, J.J.C. Twitter bot detection with reduced feature set. In Proceedings of the 2020 IEEE International Conference on Intelligence and Security Informatics (ISI), Arlington, VA, USA, 9–10 November 2020; pp. 1–6. [Google Scholar]
  14. Wang, Y.; Wu, C.; Zheng, K.; Wang, X. Social bot detection using tweets similarity. In Proceedings of the International Conference on Security and Privacy in Communication Systems, Singapore, 8–10 August 2018; Springer International Publishing: Cham, Switzerland, 2018; Volume 255, pp. 63–78. [Google Scholar]
  15. Kumar, S.; Garg, S.; Vats, Y.; Parihar, A.S. Content based bot detection using bot language model and bert embeddings. In Proceedings of the 2021 5th International Conference on Computer, Communication and Signal Processing (ICCCSP), Chennai, India, 24–25 May 2021; pp. 285–289. [Google Scholar]
  16. Yu, Z.; Bai, L.; Ye, O.; Cong, X. Social Bot Detection Method with Improved Graph Neural Networks. Comput. Mater. Contin. 2024, 78, 1773. [Google Scholar]
  17. Lin, H.; Chen, N.; Chen, Y.; Li, X.; Li, C. BotScout: A Social Bot Detection Algorithm Based on Semantics, Attributes and Neighborhoods. In Proceedings of the International Conference on Intelligent Computing, Tianjin, China, 5–8 August 2024; pp. 343–355. [Google Scholar]
  18. Lei, Z.; Wan, H.; Zhang, W.; Feng, S.; Chen, Z.; Li, J.; Zheng, Q.; Luo, M. BIC: Twitter Bot Detection with Text-Graph Interaction and Semantic Consistency. In Proceedings of the 2023 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 10326–10340. [Google Scholar]
  19. Kudugunta, S.; Ferrara, E. Deep neural networks for bot detection. Inf. Sci. 2018, 467, 312–322. [Google Scholar] [CrossRef]
  20. Heidari, M.; Zad, S.; Hajibabaee, P.; Malekzadeh, M.; HekmatiAthar, S.; Uzuner, O.; Jones, J.H. Bert model for fake news detection based on social bot activities in the COVID-19 pandemic. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2021; pp. 103–109. [Google Scholar]
  21. Anggrainingsih, R.; Hassan, G.M.; Datta, A. BERT based classification system for detecting rumours on Twitter. IEEE Trans. Comput. Soc. Syst. 2021; submitted. [Google Scholar]
  22. Heidari, M.; Jones, J.H. Using bert to extract topic-independent sentiment features for social media bot detection. In Proceedings of the 2020 11th IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 28–31 October 2020; pp. 542–547. [Google Scholar]
  23. Wang, W.; Wang, Q.; Zang, T.; Zhang, X.; Liu, L.; Yang, T.; Wang, Y. BotRGA: Neighborhood-Aware Twitter Bot Detection with Relational Graph Aggregation. In Proceedings of the International Conference on Computational Science, Seattle, WA, USA, 6–8 October; Springer Nature: Cham, Switzerland, 2024; Volume 14838, pp. 162–176. [Google Scholar]
  24. Sallah, A.; Agoujil, S.; Wani, M.A.; Hammad, M.; Maleh, Y.; El-Latif, A.A.A. Fine-tuned understanding: Enhancing social bot detection with transformer-based classification. IEEE Access 2024, 12, 118250–118269. [Google Scholar] [CrossRef]
  25. Sanh, V. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
  26. Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York City, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
  27. Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
  28. Ribeiro, L.F.R.; Saverese, P.H.P.; Figueiredo, D.R. struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 385–394. [Google Scholar]
  29. Berriche, A.; Nair, M.; Yamani, K.M.; Adjal, M.; Bendaho, S.; Chenni, N.; Tayeb, F.; Bessedik, M. A Novel Hybrid Approach Combining Beam Search and DeepWalk for Community Detection in Social Networks. In Proceedings of the WEBIST, Rome, Italy, 15–17 November 2023; pp. 454–463. [Google Scholar]
  30. Leskovec, J. GraphSAGE: Inductive Representation Learning on Large Graphs; SNAP: Long Beach, CA, USA, 2017. [Google Scholar]
  31. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Advances in neural information processing systems. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Volume 31, pp. 1–11. [Google Scholar]
  32. Zhang, T.L.; Niu, Y.F.; Ma, R.; Zhao, M.; Song, D.; Liu, H. Spam detection using Catboost integration algorithm. In Proceedings of the Second International Conference on Statistics, Applied Mathematics, and Computing Science, Nanjing, China, 25–27 November 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12597, pp. 853–857. [Google Scholar]
  33. Ibrahim, A.A.; Ridwan, R.L.; Muhammed, M.M.; Abdulaziz, R.O.; Saheed, G.A. Comparison of the CatBoost classifier with other machine learning methods. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 738–748. [Google Scholar] [CrossRef]
  34. McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
  35. Zhang, J.; Luo, Y. Degree centrality, betweenness centrality, and closeness centrality in social network. In Proceedings of the 2017 2nd International Conference on Modelling, Simulation and Applied Mathematics (MSAM2017), Bangkok, Thailand, 26–27 March 2017; pp. 300–303. [Google Scholar]
  36. Bonacich, P. Some unique properties of eigenvector centrality. Soc. Netw. 2007, 29, 555–564. [Google Scholar] [CrossRef]
  37. Li, Y.; Shang, Y.; Yang, Y. Clustering coefficients of large networks. Inf. Sci. 2017, 382, 350–358. [Google Scholar] [CrossRef]
  38. Wang, X.; Chen, K.; Wang, K.; Wang, Z.; Zheng, K.; Zhang, J. FedKG: A knowledge distillation-based federated graph method for social bot detection. Sensors 2024, 24, 3481. [Google Scholar] [CrossRef]
  39. Dehghan, A.; Siuta, K.; Skorupka, A.; Dubey, A.; Betlen, A.; Miller, D.; Xu, W.; Kamiński, B.; Prałat, P. Detecting bots in social-networks using node and structural embeddings. J. Big Data 2023, 10, 119. [Google Scholar] [CrossRef]
  40. Feng, S.; Tan, Z.; Wan, H.; Wang, N.; Chen, Z.; Zhang, B.; Zheng, Q.; Zhang, W.; Lei, Z.; Yang, S.; et al. Twibot-22: Towards graph-based twitter bot detection. Adv. Neural Inf. Process. Syst. 2022, 35, 35254–35269. [Google Scholar]
  41. Guo, Q.; Xie, H.; Li, Y.; Ma, W.; Zhang, C. Social bots detection via fusing bert and graph convolutional networks. Symmetry 2021, 14, 30. [Google Scholar] [CrossRef]
  42. Yang, K.C.; Ferrara, E.; Menczer, F. Botometer 101: Social bot practicum for computational social scientists. J. Comput. Soc. Sci. 2022, 5, 1511–1528. [Google Scholar] [CrossRef] [PubMed]
  43. Lv, Q.; Ding, M.; Liu, Q.; Chen, Y.; Feng, W.; He, S.; Zhou, C.; Jiang, J.; Dong, Y.; Tang, J. Are we really making much progress? revisiting, benchmarking and refining heterogeneous graph neural networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1150–1160. [Google Scholar]
  44. Lee, K.; Eoff, B.; Caverlee, J. Seven months with the devils: A long-term study of content polluters on twitter. In Proceedings of the Fifth International Conference on Weblogs and Social Media, Barcelona, Spain, 17–21 July 2011; Volume 5, pp. 185–192. [Google Scholar]
  45. Deshmukh, A.; Moh, M.; Moh, T.S. Bot Detection in Social Media Using GraphSage and BERT. In Proceedings of the 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Bangkok, Thailand, 9–12 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 804–811. [Google Scholar]
  46. Feng, S.; Tan, Z.; Li, R.; Luo, M. Heterogeneity-aware twitter bot detection with relational graph transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 3977–3985. [Google Scholar]
  47. Yang, K.C.; Varol, O.; Hui, P.M.; Menczer, F. Scalable and generalizable social bot detection through data selection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1096–1103. [Google Scholar]
  48. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
  49. He, B.; Yang, Y.; Wu, Q.; Liu, H.; Yang, R.; Peng, H.; Wang, X.; Liao, Y.; Zhou, P. Dynamicity-aware social bot detection with dynamic graph transformers. In Proceedings of the IJCAI 2024, Jeju, Republic of Korea, 3–9 August 2024; pp. 5844–5852. [Google Scholar]
  50. Xu, Y.; Bao, R.; Zhang, L.; Wang, J.; Wang, S. Embodied intelligence in RO/RO logistic terminal: Autonomous intelligent transportation robot architecture. Sci. China Inf. Sci. 2025, 68, 150210. [Google Scholar] [CrossRef]
  51. Chen, Z.; Zhan, G.; Jiang, Z.; Zhang, W.; Rao, Z.; Wang, H.; Li, J. Adaptive impedance control for docking robot via Stewart parallel mechanism. ISA Trans. 2024, 155, 361–372. [Google Scholar] [CrossRef]
Figure 1. CB-MTE: Multi-source heterogeneous feature fusion framework.
Figure 1. CB-MTE: Multi-source heterogeneous feature fusion framework.
Sensors 25 03549 g001
Figure 2. Multi-dimensional topological representation of user relationships with centrality metrics and clustering patterns.
Figure 2. Multi-dimensional topological representation of user relationships with centrality metrics and clustering patterns.
Sensors 25 03549 g002
Figure 3. Robustness training results of the five datasets.
Figure 3. Robustness training results of the five datasets.
Sensors 25 03549 g003
Figure 4. Robustness mean results—through the triple coding of color gradient (color: red → green), line width change and numerical annotation—clearly present the robustness performance of the framework under different noise levels.
Figure 4. Robustness mean results—through the triple coding of color gradient (color: red → green), line width change and numerical annotation—clearly present the robustness performance of the framework under different noise levels.
Sensors 25 03549 g004
Figure 5. Injection of noise in different modules.
Figure 5. Injection of noise in different modules.
Sensors 25 03549 g005
Figure 6. Disguise attacks at different camouflage rates.
Figure 6. Disguise attacks at different camouflage rates.
Sensors 25 03549 g006
Figure 7. Multi-class dimension comparison diagram.
Figure 7. Multi-class dimension comparison diagram.
Sensors 25 03549 g007
Figure 8. Comparison diagram of multi-class models.
Figure 8. Comparison diagram of multi-class models.
Sensors 25 03549 g008
Figure 9. Ablation experiment.
Figure 9. Ablation experiment.
Sensors 25 03549 g009
Table 1. Metadata feature introduction.
Table 1. Metadata feature introduction.
Feature_TypeSymbolDescription
Account Attributes: Au E t o t a l The total number of devices used by the user
H d e v i c e * Device   entropy : p i l o g p i , (pi)
V Whether user verifies: V∈ {0, 1}
L b i o Profile character length
L n i c k Nicknameccer length
L u s e r Username character length
Behavior Attributes: Bu T Total number of tweets
I t w e e t Tweet   audience   Index :   T F e
CcolTotal of Favorites
R t w e e t / a c t * Tweet   mutation   rate :   T D a c t , ( D a c t : active days)
R c o l / a c t * Collection   mutation   rate :   C c a l D a c t
R c o m / o r i g * Original   tweet   comment   rate :   C T o r i g
R c o m / t w e e t Tweet   comment   ratio : C T ,   ( C :   comment count)
L ¯ t w e e t Average   tweet   length :   T ¯ T , ( T ¯ :   tweet character count)
S s i m Content similarity of tweets
S s i m / d a y *Single day tweet similarity
R m e d i a Media   rate :   M T , ( M : media count)
H t i m e *Tweet time distribution entropy
T ¯ m o n t h Average number of tweets per month
D r e c e n t Number of days since last tweet
R u r l Link   rate :   L T , ( L :   L i n k   coun )
Social attributess: Su F o Number of follows
F e Number of followers
I a t t n Follow   the   index :   F o F o + F e
I p o p Popularity   index :   F e F o + F e
R m u t * Inverse   attention   rate :   M m u t F o , ( M m u t : mutual followers)
R f o / a c t * Mutation   rate   of   attention :   F o D a c t
R f e / a c t * Fan   mutation   rate :   F o D a c t
I i n t e r a c t Intensity   of   engagemen   R + L k , ( R :   retweets ,   L k :   likes)
R f w d / l i k e * Retweet   like   rate :   R L k
R f w d / t w e e t Retweet   rate :   R T
R l i k e / t w e e t Likes   rate :   L k T
Table 2. TwiBot-22 sub-datasets.
Table 2. TwiBot-22 sub-datasets.
TwiBot_1TwiBot_2TwiBot_3TwiBot_4TwiBot_5
Human50005000500050005000
Bot50005000500050005000
User10,00010,00010,00010,00010,000
Tweet1,156,6401,333,0181,138,4801,151,3621,142,717
Edge1,535,3971,924,6161,508,0541,511,8241,526,627
Table 3. Configuration of experimental parameters.
Table 3. Configuration of experimental parameters.
ModuleParameter NameParameter Size
DistilBERTSequence length Lmax128
DeepWalkNumber of random walks э100
Random walk step λ(u)10
Window size δ(u)5
CatBoostCatBoost learning rate η(c)0.03
CatBoost iterative training times500
CatBoost tree depth6
UMAPn_neighbors N(p)15
min_dist M(p)0.1
n_components D(p)16
Table 4. Evaluation results of TwiBot-22 sub-dataset.
Table 4. Evaluation results of TwiBot-22 sub-dataset.
TwiBot_1TwiBot_2TwiBot_3TwiBot_4TwiBot_5Average
Accuracy0.83800.77700.85600.81800.81800.8214
Precision0.81000.74500.82800.78800.79100.7924
Recall0.83900.78500.86700.82000.81600.8254
F10.82400.76400.84700.80400.80300.8084
Table 5. Comparison of detection performance of CB-MTE based on TwiBot-22 dataset.
Table 5. Comparison of detection performance of CB-MTE based on TwiBot-22 dataset.
ModelTwiBot-22
Accuracy (%)Precision (%)Recall (%)F1 (%)
BGSRD [41]0.71880.22550.19900.2114
Botometer [42]0.49870.30810.69800.4257
BotRGCN [7]0.79660.74800.46800.5750
SimpleHGN [43]0.76720.72570.32900.4544
Lee et al. [44]0.76280.67230.19650.3041
Deshmukh et al. [45]0.7462--0.5169
RGT [46]0.76470.75030.30100.4294
SGBot [47]0.75080.73110.24320.3659
T5 [48]0.72050.63270.12090.2027
BotDGT [49]0.79330.72420.48460.5815
CB-MTE0.82140.79240.82540.8084
Table 6. UMAP different parameter results.
Table 6. UMAP different parameter results.
Numbern_neighborsmin_distn_componentsAUCAccuracyPrecisionRecallF1
A50.1160.93500.85500.82900.86300.8460
B150.1160.93400.85600.82700.86700.8470
C300.1160.93500.85700.82900.86700.8470
D150.5160.93500.85600.82700.86700.8460
Table 7. Ablation experiments for CB-MTE.
Table 7. Ablation experiments for CB-MTE.
Ablation SettingsRepresentation
w/o graph & text & M*M − M*
w/o graph & textM
w/o graphM + T
w/o textM + G
CB-MTEM + T + G
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, M.; Xiao, Y.; Huang, T.; Lei, C.; Zhang, C. CB-MTE: Social Bot Detection via Multi-Source Heterogeneous Feature Fusion. Sensors 2025, 25, 3549. https://doi.org/10.3390/s25113549

AMA Style

Cheng M, Xiao Y, Huang T, Lei C, Zhang C. CB-MTE: Social Bot Detection via Multi-Source Heterogeneous Feature Fusion. Sensors. 2025; 25(11):3549. https://doi.org/10.3390/s25113549

Chicago/Turabian Style

Cheng, Meng, Yuzhi Xiao, Tao Huang, Chao Lei, and Chuang Zhang. 2025. "CB-MTE: Social Bot Detection via Multi-Source Heterogeneous Feature Fusion" Sensors 25, no. 11: 3549. https://doi.org/10.3390/s25113549

APA Style

Cheng, M., Xiao, Y., Huang, T., Lei, C., & Zhang, C. (2025). CB-MTE: Social Bot Detection via Multi-Source Heterogeneous Feature Fusion. Sensors, 25(11), 3549. https://doi.org/10.3390/s25113549

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop