Social Bots Detection via Fusing BERT and Graph Convolutional Networks

: The online social media ecosystem is becoming more and more confused because of more and more fake information and the social media of malicious users’ fake content; at the same time, unspeakable pain has been brought to mankind. Social robot detection uses supervised classiﬁcation based on artiﬁcial feature extraction. However, user privacy is also involved in using these methods, and the hidden feature information is also ignored, such as semi-supervised algorithms with low utilization rates and graph features. In this work, we symmetrically combine BERT and GCN (Graph Convolutional Network, GCN) and propose a novel model that combines large scale pretraining and transductive learning for social robot detection, BGSRD. BGSRD constructs a heterogeneous graph over the dataset and represents Twitter as nodes using BERT representations. Corpus learning via text graph convolution network is a single text graph, which is mainly built for corpus-based on word co-occurrence and document word relationship. BERT and GCN modules can be jointly trained in BGSRD to achieve the best of merit, training data and unlabeled test data can spread label inﬂuence through graph convolution and can be carried out in the large-scale pre-training of massive raw data and the transduction learning of joint learning representation. The experiment shows that a better performance can also be achieved by BGSRD on a wide range of social robot detection datasets.


Introduction
News content that is easier to consume is due to the introduction of social media [1]. The development of social media is a double-edged sword as it also has negative effects, such as bringing us unspeakable pain. Social media is different from traditional media (newspapers, television and radio), and the new news trend of "fake news" is also welcomed by social media, which quickly spreads some news with intentionally misleading information. The malicious activities of attackers, spammers and fraudsters are also due to the typical characteristics of the openness and sharing of online social networks. One of the highest security threats in online social networks is social robots, which are more vulnerable to attackers. The interaction between social robots and humans on social media is the imitation of computer software that automatically generates content, and this imitation will also change their behaviour. Creating an illusion is the main goal of these social robots, so the positive influence of social networks on public opinion can be explained in this way [2]; political penetration [3] is triggered and malicious content is also widely spread. These malicious social robots will also have a negative impact on popular social networks, mainly on human users.
At present, social robots on Twitter are facing three main challenges, mainly in the following respects: it is difficult to fully extract features, which is the first challenge of social robots on Twitter because they are characterized by complexity. In order for a social robot to avoid being discovered, it is necessary for it to pretend to be an ordinary user. To describe social robots more accurately, it is necessary to consider their characteristics and various contents. Only extracting the features of social robots from a single angle [4,5] cannot fully describe them, which is the result of many existing types of research. Building a detection model only uses a small number of features, considering the features of social robots [6,7] and studying them from several perspectives, which is the research content of other works. It is difficult to obtain large-scale tags for research datasets from Twitter, which is the second challenge. On Twitter, the lack of large-scale reliable datasets is caused by the relative rarity of social robot detection research; it needs rich, experienced support and takes a lot of time to mark manual proofing. Small-scale datasets [4,7,8] are the basis of most existing studies. Another great challenge of current research is to accurately and effectively scale datasets, which is needed for the detection of social robots on Twitter. When classical detection methods are used to detect social robots on Twitter, their performance is not very good. This is the third challenge. The performance of detection methods has been improved because machine learning detection methods have been used in previous work [4,9], but there is still much work to be done. Therefore, the detection method needs to be further developed for the detection of high-performance social robots based on deep neural networks.
BERT can learn the semantic information of the text in advance on large-scale text, and then fine-tune it on the Twitter dataset to learn the distribution characteristics of the Twitter data, so as to overcome the pain point of missing the large-scale dataset. However, GCN has a good ability to capture and learn the propagation and co-occurrence relationships of Twitter and can learn the complex features of Twitter robots in multiple dimensions.
As shown in Figure 1, a new model-BGSRD-is proposed in this work, and the detection of social robots is, through its symmetry, a combination of BERT and GCN. Large-scale pre-training and transduction learning of social robot detection is carried out by this model, combining the following advantages. BGSRD constructs a heterogeneous graph of the corpus, which uses pre-trained BERT embedded nodes as word or document nodes to classify, initialize and use the GCN of robot classification. The model that can take advantage of the two worlds is obtained by jointly training BERT and GCN modules: (1) massive raw data can be pre-trained on a large scale; (2) The label's influence through the edge of the graph can be carried out through the transduction learning of the representation of learning training data and unlabeled test data. The above three challenges can be overcome by combining the pre-training model and a graph neural network. The successful combination of large-scale pre-training and the power of the graph network is the BGSRD model. At the same time, a better performance is obtained, especially on a wide range of social robot detection datasets. This major contribution includes: • We combine pre-trained language model BERT and Graph Convolutional Networks to detect social bots; • We can fuse semantic information by applying BERT multi-head attention, and a better-integrated representation can be generated by each text; • We adopt a novel graph neural network method to detect social robots. This is research on embedding a heterogeneous graph and graph neural network to learn words and documents through the whole corpus modelling.

Related Works
With the spread of robot accounts on social networks, there are more and more studies on social robot account detection. With the development of related research, related methods can be divided into the following categories [10]: crowdsourced social machine account detection platform, detection technology via traditional machine learning, detection technology over deep learning, detection technology using social network graphs, and so forth.

Crowdsourcing Social Machine Account Detection Platform
Reference [11] proposes a crowdsourcing social machine account detection platform. It is considered that machine account detection is a relatively simple technology for human beings, so an online Turing detection platform is created. By employing a large number of workers and experts to test the account data in Facebook and Renren, the same account data are provided to multiple workers, and the opinions of the majority are taken as the final judgment.
However, its disadvantages are also very obvious. It would be better to do this in the early days of social networking, but the cost is almost unrealistic for established social networking platforms. The number of users of various mainstream social platforms has experienced explosive growth in the past few years. For example, the number of monthly active users of Twitter reached 336 million in 2019, which was an increase of 2.5 times compared with 2012 [12]. Compared with this high cost and inefficient service, it is not applicable. Due to the massive number of users and data every day, such a scheme can only stay in the process of theory and experiment, but cannot really be put into practical application.

Detection Technology Based on Machine Learning
The most common technology for the detection of machine accounts is based on machine learning, and is the mainstream detection technology at present. Taking this problem as a binary classification problem is the essence of machine account detection technology based on machine learning. After the required features are extracted from the account, the classification algorithm is used to analyze the data, and the detection model is trained. Then, the model is used to analyze the data of the account that needs to be classified and classify it.

Detection Technology Based on Deep Learning
With the development of deep learning, more and more studies have been applying it to machine account detection. Deep learning is a branch of machine learning. Deep learning takes artificial neural networks as the basic framework within which to conduct data representation learning [13]. Recently, with the rapid development of deep learning, more and more studies have also been applied to machine account detection. One branch of machine learning is deep learning. Deep learning learns data representation based on artificial neural networks [13]. Unlike with traditional machine learning, an in-depth study of the data needs more data and time to train the model; deep learning, at the same time, can use unsupervised, or characteristics of, semi-supervised learning and use a hierarchical feature extraction algorithm to replace the artificial nerual network [14] and obtain the characteristics, which can save time and discover some hidden features.
LSTM (Long Short-term Memory) is a kind of temporal cyclic neural network, first published in 1997 [15]. It is especially designed to solve the general cyclic neural network RNN (recurrent neural network, RNN). Suitable for processing and predicting events with long intervals and delays in time series, they are now often constructed as part of large deep neural networks. Researchers of machine account detection also use LSTM in correlation experiments and projects [16,17]. CNN (convolutional neural network) and LSTM networks have been used in machine account detection [16]. The CNN network is used to extract the characteristics and relations of the Twitter text content. The second layer regards the Twitter metadata as time information and uses the time information as the input to LSTM to extract the time characteristics of users' social activities. Finally, in the fusion feature layer, the previous content features and metadata features are fused to detect the machine account, and the final detection results are obtained.
Reference [17], using Twitter content and metadata, detected machine accounts at the level of tweets, extracted contextual features from user metadata, and provided them as auxiliary input to the LSTM network that processed the tweet's text. The model only needs one tweet to determine whether it is a machine account. Reference [18] used the BiLSTM (Bi-directional Long Short-Term Memory) algorithm to detect machine accounts. BiLSTM is an algorithm using bidirectional LSTM, and the two LSTMs are in opposite directions. Together they form the BiLSTM network. The model uses the context of tweets as input, and enters the BiLSTM network after word embedding. Finally, the outputs of forward LSTM and backward LSTM are stitched together, and then the normalized function is used for classification so as to obtain the required detection results. This model only uses the content of tweets as input, and does not use other features. The advantage of this method is that it saves a lot of working time of feature extraction, does not need manual features and prior knowledge, can improve work efficiency, and is more convenient to deploy in the scene of batch detection. Reference [19] proposed a two-stage, graphbased machine account detection system. The system utilizes supervised learning and unsupervised learning. Reference [20] uses incremental learning to process data in realtime. Although the convergence time of the model is longer, the final model produces a superior classification performance and is suitable for stream-based detection systems.
Similarly, the detection technology based on deep learning also has its disadvantages. When the dataset is not large enough, the effect of the neural network is often poor and the phenomenon of over-fitting easily occurs.

Detection Technology Based on Social Graph
The detection technology based on a social graph is mainly based on the social network graph formed between users in the social network. The social network graph can be used to understand and analyze the relationships between users on the social network platform. Therefore, the detection technology based on a social graph focuses on the relationships between users. After all, in a social network, no accounts exist in isolation, and they are all connected to each other. The social graph of normal users and machine accounts is often very different. For example, a large part of normal users' good friends come from real friends, who follow each other and interact often. Machine accounts, on the other hand, do not have such features. They will have fewer mutual friends, which is obvious in the social graph. They will also have fewer comments and likes, and most of them will tweet or retweet to expand their influence. There will also be a difference in the percentage of friends between normal users and computer accounts. Therefore, the structure of the social graph of normal users is significantly different from that of machine accounts, and the detection scheme based on the social graph uses this difference, together with the network characteristics of users, to identify and detect machine accounts.
SybilRank [21] represents an example of this framework: an opposing party can control multiple social machine accounts (often referred to as Sybils in this case) to impersonate different identities and launch attacks or infiltrations. Proposed strategies for detecting Sybil accounts often rely on examining the structure of the social graph. For example, SybilRank assumes that Sybil accounts only show a small number of links to legitimate users, rather than primarily to other Sybil accounts because they require a large number of social connections to show a trustworthy status. This feature can be used to identify dense, interconnected social machine accounts. In addition, research such as Sybilwalk [22], Gang [23], SybilScar [24], and Sybilfuse [25] are all machine account detection methods based on social interaction correlation diagrams.

Methods
The proposed BGSRD model uses the BERT model to initialize the representation of document nodes in the text graph. These are the representations used as input for GCN. Iterative updating based on the graph structure using GCN means that the social robot represents the posted text, and the final representation of the document node is its output, which needs to be sent to the softmax classifier when making predictions. In this way, we can make use of the complementary advantages of the pre-training model and graph models. (Replicating the experiment code is available at https://github.com/shanmon110 /BGSRD (accessed on 22 November 2021)).

Textual Representation via BERT
The essence of BERT is to provide a better feature representation for word learning by running a self-supervised learning method on the basis of the massive corpus. As shown in Figure 2, the generalization ability of the word embedding model is further increased by the BERT model, and the relationship characteristics between character level, word level, sentence level and even sentences are also fully described. MLM (Mask Language Model) is used for multi-task training objectives, similar to the cloze test; although all position information is still seen, the words that need to be predicted have been replaced by special symbols, which can be bidirectional encoding. BERT uses Transformer as an encoder to achieve context correlation, and Transformer instead of BiLSTM as an encoder can have deeper layers and better parallelism. In addition, linear Transformer is more immune to the influence of mask markers than LSTM. All you need to do is reduce the weight of mask markers through self-attention, while LSTM is similar to the black-box model and it is difficult to determine its internal processing mode for mask markers. BERT adopted the NSP (Next Sentence Prediction, NSP) multi-task training goal to learn Sentence/Sentence pair relationship representation, and sentence level negative sampling. First, given a sentence, the model identifies whether the next sentence is a positive example (correct word), conducts random sampling of a negative example (random sampling word), and includes sentence-level dichotomies (that is, judge whether the sentence is the next sentence of the current sentence or noise), similar to word2vec word-level negative sampling.  Figure 2. BERT input-output representation [26]. We use BERT to generate word embeddings.

TextGCN
In order to model the global word co-occurrence more clearly, a large heterogeneous text graph containing word nodes and document nodes is constructed, as shown in Figure 3, so that the graph convolution can be easily adapted. The number of documents (corpus size) plus the number of unique words (vocabulary size) in the corpus is the number of nodes in the text graph |V|. For the input of Text GCN, a one-hot vector is every word or document, and the identity matrix simply sets the feature matrix X = I. The edge between nodes is constructed by word occurrence in documents (document-word edge) and word co-occurrence in the whole corpus (word-word edge). The word frequencyinverse document frequency (TF-IDF) of a word in a document is the weight of the edge between a document node and a word node, the frequency of its occurrence in a document is the word frequency, and the reciprocal of the logarithmic proportion of the number of documents containing the word is the inverse document frequency. It is better to use TF-IDF weight than just using word frequency. For all documents in the corpus, in order to make use of the global word co-occurrence information, collecting co-occurrence statistics mainly uses a sliding window of fixed size. We mainly calculate the weight between two-word nodes by PPMI (point-wise mutual information), a popular word association measure. In our preliminary experiment, PMI can produce better results, especially when word co-occurrence counting is used. Node i and node j formally define the weight as follows: PPMI(i, j) = i, j are words, i, j are words and i = j TF − IDF ij , i is document, j is word The PPMI value of a word pair i, j is computed as: where #W(i) is the number of sliding windows in the corpus containing the word i. #W(i, j) is the number of sliding windows containing both the words i and j, where #W is the total number of sliding windows in the corpus. A positive PMI value means high semantic correlation of words in the corpus. Negative PMI values indicate little or few. There is no semantic correlation in the corpus. So, just add an edge between the word pairs using a positive PMI value. After creating the text graph, feed the graph to a simple two-tier GCN. In Reference [27], the embedding of the second layer node (word/document) is the label and is set and sent to the softmax classifier.
whereÂ = D − 1 2 AD − 1 2 is the same as in Equation (1), and so f tmax(x i ) = 1 Z exp(x i ) with Z = Σ i exp(x i ). The loss function is defined as the cross-entropy error over all labeled documents: The document index set with labels is Y d , and the output feature is F, which is equal to the number of classes. The label matrix is Y. The weight parameters W 0 and W 1 can be trained by gradient descent. Figure 3 is a schematic diagram of the overall Text GCN model. Text GCN is as follows [28]. A document is a node that begins with "O", and a word node is another node. The edges of the document are thick black edges and thin grey edges. The representation (embedding) of x is represented by R(x). Different document classes are represented by different colours (only four sample classes are shown to avoid confusion). CVD: cardiovascular disease; Neo: tumours; corresponding: respiratory diseases; Immun: immune diseases.

Word Document Graph Representation Document Class
What is passed between nodes at most two steps away is the message that two layers of GCN can allow. Therefore, the information exchange between document pairs is allowed between two layers of GCN, and there is no direct connection between document and document edge. The performance of single-layer GCN is better. In our preliminary experiment, especially compared with two-layer GCN, it is concluded that more layers do not improve the performance. The results are similar to those in [27,29].

Interpolating BERT and GCN Predictions
In fact, the faster convergence and better performance of BGSRD are the reasons why BERT is directly optimized by using embedded auxiliary classifiers. The auxiliary classifier is mainly built by embedding the document (represented by X) specifically, directly feeding it to the dense layer with softmax activation: Z BERT = so f tmax(WX).
The GCN model is represented by g. The joint optimization of BERT and GCN parameters is carried out by using the cross entropy loss at the nodes of the markup document. The linear interpolation of the prediction from BGSRD and the prediction from BERT is the final training goal, which is given by the following formula: The trade-off between two targets is controlled by λ. We use λ = 1 for the complete BGSRD model and λ = 0 for the BERT module only. The BGSRD model can be better optimized and we can balance the predictions of the other two models.
The explanation for obtaining a better performance can be explained by interpolation in the following: the input of GCN is adjusted and optimized for the target, which ensures that the input of GCN needs to be operated by Z BERT . This is the reason a better perfor-mance can be obtained, and it is also beneficial for overcoming the inherent defects, such as gradient disappearance or excessive smoothing by the multi-layer GCN model [29].

Datasets
We ran experiments on five widely-used social bot detection benchmarks: crescirtbust [30], botometer-feedback [31], gilani [32], cresci-stock-2018 [33,34], midterm [35]. These datasets are in the same format, including crawling time, user profile, description, followers, location, URL, and so forth. We have put these datasets with our code on GitHub (https://github.com/shanmon110/BGSRD (accessed on 22 November 2021)). The existing common datasets are summarized in Table 1. The difference between the number of accounts and the original number is caused by removing invalid accounts from the dataset. Stefano Crescis' research team and Reference [35] have collected many datasets, which are of great help to the study of machine accounts on social networks.

Baselines
Cresci-rtbust [30]: A new technology that only needs the time stamp of retweets for each analyzed account is used to detect the retweeting social robot, so there is no need to provide a complete user timeline or social graph.
Botometer [31]: A popular robot detection tool was developed by Indiana University. Botometer is based on Random Forest classifiers; given a Twitter account, Botometer extracts over 1000 features relative to the account from data easily provided by the Twitter API, and produces a classification score called a bot score: the higher the score, the greater the likelihood that the account is controlled completely or in part by software. gilani: Reference [36] mentions three methods with which to conduct experiments on gilani; we will compare these three methods as a baseline. gilani has two main parts: bot and analyser. The bot fetches a trending topic or a popular tweet, disassembles the information in the topic or tweet, and the analyser is used for analysis.
cresci-stock: Reference [33,34] proposed a method for detecting social robots in the financial field. cresci-stock studies tweets related to the stocks of the five main financial markets in the US and bot detection techniques. midterm [35,37]: Realization of efficient analysis and scalability to process all Twitter's public tweet streams in real time through a framework that uses minimal account metadata.

Experimental Setup
Document embedding is the output feature of using a [CLS] token. Compared with BERT and RoBERTa, it is the feedforward layer that obtains the final prediction. BGSRD is realized by using BERTbase and two layers of GCN. Learning rate initialization 1 × 10 −3 is used for the GCN module, and 1 × 10 −3 is used for fine-tuning the BERT module. Our model is realized mainly by using RoBERTa and GAT (Graphic Attention Network) [38]. Learning edge weights is not based on a predefined weight matrix but on the attention mechanism, especially when GAT variants are trained on the same graph as GCN variants. The input length for setting BERT is 18, 128 is the batch size, and 200 is the dimension of the GCN hidden layer. The number of attention heads of GAT is set as 8 and 0.5 is the default value of dropout. The parameter is updated by using the Adam optimizer.

Results and Analysis
The detection results of the robot can be seen in Tables 2-6. BGSRD technology achieves the best detection performance because BERT with GNN is used for feature extraction. In most evaluation indicators, in fact, BGSRD technology has defeated many other competitors. Extracting information features from our referral time series is one of the expected advantages of supporting GNN. The second-best overall result is obtained through each model. Most of the worst results are obtained by the evaluated technology in terms of accuracy index, which is interesting because there are many legal accounts that are wrongly classified as a robot. From a result comparison, this is different from the previous robot detection results. Table 2. Bot detection results on the Cresci-rtbust dataset and comparison with a baseline and other techniques [30]. The best and second-best results for each metric are bold and underlined, respectively. We also observe that the model with the BGSRD set of features performs consistently well overall, outperforming or obtaining similar results to the other models. The excellent performance of the model containing D in the stock dataset is also worth mentioning, where it performs the best. This provides evidence that the compression statistics extracted from the Digital DNA can detect bots that behave coordinately, as happens with stock. Moreover, by combining D with data selection it is possible to build a classifier that can generalise properly in different domains. Alternatively, the model with BGSRD, except for the stock dataset, produces results that outperform those of the other models on some occasions. Besides, it shows the best specificity in all cases and is scalable. BGSRD seems to be more robust against the bots in five datasets, probably because its features cover more aspects other than the user metadata, and BERT is used to study more semantic information. Results also confirm that is possible to obtain a competitive performance using just a small set of features, rather than a bigger one such as Botometer.

Ablation Study
Figures 4-8 presents the various evaluation indicators of each model. We can see that BGSRD and RoBERTaGCN perform the best across all datasets. Using BERT or RoBERT with GCN generally performs better than using them with GAT, except for Gilani, which is due to content posted by social bots having the characteristics of propagation, while GCN can learn the propagation characteristics of fake content. Roberta-base and roberta-large improve the performance on datasets more significantly than bert-base-uncased and bertlarge-uncased. The main reason for this is that the average length in the dataset is relatively long: long text may produce more document connections transmitted through intermediate word nodes because of the graph constructed by word document statistics and, at the same time, the messages transmitted by the graph will be more favorable to passing, and the performance will be better when combined with GCN. On cresci, botometer, stock and midterm datasets, the reason the GCN model performs better than the GAT model can be explained; compared with other datasets, datasets with shorter documents (such as Gilani) have less performance improvement because the ability of the graph structure is limited. BERTGAT and RoBERTaGAT also benefit from the graph structure. Their performance is not as good as that of the GCN variant because of the lack of edge weight information.  We ran all models 10 times and report the mean test accuracy. . Results for different models on the transductive Socail bots detection gilani datasets. We ran all models 10 times and report the mean test accuracy.  Figure 7. Results for different models on the transductive Socail bots detection stock datasets. We ran all models 10 times and report the mean test accuracy. We ran all models 10 times and report the mean test accuracy.

The Effect of λ
The tradeoff between BGSRD and BERT is trained by λ control. The optimal value of λ will be different according to different tasks. The accuracy of RoBERTaGCN with different λ is mainly shown in Figure 9. The value of F1 is always higher on cresci, and the value of λ is larger at this time. The explanation for this is the high performance of the graph-based method. When λ = 0.8, the model achieves the best performance, which is slightly better than that of using the GCN prediction alone (λ = 1).

Discussion
Powerful robots for detecting results and learning to predict documents and word embedding are mainly realized by BGSRD, which we can see from the experimental results. Among them, the GCN model is essentially transduction, which is a major limitation of this study because, in GCN training, document nodes are tested (without labels). Therefore, it is impossible for Text GCN to quickly generate embedding and predict invisible test documents. The best performance can be achieved only when a small learning rate is set by the RoBERTa module and when fine-tuned RoBERTa is used.

Conclusions and Future Work
BGSRD makes full use of the scale pre-training model and transduction learning for the classification of large social robots. The training of BGSRD is carried out by using a repository that stores all embedded documents. This is effective training, and some can be updated according to the small batch of samples. The detection of the classification problem of incoming text nodes is mainly carried out by constructing a heterogeneous whole corpus of generous word document maps and translating social robot texts. Limited tag documents are mainly realized through the framework of capturing global co-occurring words by BGSRD. It can be built on any document encoder and any graphic model. This method performs excellently on multiple benchmark datasets through a simple two-layer BERT combined with GCN.
We currently only detect social robots from semantic information and textual relationships and social robot detection requires more complex features to better recognize them. Future works may focus on digging for more account features under the surface, such as the sentiment analysis of tweets. The detection scheme also needs to be more comprehensive. For example, machine learning can be combined with social graphs to jointly analyze account characteristics and social network graphs, and human judgment mechanisms can be introduced into some joints. After all, humans can better identify the differences between machine accounts and human users. In order to further improve the robustness and detection capability of the detection technology, it is even necessary to further analyze the next possible update direction of the machine account and obtain the feature dimensions that can be used to detect the new machine account from the analysis results. Confrontational thinking leads to more powerful, generalized, and even preventative testing techniques.