Deep Recurrent Neural Network and Data Filtering for Rumor Detection on Sina Weibo

: Social media makes it easy for individuals to publish and consume news, but it also facilitates the spread of rumors. This paper proposes a novel deep recurrent neural model with a symmetrical network architecture for automatic rumor detection in social media such as Sina Weibo, which shows better performance than the existing methods. In the data preparing phase, we ﬁlter the posts according to the followers of the user. We then use sequential encoding for the posts and multiple embedding layers to get better feature representation, and multiple recurrent neural network layers to capture the dynamic temporal signals characteristic. The experimental results on the Sina Weibo dataset show that: 1. the sequential encoding performs better than the term frequency-inverse document frequency (TF-IDF) or the doc2vec encoding scheme; 2. the model is more accurate when trained on the posts from the users with more followers; and 3. the model achieves superior improvements over the existing works on the accuracy of detection, including the early detection.


Introduction
Social media based on the internet is becoming an important news source for many people. It was reported that in 2017 about two-thirds of Americans got their news from Facebook, YouTube, and Twitter, etc. [1]. Sina Weibo, a Chinese microblogging platform, declared that there were over 100 million people discussing the news of the FIFA 2018 World Cup [2].
It is very convenient for people to publish and consume news on social media, but at the same time, the rumors or fake news spread more easily there. Sina Weibo reported that about 28 thousand rumors were debunked in 2017 [3]. The social media rumors are very harmful to our community, for example, on 23 April, 2013, the official Twitter account of Associated Press was hacked to publish the fake news that two explosions happened in the White House and the President was injured, which resulted in a dramatic crash of the stock market immediately [4]. To prevent rumors from spreading, the organizations such as Associated Press, FactCheck.org and Snopes.com provide functions to check the facts and alert the public. Sina Weibo also set up an account @weibopiyao for users to report the possible fake news. These efforts actually work, yet they depend on the intervention of the users and the experts.
In order to improve the efficiency of rumor detection, the researchers applied artificial intelligence to the problem. Automatic rumor detection is treated as a pattern recognition problem, since the veracity of the news in social media can be judged by the features extracted, such as the texts, the behaviors of the authors, the propagation modes, and the images or videos [5][6][7][8]. Many works [9][10][11][12][13][14][15][16] were based on traditional supervised learning methods, in which feature engineering is very important but painstakingly labor intensive.
In recent years, the deep neural networks have made significant progresses in pattern recognition problems, such as the deep convolutional neural network (CNN) for image processing [17] and short text classification [18,19]. The neural networks were also applied in the area of rumor detection. Ma et al., proposed a tree structure recursive neural network to model the propagation of rumors [20]. Nguyen et al. [21] combined CNN and the RNN to obtain a credit score for a single post. Compared to CNN, RNN is more suitable for modeling the temporal characteristics of rumor spreading. Some researchers proposed RNN models for rumor detection in the event level and they obtained good results and avoided the manual feature engineering [22,23].
However, the data representation is still an important factor which affects the performance of existing RNN models. The authors of [22] chose TF-IDF to represent the texts, which is good for the global statistical information of the words, but misses the structural information in the sentences. The authors of [23] used doc2vec technology for the text, which can reserve the structural information, but needs much time and space for the model.
To address the above problems, in this paper we propose a novel deep recurrent neural network (DRNN) model for rumor detection based on the textual information. The DRNN model includes eight layers totally. The input layer receives the stream of posts related to an event, and the sequential coding scheme is applied. The first three hidden lays, including a normalization layer and two fully-connected layers, are inserted for the better representation. The next two hidden layers are RNN layers, which can capture the dynamic temporal signals characteristic in the post stream. Lastly, a fully-connected layer outputs the probability of being a rumor. The proposed model is symmetric in architecture, that each cell in the upper layer is fully connected to the cells in the layer below, and each cell in the layer below is also fully connected to the cells in the upper layer.
Considering that comments and replies to the posts are often very short and contain only similar words, we propose a sequential encoding scheme for the texts. The experiments show that it has better performance than the popular TF-IDF and the doc2vec encoding.
Another contribution of this paper is that we propose a data filtering technique in the data preparation stage. We assume that the users of social media with many followers usually know more than others in certain fields, and in order to maintain the followers, they will be more careful to speak the right things when they make assertions about something. So, we can depend more on their opinions when making the rumor classifications. Our work proves that the models are actually more accurate if they are trained on the posts submitted by the users with more followers.
By experiment, the proposed system achieves superior improvements over the existing works on the accuracy of detection, including the early detection.

RNN
RNN is the type of neural network with recurrent connections that can make predictions by time serials. Given an input sequence (x 1 , x 2 , . . . , x t , . . . ), simple RNN units can general the hidden states where W and U are the weight matrices, b is the bias, and tanh is the hyperbolic tangent non-linearity function. It was found that in the training process, the long-term information often vanished gradually with the time step. Then, LSTM units were proposed to memorize more long term information [24]: where σ is the logistic sigmoid function, f t , i t , and o t are the activation vectors of the forget gate, input gate, and output gate, c t is the cell state vector, and * denotes the Hadamard product. GRU is a simpler version of LSTM units with fewer parameters [25]: where z t and r t are the vectors of update gate and reset gate. In many cases, the recurrent neural network with LSTM or GRU units has better performance than the simple RNN.
The RNN states can be decomposed into multiple layers [26]. It was further suggested that the separate multiple layer perceptron can be inserted before the input, the hidden, or the output layer of the RNN for better performance [27], which can lead to "deeper" RNN (DRNN).
Ma et al. [22] firstly proposed RNN models on the rumor detection problem to avoid the feature engineering, where the messages were organized as a time series input. Ruchansky et al. [23] proposed an RNN model combining the message text and the user information. Nguyen et al. [21] used CNN and RNN to get a credit score for each post and then combined with other hand-craft features. Ma et al. [20] applied the recursive neural network on the problem by analysis of the tree structure of the propagation modes.
Our work is mostly related with [22,23], yet there are three major differences. The first difference is that they used TF-IDF [22] and doc2vec [23] to encode the input, while our work directly used the simple sequential encoding scheme. The TF-IDF information only keeps the frequency of text, but loses the structural information in the sentences. The doc2vec scheme needs much more training to get suitable vectors. The second difference is that our RNN model is much deeper than theirs. The new architecture can prevent the system from under fitting and over fitting. The third difference is that we filtered the dataset and used the posts from users with more followers to train the model.

Materials and Methods
Although our methods are mainly proposed to detect the rumors on Sina Weibo, we still applied them on a public Twitter dataset to check the performance. So, in this section, we first describe a Sina Weibo dataset and a Twitter dataset, and then explain the details of the methods to train and evaluate the models on the datasets to detect rumors.

Sina Weibo dataset
We evaluate the proposed DRNN models on the dataset provided in [22], which is crawled from Sina Weibo. The details about the dataset are listed in Table 1. In the dataset, there are totally 4664 events. In average, there are about 816 posts related to an event and a user contributed 1.3 posts in the dataset. For a detailed look, we drew two histograms to show the distribution of posts as in Figure 1.    Figure 1b is about the number of posts published by different users. We can see that the users with more followers publish fewer posts. User-based features registered time, location, follower count, friend count, post count, etc.
(a) (b)  We can see that the users with more followers publish fewer posts.

Twitter Dataset
We also evaluated our model on a public Twitter dataset released in [12], which provides all the information we need: the text, the post time, and the follower count. The dataset includes 111 events (60 rumors and 51 non-rumors), and 192,350 posts altogether.

Problem Definition
An event in a microblogging platform is defined as e={w 0 , r 1 , r 2 , . . . , r p }, where w 0 is the original post published by a user, r 1 , r 2 , . . . , r p are the p posts commented or reposted by other users and sorted by the publication time. An event can be labeled as a rumor or not. We should train a DRNN model to predict the label of an event.
We only use the text body of each post as the input of DRNN, so in the following sections, post also means its text body.

Text Vectorization
(1) Word Segmenting and Sequential Encoding Due to the fact that in Chinese sentences the words are adjacent and there is no blank to separate two words, we firstly use a natural language processing tool named Jieba (https://pypi.org/project/jieba/) to segment each post into a list of words. Replacing each word with its index in the dictionary, we sequentially encode the post to be a one-dimensional integer vector. Unlike the TF-IDF scheme in [11], our scheme reserves the sentence structure and the semantic information, which should be important for the classification task.
Comparing to the doc2vec scheme, our sequential encoding scheme needs less time and space expenses. The shortcoming of the sequential encoding is that the distance of two words cannot be measured so it is hard to express the homoionym; however in the microblogging system, the comments of posts are usually very short and resemble one another, so the network can memorize most of them.
(2) Post Division Series In general p is usually very large (on average, an event may contain above 816 posts in the Sina Weibo dataset), but it is complex to design an RNN to support so many time steps, so we equally partition the comments and reposts into t groups according to the amount of them. With k = p/t posts in a group, we get e = {(w 0 ), (r 1 , r 2 , . . . , r k ), (r k+1 , r k+2 , . . . , r 2k ), . . . , (r (t-1)k+1 , r (t-1)k+2 , . . . , r p )}. Each group will be fed to DRNN as a time step. Note that w 0 separately becomes a group because it is more important than other comments and reposts.
We do not make the partition according to time division as [11,13] because of the uneven distribution of the post in the time dimension, which will be discussed in Section 4.1. (

3) Data Filtering
We infer that if a microblog user has more followers, he perhaps knows more about how to judge the truth of news. Furthermore, he will be more careful when making assertions in order to maintain followers. We sort the posts in each group by the number of followers in descending order, and then retain the posts from the user with more followers and filter out what is left.

(4) Concatenating and Fixing
At last, we concatenate the remaining posts in each group into a long vector, and then fix the elements of the long vector by reserving the first m elements if its length is greater than m, or padding some zeros behind if its length is less than m. Thus, we have a final sequence {x 0 , x 1 , . . . , x t }, where each x i is an m-element integer vector.

DRNN Architecture
We propose a deep RNN architecture with eight layers as Figure 2 and the structural information is listed in Table 2.

DRNN Architecture
We propose a deep RNN architecture with eight layers as Figure 2 and the structural information is listed in Table 2.    Fully-Conn3 A fully-connected layer with Sigmoid activation, transforms the dimension of the data to 1 8 Prob The output layer, which is the probability of an event being a rumor. The first layer is the input layer and it accepts x 0 , x 1 , . . . , x t sequentially, which are prepared as the descriptions in Section 2.2.2. In our work, m was set to 3000, so the dimension of the first layer is 3000.
The second layer is a normalization layer, which applies a transformation on each x i to make the mean close to 0 and the standard deviation close to 1 [28]. This layer can accelerate the training of the network. With a batch of training samples, this layer outputs: where γ and β are parameters to be learned and ε is a hyper parameter. The superscript denotes the layer index. The third and the fourth layer are two fully-connected layers with ReLU activation. The purpose of these two layers is to reduce the dimension of the input data flow from 3000 to 800 and then to 256. We use two layers for the advantage of "deep". The outputs of these two layers are: The fifth and sixth layers are two RNN layers. They can be simple RNN or LSTM or GRU as introduced in Section 1.1. We set the dimensions of the two layers to 32 and get h 5 t and h 6 t . The seventh layer is a fully-connected layer that transforms the output from the sixth layer into a real number, and then we use the sigmoid function to get a probability value p in the last layer:

Results
In the experiments, we set t to 32 and m to 3000. We train the proposed DRNN models with the cross entropy loss function and the Adagrad optimizer with learning rate 0.001. 80% of posts are randomly chosen as the train set and the others as the test set. The metrics such as Accuracy, Precision, Recall and F1 measure [29] are used to evaluate the model performance: where TP, TN, FP, FN are, respectively, the number of rumors correctly detected, number of non-rumors correctly detected, number of non-rumors but wrongly detected as rumor, and number of rumors but wrongly detected as non-rumors. With three kinds of RNN layers in the DRNN architecture, we abbreviate the corresponding DRNN models as DSRNN (simple RNN), DLSTM, and DGRU in the following texts, tables and graphics to avoid confusion.
The experiments in Sections 3.1-3.3 are on Sina Weibo dataset, and the last experiment in Section 3.4 is on Twitter dataset.

Filtering Data by the Followers
In this experiment, we study the relationships between the accuracy of models and the followers of users. At first, we extract four subsets from the original Sina Weibo dataset. The four subsets are limited that the followers of the user are in intervals of [0,100), [100,300), [200,500), and [500,∞). As shown in Table 3, the posts in each subset are about 24% to 32% of the total (There are overlaps between the subsets [100,300) and [200,500), but we prioritized to maintain nearly 30% of posts per subset. The overlaps do not affect the conclusion that the accuracy gets better when training on the Symmetry 2019, 11, 1408 7 of 11 subset with more followers.). Next, we train three kinds of DRRN models on each subset. The accuracy of the models is shown in Figure 3.

Comparisons with Different Models
We then perform a comparison of our results with [22] and [23] on the Sina Weibo dataset in Table 4. The results of three kinds of DRNN models are on interval of [500,∞).
The results of SVM-TS and GRU-2 are from [22], where SVM-TS is a support vector machine model based on time series and hand-craft features, and GRU-2 is an RNN model with two layers of GRU. The results of CI are from [23],where the authors used an LSTM model.

Early Detection
It is important to detect the rumors in the early stage of their spreading. As indicated in Figure  1a, the early detection is possible because most of the posts are published in the early stage. We firstly filter the Sina Weibo dataset by the publishing time and get the data subset in the first 4,8,12,24,48,72, and 96 hours, and then train the proposed DRNN models on each subset with a further filtering which requires users to have more than 500 followers.
The accuracy curves of the proposed DRNN models are presented in Figure 4a accompanied by a curve of GRU-2 from [22]. The loss curves are presented in Figure 4b.

Comparisons with Different Models
We then perform a comparison of our results with [22] and [23] on the Sina Weibo dataset in Table 4. The results of three kinds of DRNN models are on interval of [500,∞). The results of SVM-TS and GRU-2 are from [22], where SVM-TS is a support vector machine model based on time series and hand-craft features, and GRU-2 is an RNN model with two layers of GRU. The results of CI are from [23], where the authors used an LSTM model.

Early Detection
It is important to detect the rumors in the early stage of their spreading. As indicated in Figure 1a, the early detection is possible because most of the posts are published in the early stage. We firstly filter the Sina Weibo dataset by the publishing time and get the data subset in the first 4, 8, 12, 24, 48, 72, and 96 h, and then train the proposed DRNN models on each subset with a further filtering which requires users to have more than 500 followers.
The accuracy curves of the proposed DRNN models are presented in Figure 4a accompanied by a curve of GRU-2 from [22]. The loss curves are presented in Figure 4b.

Extensions to Twitter Dataset
The rumor detection methods are not limited on a certain kind of dataset, so then it is valuable to apply our method to the Twitter dataset. In this experiment, we also want to study the accuracy of our methods and the relationships with the follower count of the user. We first choose 4 subsets in which the follower count of the users is in the interval of [0,80), [80,250), [250,700), and [700,∞]. The posts in each subset account for 21% to 30% of the total. We train and evaluate the DLSTM model on the Twitter dataset with the same parameters. The accuracy of the models is listed in Table 5.

Data Filtering
From Table 3, we filter the data by the followers of users and only use about 30% of the data to train the models, which means the data filtering technique can save a lot of training resources, such as memory and CPU time.
From Figure 3, three DRNN models all show that they can get better accuracy when trained with data from users with more followers. The accuracy increases about 3% when trained in the interval of [500,∞) compared to [0,100). Moreover, the accuracy in this interval is even better than the results from the whole dataset. This can be explained by that users of social media with many followers usually know more than others in certain fields, and in order to maintain their followers, they will be more careful in speaking the right things when they make assertions about something. Therefore, we can depend more on their opinions when we are seeking to identify the truth and falseness of social media events. Table 5 shows that the data filtering technique also works well in the Twitter dataset. The accuracy of the DLSTM models increases when the training data are from the users with more followers. Compared to interval of [0,80), the accuracy of the model increases about 26% when trained in [700,∞) , and it also outperforms the model from the whole dataset.

Extensions to Twitter Dataset
The rumor detection methods are not limited on a certain kind of dataset, so then it is valuable to apply our method to the Twitter dataset. In this experiment, we also want to study the accuracy of our methods and the relationships with the follower count of the user. We first choose 4 subsets in which the follower count of the users is in the interval of [0,80), [80,250), [250,700), and [700,∞]. The posts in each subset account for 21% to 30% of the total. We train and evaluate the DLSTM model on the Twitter dataset with the same parameters. The accuracy of the models is listed in Table 5.

Data Filtering
From Table 3, we filter the data by the followers of users and only use about 30% of the data to train the models, which means the data filtering technique can save a lot of training resources, such as memory and CPU time.
From Figure 3, three DRNN models all show that they can get better accuracy when trained with data from users with more followers. The accuracy increases about 3% when trained in the interval of [500,∞) compared to [0,100). Moreover, the accuracy in this interval is even better than the results from the whole dataset. This can be explained by that users of social media with many followers usually know more than others in certain fields, and in order to maintain their followers, they will be more careful in speaking the right things when they make assertions about something. Therefore, we can depend more on their opinions when we are seeking to identify the truth and falseness of social media events. Table 5 shows that the data filtering technique also works well in the Twitter dataset. The accuracy of the DLSTM models increases when the training data are from the users with more followers. Compared to interval of [0,80), the accuracy of the model increases about 26% when trained in [700,∞), and it also outperforms the model from the whole dataset.

Sequential Encoding
In Table 4, there are 5 kinds of RNN models, where GRU-2 in [22] uses TF-IDF to vectorize the text, CI in [23] uses doc2vec, and our proposed three models use sequential encoding. The results show that the sequential encoding has the best performance in detecting rumors in Sina Weibo. All the proposed three models outperform GRU-2 in accuracy, precision and F1 measure. The DLSTM and DGRU outperform CI in accuracy and F1 measure.
Among the five RNN models, the GRU-2 model in [11] is with the worst performance, which indicates that the TF-IDF encoding limits the classification ability of the model. TF-IDF uses frequency of words as a text feature, but it loses structural information between words in the sentence, and the structural information between sentences.
The doc2vec scheme is a distributed representation of text, and it can measure the distance or similarity of the texts, while the sequential encoding does not have this advantage. However, it is interesting that the sequential encoding also outperforms the doc2vec scheme in the experiments. The results can be explained by the dataset. In social media, the comments and replies to the posts are often very short and contain only similar words. In Sina Weibo, a post usually contains Chinese words below 140 characters, and the comments are even shorter than the original post. For example, most of the comments when reposted are fix words like "reposting" given by the system. The short and similar posts in the dataset mean that the proposed model can memorize most of the information. The doc2vec is a powerful representation of the text, but sometimes it loses information.

Performance of Different Models
In the experiments on the Sina Weibo dataset, the proposed DRNN models show good performance. From Figure 3, DLSTM is the best model in accuracy. Its best result is about 1.5% higher than the DGRU model and 3% higher than the DSRNN. From Table 4, The DLSTM outperforms the GRU-2 model in [22] and CI model in [23], which are also two kinds of RNN models. Furthermore, the RNN models are about 10% higher than the SVM model, which shows that the RNN models have greater advantages over traditional methods.
DLSTM also shows good performance on the Twitter dataset. From Table 5, DLSTM gets the best accuracy of 0.864, while in [6], the author reported a random forest method with the accuracy of 0.84.

Early Detection
It is important for the models to detect the rumor as early as possible. In Figure 4a, both the proposed DLSTM and DGRU models show better performance than [22] in all the time intervals. The proposed DLSTM model gets accuracy about 90.6% in the first 4 h, and then raises to 92.4%, 94.1%, and 94.5% in the following 12, 24, and 48 h when the posts gain more and more responses.
The loss curves presented in Figure 4b can explain why the DSRNN models are not as good as the DLSTM and the DGRU. In the first 4 h, the loss of SRNN drops to 0.29 after 80 epochs, but the loss of DLSTM and DGRU drops to 0.25 in the first 40 epochs. The convergence of the DSRNN is improved with the time interval getting large. It can be inferred that the convergence of the DSRNN model depends more on the number of posts.

Conclusions
We present a novel deep RNN model for rumor detection in the social media platform, Sina Weibo. This model uses two RNN layers to capture the dynamic temporal signals characteristic in the stream of the post, before which there is a normalization layer and two fully-connected layers for constructing a better representation of the input. By experimental results, we find that: 1. The sequential encoding scheme is suitable for text representation in Sina Weibo; 2. Filtering the dataset and using the posts from users with more followers to train the model can improve the model accuracy; 3. The proposed DRNN model shows better performance than the existing methods on the detection tasks (including the early detection); and 4. Among three kinds of DRNN models, the model with LSTM layers achieves the best performance.
Future improvements could focus on new encoding schemes and data filtering. Recent development of NLP models can be incorporated into the model to represent the text. The filtering of the dataset may be based on other information, such as the educational background, the occupation, etc.
Another possible improvement is to introduce CNN into the proposed architecture. Precious works [18,19] showed CNN can be used to classify the short sentences, where the sentences are convolved with the templates of words and then the structural patterns are detected. It is possible to use CNN to extract features from the posts before the data enters the cells of RNN layers, so that the structural information and the temporal information are fully utilized.