APSN: Adversarial Pseudo-Siamese Network for Fake News Stance Detection

—Fake news is a longstanding issue that has existed on the social network, whose negative impact has been increasingly recognized since the 2016 US presidential election. During the election, numerous fake news about the candidates distributes vastly in the online social networks. Identifying inauthentic news quickly is an essential purpose for this research to enhance the trustworthiness of news in online social networks, which will be the task studied in this paper. The fake news stance detection can contribute to detect a startling amount of fake news, which aims at evaluating the relevance between the headline and text bodies. There exists a signiﬁcant difference between news article headline and text body, since headlines with several key phrases are usually much shorter than the text bodies. Such an information imbalance challenge may cause serious problems for the stance detection task. What’s more, news article data in online social networks is usually exposed to various types of noise and can be contaminated, which poses more challenges for the stance detection task. In this paper, we propose a novel fake news stance detection model, namely Adversarial Pseudo-Siamese Network model (APSN), to solve these challenges. With coupled input components with imbalanced parameters, APSN can learn and compute feature vectors and similarity score of news article headlines and text bodies simultaneously. In addition, by adopting adversarial setting, besides the regular training set, a set of noisy training instances will be generated and fed to APSN in the learning process, which can signiﬁcantly enhance the robustness of the model. Extensive experiments have been done on real-world fake news dataset, and the experimental results reveal that the presented model exceeds compared suspicious information detection models with signiﬁcant advantages.


I. INTRODUCTION
Fake news usually employs ambiguous details to fool the people to obtain benefits, including via traditional news media, e.g.print and broadcast, or deliberate misinformation or hoaxes spread on online social networks.Nowadays, with the fast expansion of the Internet and convenient development of mobile terminals, fake news is dispersing especially fast on social networks.Especially in terms of politics.During the 2016 Corresponding author: Zhoujun Li (E-mail:lizj@buaa.edu.cn.).Zhibo Zhou is with E-mail: zhouzhibo@buaa.edu.cnYang Yang is with E-mail: yangy103@spdb.com.cnFeiran Huang is with E-mail: huangfr@jnu.edu.cn* First Author and Second Author contribute equally to this work.and 2020 US president election, multifarious inauthentic news about the presidential candidate spread on social networks, which might affecte the outcome of the general election.The top 20 fake news about the 2016 U.S. presidential election received more hits on Facebook than the top 20 election reports of 19 major media, according to an analysis presented by BuzzFeed [1].Only need to add or modify a few words, social network users can easily change the content of the news, which affecting the behavior of offline users.In some respects, those Internet providers even acquiesce in such behavior.How to improve the credibility of news on social networks has always been a difficult problem for practitioners.One method is to recognize fake news articles quickly, which will be the study researched in this paper.
The fake news detection problem is a difficult problem whose serious negative impact has been increasingly recognized since the 2016 election.Fake news has enormous differences from the regular suspicious message, like spam email [2][3][4] [5] [6] studied by some researchers before, in kinds of aspects: (1) impacts on society: spam emails are generally only transmitted to individuals or only to small forwarding groups.Their social impact is limited and the scope of spread is small.However, due to a massive social network users, the spread of fake news is usually widespread and influential.At the same time, reposting will also bring a new round of propagation [7][8] [9][10]; (2) audiences' initiative: In the dissemination process of fake news, social network users will actively forward fake news, even look for fake news to spread.Most users who forward fake news merely want to obtain more reading with no sense about its correctness.However, most people will block spammers directly; and (3) Spams or fake reviews are generally easier to identify.Nevertheless, it is more difficult to identify fake news.Identifying fake news requires finding enough evidence or requiring users to have relevant qualifications knowledge due to the shortage of related real news.
These features of the above-mentioned fake news lead to new challenges to the fake news detection task.Through a complete analysis of fake news dataset prior to preparing this paper, we find some common defects about fake news, which can be categorized as presentation defects.Literally, "presentation defect" denotes the instantly visible defect in news article presentations, which widely exists in each presentation modality, like titles, textual contents [11][12] [13].Specifically, "presentation defect" covers information consistency defects among these news.Significantly different from the regular news articles written by professional journalists with well-polished words, live images and videos, the information in fake news often suffers from the inconsistency (like text bodies and headlines are irrelevant), namely the presentation defects.These discovered fake news defects actually help give a direction for solving the "fake news detection" difficulty in our study.
Based on the above defects about fake news, the fake news stance detection can detect the fake news with information consistency defects.It aims to understand the relationship between the news headline and text body.This kind of detection can help to find some fake news articles whose headlines are irrelevant and even conflicting with their text body.In our paper, we propose a Pseudo-Siamese network model with coupled input components to do fake news stance detection, which can accept the news article input in various modalities and compute the multi-modality consistency based on the learned modality-specific signature representations respectively.
Fake news stance detection is not easy and may suffer from several great challenges.First, such stance detection is actually a multi-class classification problem and classical Siamese networks [14] is proposed for binary classification.There are four kinds of relationship between news headline and its text body: unrelated, conflicting, neutral and consistent.Besides, the conflicting, consistent and neutral news belong to the related news because their headlines are related to their text bodies.
• Consistent News: The text body is consistent with the headline.• Conflicting News: The text body contradicts with the headline.• Neutral News: The text body discusses the same topic as the headline, but does not take a position.• Unrelated News: The text body discusses a different topic rather than the headline.
If we design a binary classification model, the model can only distinguish unrelated news from dataset.But if we design a multi-class classification model, the model can distinguish both unrelated news and conflicting news and hence outperform the binary classification model.We need to define new loss functions for the Siamese network to train the multiclass classification model.Second, the model is likely to overfit since the size of fake news is small.So we use data augmentation to extend our dataset.By adding some negative words in text body sentence, we get some conflicting news from consistent news.With permutation of headline and text body, we can acquire lots of unrelated news.
By extending the traditional Siamese network model to the fake news stance detection scenario, we propose an exponential Pseudo-Siamese model to address such stance detection.Furthermore, in our experiment, we find that the perturbation in news (e.g.stop-word, incomplete sentences) can have a bad impact on model's performance so we use adversarial training to make model more robust to perturbation.Our innovative contributions are summarized as follows: • Size imbalance of headline and text body: We are the first to propose an exponential Pseudo-Siamese network for stance detection of fake news.The new's headline is much shorter than its text body, which will lead to the imbalance of information.The exponential Pseudo-Siamese network we proposed can address such an imbalance.II.RELATED WORK In recent years, some studies on fake news and stance detection have been launched.Some researches are about stance detection of tweet [15][16] [17][18] [19].Mohammad et al. [15] designed an automatic twitter stance detection system to detect whether the tweeter agree, disagree, or is neither relevant to the tweet.They had two tasks to verify the effectiveness of the system.For task A, the best classification F-score is 67.82, while the other task is 56.28.Augenstein et al. [16] experimented with conditional LSTM encoding, which built a representation of the tweet that was dependent on the target, and demonstrated that it outperformed encoding the tweet and the target independently.Du et al. [17]brought a novel attention module to the neural network-based stance classification model, which combines target-specific information.Their model achieved the stoa performance on both the English and Chinese Stance Detection.Yang et al. [20] experimented with a two-step attention-based mechanism, which transforms tweet stance detection into two binary classification problems, and demonstrated that it outperformed some strong baselines.
However, the shortage of a corpus of deceptive news is the main challenge in this field for kinds of models to predict or detect.There are several ways to gather fake news: fake product reviews [21][22] [23], fudged online resumes [24], opinion spamming [25][26] [27], fake social network profiles [28][29] [30], fake dating profiles [31] and forged scientific work.Some data are available, but are restricted in content (e.g., to hotels and electronics reviews).
There are other studies on fake news detection.Rubin et al. [32] Rubin separates fake news into three classifications, namely Serious Fabrications, Large-Scale Hoaxes, and Humorous Fakes.According to their characteristics, Rubin et al. use them as a corpus for text analysis and prediction.Based on the theory of detection tool impact, Zahedi et al. [33] presented a method to study how the significant performance of detection tools and cost-related factors of the fake website affect users' thoughts of tools and threats, the efficiency of processing threats, and the dependence on such tools.
In addition, there is a contest named Fake News Challenge 1 (FNC-1)1 [34] which concentrates on fake news and stance detection.Utilizing the dataset of this contest, Chopra et al. [34] leveraged an SVM trained on TF-IDF cosine similarity features to address stance detection and then employed various neural network architectures built on top of LSTM models and scored 86.58 according to the FNC-1's performance metric.Yuxi et al. 2 designed a model founded on a weighted average, which scored 82.02 of the FNC score.The model combines the gradient-boosted decision trees and a deep convolutional neural network.
There are some previous studies about the Siamese network.In 1994, Bromley et al [35] designed a rudimentary Siamese network to judge if two signatures came from one person.Their experiment showed that the Siamese network could recognize forgeries of signatures effectively.In recent years, the Siamese network is applied in other questions [36][37] [38][39].Fu et al. [40] used the Siamese network on RGB-D object detection with joint learning and densely cooperative fusion.Ji et al. [37] put forward a Siamese-based cross-attention model for video salient object detection.Chen et al. [38] used a Siamese network with a spatial transformer layer for accurate pelvic fracture detection and achieved stoa performance.Huang et al. [39] employ correlational multimodal VAE through a triplet Siamese network for social image representation.

III. APPROACH
We present an exponential Pseudo-Siamese network, a variation of the classic Siamese network in this paper.We innovatively exploit the specific contrastive loss for text information, which greatly improves the performance.In addition, adversarial training is embedded into the Siamese network to make the model more robust against perturbation.

A. Pseudo Siamese Network
The Siamese network usually contains more than two subnetworks, and the weights are shared between thoose subnetworks, including common parameters, configuration and other information.It is a special neural network architecture.The parameter update is generally updated across subnet as displayed in the left-hand side of Fig. 1.At last, the Siamese network outputs a distance (e.g.Euclidean distance) to calculate the similarity of inputs.The more similar the two inputs are, the smaller this distance is.The figure shows the situation of two sub-networks, some siamese networks will have multiple sub-networks.The following Siamese network refers to the situation of two sub-networks.
The Siamese networks are well-known for the study of discovering similarities or associations between two comparable things.. Bromley et al. [35] use siamese for signature verification on American checks.That is to determine whether the two signatures belong to the same person.Siamese network is also used for scoring the repeater's performance in the paraphrase score judging system.In this case, the input is two sentences and the output is the score.From these two cases, the Siamese network generally employs two sub-networks to process two inputs, and another module is used to integrate the output of the sub-networks to obtain the final result.
Siamese network architectures can achieve excellent results in these tasks because of the following reasons 1) First, sharing weights among sub-networks means that only a few parameters need to be trained, and less data is required.In addition, the tendency of overfitting can be reduced.2) The sub-network is essentially a representation of the input.Therefore, it is reasonable to use similar models to process similar input types, like similar sentences or signatures in each case of the previous cases.
In natural language processing, some recent studies have used Siamese architectures [41][42][43] [44].Das et al. [41] used the Siamese network to seek the semantic alikeness between the target and the generated questions.Shonibare et al. [44] employ Siamese and Triplet neural network architectures based on BERT(Bidirectional Encoder Representations from Transformers) to embed text to a vector.
The fake news stance detection in our task cannot be directly solved by the classical Siamese networks.The reason is that headline is very short, and most of them contain less than 40 words.However, the text body is much longer, which contains much more information than headline.In the classical Siamese networks, two subnetworks use shared parameters supposing that the two inputs of classical Siamese networks are similar in length and structure.The performance of classical Siamese networks on fake news is bad, and hence, we make the two branches of the Siamese network not share parameters with each other.As displayed in the right-hand side of Fig. 1, the left branch deals with only the headline, while the right branch merely copes with the text bodies.Experiments show that the proposed Pseudo Siamese network outperforms the classical Siamese network in fake news.

B. Model Architecture
We present our exponential Pseudo-Siamese network architecture in detail in this section.We employ two parallel bidirectional LSTM to extract latent features from both news headline and text body at the same time.At last, we intend As Fig. 2 shows, there are two major branches in our model, i.e., the headline and the text body branch.In each branch, news headline or text body word sequence as inputs, latent features are extracted by subnetwork for final predictions.We present our method by explaining the next three questions: 1) How to obtain the latent features from the news text?2) How to combine the headline and text body features?, and 3) Why do we add some noise into the output of the embedding layer? 1) Headline Branch: For the headline branch, the input is the word sequence of news headline T h .In this sequence, every word is represented by its index number in English Dictionary.Through the embedding layer, with the pre-trained GloVe word vectors as weights, such index numbers are converted into word vectors V h , which represent latent features of headlines.After adding some noise on word vectors, these vectors is going to trained by subnetwork.In our model, we employ bidirectional LSTM as the subnetworks.Long Short-Term Memory (LSTM) [45][46]is a special class of recurrent neural network.Due to the special memory system, LSTM is fit for handling and forecasting things for extremely long periods.The recurrent neural network (RNN) is usually used to handle input series of arbitrary size through the hidden state unit h t .At each time step t, the inputs of the hidden unit are input vectors x t (e.g.word vectors).The RNN gets at time t and its last output vectors y t 1 .Then this unit outputs vectors y t based on the following equation where W , U , and B are parameters of the hidden unit: With recursion of this process, RNN can pass information from one step to the next of the network and connects previous information to do the present calculation.However, if the sequence is too long and the needed previous information is too far, RNN might fail to find it, which leads to the gradient vector growing or decaying exponentially during training [47].However, the vanishing or explosion of the gradient makes it hard for the RNN in fake news stance detection task.The LSTM can address this problem through introducing a memory cell.Compared with RNN, LSTM can perform better in longer sequences.Here we use Zaremba's version [48] to explain LSTM's process.
At every time step t, the LSTM unit is a set of vectors in R d (d is the memory dimension of the LSTM).Unlike RNN which has only one transmission state h t , LSTM has two transmission states, one is memory cell state c t , and the other is a hidden state h t .The range of the gating vectors i t , f t and o t are in [0, 1].Specifically, the calculation formula for special time step t of LSTM is as follows: where f t , i t , o t are forget gate, input gate, output gate at time step t respectively.
is the logistic sigmoid function and is elementwise multiplication.First, the forget gate combines the previous hidden stateh t 1 with the current input x t , and decides which old information to discard through the sigmoid function.The sigmoid value range is (0, 1), which means the value closed to 0 is discarded and closed to 1 is kept.The input gate i t and tanh function determine what new information is updated.Next, combining the forget gate and the updated information to obtain the cell state c t at the current moment.Finally, the output gate multiple with the tanh value of cell state to determine which information is output.
According to the characteristic of LSTM, it is well-suited to learn some advanced features from time series, which is suitable to process word vectors.But LSTM's output is only based on the previous and current status, which doesn't take future status into account.To make up for that shortage, we use bidirectional LSTM as subnetwork.
2) Text Body Branch: The architecture of text body branch is similar to headline branch and has its own parameters.And its input is word sequence of news text body T b and output is text body feature X b .The subnetwork of this branch also uses the bidirectional LSTM network.
3) Exponential Distance: In a classical Siamese network, two branches output two vectors, and the network outputs the Euclidean distance of two vectors at last.However, in our multi-class classification, the Euclidean margin between two classes is too small, which harms the performance of the model and makes tuning the parameters of the model harder.So we calculate the exponential distance of two branches' output X h and X b as model output.It can increase two classes' margin effectively.

4) Adversarial Training:
As is shown in Fig. 2, before entering the subnetwork of each branch, the word vectors V h and V b add some noise, which is called adversarial training and can make the proposed model more robust to perturbation.Goodfellow et al. [49] proposed the adversarial network architectures in 2014.It has been used in many ways.Huang et al. [50][51]incorporate the attention mechanism and the adversarial networks for multimodal representation.We utilize X =( X h ,X b ) as the input pairs, y as the label, and W as the parameter of the neural networks.The adversarial training loss is as follows: r h adv and r b adv are the perturbation on headline and on text body respectively.The purpose of the perturbation is to add perturbation on the input and challenge the model to be robust to learn something in the most difficult situations.The r adv is the worst case perturbations on the model.Miyato et al. [52] proposed an approximation algorithm to estimate the r adv by linearing log p(y|x; Ŵ ).AL 2 norm is used to normalize the perturbation.And ✏ is the intensity of perturbation.
Then, we can derive the loss for headline and text body branch, respectively.

C. Contrastive Loss
The traditional machine learning loss function is to sum over all the differences of samples between the predicted value and true value.The loss function of the Siamese network is designed based on the distance between pairs of samples.Suppose T h is the word sequence of the news headline, T b is the word sequence of news text body.T h and T b are inputs of the model.X h and X b are vectors outputed by the two subnetworks of Siamese network.The distance function outputed by the Siamese network is usually defined as the Euclidean distance between the X h and X b .Y is the label of each [T h , T b ] pair.For a traditional Siamese network, if the pair is dissimilar, Y =0.Otherwise, Y =1.
Where Xh and Xb are the the partial derivative of X h and X b respectively: The loss function is as follows.m is the number of samples, and w the parameters of model.(Y, X h ,X b ) i is the i-th sample, which is composed of a [HEADLINE , TEXT BODY] pair and a label.
L S (D i W ) is the partial loss function for a similar pair, while L D (D i W ) is the partial loss function for a dissimilar pair.When Y equals to 1, the inputs are similar and the distance between them should be as small as possible.So L(w, (Y, X h ,X b )) i equals to (D i W ) 2 , which means that the loss of this sample is directly proportional to the square of distance.When Y equals to 0, the inputs are dissimilar and the distance between them should be as large as possible.Hence we set a positive number margin and unless the distance of two dissimilar inputs is bigger than this margin, the loss won't reach the minimum value.
Based on the loss of traditional Siamese network, we design a new loss of multi-class classification for our model.The labels of dataset are encoded with number as y = {0, 1, 2, 3}, which represent consistent, conflicting, neutral and unrelated, respectively.The indicator function is as follows: The loss function is as follows: ↵, , , are the weights of each class.(l 2 ,l 3 ), (l 4 ,l 5 ) and (l 1 , +1) are intervals of distance for each class.g(D w ) is a transformation of D 2 w .Because of different classes corresponding to different partial loss functions, f i (y) is used to choose the right partial loss function.Similarly, partial loss function will reach minimum value only when the distance of each sample is in corresponding interval.

A. Case Study
Our dataset is from Fake News Challenge (TNC) contest. 3he data set consists of news headline and text body and one sample is a [HEADLINE , TEXT BODY] pair, as is shown in Table II.In this table, column "type" describes relationship between headline and text body.For example, in the last news, what its headline and text body talk about are different things, so the value of "type" is "unrelated".For the news with type of "unrelated" or "conflicting", we judge them as fake news.
In our experiment, we use data augmentation to prevent overfitting.The most types of news we collect are "neutral" and "consistent".By adding some negative words in text body sentence, we get some "conflicting" news from "consistent" news.With permutation of headline and text body, we can acquire lots of "unrelated" news.After our counting, about 70% of news are "unrelated" news.The percentage of each type of news is shown in Table III.And we totally get 49979 pairs of news.

B. Experimental Setup
All experiments used Keras on GPUs.We employ 70% of the dataset for training, 10% of the dataset for validation, and the rest for testing.All of the experiments are conducted at least 10 times individually in our experiment.As is shown in Fig. 2, our model uses bidirectional LSTM as the subnetwork.The model outputs the distance which can measure the relevance between the headline and text body.Before being imported into the model, every word in the headline and text body is going to be transformed into its index number in dictionary.In addition, each headline sequence is padded to 40 words and each text body sequence is padded to 400 words.We elaborate the set process of the parameters of every layer in two aspects.
1)Headline branch: For this branch, the dimension of GloVe embedding is set to 100.We introduce detailed information on selecting parameters in the sensitivity analysis section.In adversarial training, after a batch of data is trained, the perturbation will be calculated and added into the next batch of embedding output.In the bidirectional LSTM layer, the number of units is set to 128.Then, a dropout whose drop rate is 0.1 can avoid overfitting.In the end of this branch, we add a dense layer with 128 neurons.The outputs of the headline branch and text body branch are combined by calculating the exponential distance.In our experiment, the distance is e 2.5 kO1 O2k , where O 1 and O 2 are outputs of headline branch and text body branch, respectively.
2)Text Body branch: For this branch, the dimension of GloVe embedding is set to 100.The method of generating perturbation is the same as headline branch.bidirectional LSTM layer has 128 units and the dropout, and dense layers are the same as headline branch.

C. Evaluation
Because our data is imbalanced, a novel scoring system is designed 4 .We divide the final evaluation score into three classes based on whether the [HEADLINE , TEXT BODY] pair in the test dataset has a related target label or not.If there is an unrelated label, the final evaluation score is 0.25.If there is a related label, the final evaluation score is 1.00.Otherwise, the final evaluation score will be 0.00.The mean of every pair's final evaluation score is the final score to evaluate the model's performance.

D. Experimental Results
We make a comparison about our results and several competitive methods in Table IV.The baseline model uses handcoded features and a Gradient Boosting classifier.In the FNC contest, the score of best model is 82.02 5 .A Stanford team uses Cosine Siamese network and weighted bag of words

E. Sensitivity Analysis
In this section, we study the effectiveness of several parameters in the proposed model: length of news headline and text body, exponential distance parameter, perturbation parameter ✏ and the dropout probability.
1) Length of News Headline and Text Body: Before embedding layers, each headline sequence has the same number of words, so does text body sequences.We use grid search to choose length of headline and text body.As is shown in Fig. 3, the model performs best when each headline has 37-49 words and each text body has 295-404 words.From the figure, we can see that the model won't performs best if the sequence is too long or too short.Specifically, we test our model with 10-word headline and 100-word text body.Then the FNC score is just 92.74, which means that a short sequence without enough information can't lead to the best FNC score.On the other hand, an overlong sequence often has redundant information and the FNC score of the model doesn't increase significantly.So we choose 40-word headline and 400-word text body, which can not only result in a high FNC score but also shorten the time for training.
2) exponential distance parameter: In our model, the exponential distance is set to e 2.5 kO1 O2k and "2.5" is the exponential distance parameter.The bigger this parameter is, the faster the curve of exponential distance declines.At the beginning, we use the Euclidean distance and find that every category is close to each other and can't be classified effectively.After we use the exponential distance instead, it can expand distance of two categories and contribute to classification.As is shown in Fig. 4, we test a set of values of this parameter and find when it's in [1.5 , 2.5], the model performs well.In this paper, we set the exponential distance parameter as 2.5.

4) dropout probability:
We analyze the dropout probabilities and the dropout layer is shown in Fig. 2 (model architecture).In Fig. 6, D ↵ is the probability of dropout layer connected to the text body Bi LSTM layer, while D is the probability of dropout layer connected to the headline Bi LSTM layer.During the experiment, we employ the grid search to select the appropriate dropout probability.Our Experiments show that  V. C ONCLUSION With the rapid development of social networks, fake news spread all over the world in a very short time.It's very important to identify these fake news in time, i.e., fake news detection.In this paper, we focus on the fake news stance detection, which detects fake news by evaluating the relevance between news headline and text bodies.We novelly propose the Pseudo-Siamese network to project the features of headline and text bodies into the same space.Then, an exponential projection function is applied to project the points in high dimension space into the 2-Dimension space.We do experiments on a fake news challenge dataset.The experimental results outperform many competitive baselines.The highest score of the proposed model is 93.40.
The Siamese network is also a promising solution to fuse multi-view data in order to evaluate the relevance between imgae and text.Furthermore, for fake news detection task, it's difficult to collect a large amount of data.Generative Adversarial Nets (GAN) is a possible way to generate real and fake headlines from text body, which may greatly improve the performance.

Fig. 3 .
Fig. 3. Length of News Headline and Text Body and Performance of Model

Fig. 5 .
Fig. 5. Perturbation Parameter and Performance of Model
to get a score of 896, which is their highest score.In the beginning, we just combine Exponential Siamese network with bidirectional LSTM and get a score of 90.12.After adding the adversarial training into our model, with only 47% of data set, we can get a score of 89.12; while with entire data set, the score of our model is 93.40. feature