A Review Structure Based Ensemble Model for Deceptive Review Spam

: Consumers’ purchase behavior increasingly relies on online reviews. Accordingly, there are more and more deceptive reviews which are harmful to customers. Existing methods to detect spam reviews mainly take the problem as a general text classiﬁcation task, but they ignore the important features of spam reviews. In this paper, we propose a novel model, which splits a review into three parts: ﬁrst sentence, middle context, and last sentence, based on the discovery that the ﬁrst and last sentence express stronger emotion than the middle context. Then, the model uses four independent bidirectional long-short term memory (LSTM) models to encode the beginning, middle, end of a review and the whole review into four document representations. After that, the four representations are integrated into one document representation by a self-attention mechanism layer and an attention mechanism layer. Based on three domain datasets, the results of in-domain and mix-domain experiments show that our proposed method performs better than the compared methods.


Introduction
Consumers' purchase behavior increasingly relies on online reviews.Accordingly, there are more and more deceptive reviews which are written to deceive consumers for commercial purpose.In order to make more profits, some merchants hire writers to write positive reviews to promote their products or write negative reviews to damage the business of their competitors [1].With the spread and growth of deceptive reviews, more and more research [2][3][4][5][6][7][8][9] is focusing on the detection of deceptive comments.
To identify whether a review is deceptive or not can be regarded as a binary classification problem.The research on spam reviews was first investigated by Jindal and Liu [1].Early representative works [2][3][4][5] generally extract features manually and use machine learning algorithms to solve the problem.As the neural networks model is widely used in natural language processing, more and more research [6,7] builds an end-to-end neural network model to extract the document representation from the review automatically which obtains the better classification results.
It is very difficult to identify deceptive comments.According to the experimental results of Ott et al. [2], the accuracy of three human judges is only 57.3%.But Li et al. [4] built a model using n-grams, part of speech, linguistic inquiry and word count (LIWC) as features, and SVM (Support Vector Machine), Bayes as the classifier, which has a much better performance than humans.Li et al. [6] and Ren et al. [7] built end-to-end neural networks models to extract the representation of the review and gain a better much result than the method carried out by Li et al. [4].Their works indicate that the representation learned by neural networks can catch more information of a review than manually extracted features.Compared to the representation extracted by neural networks, manually extracted features are low-dimensional and sparse.According to Ren et al. [6], it's difficult for us to extract features manually that can capture global semantic information over a sentence or discourse.
Although neural networks can learn complex nonlinear relationships from data, they have low bias and high variance, which means that they are sensitive to the statistical noise in the training data.It is easy for neural networks to overfit on small training data.However, the lack of annotated data is a critical problem in deceptive review spam [1], hence, it is important to make full use of the annotated data, and use some methods to improve the generalization performance of neural networks.
We compared the deceptive reviews with truthful reviews carefully and came up with following conclusions: (1) deceptive comments expressed stronger emotions than real comments, which is consistent with the conclusion of Li et al. [4]; (2) the strongest expression of emotion in a comment is at the beginning and the end; and (3) deceptive reviews often start or end with similar sentences, which may due to that deceptive reviews are usually created by dedicated writers, while the same person may create a large number of similar reviews.Table 1 shows some similar beginnings and ends of deceptive reviews.

Similar Beginnings
My husband and I arrived for a 3-night stay for our 10th wedding anniversary.
My husband and I stayed there when we went to visit my sister.
My wife and I checked in to this hotel after a rough flight from Los Angeles.

Similar Endings
I look forward to many visits to Joe's in the future.I am looking forward to my next visit to Mike Ditka's-Chicago.We definitely will be returning to this restaurant in the near future.
According to the above discoveries, we divide a review into three parts: first sentence, middle context, and the last sentence, and propose an ensemble model based on such structure of the review.Firstly, we use bidirectional long short-term memory (BiLSTM) to encode the first sentence, middle context, last sentence, and the whole review into four independent document representations.As the representations obtained by the first sentence, middle context and last sentence only contained one part of the information of a review, we used the self-attention mechanism to integrate three local representations to a global representation which include all information of the review.Since the representation encoded by BiLSTM using the whole review also contains all information of the review, we used the attention mechanism to integrate two global representations into a final representation.Finally, the classification result was obtained through a fully-connected neural network based on the final representation.
We compared the proposed model with the standard benchmark [4] and the state-of-the-art [6] based on the standard dataset [4], which contains three domains (Hotel, Restaurant, Doctor).Results on in-domain and mix-domain experiments show that our model outperforms the compared methods.
The major contributions of the work presented in this paper are as follows: We split a review into three parts: first sentence, middle context, and last sentence to highlight the first and last sentence, based on the discovery that the first and last sentence express stronger emotion than the middle context.
We used four independent bidirectional LSTM models to encode the first sentence, middle context, last sentence, and the whole review into four document representations.Rather than simply make an average of them, we integrated them using a self-attention mechanism layer and an attention mechanism layer, which can learn a better combination of them through backward propagation.
We verified the effectiveness of our method in three kinds of experiments, we compared it with the baseline method and visualized the weights in the attention mechanism, which showed that the weights of the first sentence and last sentence were significantly higher than middle context, as we expected.

Classification of Deceptive Reviews
Research on spam reviews was first investigated by Liu et al. [1], who divide spam reviews into three categories: (1) unreal reviews (deceptive reviews); (2) reviews on brands; and (3) irrelevant reviews.They also conclude that it is easy to identify the spam reviews of the second and third category, but it is difficult to identify the first category, the deceptive review, because of the lack of annotated data.Current research for deceptive reviews is mainly based on the users' behavior and the text of reviews.The approach based on the user's behavior is focused on filtering strategies to withstand faulty or malicious behavior in networks [8][9][10].The approach based on the text of reviews is focused on extracting effective features and take this problem as a classification task.In this paper, we mainly introduce the approach based on the text of reviews.
Ott et al. [2] created the first public deceptive review dataset by hiring online writers to write deceptive reviews.Their data included 400 deceptive reviews and 400 truthful reviews about hotels.Based on the data from Ott et al. [2], Feng et al. [11] applied context-free grammar parse trees to extract syntactic features to improve the performance of the model.Li et al. [5] proposed a topic model based on LDA for deceptive review detection.Xu and Zhao [12] exploited generative features to extract text features from the dependency parse tree.While Banerjee and Chua [13] proposed a language framework to analyze the differences between truthful and deceptive reviews in terms of their writing style and readability.In addition, Donato et al. [14] found that the character n-grams are better features than word n-gram features for the detection of opinion spam.
The dataset proposed by Ott is too small, therefore some approaches which use unsupervised or semi-supervised methods are applied to this problem.Donato et al. [15] employed PU-learning to the problem using unlabeled data.Hai et al. [16] developed a multi-task learning method based on logistic regression.Feng et al. [17] studied the distributions of rating scores and introduced strategies to create a dataset with pseudo-standard.Liu and Pang [18] trained multiple tree classifiers to generate labeled samples from unlabeled ones and train a neural network on the extended dataset.
Li et al. [4] collected another deceptive reviews dataset based on the work of Ott et al. [2], which contains three domains: hotel, restaurant, and doctor, and explored a general method to detect deceptive reviews.In this paper, we use the dataset proposed by Li et al. [4], because it is the largest dataset of deceptive review spam to our best knowledge.Based on this dataset, some neural networks models are proposed.Ren et al. [7], Li et al. [6] built hierarchical structure (sentence-document) models and used the attention mechanism to learn the representation of the review, which achieved better results than the baseline model proposed by Li et al. [4].Sun et al. [19] proposed a convolutional neural network model to integrate the product related review features through a product word composition model.This paper uses the neural network model to learn the document representation of the review.But to be different from Ren et al. [7] and Li et al. [6], we do not use the sentence-document structure.The structure of our model is based on the review of structures, we divide a review into three parts according to the idea that the beginning and end of a review are more important to detect a deceptive review, and stack LSTM models and attention mechanism to learn the representation of the review.

Ensemble Learning
The idea of ensemble learning is to build multiple weak models and integrate them together through some strategies to learn a stronger model.There are some popular methods in ensemble learning such as, bagging [20] and boosting [21].Bagging is to randomly construct several groups of training samples to train several different models.And the independence of the model comes from the independence of the training data.Random forest [22] is a representative model that uses the bagging method.Boosting is to train a group of models iteratively, and change the distribution of the data according to the results of the classification.AdaBoost [23] is a representative model that uses the boosting method.
Ensemble learning is very popular in tree models, and it is also commonly used in neural network models to improve the generalization ability of models.In addition to general methods such as bagging, there are some useful methods such as using different initialization parameters [24], different hyper-parameters [25] to train a group of models, and the models can be combined through weighted average or stacking models.
In this paper, we use four independent bidirectional LSTM model to encode the beginning, middle, end, and whole article of a review into four document representations based on the discovery that the beginning and end of a review is more important than middle context.To catch the information of four document representation, we use attention mechanism to integrate them into 1 document representation.

Materials and Methods
Based on the discovery that the first sentence and last sentence of the review is more important than the middle context, we split a review into three parts, as shown in Figure 1

Bidirectional LSTM Encoder
The long-term short-term memory network (LTSM) [26], is commonly used to model sequences.LSTM is the special architecture of the recurrent neural network (RNN) [27], which is designed to solve the vanishing gradients problem of the RNN.The LSTM introduces the cell memory and gating mechanism based on the common RNN.The memory cell is designed to save memory and gradients of neurons across time.The input, forgetting, and output of the information in the memory cell is controlled by three adaptive gates (g i , g f , g o ) which are defined as Equations ( 1)-(3).
where x j is the current input at position j in the sequence, and h j−1 is the state of the previous cell.g i , g f , g o control the input, forgetting and output of the memory cell.The values of g i , g f , g o are the linear combination of x j and h j−1 , passed through a sigmoid activation function.The new state is the linear combination of x j and h j−1 passed through a tanh activation function as shown in Equation ( 4) z is then saved in the memory cell, but it does not replace the old value in the memory cell.The new memory cell is the linear combination of z and the old value.Equation (5) shows the update of the memory cell.
where c j is the new memory cell, and c j−1 is the old value of the memory cell.The forget gate g f controls how much of old information should be forgotten, and the input gate g i controls how much of the new information should be saved.The final output of the cell is not z, but the memory cell c j passed through a tanh function and controlled by the output gate g o .g o controls how much information of memory cell should be output, as shown in Equation ( 6).
where h j is the output of LSTM at position j.The memory cell and gate mechanism can effectively alleviate the problem of vanishing gradient and explosion gradient of RNN.Hence, the LSTM can extract the long-distance dependency of sequences.Compared with an ordinary LSTM, the bidirectional LSTM [28] can extract bidirectional information of sequences, which is more effective than a one-directional LSTM.For the convenience of description, we denote the bidirectional LSTM as BiLSTM in this paper.The output of each position in the BiLSTM is the concatenation of the output of forwarding LSTM and the output of backward LSTM, as shown in Equations ( 7)- (9).
where  7) and ( 8) are recursive definitions of the output of forwarding and backward LSTM at position t.They show that the output of forward and backward LSTM is dependent on the current input and the output of the previous position.Equation (9) shows that the output of BiLSTM at each position is the concatenation of forward LSTM and backward LSTM.e t is the embedding of the word at position t.The word embedding [29] is the continuous real-valued vectors, which can be pre-trained with a large corpus.The word embedding in this paper was pre-trained on Wikipedia corpus using fasttext model [30].
As shown in Figure 2, the output of BiLSTM encoder is the concatenation of the last state of forwarding and backward LSTM, which contains bidirectional information of the whole sequence.
In this paper, we use the BiLSTM to encode the first, middle and last part of the review into the three vectors s 1 , s 2 , s 3 with the same dimensions (s 1 , s 2 , s 3 ∈ R d m ).Since that s 1 , s 2 , s 3 can only represent a part of the review, we use BiLSTM to encode the whole review into a vector s c to catch the information of the whole review.Though s 1 , s 2 , s 3 , s c come from the same architecture, each encoder is independent with others and takes a different sequence as the input, hence, the outputs of them are totally different.3.2.Self-Attention Mechanism s 1 , s 2 , s 3 , which are encoded by BiLSTM contain the information of the first, middle and last part of the review respectively.Since that [s 1 , s 2 , s 3 ] can be regarded as a sequence with a length of three, a sequence model is a better way to integrate them than a weighted average.As shown in Figure 3, we use the self-attention mechanism to encode the sequence composed of s 1 , s 2 , s 3 .
Self-attention [31] is a special kind of attention mechanism, which can effectively extract the dependencies of different positions like common sequence models such as RNN and CNN.Compared with RNN and CNN, it has fewer parameters and lower computational complexity.The output of the self-attention mechanism is the weighted average of different positions of the input sequence, and the weights are obtained by a function of the input sequence.We denote the weights and the function of Attention and Adp.In our model, the input sequence is a matrix composed of s 1 , s 2 , s 3 .We denote the input sequence as S, S = [s 1 : We use a multilayer perceptron (MLP) as the function Adp and use softmax to normalize the Attention because MLP can fit any continuous function and adjust parameters adaptively through backward propagation.The Adp function and Attention are defined as follows: The output of the self-attention mechanism is the weighted average of S, while the weights matrix is Attention.We denote the output as Z, Z ∈ R 3×d m .Z can be represented as [z 1 : z 2 : z 3 ] and z i is obtained by the weighted average of S.
In fact, the output of the self-attention mechanism is still a sequence, and each element of the sequence can be viewed as a document representation.But compared with the input sequence, the output sequence has no information about the order of sequence.To keep the positional information of the sequence, we add the positional encoding into [z 1 : z 2 : z 3 ].In this paper, we use the sine and cosine function to encode the position, which is proposed by Vaswani et al. [31]:

Attention Mechanism
The output of the self-attention mechanism can be regarded as a sequence, and each element of the sequence includes the information of the whole review.While s c encoded by BiLSTM using the whole review also contains the information of the whole review.Since Z and s c are encoded by different models, we can integrate them to obtain a better document representation.However, the dimension of Z and s c is different, Z ∈ R 3×d m , s c ∈ R d m .Z can be represented as [z 1 : z 2 : z 3 ], we can view Z as the concatenation of z 1 , z 2 , z 3 .Hence, we are actually integrating four representations: z 1 , z 2 , z 3 , s c .We take [z 1 : z 2 : z 3 ] as a sequence and use attention mechanism to encode the sequence.The reason to use the attention mechanism is that we cannot add s c to the sequence [z 1 : z 2 : z 3 ], because of the difference between s c and z i .But we can take s c as the query, and Z as the key-value pair, which is natural in Attention mechanism (Figure 4).
The idea behind Attention mechanism is to compare each element of a sequence with a query vector.While the higher the similarity is, the larger weight the element can get.In our model, the query vector is s c , and the sequence is [z 1 : z 2 : z 3 ] T .The weight of z i is denoted as a i , and the matrix [a 1 : a 2 : a 3 ] which is concatenated by a 1 , a 2 , a 3 is denoted as Attention.a i is obtained by a similarity function of s i and z i .We use multilayer perceptron (MLP) to compute the similarity of s i and z i .And the softmax is applied to normalize the similarity.The Sim function and Attention are defined as follows: We take Attention as the weights and make a weighted combination of Z to get the output O, as shown in Equation (18).O is the integration of s 1 , s 2 , s 3 , s c , which is the final representation of the review,

Classifier
The classifier is a shallow fully-connected neural network based on the final document representation.Note that we can use this classifier to make a classification based on other representations such as s 1 , z 1 , s c , but we only use O for classification because it combines the information of all other representations.
The fully-connected neural network is used to map multi-dimensional vectors to a 2-dimensional vector y, y = [y 0 , y 1 ].y 0 , y 1 are scores of the review on two categories predicted by the model.The softmax is to normalize y 0 and y 1 , the result of normalization which is denoted as p can be viewed as the probability distribution of model on two categories, p ∈ R 2 .y and p are defined as follows:

Results
We evaluated our model in three experiments (in-domain, mix-domain, and cross-domain) based on three domain datasets (Hotel, Restaurant, Doctor).Compared with the baseline model of Li et al. [4] and Li et al. [6], the results of in-domain and mix-domain experiment showed that our model gets a better result than the compared methods.

Datasets and Evaluation Metrics
We evaluated the proposed model using the standard dataset proposed by Li et al. [4], which is the largest dataset of deceptive reviews to our best knowledge.The dataset contains three domains (Hotel, Restaurant, and Doctor).The Table 2 shows the distribution of the data.There are three types of data in each domain: "Turker", "Expert" and "Customer".The review of type "Turker" and "Expert" belongs to deceptive reviews, while the reviews of type "Customer" are truthful reviews written by customers with high credibility.The review of type "Turker" are collected by Li et al. [4] and Ott et al. [2] through the Amazon online crowdsourcing market.The reviews of type "Expert" are written by experts with domain knowledge.However, the reviews of "Experts" are much fewer than the "Turker" and the "User", hence, we don't use them in the experiment.We compared the proposed model with the baseline method [4] and the state-of-the-art method [6].Li et al. [4] and Li et al. [6] evaluate their model in three kinds of experiments: in-domain experiments, cross-domain experiments, and mix-domain experiments.To make a comparison with them, we also tested our model in these three experiments.In order to make the results of experiments more reliable, we used five-fold cross-validation.The data was split into five equal folds, and four folds were taken as training data, the remaining fold is for testing.Li et al. [4] and Li et al. [6] used the F1 score, precision, recall, and accuracy to evaluate the performance of the model.To make a comparison with them, we also used these four metrics to evaluate our model.As shown in Tables 3-5, we compared our model (RSBE) with Li et al. [4]'s model (SAGE), and the Li et al. [6]'s model (SWNN) in three experiments.The SAGE model proposed by Li et al. [4] used n-grams features and the SAGE model [4], and the SWNN model proposed by Li et al. [6] is a hierarchical model based on convolution neural networks and hard attention mechanism.

In-Domain Experiments
Table 3 shows the results of in-domain experiments.In the hotel domain as well as doctor domain, our proposed model (RSBE) performed significantly better than SAGE and SWNN.In the restaurant domain experiment, the method of SWNN got the best result, but RSBE gained the highest recall and performed much better than SAGE.Although SWNN performed best in the restaurant domain, its performances on another two domains were much worse than the restaurant domain.The performance of RBME was stable on three domains, which is about 85% in every metric, although the sample size of restaurant dataset and doctor dataset was much smaller than the hotel dataset.However, the performance of SWNN and SAGE in doctor domain was much worse than their performances on the restaurant domain and hotel domain.As mentioned in this paper, the first and last sentence of the review is more important than middle context.Therefore, there will be more information from detecting deceptive reviews given from the first and last sentence than middle context.Based on the point, RSBE extracts more information of detecting deceptive reviews than SWNN, which can enhance the sensitivity of detecting spam reviews.This view is proved by the experimental results in Table 3.The recall of RSBE was the best in three domains of all.In general, a classifier with high sensitivity gets a good performance in recall but might reduce the precision.That is the reason that F1 measurement was adopted to determine the classifier good or not.
The average F1 score of RBME (85.5%) was significantly higher than SAGE (79.6%) and SWNN (84.7%).In general, our method performs better than SWNN and SAGE in the in-domain experiment.

Mix-Domain Experiments
Table 4 shows the results of the mix-domain experiment.In this experiment, we gathered all domain data into a mix-domain dataset and verify our method with SWNN and several neural networks models.The results of Basic LSTM, Hier-LSTM, and Basic CNN were from Li et al. [6]'s paper.We have not compared our model with SAGE because there was no mix-domain result in Li et al. [4]'s paper.All of the methods in Table 4 are learning a document representation using neural networks models.The basic LSTM method uses LSTM to extract document representation, and Hier-LSTM uses LSTM to extract sentence representations and combine them into a document representation.Basic CNN uses convolutional neural networks to learn document representation.SWNN is the modification of the Basic CNN model.
Table 4 shows the result that the RSBE and SWNN model performs significantly better than other neural networks models.Though Hier-LSTM gained a very high recall value, its accuracy and precision were very low, which means that the model fails to fit the data.The RSBE model gained the highest value in accuracy and precision, which are important metrics for classification, while SWNN gained the best results in recall and F1 score.In general, our method performs comparably with SWNN and better than other neural networks in the mix-domain experiment.

Cross-Domain Experiments
The cross-domain experiment is designed to test the robustness of the model.In the experiment, we trained a model on a dataset and evaluated the model on other datasets.Since the sample size of hotel dataset was the largest (1600), compared with the hotel dataset (400), and the restaurant dataset (400), we trained the model on the hotel dataset and test it on restaurant and doctor dataset.
Table 5 shows the results of the cross-domain experiment.In the test experiment on restaurant dataset, Li et al. [4]'s method gains the best results, while the performance of RSBE was better than SWNN on accuracy and precision metrics.In the doctor domain, Li et al. [6]'s method gained the best result because of the high recall, but it failed to get a good result in accuracy and precision which are important metrics to evaluate a model.In general, the performances of all three methods in the cross-domain experiment is worse than in-domain and mix-domain experiments.All models trained on Hotel reviews performed better on Doctor reviews than on Restaurant reviews, which is reasonably due to the vocabulary of Hotel domain being more similar to the Restaurant domain.

Hyper-Parameters Tuning
As shown in Table 6, we found that there are four kinds of hyper-parameters which are important to the results of the experiment.Dropout is a common method to avoid overfitting in neural networks models [32], hence, we applied dropout to the output of BiLSTM, self-attention mechanism, attention mechanism, and fully-connected layer.The recurrent dropout is a special kind of dropout used inside of the BiLSTM to avoid overfitting [33].Table 6 shows the best set of four kinds of hyper-parameters in the in-domain experiment.The best hyper-parameters on threes domains are very similar, which means that one best hyper-parameters setting can be applied to three domains.To test the model's robustness on different hyper-parameters and different domains, we compared the influence of different hyper-parameters and different domains on the model's performance.As shown in Table 7 and Figure 5, we chose three important hyper-parameters and use the F1 score to evaluate the model's performance.To make the results clearer, we computed the standard deviation of F1 scores.All three hyper-parameters had a small standard deviation (under 0.014), which indicates that the model is robust to the varying of these three hyper-parameters.We also noticed that the standard deviation of the second hyper-parameter (dimension of the fully-connected layer) was obviously smaller than the other ones, which indicates that the model is more robust to the dimension-of-fully-connected-layer.Comparing the standard deviation of the same hyper-parameter on different domains, we notice that the hotel domain has the smallest standard deviation, which is reasonable because the hotel dataset is much larger than the other dataset.To test the model's robustness on different domains, we computed the standard deviation of the model on different domains.The result shows that the standard deviations on different domains were all smaller than 0.016, which indicates that the model's performance is stable on different domains.

Visualization of Attention
As shown in Figure 6, we visualized the attention weights of the self-attention mechanism and the attention mechanism on three datasets.It is obvious that the weights of the first sentence, last sentence were higher than the middle context in the self-attention mechanism, which validates our assumption that the first sentence and the last sentence is more important than the middle context in a review.In particular, the rule is most significant in the Hotel dataset which is the largest dataset.However, the weights of attention mechanism failed to show significant rule, because the attention mechanism is stacked on the self-attention mechanism, the output of the self-attention mechanism contains no information about the order of sequence except for the absolute encoding we add to the output sequence.In other words, it is difficult to identify which of (z1, z2, z3) represents the first sentence or last sentence.We noticed, however, that the weights on hotel dataset in the attention mechanism were very similar to those in the self-attention mechanism, and the hotel dataset was much larger than other datasets, which may indicate that the sequence (z1, z2, z3) still keeps the order of (first sentence, mid context, last sentence).

Conclusions
This paper proposes an integrated model based on the structure of the review for deceptive review detection.Firstly, we split a review into three parts: the first sentence, middle context, and the last sentence.Then we used four independent bidirectional LSTM models to encode the three parts and the whole review.After that, to integrate the output of the four LSTM encoders, we stacked two layers of attention mechanism to get a final representation of the review, finally, the classification result was obtained through a fully-connected neural network based on the final representation.
We compared the RSBE model with two baseline methods [4,6].In general, RSBE performs better than the compared methods in the in-domain and mix-domain experiment, which verifies the effectiveness of our method for deceptive review detection.The results of hyper-parameters tuning experiments indicate that our model is robust to different hyper-parameters and domains.The visualization of attention indicates that the structure of our model is reasonable since the weights of the first sentence and the last sentence is significantly higher than the middle context as we expected.
However, the model failed to perform well in the cross-domain experiment.In fact, the cross-domain experiment is a zero-shot learning task [34], because the test domain is unseen while training.The dictionary of different domains can be very different, therefore it is difficult for a model trained on a special domain to transfer to another domain.In the next study, we may try two approaches to the mentioned problems.One is to extract domain-independent features to train the model, such as the syntactic structure of sentences, high-frequency words; another is to make use of unlabeled data of the target domain.It is much easier to get the unlabeled data than the labeled data.The unlabeled data cannot give information about whether a review is deceptive or not, but it contains rich domain-information which is useful for domain adaption.There are several domain-adaptive approaches which make use of unlabeled data [35,36].It is worth applying these methods to RSBE, since the unlabeled data is easy to get, and we will verify it in the future.
, and propose an ensemble model (RSBE) based on such structure.The model is composed of four Bidirectional LSTM encoders and two layers of attention mechanisms.The four bidirectional LSTM encoders encode the review's first sentence, middle context, last sentence, and whole text into four document representations.While representation 1, representation 2, and representation 3 represent the first sentence, middle context, and the last sentence respectively, and the representation4 represents the entire review.The next two layers of attention mechanism integrate representations 1-4 into a final document representation.In details, the self-attention mechanism integrates representations 1-3 (first sentence, middle context, and last sentence) into representation 5, while the attention mechanism integrates representation 4 and representation 5 to get the final representation.Finally, the classification results are obtained by a feedforward neural network.In the following sections, we will present the details of bidirectional LSTM encoder, self-attention mechanism, and the attention mechanism.

Figure 1 .
Figure 1.The ensemble model based on the structure of the review.
PE is the positional encoding of sequence at position pos.PE pos,2i and PE pos,2i+1 is the value of vector PE at position 2i and 2i + 1.According to Vaswani et al.[31], compared with other encoding methods, this method can extract the relative positional information without adding any parameter to the model.

Figure 5 .
Figure 5. (a) the effect of the dimension of the fully-connected layer on the model's performance.(b) The effect of the dimension of the LSTM layer on the model's performance.(c) The effect of the dropout rate on the model's performance.

Figure 6 .
Figure 6.(a) The attention weights of the self-attention mechanism described in Section 3.2 on three datasets.(b) The attention weights of attention mechanism described in Section 3.3 on three datasets.

Table 1 .
Similar beginnings and endings in deceptive reviews.
1 denote the output of forward LSTM model at position t, t − 1 and denote the output of backward LSTM at position t, t − 1. e t denotes the input of sequence at position t.H t denotes the output of BiLSTM at position t which is the concatenation of → h t and

Table 2 .
Statistics of the three-domain dataset.

Table 7 .
The model's performance on different hyper-parameters and domains.