Tourism Review Sentiment Classiﬁcation Using a Bidirectional Recurrent Neural Network with an Attention Mechanism and Topic-Enriched Word Vectors

: Sentiment analysis of online tourist reviews is playing an increasingly important role in tourism. Accurately capturing the attitudes of tourists regarding different aspects of the scenic sites or the overall polarity of their online reviews is key to tourism analysis and application. However, the performances of current document sentiment analysis methods are not satisfactory as they either neglect the topics of the document or do not consider that not all words contribute equally to the meaning of the text. In this work, we propose a bidirectional gated recurrent unit neural network model (BiGRULA) for sentiment analysis by combining a topic model (lda2vec) and an attention mechanism. Lda2vec is used to discover all the main topics of review corpus, which are then used to enrich the word vector representation of words with context. The attention mechanism is used to learn to attribute different weights of the words to the overall meaning of the text. Experiments over 20 NewsGroup and IMDB datasets demonstrate the effectiveness of our model. Furthermore, we applied our model to hotel review data analysis, which allows us to get more coherent topics from these reviews and achieve good performance in sentiment classiﬁcation.


Introduction
The availability of extensive tourism online reviews provides an unprecedented opportunity to analyze the emotions, preferences, feelings, and opinions expressed by visitors.Sentiment analysis is one of the major techniques for this purpose, which provides us insight into tourism services.Currently, a significant amount of research has been carried out on tourism analysis and applications based on sentiment analysis.Zheng [1] proposed a tourism destination recommender system by analyzing and quantifying users' sentiment tendency.Ren [2] proposed a topic-based sentiment analysis approach to measure online destination image.Li [3] designed a visual analytic system to analyze tourists' regional tendency and sentiment changes from user-generated content (UGC) data.Serna [4] analyzed the public bike share system in Spain to explore sustainable tourism through sentiment analysis of UGC.He [5] used sentiment analysis techniques to analyze online hotel reviews and to understand users' preferred hotel attributes or demands.
With an increasing demand for sentiment analysis of text data, a variety of research on improving the accuracy of document sentiment classification was carried out.The goal of document sentiment

Related Work
Major progress was made recently in sentiment analysis, ranging from word embedding methods to recurrent neural networks.For example, word2vec [10] is one of the widely used word-embedding models and is used in a variety of applications related to text processing.However, it still has some limits.For example, it cannot solve the problem of polysemy, and the learned word vector cannot represent the global meaning.To address this limitation, latent Dirichlet allocation (LDA) [11] was proposed as a probabilistic topic model that can extract latent topics from documents.It describes the topic distribution of the documents and word distribution of the topic by probability distribution, and can represent a global rather than contextual relationship.
Based on these methods, a hybrid document feature extraction method was put forward [12].This method uses the latent Dirichlet allocation and word2vec independently to train topic vectors, while the document vector is still the simple average of the word vectors.Liu et al. [13] proposed a topical word embedding (TWE) based on all the words and their topics.Compared to word2vec, it uses the topic of the words to predict the context and allows the same word to have different word vector expressions under different topics.Yao et al. [14] combined word2vec and LDA to mine coherent topics in documents.Zhang [15] learned from LDA to supervise the training of deep neural networks.All these works aim to combine word2vec with LDA, expecting to take advantage of both; however, they still cannot train a type of word vector that can both represent the local meaning of documents and explain the global meaning of topic distribution.In 2016, Moody [16] proposed a model named lda2vec by mixing Dirichlet topic models and word embedding.This model attempts to construct a context vector by adding the composition of a document vector and the word vector, which are all learned during the training process.It greatly improves the representation power of standard word vectors.
Another major research theme in sentiment analysis is how to compute the sentence and document vectors.Several works tried importing document topics and an attention mechanism into the neural network framework for document classification.Dieng [17] proposed TopicRNN, which integrates the merits of recurrent neural networks (RNNs) and the latent topic model to achieve long-range semantic dependency.Li [18] proposed a recurrent attentional topic model for document embedding based on a novel recurrent attentional Bayesian process.A feed-forward network with attention was suggested by Raffel [19], which is a well-known work using the attention mechanism.
Attention mechanism is a powerful technique for solving the problem of long-term memory for sequences.It is widely used in various tasks, especially with recurrent neural network models.Whatever the topic model or attention mechanism, they both aim to extract more informative features, which could be combined for achieving better sentiment analysis performance as we proved below.A combination of the attention mechanism with topic information was used in Reference [20] for text summarization based on a convolutional sequence-to-sequence model, which is different from the sentiment analysis task in this paper.

BiGRULA
We propose the BiGRULA document sentiment analysis model based on lda2vec and the attention mechanism.The overall architecture is shown in Figure 1.In the left part of this architecture, the lda2vec model, which is based on LDA and word2vec, is used to extract document topic-based word vector representation.It adds the context information to the word embedding.Through lda2vec, we can get the word vectors and the topics from text dataset.Then the topic-enhanced word vectors are used to encode the text set, which are then fed to the BiGRU recurrent neural network model with attention to get the document vectors and the classification model.Finally, the text documents are classified by this model.

standard word vectors.
Another major research theme in sentiment analysis is how to compute the sentence and document vectors.Several works tried importing document topics and an attention mechanism into the neural network framework for document classification.Dieng [17] proposed TopicRNN, which integrates the merits of recurrent neural networks (RNNs) and the latent topic model to achieve longrange semantic dependency.Li [18] proposed a recurrent attentional topic model for document embedding based on a novel recurrent attentional Bayesian process.A feed-forward network with attention was suggested by Raffel [19], which is a well-known work using the attention mechanism.
Attention mechanism is a powerful technique for solving the problem of long-term memory for sequences.It is widely used in various tasks, especially with recurrent neural network models.Whatever the topic model or attention mechanism, they both aim to extract more informative features, which could be combined for achieving better sentiment analysis performance as we proved below.A combination of the attention mechanism with topic information was used in Reference [20] for text summarization based on a convolutional sequence-to-sequence model, which is different from the sentiment analysis task in this paper.

BiGRULA
We propose the BiGRULA document sentiment analysis model based on lda2vec and the attention mechanism.The overall architecture is shown in Figure 1.In the left part of this architecture, the lda2vec model, which is based on LDA and word2vec, is used to extract document topic-based word vector representation.It adds the context information to the word embedding.Through lda2vec, we can get the word vectors and the topics from text dataset.Then the topic-enhanced word vectors are used to encode the text set, which are then fed to the BiGRU recurrent neural network model with attention to get the document vectors and the classification model.Finally, the text documents are classified by this model.

Lda2vec Architecture
We exploited the lda2vec algorithm as the topic feature extractor of the BIGRULA model.Lda2vec has the ability to extract topics from texts and to generate topic-adjusted word vectors, which makes these word vectors more interpretable by linking them to the topics.In other words, sparse word vectors in our model are enhanced with meaning by importing the interpretable document representation.The lda2vec model mainly takes advantage of the document representation and learns the topic weights of the documents by minimizing the objective function of the skip-gram negative sampling (SGNS).The procedure of the lda2vec model is illustrated in Figure 2. Details of the model are explained below.

Lda2vec Architecture
We exploited the lda2vec algorithm as the topic feature extractor of the BIGRULA model.Lda2vec has the ability to extract topics from texts and to generate topic-adjusted word vectors, which makes these word vectors more interpretable by linking them to the topics.In other words, sparse word vectors in our model are enhanced with meaning by importing the interpretable document representation.The lda2vec model mainly takes advantage of the document representation and learns the topic weights of the documents by minimizing the objective function of the skip-gram negative sampling (SGNS).The procedure of the lda2vec model is illustrated in Figure 2. Details of the model are explained below.

Document Vector
Document vectors are used to represent the topic tendency of a document.We used lda2vec to obtain an interpretable representation of a traditional LDA to generate document vectors.In order to achieve this goal, a document vector was calculated as a weighted sum of topic vectors.
where denotes the weight of topic k on document j, >0, denotes k-th topic vector, and n is number of topics.During the training process, the document vector was updated by these weights which were normalized to ensure ∑ = 1 .The document vector, word vector, and topic vector were all in the same vector space.To determine the specific meaning of a topic vector, we only need to compute the most similar words with the topic vector.Finally, topic weights of the document were optimized using Dirichlet likelihood .
where denotes the strength of the in the training process of lda2vec, and denotes a low concentration parameter.If is less than 1, topics will become sparse; if is equal to 1, Dirichlet

Document Vector
Document vectors are used to represent the topic tendency of a document.We used lda2vec to obtain an interpretable representation of a traditional LDA to generate document vectors.In order to achieve this goal, a document vector was calculated as a weighted sum of topic vectors.
where p jk denotes the weight of topic k on document j, p jk > 0, → t k denotes k-th topic vector, and n is number of topics.During the training process, the document vector was updated by these weights which were normalized to ensure ∑ k p jk = 1.The document vector, word vector, and topic vector were all in the same vector space.To determine the specific meaning of a topic vector, we only need to compute the most similar words with the topic vector.
Finally, topic weights of the document were optimized using Dirichlet likelihood L d .
where γ denotes the strength of the L d in the training process of lda2vec, and α denotes a low concentration parameter.If α is less than 1, topics will become sparse; if α is equal to 1, Dirichlet will degenerate to uniform distribution and leads to poor consistency of topics; if α is more than 1, the difference between topics will become small.In the beginning, the weights of all topics are initialized to be the same.As the iterative training goes on, they become sparser and concentrate on one or a few topics.

Context Vectors
Context vectors are topic/context-enhanced word vectors used in our BiGRULA model.They were calculated as follows: firstly, given a pivot word in the text corpus, five target words in a moving window behind and after the pivot word were selected.This process was repeated across all the corpus.Then, the pivot word was used to predict the nearby target words.For example, if the pivot word is "red", then the nearby words are probably predicted as "green" or "yellow".Assuming that we know the document is about weather, the words nearby the pivot word "red" should be more likely to be predicted as "typhoon" or "heavy rain" or "high temperature".
Context vectors are inspired by the meaningful word vector combination through addition and subtraction of word vectors such as "king − man + woman = queen".Equally, if we add a word vector and the document vector together, the sum vector will, thus, capture long-and short-term themes.In our model, the context → c i was defined as the addition of the pivot vector → ω i and the document vector where → ω j is the word vector of j-th word in the document.

SGNS (Skip-Gram Negative Sampling)
In our model, we used SGNS to jointly train the context vectors and topic-enhanced word vectors as shown in Figure 1.SGNS attempts to differentiate the target words from the negative samples which are randomly picked from a negative sampling pool.In SGNS, high-frequency words are selected as the negative samples with higher probability, while low-frequency words are less likely to be selected.Let µ denote the word frequency normalized by the sample scale; then, the probability of the appearance of infrequent words is regulated and controlled by µ β where β is a smoothing parameter.When the target word is separated from the negative samples, the loss function where → ω i denotes the word vector of the target word i, and → ω l denotes the word vector of negative sample l.To prevent self-adaption, we used dropout on the document vector and the pivot vector before they were normalized.
In the process of negative sampling, we need to remove high-frequency stop words to reduce the noise of the model.We used Equation ( 5) to calculate the probability of the word being canceled.
where t denotes the threshold, and f (ω i ) denotes the frequency of word ω i .

Loss Function
The whole loss function was as follows:

BiGRU with Attention Mechanism
In our BiGRULA model for sentiment analysis, we used the BiGRU model with an attention mechanism to build document vectors from a sequence of word vectors, which were then used to make classification for documents.Attention mechanism has two benefits: firstly, it helps the model get better performance; secondly, it provides a mechanism to assign different importance to different words in document classification.Next, we introduce the model of BiGRU and attention mechanism in detail.The architecture of the model is illustrated in Figure 3.

BiGRU with Attention Mechanism
In our BiGRULA model for sentiment analysis, we used the BiGRU model with an attention mechanism to build document vectors from a sequence of word vectors, which were then used to make classification for documents.Attention mechanism has two benefits: firstly, it helps the model get better performance; secondly, it provides a mechanism to assign different importance to different words in document classification.Next, we introduce the model of BiGRU and attention mechanism in detail.The architecture of the model is illustrated in Figure 3.

BiGRU
In our model, BiGRU, a bi-directional recurrent neural network model, is used to map a sequence of word vectors of the document to sentiment categories.In BiGRU, a gated recurrent unit (GRU) uses gates to resolve the problem of gradient vanishing for preserving the long-distance information.A GRU has two gates: a reset gate and an update gate, as illustrated in Figure 4.The reset gate controls how past information contributes to the candidate state ℎ ; the update gate determines how past information is preserved and how new information is added.

BiGRU
In our model, BiGRU, a bi-directional recurrent neural network model, is used to map a sequence of word vectors of the document to sentiment categories.In BiGRU, a gated recurrent unit (GRU) uses gates to resolve the problem of gradient vanishing for preserving the long-distance information.A GRU has two gates: a reset gate and an update gate, as illustrated in Figure 4.The reset gate r t controls how past information contributes to the candidate state → h t ; the update gate z t determines how past information is preserved and how new information is added.
At time t, we compute the hidden vector → h t of the forward GRU: Similarly, we compute the hidden vector ← h t of the backward GRU.Then, the hidden vector h t of BiGRU is calculated as follows: ...

BiGRU
In our model, BiGRU, a bi-directional recurrent neural network model, is used to map a sequence of word vectors of the document to sentiment categories.In BiGRU, a gated recurrent unit (GRU) uses gates to resolve the problem of gradient vanishing for preserving the long-distance information.A GRU has two gates: a reset gate and an update gate, as illustrated in Figure 4.The reset gate controls how past information contributes to the candidate state ℎ ; the update gate determines how past information is preserved and how new information is added.

Attention Mechanism
Since not all words contribute equally to the meaning of the text, we added the attention mechanism to the BiGRU neural network model to emphasize the words important to the meaning of the text during sentiment classification.Then, a document vector is formed by these word vectors weighed by their importance on the document, and then, the document is finally classified.
The importance of a word in a document can be computed by the context of the word.The BiGRU model uses information in both forward and backward directions to get the contextual information, which captures the word connotation.For a given text c, it contains T words; w t denotes the t-th word in a document and x t denotes the t-th word vector, t ∈ [1, T].Forward GRU reads the text c from w 1 to w T , while the backward GRU does it in reverse.Specifically, we firstly use a one-layer multilayer perceptron (MLP) to get u t as the hidden representation of word annotation h t , and then, we use a word-level context vector u w which is randomly initialized to measure the importance of the word as the similarity of u t .The context vector u w can be seen as a high-level representation of the informative word and the value of u w is updated during the training process.Finally, we get the importance weight α t normalized by softmax function.
After that, we compute the document vector as a weighted sum of the word annotations at the current time in the decoded state using Equation (14).
To further illustrate the performance of lda2vec, Figure 5 shows the confusion matrix with the percentages of samples for each class predicted by Liblinear.Each column of the confusion matrix represents the predicted label (output class), while each row represents the true label (target class).As shown in Figure 5, the largest percentage of true positive classification was 88% for category misc.forsale.The smallest percentage of true positive classification was 68% for talk.religion.misc,while 15% of talk.religion.miscwas misclassified as talk.politics.guns.

Dataset and Parameter Settings
We used the IMDB movie review dataset [23] from Scikit-learn to evaluate our sentiment classification model.The dataset had 50,000 reviews, allowing no more than 30 reviews per movie.The whole dataset was split into 25,000 training samples and 25,000 testing samples.There were 25,000 positive reviews and 25,000 negative reviews in this dataset.
In our model, we used Google News-vectors-negative 300.bin as the pretrained word vectors used in the lda2vec module of BiGRULA.The number of topics for each document was set to 10 after evaluating its value from four to 20, where 10 was identified as the best.We consider the 10,000 top most frequent words in IMDB, and set the sequence length as 250.In the training of BiGRU with attention, we set the hidden size as 150, the attention size as 50, the batch size as 256, and the keep probability of training samples as 0.8.

Results and Analysis
To evaluate our BiGRULA model, we compared our results with those of other various neural networks.At the same time, we also trained a set of BiGRU models with attention mechanism using three other common word-vector-encoding methods to evaluate the performance of lda2vec.These models were all trained on the IMDB dataset with an equal number of parameter settings, and the results are presented in the Table 2.

Model
Train set Test set

Dataset and Parameter Settings
We used the IMDB movie review dataset [23] from Scikit-learn to evaluate our sentiment classification model.The dataset had 50,000 reviews, allowing no more than 30 reviews per movie.The whole dataset was split into 25,000 training samples and 25,000 testing samples.There were 25,000 positive reviews and 25,000 negative reviews in this dataset.
In our model, we used Google News-vectors-negative 300.bin as the pretrained word vectors used in the lda2vec module of BiGRULA.The number of topics for each document was set to 10 after evaluating its value from four to 20, where 10 was identified as the best.We consider the 10,000 top most frequent words in IMDB, and set the sequence length as 250.In the training of BiGRU with attention, we set the hidden size as 150, the attention size as 50, the batch size as 256, and the keep probability of training samples as 0.8.

Results and Analysis
To evaluate our BiGRULA model, we compared our results with those of other various neural networks.At the same time, we also trained a set of BiGRU models with attention mechanism using three other common word-vector-encoding methods to evaluate the performance of lda2vec.These models were all trained on the IMDB dataset with an equal number of parameter settings, and the results are presented in the Table 2.We found that the topic-enhanced word vectors learned from lda2vec achieved the lowest accuracy of 0.914 compared to three other word-embedding methods over the training set.However, it achieved the highest accuracy of 0.894 on the test dataset compared to 0.864, 0.869, and 0.872 of the other three word-embedding methods.When comparing the BiGRU RNN network with other neural networks such as Convolutional Neural Network (CNN), Long Short-term Memory (LSTM), and CNN+LSTM, our BiGRULA achieved better accuracy with 0.894 over the test dataset than the other three network models with accuracies of 0.881, 0.812, and 0.858, respectively.
Table 2 also shows the comparison accuracy values with other known machine learning approaches as available in literature [6,24,25] using the IMDB dataset.FPCD feature vectors combined with the generalized TF_IDF vectors + Naïve Bayes (G_TF-IDF + FPCD + NB), Word2vec + K-Nearest Neighbor (Word2vec + KNN), and frequent, pseudo-consecutive phrase feature with high discriminative ability + Support Vector Machine (FPCD + SVM) achieved the highest accuracy among their feature extraction methods, while, compared to our model, their accuracy values still could not compare.When we compared our model with Support Vector Machine (SVM), Naïve Bayes (NB), and Maximum Entropy (ME), which had the best accuracy values among n-gram methods, and with word2vec + LR, which had the best value among three different features, our model still showed the best accuracy value.

Application of BiGRULA to Sentiment Analysis of Tourism Reviews
Here, we demonstrate the utility of our BiGRULA model in tourism review analysis.We chose the ChnSentiCorp-Htl-unba-10000 hotel review dataset, a set of Chinese hotel reviews collected by Songbo Tan from Ctrip [26].It had 7000 positive reviews and 3000 negative reviews.Through our model, we extracted and utilized the useful information hidden in the hotel review data and acquired customers' sentimental attitude toward the hotels.
Firstly, we used lda2vec to extract the topics of the hotel reviews.We used sgns.Weibo.word,a set of pretrained Chinese word vectors [27].Common topics of hotel reviews include aspects about hotel environment, hygiene, transportation, diet, supporting facilities, price, hotel service, tourism network service, entertainment, and surroundings.Therefore, we set the number of topics as 10 for the BiGRULA model in our experiments.By training the model of lda2vec, we acquired the topic-enhanced word vectors and the topics of these hotel reviews after 400 epochs.Furthermore, we estimated the most relevant terms within the selected topic [28].
where p(w) indicates the probability of term w, p(w|t) indicates the probability of term w under topic t, and λ determines the weight given to p(w|t).
The topics and their most relevant terms extracted by lda2vec are displayed in Table 3.As the baseline, we also trained an LDA model with 50,000 epochs on the same hotel review dataset.The number of topics was also set as 10, and the results are present in Table 3. From the results in Table 3, we can observe that the topics extracted by LDA and lda2vec were all incoherent.We suspect that the main reason was that these hotel reviews were much shorter compared to other types of documents that LDAs are commonly applied to.This makes the models unable to extract the topics very well.However, with close examination, we found that the topics extracted by lda2vec were closer to the common topics of the hotel reviews as recognized by humans when compared to LDA.These results demonstrate that lda2vec performed better than LDA in topic extraction.In addition, to illustrate how informative the extracted terms by lda2vec were, we computed the term saliency for each term and got the top most salient terms [29].
where w denotes a given term, and we define the distinctiveness of w as the Kullback-Leibler divergence between P(T|w) and P(T): distinctiveness(w) = ∑ T P(T|w)log P(T|w) P(T) , (19) where P(T|w) is the conditional probability that term w belongs to topic T. This calculated distinctiveness describes how informative the specific term w is for determining the generating topic.If a term occurs in all topics, which tells us little about the document's topical mixture, the term would receive a low distinctiveness score.
Ranking all the terms by their term saliency, we got the top-30 most salient terms, as shown in Figure 6.We can observe that the most salient terms were positive and it showed that the customers' general impression about the hotel was good.The top most salient terms such as "environment", "price", "shopping", "transportation", and so on, tell us some topical information.From Figure 6, we can find that the frequent terms were not always salient.For example, the frequency of "good" was ranked as first, while its saliency was far below first.In addition, we can observe that most top salient terms can be found in topic terms extracted by lda2vec, as shown in Table 3, which demonstrated the effectiveness of the lda2vec model.general impression about the hotel was good.The top most salient terms such as "environment", "price", "shopping", "transportation", and so on, tell us some topical information.From Figure 6, we can find that the frequent terms were not always salient.For example, the frequency of "good" was ranked as first, while its saliency was far below first.In addition, we can observe that most top salient terms can be found in topic terms extracted by lda2vec, as shown in Table 3, which demonstrated the effectiveness of the lda2vec model.Secondly, we input the pre-trained word vectors trained by lda2vec into the BiGRU model with the attention mechanism for sentimental analysis of hotel reviews.In our experiment, we set the sequence length and the hidden size as 100, the attention size as 50, the batch size as 300, and the keep probability of training samples as 0.8.After training four epochs, our model converged to high classification accuracy of 93.1% in our binary sentiment classification over the hotel reviews.
In order to further observe the process of sentimental classification and demonstrate the Secondly, we input the pre-trained word vectors trained by lda2vec into the BiGRU model with the attention mechanism for sentimental analysis of hotel reviews.In our experiment, we set the sequence length and the hidden size as 100, the attention size as 50, the batch size as 300, and the keep probability of training samples as 0.8.After training four epochs, our model converged to high classification accuracy of 93.1% in our binary sentiment classification over the hotel reviews.
In order to further observe the process of sentimental classification and demonstrate the effectiveness of our model, we used t-distributed stochastic neighbor embedding (t-SNE), a nonlinear dimensionality reduction algorithm [30] to visualize the results.Figure 7 demonstrates the t-SNE visualization of BiGRULA document embedding of four epochs from the test set.We can see that the negative reviews (tagged 0) and positive reviews (tagged 1) were gradually separated through four epochs.This further demonstrates that our model can differentiate the negative and positive emotions within the review comments well.

Conclusions and Future Work
In this work, we proposed the BiGRULA model for sentimental classification and applied it to Chinese hotel review analysis.This model is characterized by its synergistic combination of lda2vec, a topic-enhanced word-embedding approach, with a bidirectional recurrent neural network model with attention mechanism.Through experimental evaluation, we showed that this model can achieve better performance that other popular neural network models such as CNN, LSTM, and CNN+LSTM.Application of our BiGRULA model to hotel review analysis showed that it can extract rich information from the text dataset, which is also closer to the meaning of the text.These features extracted from the text further improve the performance of succeeding sentimental classification.
Several aspects of our BiGRULA model can be further improved.We note that BiGRULA model only has a word-level attention mechanism, which may limit the training ability of the model.This issue may be addressed by the hierarchical attention network (HAN) as proposed in Yang [31], which builds the representation of a sentence using a word-level attention mechanism and builds a document classifier using a sentence-level attention mechanism.In the near future, we aim to add the sentence-level attention into our model.

Conclusions and Future Work
In this work, we proposed the BiGRULA model for sentimental classification and applied it to Chinese hotel review analysis.This model is characterized by its synergistic combination of lda2vec, a topic-enhanced word-embedding approach, with a bidirectional recurrent neural network model with attention mechanism.Through experimental evaluation, we showed that this model can achieve better performance that other popular neural network models such as CNN, LSTM, and CNN+LSTM.Application of our BiGRULA model to hotel review analysis showed that it can extract rich information from the text dataset, which is also closer to the meaning of the text.These features extracted from the text further improve the performance of succeeding sentimental classification.
Several aspects of our BiGRULA model can be further improved.We note that BiGRULA model only has a word-level attention mechanism, which may limit the training ability of the model.This issue may be addressed by the hierarchical attention network (HAN) as proposed in Yang [31], which builds the representation of a sentence using a word-level attention mechanism and builds a document classifier using a sentence-level attention mechanism.In the near future, we aim to add the sentence-level attention into our model.
Our study has some practical implications and applications.It can extract topics from online hotel reviews, which can give hoteliers insight into these reviews and can capture different determinants of guest satisfaction, which allows them to realign their strategies in service and product development, such as meaningful hotel competitive sets to better reflect guests' perspective.Also, our model BiGRULA can analyze consumers' sentiment and satisfaction, which can effectively help hoteliers evaluate the performance of the hotel operation and further formulate their strategies in the marketplace.Our model proved effective in the Chinese hotel consumer online review sentiment prediction.Our algorithm can also be used by hotel recommendation websites or booking platforms such as TripAdvisor, Ctrip, and so on to automatically rank hotels by sentiment analysis of their online reviews.

Figure 1 .
Figure 1.The bidirectional gated recurrent unit neural network model (BiGRULA) framework for sentiment analysis.

Figure 1 .
Figure 1.The bidirectional gated recurrent unit neural network model (BiGRULA) framework for sentiment analysis.

Figure 3 .
Figure 3. Workflow of BiGRU neural network model with attention mechanism.

Figure 3 .
Figure 3. Workflow of BiGRU neural network model with attention mechanism.

Figure 3 .
Figure 3. Workflow of BiGRU neural network model with attention mechanism.

Figure 5 .
Figure 5. Confusion matrix for multi-class classification with lda2vec word embedding.

Figure 6 .
Figure 6.Top-30 most salient terms from top to bottom.

Figure 6 .
Figure 6.Top-30 most salient terms from top to bottom.

Figure 7 .
Figure 7. Evolution of the separation of positive and negative reviews during the training process.

Figure 7 .
Figure 7. Evolution of the separation of positive and negative reviews during the training process.

grams from sentences Lufthansa is a German airline and when ...
neg L

Table 2 .
Comparison of models for sentiment classification.

Table 2 .
Comparison of models for sentiment classification.

Table 3 .
Comparison of topic terms extracted by LDA and lda2vec.