An Effective ELECTRA-Based Pipeline for Sentiment Analysis of Tourist Attraction Reviews

: In the era of information explosion, it is difﬁcult for people to decide on a tourist destination quickly. Online travel review texts provide valuable references and suggestions to assist in decision making. However, tourist attraction reviews are primarily informal and noisy. Most works in this ﬁeld focus on shallow machine learning models or non-pretrained deep learning models. These approaches struggle to generate satisfactory classiﬁcation results. To solve this issue, the paper proposes a pipeline model. In the ﬁrst step of this paper, we preprocess tourist attraction reviews by performing stopword removal, special character removal, redundancy deletion and negation substitution to reduce noise. Then, we propose an ELECTRA (Efﬁciently Learning an Encoder that Classiﬁes Token Replacements Accurately) classiﬁer for sentiment analysis of tourist attraction review. Finally, we compare our pipeline model with several representative deep text classiﬁcation models. Extensive experiments have demonstrated the effectiveness of our approach to sentiment analysis of tourist attraction reviews. We not only provide one high-quality dataset for tourist attraction reviews, but our work can also expand and promote the development of sentiment analysis in other domains.


Introduction
In recent years, review texts have provided a novel source of data for travel research. Review texts contain insightful feedback that is spontaneously provided by users. Online travel review texts are naturally considered as one of the most essential sources of user opinion. It is widely understood that reputation plays an important role for informing and affecting user decisions. User-generated Internet content, such as reviews, facilitates this process. Therefore, it is very important to analyze user opinions from data such as online reviews. The growing number of user reviews is the most readily available source of the opinions of crowds. However, the analysis of such opinions from reviews, especially from websites with a large number of users worldwide, is challenging. Sentiment analysis has been introduced to discover knowledge through user reviews. Sentiment analysis in review texts is a rapidly growing field of study and application, and has been applied in the domain of tourism aswell. However, compared to many other domains of sentiment analysis, online travel reviews are usually short texts, written informally-meaning the writer uses slang, misspelled words, and emoticons. Travel is a low-frequency activity, and travel frequency among users approximates Zipf's law from a statistical perspective. The sparsity of noisy short texts, coupled with uneven sentiment distribution, makes it difficult to obtain ideal sentiment analysis results. Although several works have recognized these problems and treated the sentiment analysis of travel review texts as a domain specific classification problem [1][2][3][4], they use either shallow machine learning models that lack the ability to extract deep semantic features or use non-pretrained deep learning models that heavily depend on a large amount of labeled text. It is still possible to improve the classification results of sentiment analysis of travel review texts.
Our research objectives mainly focus on the following two aspects: First, we plan to carry out a series of pre-processing procedures to transform tourist attraction reviews into text that is more appropriate for learning emotional characteristics to ensure the performance gain of learning models. Second, we aim to use a pre-trained model, namely, ELECTRA, as a smart classifier to work on a sentiment analysis task based on tourist attraction reviews together with pre-processing procedures. This is due to ELECTRA's more efficient pre-training process and better performance on small data sets. To verify our approach, we build an expert-annotated review dataset in Chinese related to tourist attractions.
Overall, we provide resources and benchmark model for those who use NLP technology to create sentiment analysis of tourist attraction review.
In summary, this paper has the following contributions: • We propose a pre-trained pipeline model for the sentiment analysis of tourist attraction reviews, which can exploit the gains of pre-trained language models and which outperforms other baseline models.

•
We develop annotation specifications and manually construct a Chinese tourist attraction review dataset to fill the research gap.

•
We conduct a detailed comparison between our model and other baseline models in terms of performance evaluation and model evaluation, with an additional ablation experiment. The discoveries from these analyses can promote the development of sentiment analysis of other reviews.

Travel Review Sentiment Analysis
Text sentiment analysis (TSA) is the process of extracting users' opinions, sentiments and demands from unstructured subjective texts in a specific domain and distinguishing their polarity. Existing TSAs roughly fall into three main categories: sentiment lexicon and rule-based methods [5][6][7]; traditional machine learning-based methods [8][9][10][11]; and deep learning-based methods [12][13][14][15][16][17][18][19][20]. Category I has proved to perform poorly when the texts are rich in new words, contextual words or multilingual words. Category II focuses on extracting sentiment features and the combination of different classifiers (e.g., KNN, SVM and Naïve Bayes, etc.). Without fully utilizing the contextual information of the text, their classification accuracy is affected to a certain extent. In order to obtain better classification results, category III introduces deep neural networks-such as CNN, RNN and LSTM networks, Attention networks and Transformer networks-to automatically extract sentiment features and make good use of contextual semantic information.
In recent years, research on sentiment analysis of review texts in tourism has started to receive more and more attention. It is driven by helping tourism stakeholders to better understand and to more quickly grasp relevant tourism information, providing a reliable basis for their decision-making process. Sentiment analysis is considered a key step in restaurant or hotel recommendations for tourists in many works [21][22][23]. However, only a handful of studies on the sentiment analysis of attraction review can be found [24][25][26]. It is worth noting that there is no existing work applying a pre-trained model for sentiment analysis of tourist attraction review texts. Pre-trained language models (PLMs) can achieve comparable or even SOTA (State-of-the-Art) results on NLP tasks with a small number of supervised corpora.

Pre-Trained Language Models
Recently, PLMs, which consist of an extensive neural network previously trained on a large amount of unlabeled data and fine-tuned on downstream tasks, have achieved outstanding performance in several natural language understanding tasks. In the text classification task, Howard and Ruder [27] put forward Universal Language Model Fine-Tuning (ULMFiT) and achieved SOTA results. The encoder part of an encoder-decoder architecture based on deep transformer, such as Generative Pre-trained Transformer (GPT) [28] and BERT [29], is nowadays one of the most popular task-specific models. In particular, many of the works [15,16,20] proposed using BERT and its variants for sentiment analysis with excellent results. However, BERT and its variants have some fatal flaws: slow convergence of model training; high computational effort; and some inconsistency in inputs during pre-training and fine-tuning. To address these, ELECTRA [30], which uses a different pre-training method acting as a discriminator rather than a generator, has been used. In some tasks, such as similarity comparisons [31] and sequence annotation [32], ELECTRA has been shown to have better performance than BERT.

Text Pre-Processing
Text pre-processing, especially for informal texts, is an integral step in sentiment analysis and PLMs. The pre-processing that may be involved includes tokenization, part-ofspeech tagging, stemming, lemmatization, text cleaning [33], text clarity [34], tagging [35], lexical-grammatical check, spellchecking, stopword removal and negation handling [36]. Pre-processing has a direct impact on the performance of sentiment classification. For instance, Reference [33] indicates that the inappropriate processing of negations leads to biases and misclassification of sentiments. Reference [34] proposes cleaning and normalizing data, negation handling and intensification handling to improve sentiment classification performance. Several studies have shown that pre-processing also contributes to the performance of pre-trained models. Reference [35] leverages lexical simplification to effectively improve the performance of PLMs in text classification. Pre-processing has been verified to solve the limitations of word embedding for affective tasks [36]. Nowadays, the preprocessing of sentiment analysis mainly focuses on English text, and there are only a few works for Chinese text. Reference [37] proposes a method based on Chinese characters rather than words to address the problem of requiring complex pre-processing steps in Chinese text sentiment analysis. Reference [38] studies text pre-processing in Chinese, such as document segmentation, word segmentation and text representation, but only to unify the format of documents before text classification.

Methodology
In this section, we introduce our proposed approach to classify tourist attraction review and perform sentiment analysis, given the large number of tourist attraction reviews with significantly different sentiment tendencies and the presence of many meaningless or inauthentic contents, as well as the extensive use of informal terms. These characteristics mean the task of tourist attraction classification requires necessary pre-processing and efficient classifiers. It is essentially a two-step ELECTRA-based pipeline, as shown in Figure 1. In detail, the first step of the pipeline applies a series of pre-processing procedures to convert travel reviews from Ctrip (https://www.ctrip.com/, accessed on 15 July 2022) into filtered text, while the second step places the data processed into the classification system based on an ELECTRA language model that has been pre-trained on plain text corpora. In particular, our proposed pre-processing procedures are outlined and the architecture of the classification system adopted is explained.

Pre-Processing Procedures
Collecting raw travel reviews from the Ctrip Travel Website (the largest Online Travel Agency in China) using a Web crawler generally results in a very noisy dataset due to the spontaneity and creativity of the posted comments. Since tourist attraction reviews reflect the satisfaction of tourists and their evaluations of the quality of services, perceptions of the image of attractions or information provision, they are filled with many modal words and emotional words, coupled with other noisy sources, such as phone numbers, amounts of money, times, dates, addresses and questions.

Pre-Processing Procedures
Collecting raw travel reviews from the Ctrip Travel Website (the largest Online Travel Agency in China) using a Web crawler generally results in a very noisy dataset due to the spontaneity and creativity of the posted comments. Since tourist attraction reviews reflect the satisfaction of tourists and their evaluations of the quality of services, perceptions of the image of attractions or information provision, they are filled with many modal words and emotional words, coupled with other noisy sources, such as phone numbers, amounts of money, times, dates, addresses and questions.
Our pre-processing procedures include the following four sub-procedures:

Stopword Removal
Stopwords (abbreviated as Stop) are the most common words typically filtered out before the classification task. Thus, we removed all the stopwords. Here, we use the "remove-stopwords" method in the Gensim (https://radimrehurek.com/gensim/index.ht ml, accessed on 24 July 2022) Library with a stopwords dictionary that integrates Chinese words from the HIT (Harbin Institute of Technology), Baidu and SCU (Sichuan University) thesauruses. Given that the removal of stopwords has a significant impact on the set of feature vectors required for classification and the classification effect [39], an optimal selection of stopwords is required. It is not preferable to have a stopwords dictionary with as many stopwords as possible, but rather, it is better to have a targeted dictionary. Since the combination of words such as exclamations, onomatopoeia and pronouns with other words in emotional texts often has a strong emotional tendency, all words other than punctuation and non-semantic words are retained in the stopwords dictionary as much as possible. The final stopwords dictionary contains 1336 stopwords. We will make the final stopwords dictionary publicly accessible for the research community.

Special Character Removal
Special characters (abbreviated as Symbol) such as . , ()[]{}, should be removed in order to eliminate differences when assigning polarity. Here, we use the "remove_stopwords" method in the Gensim Library with a symbol dictionary that integrates special characters from the HIT thesaurus. Gensim is an open source software library that uses modern statistical machine learning and is designed to process large collections of text using data streams and incremental online algorithms, unlike most other machine learning packages that target only in-memory processing. The final symbol dictionary contains 263 special characters.

Redundancy Deletion
Repeated statements may disrupt polarity distribution, and phone numbers, amounts of money, times, dates and questions do not contribute to sentiment tendencies. Thus, we delete these statements and phrases, abbreviated as Deletion. We identify redundant statements from the text using the NLTK regexptokenizer and delete them. Several regular matching examples are illustrated in Table A1 in the Appendix.

Negation Substitution
Negation substitution (abbreviated as Negation) plays a critical role, as negation words would invert the word or sentence polarity in sentiment analysis. Thus, we substitute the negation and the negated word with its antonym. First, we identify the negation words in tokenized text using a negation dictionary (contains 243 negation words). Then, the antonym of the token following the negated word is looked up in the antonym dictionary (contains 18,797 pairs). If an antonym is found, the negation word and the negated word are replaced with the antonym. For example, we replace 不开心 (not happy) with 沮丧 (depressed).

ELECTRA System Architecture
While BERT and its variants produce excellent results on downstream NLP tasks, they require a large amount of computation to be effective. This is because such masked language models (MLM) mask a few words randomly during the training process and predict very limited words. Furthermore, because of the arbitrary choice of the masked token, it would be challenging to learn as many meaningful tokens as possible, such as emotional words and opinion words, for the sentiment analysis task. As an alternative, Appl. Sci. 2022, 12, 10881 5 of 12 ELECTRA proposes a more efficient pre-training task that can compensate well for these shortcomings, so we introduce it as a sentiment classification model for attraction review to achieve a more satisfying classification performance. The ELECTRA system architecture is shown in Figure 2.

ELECTRA System Architecture
While BERT and its variants produce excellent results on downstream NLP tasks, they require a large amount of computation to be effective. This is because such masked language models (MLM) mask a few words randomly during the training process and predict very limited words. Furthermore, because of the arbitrary choice of the masked token, it would be challenging to learn as many meaningful tokens as possible, such as emotional words and opinion words, for the sentiment analysis task. As an alternative, ELECTRA proposes a more efficient pre-training task that can compensate well for these shortcomings, so we introduce it as a sentiment classification model for attraction review to achieve a more satisfying classification performance. The ELECTRA system architecture is shown in Figure 2. To go into more detail, just like Generative Adversarial Networks (GAN), the ELECTRA architecture consists of two networks: a generator and a discriminator. Both parts use transformer-based encoding networks to obtain the vector representation of the input word sequence. In the Chinese dataset, as Chinese characters are different from English, which is formed by alphabet-like symbols, the word tokenizer uses the traditional Chinese Word Segmentation (CWS) tool to split the text into several words instead of small fragments. In this way, whole word masking in Chinese could be adopted to mask the word instead of individual Chinese characters. In the generator, the goal is to train a masked language model. Its structure is similar to BERT, i.e., given an initial input sequence = {我 (I), 不 (not), 爱 (like), 吃 (eat), 苹果 (apple)}, the Chinese words 爱 (like) and 苹果 (apple) in the masked sequence are first replaced by [MASK] according to a certain percentage to obtain the generator's input. The process can be formulated as: Then, a vector representation is obtained through a generative network, typically a small MLM. Followed by a softmax layer, a sample word ˆi x is predicted for the location of the mask in the generator's input: The objective function of training is to maximize the likelihood of the masked words. The prediction result replaces the original masked word, e.g., the Chinese words 爱 (like) and 苹果 (apple) are replaced by 爱 (like) and 梨 (pear), respectively: To go into more detail, just like Generative Adversarial Networks (GAN), the ELEC-TRA architecture consists of two networks: a generator and a discriminator. Both parts use transformer-based encoding networks to obtain the vector representation of the input word sequence. In the Chinese dataset, as Chinese characters are different from English, which is formed by alphabet-like symbols, the word tokenizer uses the traditional Chinese Word Segmentation (CWS) tool to split the text into several words instead of small fragments. In this way, whole word masking in Chinese could be adopted to mask the word instead of individual Chinese characters. In the generator, the goal is to train a masked language model. Its structure is similar to BERT, i.e., given an initial input sequence = {我 (I), 不 (not), 爱 (like), 吃 (eat), 苹果 (apple)}, the Chinese words 爱 (like) and 苹果 (apple) in the masked sequence are first replaced by [MASK] according to a certain percentage to obtain the generator's input. The process can be formulated as: where index = [id 1 , . . . , id k ] is the index sequence for selected positions. Then, a vector representation is obtained through a generative network, typically a small MLM. Followed by a softmax layer, a sample wordx i is predicted for the location of the mask in the generator's input:x The objective function of training is to maximize the likelihood of the masked words. The prediction result replaces the original masked word, e.g., the Chinese words爱 (like) and 苹果 (apple) are replaced by 爱 (like) and 梨 (pear), respectively: where index is the same as above.
In the discriminator, a new pre-training task known as "Replaced Token Detection (RTD)" is applied. More specifically, a discriminative model is trained to predict whether the word at each position of the discriminator's input sequence has been replaced by the generator, with "Original" or "Replaced" as the classification result. Here, only the word 苹果 (apple) can be found to have changed.
The generator's loss function L MLM (x, θ G ) can be formulated as: Appl. Sci. 2022, 12, 10881 6 of 12 and the discriminator's loss function L Disc (x, θ D ) can be formulated as: The final loss function is the weighted sum of two loss functions: where λ is a weighting factor and X is a large raw text corpus. After pre-training, we use the discriminator to fine-tune the sentiment analysis task. The process of replacement identification using a discriminator converts a prediction problem into a binary classification one that is characterized by allowing words in all positions to be predicted, resulting in an increase in efficiency and faster convergence. This new pre-training task is more effective than MLM because the model learns from all input words, not just from a small subset of masked ones. The contextual representation learned by ELECTRA is substantially better than that learned by MLM such as BERT in the same model size, data and computational conditions, with particularly large gains from small datasets.

Dataset
There is a large number of datasets publicly available to facilitate the study of sentiment analysis in recent years. Most of these datasets are movie or product reviews, such as from Yelp (https://www.kaggle.com/yelp-dataset/yelp-dataset, accessed on 16 July 2022), Amazon (https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products, accessed on 16 July 2022), IMDb (https://www.kaggle.com/lakshmi25npathi/imdb-data set-of-50k-movie-reviews, accessed on 16 July 2022) and SST (https://www.kaggle.com/a tulanandjha/stanford-sentiment-treebank-v2-sst2, accessed on 16 July 2022). The tourismrelated dataset only covers restaurants or hotels, mostly in English, such as SemEval2014-ABSA-Restaurant-Reviews (https://alt.qcri.org/semeval2014/task4, accessed on 16 July 2022) and ChusentiCorp-htl (https://raw.githubusercontent.com/SophonPlus/ChineseNl pCorpus, accessed on 16 July 2022). Therefore, we have constructed a sentiment dataset of tourist attraction reviews in Chinese from Ctrip. Our dataset is called SenTARev (Sentiment analysis for Tourist Attraction Reviews). In the adopted labeling scheme, there are two types of labels, positive (PL) and negative (NL). There are three types of affective outcomes, negative (Neg), positive (Pos) and neutral (Neu). Considering labeling as a classification task, it is defined as follows: if positive sentiment is detected, PL outputs a value of 1 as a positive class and 0 otherwise; if negative sentiment is detected, NL outputs a value of 1 as a negative class and 0 otherwise. Thus, for PL, a value of 1 means a positive or neutral sentiment and a value of 0 means a neutral or negative sentiment. For NL, the value 1 implies a negative or neutral sentiment and the value 0 implies a neutral or positive sentiment. We organize annotators to perform polarity annotation and data inspection on the raw data, which is finally used as a supervised corpus for fine-tuning the ELECTRA model. In the quality-control process of human-annotated data, each attraction was assigned an annotator and an inspector. We set a unified annotation specification before annotation to ensure the consistency of the data. Every annotator had to perform real-time inspection, and every inspector had to complete full sample inspection and sampling inspection. Table 1 illustrates our annotation scheme and SenTARev's label distribution.

Baseline
To evaluate the effectiveness of the proposed approach, we compared it with several baselines. Each baseline is a pipeline model that includes the same pre-processing procedures mentioned above and the following classifier: •

Evaluation Metrics
In this paper, we adopt precision (P), recall (R) and f1-score (F) as the main evaluation metrics for sentiment analysis performance. They can be respectively calculated as: where p where p denotes the polarity, i.e., Neu, Neg and Pos. We also utilize macro-average (m_avg) and weighted average (w_avg) as comprehensive evaluation metrics of the deep learning model. These calculations are shown as: where X denotes P, R or F, and sup denotes the number of support samples.

Model Training
Under the Hugging Face framework, the ELECTRA model used is made available by the Joint Laboratory of the HIT and iFLYTEK Research team. The Chinese ELECTRA is pre-trained on two corpora: the first corpus is the source data consisting of the Chinese Wikipedia dump, while the second corpus is further extended with data from encyclopedia, news and question and answering websites, which has 5.4 billion words and is over ten times bigger than the Chinese Wikipedia. Fine-tuning of the model was performed by using labelled reviews in the training set of the SenTARev dataset. Categorical cross entropy was used as the loss function during training, and the fully connected classification layer was learned correspondingly. For the fine-tuned ELECTRA, the hyper-parameters used are shown in Table A2 in the Appendix A.

Result and Discussion
The comparison results of performance evaluation on the SenTraRev dataset are presented in Table 2. For each method, the results are obtained on the best model. As shown in Table 2, firstly, we can see that the classification performance of our ELECTRA-based pipeline model is better than all selected classical text classification models in precision, recall and f1-score by a significant margin. Then, we observe that, with complex pre-trained targets and large model parameters, large-scale PLMs can effectively capture knowledge from large amounts of labeled and unlabeled data. By storing knowledge using a large number of parameters and fine-tuning it for specific tasks, the rich knowledge implicitly encoded by a large number of parameters can benefit the sentiment analysis task. Hence, from Table 2, we can observe that the pre-trained models have outperformed other models. Thirdly, our ELECTRA-based pipeline model also achieves better performance compared with other pre-trained models on all measurements. This is due to its more efficient pretraining task. This novel pre-training task makes our ELECTRA-based text classification model learn deeper contextual sentiment features in reviews than other pre-trained based models. In terms of text classification, arbitrary masking patterns make it difficult for BERT-based models to learn all meaningful information about sentiment tendencies. This is especially true in the RoBERTa-based model, whose dynamic masking further undermines the effectiveness of learning sentiment tokens, making performance even worse than CNN/RNN models. These have further blurred the boundary between biased polarity and unbiased polarity during training and increased the difficulty of classification. The subtle masking design makes our ELECTRA-based model obtain better generalization performance on text classification.

Ablation Studies
For further evaluation, an ablation study was performed. This involved firstly discarding the pre-processing procedure, then only enabling one pre-processing sub-procedure, then enabling any two sub-procedures and disabling one pre-processing sub-procedure, and finally retaining other sub-procedures. The results of ablation studies are detailed in Table 3. A primary goal of this work is to identify the most effective sub-procedure for PLMs for sentiment analysis. Observing the results of the individual sub-procedure on the SenTARev dataset, it is worth noting that even a single pre-processing sub-procedure can bring improvements. Among these four pre-processing sub-procedures, negation substitution appears to be the most effective, verifying its importance in sentiment classification. Then we find redundancy deletion and special character removal also contribute to improvement. Stopwords removal had minimal impact. Performance with negation pre-processing is generally better than without it when the same number of pre-processing sub-procedures is included. Although it does not present a multiplicative or proportional relationship, performance becomes better as the number of pre-processing sub-procedure increases. We note that the best performance comes from combining all the pre-processing procedures.

Conclusions
In this paper, we set out to improve sentiment analysis in tourist review data. Firstly, we constructed a Chinese tourist attraction review dataset, SenTARev, from Ctrip. Then, we proposed a two-step pipeline approach for the sentiment analysis of tourist attraction reviews. We found that an ELECTRA-based pipeline model is highly efficient at the sentiment analysis of tourist attraction reviews in terms of model performance. This represented, on average, a 6.33 percent improvement in m_avg F and averaged a 1.5 percent improvement in w_avg F . In addition, we also found that pre-processing can further enhance pre-trained sentiment classification models, especially negation substitution. Our work can expand and promote the development of sentiment analysis of other domains. In the future, we will explore pre-trained models with travel knowledge enhancement to improve their ability to understand and represent domain knowledge. Furthermore, we will integrate pre-processing with deep learning models through transformation and filtering with the chi-quared method to further improve the efficiency of Chinese sentiment analysis.