Anomaly Detection on Natural Language Processing to Improve Predictions on Tourist Preferences

: Argumentation-based dialogue models have shown to be appropriate for decision contexts in which it is intended to overcome the lack of interaction between decision-makers, either because they are dispersed, they are too many, or they are simply not even known. However, to support decision processes with argumentation-based dialogue models, it is necessary to have knowledge of certain aspects that are specific to each decision-maker, such as preferences, interests, and limitations, among others. Failure to obtain this knowledge could ruin the model’s success. In this work, we sought to facilitate the information acquisition process by studying strategies to automatically predict the tourists’ preferences (ratings) in relation to points of interest based on their reviews. We explored different Machine Learning methods to predict users’ ratings. We used Natural Language Processing strategies to predict whether a review is positive or negative and the rating assigned by users on a scale of 1 to 5. We then applied supervised methods such as Logistic Regression, Random Forest, Decision Trees, K-Nearest Neighbors, and Recurrent Neural Networks to determine whether a tourist likes/dislikes a given point of interest. We also used a distinctive approach in this field through unsupervised techniques for anomaly detection problems. The goal was to improve the supervised model in identifying only those tourists who truly like or dislike a particular point of interest, in which the main objective is not to identify everyone, but fundamentally not to fail those who are identified in those conditions. The experiments carried out showed that the developed models could predict with high accuracy whether a review is positive or negative but have some difficulty in accurately predicting the rating assigned by users. Unsupervised method Local Outlier Factor improved the results, reducing Logistic Regression false positives with an associated cost of increasing false negatives.


Introduction
Argumentation-based dialogue models are extremely useful in contexts where a group of agents is intended to find solutions for complex decision problems using negotiation and deliberation mechanisms [1][2][3].In addition, they allow human decision-makers to understand the reasons that led to a given decision (enhancing the acceptance of decisions) and to define mechanisms for intelligent explanations [4,5].These models receive the decision-makers' preferences as input (for instance, regarding criteria and alternatives), which are typically used to model the agents that represent them [6].However, obtaining these preferences is not a simple process: first, in the contemporary and highly dynamic world in which we live, it is less and less comfortable for decision-makers to answer questionnaires and, second, it is sometimes difficult to express preferences through questionnaires [7,8].To facilitate this task, strategies that aim to automatically identify the users' preferences have been proposed.One of these strategies consists in using Machine Learning (ML) algorithms and Natural Language Processing (NLP) to automatically extract from a text corpus the users' opinions through different strategies such as text wrangling and pre-processing, named entity recognition and sentiment analysis [9,10].However, there are many algorithms and strategies that can be applied.Therefore, it is mandatory to develop specific procedures according to the application topic, to achieve the best results.
In this work, we studied the problem previously described under the topic of group recommendation systems, more specifically in the context of tourism, in which there has been an increased interest in the development of technologies capable of making recommendations according to the interests of each group member.We assumed as habitual that users/tourists express their opinions regarding points of interest (POI) on social networks (such as TripAdvisor, Facebook, or Booking.comaccessed on 3 January 2022) and we sought to take advantage of this to automatically predict their preferences nonintrusively.For this, we used a public dataset (available in Kaggle) and applied the development lifecycle for intelligent systems using concepts of NLP defined in [11].More specifically, we developed forecast models using five supervised ML algorithms (Logistic Regression [12], Random Forest [13], Decision Trees [14], K-Nearest Neighbors [15], and Long/Short-Term Memory [16]), using them both as classification and regression methods.We also applied three unsupervised ML algorithms (One-Class Nearest Neighbor [17], Isolation Forest [18], and Local Outlier Factor [19]) used for anomaly detection to improve the supervised ML methods' results.In addition, we used NLP to extract more knowledge from the users' reviews and various libraries of Sentiment Analysis (Vader, TextBlob and Flair) to find those that best fit this context.
The rest of the paper is organized in the following order: Section 2 reviews state-ofthe-art works in the field of recommendation systems.Section 3 presents the methodology.In the last section, some conclusions are put forward, alongside suggestions of work to be done hereafter.

Related Work
Several works have been conducted and proposed for the development of recommended systems in the tourism context.Nilashi et al. [20] applied multi-criteria ratings in developing a new method for hotel recommendations in e-tourism platforms.The authors used supervised and unsupervised ML techniques to analyze the customers' online reviews.Cenni and Goethals [21] examined 100 reviews for languages written in English, Dutch, and Italian and analyzed three features, namely the types of speech acts that users used, the specific topics that they evaluated, and the extent to which they up-scaled or down-scaled their evaluative statements.The authors found a general trend towards similarity between the three language user groups under examination.
Valvida et al. [22] propose TripAdvisor as a source of data for sentiment analysis tasks.The authors develop an analysis for studying the matching between users' sentiments and automatic sentiment-detection algorithms.They provide some of the challenges regarding sentiment analysis on TripAdvisor.In [23], the authors present a review focused on the multi-criteria review-based recommender system (RS), where they explain the user reviews' elements in detail and how these can be integrated into the RS to help develop their criteria to enhance its performance.The authors presented four future trends to support researchers who wish to pursue studies in this field based on the survey.
The work of Kbaier et al. [24] focused on building personalized RS in the tourism field.They proposed a hybrid RS that combines the three best-known recommender methods: the collaborative filtering (CF), the content-based filtering (CB), and the demographic filtering (DF).In order to implement these recommender methods, the authors applied different ML algorithms, which were the K-Nearest Neighbors (K-NN) for both CB and CF and the Decision Tree for the DF.They conducted an extensive experimental study based on different evaluation metrics using extracted data from TripAdvisor.
In the work of Logesh et al. [25], they proposed an Activity and Behavior-Induced Personalized RS (ABiPRS) as a hybrid approach to predict persuasive POI recommendations.Their RS is designed to support travelling users by providing a compelling list of POIs as recommendations.As an extension, the authors designed a new group recommendation model to meet the requirements of the group of users by exploiting relationships between them.They also have developed a novel hybridization approach for aggregating recommendations from multiple RSs to improve the effectiveness of recommendations.The authors evaluated their approach on real-time large-scale datasets of Yelp and TripAdvisor.
In [26], the authors provided a fascinating study of users' evaluations of serendipity in urban recommender systems through a survey among 1641 citizens.They studied which characteristics of recommended items contribute to serendipitous experiences and to what extent this increases user satisfaction and conversion.Their results are aligned with findings in other application domains in the sense that there is a strong relation between the relevance and novelty of recommendations and the corresponding experienced serendipity.They found that serendipitous recommendations increase the chance of users following up on these recommendations.

Methods
In this section, we describe the methodology in detail.We start by illuminating the problem that we intend to address.Next, we justify the choice of the dataset, and carry out its analysis, covering preprocessing and feature engineering.Finally, we approach the used computational techniques and describe the tests and results obtained.

Understanding the Problem Statement
The problem we want to overcome is to predict, non-intrusively and with a high level of accuracy, how much a tourist likes/dislikes a given POI.Subsequently, we intend to use the predicted preferences to model intelligent agents that represent tourists in a group recommendation system, who seek to jointly decide (using an argumentation-based dialogue model) and recommend to the group of tourists the set of POIs to visit.For this, we chose to use the reviews that tourists wrote on social media (TripAdvisor) to predict their preferences.

Collecting Dataset
The chosen dataset was selected based on 2 criteria: it needed to be a public dataset and should best represent the context in which this work intends to be applied.Therefore, a dataset available at Kaggle [27] and which is composed of more than 20 thousand hotel reviews extracted from TripAdvisor was selected.The fact that there are already many works on Kaggle's repository that use this dataset allowed us to know beforehand that it would be very difficult to obtain good results, since, for example, for predicting 5 classes, the presented accuracy of the vast majority varies between 30% and 60%.

Analyzing Dataset, Preprocessing, and Feature Engineering
The dataset is composed of the attributes "Review" and "Rating".Table 1 shows some examples of the type of records that make up the dataset.The "Rating" is between 1 and 5, where 1 is the worst and 5 is the best possible evaluation.The dataset consisted of 20,491 records and 2 attributes, and it did not have any missing data.Figure 1 shows the distribution by "Rating".As can be seen, the dataset is quite unbalanced, with many more records with a positive evaluation (Rating 5:9054; Rating 4:6039) than with a negative evaluation (Rating 2:1793; Rating 1:1421).Furthermore, the number of records with an intermediate evaluation is also much lower than the number of records with a positive evaluation (Rating 3:2184).To study possible correlations between the "Review" and the assigned "Rating", we created 3 new attributes: "Word_Count", "Char_Count", and "Average_Word_Length".The "Word_Count" stands for the number of words used in the "Review", the "Char_Count" stands for the number of characters used in the "Review", and the "Average_Word_Length" stands for the average size of the words used in the "Review".The "Average_Word_Length" did not show statistical relevance, but we found that the most negative reviews tended to be composed of more words than the most positive reviews (Figure 2), which made us believe that the attribute "Word_Count" would be very relevant for the creation of the model.In the next step, we analyzed which words were most used in the reviews.In addition, we analyzed which words were most used in negative reviews (Rating 1 and 2) and in positive reviews (Rating 3, 4, and 5).We found that many of the most used words were the same, both in positive and in negative reviews.In Table 2 are presented the most used words considering all the reviews.The fact that many of the most used words are the same, in both positive and negative reviews, made us wonder if eliminating these words would be a good strategy in creating the model.Then, we used some libraries to perform sentiment analysis.Sentiment analysis techniques allow the identification of people's opinions, feelings, or attitudes through their comments.These techniques make it possible to determine a sentiment in a given sentence being classified as positive, negative, or neutral, using scalar values, and also through polarity (quantifying the sentiment as positive or negative through a value).These techniques are widely used in domains such as social networks, and their application is an excellent exercise to aid in interpreting and analyzing data from this particular field.Therefore, we applied 3 different libraries: Textblob, Vader, and Flair.Textblob and Vader presented similar results, while Flair did not obtain results that correlated with the "Rating".With Textblob, we obtained 2 new attributes (Polarity and Subjectivity), and with Vader, we obtained 3 new attributes Positive_Sentiment, Negative_Sentiment, and Neutral_Sentiment. Figure 3 presents the density of the "Polarity" attribute obtained with Textblob.We found that the "Polarity" is mostly positive, which makes sense since, as we saw earlier, most reviews are also positive.Figure 4 presents the correlation between "Polarity" and "Rating".We can see that the polarity rises as the rating increases, which clearly demonstrates the existence of a correlation.However, we also found that the boxplots of each rating level are superimposed, which is a strong indicator of the difficulty in achieving success in creating classification models.In addition, we verified the existence of many outliers, which may not actually be accurate, as is the case for "Rating" equal to 1, in which we verified the existence of many records with polarity between −1 and −0.65. Figure 5 presents the correlation between "Subjectivity" and "Rating".As we can see, there does not seem to exist any kind of correlation between subjectivity and rating.
To create a more simplified version of the assessment made by tourists, we generated a new attribute called "Sentiment", with a value equal to 1 for records where the "Rating" was equal to or greater than 3 and with a value equal to 0 for records where the "Rating" was less than 3.This attribute will allow us to distinguish positive ratings from negative ratings.
We also carried out important preprocessing activities that allowed us to prepare the dataset and discover some important aspects.First, we put all the corpus in lowercase.Then, we tokenized the corpus and performed lemmatization and removed all the punctuation.In addition, we used other techniques, such as removing stopwords, stemming, and considering only the characters of the alphabet; however, these did not allow us to obtain better results.Finally, we used the MinMaxScaler to normalize the data.

Computational Techniques
Considering the objective of this work, we believed that it would be important to test the results that would be possible to obtain with different algorithms, both as classification methods and as regression methods for supervised learning.We anticipated that if algorithms as classification methods failed due to previously identified limitations, algorithms as regression methods could be an acceptable alternative in the context of the objective of this work.Due to the vast number of existing methods, we decided to choose the classic and the most widely used in the literature.Our main criterion was the diversity of the mechanics with which these methods are structured.Hence, we chose methods from different categories based on decision trees, distances, neural networks, and decision boundaries.The algorithms used were: Logistic Regression, Random Forest, Decision Tree, K-Nearest Neighbors, and Bidirectional Long/Short-Term Memory (biLSTM).The first 4 used the Scikit-learn library and the last one used the Keras library.

Tests and Evaluation
Several experiments were carried out with the selected algorithms to tune parameters for optimization.However, as no significant differences were found, the default configuration provided by the used libraries was employed for all algorithms.For estimating the performance of the ML models, we performed cross-validation with five repetitions.
We defined six different scenarios to create models.In the first three scenarios (#1, #2 and #3), the set of most used words that did not express feelings were removed (hotel, room, staff, did, stay, rooms, stayed, location, service, breakfast, beach, food, night, day, hotel, pool, place, people, area, restaurant, bar, went, water, bathroom, bed, restaurants, trip, desk, make, floor, room, booked, nights, hotels, say, reviews, street, lobby, took, city, think, days, husband, arrived, check, and told), and in the other 3 (#4, #5 and #6), all words were kept.
For all scenarios, we used the TfidfVectorizer class from the Scikit-learn library to transform the "Review_new" feature to feature vectors, and we defined max_features equal to 5000.In addition, in scenarios #1 and #4, the features considered were: "Re-view_new", "Polarity", "Word_Count", "Char_Count", "Average_Word_Length", "Pos-itive_Vader_Sentiment", and "Negative_Vader_Sentiment"; in scenarios #2 and #5, the features considered were "Review_new" and "Polarity"; and in scenarios #3 and #6, only the feature "Review_new" was considered.We applied each supervised learning algorithm to each scenario with both the classification and regression methods.Thus, all combinations were used for a 5-class problem (Y = "Rating") and a 2-class problem (Y = "Sentiment").Finally, we applied three anomaly detection methods to the output of the best classification model (2-class problem).

Classification and Regression Results with Supervised Methods
Figure 6 presents the results obtained with the five algorithms for each of the scenarios defined with the classification method for the 5-class problem (Y = "Rating").Note that the Logistic Regression method is limited to two-class classification problems by default.However, with the Scikit-learn library, Logistic Regression can handle multi-class classification problems using the approach one-vs-rest [48].Analyzing Figure 6, the Logistic Regression algorithm obtained the best results for all scenarios, with an accuracy always higher than 0.6, followed by the Random Forest algorithm.The other three algorithms obtained considerably lower results, and in the case of the BiLSTM algorithm, the results were very poor, as it classified all cases with a "Rating" of 4. Since scenario 4 was the one that allowed us to achieve the best results, in terms of accuracy, Table 3 presents precision and recall for each of the algorithms in scenario 4 with the classification method for the 5-class problem.We verified that the Logistic Regression and Random Forest algorithms presented interesting results.It is possible to verify that relatively high values were obtained for the extreme cases ("Rating" = 1 and "Rating" = 5), but the quality was quite low in the classification of intermediate values.Figure 7 presents the results obtained with the 5 algorithms for each of the scenarios defined with the classification method for the 2-class problem (Y = "Sentiment").As can be seen, the results were quite good.Once again, the Logistic Regression and Random Forest algorithms obtained the best results, with the Logistic Regression algorithm showing an accuracy very close to 0.95.The Decision Tree and K-Nearest Neighbors algorithms obtained reasonable results, mainly in scenarios where more features were considered.The BiLSTM algorithm returned the worst results.Table 4 presents precision and recall for each of the algorithms in scenario 4 with the classification method for the 2-class problem.The results presented by the Logistic Regression algorithm are quite solid.It is verified that the recall for L 1 (Sentiment = 0) is lower than desirable, but this is probably explained by the dataset being unbalanced.The next experiences concern the application of the algorithms to the previously presented scenarios with the regression method.Figure 8 presents the Mean Absolute Error obtained with the 5 algorithms for each of the scenarios defined with the regression method for the 5-class problem (Y = "Rating").We found that most algorithms obtained poor results.However, the Random Forest algorithm presented very interesting results, obtaining a Mean Absolute Error of 0.69 in scenario 4 (which is quite good considering the problem in question).Table 5 presents the Mean Squared Error, Root Mean Square Error, and Mean Absolute Error for each of the algorithms in scenario 4 with the regression method for the 5-class problem.Once again, it is possible to verify that the Random Forest algorithm obtained very good results, unlike the other algorithms.Although the BiLSTM algorithm seems to give reasonable results, this only happens due to the fact that it always generates the same output and most reviews are positive.Figure 9 presents the Mean Absolute Error obtained with the 5 algorithms for each of the scenarios defined with the regression method for the 2-class problem (Y = "Sentiment").We verified that, in this case, all algorithms, with the exception of the BiLSTM algorithm, obtained very good results.Table 6 presents the Mean Squared Error, Root Mean Square Error, and Mean Absolute Error for each of the algorithms in scenario 4 with the regression method for the 2-class problem.The Logistic Regression algorithm again presented very good results that were consistent across all experiments.In this scenario, the K-Nearest Neighbors algorithm also presented interesting results.

Anomaly Detection Results
Through our next experiments, we selected OCC methods used in anomaly detection problems.We applied them to score the predicted output of the best classification algorithm-in this case, the Logistic Regression.These anomaly detectors are trained with normal data, identifying patterns that deviate from normality, which are considered anomalies.The main goal is to analyze whether these techniques can help the recommendation system that we intend to develop to correctly classify as many users as possible-that is, to detect whether they like a POI, improving the Logistic Regression performance.Therefore, as we can observe in Figure 10, the class 0 (showed as red dots), which we have considered as the anomalous one in this scenario, is dispersed through the graph in the Isolation Forest and OCKNN methods.We can also visualize that users with negative sentiments are at the top for the LOF method, with the highest scores.However, some of them are overlapped with users with positive sentiments, which means that although improvements in reducing false positives are possible, they come with the cost of increasing false negatives.We identified LOF as the best method to apply for this purpose as it was shown to better separate the Y = "Sentiment" classes through its score compared to the other methods.
We then performed four different experiments with this technique, analyzing the precision and recall metrics, as we intended to reduce false positives (increase precision), taking the increase in false negatives (decrease recall) into account.Thus, in the first two experiments, we applied LOF to separate Y = "Sentiment" classes by training with users with positive sentiments to isolate users with negative sentiments in the first experiment, while in the second experiment, we did the same, switching the classes (training with users with negative sentiment to isolate users with positive sentiment).We repeated the process for experiments three and four, this time using Y = "Ranking" to isolate the extreme ranking values, meaning that, in experiment three, we used users who rated 5 to train in order to isolate users who rated 1 and vice versa for the fourth experiment.To visualize the experiments, we built different graphics (Figures 11-14).The y-axis represents precision and recall percentage values, and in the x-axis, the percentile thresholds from LOF are given.That is, the percentage instances with the highest score from LOF output are considered the isolated class.For example, threshold percentile 95% means that instances that have a score value greater than 95% of the highest score output are considered as the isolated class.
We can observe in all experiments that recall presents a linear increase when threshold values also increase, while precision shows a slight decrease for high threshold values.It is essential to mention that threshold percentile 100% represents the output of Logistic Regression without cuts, which is why recall is always 100%, which means the absence of false negatives since we are using the values predicted by the Logistic Regression method of only a specific class.
In the first experiment (Figure 11), we aimed to discard class 0 from the Logistic Regression output, reducing the population from class 1 in order to obtain the maximum users who liked a POI.We can see that if using a threshold percentile of 50%, we obtain approximately 99% precision, but with a high cost for the recall value (52%).In this experiment, precision has a slight increase when reducing the class 1 population in 20% (threshold percentile 80%), achieving precision of 98%, while recall decreases at 82%.It obtains an acceptable recall value while precision converges to its highest value.In Figure 12, it is possible to observe the experiment in which we intended to hit the highest number of users who did not like a POI.In this scenario, LOF shows poor performance since it could not separate adequately class 1 from class 0. In order to be able to increase precision in only 2% (from 84% to 86%), recall drops from 99% to 51%.
Regarding the third experiment, shown in Figure 13, our goal was to discard users who rated a POI as 1, while reaching the maximum number of users who rated a specific POI as 5. Regarding the third experiment, shown in Figure 13, we wanted to discard users who rated a POI as 1 while reaching the maximum number of users who rated a specific POI as 5; it can be seen that precision can increase from 73% to 78% when reducing the population from users who rate 5 in 50%.This increase of 5% is the same, visible in the first (Figure 11) and last experiment (Figure 14); however, the highest precision value is much higher in the first scenario.

Discussion
In this work, we carried out several experiments to understand the ability of Machine Learning models to predict user reviews on the TripAdvisor platform.We started with the classification and regression of two problems, multi-class (Y = "Rating") and binary (Y = "Sentiment"), to observe the models' behavior.The results in the multi-class problems were not very high, especially in identifying the intermediate classes (Rating 2, 3, 4) due to the composition of the dataset.In the dataset analysis, we verified that, in addition to the classes being unbalanced (Figure 1), there is an overlap in the user evaluations (Figure 4).On the one hand, the dataset may not be sufficiently representative-for example, in comments with a level 3 rating-and, on the other hand, the fact that users are different can also have a large impact on a scale from 1 to 5, i.e., the same words have different meanings/weights for different people and people who evaluate a POI with the same rating may express it in a completely different way.As expected, the binary problem (Y = Sentiment) results were higher since the data were aggregated by the extreme ratings (1,5), where the overlapped observations were minor compared to intermediate ratings.Since our goal was to identify those ratings classified as positive, which actually obtained a positive rating from the user (and vice versa), we applied anomaly detection techniques to improve the Logistic Regression precision.We verified that the LOF was the best anomaly detection method to better differentiate classes from the Logistic Regression output compared to OCKNN and Isolation Forest.The LOF algorithm could reduce false positives but with an associated cost (with linear growth derived from the noise present in the dataset) of increasing false negatives, which is excellent since it is essential that the recommendation system we intend to develop can identify POIs that users will like or not like with certainty.

Conclusions
This work aimed to study strategies to automatically predict tourists' preferences regarding tourism points of interest.The method consisted in using Machine Learning algorithms and Natural Language Processing techniques on reviews that tourists posted on TripAdvisor ® to predict their assigned ratings.The chosen dataset had a lot of issues, making it difficult to achieve better results (the top three were being unbalanced, having comments that were not about the POI, and having comments with very poor writing quality).Since this was a public dataset, we already knew it would be extremely challenging because most existing works present accuracy rates between 30% and 60%.However, we decided to use this dataset as it is a good example of the reality and type of problems that exist in the context of the topic of this work.
The work carried out allowed us to reach important conclusions.First, the inclusion of sentiment analysis had a much smaller positive impact than expected.Furthermore, it was possible to notice that, for this dataset, the Vader and TextBlob models obtained a good correlation with the ratings associated with comments, while Flair did not.Second, although negative comments are usually longer, the inclusion of the "Word_Count" attribute did not prove to be relevant.Third, the Logistic Regression algorithm proved to be, for classification, the one that achieved greater accuracy, while the Random Forest algorithm, for regression, proved to be the one that obtained the smallest error.The Bidirectional LSTM algorithm obtained poor results for both classification and regression, most likely because the dataset was not large enough and contained several outliers, making it difficult for LSTM to extract patterns and generalize the data.Finally, we verified that we can improve the precision of a model using anomaly detection techniques, albeit with a certain decrease in recall.The cost of increasing false negatives is defined by the anomalous threshold, which is a user-specified parameter.Therefore, the threshold can be adjustable so that there is a beneficial trade-off between precision and recall.We intended to create a model to identify only those tourists who truly like or dislike a particular point of interest, in which the main objective is not to identify everyone, but fundamentally not to fail those who are identified in those conditions.Our experiments provide valuable information as they give an idea of the behavior of Machine Learning models in a real scenario, helping to develop approaches for those who intend to create a recommendation system for decision support systems in the tourism field.As future work, we intend to replicate this study with a much larger dataset and in which comments/evaluations are about different points of interest.

Figure 2 .
Figure 2. Correlation between the average number of words in the "Review" with the assigned "Rating".

Figure 10 .
Figure 10.Anomaly scores distinguishing sentiment 1 from sentiment 0. First graphic represents the Isolation Forest results, second shows OCKNN results, and third shows LOF results.The y-axis represents the scores and the x-axis represents the sample indices.

Table 1 .
Small example of the used dataset.

Table 2 .
List of the most used words in reviews.

Table 3 .
Precision and recall for scenario 4 with the classification method (Y = "Rating").

Table 4 .
Precision and recall for scenario 4 with the classification method (Y = "Sentiment").