Statistics-Based Outlier Detection and Correction Method for Amazon Customer Reviews

People nowadays use the internet to project their assessments, impressions, ideas, and observations about various subjects or products on numerous social networking sites. These sites serve as a great source to gather data for data analytics, sentiment analysis, natural language processing, etc. Conventionally, the true sentiment of a customer review matches its corresponding star rating. There are exceptions when the star rating of a review is opposite to its true nature. These are labeled as the outliers in a dataset in this work. The state-of-the-art methods for anomaly detection involve manual searching, predefined rules, or traditional machine learning techniques to detect such instances. This paper conducts a sentiment analysis and outlier detection case study for Amazon customer reviews, and it proposes a statistics-based outlier detection and correction method (SODCM), which helps identify such reviews and rectify their star ratings to enhance the performance of a sentiment analysis algorithm without any data loss. This paper focuses on performing SODCM in datasets containing customer reviews of various products, which are (a) scraped from Amazon.com and (b) publicly available. The paper also studies the dataset and concludes the effect of SODCM on the performance of a sentiment analysis algorithm. The results exhibit that SODCM achieves higher accuracy and recall percentage than other state-of-the-art anomaly detection algorithms.


Introduction
Sentiment analysis, emotion artificial intelligence, and intent analysis are often used to describe the same concept, i.e., opinion mining. Sentiment analysis uses a combination of natural language processing (NLP), computational linguistics, and text mining to analyze, derive, calibrate, and evaluate textual information in the form of sentences, phrases, documents, etc. [1]. NLP has earned a lot of attention recently.
People have started to rely on consumer reviews and sentiments shared over social media sites, blogs, and consumer feedback websites on the internet before purchasing or opting for a particular product or service. It has also become a vital tool for decision-makers who plan to improve, modify, or perform necessary actions based on public opinions. Sentiment analysis is used extensively in various domains such as marketing, politics, sports, and stocks for information extraction, improvement of an automated chatbot response system, or product modification. Most companies use sentiment analysis to research consumer requirements and understand the market trends. Positive reviews of a product or service drive online marketing, while negative comments motivate companies to improve their products or services based on customer demands. Social media has become a robust platform that helps understand public opinions, acceptance, or issues regarding

Related Work
Social media has become a powerful platform for people to share their opinions and concerns on topics ranging from socio-economic to political to technological advancements. Iglesias et al.,in [8], discussed advancements in various approaches in the field of sentiment analysis, their contributions, and their applications in various domains. The work in [9] compiles all the studies related to various limitations of sentiment analysis on social media datasets. It discusses problems as trivial as spelling and grammatical mistakes to situations as critical as rumor-mongering, community shaming, riots, and protests arising from posts or comments on the internet. It also highlights the increasing impact of research conducted on sentiment analysis applied to social media datasets. The study in [10] analyzed previous literature based on modern social media applications. It also featured their impacts in healthcare, disaster management, and business.
In [11], Wang et al. explained that a sentence that holds an opinion consists of quintuple parameters (e, a, s, h, t), where e is the target or entity, a is the aspect or feature of e, s is the nature of the opinion on e or a, h is the opinion holder, and t is the time when h expresses the sentiment. For instance, in this 5-star Amazon review for a hand sanitizer, "With having to use hand sanitizers so much due to the COVID situation, this is the best one I have found. Love the residual effects and the fact that is doesn't dry out my skin. Would recommend over other brands.", e is the hand sanitizer, a is the residual effect, the nature of the opinion is positive, and the opinion holder is the Amazon reviewer while time is during COVID-19 pandemic. Sentiment analysis focuses explicitly on s, which is the nature of the opinion.
Sentiments or emotions tenaciously drive a consumer's decisions and views regarding a product or service. The research in [12] focused on social media's impact on people from a spatial and temporal vantage point. Using Alteryx, it filtered the tweets based on residential users from the 2016 United States Geo-tweets dataset. The results show a higher impact of tweets, especially those with positive sentiments, based on several features such as location, content, and time. Cosmetic brands apply sentiment analysis to obtain a clear and comprehensive insight into consumers' thoughts on product quality and desires. In [13], Park implemented Term Frequency-Inverse Document Frequency to analyze the polarity of customer opinions and brand satisfaction for 26 different cosmetic companies. The research also focused on the factors affecting the nature of consumers' views.
Understanding a consumer's buying choices is a challenging assignment for a machine learning algorithm. Hu et al.,in [14], introduced credibility, interest, and sentiment enhanced recommendation model, which consists of five segments, namely, feature extraction of the review, interest mining on the aesthetic of the comment, candidate feature sentiment assignment based on the nature of their fastText sentiment, and a recommendation module that utilizes credibility weighted sentiment score of the feature selected by the buyer and reviewer credibility evaluation that helps in weighing the credibility of the reviewer to avoid fake reviewers. The reviews also depend on a reviewer's experience, which might differ from one customer to another. Li et al. focused on this problem in [15] by recommending an algorithm inspired by Dempster-Shafer's evidence theory. They used hotel customer reviews of four different properties as a case study and extracted information from various travel websites to identify the practicability and capability of the algorithm. Their approach can help the managers develop strategies based on the customer reviews to outrun their competitors.
Aspect-based sentiment analysis (ABSA) identifies the feature/aspect of an entity/target in an opinion/review and then performs sentiment analysis on each element analyzed. In this 3-star Amazon review on gloves, "Good value for the money, however, they do not hold up very well. They rip easily", the two aspects the consumer discusses are (a) affordability, whose sentiment is positive as they are cheap, and (b) durability, which carries a negative polarity. In [16], feature-focused sentiment analysis was applied to the customer comments, and the review votes of various mobile products were collected from Amazon. The result indicated that the method helps the manufacturers in product development and the buyers make a personalized decision based on multiple features of the product. Ali et al. [17] Entropy 2021, 23, 1645 4 of 24 studied the customer reviews and feedbacks for ridesharing services to modify and uplift several organizations for Kansei engineering in India-Pakistan. Since the languages used commonly are Urdu/Hindi and English, the work converted all the reviews into English and performed ABSA. They also extracted the most frequently used aspect to improve further the services provided based on customer demands. ABSA also helps classify reviews or comments based on various product or service features related to the opinion. ABSA has several challenges, such as that the attention-based models may sometimes (a) lead to a given aspect incorrectly targeting grammatically irrelevant words, (b) fail to diagnose special sentence structures such as double negatives, and (c) weigh only one vector to depict context and target. In [18], Zhang et al. proposed a knowledge-guided capsule network to address the above limitations using Bi-LSTM and capsule attention network. The study in [19] summarizes the state-of-the-art ABSA methods by using lexicon-based, machine learning, and deep learning approaches.
In this digital age, since information is so readily available, before purchasing a product, buyers tend to read customer reviews and comments, which affect their purchasing decision. Researchers usually focus on the review body, but a review contains more information than that, which is generally not exploited, such as review time, number of helpful votes, review time, reviewer id, and review rating. In [20], Benlahbib and Nfaoui visualized the reputation of a product differently by considering all the parameters and projecting the reputation value, opinion category, top positive review, and top negative review. They implemented the time of review and the number of helpful votes for each review from the Transformers model to Bidirectional Encoder Representations. This helps to predict the probability of the nature of review sentiment. They also proposed equations that calculate the reputation value for a product. Extensive research is being conducted not only focusing on sentiment analysis in English but also several other languages such as Arabic [21], Persian [22], Urdu [23], Hindi [24], Russian [25], Chinese [26], and Indonesian [27].
Several studies were conducted on sentiment analysis [28] and its application on e-commerce. With the increase in online consumption, e-commerce enhancement has become a hot topic for research. Many scholars introduced methods focusing on deep neural networks [29], probabilistic classifiers [30], linear classifiers [31], lexicon-based approaches [32], or decision trees [33] to increase accuracy and efficiency. In [34], Wang et al. proposed an iterative sentiment analysis model called SentiDiff, which predicts polarities in Twitter messages by considering the interconnections between textual information of Twitter messages and sentiment diffusion patterns. Shofia and Abidi [35] used a support vector machine to identify the keywords and extricate the sentiment polarity of Twitter data specific to Canada on social distancing due to COVID-19. Zhang et al. [36] introduced a convolutional multi-head self-attention memory network to glean valuable and intricate semantic information from sequences and aspects of a sentence. This algorithm uses a convolutional network to capture n-gram grammatical knowledge and multi-head selfattention to acknowledge the linguistic information of the sequence by the memory network. Abdalgader et al. [37] applied a lexicon-based word polarity identification method by studying the semantic relatedness between the set of the target word and synonyms of words surrounding the target on several benchmark datasets. The result has outrun several existing methods that use pairwise relatedness between words at term-level around the target over a fixed size. The performance of various sentiment analysis methods differs due to such factors as datasets, feature representations, or classification processes. Liu et al. [19] conducted a detailed survey on several deep learning approaches for aspectbased sentiment analysis using benchmark datasets evaluation metrics and the performance of the existing deep learning methods.
Outliers are extreme values that diverge from the rest of the data samples [38,39]. It might occur due to an imbalanced dataset or experimental error, or novelty. The research [39] defines an outlier in its experiment as any tweet in a Twitter dataset that is not relevant to the topic in consideration. Once the outliers are detected and eliminated, it is noticed that the algorithm's accuracy improves significantly. Similarly, in [40], it was observed that before implementing a convolutional neural network to the document to be classified, if outliers are identified and erased by using a density-based clustering algorithm, the efficiency increases, and the computational cost decreases. Kim et al. [41] applied a combination of four outlier detection methods, namely (a) Gaussian density estimation, (b) Parzen window density estimation, (c) Principal component analysis, and (d) K-means clustering to identify malicious activities in an institution using user log database. The outlier identification methods can be broadly categorized into statisticalbased [42], distance-based [43], graph-based [44], clustering-based [45], density-based [46], and ensemble-based [47]. Once the outliers are detected, it is crucial to delete, consider, or modify the outlier. This usually depends on an outlier's effect on the dataset if it is deleted or tampered with. The condition of an outlier can vary for different applications and datasets; for instance, if in a population estimation survey the number of people with height over 7 ft is very low, then these data can be verified and kept as they are natural outliers. In contrast, if in a dataset with various brands of shoes, the price of one or two are extraordinarily high, then those outliers can be deleted before calculating the average cost of a pair of shoes.

Datasets
With the advancement in the field of the internet and cloud computing [48], data collection has become more accessible. Public datasets are found in abundance for research purposes. Amazon is one of the many colossal data sources that encourage scholars to scrape publicly available data from their websites for research purposes. Based on a survey from Feedvisor, an article in Forbes concluded that 89% of the buyers choose Amazon instead of other e-commerce websites to make online purchases [49]. There are two types of datasets used in this paper, (a) collected datasets and (b) publicly available datasets. Collected datasets used in this paper [50] consists of product reviews we ourselves collected from Amazon.com, starting from the year 2008 to 2020, spanning across seven different domains, namely, book (Becoming by Michelle Obama), pharmaceutical (Turmeric Curcumin Supplement by Natures Nutrition), electronics (Echo Dot 3rd Gen by Amazon), grocery (Sparkling Ice Blue Variety Pack), healthcare (EnerPlex 3-Ply Reusable Face Mask), entertainment (Harry Potter: The Complete 8-Film Collection), and personal care (Nautica Voyage By Nautica).
Each review carries multiple information such as reviewer name, date and place of comments, star rating, verified purchase, the number of buyers who find the review helpful, and the images added by the reviewer. This dataset scraped from Amazon consists of 35,000 Amazon customer reviews, including the product name, comment date, star rating, and the number of helpful votes. Figure 1 shows the number of reviews against each star rating accumulated for all seven collected datasets. It can be observed that the extremely positive star rating (5-star) dominates the dataset, and there are very few negative (1and 2-star) and moderately positive (3-and 4-star) star ratings. The skewed nature of the dataset results in J-shaped distribution. Multiple reasons behind such bias towards extremely positive reviews exist. People usually agree with and write about the positive ratings and comments quickly but are generally skeptical about the negative ratings or comments. When a consumer notices an extremely positive review, it usually influences the consumer's opinion resulting in the switching of star rating. A higher rating was also observed to easily influence a consumer to increase the valuation, while the reverse is not true [51]. Table 1 represents the consumer review distribution across the different star ratings in all the collected datasets individually. The results show the same biases of customer reviews towards a 5-star rating as compared to the rest.   Book  4104  219  62  46  569  Electronics  3567  770  94  51  518  Entertainment  2485  1062  797  271  385  Grocery  3402  683  134  104  677  Health Care  3014  910  263  196  617  Personal Care  3287  338  641  200  534  Pharmaceutical  3190  855  184 114 657 Figure 2 represents a graphical distribution of the average number of helpful votes per review. It can be inferred that customers find the extremely negative reviews as the most helpful ones for making buying decisions or understanding a product. Extremely negative reviews are usually critical about the product, its features, packaging, delivery, usefulness, cost, and authenticity. It becomes easier for a consumer to decide about buying a product if they understand the various aspects of a product and the extremely negative experiences of former buyers. Table 2 compiles the average helpful vote per customer review in each dataset. It can be observed that most customers find extremely negative reviews most informative and beneficial.   Figure 2 represents a graphical distribution of the average number of helpful votes per review. It can be inferred that customers find the extremely negative reviews as the most helpful ones for making buying decisions or understanding a product. Extremely negative reviews are usually critical about the product, its features, packaging, delivery, usefulness, cost, and authenticity. It becomes easier for a consumer to decide about buying a product if they understand the various aspects of a product and the extremely negative experiences of former buyers. Table 2 compiles the average helpful vote per customer review in each dataset. It can be observed that most customers find extremely negative reviews most informative and beneficial.

Interquartile Range
Traditionally a dataset can be represented by using the five-number summary, which includes the lowest and highest value, median, and first and third quartile, the middle number between median and first and last number, respectively [52]. These values exhibit more information about a dataset as compared to just rows and columns. Figure 3 is an example of the box plot distribution of a dataset.

Interquartile Range
Traditionally a dataset can be represented by using the five-number summary, which includes the lowest and highest value, median, and first and third quartile, the middle number between median and first and last number, respectively [52]. These values exhibit more information about a dataset as compared to just rows and columns. Figure 3 is an example of the box plot distribution of a dataset. Q 1 and Q 3 are the intermediate points of the first and second half of an ordered dataset, respectively, and Q 2 is the median value of a dataset. For example, in an arranged dataset A = {1, 1, 2, 3, 5, 6, 7}, Q 2 is 3, which is the median value or the fourth number of the dataset. Q 1 is 1 as it is the center value of the first half, 6 is Q 3 as it is the midpoint of the second half of the dataset. Traditionally a dataset can be represented by using the five-number summary, which 265 includes the lowest and highest value, median, and first and third quartile, the middle 266 number between median and first and last number, respectively [52]. These values exhibit 267 more information about a dataset as compared to just rows and columns. Figure 3 is an 268 example of the box plot distribution of a dataset. The difference between Q 1 and Q 3 is the interquartile range (I QR ), which reflects the spread of the dataset about the median.
The lower and upper fences can be represented as: Data in a dataset that exists beyond the bounds of F L and F U are outliers. Additionally, 1.5 preserves the sensitivity of the dataset. A larger scale than 1.5 would consider outliers as a datapoint, while the reverse would include data points in outliers.
In a dataset, there are two types of outliers, suspected or potential outliers and definite outliers. A potential outlier (O P ) is the data that are suspected as possible outliers if they satisfy: A definite outlier (O D ) is the data that are absolute outliers if they comply with:

Definitions for SODCM
R consist of all the customer reviews in a dataset such that R = {r 1 , r 2 , r 3 , . . . , r N }, where r i denotes i th review and r * i is the star rating of r i . In order to understand our proposed statistics-based outlier detection and correction method (SODCM), the following definitions are presented.
Any review with a star rating of four or more is considered a positive star rated review, denoted by S + .
Any review with less than a four-star rating is considered a negative star rating review, denoted by S − . Definition 3. T V (r i ) = 1 i f r i ∈ S + and T V (r i ) = −1 i f r i ∈ S − . The target value of review r i is 1 if it is a positive star rated review and −1 otherwise, denoted by T V .
is the compound sentiment score of r i predicted by a sentiment analysis algorithm. The value difference of review r i is the Euclidean distance between T V (r i ) and C V (r i ) of the corresponding review, denoted by V D (r i ). Since the range of both T V and C V is [−1, 1], the range of V D is [0, 2].

Proposed Algorithm
The star rating assigned to a customer's review is generally considered as the ideal sentiment of the comment. There are instances when a customer might have assigned a positive star review, but the nature of the feedback is negative. This 4-star Amazon customer review on a thermometer, "Purchased the thermometer to have a method to check temperatures by non-contact. The thermometer's box and content was not sealed which bothered me because of COVID.", carries a negative sentiment but has a positive rating which is contradictory. These ratings of reviews can be corrected to their correct star rating to improve the efficiency of a sentiment analysis algorithm.
SODCM consists of two major parts, namely the (a) detection of these outliers and (b) correction of these identified anomalies. It has the following six steps: Input: The input for SODCM is any dataset containing customer reviews (r i ) and their corresponding star ratings (r * i ); Step 1: T V is calculated using r * i . If r i belongs to S + then T V = 1 and if r i belongs to S − then T V = −1. Since this work focuses on the binary classification of the sentiments of customer reviews, the values assigned to T V are −1 or 1; Step 2: V D is calculated between T V and C V . The value of V D is always positive. Since the minimum and maximum value T V and C V is 0 and 1, the range of V D is between 0 and 2. Figure 4 is an example of the box plot distribution of S + . Since the minimum value V D can hold is 0, Figure 4a depicts the box plot of S + when F L is negative and Figure 4b depicts the box plot of S + when F L is positive. Figure 5 is an example of the box plot distribution of S − . Since the maximum value V D can hold is 2, Figure 5a depicts the box plot of S − when F U > 2 and Figure 5b depicts the box plot of S − when F U ≤ 2; Step 3: After analyzing the dataset, it can be construed that S + has some reviews whose sentiment does not match the nature of star rating; hence, they are considered outliers. On the other hand, S − has very few reviews whose opinions match the essence of their respective star rating; hence, the reviews which are correctly assigned to their corresponding star ratings are considered outliers. This implies that most negative comments are incorrectly rated; therefore, the outliers, in this case, would be the correctly rated comments. In other words, the incorrectly labeled reviews are all the reviews in S − , excluding the ones which are outliers. Hence, the dataset is split into S + and S − ; Step 4: In S + , if F L is negative, then O s can be calculated as Q 3 + I QR else, F U − I QR F L .
Since the range of V D is [0, 2], the least value it can hold is 0. In S − if F U > 2, then O s can be calculated as Q 1 − I QR , else, Q 3 − I QR F U . We compute O s as follows: Step 5: For S − , customer comments whose V D (r i ) ≤ O s , are outliers. These five steps complete the outlier detection process; Step 6: T V of reviews labeled as outliers in S + is reversed, meaning a comment with T V = 1 now becomes re-labeled as −1 and vice versa. On the contrary, for S − , T V of reviews that are not labeled as outliers is reversed. This step is vital as it performs outlier correction by changing the nature of r * i ; Output: The output of the proposed algorithm is the dataset consisting of reviews with their corrected nature of star ratings which means a positive natured review is labeled as 1 and the negative natured review as −1. SODCM helps in detecting the outliers and correcting them without eliminating or modifying any review.
labeled as 1 and the negative natured review as 1. SODCM helps in detecting the outliers and correcting them without eliminating or modifying any review.
The above steps are realized in SODCM. After its execution, we can perform a more accurate sentiment analysis of the revised dataset, and the obtain performance matrix of SODCM is obtained.
Step 7:  Proof. Each of Steps 1 to 6 requires time complexity Ο while Step 4 needs Ο 1 . Hence, the entire algorithm (Algorithm 1) has the complexity Ο . □  The above steps are realized in SODCM. After its execution, we can perform a more accurate sentiment analysis of the revised dataset, and the obtain performance matrix of SODCM is obtained. Step 1:

Experimental Results
The proposed SODCM identifies and rectifies outliers for all the datasets consisting of Amazon customer reviews of products from various domains. All the three algorithms are executed on both (a) collected Amazon review datasets and (b) an Amazon review dataset publicly available in the amazon-reviews-pds S3 bucket in AWS US East Region [53]. There are several datasets consisting of product reviews from various domains, and we chose Amazon product review datasets for seven domains, namely apparel, beauty, fashion, furniture, jewelry, luggage, and toys. Each of these datasets consists of 100,000 customer reviews. The algorithm used for sentiment analysis is TextBlob [54], which is a Python library for NLP. The experiment is performed in two stages. Initially, the algorithm is implemented to each star rating of a dataset separately to study the results. SODCM then evaluates the complete dataset at a later stage of the research.
Tables A1-A5 in Appendix A represent the results from reviews evaluated based on the star ratings individually. For Tables A1 and A2, the least value for O s is considered as F U , and O s is then decremented by 0.1 until it reaches 0.8. For Tables A3-A5, the least value for O s is considered as F L , and O s is then incremented by 0.1 until it reaches 1.2. The results are then saved in a csv file, evaluated manually to check the number of outliers detected correctly and incorrectly. In all the Tables, O D represents the total number of outliers detected, O I is the number of reviews incorrectly labeled as outliers, and O C equals the number of reviews correctly labeled as outliers. O I and O C are validated manually for cross-verification. SODCM is implemented for all the datasets and ratings separately.
The performance of SODCM is compared with two state-of-the-art outlier detection methods published this year: (a) a class-based approach [55] and (b) a deep-learning-based approach [56]. Tables 3 and 4 represent the performance comparison of SODCM with those in [55,56] on the collected datasets and on the publicly available datasets, respectively. The bold numbers in all tables mean the best results among three methods. Table 5 compiles the metrics comparison for SODCM using p-value, T-score, and CI, where CI represents the 95% confidence interval in the form of [x, y]. From Tables A1-A5, it can be concluded that SODCM detects an optimal number of outliers in all the datasets and shows a perfect ratio between the correctly and incorrectly detected outliers, thus resulting in a high degree of accuracy. The accuracy decreases considerably once the value of O s reaches one. Moreover, the increase or decrease in O s for positive or negative star-rated reviews, respectively, results in a rise in incorrectly labeled outliers. It can also be concluded from Tables 3 and 4 that the accuracy and recall percentage of SODCM for all the datasets outperform those of [55,56]. Hence, it is inferred that SODCM outperforms the other methods in the outlier detection and correction approach, which are outperformed by those in [55,56]. Table 5 reflects that the p-value is less than 0.001, which is robust evidence against the null hypotheses. An extremely low p-value signifies that the results are not accidental, and the improvement is due to SODCM. The T-score for all the datasets is high, indicating more significant evidence against the null hypothesis. This means that there is a considerable difference between the collected star ratings from the website and the improved star ratings based on the nature of the reviews by SODCM. CI in Table 4 represents a 95% chance that the actual error of the model is between x ± y. Hence, the smaller CI, the more precise is the estimate of the model.

Conclusions and Future Work
SODCM is a novel approach tor identifying anomaly in a customer review dataset and rectifying it by improving their corresponding star rating. The results exhibit that the performance of the proposed algorithm surpasses other state-of-the-art approaches, and it also gives evidence for SODCM's rejection of the null hypothesis. The advantage of SODCM against most of the methods is that this data analysis pipeline preserves the outliers to correct them and prevents any information loss. From this dataset study, it can also be inferred that the outlier definition is different for positive and negative reviews as the minority in a dataset with positive star rated reviews is when the nature of both reviews and star ratings contradicts. At the same time, the reverse is true for negative star-rated reviews. Moreover, Amazon customer review datasets are generally highly imbalanced irrespective of the product or its department, and they follow J-shaped distribution. By studying the count of helpful votes in the datasets, it is noticed that extremely negative reviews are the most critical ones, which help in the decision-making for the majority of the customers.
Since it can be concluded that SODCM performs well on datasets consisting of Amazon customer reviews, the future work should focus on applying the proposed method to product reviews from other marketplace datasets such as eBay, Etsy, Best Buy, Target, Walmart, etc., to obtain a better insight into the discrepancies between star ratings and the related reviews. This will help conclude that SODCM can detect and rectify anomalies without deleting any data to preserve the overall dataset knowledge. This algorithm can be implemented in several real-life scenarios such as accessing product performance [57][58][59][60][61][62], conducting market research along with flagging of reviews through rating and review irregularity detection, and thus rectifying them without any data loss [63,64]. In this paper, the sentiment analysis algorithm used is TextBlob, a Python-based NLP package. It should be interesting to study the behavior and impact of SODCM when combined with other state-of-the-art sentiment analysis algorithms such as BERT, XLNet, ELECTRA, OpenAI's GPT-3, RoBERTa, or StructBERT. From Tables A1-A5, it can be observed an optimal number of outliers is successfully detected in all the datasets by the proposed SODCM. This leads to a high degree of accuracy since the number of correctly and incorrectly detected outliers reach a perfect balance. As the value of O s reaches 1, the sentiment analysis accuracy of the modified dataset decreases considerably, and the increase and decrease in O s for positive and negative star-rated reviews, respectively, results in a rise in incorrectly labeled outliers.