Evaluating the Effectiveness of Text Pre-Processing in Sentiment Analysis

: Practical demands and academic challenges have both contributed to making sentiment analysis a thriving area of research. Given that a great deal of sentiment analysis work is performed on social media communications, where text frequently ignores the rules of grammar and spelling, pre-processing techniques are required to clean the data. Pre-processing is also required to normalise the text before undertaking the analysis, as social media is inundated with abbreviations, emoticons, emojis, truncated sentences, and slang. While pre-processing has been widely discussed in the literature, and it is considered indispensable, recommendations for best practice have not been conclusive. Thus, we have reviewed the available research on the subject and evaluated various combinations of pre-processing components quantitatively. We have focused on the case of Twitter sentiment analysis, as Twitter has proved to be an important source of publicly accessible data. We have also assessed the effectiveness of different combinations of pre-processing components for the overall accuracy of a couple of off-the-shelf tools and one algorithm implemented by us. Our results conﬁrm that the order of the pre-processing components matters and signiﬁcantly improves the performance of naïve Bayes classiﬁers. We also conﬁrm that lemmatisation is useful for enhancing the performance of an index, but it does not notably improve the quality of sentiment analysis.


Introduction
Sentiment analysisencompasses a series of methods and heuristics for detecting and extracting subjective information, such as opinions, emotions, and attitudes from language [1]. Although it originated in the text subjectivity analysis performed by computational linguists in the 1990s [2,3], which was later enhanced by studies about public opinion at the beginning of the 20th century, the proliferation of publications on sentiment analysis did not start until the Web became widespread [4]. The Web and social media in particular have created a large corpora for academic and industrial research on sentiment analysis [5]. Examples of this can be found in the various applications of sentiment analysis investigated thus far: product pricing [6], competitive intelligence [7], market analysis [8], election prediction [9], public health research [10], syndromic surveillance [11], and many others. The vast majority of papers on sentiment analysis has been written after 2004 [12], making it one of the fastest growing research areas.
Sentiment analysis requires pre-processing components to structure the text and extract features that can later be exploited by machine learning algorithms and text mining heuristics. Generally, the purpose of pre-processing is to separate a set of characters from a text stream into classes, with transitions from one state to the next on the occurrence of particular characters. By careful consideration of the set of characters-punctuation, white spaces, emoticons, and emojis-arbitrary text sequences can be handled efficiently.
Tokenising a stream of characters into a sequence of word-like elements is one of the most critical components of text pre-processing [13]. For the English language, it appears trivial to split words by spaces and punctuation, but some additional knowledge should be taken into consideration, such as opinion phrases, named entities, and stopwords [14]. Previous research suggests that morphological transformations of languagethat is, an analysis of what we can infer about language based on structural features [15]can also improve our understanding of subjective text. Examples of these are stemming [16] and lemmatisation [17]. Lately, researchers working on word-embeddings and deeplearning based approaches have recommended that we should use techniques such as word segmentation, part-of-speech tagging, and parsing [18].
All the decisions made about text pre-processing have proved crucial to capturing the content, meaning, and style of language. Therefore, pre-processing has a direct impact on the validity of the information derived from sentiment analysis. However, recommendations for the best pre-processing practice have not been conclusive. Thus, we aim to evaluate various combinations of pre-processing components quantitatively. To this extent, we have acquired a large collection of Twitter [19] data through Kaggle [20], the online data science platform and tested a number of pre-processing flows-sequences of components that implement methods to clean and normalise text. We also assessed the impact of each flow in the accuracy of a couple of off-the-shelf sentiment analysis tools and one supervised algorithm implemented by us.
It is important to clarify that we are not intending to develop a new sentiment analysis algorithm. Instead, we are interested in identifying pre-processing components that can help existing algorithms to improve their accuracy. Thus, we have tested our various pre-processing components with two off-the-shelf classifiers and a basic naïve Bayes classifier [21]. As their name indicates, the off-the-shelf classifiers are generic and not tailored to specific domains. However, those classifiers were chosen specifically because they are used widely and therefore our conclusions may be relevant to a potentially larger audience. Lastly, we chose the naïve Bayes classifier as a third option, because it can easily be reproduced to achieve the same benefits that will be discussed later. Our results confirm that the order of the pre-processing components makes a difference. We have also encountered that the use of lemmatisation, while useful for reducing inflectional forms of a word to a common base, does not improve sentiment analysis significantly.
The reminder of this paper is organised as follows. Section 2 reviews the related work. Section 3 describes the corpus used for our experiments, and the text pre-processing components and flows tested. Section 4 presents our results, and, finally, Section 5 offers our conclusions.

Related Work
As the number of papers on sentiment analysis continues to increase-largely as a result of social media becoming an integral part of our everyday lives-the number of publications on text pre-processing has increased too. According to Mäntylä et al. [12], nearly 7000 papers on sentiment analysis have been indexed by the Scopus database [22] thus far. However, 99% of those papers were indexed after 2004 [12].
Over the past two decades, YouTube [23] and Facebook [24] have grown considerably. Indeed, YouTube and Facebook have reached a larger audience than any other social platform since 2019 [25]. Facebook, in particular, has held a steady dominance over the social media market throughout recent years. In the UK, for example, which is where we carried out our study, Facebook has a market share of approximately 52.40%, making it the most popular platform as of January 2021. Twitter, on the other hand, has achieved a market share of 25.45%, emerging as the second leading platform as of January 2021 [26].
Although Twitter's market share falls behind Facebook's, the amount of Twitter's publicly available data is far greater than that corresponding to Facebook. This makes Twitter remarkably attractive within the research community, and that is why we have undertaken all our work using it. Table 1 lists four of the most cited papers on text pre-processing available on Scopus. Such papers along with their references account for 279 publications, which are all represented in Figure A1 in Appendix A. Pink circles in Figure A1 represent the four papers displayed in Table 1, red circles represent the earliest publications on the subject-these are publications mostly related to the Lucene Search Engine [27]-and blue circles represent the rest of the papers. The size of the circles depends on the number of citations of the corresponding paper: the larger the circle is, the more citations the paper has. The links between the papers denote the citation relationship-paper A is linked to paper B if A cites B. Figure A1 was produced with Gephi [28], an open-source network analysis and visualisation software package. The size and connectivity of the network displayed in Figure A1 in Appendix A, which is based on only four papers, shows that the body of literature on text pre-processing is large and keeps growing. However, it is still unclear which pre-processing tools should be employed, and in which order. Thus, we have evaluated various pre-processing flows quantitatively and we will present our conclusions here.
A couple of influential publications that have followed a similar approach to what we aim to achieve are Angiani et al. [32] and Jianqiang et al. [30]. These publications are included in Table 1 and have stated the sequences of pre-processing components that they have examined and the order in which they have examined them. Table 2 displays these sequences as pre-processing flows. The first row of Table 2 indicates the datasets used by the corresponding authors to test their approaches. The remaining rows in Table 2 display the actual steps included in the pre-processing flows. As explained before, researchers have recommended the use of text pre-processing techniques before performing sentiment analysis-an example of this can be found in Fornacciari et al. [33]. However, we are not only interested in using pre-processing techniques but also in comparing different pre-processing flows and identifying the best.
Pre-processing is often seen as a fundamental part of sentiment analysis [32], but it rarely is evaluated in detail. Consequently, we wanted to assess the effect of pre-processing on some off-the-shelf sentiment analysis classifiers that have become popular-namely VADER [34] and TextBlob [35]. To compare and contrast these off-the-shelf classifiers with other alternatives, we have implemented our own classifier, based on the naïve Bayes algorithm [21]. Additionally, we took advantage of this opportunity to examine some components which have not been researched broadly in the literature. For instance, most of the existing literature on English sentiment analysis refers to stemming as a pre-processing step-see, for instance, Angiani et al. [32]. However, we have also evaluated lemmatisation [36], which is a pre-processing alternative that has been frequently overlooked.
Despite all the recent NLP developments, determining the sentiment expressed in a piece of text remains a problem that has not been fully solved. Issues such as sarcasm or negation remain largely unsolved [37]. Our main contribution lies precisely in identifying pre-processing components that can pave the way to improving the state of the art.  [38] and 2016 [39].
In order to keep only significant Replace negative mentions-that information, remove URLs, hashtags is, replace won't, can't, and n't (for example, #happy) and user with will not, cannot, and not, mentions (for example, @BarackObama). respectively.
Replace tabs and line breaks Remove URLs. Most researchers with a blank, and quotation consider that URLs do not carry marks with apostrophes.
any valuable information regarding the sentiment of a tweet.
Remove punctuation, except for Revert words that contain apostrophes (because apostrophes repeated characters to their are part of grammar constructs, original English forms. For such as the genitive).
example, revert cooooool to cool.
Remove vowels repeated in sequence Remove numbers. In general, at least three times to normalise numbers are of no use when words. For example, the words cooooool measuring sentiment. and cool will become the same.

Replace sequences of a and h with a
Remove stop words. Multiple laugh tag-laughs are typically lists are available, but the classic represented as sequences of the Van Rijsbergen stop list [40] was characters a and h. selected by Jianqiang et al. [30].
Convert emoticons into corresponding Expand acronyms to the original tags. For example, convert :) into words by using an acronym smile_happy. The list of emoticons dictionary, such as the Internet & is taken from Wikipedia [41].
Convert the text to lower case, and remove extra blank spaces.
Use PyEnchant [43] for the detection and correction of misspellings.
Replace insults with the tag bad_word.
Use the Iterated Lovins Stemmer [44] to reduce nouns, verbs, and adverbs which share the same radix.
Remove stop words.

Experimental Dataset
Our research was conducted using a dataset retrieved from Twitter. As stated above, such dataset has been made publicly available by Kaggle [20], a platform that hosts data science competitions for business problems, recruitment, and academic research. Our dataset was originally taken from the Crowdflower's Data for Everyone Library, now known as the Datasets Resource Center [45]. It contains 13,871 tweets published in relation to the first Grand Old Party (GOP) debate held among the candidates of the Republican Party who were looking for nomination in the 2016 US Presidential election. There were 12 debates in total, but the one that corresponds to our experimental dataset is the first one.
The debate commented on by the tweets in our dataset took place in Cleveland, Ohio, on 6 August 2015, starting at 17:00 EDT and finishing at 21:00 EDT. However, the tweets were collected over a 17 h period, approximately, starting at 17:44:53 EDT and finishing the following day, 7 August 2015, at 10:12:32 EDT [45]. Figure 1 shows the number of tweets captured per hour. As it can be seen, the largest number of tweets was published between 01:00 and 08:00, with two clear peaks at around 02:00 and 06:00. While the corpus does not seem to be evenly balanced across the time of collection, we took it exactly as it was published by Kaggle. The debate was seen on television by 24 million viewers, making it the most watched live broadcast for a non-sporting event in cable television history [46]. A group of contributors created the necessary metadata to annotate the tweets with information, such as how relevant the tweets were to the GOP debate, which candidates were mentioned in each tweet, what subjects were discussed, and what sentiment polarity value-positive, negative, or neutral-could be associated with each tweet [47]. A total of 61% of the tweets were classified as negative, 23% as positive, and 16% as neutral. Each tweet was associated with a confidence sentiment value, which is an indicator of the degree of reliance the contributors have regarding the overall sentiment expressed in the tweet. A total of 5370 tweets had a confidence sentiment value between 0.96 and 1; 6779 tweets between 0.59 and 0.72; and 1621 between 0.31 and 0.39. Figure 2 shows the proportion of positive, negative, and neutral tweets in the dataset which had a high confidence sentiment value-between 0.96 and 1.
Predictably, the hashtag #GOPDebate, which is the hashtag employed to refer generically to the twelve presidential debates and nine forums that were held between the candidates for the Republican Party's nomination for the presidency in the 2016 US election, was the most frequent term in the dataset. The names and last names of the candidates-Donald Trump, Jeb Bush, Scott Walker, Mike Huckabee, Ben Carson, Ted Cruz, Marco Rubio, Rand Paul, Chris Christie, and John Kasich-were also among the most frequent terms. The television network-Fox News Channel-is constantly referred to in the dataset as well. Figure 3 displays a word cloud created with the terms found in the experimental dataset. Given that frequency is not the best indicator of relevance [48], we chose TF-IDF [15] to rank the terms. The higher the TF-IDF score is, the larger the font size used to display it on the cloud is. The cloud was produced using the Python WordCloud library [49].

Pre-Processing Components
To assess the impact of pre-processing on the accuracy of Twitter sentiment classifiers, we implemented the next 12 pre-processing components. We will describe them first, and we will later discuss the order in which we execute them.
Lowercasing: We convert to lowercase all the characters in the dataset, as opposed to capitalising proper names, initials, and words at the start of the sentence. Capitalisation is complex in Twitter, because users tend to avoid capital letters, as this lends a tone of informality to the conversation that makes the messages more speech-like [50]. Therefore, lowercasing everything seems to be the most practical approach. Removal of URLs and Twitter features: We remove URLs entirely, because we cannot estimate the sentiment of an online resource-web page, picture, or anything else associated with a URL. We also remove the hashtag character, but not the keywords or phrases comprised within the hashtags-for example, we remove the # character from #happy, but not the word happy, because it can have a sentiment-related connotation. We also remove the characters RT at the beginning of the tweets. RT is used to indicate the re-tweeting-or re-posting-of someone else's tweets. Removal of unnecessary spaces: Identifying spaces between characters is critical, as spaces are considered word boundaries. Regrettably, splitting a character sequence where spaces occur can also split what should be regarded as a single "token". This happens commonly with names-for example, New York, The Netherlands, and Côte d'Ivoire-but also with a number of borrowed foreign phrases-for instance, au fait. Although word segmentation remains an issue, we handle it by removing spaces that only increase the length of the text without adding any meaningful value. Removal of punctuation: Eliminating punctuation before performing sentiment analysis is common. However, some punctuation characters are related to emoticons which express sentiment; thus, their removal may reduce the accuracy of the classification. Indeed, punctuation sequences such as :), :D, ;), or <3 are references to emoticons that convey sentiment. As we want to convert the emoticons into words, we remove the punctuation after handling the emoticons. Other researchers have recommended the same approach to speed up the analysis and improve the performance-for instance, see Kim's work on dimensionality reduction [51]. Negation handling: Negations are words such as no, not yet, and never, which express the opposite meaning of words or phrases. For the purpose of sentiment analysis, it is common to replace a negation followed by a word with an antonym of the word. For example, the phrase not good is replaced with bad, which is an antonym of good. Consequently, a sentence such as the car is not good is transformed into the car is bad. However, certain negative words, such as no, not, and never, and negative contractions, such as mustn't, couldn't, and doesn't, are often part of stop-word lists. Thus, we replace all negative words and contractions with nnot, and then we correct this issue after stop-word removal as part of the misspelling correction [52]. Stop-word removal: Extremely common words which are of little value in matching an information need-such as prepositions, definite and indefinite articles, pronouns, and conjunctions-are known as stop-words [53]. These words are typically removed to reduce the amount of processing involved in the analysis [54]. The general strategy for constructing a stop-word list is to sort all the terms in the corpus by frequency and then add the most frequent terms, often hand-filtered for their semantic content, to a stopword list. The terms in this list are then discarded from any further processing [55]. In the case of sentiment analysis, the utilisation of various stop-word lists is reported in the literature. We have chosen the list used by the NLTK library [56], which is a well-regarded text mining tool. Emoticons and emojis translation: An emoticon is a representation of a facial expression, such as a smile or frown, formed by a combination of keyboard characters. An emoji is a small digital image or icon used to express an idea or emotion [57]. Previous studies have shown that emoticons and emojis play a role in both building sentiment lexicons and training classifiers for sentiment analysis [58,59]. Wang and Castanon [58] have concluded that emoticons are strong signals of sentiment polarity. Acknowledging this, we translate emoticons and emojis into their corresponding words using emot, the Open-Source Emoticons and Emoji Detection Library [60]. Then, :D, which translates into laughing, increases the positive score of a tweet, whereas :-(, which translates into a frown, increases the negative score. Acronym and slang expansion: Computer-mediated communication has generated slang and acronyms often referred to as microtext [61]. Examples of microtext are expressions such as "c u 2morrow" (see you tomorrow), which are not found in standard English but are widely seen in short message service (SMS) texts, tweets, and Facebook updates. In addition, Palomino et al. [62] have explained how important it is to expand acronyms to reveal concealed messages with sentiment repercussions. Addi-tionally, Satapathy et al. [61] have shown that acronym expansion improves sentiment analysis accuracy by 4%. Therefore, replacing microtext by conventional meanings is indispensable. We perform this via the SMS and Slang Translator dictionary [63]. Spelling correction: Language features are likely to be missed due to misspellings. Using tools that automatically correct misspellings enhances classification effectiveness [64]. Although no spelling corrector is perfect, some of them have demonstrated a reasonably good accuracy [52]. While Kim [51] made use of AutoMap [65], a text mining tool developed by CASOS at Carnegie Mellon, we favoured the Python library pyspellchecker [66], which employs the Levenshtein Distance algorithm [67]. Tokenisation: This is the process of splitting a piece of text into its parts, called tokens, while disposing of certain characters and sequences which are not considered useful [53]. We tokenised the tweets in the dataset using the TweetTokenizer from the NLTK package [56], which we are also using for stop-word removal. Short-word removal: To avoid further noise, we remove all the words that consist of only one or two characters. Such words are unlikely to contribute to the analysis of sentiment and may even be typos. Lemmatisation: There are families of derivationally related words with similar meanings, such as president, presidency, and presidential. The goal of lemmatisation is to reduce inflectional and derivational forms of a word to a common base [53]. Lemmatisation employs vocabulary and morphological analysis to remove inflectional endings and return the base or dictionary form of a word, namely the lemma. If lemmatisation encountered the word saw, it would attempt to return either see or saw depending on whether it was used as a verb or a noun. Sentiment analysis has used lemmatisation in other languages-for example, Indonesian [68] and Vietnamese [69]. However, in the English language, lemmatisation has been mostly used in classification studies, where it has improved the performance, as indicated by Haynes et al. [70]. Thus, we opted to test the use of lemmatisation in sentiment analysis, too.

Pre-Processing Flows
Once we have described the pre-processing components that we implemented, we will discuss the order in which we executed them. We created five pre-processing flows, and we will list them below. Each pre-processing flow is a linear sequence of pre-processing components that are performed with the input dataset in a specific order. Flow 1: Our first pre-processing flow is named the Reference Flow, because it was created as a reference, so that we can compare it with the rest. Figure 4 shows the full list of components contained in this flow and their corresponding order. Flow 2: The second flow, which is shown in Figure 5, includes the same components as the Reference Flow, but in a different order. After reading the dataset, the first three components of Flow 1 and Flow 2, as well as the last three, are executed in the same order. However, the components for pre-processing emoticons, emojis, acronyms, slang, negation handling, and stop words have been rearranged. Flow 3: The third flow, which is shown in Figure 6, is the same as Flow 2, but without lemmatisation. As explained above, lemmatisation identifies families of derivationally related words with similar meanings, such as democracy, democratic, and democratisation. While it would be useful for a retrieval system to search for one of these words to return documents that contain any other word in the family, the sentiment conveyed in a piece of text is unlikely to change due to the derivational forms. Hence, lemmatisation may not have a strong contribution to accuracy; thus, we tested the effectiveness of the flow without spending any time on lemmatisation. Flow 4: The fourth flow is the same as Flow 2, but without both lemmatisation and spelling correction-see Figure 7. Flow 5: The fifth flow is the same as Flow 4, but without stop-word removal-see Figure 8.

Sentiment Classifiers
We chose three sentiment classifiers to assess the effectiveness of the different preprocessing flows. Such classifiers are VADER [34], TextBlob [71], and a naïve Bayes classifier. The first two are off-the-shelf tools which are described below.

TextBlob
TextBlob is a Python library for processing text. It offers an API to perform natural language processing (NLP) tasks, such as noun phrase extraction, language translation, and spelling correction [35]. While the NLTK is one of the most commonly used Python libraries for NLP, we favoured the selection of TextBlob because it is simpler and more user-friendly than the NLTK [72]. As far as sentiment analysis is concerned, TextBlob provides two options for detecting the polarity of text: PatternAnalyzer, which is based on the data mining Pattern library developed by the Centre for Computational Linguistics and Psycholinguistics (CLiPS) [73], and NaiveBayesAnalyzer, which is a classifier trained on movie reviews [74].
The default option for sentiment analysis in TextBlob is PatternAnalyzer, and that is precisely the option that we used, because we are not working with movie reviews. We may consider the use of the NaiveBayesAnalyzer in the future, provided that we can train it suitably for the domain of our experimental dataset.

VADER
The Valence Aware Dictionary and sEntiment Reasoner (VADER) is a rule-based Python tool specifically designed to identify sentiments expressed in social media [75]. We chose VADER for two reasons: it is fast and computationally economical [72], and its lexicon and rules are publicly available [75]-the developers of VADER have built a public list of lexical features, which combine grammatical rules and syntactical conventions for expressing and emphasising sentiment intensity.

Naïve Bayes Classifier
Naïve Bayes classifiers are based on the Bayes' theorem with the "naïve" assumption of conditional independence between every pair of features. Naïve Bayes has proved useful in many real-world situations-for instance, spam filtering [76]-and requires a small amount of training data to estimate the necessary parameters.
Generally, naïve Bayes performs its tasks faster than other more sophisticated methods [77]. In our case, we used it to implement sentiment analysis as a two-category classification question: positive or negative?-neutral tweets were removed.

Results
Our experiments ensured that the entire dataset was processed by each of the three classifiers described above. However, we also tested the classifiers separately with the part of the dataset for which we had a high confidence sentiment value, because we had a higher expectation that this part of the dataset would work as a gold standard-a collection of references against which the classifiers can be compared.
All the flows were assessed by determining the sentiment using the three classifiers on both the raw data-that is, the dataset without any pre-processing-and the pre-processed data-that is, the dataset processed by the different combinations of the components that constitute the proposed flows.
After determining the sentiment using the classifiers, we evaluated their accuracy. Table 3 shows the accuracy of the classifiers when tested with the entire dataset: first without pre-processing-raw data-and then with each of the flows. Table 3, as well as the rest of the tables in this section, shows in bold font the entry that corresponds to the highest accuracy achieved by each of the classifiers. For instance, TextBlob achieves its highest accuracy when using Flow 4; thus, the accuracy that corresponds to the combination of TextBlob and Flow 4 is shown using bold font.  Table 4 shows the accuracy of the classifiers when tested with the gold standard-the part of the dataset for which we have a high confidence value. First, we tested the classifiers without any pre-processing and then with each of the pre-processing flows.  Tables 3 and 4, there is at least one flow that increases the accuracy of each of the three classifiers evaluated, which proves the importance and benefits of preprocessing. From Tables 3 and 4, we can confirm that Flow 4 offers the greatest advantages for two of the three classifiers. TextBlob achieves its greatest accuracy with Flow 4 when tested with both the entire dataset and the gold standard. Similarly, naïve Bayes achieves its greatest accuracy with Flow 4 when tested with the entire dataset and the gold standard. Recall that Flow 4 includes all the pre-processing components listed in Section 3.2, except for spelling correction and lemmatisation.
Given that naïve Bayes performed better than the other two classifiers in all cases, regardless of the pre-processing flow chosen, we opted to evaluate this classifier further. We started by testing naïve Bayes exclusively with negative tweets, then exclusively with positive tweets, and finally with all the tweets. The results of this additional evaluation are presented in Table 5. Note that the results of testing exclusively with negative tweets are shown in the row titled "Naïve-Bayes Negative", the results of testing exclusively with positive tweets are shown in the row titled "Naïve-Bayes Positive", and the results of testing with all the tweets are shown on the row titled "Naïve-Bayes Total". We also evaluated naïve Bayes when tested, separately, with the positive and negative tweets included in the gold standard. The results are presented in Table 6. It should be observed that Flow 4 allows the naïve Bayes classifier to achieve its highest accuracy with the entire dataset, when tested exclusively with positive tweets and with all the tweets together-see Table 5. Additionally, Flow 4 allows the naïve Bayes classifier to achieve the highest accuracy with the gold standard, when tested exclusively with positive tweets and with all the tweets together-see Table 6. Overall, Flow 4 appears to be the best pre-processing option. It allows the naïve Bayes classifier to achieve its highest accuracy with the entire dataset, when testing exclusively with positive tweets and all the tweets together-see Table 5. Additionally, it allows the naïve Bayes classifier to achieve the highest accuracy with the gold standard, when testing exclusively with positive tweets and all the tweets together-see Table 6. Our results can be summarised as follows: The importance of pre-processing: For each of the three classifiers evaluated, there is at least one pre-processing flow that increases its accuracy, regardless of the dataset employed for testing-the entire dataset or the gold standard-which confirms the benefits of pre-processing. Sensitivity: As it can be seen in Tables 5 and 6, naïve Bayes positive achieves its worst accuracy overall without pre-processing. Additionally, naïve Bayes positive appears to be the most sensitive classifier to the pre-processing flows-that is, naïve Bayes positive is the classifier that improves its accuracy the most with the help of pre-processing. Indeed, Flow 4 enhances its accuracy significantly. Insensitivity: Both VADER and TextBlob achieve similar performance, and they seem to be insensitive to pre-processing-in the sense that pre-processing does not increase their accuracy considerably. According to Tables 3 and 4, the variations in their accuracy, with and without any pre-processing, are minimal. Pre-processing loss: Naïve Bayes negative achieves the highest accuracy overall when tested with the entire dataset and the gold standard. However, such a high accuracy is achieved without any pre-processing. Figures 9 and 10 illustrate our results graphically. The developers of VADER claim to have implemented a number of heuristics that people use to assess sentiment [75]. Such heuristics include, among others, pre-processing punctuation, capitalisation, degree modifiers-also called intensifiers, booster words, or degree adverbs-and dealing with the conjunction "but", which typically signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being the dominant part [75]. Hutto and Gilbert use the sentence "The food here is great, but the service is horrible" to show an example of mixed sentiment, where the latter half of the sentence dictates the overall polarity [75].  In other words, VADER performs its own text pre-processing, and this is likely to interfere with our flows. For instance, if we removed all the occurrences of the word "but" from the dataset while using Flow 4, because "but" is a stop-word, we would be disturbing VADER's own pre-processing and, consequently, damaging its accuracy. Hence, VADER's accuracy is not improved by Flow 4 or any other Flow where we remove stop-words. Unsurprisingly, VADER achieved its best performance with Flow 5, which is the only flow that does not remove stop-words. We can expect that TextBlob also performs its own pre-processing. After all, TextBlob and VADER are both off-the-shelf tools that implement solutions to the most common needs in sentiment analysis, without expecting their users to perform any pre-processing. Therefore, they are insensitive to our flows. Tables 5 and 6 show some high accuracy values for the naïve Bayes negative and total classifiers, which highlight the possibility of overfitting [78]. Clearly, overfitting is a fundamental issue in supervised machine learning, which prevents algorithms from performing correctly against unseen data. We need assurance that our classifier is not picking up too much noise. Thus, we have applied cross-validation [79]. We have separated our dataset into k subsets (k = 5), so that each time we test the classifier, one of the subsets is used as the test set, and the other k − 1 subsets are put together to form the training set. Table 7 shows the accuracy values after applying the k-fold cross validation. While the accuracy is reduced by a small percentage after cross-validation, we still have Flow 4 as the best pre-processing option for Naïve Bayes Total-see Table 7. Of course, we are considering further training, testing, and validation with other datasets as part of our future work.

Conclusions
We have reviewed the available research on text pre-processing, focusing on how to improve the accuracy of sentiment analysis classifiers for social media. We have implemented several pre-processing components and evaluated various combinations of them. Our work has been tested with a collection of tweets obtained through Kaggle to quantitatively assess the accuracy improvements derived from pre-processing.
For each of the sentiment analysis classifiers evaluated in our study, there is at least one combination of pre-processing components that increases its accuracy, which confirms the importance and benefits of pre-processing. In the particular case of our naïve Bayes classifier, the experiments confirm that the order of the pre-processing components matters, and pre-processing can significantly improve its accuracy.
We have also discussed some challenges which require further research. Evidently, the accuracy of supervised learning classifiers depends heavily on the training data. Therefore, there is an opportunity to extend our work with new training samples. Our current results are promising and motivate further work in the future. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Our experimental dataset came originally from the Crowdflower's Data for Everyone library-available at https://appen.com/pre-labeled-datasets/, accessed on 29 July 2022. However, we employed the version provided by Kaggle-available at https://www.kaggle.com/ datasets/crowdflower/first-gop-debate-twitter-sentiment, accessed on 29 July 2022. Kaggle's version is slightly reformatted from the original source and includes both a CSV file and an SQLite database.