Machine Learning and Deep Learning Sentiment Analysis Models: Case Study on the SENT-COVID Corpus of Tweets in Mexican Spanish

: This article presents a comprehensive evaluation of traditional machine learning and deep learning models in analyzing sentiment trends within the SENT-COVID Twitter corpus, curated during the COVID-19 pandemic. The corpus, filtered by COVID-19 related keywords and manually annotated for polarity, is a pivotal resource for conducting sentiment analysis experiments. Our study investigates various approaches, including classic vector-based systems such as word2vec, doc2vec, and diverse phrase modeling techniques, alongside Spanish pre-trained BERT models. We assess the performance of readily available sentiment analysis libraries for Python users, including TextBlob, VADER, and Pysentimiento. Additionally, we implement and evaluate traditional classification algorithms such as Logistic Regression, Naive Bayes, Support Vector Machines, and simple neural networks like Multilayer Perceptron. Throughout the research, we explore different dimensionality reduction techniques. This methodology enables a precise comparison among classification methods, with BETO-uncased achieving the highest accuracy of 0.73 on the test set. Our findings underscore the efficacy and applicability of traditional machine learning and deep learning models in analyzing sentiment trends within the context of low-resource Spanish language scenarios and emerging topics like COVID-19.


Introduction
Social media communication is crucial in all sectors of the population's life.Companies use social media to massively promote products and services, while people use them to transmit experiences and opinions.Natural Language Processing (NLP) and Text Mining have been of great interest in exploring this source of textual communication to generate information about mass behavior, thoughts, and emotions on a wide variety of topics, such as product reviews [1], political trends [2], and stock market sentiment [3].During the Coronavirus pandemic, people expressed how they experienced the consequences of quarantine, the way it altered the daily rhythm of life, and how they changed their day-to-day activities.
Among the most used social media during the pandemic was Twitter, which at the time functioned as a freely accessible universal microexpression tool.This made it an ideal platform to capture the population's feelings during this historic moment.Many studies have been presented that analyze various aspects of the epidemic, some of them on Twitter and mainly in English.
This article presents the work carried out to study the emotional impact of COVID-19 on the Mexican population.The MIOPERS platform responded to UNAM's initiative to develop models for the analysis and visualization of information that support strategic decision-making, especially during lockdown.During the pandemic, there were two main motivations for starting such work: (a) to evaluate people's behavior, moods, and popularity of the measures given by the government and (b) to monitor users with possible symptoms.
This initiative, which covers two years (2020-2022), the duration of the pandemic, allowed a compilation of many tweets related to COVID-19.This facilitated the study of topic-related lexicon, mentions, and hashtags, which in turn served as a basis for studying other important NLP topics, such as sentiment analysis.
This article focuses on developing a specific corpus for polarity analysis of COVID-19, the SENT-COVID corpus, taking a subset of the tweets collected by the Miopers system during the pandemic.Furthermore, polarity classification experiments are performed, applying both traditional ML and DL methods.To do this, the article follows the structure explained below.Related work is discussed in Section 2, especially on sentiment analysis in social networks or specifically oriented to the topic of COVID-19.Section 3 explains the compilation of the corpus, the annotation protocol, and the agreement results.The methodology that has been followed to carry out the analysis is described in Section 4, including pre-processing, forms of text representation, and algorithms used.The results are presented and discussed in Section 5.The article concludes with the conclusions in Section 6.

Related Work
Numerous toolkits are available to process textual data, which makes complex NLP tasks more accessible with user-friendly interfaces.In the context of sentiment analysis, several researchers have used libraries such as TextBlob, VADER, and Pysentimiento, among others.TextBlob and VADER have the advantage of not requiring training data, as it is a lexicon-based approach.Therefore, they have been popular tools for analyzing comments on social networks, such as tweets [4][5][6][7][8], youtube [9][10][11][12][13] or Reddit [14][15][16][17] comments.Although the lexicon-based approach is suitable for general use, its main limitation lies in its difficulty adapting to changing contexts and linguistic uses [18].Examples are texts such as tweets that have a lively and casual tone [19].In addition, if we look at those related to COVID-19, we find new terms associated with the phenomenon.Additionally, since TextBlob and VADER were designed mainly for English-language texts, they may not be as effective when used in texts in other languages.Therefore, a toolkit for analyzing text sentiments and emotions in a wide range of languages is the Pysentimiento library, which offers support for multiple languages [20][21][22], including Spanish [23,24].Furthermore, Pysentimiento uses state-of-the-art machine learning models, such as BERT (Bidirectional Encoder Representations from Transformers) models, for sentiment analysis.However, this requires more computing resources than TextBlob or VADER.
From the beginning of the quarantine period, several researchers studied social media information to measure people's feelings about their situation during the COVID-19 pandemic [25].This has been done considering the language and domain of the comments posted on the different social platforms [26].Many studies have used TextBlob, VADER, and Pysentimiento tools for sentiment analysis on social networks [6,23,[27][28][29][30][31].Moreover, machine learning approaches have been widely adopted to categorize sentiments into two (negative and positive) or three classes (positive, negative, and neutral).For example, Long Short Term Memory (LSTM) recurrent neural network has been used in Reddit comments, which allows for 81.15% accuracy [32].Chunduri and Perera [33] have used advanced deep learning models, such as Spiking Neural Networks (SNN), for polarity-based classification.SNNs encompass what is known as brain-based computing, and attempt to mimic the distinctive functionalities of the human brain in terms of energy efficiency, computational power, and robust learning.Although they report 100% accuracy with their model, their main claim is that SNNs have lower energy consumption than ANNs.
For public tweets related to COVID-19, the TClustVID model [34] was developed, achieving a high accuracy of 98.3%.
Researchers have also analyzed the performance of language models for sentiment analysis in Spanish.Specifically, for the COVID-19 tweet polarity, Contreras et al. [35] found that pretrained BERT models in Spanish (BETO), with domain-adjusted, have achieved a high accuracy of 97% in training and 81% in testing.Such performance was the best compared to multilingual BERT models and other classification methods such as Decision Trees, Support Vector Machines, Naive Bayes, and Logistic Regression.
Research has focused not only on creating computational models for text classification but also on annotated datasets, which help to train and evaluate models in supervised learning approaches.An example is COVIDSENTI [36], which consists of 90,000 COVID-19related English-language tweets collected in the early stage of the pandemic from February to March 2020.Each tweet has been labeled as positive, negative, or neutral.Furthermore, state-of-the-art BERT models have been applied to the data to obtain a high precision of 98.3%.
For sentiment analysis, several corpora of annotated tweets related to COVID-19, mainly in English, have been released [36][37][38][39][40].However, since the behavior of social media users also varies with language [41], having datasets in various languages besides English is crucial.Therefore, efforts have been made to compile multilingual corpora [42,43] as well as language-specific datasets such as Portuguese [44,45], Arabic [46,47], French [48], among others [49][50][51].For the Spanish language, there are annotated tweet datasets for tasks such as hate speech detection [52], aggression detection [53], LGBT-phobia detection [54], and automatic stance detection [55], among others.However, to our knowledge, there is no manually annotated public corpus for the sentiment polarity of COVID-19-related tweets in Spanish.Given that research tends to use an automatic labeling process.Like the work by Contreras mentioned above [35].Therefore, we present a corpus with a manual labeling process and an annotation guideline.Furthermore, we provided an extensive analysis of the agreement between the annotators.

Data Collection and Annotation: The SENT-COVID Corpus
We collected COVID-19 tweets by implementing the Twitter API in Python.The messages are from 1 April 2020, to the end of 2022.About 4,000,000 tweets were collected, including only messages labeled as written in Mexican Spanish.We also included tweets that were responses or retweets, i.e., the type and form of the tweet did not matter to the extraction and annotation process.
Once the data was obtained, we filtered the messages with a dictionary of appropriate terms, hashtags, and mentions depending on the development of the pandemic.In the first lexicon, the terms focused on different variants of the word COVID-19 (coronavirus, el virus, covid, lo del contagio) and symptoms.(dolor de cabeza agudo, cuerpo cortado, diarrea, fiebre [leve], tos [seca], dolor de garganta, altas temperaturas).Regarding hastaghs, many of them were government messages or slogans, used to support their policies, such as #QuedateEnCasa, #TecuidasTúNosCuidamosTodos, #SusanaDistancia.This is shown in Table 1.
After applying the initial filter, our corpus remained at only 4986 tweets.However, we removed 120 tweets that were not in Spanish and those that contained less than three words because they did not provide enough information to assign them any label.Therefore, our final corpus consists of 4799 tweets.

Annotation Protocol
We created an annotation guideline, summarized in this section, based on the polarity of sentiments.This describes how we labeled tweets and the criteria we used to categorize sentiments into three classes: positive, negative, and neutral.Each tweet in the corpus was manually assigned to one of the three categories.
We used as reference the Robert Plutchik's description of the eight primary emotions [56]-anger, fear, sadness, disgust, surprise, anticipation, trust, and joy.This allowed us to describe the polarity categories as follows.POSITIVE TAGS.Positive tags are used to identify tweets that communicate joy/trust.
Predominance of pleasure or well-being: • 'No se ustedes pero yo he sido muy feliz durante esta cuarentena'.
[I don't know about you but I have been very happy during this quarantine].• 'jaja Ana me acaba de alegrar la cuarentena'.
[haha Ana just made my quarantine happy].

2.
Cultivation of personal strengths and virtues that lead to happiness.Expressing some kind of doubt, without attacking something or someone.
[Aren't parking meters supposed to be suspended due to the contingency?]• 'Mateo cuando escucho la pregunta de la vacuna del jabón dijo-pero como de jabón?Que no sabe que en las vacunas van la mitad de los virus??' [When Mateo heard the question about the soap vaccine, said-but how about soap?Who doesn't know that half of the viruses are in vaccines??] To annotate the corpus, we had a previous step.We designed an experiment to check which is the best way to proceed to categorize the messages in type of corpus.Two students were asked to label a sample of 100 tweets with the tags positive, negative, or neutral, without any guidance but based solely on their own opinions.
At the same time, two more students were in charge of labeling the same messages following the guide that had been developed.The analysis of the results showed that the guide favors the agreement between the annotators.Thus, we moved on to a second phase with new students, with the help of the guide.Therefore, the final process of tagging involved three annotators who labeled the tweets according to our created guide.

Data Statement/Annotators Data
We followed the guidelines specified by [57] to create this data statement.
A. Curation Rationale: We collected tweets from the widely used social media platform, Twitter, due to its convenience in acquiring concise statements from the general user population on diverse topics within a digital context.We used specific key terms and hashtags commonly used to refer to the pandemic.B. Language variety: We systematically extract a set of tweets by filtering for specific keywords and ensuring that they are in Spanish and geographically associated with the designated region (Mexico).C. Tweet author demographic: The data is likely to come from a wide range of users with different characteristics such as age, gender, nationality, race, socioeconomic status and educational backgrounds.This is because we collected the data using Twitter's data collection API, which is expected to have a diverse user base in Mexico.D. Annotator demographic: We selected three annotators from the UNAM Language Engineering Group to label the tweets.All of them were undergraduate students from this university, between 20 and 25 years old, Spanish native speakers with Mexican nationality and residence.E. Speech Situation: All tweets are about the pandemic.The years of extraction are 2020 to 2022.F.
Text characteristics: The tweets collected come from a pandemic context, so they followed a specific global trend.They could be a unique tweet or a response to another tweet.The limited length of tweets is an important factor to consider, as is the social media policy.All the data are public.G. Recording Quality: We extracted the tweets from the Twitter API.H. Ethical Statements: We collected all tweets for academic use according to Twitter's privacy policy.

Results of the Annotation Process
The "interrater reliability" [58] is a measurement of the extent to which data annotators (raters) assign the same score or label to the same variable.Frequently, this quantity is calculated by Cohen's kappa coefficient (κ).We assessed the agreement of the corpus annotators by using the Inter-Annotator Agreement (IAA) score, where Cohen's Kappa statistical measure is used in its definition.So, the annotators who did not use the guide presented a Cohen's (κ) score of 0.178, a very slight agreement.In contrast, the annotators that used the guide presented a Cohen's (κ) score of 0.4369, a moderate agreement indeed.Table 2 below shows the percent of agreement and Cohen's (κ) score for each pair of annotators who did not use the guide.Compared to this, Table 3 below shows the percent of agreement and Cohen's (κ) score for each pair of annotators who used the guide.Cohen's Kappa suits very well for estimating the agreement between not more than two annotators.So, given the characteristics of our annotation process, where we have at least three annotators for each tweet, we used Fleiss' kappa to measure the agreement between the three annotators that used a guide.This analysis resulted in an overall agreement of 0.4369, reflecting a moderate inter-annotator agreement.
Having three annotators for each tweet allowed us to identify the labels with the majority vote.By having three independent labels where there was disagreement, we could seek agreement on two out of three to set the repeated label as the definitive label in the final corpus.At the end of the annotation process, some tweets did not have an assigned label since the three annotators disagreed.For this reason, it was necessary that all the annotators together decide on the final label for the tweets without agreement.
Our final corpus consists of 4799 tweets, of which 1834 (38.21%) contain negative sentiments, 1126 (23.46%) contain positive sentiments, and 1839 (38.33%) contain neutral sentiments.Finally, Table 4 presented below shows the general statistics computed from word counts on each tweet of our corpus.The minimum number of words across all categories is three.We can ascertain the range of words in each category by computing the maximum.Tweets with a negative sentiment have the highest maximum number of words, while those with a positive sentiment have a lower range.The maximum count of words varies significantly between categories.However, the average number of words is quite similar.Additionally, on average, tweets contain a low number of words, which could explain why the standard deviation of the count is so high.

Sentiment Analysis Methods
Once the corpus has been annotated, the next step is to process the raw tweets into data that we can use for classification.This section outlines the sentiment analysis methods we evaluate to build a classification model on COVID-19-related tweets.
Figure 1 shows the workflow of our experimentation, starting with the preprocessing of the text data before feature extraction, which includes removing digits, separating words based on patterns, normalizing words, wrapping special tokens, transcribing emojis, lemmatizing, and removing stop words.The results of processing the tweets using these methods are presented, demonstrating the transformation from raw tweets to processed text.The feature extraction techniques we evaluated are the Bag of Words (BoW) model, Term Frequency-Inverse Document Frequency (Tf-Idf), word embedding, and phrase modeling.Once the features were extracted using the BOW or n-grams models, we used feature selection techniques (explained in Sections 5.1 and 5.2) to reduce the vector dimensions while keeping the highest amount of information as possible.The models and algorithms for text classification we evaluated include ready-to-use libraries such as TextBlob, VADER, and the Pysentimiento Toolkit.With respect to the supervised learning models, we evaluated Logistic Regression, Naive Bayes Classification, Support Vector Machines, and Multilayer Perceptron (MLP).Moreover, transformer networks are also explored.Finally, for evaluating the sentiment classification models, we used cross-validation for the traditional machine learning algorithms that required training; for the ready-to-use libraries, we used the entire datasets because no training is required; for the transformers' models, we only did one training test split due to the computational cost of the experiments.

Text Processing
To build a Bag-of-Words (BoW) representation (we consider bi-grams and other structures in Section 4.2) we compose a vocabulary of all unique words in the corpus following the process outlined below: 1.
Remove digits, double blanks, and line breaks: Extracting information from digits presents a challenge since they can represent different things such as magnitudes, time-dates, directions, etc.Consequently, digits are intentionally omitted.

2.
Separate words with patterns: Tweets contain misspellings or camel case writing in hashtags.So, we identified common patterns (such as dots before a capital letter or symbols like '#' or '¿') and separated them using regular expressions.

Normalize words:
We transformed text to lowercase and removed punctuation marks and other special characters.It would not have split properly if we had done this before the previous step.In addition, consecutive repeated characters (usually used for laughs) were minimized to two repetitions to prevent the formation of new tokens for words already present in the vocabulary.4.
Wrap special tokens: Tweets often contain mentions, web links, and pictures.Thus, we identified common objects with general labels.For example, mentions to users were wrapped with the token 'usuario' and web links by the token 'url'.

5.
Transcribe emojis to words: We convert emojis into words, positioning them properly within the tweet using the emoji (https://pypi.org/project/emoji/accessed on 20 February 2024) Python module.All of these are normalized to lowercase.6.
Lemmatize: We find a word's dictionary form (or lemma).This process was done using the Spacy library.This allowed us to reduce vocabulary size by avoiding multiple tokens for different inflections of the same word.7.
Remove Stop words: Common words that do not carry semantic information, called 'stop words', are removed to reduce the vocabulary size.
In the following, we show the results of the comments processing using the described method.Table 5 shows the original raw tweet in the first column and the processed text in the second column.After the process, the entire text is in lowercase.The links are changed to the token 'url', while the users mentioned are replaced with 'usuarios'.Furthermore, emojis are converted into words and any repetition is avoided.

Feature Extraction
After processing the text, we investigated and experimented with different numerical data representations in the training stage.We compared some simple feature extraction techniques and algorithms to build a vocabulary from all words in the corpus.Also, we tried out some pre-trained embedding models, which provide a more complex yet more effective way of extracting text features.

Bag of Words Model
In the BoW model, each token in the text corresponds to a given dimension (feature) in a vector representation.Each token in a text will have a weight that can be given by the number of occurrences of a word in the text (term frequency) or by multiplying the number of occurrences of a word in the entire corpus with the occurrences of a word in the text (Tf-Idf).

1.
Term Frequency: To transform tweets to vector representations, we used the Scikitlearn library (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer (accessed on 20 February 2024)).The size of the vector of each text is equal to the size of the vocabulary of the corpus, and, as mentioned before, the value of each dimension is the number of times that the words appear in the tweet.We extracted different feature sets by modifying the following parameters: • n_gram_range: It allows us to know if each token is formed by singular words or n_grams (n-gram is a continuous sequence of n items from a given sequence of text) of words (this tries to preserve local ordering of words but at the cost of highly increasing the number of features).• min_df: It is the minimum number of times that a word must appear in all documents to be considered part of the vocabulary.Many words are present in the corpus with few appearances, so small variations of this parameter highly increase vocabulary size.• stop_words: We used the stopword list provided by the Nltk library, which consists of 313 words.
Table 6 shows in the last column the vocabulary size extracted by modifying the parameters mentioned above.The first column indicates whether the text is lemmatized or not, the second column indicates whether stopwords are removed, the third column indicates the n_gram range, and the fourth column shows the value of the minimum document frequency.

2.
Tf-Idf: The Term frequency weighting treats all words as having the same importance.This can be improved by considering the significance of each term in the corpus.To give each term a weight.A high Tf-Idf score indicates that a word is present in the tweet but not in many other tweets in the corpus.However, a low Tf-Idf value implies that most tweets frequently use the word.This process emphasizes words that are significant to the tweet.Table 7 shows terms with the smallest and largest Tf-Idf values.This, using bigram normalized tokenization with stop words included and mindf = 2: Word embedding is a technique for representing words as vectors.The purpose is to reduce the high-dimensional word features to low-dimensional feature vectors while preserving the context similarity.

1.
Word2Vec: is a popular natural language processing technique that represents words as dense vectors in a continuous vector space.It relies on the assumption that words with similar meanings often appear in similar contexts.Word2Vec uses a neural network to learn these word embeddings, capturing semantic relationships and similarities between words [59].This allows for word representations that can be used in various NLP tasks like sentiment analysis, machine translation, and information retrieval.CBOW (Continuous Bag of Words) and Skip-gram are the two fundamental architectures in Word2Vec.CBOW aims to predict a target word based on its surrounding context words.Skip-gram predicts the context words given a target word.Both architectures contribute to creating meaningful word embeddings.In the Gensim library, we can specify whether to use CBOW or Skip-gram.We decided to use the Skip-gram technique since it performed better in our experiments.

2.
Doc2Vec: extends the principles of Word2Vec to generate fixed-length vector representations for entire documents, such as sentences, paragraphs, or even entire documents.Doc2Vec employs neural networks to learn these document embeddings while considering the context of words within the document.It assigns a unique vector to each document, capturing its semantic content, and allowing for similarity comparisons between documents [60].

3.
Phrase Modeling: The Gensim library offers phrase detection, similar to the n-gram representation.However, instead of getting all n-grams by sliding the window, it detects frequently used phrases and sticks them together.Hence, we integrated the vector representation of sentences to capture the collective meaning of a group of words, rather than merely aggregating the meanings of individual words.This process allows us to extract many reasonable phrases while keeping the vocabulary size in a manageable size [59].We built 2-gram and 3-gram models to detect and combine frequently used two and three-word phrases within the corpus.After we obtained the corpus with the phrases, we did the same Doc2Vec process previously applied to unigram tokens.Thus, we present the results for each, using both the Distributed Bag of Words (DBOW) and Distributed Memory (DM), as well as their combination.DBOW and DM are two distinct training algorithms used to generate vector representations of documents.
Table 8 shows the phrases detected by Gensim's phrase detection algorithm in a given tweet.We can see that unigram yields 13 tokens, while 2-gram and 3-gram yields 10 and 9 tokens, respectively.As we can see, phrases like 'no es justo' and 'quedate en casa' become a trigram but the rest of the tokens remain as unigrams.This is because the algorithm only extracts the most significant n-grams.BERT models are considered state-of-the-art models for various NLP tasks that involve text representation.BERT possesses the significant benefit of supporting transfer learning.These language models have undergone extensive training over several days on robust machines over a large amount of text from platforms like Wikipedia and news websites.A pre-trained model can then be fine-tuned to align with our specific classification task.
We used some variants of BERT models in Spanish.These are particularly useful since they have been built for NLP tasks with high-dimensional analysis.Spanish models are hard to come by, and when available, they are frequently developed using substantial proprietary datasets and resources.As a result, the relevant algorithms and techniques are restricted to large technology corporations.However, a fundamental objective of these models is to promote openness by making them available as open-source resources.Examples of these models are 1.
BERTIN: Series of BERT-based models in Spanish, the current model hub points to the best of RoBERTa-base models trained from scratch on the Spanish portion of mC4 using Flax-a neural network library ecosystem for JAX designed for flexibility [61].

2.
ROBERTUITO: A language model for user-generated content in Spanish, trained following the RoBERTa guidelines on 500 million tweets.RoBERTuito comes cased and uncased [62].

3.
BETO: Another model trained on a large Spanish corpus that is size similar to a BERT-Base and was trained with the whole word masking technique, which outperforms some other models [63].
In recent years, there have been more advances in pre-trained BERT models [64].As a result of their growing popularity, several versions of lighter and faster versions of BERT (e.g., DistilBERT) have been made available to accelerate training and inference processes.However, there is a lack of these for languages other than English.

Ready-to-Use Libraries
We evaluated NLP libraries that do not need to train machine learning-specific models: • TextBlob: A Python library that allows users to perform various textual data processing tasks.For sentiment analysis, this tool uses a lexicon-based approach.It makes use of a vocabulary consisting of around 3000 words in English along with their corresponding scores.Thus, for a given text, the TextBlob sentiment analyzer returns two outputs.The polarity value belongs to [−1, 1], where −1 indicates a negative sentiment text, and +1 a positive one.On the other hand, the subjectivity value ranges from 0 to 1, with 0 indicating an objective text, while 1 representing a subjective text.Table 9 shows examples of how TextBlob works with different sentences in Spanish.Prior analysis, these sentences were previously translated into English.• VADER (Valence Aware Dictionary and sEntiment Reasoner): Similar to TextBlob, this tool uses a sentiment analyzer that is based on a lexicon.But, this tool is specifically tuned to the sentiments expressed in social media since its lexicon (of approximately 9000 token features) includes slangs and emoticons [65].The words in the lexicon have a valence score that ranges from extremely positive [4]  in English texts due to the quality of its lexicon [66].However, our texts are in Spanish, so we need a Spanish lexicon (and this resource is not easy to obtain specifically tailored to Latin Spanish and social media) or to translate the tweet into English.We opted for the SentiSense Lexicon [67] which contains a list of Spanish words classified according to their emotional connotation and information about the intensity of the emotion transmitted by each word.

Machine Learning Algorithms
We also trained and evaluated supervised classification models with the SENT-COVID corpus.This was done under the hypothesis that the models trained with the corpus outperform the ready-to-use libraries.The algorithms we evaluated were: • Logistic Regression: A generalized linear model widely employed in machine learning applications for classification purposes.It is especially useful for text mining tasks because of its ability to handle large, sparse data sets with robust performance [69].Logistic regression is used to compress the output of a given set of data into discrete values to a categorical response value.As ŷ output is the probability that the input instance belongs to a certain class, we use a binary 'one vs. the-rest' model for each class.This is interpreted as the probability of being or not being within the class.Hence, three binary classifiers ['neg', 'neu', 'pos'] are created, which we trained with the following parameters using the scikit-learn library: 1.
Regularization: We use 'L2' penalty (this is by using the usual Euclidian distance when calculating the norm) on estimated coefficients (as Ridge regression), which can be controlled using the 'C' parameter.Higher values of 'C' correspond to reduced regularization, allowing the model to prioritize fitting the training data optimally.In contrast, for lower values, the model gives priority to finding coefficients close to zero, even if this means a slightly reduced fit to the training data.

2.
Multi_class: The training algorithm uses the one-vs-rest scheme if the 'multi_class' option is set to 'ovr', and uses the cross-entropy loss if the 'multi_class' option is set to 'multinomial'.

3.
Solver: Algorithm to use in the optimization problem.Only 'newton-cg', 'sag', and 'lbfgs' solvers support L2 regularization with primal formulation as we selected for penalty.

4.
Formulation: (Dual is only implemented for L2 penalty with 'liblinear' solver) We prefer Primal when training data instances are greater than the number of features and Dual for other cases.
• Naive Bayes: Naive Bayes (NB) classifier is a probabilistic classifier based on Bayes's theorem for prior distributions.It is applied in problems such as spam filtering, text classification, and hybrid recommender systems [69].The NB classifier has been shown to be optimal and efficient in many machine learning text classification tasks (especially with independence between document labels assumptions) [70,71].According to our needs, we decided to use the version of Multinomial Naive Bayes which is used specifically for discrete cases (such as word counts in documents) and incorporates the assumption that characteristics follow a multinomial distribution.In sentiment analysis this model can work better since it allows text data modeled as word frequencies to be handled more efficiently and this makes it particularly useful.The BoW model is used as a feature model for implementation because it has been found to produce results comparable to those obtained by Support Vector Machines and logistic regression algorithms.The predicted label ŷ is the y that maximizes the probability of Y given X.We considered the following parameters for NB: 1. alpha: We conducted tests using the smoothing parameter, which has a default value 1.

2.
force_alpha: If 'False' is set and alpha is less than 10 −10 , it sets alpha to 10 −10 .If 'True', alpha remains unchanged.In this case, if alpha is too close to zero, it may cause numerical errors.Besides, when alpha = 0 and force_alpha = true means no smoothing.

3.
fit_prior: Whether to learn class prior probabilities or not.If false, a uniform prior is used.4.
class_prior: The prior class probabilities of the model.If not specified, these priors are adjusted according to the data frequency.This is useful for an imbalanced class distribution.
• Support Vector Machines (SVM): Various studies show that SVM outperforms other classification algorithms [72] for text classification problems.The SVM objective (primal) is to find the decision surface that maximizes the margin between the data points from different classes.In the linear case, this classifier rewards the amount of separation between classes by applying a sign function to produce a categorical output.Consequently, we handle multi-class classification by creating single linear binary classifiers for each class.Once this criterion has been defined as a decision rule, we define the decision boundaries and corresponding classification margins for each classifier.The most proficient classifier is the 'linear support vector machine', which is characterized by the maximum margin of separation between points.The advantage of this method is that slight modifications to the data for a particular document will not alter the label that the classifier assigns.So, the approach is more resistant to noise or perturbations.Also, it is still effective even when the number of features is greater than the number of data instances, and it only requires a limited number of training data to learn the decision function, making it memory efficient.We chose LinearSVC in scikit-learn rather than SVC implemented in terms of liblinear rather than libsvm.So, the choice of penalties and loss functions has more flexibility and scales better to large numbers of samples.Thus, we tested the following parameters: 1.
Regularization: We applied L2 penalty with 'C' = 1 same as in logistic regression.
In order to determine the significance of correctly labeling individual documents, smaller values of 'C' (more regularization) indicate a greater tolerance for errors on individual documents.

2.
Kernel: A linear kernel usually works better when using text data, but we also tested polynomial kernels.

3.
Multi_class: We preferred learning fewer number of classifiers so we used oneversus-rest over one-versus-one.
• Multilayer Perceptron (MLP): It is one of the simplest neural network models.These networks have achieved remarkable results on various classification problems, from object classification in images to fast, accurate machine translation [73].Their approach is similar to logistic regression but takes a step beyond by adding 'hidden layers' contained by 'hidden units'.Each layer performs a non-linear transformation (called activation functions) in the input features.These functions adopt 'S-shaped' curves, and various forms of them were considered during data training.Introducing this extra hidden layer renders the prediction model more complex compared to logistic regression.However, the computational cost is greater, as predicting the response necessitates computing a distinct initial weighted sum of feature values for each hidden unit.

1.
Hidden layer sizes: A list with one element for each hidden layer that gives the number of hidden units for that layer.We passed two values of 100 (two hidden layers, 100 units per layer) 2.
Activation function: Activation function for the hidden layer.Options include the logistic sigmoid, hyperbolic tan, or rectified linear unit functions.In this work, we used the rectified linear unit function.

3.
Solver: The solver for weight optimization.'lbfgs' is an optimizer in the family of quasi-Newton methods, 'sgd' refers to stochastic gradient descent, 'adam' refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik [74].The default solver 'adam' works pretty well on relatively large datasets in terms of both training time and validation score.

Transformers
With their self-attention mechanisms, transformer-based models have brought a paradigm shift in natural language processing (NLP) [75].Unlike traditional recurrent or convolutional neural networks, transformers can process entire sequences of input tokens in parallel.This parallel processing enables efficient computation of contextual representations.The multi-head self-attention mechanisms empower the model to assign importance to each token in the input sequence based on its contextual relevance to other tokens.This feature allows transformers to effectively capture long-range dependencies and contextual information, making them an ideal choice for tasks that demand an understanding of complex linguistic patterns, such as sentiment analysis.To use transformer-based models for sentiment analysis in Spanish, one typically fine-tunes a pre-trained transformer model on a labeled dataset of Spanish text.The pre-training process involves initializing the model's parameters with weights learned from a large corpus of Spanish text.During fine-tuning, the model learns to adjust its parameters better to capture the nuances of sentiments, leveraging the contextual information encoded in the transformer's self-attention mechanisms.Once the model is fine-tuned, it can be used to predict the sentiment of new text inputs by feeding them through the model and interpreting the output probabilities or scores assigned to each sentiment class.
Using transformer pre-trained models is computationally expensive in training, limiting our ability to execute comprehensive cross-validation tests for score metrics or conduct extensive hyperparameter searches.The experiments were carried out with a single traintest split of the data, usinng 75% for training and 25% for testing.We created the neural network with a single hidden layer and a single output unit.Essential parameters such as input size, hidden units, output size, batch size, dropout, and learning rate were considered.Then, we randomly initialized the dummy input and the output target data (or tensor), and using built-in functions, we created a simple sequential model with an output sigmoid layer and defined the corresponding loss function.This computed the mean-squared error between the input, target, and optimizer.For the gradient descent, the 'torch.optim'package provides various optimization algorithms, so we used a stochastic gradient descent (SGD) optimizer.Finally, we defined the training loop with the following steps:

•
Forward propagation: This computed the predicted ŷ label and calculated the current loss.This helped us see how the model trains over each epoch (we considered 5 epochs).• Backward propagation: After each epoch, we set the gradients to zero before starting.• Gradient descent: Finally, we updated model parameters by calling the optimizer function.

Results of Sentiment Analysis Methods
We split the SENT-COVID corpus in a single train-test division randomly separated into approximately 3600 for training and 1200 for testing (75-25%) and used the same seed to preserve the same partitions in different experiments.Table 11 shows the distribution of labels in each partition.It can be observed that both partitions maintain the same class distribution.We established a fundamental baseline for benchmarking performance among the different classification methods.For this, we used the Zero Rule (ZeroR), which predicts the most frequent class in the training dataset.If a model's performance is worse than ZeroR under the same parameters, it suggests the model is useless.As shown in Table 11 the majority class is neutral with 44%.This implies that a ZeroR classifier that predicts the neutral class consistently for each test data point would achieve an accuracy rate of 44%.
The initial experiments were conducted using the BoW model to determine the best vocabulary size for the classification models.Subsequently, a second set of experiments was carried out using word embeddings, Doc2Vec models, and phrase modeling instead of the classical BoW model.The goal was to compare different text representations and their usefulness in machine learning algorithms.

Vocabulary Size Evaluation
Initially, We wanted to figure out how many features are suitable for the model and seek insights that could guide the establishment of a simplistic criterion for feature selection.
However, since the selection of text features is a broad topic, we do not address it in this paper.An attempt to preserve some of the contexts lost when using BoW model was done by considering the use of n-grams and removing stop words (or just some of them using the frequency of use as a criterion).Thus, we tested different numbers of n-grams and stopwords comparing them using simple logistic regression and computing the accuracy on the test set for the different vocabulary sizes.
In addition to n-grams and stopwords, we compared the results produced by the TF and Tf-Idf weighting schemes.Then, these findings were combined with the previous experiments, aligning with various text processing approaches such as lemmatization, stemming, and normalization applied to tweets.
As we see in Figure 2a the prediction accuracy is improved when more features are included in the model.Interestingly, removing the 'stopwords' is not useful for increasing the accuracy of the model even though these words do not carry semantic information.Figure 2b shows that using 'unigrams' (i.e., bag of words) works better as features increase than using 'bigrams' or 'trigrams'.This means that the n-grams were unable to capture the desired context, so tokens of just one word seem to do the work better for sentiment classification.

Dimension Reduction Evaluation
When constructing a BoW model, we encounter a high number of features, necessitating a reduction in feature dimensions prior to their integration into learning models.We explored several feature selection methods for comparative analysis in our experiments.

χ 2 Feature Selection
The chi-squared (χ 2 ) statistic measures the degree of dependence between a feature (here, a term within a tweet) and a class (the sentiment of the tweet, whether positive or negative).Through a contingency table, which shows frequency distribution, we see the relationship between a term within a tweet and the class that the tweet belongs to.
Initially, we assessed this method using the BoW model.This involved transforming the training data into TF vectors, followed by the calculation of the χ 2 statistics between each feature and class.This score helps to select the number of features with the highest values relative to the classes.Subsequently, we used the χ 2 statistic to determine which features were useful and then presented our findings graphically to show which word features were important for prediction.For better visualization, only the top 20 features are shown in Figure 3.We then decreased the dimensions to different amounts of features and assessed the precision based on the test set.
Figure 3a shows the 20 most significant words identified, some of which are surprisingly considered 'stopwords'.The plot in Figure 3b shows that enhanced accuracy was achieved by selecting around 8000 features based on the χ 2 criterion rather than the unconstrained BoW model.Despite the slight increase in accuracy, the main objective of dimension reduction has not been completely achieved.Fewer features do not necessarily obtain better results.But we can see that using 3000 χ 2 selected features (0.56 of accuracy) yields better results than employing the most frequent 9000 features (0.55 of accuracy).

Truncated Singular Value Decomposition
Another method to reduce the dimension is Singular Value Decomposition (SVD).Contrary to the Principal Component Analysis (PCA) method, this estimator does not center the data before computing the singular value decomposition.In SVD the termdocument matrix is decomposed into three matrices-U, σ (sigma), and V-and retaining the top-k singular values and their corresponding columns from U and V [76].These retained singular vectors effectively represent the most important features in the text data.Subsequently, these vectors can be utilized as a reduced feature set of n components.
In Figure 4, we observe that with a minimum of 1000 components, more than 90% of the variance is already taken into account, which is a considerable reduction.To determine whether these new features yield good predictions, we tested the accuracy as we have done in previous experiments.Table 12 shows that the best results are obtained when using around 2000 components.

Document Embeddings Evaluation
To test this vector representation model, we obtained the 'embeddings' using pretrained vectors, specifically utilizing the Word2Vec embeddings from Spanish Billion Word Corpora (https://crscardellino.github.io/SBWCE/(accessed on 20 February 2024)).Employing these 300-size vectors we were able to represent a tweet as a vector in a more precise way through the Doc2Vec models.
Next, regarding the identification of phrases, we initially constructed the '1-gram', '2-gram', and '3-gram' representations of the tokens from all documents.With the Gensim library, we implemented Doc2Vec to learn the paragraph and document embeddings via distributed memory (DM) and distributed Bag of Words (DBOW).Specifically, for DM we used the Distributed Memory Concatenation (DMC) and Distributed Memory Mean (DMM) alternative training algorithms for document vector generation.DMC enhances the DM model by concatenating the document vector with the average of context word vectors, aiming to capture both overall document semantics and specific word context.On the other hand, DMM simplifies this process by directly averaging the context word vectors without concatenation.Once this was done, we compared them evaluating the accuracy on the test set.These methods were tested both separately and in combination.
In Table 13 we can see that the best results are given when combining the DBOW and DMM models.Although these results are not better than what we have obtained so far, it is remarkable that the representation using '2-gram' and '3-gram' was increasingly effective.This can potentially be a direction to explore further.

Hyperparameter Tuning
Having obtained representations of the corpus suitable for learning models, we shifted our focus to model selection and hyperparameter optimization to get the best performance and accuracy in our prediction.Our decisions were driven by optimizing accuracy.Therefore, we performed a 10-fold grid search cross-validation method using a repeated stratified k-fold to identify the optimal values within the range we examined for each classification algorithm.For logistic regression, we explored a range of regularization strength (C) that spans from 10 −5 to 10 2 (or from 0.00001 to 100) on a logarithmic scale.For the solvers, we consider {'newton-cg', 'lbfgs', 'liblinear'}, and for specifying the penalty we experimented with the values {'none', 'l1', 'l2', 'elasticnet'}.The Naïve Bayes model was tested with the smoothing parameter alpha varying twenty values from 0 to 1, with both true and false fit_prior.For the implementation of SVM, we used the same set of regularization strength values as those used in logistic regression.Furthermore, we conducted trials using {'poly', 'rbf', 'sigmoid' } kernels with their default coefficients.Lastly, when creating our neural network architecture, we explored 1 to 5 hidden layer sizes.We also tested different activation functions including {'relu', 'tanh', 'logistic'}, and considered multiple solvers like {'lbfgs', 'sgd', 'adam'}.Table 14 summarizes the values tested for the different models.
In Table 14, the chosen parameters for each supervised learning algorithm are presented as the optimal ones, utilizing the BoW unconstrained model as the feature set.Regrettably, there is little variation in performance when employing various penalty values or types, showing only slight enhancements.This observation also holds for the alpha smoothing parameter within the Naive Bayes model.

Sentiment Analysis Final Models
Finally, we fitted the supervised learning algorithms using the hyperparameters found with the grid search using as features the unconstrained and reduced BoW representations, as well as the Doc2Vec with DBOW+DMM using '3-gram'.Table 15 reports the results of the obtained classification models in terms of the accuracy, precision, recall and the micro F1-score using 10-fold cross-validation.Additionally, the results of fitting an additional KNN model with k = 5 (which does not estimate coefficients) were only included as a reference point for the performance of the models we analyzed.This non-parametric model was included for empirical analysis purposes.Although we cannot affirm that the parametric models are more effective, initial observations appear to indicate that they are more successful in this task.
For the black-box libraries, there is no need for matrix transformations or data splitting, the evaluation is done over the complete set because there is no need to train an algorithm.We only need to pre-process the corpus.This is especially important for libraries such as Vader, which needs to match the largest number of words possible to obtain better predictions.The same applies to pysentimiento since it is not recommended to use lemmatized words (in fact, low processing is more recommended for this library).Vader has a low computational cost, while pysentimiento requires a bit of time to compute the results, neuron is deactivated during training.the training.In this case, it is set to 0.3, which means that each neuron has a 30% probability of being deactivated during each training step.
Finetunning on the pre-trained models yielded the best results.However, the more epochs we use for training, the more overfitting we observe, as it can be seen in Table 18, where we show the accuracy on the train and test sets when training on 3, 5 and 10 epochs.This increase in variance is not as easy to interpret as the training error from previous models, where we can take a better look at error rates.Finally, Table 19 compares the best accuracy scores of various sentiment models evaluated in our study.In particular, the table shows that BERT-based models perform better than ruled-based or machine learning models.These levels of accuracy are even achieved by the ready-to-use Pysentimiento library.Among these, BETO-uncased is the most accurate model, with an accuracy score of 73.26%.This performance of the BERT-base models may be due to the fact that the model achieves a deep understanding of the context and nuances of the Mexican Spanish discourse, acquired through an extensive pre-training on a variety of text corpora.This allows them to capture the expressions of sentiment and complex linguistic structures of COVID-19 related tweets.In contrast, rule-based models such as Vader and TextBlob exhibit the lowest accuracy scores in our analysis, reflecting their limitation to adapting to the specific linguistic structure and domain.

Conclusions
This paper presents SENT-COVID, a Twitter corpus of COVID-19 in Mexican Spanish manually annotated with polarity.We have designed several classification experiments with this resource using ready-to-use libraries, classical machine learning methods, and deep learning approaches based on transformers.
In light of the temporal context surrounding the compilation and presentation of our corpus, it is crucial to emphasize the importance of its value in hindsight.While we acknowledge that the corpus's arrival may seem overdue, We firmly assert that it remains relevant to our understanding of linguistic patterns and public discourse.As a historical archive of Mexican Spanish tweets during the pandemic, our corpus offers unique insights into the evolution of societal responses, linguistic shifts, and sentiment fluctuations over time.Despite the availability of other resources, the retrospective nature of our corpus provides researchers with an invaluable opportunity to conduct comparative analyses, trace the trajectory of linguistic trends, and evaluate the enduring impact of COVID-19 discourse on societal norms and behaviors.Furthermore, we emphasize the corpus's potential to complement existing datasets and tools, enriching interdisciplinary research endeavors in fields such as linguistics, public health communication, and computational social science.
Given the experiments, we observe that, among the black-box libraries, neither TextBlob nor Vader demonstrated satisfactory performance, probably due to the difficulty of obtaining a suitable lexicon in Spanish.In contrast, Pysentimiento exhibits better performance because it employs machine learning models trained on large Spanish corpora to classify text into sentiment categories such as positive, negative, or neutral, and to detect emotions such as joy, anger, sadness, and fear with higher accuracy and contextual understanding.By leveraging machine learning techniques, PySentimiento can capture the nuances of sentiment expressed in Spanish text more effectively, overcoming the limitations faced by lexicon-based approaches like TextBlob and Vader.
The supervised models have revealed that contrary to our initial expectations, removing common words is not as effective as we had thought.However, the models showed that including a broader range of features and observations improved performance without requiring too much computing power.The dimension reduction models managed to improve the prediction results with fewer features, so we can conclude that it is a viable alternative to tackle this problem.However, there is still much to explore.Furthermore, the penalty parameter selection did not make a major difference as expected, neither Ridge nor Lasso regularization, and it performed almost the same as with the default parameters.
The results of the Doc2Vec models did not meet the expectations, as they could not outperform basic BoW models.Additionally, training these models is associated with a higher computational cost.
Finally, pre-trained BERT models yielded the best results.However, they are the most expensive in terms of computational cost.Additionally, it is difficult to perform different tests since cross-validation is difficult.Therefore, the parameters and configuration settings must be chosen based on another criterion.Despite these challenges, for datasets that are not too large, pre-trained BERT models are the most suitable choice.
Informed Consent Statement: Not applicable.

Figure 3 .
Figure 3. (a) Most significant words given by χ 2 and (b) accuracy on the test set for the different number of features.We show results for the term frequency vector reduced by the term frequency (solid line) and the χ 2 (dashed line).

Table 1 .
Lexicon used to filter the COVID-19 related tweets for the corpus creation.

Table 2 .
Agreement score by the annotators of the classification of sentiments without a guide.

Table 3 .
Agreement scores by each pair of annotators of the classification with the guide.

Table 4 .
General statistics computed from word counts on each tweet.

Table 5 .
Original (raw)and processed version of a sample of tweets.

Table 6 .
Vocabulary size by different parameter settings.

Table 7 .
Sorted features with smallest and largest Tf-Idf values.

Table 8 .
Phrase detection tokens yield by each model.
to extremely negative [−4], with [0] representing neutral sentiment.These scores are determined based on the semantic orientation of the lexical features.The Table 10 illustrates the functioning of VADER.The first column displays the input text.The 'compound' column shows the normalized sum of the valence scores of each word in the text.A value of [−1] indicates a negative sentiment, while [+1] indicates a positive polarity in the text.The columns 'neg', 'neu', and 'pos' indicate the percentage likelihood of the text belonging to the negative, neutral, or positive class, respectively.Vader has achieved good results [68]sentimiento Multilingual Toolkit: A very useful transformer-based library for Text Mining and Social NLP tasks such as sentiment analysis and hate speech detection[68] for text classification in Spanish this library uses BETO (https://github.com/dccuchile/beto (accessed on 20 February 2024)) and RoBERTuito (https://github.com/pysentimiento/robertuito (accessed on 20 February 2024)) language models.Pysentimiento is trained with 'pos', 'neg', 'neu' labels using the 'TASS-2020 task-1' corpus (http: //tass.sepln.org/2020/(accessed on 20 February 2024)) merged with the Spanish subsets for each dialect, summing up to 6000 tweets.

Table 9 .
TextBlob outputs for different statements in Spanish.

Table 10 .
Vader outputs for different statements (in Spanish).

Table 11 .
Distribution of labels in the train and test partitions.

Table 12 .
Accuracy for n components.

Table 13 .
Test accuracy for Doc2Vec models.The best result is highlighted in bold.

Table 14 .
Optimal hyperparameters settings selected for each model based on th optimization of accuracy through grid-search and cross-validation.

Table 17 .
Classification results of the Spanish BERT models.The best result is highlighted in bold.

Table 18 .
Results of an increasing number of epochs using BETO.The best result is highlighted in bold.

Table 19 .
Summary of the performance evaluated based on the accuracy of different sentiment analysis models on the SENT-COVID corpus.The best result is highlighted in bold.