Detecting Aggressiveness in Tweets: A Hybrid Model for Detecting Cyberbullying in the Spanish Language

: In recent years, the use of social networks has increased exponentially, which has led to a signiﬁcant increase in cyberbullying. Currently, in the ﬁeld of Computer Science, research has been made on how to detect aggressiveness in texts, which is a prelude to detecting cyberbullying. In this ﬁeld, the main work has been done for English language texts, mainly using Machine Learning (ML) approaches, Lexicon approaches to a lesser extent, and very few works using hybrid approaches. In these, Lexicons and Machine Learning algorithms are used, such as counting the number of bad words in a sentence using a Lexicon of bad words, which serves as an input feature for classiﬁcation algorithms. This research aims at contributing towards detecting aggressiveness in Spanish language texts by creating different models that combine the Lexicons and ML approach. Twenty-two models that combine techniques and algorithms from both approaches are proposed, and for their application, certain hyperparameters are adjusted in the training datasets of the corpora, to obtain the best results in the test datasets. Three Spanish language corpora are used in the evaluation: Chilean, Mexican, and Chilean-Mexican corpora. The results indicate that hybrid models obtain the best results in the 3 corpora, over implemented models that do not use Lexicons. This shows that by mixing approaches, aggressiveness detection improves. Finally, a web application is developed that gives applicability to each model by classifying tweets, allowing evaluating the performance of models with external corpus and receiving feedback on the prediction of each one for future research. In addition, an API is available that can be integrated into technological tools for parental control, online plugins for writing analysis in social networks, and educational tools, among others.


Introduction
The growing use of social networks has provided a channel to unrestrictedly express feelings and opinions on a mass scale. However, one of the negative aspects is that this has caused an increase in harassment, the so-called cyberbullying, defined as the use of information and communication technologies, like e-mails, text messages from cell phones, social networks, to support the deliberate, repeated, and hostile behavior of an individual or group to harm others, through personal attacks, disclosure of confidential or fake information, among other aspects [1].
According to [2], between 2005 and 2018, there was an increase in cyberbullying cases in Latin America. This study detected the existence of a high percentage of related situations: between 3.5% and 58% of cyber-victims; and between 2.5% and 32% of cyberaggressors. Those involved are mainly men. In the particular case of Chile, according to [3] study led by PUCV with financing from MINEDUC, and under the agreement with Table 1. Types of cyberbullying [6].

Type of Cyberbullying Description
Flaming Sending aggressive, rude, and vulgar messages, targeting one or more people, privately or in an online group Bullying Repetitively sending aggressive, rude, and vulgar messages to a person Cyberstalking Harassment that includes threats to harm or that is highly intimidating.

Denigration
Sending or publishing harmful, aggressive, fake, or cruel statements about one person to others.
Identity theft Pretending to be another person and sending or publishing material to make them either look bad or to endanger them.

Outing and trickery
Sending or publishing material about a person that contains sensitive, private, or embarrassing information, including forwarding private messages or images.
Tricking people to request embarrassing information that is then made public.

Exclusion
Actions that specifically and intentionally exclude a person from an online group.
Natural Language Processing (NLP) can provide important mechanisms to detect aggression in texts. NLP is an area of research and application that explores how computers can be used to understand and manipulate human expressions in text [7]. NLP addresses different areas, like computing and computer science, linguistics, math, artificial intelligence, and psychology, among others. In recent years, the use of different techniques has been popular to identify the emotions that the author of a comment or message wishes to transmit. Text subjectivity analysis is found as an NLP subcategory. This oversees extracting and classifying the different emotions that the author of a text wants to transmit, and with this obtain valuable information to analyze its content. The analysis of text subjectivity provides different tools that allow detecting aggression in text written on social media. There is consensus in the benefits that the early detection of aggression in messages sent by users provides, allowing taking preventive measures, and thus avoiding the consequences of cyberbullying.
Given that most aggression detection works in texts prior to 2018 have been made for the English language, it seems to be important to focus this work on the analysis of aggression in texts written in the Spanish language. Using the previous work of the Universidad del Bio-Bio research group SoMos, the use of a hybrid approach based on lexicons and Machine Learning (ML) is proposed. Specifically, a hybrid model is proposed, and its results are compared with models that do not use lexicons in the detection of text aggression.
The rest of this article is organized as follows: the next section presents the background and related works on cyberbullying detection that use machine learning and lexicon-based approaches. Section 3 describes the methodology applied, including a detailed description of the resources used. Section 4 describes the models proposed for classifying aggressive texts. The implementation and performance evaluation of the proposed models are shown in Section 5, using a software specifically created for this purpose. Section 6 presents the discussion of the results achieved. Finally, the conclusions and lines of future works are presented.

Background
Through a revision of the literature, works were studied that propose some aggression detection mechanism. As has already been mentioned, most of these works were applied to texts in English [8][9][10][11][12][13][14][15][16][17]. Regarding the approach used, more than 50% of the articles used ML techniques, using different corpora to train classifiers. The most typically used algorithms are Naïve Bayes, Support Vector Machine and Random Forest, due to their widespread use in text classifications. There is a smaller body of works that combine, in one way or another, the ML approach with the use of lexicons, for example, to have predefined lists of bad words that, once detected, are used as features in ML [13,15,17]. In [17] was used exclusively the lexicon-based approach, including 9 bad words chosen by the authors considering their high frequency in situations labeled as Cyberbullying, applying a morphological analysis and information recovery techniques to determine the degree of aggression.
A growing interest can be seen in applying aggression detection in Spanish texts. In [18], 3 corpora are created from Twitter: small corpus (25,304 tweets), medium corpus (229,801 tweets), and big corpus (960,578 tweets). The "Presumed cyberbullying" or "Without cyberbullying" labelling of each tweet is done automatically, bearing in mind the "General insult inventory" [19], as well as adding Ecuadorian insults and the patterns detected in [20]. The model is created using solely ML (Naive Bayes, Support Vector Machine and Logistic Regression) algorithms, and the feature vector is formed using the TF-IDF technique. On evaluating the model with the different corpora, there is an average accuracy of between 80% and 91%, with Support Vector Machine obtains the best result on being applied in the medium corpus, with a peak of 94% accuracy. In addition, a web application is implemented, where the percentage of cyberbullying can be evaluated in real-time in Twitter under 3 scenarios: phrase analysis, analysis of a Twitter profile (bearing in mind the most recent tweets), and trend analysis.
In [21] was present a proposal that analyzes Peruvian phrases. In this work, a Naive Bayes classifier is trained through the NTLK library [22] for Python, using a lexicon [23] and 595 words labeled manually. The Bag of Words method is used to represent the phrases, through a collection of bad words. The model provides the probability that a phrase contains bullying characteristics.
The workshop organized by IberEval has provided important room for aggression detection initiatives in Spanish texts [24]. In this event, the participants faced the challenge of proposing models that allow detecting aggression in a corpus which, in its 2018 version, comprised 10,856 instances (7700 for training and 3153 for evaluation). The corpus gathered tweets of Mexican users, in Spanish. The participants proposed a variety of methodologies, which comprised content-based features (frequencies, scores, POS, specific elements of Twitter, etc.), as well as classical ML (Naïve Bayes, SVM, Logistic regression, etc.) algorithms, and Neural Networks. In [25] was presented the winning model that reached an average F-measure of 0.620 and an accuracy of 0.667. It uses a classifier that utilizes a Support Vector Machine and two lexicons, and through genetic programming, it makes the final prediction. In 2019, the winning work extracts features using Word Embedding and n-grams, and uses Multilayer Perceptron to classify [26]. The winning team of the 2020 workshop uses a classifier trained to predict aggression, with majority and weighted vote schemes [27]. The training is done by adjusting the model. Table 2 summarizes the works analyzed that make aggression detection in Spanish, indicating the approach, corpus, algorithms, and the features vector reported.  Figure 1 describes the methodology used in this work. In the first stage, a revision of the literature was made to get to know the models recently applied for aggression Appl. Sci. 2021, 11, 10706 5 of 19 detection in Spanish texts. Then, the corpora to be used in this work was chosen. After this, the hybrid classification models are defined and implemented, considering the literature revision. In the next stage, the models are evaluated through experiments, measuring their performance using the metrics utilized in the literature. Finally, the results analysis of the evaluation is carried out, reaching conclusions on the hybrid models implemented.   Table 3 corpora are used, that contain tweets in Spanish, manually labeled as "aggressive" and "non-Aggressive". The first corpus is from tweets by Chilean users, and for this reason, is called the Chilean corpus. It comprises the corpus prepared by [31], with 1470 tweets in the context of aggression against women, and the corpus prepared in [32] which has 1000 tweets. 41% of the total correspond to tweets labeled as "aggressive" and 59% as "non-Aggressive". The second corpus used was created by [24]. These are tweets in Spanish collated in Mexico City, and were filtered to use only the instances labeled. This corpus of 7332 tweets will be called Mexican corpus, with 28.8% of the tweets labeled as "aggressive" and 71.2% as "non-Aggressive". The third corpus used is the previous two together, which will be called the Chilean-Mexican corpus. The merger is made with the goal of having a larger corpus with tweets from different countries to test the different models. This last corpus has a total of 9802 tweets, with 31.9% labeled "aggressive" and 68.1% as "non-Aggressive".

Materials and Methods
To train the models, 70% of the corpus instances were used, with 30% used to run the performance tests. Table 3 shows the number of instances that were set aside for training and testing in the 3 corpora used.

Proposed Aggressiveness Detection Models
The main feature that differs the approaches is the way of representing the features vector of tweets that receive the ML algorithm as input. All the proposed approaches are implemented with 3 supervised Machine Learning classification algorithms: Support Vector Machine (SVM), Naïve Bayes (NB) and Random Forest (RF).
Hyperparameters and candidate values are defined in each approach, while the definitive values are chosen using the GridSearchCV algorithm, applied to the training datasets of each corpus. GridSearchCV creates an execution matrix where all the possible combinations of candidate values are evaluated, and the best combination is kept. The  Table 3 corpora are used, that contain tweets in Spanish, manually labeled as "aggressive" and "non-Aggressive". The first corpus is from tweets by Chilean users, and for this reason, is called the Chilean corpus. It comprises the corpus prepared by [31], with 1470 tweets in the context of aggression against women, and the corpus prepared in [32] which has 1000 tweets. 41% of the total correspond to tweets labeled as "aggressive" and 59% as "non-Aggressive". The second corpus used was created by [24]. These are tweets in Spanish collated in Mexico City, and were filtered to use only the instances labeled. This corpus of 7332 tweets will be called Mexican corpus, with 28.8% of the tweets labeled as "aggressive" and 71.2% as "non-Aggressive". The third corpus used is the previous two together, which will be called the Chilean-Mexican corpus. The merger is made with the goal of having a larger corpus with tweets from different countries to test the different models. This last corpus has a total of 9802 tweets, with 31.9% labeled "aggressive" and 68.1% as "non-Aggressive". To train the models, 70% of the corpus instances were used, with 30% used to run the performance tests. Table 3 shows the number of instances that were set aside for training and testing in the 3 corpora used.

Proposed Aggressiveness Detection Models
The main feature that differs the approaches is the way of representing the features vector of tweets that receive the ML algorithm as input. All the proposed approaches are implemented with 3 supervised Machine Learning classification algorithms: Support Vector Machine (SVM), Naïve Bayes (NB) and Random Forest (RF).
Hyperparameters and candidate values are defined in each approach, while the definitive values are chosen using the GridSearchCV algorithm, applied to the training datasets of each corpus. GridSearchCV creates an execution matrix where all the possible combinations of candidate values are evaluated, and the best combination is kept. The crossvalidation technique is used to evaluate the performance of each execution, to minimize overadjustment, and the metric used to choose the best combination is the F-measure. Figure 2 shows this process and how the final evaluation of the models is done. The results are presented in the Experiments and Results section. cross-validation technique is used to evaluate the performance of each execution, to minimize overadjustment, and the metric used to choose the best combination is the Fmeasure. Figure 2 shows this process and how the final evaluation of the models is done.
The results are presented in the Experiments and Results section.

TF-IDF Approach
This first approach is the simplest and uses the most traditional technique to obtain features of a text, called TF-IDF (Term frequency-Inverse document frequency) (https://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html, accessed on 5 November 2021). The purpose of creating this model is to use it as a base to compare the results of the rest of the models that mix Lexicons with ML classifiers.
First, a preprocessing of the text is done, depending on the definitive values of the defined hyperparameters. TF-IDF is used to obtain the features vector of the text, which consists of determining the importance of each word in the phrase depending on the frequency words appear in the corpus. After obtaining the features vector, the different ML classifiers are applied.

Lexicon Approach
The second approach implemented used a mixture of the emotions analysis using Lexicons to form the features vector and ML classifiers. The Lexicon used is the one proposed in [33], which consists of an affective Lexicon in Spanish based on an enriched Lexicon, which represents the emotion intensity of each word, as shown in the example in Table 4. This Lexicon considered so-called emotional words. Initially, a preprocessing of the text is done, filtering special characters (stress marks, punctuation, signs, etc.), as well as eliminating stopwords and the lemmatization of each word depending on the definitive values of defined hyperparameters. Tokenization is done using spacy (https://spacy.io/models/es, accessed on 5 November 2021) in Spanish. After preprocessing, the analysis is made with the Lexicons, to form the features vector of each phrase that is comprised by 10 columns, detailed below.
Each one of the first 8 columns represents the sum of intensities of the phrase's words, that appear in the Lexicon which represents the corresponding affective class. The affective classes proposed by [34] are considered (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust).

TF-IDF Approach
This first approach is the simplest and uses the most traditional technique to obtain features of a text, called TF-IDF (Term frequency-Inverse document frequency) (https:// scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer. html, accessed on 5 November 2021). The purpose of creating this model is to use it as a base to compare the results of the rest of the models that mix Lexicons with ML classifiers.
First, a preprocessing of the text is done, depending on the definitive values of the defined hyperparameters. TF-IDF is used to obtain the features vector of the text, which consists of determining the importance of each word in the phrase depending on the frequency words appear in the corpus. After obtaining the features vector, the different ML classifiers are applied.

Lexicon Approach
The second approach implemented used a mixture of the emotions analysis using Lexicons to form the features vector and ML classifiers. The Lexicon used is the one proposed in [33], which consists of an affective Lexicon in Spanish based on an enriched Lexicon, which represents the emotion intensity of each word, as shown in the example in Table 4. This Lexicon considered so-called emotional words. Initially, a preprocessing of the text is done, filtering special characters (stress marks, punctuation, signs, etc.), as well as eliminating stopwords and the lemmatization of each word depending on the definitive values of defined hyperparameters. Tokenization is done using spacy (https://spacy.io/models/es, accessed on 5 November 2021) in Spanish. After preprocessing, the analysis is made with the Lexicons, to form the features vector of each phrase that is comprised by 10 columns, detailed below.
Each one of the first 8 columns represents the sum of intensities of the phrase's words, that appear in the Lexicon which represents the corresponding affective class. The affective classes proposed by [34] are considered (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust).
Column 9 represents the division between the number of bad words (BW) found in the phrase and the number of words this has.
Finally, column 10 represents the number of words in the phrase (NW). Table 5 shows an example of the features vector obtained for the offensive phrase "Oyyyyy feo culiao insoportable chucha nota esta cagao miedo" (Oi, ugly unbearable fucker, fuck, look they're fucking scared) to exemplify the process. The angry column has a value of 56, as in the affective class Lexicon, the word "nota" (look) has an intensity of 10 and the word "miedo" (scared) an intensity of 46, therefore, on adding these two intensities, the total is 56. The process was done in the same way for the other columns of the affective class. In the column that represents the result of the division between the number of bad words and the number of words in the phrase, the value is 0.333, since there are 3 bad words found in the defined Lexicon: "feo" (ugly), "culiao" (fucker), "chucha" (fuck), and a total of 9 words in the phrase.

TF-IDF Lexicon Approach
For this approach, a mix of the TF-IDF and Lexicon approach is implemented, where the features vector is a concatenation of the TF-IDF vector and the one derived from the Lexicons analysis. Figure 3 shows the process this approach performs. The corpus processing takes two routes to perform each approach, while the preprocessing of the corpus is done in each approach following its previously defined hyperparameters. Finally, these two vectors are concatenated, as shown in Figure 4, to apply the ML algorithms. The size of the TF-IDF vector depends on the vocabulary of each corpus, hence the final size of the vector is subject to the corpus the training is done with. Column 9 represents the division between the number of bad words (BW) found in the phrase and the number of words this has.
Finally, column 10 represents the number of words in the phrase (NW). Table 5 shows an example of the features vector obtained for the offensive phrase "Oyyyyy feo culiao insoportable chucha nota esta cagao miedo" (Oi, ugly unbearable fucker, fuck, look they're fucking scared) to exemplify the process. The angry column has a value of 56, as in the affective class Lexicon, the word "nota" (look) has an intensity of 10 and the word "miedo" (scared) an intensity of 46, therefore, on adding these two intensities, the total is 56. The process was done in the same way for the other columns of the affective class. In the column that represents the result of the division between the number of bad words and the number of words in the phrase, the value is 0.333, since there are 3 bad words found in the defined Lexicon: "feo" (ugly), "culiao" (fucker), "chucha" (fuck), and a total of 9 words in the phrase.

TF-IDF Lexicon Approach
For this approach, a mix of the TF-IDF and Lexicon approach is implemented, where the features vector is a concatenation of the TF-IDF vector and the one derived from the Lexicons analysis. Figure 3 shows the process this approach performs. The corpus processing takes two routes to perform each approach, while the preprocessing of the corpus is done in each approach following its previously defined hyperparameters. Finally, these two vectors are concatenated, as shown in Figure 4, to apply the ML algorithms. The size of the TF-IDF vector depends on the vocabulary of each corpus, hence the final size of the vector is subject to the corpus the training is done with.

Word Embedding Approach
This approach seeks to represent the features vector using the Word Embedding technique. In a similar way as the TF-IDF approach, this is implemented to have a basis for comparison for the rest of the approaches that include Lexicons. Word Embedding is an approach of distribution semantics that represents words of a phrase as real number vectors. This representation has useful grouping properties, as it groups semantically and syntactically similar words. For example, it is expected that the words "dolphin" and "seal" are found to be close, but "Paris" and "dolphin" are not, since there is not a strong relationship between them. Therefore, the words are represented as real value vectors, Column 9 represents the division between the number of bad words (BW) found in the phrase and the number of words this has.
Finally, column 10 represents the number of words in the phrase (NW). Table 5 shows an example of the features vector obtained for the offensive phrase "Oyyyyy feo culiao insoportable chucha nota esta cagao miedo" (Oi, ugly unbearable fucker, fuck, look they're fucking scared) to exemplify the process. The angry column has a value of 56, as in the affective class Lexicon, the word "nota" (look) has an intensity of 10 and the word "miedo" (scared) an intensity of 46, therefore, on adding these two intensities, the total is 56. The process was done in the same way for the other columns of the affective class. In the column that represents the result of the division between the number of bad words and the number of words in the phrase, the value is 0.333, since there are 3 bad words found in the defined Lexicon: "feo" (ugly), "culiao" (fucker), "chucha" (fuck), and a total of 9 words in the phrase. For this approach, a mix of the TF-IDF and Lexicon approach is implemented, where the features vector is a concatenation of the TF-IDF vector and the one derived from the Lexicons analysis. Figure 3 shows the process this approach performs. The corpus processing takes two routes to perform each approach, while the preprocessing of the corpus is done in each approach following its previously defined hyperparameters. Finally, these two vectors are concatenated, as shown in Figure 4, to apply the ML algorithms. The size of the TF-IDF vector depends on the vocabulary of each corpus, hence the final size of the vector is subject to the corpus the training is done with.

Word Embedding Approach
This approach seeks to represent the features vector using the Word Embedding technique. In a similar way as the TF-IDF approach, this is implemented to have a basis for comparison for the rest of the approaches that include Lexicons. Word Embedding is an approach of distribution semantics that represents words of a phrase as real number vectors. This representation has useful grouping properties, as it groups semantically and syntactically similar words. For example, it is expected that the words "dolphin" and "seal" are found to be close, but "Paris" and "dolphin" are not, since there is not a strong relationship between them. Therefore, the words are represented as real value vectors,

Word Embedding Approach
This approach seeks to represent the features vector using the Word Embedding technique. In a similar way as the TF-IDF approach, this is implemented to have a basis for comparison for the rest of the approaches that include Lexicons. Word Embedding is an approach of distribution semantics that represents words of a phrase as real number vectors. This representation has useful grouping properties, as it groups semantically and syntactically similar words. For example, it is expected that the words "dolphin" and "seal" are found to be close, but "Paris" and "dolphin" are not, since there is not a strong relationship between them. Therefore, the words are represented as real value vectors, where each value captures a dimension of the meaning of the word. This means that semantically similar words have similar vectors. In other words, each dimension of the vectors represents a meaning, and the numerical value in each dimension captures the proximity of the association of the word to said meaning. In [35] was showed the power of Word Embedding. In their work, they establish this tool as being highly effective in different Natural Language Processing tasks, while presenting a neural network architecture that many of the current approaches are based upon.
Firstly, just as in the previous approaches, a preprocessing of the test is done, filtering special characters (stress marks, punctuation, signs, among others), as well as eliminating stopwords and making a lemmatization of each word depending on the definitive values of the defined hyperparameters. Then, the representation of the feature vector of each text is done using the sum of the Word Embedding vectors of each word present in the phrase. In this way, a vector is obtained that represents the entire text. It is worth stating that, after the sum is made, a standardization of the resulting vector is made. Figure 5 shows, as an example, a vectorial representation of the phrase "me gustan los gatos" (I like cats), without standardizing it. where each value captures a dimension of the meaning of the word. This means that semantically similar words have similar vectors. In other words, each dimension of the vectors represents a meaning, and the numerical value in each dimension captures the proximity of the association of the word to said meaning. In [35] was showed the power of Word Embedding. In their work, they establish this tool as being highly effective in different Natural Language Processing tasks, while presenting a neural network architecture that many of the current approaches are based upon. Firstly, just as in the previous approaches, a preprocessing of the test is done, filtering special characters (stress marks, punctuation, signs, among others), as well as eliminating stopwords and making a lemmatization of each word depending on the definitive values of the defined hyperparameters. Then, the representation of the feature vector of each text is done using the sum of the Word Embedding vectors of each word present in the phrase. In this way, a vector is obtained that represents the entire text. It is worth stating that, after the sum is made, a standardization of the resulting vector is made. Figure 5 shows, as an example, a vectorial representation of the phrase "me gustan los gatos" (I like cats), without standardizing it. A pretrained Word Embedding model is used to obtain the features vector from the phrases. This was implemented with FastText and Skipgram [36] and was trained with 1.4 billion words, using the Spanish Billion Word Corpus [37]. Each vector has 300 dimensions; therefore, each text will be represented with a 300-size vector. This vector is received as input for the classification algorithms implemented.

WE_Lexicon Approach
This approach represents the features vector as a concatenation of the output vectors of Word Embedding and Lexicon approaches. Figure 6 shows the process performed, doing this following its previously defined hyperparameters. Finally, these two vectors are concatenated, as shown in Figure 7, to apply Machine Learning algorithms. The vector size is 310, 300 boxes for the Word Embedding vector, and 10 for the Lexicon analysis vector.  A pretrained Word Embedding model is used to obtain the features vector from the phrases. This was implemented with FastText and Skipgram [36] and was trained with 1.4 billion words, using the Spanish Billion Word Corpus [37]. Each vector has 300 dimensions; therefore, each text will be represented with a 300-size vector. This vector is received as input for the classification algorithms implemented.

WE_Lexicon Approach
This approach represents the features vector as a concatenation of the output vectors of Word Embedding and Lexicon approaches. Figure 6 shows the process performed, doing this following its previously defined hyperparameters. Finally, these two vectors are concatenated, as shown in Figure 7, to apply Machine Learning algorithms. The vector size is 310, 300 boxes for the Word Embedding vector, and 10 for the Lexicon analysis vector. where each value captures a dimension of the meaning of the word. This means that semantically similar words have similar vectors. In other words, each dimension of the vectors represents a meaning, and the numerical value in each dimension captures the proximity of the association of the word to said meaning. In [35] was showed the power of Word Embedding. In their work, they establish this tool as being highly effective in different Natural Language Processing tasks, while presenting a neural network architecture that many of the current approaches are based upon. Firstly, just as in the previous approaches, a preprocessing of the test is done, filtering special characters (stress marks, punctuation, signs, among others), as well as eliminating stopwords and making a lemmatization of each word depending on the definitive values of the defined hyperparameters. Then, the representation of the feature vector of each text is done using the sum of the Word Embedding vectors of each word present in the phrase. In this way, a vector is obtained that represents the entire text. It is worth stating that, after the sum is made, a standardization of the resulting vector is made. Figure 5 shows, as an example, a vectorial representation of the phrase "me gustan los gatos" (I like cats), without standardizing it. A pretrained Word Embedding model is used to obtain the features vector from the phrases. This was implemented with FastText and Skipgram [36] and was trained with 1.4 billion words, using the Spanish Billion Word Corpus [37]. Each vector has 300 dimensions; therefore, each text will be represented with a 300-size vector. This vector is received as input for the classification algorithms implemented.

WE_Lexicon Approach
This approach represents the features vector as a concatenation of the output vectors of Word Embedding and Lexicon approaches. Figure 6 shows the process performed, doing this following its previously defined hyperparameters. Finally, these two vectors are concatenated, as shown in Figure 7, to apply Machine Learning algorithms. The vector size is 310, 300 boxes for the Word Embedding vector, and 10 for the Lexicon analysis vector.  In a similar way as the previous approach, the features vector is presented with the concatenation of the Word Embedding and Lexicon vector, but the TF-IDF vector is also added. Figure 8 shows the process used for this approach. Unlike with the previous approach, now the corpus takes 3 paths to execute the 3 approaches with their preprocessing, following the previously defined hyperparameters. Finally, the 3 vectors are concatenated, as shown in Figure 9, to later apply the Machine Learning classifier. The vector size depends on the corpus and its words. This occurs on having the TF-IDF representation.

Ensemble Approach
Below, four models implemented under the "Ensemble Learning" technique are described. Ensemble learning is the process of combining decisions of several Machine Learning models trained to improve overall performance. With the decisions of the different models, a final prediction takes place using different rules such as, for example, the majority vote. The reason behind using Ensemble models is to reduce the prediction generalization error. The prediction error of the model drops when this technique is used, provided that the combined models are diverse and independent. The approach seeks the wisdom of the masses to make a prediction. Although the Ensemble model has multiple base models within it, it acts and behaves as a single model [38]. In the models developed, the final prediction is made using a majority vote, which is implemented with VotingClassifier (https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html, accessed on 5 November 2021) from the scikit-learn library.

TF-IDF_Lexicon_E_Clfs Model
The first model created under this approach combines the three models implemented under the TF-IDF_Lexicon approach, as shown in Figure 10. Here, the corpus feeds 3 individually trained models, to then make a final prediction about the test corpus using the majority vote technique. This model is created under the hypothesis that the

WE_Lexicon TF-IDF Approach
In a similar way as the previous approach, the features vector is presented with the concatenation of the Word Embedding and Lexicon vector, but the TF-IDF vector is also added. Figure 8 shows the process used for this approach. Unlike with the previous approach, now the corpus takes 3 paths to execute the 3 approaches with their preprocessing, following the previously defined hyperparameters. Finally, the 3 vectors are concatenated, as shown in Figure 9, to later apply the Machine Learning classifier. The vector size depends on the corpus and its words. This occurs on having the TF-IDF representation.

WE_Lexicon TF-IDF Approach
In a similar way as the previous approach, the features vector is presented with the concatenation of the Word Embedding and Lexicon vector, but the TF-IDF vector is also added. Figure 8 shows the process used for this approach. Unlike with the previous approach, now the corpus takes 3 paths to execute the 3 approaches with their preprocessing, following the previously defined hyperparameters. Finally, the 3 vectors are concatenated, as shown in Figure 9, to later apply the Machine Learning classifier. The vector size depends on the corpus and its words. This occurs on having the TF-IDF representation.

Ensemble Approach
Below, four models implemented under the "Ensemble Learning" technique are described. Ensemble learning is the process of combining decisions of several Machine Learning models trained to improve overall performance. With the decisions of the different models, a final prediction takes place using different rules such as, for example, the majority vote. The reason behind using Ensemble models is to reduce the prediction generalization error. The prediction error of the model drops when this technique is used, provided that the combined models are diverse and independent. The approach seeks the wisdom of the masses to make a prediction. Although the Ensemble model has multiple base models within it, it acts and behaves as a single model [38]. In the models developed, the final prediction is made using a majority vote, which is implemented with VotingClassifier (https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html, accessed on 5 November 2021) from the scikit-learn library.

TF-IDF_Lexicon_E_Clfs Model
The first model created under this approach combines the three models implemented under the TF-IDF_Lexicon approach, as shown in Figure 10. Here, the corpus feeds 3 individually trained models, to then make a final prediction about the test corpus using the majority vote technique. This model is created under the hypothesis that the

WE_Lexicon TF-IDF Approach
In a similar way as the previous approach, the features vector is presented with the concatenation of the Word Embedding and Lexicon vector, but the TF-IDF vector is also added. Figure 8 shows the process used for this approach. Unlike with the previous approach, now the corpus takes 3 paths to execute the 3 approaches with their preprocessing, following the previously defined hyperparameters. Finally, the 3 vectors are concatenated, as shown in Figure 9, to later apply the Machine Learning classifier. The vector size depends on the corpus and its words. This occurs on having the TF-IDF representation.

Ensemble Approach
Below, four models implemented under the "Ensemble Learning" technique are described. Ensemble learning is the process of combining decisions of several Machine Learning models trained to improve overall performance. With the decisions of the different models, a final prediction takes place using different rules such as, for example, the majority vote. The reason behind using Ensemble models is to reduce the prediction generalization error. The prediction error of the model drops when this technique is used, provided that the combined models are diverse and independent. The approach seeks the wisdom of the masses to make a prediction. Although the Ensemble model has multiple base models within it, it acts and behaves as a single model [38]. In the models developed, the final prediction is made using a majority vote, which is implemented with VotingClassifier (https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html, accessed on 5 November 2021) from the scikit-learn library.

TF-IDF_Lexicon_E_Clfs Model
The first model created under this approach combines the three models implemented under the TF-IDF_Lexicon approach, as shown in Figure 10. Here, the corpus feeds 3 individually trained models, to then make a final prediction about the test corpus using the majority vote technique. This model is created under the hypothesis that the

Ensemble Approach
Below, four models implemented under the "Ensemble Learning" technique are described. Ensemble learning is the process of combining decisions of several Machine Learning models trained to improve overall performance. With the decisions of the different models, a final prediction takes place using different rules such as, for example, the majority vote. The reason behind using Ensemble models is to reduce the prediction generalization error. The prediction error of the model drops when this technique is used, provided that the combined models are diverse and independent. The approach seeks the wisdom of the masses to make a prediction. Although the Ensemble model has multiple base models within it, it acts and behaves as a single model [38]. In the models developed, the final prediction is made using a majority vote, which is implemented with VotingClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. VotingClassifier.html, accessed on 5 November 2021) from the scikit-learn library.

TF-IDF_Lexicon_E_Clfs Model
The first model created under this approach combines the three models implemented under the TF-IDF_Lexicon approach, as shown in Figure 10. Here, the corpus feeds 3 individually trained models, to then make a final prediction about the test corpus using the majority vote technique. This model is created under the hypothesis that the combination of the 3 models that use the TF-IDF_Lexicon approach will provide better results than each one of them separately, as they use different classifiers. The definitive values of the hyperparameters of each model, found previously using GridSearchCV on the different training datasets of the corpora, are used.
Appl. Sci. 2021, 112, 706 10 of 19 combination of the 3 models that use the TF-IDF_Lexicon approach will provide better results than each one of them separately, as they use different classifiers. The definitive values of the hyperparameters of each model, found previously using GridSearchCV on the different training datasets of the corpora, are used.

TF-IDF_Lexicon_E_SVM Model
This model combines those created with the Support Vector Machine classifier in the TF-IDF, Lexicon and TF-IDF_Lexicon approaches, as shown in Figure 11. Just as in the previous approach, the models are trained separately to then make a final prediction on the test corpus using the majority vote technique. This is implemented, as it is thought that by combining the different ways of obtaining the features vector, the result of the final classification can be improved. The Support Vector Machine classifier is used because it obtained the best performance in the preliminary tests. The different values of each model's hyperparameters, found beforehand using GridSearchCV on the different training datasets of the corpora, are used.

WE_Lexicon_TF-IDF_E_SVM Model
The third model created combines the models implemented with the Support Vector Machine classifier in the Word Embedding, WE_Lexicon and WE_Lexicon_TF-IDF approaches, as shown in Figure 12. The models are trained separately to then make a final prediction on the test corpus using the majority vote technique. This model, just as in the previous approach, is implemented under the hypothesis that combining the different ways of obtaining the feature vector can improve the final classification result. The hyperparameter values used in each model were found beforehand using GridSearchCV on the different training datasets of the corpora.

TF-IDF_Lexicon_E_SVM Model
This model combines those created with the Support Vector Machine classifier in the TF-IDF, Lexicon and TF-IDF_Lexicon approaches, as shown in Figure 11. Just as in the previous approach, the models are trained separately to then make a final prediction on the test corpus using the majority vote technique. This is implemented, as it is thought that by combining the different ways of obtaining the features vector, the result of the final classification can be improved. The Support Vector Machine classifier is used because it obtained the best performance in the preliminary tests. The different values of each model's hyperparameters, found beforehand using GridSearchCV on the different training datasets of the corpora, are used. combination of the 3 models that use the TF-IDF_Lexicon approach will provide better results than each one of them separately, as they use different classifiers. The definitive values of the hyperparameters of each model, found previously using GridSearchCV on the different training datasets of the corpora, are used.

TF-IDF_Lexicon_E_SVM Model
This model combines those created with the Support Vector Machine classifier in the TF-IDF, Lexicon and TF-IDF_Lexicon approaches, as shown in Figure 11. Just as in the previous approach, the models are trained separately to then make a final prediction on the test corpus using the majority vote technique. This is implemented, as it is thought that by combining the different ways of obtaining the features vector, the result of the final classification can be improved. The Support Vector Machine classifier is used because it obtained the best performance in the preliminary tests. The different values of each model's hyperparameters, found beforehand using GridSearchCV on the different training datasets of the corpora, are used.

WE_Lexicon_TF-IDF_E_SVM Model
The third model created combines the models implemented with the Support Vector Machine classifier in the Word Embedding, WE_Lexicon and WE_Lexicon_TF-IDF approaches, as shown in Figure 12. The models are trained separately to then make a final prediction on the test corpus using the majority vote technique. This model, just as in the previous approach, is implemented under the hypothesis that combining the different ways of obtaining the feature vector can improve the final classification result. The hyperparameter values used in each model were found beforehand using GridSearchCV on the different training datasets of the corpora.

WE_Lexicon_TF-IDF_E_SVM Model
The third model created combines the models implemented with the Support Vector Machine classifier in the Word Embedding, WE_Lexicon and WE_Lexicon_TF-IDF approaches, as shown in Figure 12. The models are trained separately to then make a final prediction on the test corpus using the majority vote technique. This model, just as in the previous approach, is implemented under the hypothesis that combining the different ways of obtaining the feature vector can improve the final classification result. The hyperparameter values used in each model were found beforehand using GridSearchCV on the different training datasets of the corpora.

E_SVM_Approach Model
The final Ensemble model implemented combines all the those implemented with the Support Vector Machine classifier in the different approaches, as shown in Figure 13. Just as in all the previous models, these are trained separately to then make a final prediction on the test corpus using the majority vote technique. The hypothesis behind implementing this is that a greater diversification of the ways of obtaining the features vector can improve the result. It is worth mentioning that this model is more costly in terms of memory and time to train and test the corpus. The hyperparameter values of the models found beforehand in the different corpora are used.

Implementations and Experimentation
A web application (http://35.247.212.145/, accessed on 28 October 2021), which implements the models proposed for the classification of aggressiveness of the different corpora and other new ones written by a user was developed, to show the applicability of the models proposed, and to evaluate their performance through experiments. This application allows classifying comments, receiving feedback of the classification results, and building a base of labeled tweets for future research. The application also allows evaluating the performance of the model chosen with suitably structured test corpora. The performance results are provided using the F-measure metrics: Accuracy, Precision, and Recall. For back-end development, the FastAPI (https://fastapi.tiangolo.com/, accessed on 5 November 2021) framework was used, which allows building APIs using the Python programming language. The front-end was implemented using the Vue.js

E_SVM_Approach Model
The final Ensemble model implemented combines all the those implemented with the Support Vector Machine classifier in the different approaches, as shown in Figure 13. Just as in all the previous models, these are trained separately to then make a final prediction on the test corpus using the majority vote technique. The hypothesis behind implementing this is that a greater diversification of the ways of obtaining the features vector can improve the result. It is worth mentioning that this model is more costly in terms of memory and time to train and test the corpus. The hyperparameter values of the models found beforehand in the different corpora are used.

E_SVM_Approach Model
The final Ensemble model implemented combines all the those implemented with the Support Vector Machine classifier in the different approaches, as shown in Figure 13. Just as in all the previous models, these are trained separately to then make a final prediction on the test corpus using the majority vote technique. The hypothesis behind implementing this is that a greater diversification of the ways of obtaining the features vector can improve the result. It is worth mentioning that this model is more costly in terms of memory and time to train and test the corpus. The hyperparameter values of the models found beforehand in the different corpora are used.

Implementations and Experimentation
A web application (http://35.247.212.145/, accessed on 28 October 2021), which implements the models proposed for the classification of aggressiveness of the different corpora and other new ones written by a user was developed, to show the applicability of the models proposed, and to evaluate their performance through experiments. This application allows classifying comments, receiving feedback of the classification results, and building a base of labeled tweets for future research. The application also allows evaluating the performance of the model chosen with suitably structured test corpora. The performance results are provided using the F-measure metrics: Accuracy, Precision, and Recall. For back-end development, the FastAPI (https://fastapi.tiangolo.com/, accessed on 5 November 2021) framework was used, which allows building APIs using the Python programming language. The front-end was implemented using the Vue.js

Implementations and Experimentation
A web application (http://35.247.212.145/, accessed on 28 October 2021), which implements the models proposed for the classification of aggressiveness of the different corpora and other new ones written by a user was developed, to show the applicability of the models proposed, and to evaluate their performance through experiments. This application allows classifying comments, receiving feedback of the classification results, and building a base of labeled tweets for future research. The application also allows evaluating the performance of the model chosen with suitably structured test corpora. The performance results are provided using the F-measure metrics: Accuracy, Precision, and Recall. For back-end development, the FastAPI (https://fastapi.tiangolo.com/, accessed on 5 November 2021) framework was used, which allows building APIs using the Python programming language. The front-end was implemented using the Vue.js (https://vuejs.org/, accessed on 5 November 2021) JavaScript framework, along with the Vuetify (https://vuetifyjs.com, accessed on 5 November 2021) user interface library for Vue.js. Docker (https://docs.docker.com, accessed on 5 November 2021), while Docker-compose (https://docs.docker.com/compose, accessed on 5 November 2021) was used for the display. The web application code can be downloaded at https: //gitlab.com/ManuelLepeF/lexicon_ml_agresividad_web, accessed on 28 October 2021. Figure 14 shows the user interface of the web application.  Figure 14 shows the user interface of the web application.

Description of the Experiments
Using this application, the implemented models were tested on the test datasets of the three corpora: Chilean, Mexican, and Chilean-Mexican. The datasets used in the experiments are available at https://gitlab.com/ManuelLepeF/lexicon_ml_agresividad (accessed on 5 November 2021). It is worth highlighting that these datasets were not used in the training process, as Figure 2 shows. In this way, the generalization capacity of the models was measured using the F-measure and Accuracy metrics. The hyperparameters used in each model were found using the GridSearchCV technique in the different training datasets of the corpora.
The experiments were made on a server that has the following hardware and software features.

Results
After running the experiments with all the models described above on the 3 corpora, Table 6 shows the results obtained by the models in the F-measure metric in the different corpora used.

Description of the Experiments
Using this application, the implemented models were tested on the test datasets of the three corpora: Chilean, Mexican, and Chilean-Mexican. The datasets used in the experiments are available at https://gitlab.com/ManuelLepeF/lexicon_ml_agresividad (accessed on 5 November 2021). It is worth highlighting that these datasets were not used in the training process, as Figure 2 shows. In this way, the generalization capacity of the models was measured using the F-measure and Accuracy metrics. The hyperparameters used in each model were found using the GridSearchCV technique in the different training datasets of the corpora.
The experiments were made on a server that has the following hardware and software features.

Results
After running the experiments with all the models described above on the 3 corpora, Table 6 shows the results obtained by the models in the F-measure metric in the different corpora used.
As a means of complementing the results, Table 7 shows the results obtained with the Accuracy metric in the different corpora used.

Discussion
Graphs are presented for each metric used as a means of visually comparing the performance of the models of each approach in the different corpora. The models are presented in different colors. Figure 15 shows the performance of the models in the Fmeasure metric in the 3 corpora used. It is seen that the model that obtains the best performance in this metric in the Chilean corpus is WE_Lexicon_SVM, with 0.8908. For the Mexican and the Chilean-Mexican corpora, it is the WE_Lexicon_TF-IDF_SVM, with 0.8394 and 0.8507, respectively.

Discussion
Graphs are presented for each metric used as a means of visually comparing the performance of the models of each approach in the different corpora. The models are presented in different colors. Figure 15 shows the performance of the models in the Fmeasure metric in the 3 corpora used. It is seen that the model that obtains the best performance in this metric in the Chilean corpus is WE_Lexicon_SVM, with 0.8908. For the Mexican and the Chilean-Mexican corpora, it is the WE_Lexicon_TF-IDF_SVM, with 0.8394 and 0.8507, respectively. . Figure 15. Comparison of F-measure obtained by the models.
On the other hand, Figure 16 shows that the model with the best performance in the Accuracy metric for the Chilean corpus is WE_Lexicon_SVM, with 0.892. For the Mexican and Chilean-Mexican corpora, it is the WE_Lexicon_TF-IDF-SVM, model with 0.8431 and 0.8548, respectively. On the other hand, Figure 16 shows that the model with the best performance in the Accuracy metric for the Chilean corpus is WE_Lexicon_SVM, with 0.892. For the Mexican and Chilean-Mexican corpora, it is the WE_Lexicon_TF-IDF-SVM, model with 0.8431 and 0.8548, respectively.
In general terms, it is seen that the models, in the different metrics, have a similar behavior. Meanwhile, the models generally have a better performance in the Chilean corpus, followed by the Chilean-Mexican one, and finally, the Mexican one. In the graphs, it is seen that the models with a hybrid approach have a better performance compared to the approaches that do not use Lexicons in the Chilean corpus, followed by the Mexican one, and finally, the Chilean-Mexican one. As can be seen in Table 8, the Chilean corpus processed with the 8 hybrid models outperforms the best model that does not use Lexicons. In the case of the Mexican corpus, there are 3 hybrid models that obtain better performance than the best model that does not use Lexicons, as seen in Table 9. Finally, in Table 10, it is seen that only one model outperforms the best hybrid model that does not use Lexicons in the Chilean-Mexican corpus. In general terms, it is seen that the models, in the different metrics, have a similar behavior. Meanwhile, the models generally have a better performance in the Chilean corpus, followed by the Chilean-Mexican one, and finally, the Mexican one. In the graphs, it is seen that the models with a hybrid approach have a better performance compared to the approaches that do not use Lexicons in the Chilean corpus, followed by the Mexican one, and finally, the Chilean-Mexican one. As can be seen in Table 8, the Chilean corpus processed with the 8 hybrid models outperforms the best model that does not use Lexicons. In the case of the Mexican corpus, there are 3 hybrid models that obtain better performance than the best model that does not use Lexicons, as seen in Table 9. Finally, in Table 10, it is seen that only one model outperforms the best hybrid model that does not use Lexicons in the Chilean-Mexican corpus.    Table 9. Hybrid models outperforming the best non-Lexicon model in the Mexican corpus.

Model F-Measure
WE_Lexicon_TF-IDF_SVM 0.8507 TF-IDF_SVM (Does not use Lexicons) 0.8424 As a summary, Table 11 shows the models that obtain the best results according to the F-measure and Accuracy metrics for each corpus used. It is seen that for the Chilean corpus, this is the WE_Lexicon_SVM model, and for the Mexican and Chilean-Mexican corpora, it is the WE_Lexicon_TF-IDF_SVM. These three models use Word Embedding and Lexicons to extract the features of the texts. This shows that the best results are obtained by incorporating this technique. On the other hand, it is seen that the best result is obtained in the Chilean corpus, achieving a value of 0.89 of F-measure and Accuracy, followed by the Chilean-Mexican corpus, and finally, the Mexican. This can be explained due to the lack of specific Mexican words in the different Lexicons used, especially the bad words Lexicon.
Finally, it is seen that the models with the best results used Support Vector Machine as a Machine Learning classifier. With this, it is reasserted that this seems to be a good algorithm to perform the text classification of the three algorithms tested.  Table 12 shows the results that obtain the best models of each corpus in the F-measure metric (Table 11), and the models of the base approaches that obtain the best results. It is seen that the broadest difference is found in the Chilean corpus, followed by the Mexican one, and finally, the Chilean-Mexican one. It can also be seen that the difference is broader with the models that use the Word Embedding-based approach.

Conclusions
This article presented several hybrid models, whose idea is using the Lexicon and Machine Learning approach to analyze emotions in user comments, specifically to detect aggression in texts written in Spanish. 5 approaches are proposed to create different models: Lexicon, TF_IDF_Lexicon, WE_Lexicon, WE_Lexicon_TF-IDF, and the Ensemble approach, which differentiate mainly in the way of extracting the feature vector from the text. The 2 TF-DF and Word Embedding approaches are also implemented, which do not use Lexicons, to compare them with the other models.
In each one of the models created, the best hyperparameters are sought from the training dataset of each corpus using GridSearchCV, to then perform experimentation on the test datasets and, through this, compare the results obtained in each model and select the best models in each one of the corpora. The models that obtained the best results use approaches that mix Word Embedding, Lexicons, and ML classifiers, outperforming the base models. The results indicate that hybrid models obtain the best results in the 3 corpora, over the models implemented that do not use Lexicons. This shows that, by mixing the approaches, the aggressiveness detection improves. It is worth highlighting that hybrid models have a better performance in the Chilean corpus, because the Lexicons have a better coverage or coincidence with Spanish words used in Chile, than what occurs with the Spanish used in Mexico.
On the other hand, all the models that obtain better results in the corpora use the Support Vector Machine as a classifier. Using the experiments that were run, it can be reasserted that this is one of the best algorithms to perform aggressiveness classification compared to the other algorithms used.
Finally, a web application was created, that allows showing the applicability of the proposed models, allowing classifying tweets or comments, evaluating the models implemented, and receiving user feedback on the prediction of the models, that allows generating a database for future research. It is worth mentioning that the backend of the web application is implemented as an API, meaning it can be used by external services.
In future work, incorporating Mexican words into the different Lexicons used is considered, especially in the bad words one, to check whether the performance of the models implemented on the Mexican corpus improves. Likewise, using different dictionary type Lexicons is considered, as these include more words than the Lexicon used in this work. The intention is also to implement the management of quantifiers, negations, and emojis in text preprocessing, as this work does not consider these. It is also considered important to incorporate other Ensemble models in the experimentation, using different ML classifiers. While it is felt that it is important to incorporate models based on neural networks in the future, to classify and mix these results with Lexicon-based models.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.