Identifying Polarity in Tweets from an Imbalanced Dataset about Diseases and Vaccines Using a Meta-Model Based on Machine Learning Techniques

: Sentiment analysis is one of the hottest topics in the area of natural language. It has attracted a huge interest from both the scientiﬁc and industrial perspective. Identifying the sentiment expressed in a piece of textual information is a challenging task that several commercial tools have tried to address. In our aim of capturing the sentiment expressed in a set of tweets retrieved for a study about vaccines and diseases during the period 2015–2018, we found that some of the main commercial tools did not allow an accurate identiﬁcation of the sentiment expressed in a tweet. For this reason, we aimed to create a meta-model which used the results of the commercial tools to improve the results of the tools individually. As part of this research, we had to deal with the problem of unbalanced data. This paper presents the main results in creating a metal-model from three commercial tools to the correct identiﬁcation of sentiment in tweets by using di ﬀ erent machine-learning techniques and methods and dealing with the unbalanced data problem.


Introduction
Nowadays, information can be obtained from a vast number of sources, most of which are available on the Internet. It is well known that the Internet has become our main knowledge engine, redefining the way we communicate and gain understanding about the world around us. That implies that information of every kind and from every field can be publicly accessed just by typing a few words into our navigator.
Among the wide range of knowledge areas, searches for information on the Internet related to healthcare are common. People can solve health doubts and find data in a quick and easy manner. If a certain health issue concerns the population, such searches tend to increase. In the past, some studies about these web queries have corroborated that fact [1][2][3] and have been used in the pursuit of improving public healthcare strategies and social wellbeing. Some of such health issues can be related to the early detection of disease outbreaks [4][5][6][7], disease surveillance [8][9][10][11] or epidemic intelligence [5,12] and even in the current COVID-19 outbreak, infoveillance studies can aid in tackling the pandemic situation [13][14][15]. However, the diseases that have been screened under this type of analyses are not limited to COVID-19 as it will be discussed later.
The Internet has allowed the emergence of new services and applications. Some of the most popular and with an increasing use, can be grouped under the term "social media". Social media is a subject of study for a broad variety of domains related to computer science, since it can be seen as an important means of data to analyze users' feelings and opinions. Those approaches that have taken advantage of the information present in social media have traditionally included political [16] and marketing campaigns [17][18][19][20] or financial predictions [21][22][23]. Lately, it has been proven that in the health scope, social media information can be relevant too by assessing epidemiological patterns [24], predicting epidemic outbreaks [25,26] or detecting drug side effects [27] and medication safety [28]. Moreover, concerns about vaccines and vaccination are widely expressed in social media [29,30]. With the expansion of anti-vaccines movements in the recent years, the debate in social media has only grown and several conversations about vaccines have been monitored [31][32][33][34]. Special controversy has surrounded the human papillomavirus (HPV) vaccines [29,[35][36][37][38][39][40][41][42][43]. Besides, some of the diseases that have been studied under these types of approaches of Internet and social media mining comprise largely Influenza [4,5,9,11,25,[44][45][46] but also others such as Zika virus disease [47,48], cholera [24], obesity [49] or diabetes [50].
Twitter, amongst all the possibilities in social media, has been explored in numerous works, as it is the perfect platform to share opinions that can be mined [17,19,22,23]. Sentiment analysis or opinion mining stands for natural language processing (NLP) methods that aim to computationally extract, interpret and classify emotions and subjective information from unstructured resources. The attitude of authors towards a topic in a text can be categorized as being positive, negative or neutral. In other words, we can assign a polarity to a text. The applications of these methods involve multiple domains (political science, social sciences, market research, etc.) and have evolved over time [51]. Nowadays, the most important sources to mine in this context come from the Internet and, as it has already been mentioned, Twitter can be seen as one of the most popular. The research on sentiment analysis in Twitter is called Tweet Sentiment Classification (TSC) [52] and multiple works have been developed under this approach [19,20,22,23,53]. One important topic that has attracted attention on Twitter has been vaccination. Analysis of discussions, opinions and feelings about certain vaccines in tweets has been performed to detect the feelings related to vaccine promotion [36,[39][40][41].
In the current paper, we analyze the possibilities of exploring social media information, and in particular Twitter, in order to extract feelings related to different vaccines messages. Thus, the expansion of negative opinions related to a set of vaccines and their related diseases could be monitored. The analysis has been performed by mining tweets in Spanish language published during the period 2015-2018. We have created several classification meta-models according to different machine learning techniques and different datasets. As collected data were imbalanced, sampling methods were used to address the situation. The work is included in the MAVIS study and it is an extension of a previous work [54]. The structure of the present paper consists of the following sections: Section 2 includes the methodology performed, Section 3 details the obtained results, while Section 4 discusses them. Finally, Section 5 summarizes the achieved conclusions and the future work to be carried out.

Materials and Methods
In this section, the pipeline to gather the data from Twitter, perform the sentiment analysis and create classification models is detailed. Some of the processes here explained were previously presented in [54]. The first objective was, using a set of vaccines and their related diseases, to discover whether a negative opinion about them was spreading in Twitter or not. To this aim, first Twitter messages associated to those concepts were extracted. Afterwards, the tweets' polarity was classified both by three commercial tools and by five evaluators, who annotated the Tweets manually. As the class assigned to tweets was distributed in an imbalanced manner, sampling methods were performed to obtain different datasets. To end with, several machine learning techniques were applied to data to generate different classification models.

Twitter Data Extraction and Sentiment Analysis
The keywords on which this study has focused its interest are related to the following vaccines and diseases: Although Instagram data were also considered for the current methodology, the amount of data compared to Twitter was very low so for the sake of quality models generation, Instagram data were discarded.
Twitter data were obtained by using the official API (Application Programming Interface), from which all the required information for the current study could be extracted. The execution of the extraction process obtained a total of 1,028,742 tweets, from which 318,302 were different/original tweets. The number of retweets was 10,440 and the number of quotes of the original tweets was 65,806. The keywords used in the search were mentioned 1,187,046 times. After extracting all the tweets, they were all submitted to a cleaning process in order to get a consistent and understandable version of the texts. Hashtags (#), user mentions (@), URLs, email addresses, retweet markers (RT:) and emojis and other non-representable characters were removed. Sentiment analysis examines the content of free-text natural language to identify opinions and emotions. One of the principals and most important sources of text comes from the Internet and it is social media. Sentiment analysis from social media has been a common topic on research and diverse software tools have been developed to automatize its processes, enabling the classification of large numbers of texts [51]. Methods may focus on the polarity of texts ("positive", "negative", "neutral") but also can be centered on feelings and emotions ("angry", "happy", "sad") or intentions ("interested", "not interested"). Sentiment analysis approaches have been categorized in three main groups: knowledge-based, statistical and hybrid methods [55]. Efforts have been made to extract sentiments associated with polarities of positive or negative for specific subject of a text, instead of classifying the whole text as positive or negative [56].

Different Datasets Creation
To get the different datasets that would be input to the machine learning methods to generate the classification models, a process to annotate the tweets was implemented. Such annotations were performed in two ways: (1) using different commercial tools and (2) being manually revised by expert evaluators.

•
Annotation with commercial tools Three tools were chosen to automatically annotate the whole set of tweets and quotations: IBM Watson (https://www.ibm.com/watson/services/tone-analyzer/) (now called Watson Tone Analyzer), Google Cloud Natural Language (https://cloud.google.com/natural-language) and Meaning Cloud (https://www.meaningcloud.com/es). The three of them returned different formats of the polarity of tweets. IBM and Google returned numerical values (a score between −1 and 1) while Meaning Cloud returned a class (a class value between 6 classes: P+, P, NEU, N, N+ and NONE). The analysis was simplified so that tweets were discretized to either "negative" or "non-negative". This way, "non-negative" would include other classes such as "neutral". Models were generated considering (i) the original values without discretizing ("original"), (ii) the adapted discretized values ("adapted") and (iii) both of them ("both"). •

Manual annotation by experts
Five experts in the field classified the tweets' polarity manually, by determining if each tweet contained a "negative" or a "non-negative opinion". An iterative process of annotation was cyclically performed three times: in each iteration a set of 100 tweets were annotated classifying the sentiment expressed in tweets. A total of 300 tweets were annotated. From those, a very low number of tweets were identified as negative, leading to a class imbalance challenge that could impact on the quality of the models that would be later generated.
To increase the number of negative tweets, a sample that contained words with negative sentiments was extracted from the original dataset. Such words were selected from Meaning Cloud platform, since it allows retrieving the polarity of the words extracted in the sentiment analysis process. A subset of 459 tweets containing words from this list was obtained, ensuring they had not been previously selected for the first three iterations. Therefore, the total number of tweets that were classified by the five evaluators amounted to 759.
Analyzing the resulting manually annotated dataset, it was found that two of the experts had a high grade of disagreement with the other three evaluators. Annotations of those three and of the total five evaluators were compared to the annotations performed by the commercial tools. The level of agreement was higher between the three evaluators and the commercial tools than between the five evaluators and the commercial tools. For the three evaluators, there were 142 tweets classified as negative and 617 as non-negative; and, for the three evaluators, there were 128 tweets annotated as negative and 631 as non-negative. Nevertheless, as will be stated in the next subsection, models were generated with both the three and the five expert's annotations, as the difference between them was sparse and the five evaluators scenario increased the number of negative tweets.

Sampling and Models Generation
Learning from imbalance data is still a challenge for machine learning methods nowadays [57]. A classifier, when trained with an imbalance distribution dataset, tends to be biased towards the more frequent class. Therefore, efforts to avoid such types of skewed distributions are of paramount importance to ensure the aforementioned influence is not learned by the model. Pre-processing or training methods should then focus on alleviating this disadvantage. The aim is to create a learning system that is able to predict over the minority class but without sacrificing the performance on the majority one [58].
As this is not a trivial challenge and has a major relevance in the generated model accuracy, multiple approaches to solve it have been proposed in the literature [59,60]. These types of approaches are called sampling methods. There are two main ways of balancing an imbalanced class set: downsizing the large class (also known as under-sampling) or upsizing the small class (also known as over-sampling). Generally, over-sampling is preferred [61]. Both can be performed on a random basis by randomly adding or dropping instances of the minority or majority class, respectively. However, this approach may cause overfitting or loss of information.
Thus, other techniques aim to overcome such limitations. One of them [62] introduced a cluster-based under-sampling approach, where it was proposed that clusters in the dataset that have more majority class samples and less minority class samples will behave like the majority class samples, and vice versa. Therefore, it would be reasonable to select a suitable number of majority class samples from each cluster by considering the ratio of the number of majority class samples to the number of minority class samples in the cluster. Regarding over-sampling, SMOTE [63] (synthetic minority over-sampling technique) was developed to generate synthetic minority class examples by selecting new samples close to the existing ones in the feature space. On the other hand, ADASYN [64] (adaptative synthetic sampling) also generates synthetic sample points for the minority class but considers a density distribution to decide the number of synthetic samples to be generated for a particular point.
The main objective of the current research work was to generate different classification models by means supervised machine learning techniques. Such models were obtained starting from the different datasets mentioned in the previous subsection: considering the different numbers of evaluators (3 or 5) and discretizing or not the inputs (original, adapted or both). The variables in such datasets were the following: the output values from (i) IBM (originally a numerical score), (ii) Google Cloud (originally a numerical score) and (iii) Meaning Cloud (originally a discrete class) tools, and the (iv) manually annotated class.
On the other hand, different sampling techniques were performed in order to balance the number of negative and non-negative classes associated to tweets. As discussed in Section 2, imbalanced datasets lead to poor quality classification models. There are two main ways of addressing this challenge: under-or down-sampling (i.e., reducing the number of samples of the majority class) and overor up-sampling (i.e., increasing the number of samples of the majority class). For down-sampling, two methods were performed: random sampling and clustering; and for up-sampling, three methods were implemented: random sampling, SMOTE and ADASYN (see Section 2 for detailed references). In both cases of random sampling, the method was iterated over 10 times to ensure the randomness. For clustering, k-means was implemented. Both clustering and ADASYN were only applied to the original data not discretized, as they cannot handle categorized input variables.
There is a wide variety of machine learning methods that are used to generate classification models. This kind of machine learning is also known as supervised machine learning. A classifier is a system that can predict, based on the previous learning, the class of a new input instance. Some of the methods that have been discussed in the literature and that will be used for our work are the following: C5.0, Logit Boost, Bayesian Generalized Linear Models (BayesGLM), Multilayer Perceptron, Random Forests (RF) and Support Vector Machine models (SVM).
C5.0 [65] (p. 5) is an improvement of the classic C4.5 algorithm [66], which generates decision trees based on the concept of information entropy. LogitBoost [67], also known as additive logistic regression, applies a boosting approach to build a logit model using decision trees as weak learners [68]. BayesGLM [69] uses an approximate Expectation-Maximization (EM) to fit GLM with the Student-t prior to distribution. The Multilayer Perceptron [70,71] is a class of artificial neural network that uses hidden layers and the back-propagation error algorithm. Random Forests [72] is a metaclassifier that builds multiple decision trees and combines their outputs by a voting process. Finally, SVM [73] build models as function estimation and optimization problems, in a linear or non-linear way, separating classes by hyperplanes.
Some of the literature works related to health sentiment analysis have used machine learning methods such as SVM [29,39], Naïve-Bayes [74], Random Forests and Random Decision Trees [48].
Six supervised learning algorithms were implemented to obtain the models: C5.0, Logit Boost, Bayes GLM, Multilayer Perceptron, Random Forest and SVM. The hyperparameters of each algorithm were either set in their default values or optimized between 4 and 12 parametrizations, for further indications code and packages are provided as it is indicated below. The Multilayer Perceptron used a Weighted Decay in the los function and SVM was implemented with a linear kernel. Each of them was evaluated using 10-fold cross-validation and their accuracies were computed by the mean and standard deviation of ROC values.
All the sampling and classification methods were implemented in R, mainly by UBL (https: //www.rdocumentation.org/packages/UBL) and caret (https://www.rdocumentation.org/packages/caret) packages, respectively. All the code is included in the Supplementary Materials.

Results
The principal objective of the present work is the generation of sentiment classification models learned from annotated vaccine-related tweets. Six different supervised learning methods were used for this task: C5.0, Logit Boost, Bayes GLM, Multilayer Perceptron, Random Forests and SVM. Each algorithm was run in each of the sampling subsets (non-sampling, down-sampling and up-sampling) and considering (i) the different numbers of evaluators (3 or 5) and (ii) the distinct input predictors (original, adapted or both). In the random modalities of the up-sampling and down-sampling, the techniques were executed a total of 10 times to analyze the independent results to guarantee their significance.
The full tables with all the mean values and standard deviations of the ROC curves derived from 10-fold cross validation of the previous analysis are included in Supplementary Materials (https://medal.ctb.upm.es/internal/gitlab/mavis/mavis/blob/master/SA_ASC/MAVIS_tables_ by_sampling.xlsx). In the provided .xlsx file, each sheet corresponds to each sampling method ("NO sampling", "DOWN-random", "DOWN-clustering", "UP-random", "UP-smote" and "UP-ADASYN"). All the results are shown for every Machine Learning (ML) method and every combination of the number of evaluators and predictors.
For the highest values of ROC curves' mean in each sampling subset and obtained by each ML method, results have been visualized (https://medal.ctb.upm.es/internal/gitlab/mavis/mavis/blob/ master/SA_ASC/MAVIS_SA_best_results.xlsx). We have represented three figures: Figure 1 represents the highest accuracy of the models in the initial subset without performing sampling, Figure 2 in the two under-sampling subsets and Figure 3 in the three over-sampling subsets. The colors of the bars stand for the different classification methods and the textures represent the input predictor from which the model has been generated. All the best generated models have been obtained from either the original predictor ("ORIG") or the combination of the original and the adapted ("BOTH"), but none of the best models in any subset have been generated from the adapted predictor ("ADAP"). There were two cases (1 for non-sampling subset and 1 for up-sampling) in which the highest mean of ROC values was the same for both predictors. In those cases, for simplicity, the predictor has been represented as the original one.
The width of the bars in Figure 1 represents the number of evaluators, being 3 in the thinner bars and 5 in the larger one. For the other two figures, all the bars are representing models coming from three evaluators annotations. There were five cases (1 for non-sampling subset, 2 for down-clustering and 2 for up-sampling) in which the highest mean of ROC values was the same for both numbers of evaluators. For the sake of clarity in those cases, the number of evaluators has been set to three.
Overall, the result of averaging the accuracy mean of the generated models with the ML methods in the different sampling subsets, led to the results presented in Table 1. In this table, the mean values of the different ML methods' accuracies from each of the sets have been represented. Results are displayed ranking the applied ML tools from left to right so that the higher global accuracy averages are shown in the left while the lower ones appear to the right.
Appl. Sci. 2020, 10, 9019 7 of 13 methods and the textures represent the input predictor from which the model has been generated. All the best generated models have been obtained from either the original predictor ("ORIG") or the combination of the original and the adapted ("BOTH"), but none of the best models in any subset have been generated from the adapted predictor ("ADAP"). There were two cases (1 for nonsampling subset and 1 for up-sampling) in which the highest mean of ROC values was the same for both predictors. In those cases, for simplicity, the predictor has been represented as the original one.     Highest mean accuracy values in down-sampling subset, corresponding to each ML method. Each color bar represents the different classification models. The texture of the bar represents the input predictor from which each model has been generated. The two different down-sampling methods are represented grouped in the X axis. The width of the bars in Figure 1 represents the number of evaluators, being 3 in the thinner bars and 5 in the larger one. For the other two figures, all the bars are representing models coming from three evaluators annotations. There were five cases (1 for non-sampling subset, 2 for down-clustering and 2 for up-sampling) in which the highest mean of ROC values was the same for both numbers of evaluators. For the sake of clarity in those cases, the number of evaluators has been set to three.

Discussion
The model that provided the highest accuracy from all the studied possibilities was the one generated by the subset obtained from up-sampling with the ADASYN method and corresponding to the Random Forest technique. Such a data subset was formed by the original values of the commercial tools and was annotated by three evaluators.
When no sampling was performed, the accuracy highest values ranged from 0.52 to 0.72. Such accuracies were not much improved when under-sampling (0.49 to 0.72). There was a trend in the accuracy to present larger values when the dataset was balanced via over-sampling the minority class (0.66 to 0.9). The three highest values of the accuracy were 0.9, 0.87 and 0.87, obtained by the three up-sampling methods (Random Forests with ADASYN and C5.0 with SMOTE obtained a mean of ROC values of 0.9; Random Forests with SMOTE got 0.89; and C5.0 with ADASYN got 0.87).
Most of the best results were performed when the dataset was annotated manually by the three evaluators. When the annotation was carried out by the five experts the accuracy mainly decreased. There is just one case when the five evaluators' annotations worked better (SVM without sampling) and five cases when the accuracy was equal to the one obtained by the three evaluators' annotations. Nevertheless, still in those situations, the accuracy was not very high (0.49, 0.52, 0.60, 0.64, 0.66 and 0.76). This can be explained by the disagreement of the three experts versus the five. When considering the five experts, as there were two of them who did not agree with the other three, the dataset increased its noise levels, and therefore, the accuracy would tend to be lower. On the other hand, adapted values as input to generate models, did not perform as well as the original or the combined ones.
Random Forests was one of the models that obtained better accuracy values for the different sampled subsets (0.7 when no sampling, 0.61 for down-clustering, 0.72 for down-random, 0.9 for ADASYN, 0.89 for SMOTE and 0.87 for up-random). On the contrary, SVM usually got the lower values (0.52 when no sampling, 0.49 for down-clustering, 0.69 for down-random, 0.73 for ADASYN, 0.7 for SMOTE and 0.66 for up-random). The Multilayer Perceptron and C5.0 also obtained high accuracies (the first one when no sampling and for down-clustering, and the second one for SMOTE).
There was a trend in the global accuracy average from the different ML methods: while Random Forests performed the best when up-sampling and regarding the overall accuracy, the Multilayer Perceptron showed the highest values of ROC means when no sampling and down-sampling. Analogously, whilst LogitBoost performed better than BayesGLM when up-sampling, BayesGLM obtained higher results in conditions of no sampling and down-sampling. The same global accuracy average was derived in both cases.

Conclusions
The current manuscript presents the research on the different options when analyzing the polarity of tweets regarding a specific set of vaccines and their related diseases. The polarity of tweets published in Twitter in Spanish was annotated by commercial tools and experts. Generally, opinion about vaccines expressed in that social network tend to be non-negative. Therefore, imbalance between classes must be overcome through sampling. Different combinations of sampled classified tweets were used as inputs to generate classification models.
The results showed that the highest accuracy was obtained with the Random Forest model when up-sampling with ADASYN. Over-sampling methods exhibited a better accuracy in most of the cases. However, it must be noted that the analyzed techniques have shown, in some cases, very close results to both the random approaches and other more complex techniques. Those results shown that although the use of more complex techniques such as ADASYN or SMOTE would help to obtain more accurate and stable results based on how they worked, the dataset seems to not have sufficient differences in the distribution of its data to allow such techniques to significantly improve these results in comparison with the random selection of the tweets.
Some other limitations might be pointed out, in particular, the fact that to develop the classifiers, expert manual annotations were needed to provide a trustful labelling. Such experts' annotations may not be always accessible in all the fields if generating a text sentiment classifier in a totally different environment. In future works, more effort should be addressed to assess the classification process with non-binary labels, as it could be "neutral", "negative" and "positive", for instance.
However, the results provided in this paper are significant and allowed us to demonstrate that it is possible to apply sentiment identification techniques by using a metamodel based on different commercial tools even when the number of available tweets for training it is low and unbalanced.
In a more general context, the present work corroborates the fact already stated in literature that over-sampling produces better results than under-sampling. Moreover, this study taught us that there are some ML methods to generate classifiers that outperform others, with Random Forests and the Multilayer Perceptron being the ones with the highest accuracies in comparison to the other ones used. The generated classifiers provide a reasonable solution for annotating tweets in the scope of the mentioned vaccines and related diseases, avoiding the efforts that can require deciding the classification of each tweet manually. Other research questions regarding sentiment classification of texts, wherever they come from, can be solved by following the described or a similar workflow.
Supplementary Materials: The code developed for the current analysis is fully available and accessible at the public repository https://medal.ctb.upm.es/internal/gitlab/mavis/mavis.