Zero-Shot Emotion Detection for Semi-Supervised Sentiment Analysis Using Sentence Transformers and Ensemble Learning

: We live in a digitized era where our daily life depends on using online resources. Businesses consider the opinions of their customers, while people rely on the reviews/comments of other users before buying speciﬁc products or services. These reviews/comments are usually provided in the non-normative natural language within different contexts and domains (in social media, forums, news, blogs, etc.). Sentiment classiﬁcation plays an important role in analyzing such texts collected from users by assigning positive, negative, and sometimes neutral sentiment values to each of them. Moreover, these texts typically contain many expressed or hidden emotions (such as happiness, sadness, etc.) that could contribute signiﬁcantly to identifying sentiments. We address the emotion detection problem as part of the sentiment analysis task and propose a two-stage emotion detection methodology. The ﬁrst stage is the unsupervised zero-shot learning model based on a sentence transformer returning the probabilities for subsets of 34 emotions (anger, sadness, disgust, fear, joy, happiness, admiration, affection, anguish, caution, confusion, desire, disappointment, attraction, envy, excitement, grief, hope, horror, joy, love, loneliness, pleasure, fear, generosity, rage, relief, satisfaction, sorrow, wonder, sympathy, shame, terror, and panic). The output of the zero-shot model is used as an input for the second stage, which trains the machine learning classiﬁer on the sentiment labels in a supervised manner using ensemble learning. The proposed hybrid semi-supervised method achieves the highest accuracy of 87.3% on the English SemEval 2017 dataset.


Introduction
For many years, humans have had to adjust their communication style to be 'understood' by computers, but communication in natural language has recently become a new trend. Huge amounts of texts available online are in the unstructured/unannotated form and therefore do not have much value. Such noisy data can be converted into useful information only after proper processing. However, manual processing is a cumbersome and time-consuming process. In contrast, the automatic techniques can help save manual labor, get the result faster, filter through huge amounts of unnecessary data to find appropriate material, and deliver the machine output in the desired format [1]. Natural language processing (NLP) tackles language technology problems by employing Artificial Intelligence (AI) methods for intelligent human-machine interaction. The AI technologies that use data mining, pattern recognition, and NLP, the computer can mimic the way the human brain works. NLP applications, such as machine translation systems, web search engines, natural language assistants, and opinion analysis, are resolving societal problems [2].
Today, the mood (sentiments, emotions) of texts is as important as their content [3]. Sentiment and emotion detection plays a crucial role in analyzing social moods [4,5]. Explosive social media growth enables users to share their opinions more and more and leave feedback online; this, in turn, makes Sentiment analysis become a powerful NLP We present the following contributions to the research field. • The zero-shot model detects emotions first, and later they are used to assign positive, negative, and neutral sentiments. Such a method gradually decreases the dimensionality starting from the high-dimensional sentence transformer input (i.e., vectorized text) mapped into probability values of different emotions; probability values are further mapped into the sentiment labels.

•
The second-stage input does not require complicated feature extraction or sophisticated machine learning methods able to catch sentiments directly from the text, which, in turn, speeds up the whole sentiment analysis process.

•
The performance of the proposed method is evaluated on three benchmark datasets (IMDB, Sentiment140, and SemEval-2017) and using multiple classifiers, including machine learning, neural network, and ensemble learning.

•
The proposed emotion-sentiment detection model requires fewer training data compared to traditional Sentiment analysis detection.
This paper is divided into five more sections. In Section 2, we present the related work of existing solutions. The methods used in this experiment are described in Section 3. In Section 4, we present our experiment results and discuss the results obtained in Section 5. Section 6 summarizes, concludes our work, and provides our thoughts on possible future research directions.

Related Work
Sentiment analysis is among the principal tasks of NLP that strives to predict opinion polarity. It often predicts the sentiment as belonging to one of the three categories (negative, neutral and positive) that can be used in many areas such as customer product review [35], political forecasting [11], telehealth services [36], finance [37], etc. [38].
According to Medhat et al. [39], we describe the taxonomy of sentiment analysis techniques and divide it into two main paradigms: rule-/lexicon-based and machine learning. Lexicon-based methods [40] rely on the assumption that the overall sentiment depends on the words that explicitly express these sentiments. Words (adjectives, adverbs, sometimes verbs, and nouns) that define different sentiments are searched in the text and counted: the overall sentiment of the text depends on the majority. In machine learning, the sentiment analysis task is typically formulated as the text classification task and, therefore, can be solved with a whole spectrum of methods for this purpose: traditional machine learning methods (e.g., Support Vector Machine (SVM)), deep learning methods (e.g., Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN)), or innovative sentence transformer models (e.g., Bidirectional Encoder Representation from Transformer (BERT)).
For many years different languages were treated separately by training monolingual models from scratch for separate tasks and separate languages [41,42]. Recently, many pre-trained models provided by open-source NLP libraries, such as BERT and NTLK, were introduced to minimize the efforts and resources required to learn general knowledge about the language and its structure (i.e., existing words, their meanings, and similarities). These transformer models are typically trained on very large monolingual or multilingual unannotated corpora (i.e., on pure texts) in a self-supervised manner and therefore are not adjusted for specific NLP problems [43]. With the help of transfer learning, the previously acquired general knowledge in the pre-trained word-or sentence-transformer models can be augmented and fine-tuned to tackle the specific NLP problems (including the sentiment analysis task). Once the model is already 'familiar' with the language, it is much easier to adapt it to a specific NLP task: that is, typically, fewer training data are needed. Moreover, some multilingual transformer models are trained on the parallel corpora and tuned for similar tasks in the way they can cross barriers between languages. Cross-lingual methods have recently received more NLP community attention, thus demonstrating promising results when fine-tuning augmented transformer layers on different languages than they are later tested on (a good example is a group of the cross-lingual language model (XLM) transformers) [44].
Pre-trained transformer models can be used for sentiment analysis tasks in very different manners [45,46]: as text classifiers (by adding additional layers connecting the output of the transformer model with the sentiment labels); for the evaluation (by calculating distances between the unseen texts and texts of which the sentiments are already known); as zero-shot models that can evaluate the relatedness of some word (category, narrative, etc.) with the text. Zero-shot models can act as advanced dictionary-based methods that seek emotion or sentiment words both explicitly and implicitly and return probabilities determining how much these words are related to the text. The zero-shot models do not require the training data, but they are not directly adjusted for the sentiment analysis tasks and, therefore, may need additional mechanisms to go their limitations. In Table 2, we summarize the sentiment analysis methods that are the most influential in solving our problem. The ensemble of the three CNN models achieves the highest accuracy of 68.7%

Outline
The proposed two-stage method combines unsupervised and supervised machine learning paradigms in one pipeline ( Figure 1). The core of the first stage is the pre-trained zero-shot model, which is applied to (1) the emotion labels (see Table 3) and (2) the inputted text vectorized with the sentence transformer. The output of the zero-shot model is a list of emotion labels mapped to their probabilities for the input text. This output becomes an input into the second stage ( Figure 2): emotion probabilities are converted into a one-hot encoding format and then fed into the sentiment classifier trained to detect positive/negative/neutral (three-class classification scenario) or positive/negative (binary classification scenario) sentiments (see Section 3). For classification, we have used supervised machine learning methods, including neural networks and ensemble learning. classification scenario) sentiments (see Section 3). For classification, we have used supervised machine learning methods, including neural networks and ensemble learning.     becomes an input into the second stage ( Figure 2): emotion probabilities are converted into a one-hot encoding format and then fed into the sentiment classifier trained to detect positive/negative/neutral (three-class classification scenario) or positive/negative (binary classification scenario) sentiments (see Section 3). For classification, we have used supervised machine learning methods, including neural networks and ensemble learning. Table 3. Set of emotions used for zero-shot classification.

Emotions and Sentiments
The emotion models that define the categorization process are a crucial factor to consider for systems that recognize emotions. Although there are various ideas on how to portray emotions, two stand out as the most popular in the field of NLP: the Ekman's fundamental emotions [49] and the Plutchik's wheel of emotions [50]. Six fundamental emotions are included in the Ekman model: surprise, sadness, happiness, fear, disgust, and anger. Four opposing pairs of axes make up the Plutchik's model, which uses a multidimensional representation method to characterize emotions as points along these axes (dimensions). The axis and intensity are what determine the emotions under this approach. These axis pairings include surprise-anticipation, trust-disgust, anger-fear, and joy-sadness. Other emotions can be produced from these emotions as a combination of other emotions and their intensities, as shown in Figure 3, which is an extraction of the Plutchik model. These axes and intensity are marked with colors in the concentric rings. Most studies on emotion detection only consider a limited selection of these feelings. In this paper, we have subdivided the entire set of emotions into four subsets, as outlined in Table 3. We use four sets of emotions, where each set consists of several taken from the emotions' wheel of emotions ( Figure 3): anger, sadness, disgust, fear, joy, happiness, admiration, affection, anguish, caution, confusion, desire, disappointment, attraction, envy, excitement, grief, hope horror, joy, love loneliness, pleasure, fear, generosity, rage, relief, satisfaction, sorrow, wonder, sympathy, shame, terror and panic. Emotions are nonexclusive in the Plutchik's model as they are composable; there are also some correlations between them.

Emotions and Sentiments
The emotion models that define the categorization process are a crucial factor to consider for systems that recognize emotions. Although there are various ideas on how to portray emotions, two stand out as the most popular in the field of NLP: the Ekman's fundamental emotions [49] and the Plutchik's wheel of emotions [50]. Six fundamental emotions are included in the Ekman model: surprise, sadness, happiness, fear, disgust, and anger. Four opposing pairs of axes make up the Plutchik's model, which uses a multidimensional representation method to characterize emotions as points along these axes (dimensions). The axis and intensity are what determine the emotions under this approach. These axis pairings include surprise-anticipation, trust-disgust, anger-fear, and joy-sadness. Other emotions can be produced from these emotions as a combination of other emotions and their intensities, as shown in Figure 3, which is an extraction of the Plutchik model. These axes and intensity are marked with colors in the concentric rings. Most studies on emotion detection only consider a limited selection of these feelings. In this paper, we have subdivided the entire set of emotions into four subsets, as outlined in Table 3. We use four sets of emotions, where each set consists of several taken from the emotions' wheel of emotions ( Figure 3): anger, sadness, disgust, fear, joy, happiness, admiration, affection, anguish, caution, confusion, desire, disappointment, attraction, envy, excitement, grief, hope horror, joy, love loneliness, pleasure, fear, generosity, rage, relief, satisfaction, sorrow, wonder, sympathy, shame, terror and panic. Emotions are nonexclusive in the Plutchik's model as they are composable; there are also some correlations between them.

Vectorization
Most machine learning methods perform mathematical calculations during training or testing and, therefore, cannot be directly applied to pure texts. Thus, vectorization becomes a crucial state in any NLP task. In our experiments, we have used the one-hot encoding vectorization technique. It is a discrete token representation method in which the

Vectorization
Most machine learning methods perform mathematical calculations during training or testing and, therefore, cannot be directly applied to pure texts. Thus, vectorization becomes a crucial state in any NLP task. In our experiments, we have used the one-hot encoding vectorization technique. It is a discrete token representation method in which the length is equal to the size of the vocabulary. Each token is represented with a unique vector having all zero values except for one value equal to 1. This type of representation was used in the output of categorical data.

First Stage Zero-Shot Classifiers (Sentence Transformers)
The proposed method has two stages (see its schematic representation in Figure 3). In the first stage, the emotion detection problem is tackled (see Figure 4). The core of this stage is the zero-shot classifier that does not require any training. The idea is to transfer the knowledge it already has to a new task. Zero-shot learning involves training a classifier on a set of labels and then testing it in new data having different labels that the classifier has not been trained on. Classical zero-shot learning needs the provision of a descriptor for an unknown class for a model in order to predict that class without being trained on known representatives of it [51]. This machine learning method is based on a pre-trained model that can observe classes that were not observed during training and has a predictor of which class the input text belongs to. The zero-shot model returns probabilities for the given emotions and thus determines their relations to the input text. length is equal to the size of the vocabulary. Each token is represented with a unique vector having all zero values except for one value equal to 1. This type of representation was used in the output of categorical data.

First Stage Zero-Shot Classifiers (Sentence Transformers)
The proposed method has two stages (see its schematic representation in Figure 3). In the first stage, the emotion detection problem is tackled (see Figure 4). The core of this stage is the zero-shot classifier that does not require any training. The idea is to transfer the knowledge it already has to a new task. Zero-shot learning involves training a classifier on a set of labels and then testing it in new data having different labels that the classifier has not been trained on. Classical zero-shot learning needs the provision of a descriptor for an unknown class for a model in order to predict that class without being trained on known representatives of it [51]. This machine learning method is based on a pre-trained model that can observe classes that were not observed during training and has a predictor of which class the input text belongs to. The zero-shot model returns probabilities for the given emotions and thus determines their relations to the input text. In our experiments, we have tested four zero-shot transformer models as follows: • The bart-large-mnli model [52] is a zero-shot sequence classifier proposed in [53]. The model was trained on tweets, emotional occurrences, fairy tales, and artificial sentences. It has nine emotions (anger, disgust, fear, guilt, joy, love, sadness, shame, surprise), as well as the "none" class (if no emotion applies). The approach offers the sequence to be categorized as the multi-genre natural language inference (MNLI) and creates a hypothesis from each possible label. Then, label probabilities are created from the entailment and contradiction probabilities.

•
The Fb-improved-zeroshot model [54] is a zero-shot model for German and English academic searchlog classification created by ETH Zürich students and based on [53].
The bart-large-mnli model was used to train and then fine-tune this model.

•
The COVID-Twitter-BERT (CT-BERT), a transformer-based model, is the foundation of the covid-twitter-bert-v2-mnli model [55], which was pre-trained on a corpus of Twitter conversations about COVID-19 [56]. CT-BERT was designed to work with the COVID-19 content, particularly from social media. The emotion toward vaccines is captured by the model. The dataset comprises three classes: positive (towards vaccinations), negative, and neutral/others.

•
The bart-large-mnli-yahoo-answers model [57] refined the bart-large-mnli model on Yahoo Answers subject categorization. The model may be used to forecast whether the topic label can be assigned to a certain sequence.

Second Stage Machine Learning and Ensemble Learning Classifiers
The second stage uses the output of the first stage by transforming it into a one-hot encoding format. These feature vectors were then fed into the classifier in a supervised manner learning to predict positive, negative, and neutral sentiment labels. In our experiments, we used two types of classifiers. In our experiments, we have tested four zero-shot transformer models as follows: • The bart-large-mnli model [52] is a zero-shot sequence classifier proposed in [53]. The model was trained on tweets, emotional occurrences, fairy tales, and artificial sentences. It has nine emotions (anger, disgust, fear, guilt, joy, love, sadness, shame, surprise), as well as the "none" class (if no emotion applies). The approach offers the sequence to be categorized as the multi-genre natural language inference (MNLI) and creates a hypothesis from each possible label. Then, label probabilities are created from the entailment and contradiction probabilities.

•
The Fb-improved-zeroshot model [54] is a zero-shot model for German and English academic searchlog classification created by ETH Zürich students and based on [53].
The bart-large-mnli model was used to train and then fine-tune this model.

•
The COVID-Twitter-BERT (CT-BERT), a transformer-based model, is the foundation of the covid-twitter-bert-v2-mnli model [55], which was pre-trained on a corpus of Twitter conversations about COVID-19 [56]. CT-BERT was designed to work with the COVID-19 content, particularly from social media. The emotion toward vaccines is captured by the model. The dataset comprises three classes: positive (towards vaccinations), negative, and neutral/others.

•
The bart-large-mnli-yahoo-answers model [57] refined the bart-large-mnli model on Yahoo Answers subject categorization. The model may be used to forecast whether the topic label can be assigned to a certain sequence.

Second Stage Machine Learning and Ensemble Learning Classifiers
The second stage uses the output of the first stage by transforming it into a one-hot encoding format. These feature vectors were then fed into the classifier in a supervised manner learning to predict positive, negative, and neutral sentiment labels. In our experiments, we used two types of classifiers.

Single-Model Machine Learning Classifiers
Traditional machine learning (as implemented in [58,59]) and deep learning [60][61][62] classifiers have already been applied to the sentiment analysis problem. Recently, deep learning methods were combined with ensemble learning [63]. However, the main innovation of our study is that we classify the output of the zero-shot model rather than the vectorized text directly. Due to this reason, we cannot use a whole spectrum of deep learning models such as CNN, LSTM, etc. In our experiments, we have used and evaluated these classifiers described below: • Feed-forward neural network (FFNN) is suitable for solving tasks as it can learn relationships between independent features. In addition, it is a simple and fast network learning how to adapt the weights of connections between units until the correct output is produced. In this paper, we have used this architecture because of its simplicity of feature selection. The architecture of the model we used in our experiment has one layer of 64 neurons, Rectified Linear Unit (ReLU) activation function in the hidden layer, and sigmoid activation function in the output layer. During training, we used accuracy metrics and Adam optimizer with binary cross-entropy loss. • Linear regression (LR) is an algorithm used when you want to know how strong the relationship between two variables is and the value of a dependent variable at a certain value of the independent variables. The parameters of this classifier are set to their default values. • K-nearest neighborhood (KNN). In KNN, similar class-type objects exist in closer proximity. KNN can be used for multiclass classification, and it is useful when the size of the labeled data is smaller. In our case, due to the small amount of data used for this experiment, we chose to test this method. The parameters of these classifiers were set into their default values. • Support Vector Machine (SVM) is a supervised learning method that is used for classification, regression, and outlier detection. Default values were used in the parameters of this classifier. • Naive Bayes (NB) predicts the probability of different classes based on several attributes. We use this algorithm because it is mostly used for text classification and multiple classes. We choose this classifier because it does not require much training data. We used the default values of its parameters in our experiment. • Classifier and Regression Tree (CART). It is a decision tree algorithm used for the classification task. CART can capture non-linear relationships within the dataset, and there is no need for standardization of data when using this model. We used the default values for the parameters of this classifier.

Ensemble Learning Classifiers
Ensemble learning methods use multiple combined machine learning classifiers (instead of a single classifier) to achieve better predictive performance. Each of these methods is trained to solve the same problem, but their results are combined. In our experiments, we have used the following ensemble learning methods:

•
Adaptive Boosting (AdaBoost) classifier re-assigns weights to each data sample, i.e., higher weights are assigned to wrongly classified data. AdaBoost is less likely to overfit because input parameters are not optimized jointly. • AdaBoost regressor is a meta-estimator that, first, fits a regressor on the original dataset, and then it fits subsequent copies of the regressor while the weights of the instances are changed in accordance with the error of the most recent prediction. • Bagging classifier is used to lower a variance within the noisy dataset. A bagging classifier fits base classifiers on randomly selected subsets of the dataset and then combines their predictions (by averaging or by voting) to get a prediction.

•
Bagging regressor is a meta-estimator that fits base regressors to individual random subsets of the dataset and then combines each prediction to get the final prediction. By adding randomization to the process of building a black-box estimator (such as a decision tree), a meta-estimator lowers the variance of the estimator. • Extremely Randomized Trees (ExtraTress) classifier is similar to Random Forest but has two key differences: it samples without replacement; in this case, bootstrap is equal to False by default, and nodes are split based on random splits rather than best splits. The advantage of this estimator is its low variance. • Histogram Gradient Boosting (HistGradientBoost) classifier buckets continuous feature values into discrete bins, and then it uses these bins to generate feature histograms during training. The histogram-based algorithm is very efficient in both memory consumption and training speed. • Stacking classifier stacks several machine learning classifiers such as Random Forest Classifier, KNN, decision tree, SVM, NB, and Support Vector Regression.

Evaluation and Statistical Analysis of Performance
The tested methods were evaluated with the commonly used accuracy, precision, recall, and F-score metrics. With the null hypothesis that the medians of the two variables differ, we used the Wilcoxon rank-sum test with the null hypothesis indicator H and the significance level p-value to determine whether the performance differences between sentence transformers (used as a baseline) and the suggested method were statistically significant. We have used the Friedman test and the post hoc Nemenyi test to examine the effectiveness of various machine learning techniques. The Friedman test is a strong nonparametric statistical ranking test that does not require the assumption of normality. It has been used in various studies in the past to evaluate the effectiveness of machine learning techniques. All pairwise algorithm comparisons were performed using the non-parametric Nemenyi test, with a 0.05 significance level. The critical distance (CD) diagram [64] is used to represent the outcomes (mean rankings of compared methods).

Settings
Sentiment analysis is a text classification task, where given written text as an input, positive, neutral, or negative class is returned as the output. Here we perform the binary (2-class, positive and negative) and 3-class (positive, negative, and neutral) sentiment classification.
Our method was implemented using Tensorflow and Keras libraries with python programming language. Our experiments were executed with the datasets described in Section 4.2 and using the methods described in Sections 3.4 and 3.5. The results of the experiments are presented in Tables 4-9.

Datasets
We have used the following sentiment datasets (for detailed statistics, see Figure 5): • IMDB [65] is the English dataset that has 50K movie reviews (with~300 words per review on average) annotated with positive or negative labels. This dataset contains only highly polarized reviews (with a score of ≤4 of 10 for negative and ≥ 7 of 10 for positive). It is highly researched, with more than 1000 research papers using it. The task analyzed in this paper differs from the traditional text classification, and it does not require a large, annotated dataset. Therefore, we have randomly selected 5000 samples of positive and negative classes to create a new dataset used for our experiments. • Sentiment140 [66] is an English dataset has 1.6 million tweets extracted using the Twitter API and annotated with two classes (positive and negative). For our experiments, we randomly selected a subset of 5000 texts for each class.  The Sentiment140 and SemEval-2017 datasets are retrieved from the Twitter social network, and they contain symbols of emojis and weblinks that were filtered out in the data pre-processing step.

Results
The results of experiments on zero-shot classification (first stage) are summarized in Table 4. We compared four zero-shot models (i.e., bart-large-mnli, Fb_improved_zeroshot, covid-twitter-bert-v2-mnli, and bart-large-mnli-yahoo-answers). The models were employed for zero-shot classification via a pipeline in the Hugging face's transformers package. The determined most accurate zero-shot model (i.e., bart-large-mnli), which gives the best performance of 0.747 on the Sentiment140 dataset with the single-model machine learning classifiers, was later used in our further experiments. Table 4. The impact of zero-shot models on the accuracy of machine learning classifiers for the binary sentiment classification with the Sentiment140 dataset. The best result is shown in bold.  Table 5 represents the accuracies of single model machine learning and ensemble classifiers with different sets of emotions (from Table 3) on the SemEval-2017 dataset using three-class classification. The best overall accuracy was achieved by the stacking classifier on the first set of emotions (0.627).  The Sentiment140 and SemEval-2017 datasets are retrieved from the Twitter social network, and they contain symbols of emojis and weblinks that were filtered out in the data pre-processing step.

Results
The results of experiments on zero-shot classification (first stage) are summarized in Table 4. We compared four zero-shot models (i.e., bart-large-mnli, Fb_improved_zeroshot, covid-twitter-bert-v2-mnli, and bart-large-mnli-yahoo-answers). The models were employed for zero-shot classification via a pipeline in the Hugging face's transformers package. The determined most accurate zero-shot model (i.e., bart-large-mnli), which gives the best performance of 0.747 on the Sentiment140 dataset with the single-model machine learning classifiers, was later used in our further experiments. Table 4. The impact of zero-shot models on the accuracy of machine learning classifiers for the binary sentiment classification with the Sentiment140 dataset. The best result is shown in bold.  Table 5 represents the accuracies of single model machine learning and ensemble classifiers with different sets of emotions (from Table 3) on the SemEval-2017 dataset using three-class classification. The best overall accuracy was achieved by the stacking classifier on the first set of emotions (0.627).  Table 6 shows the accuracy of single-model machine learning and ensemble classifiers on the SemEval-2017 dataset (of two-class classification without the neutral class) with different sets of emotions. The best overall accuracy was also achieved by the stacking classifier on the third set of emotions (0.873). Table 6. Accuracy of classifiers on the SemEval-2017 dataset (of two-class classification without considering the neutral class) with different sets of emotions. The best result is shown in bold.  Table 7 compares the accuracy of single-model machine learning and ensemble classifiers on three analyzed datasets. The best overall accuracy was matched by the stacking classifier and FFNN on the SemEval-2017 dataset (without using the neutral class) (0.873). The experiment with single-model and ensemble learning methods shows the superiority of ensemble methods (see Tables 5 and 6). It is explainable: they combine the knowledge from several classifiers. The highest accuracy for both the binary and 3-class classification problems was achieved with the ensemble learning type methods 0.873 and 0.627, respectively, using the SemEval-2017 dataset.

Classification
The confusion matrix for the three-class classification case is presented in Figure 6. Note most common misclassifications occur between the "adjacent" classes, i.e., between neutral and negative sentiments and between neutral and positive sentiments. classification problems was achieved with the ensemble learning type meth 0.627, respectively, using the SemEval-2017 dataset.
The confusion matrix for the three-class classification case is presente Note most common misclassifications occur between the "adjacent" classes neutral and negative sentiments and between neutral and positive sentimen  Table 9 shows some of the examples of misclassifications. Note that ma fications may have occurred due to mislabeling of the original text in the da Table 8. Performance result comparison for binary and 3-class classification.

Score Labels Predic
Did anybody notice Jurassic World is currently the 3rd highest grossing film in domestic box office history Damm

Ablation Study
To compare the result of the traditional sentiment analysis classification proposed method, we perform an experiment on the SemEval-2017 (without  Table 9 shows some of the examples of misclassifications. Note that many misclassifications may have occurred due to mislabeling of the original text in the dataset.

Ablation Study
To compare the result of the traditional sentiment analysis classification task and our proposed method, we perform an experiment on the SemEval-2017 (without neutral class) dataset using sentence transformer and single-model machine learning classifiers.
We have analyzed different sizes of training datasets (from 100 to 1000 samples, see the vertical axis in Figure 7) with the fixed-size testing set using 500 instances. The result shows that our proposed method can achieve almost the same and even better in most cases than sentence transformers with only a small dataset required for training.

R PEER REVIEW
14 of 20 shows that our proposed method can achieve almost the same and even better in most cases than sentence transformers with only a small dataset required for training.

Statistical Analysis
We have analyzed the results statistically to compare our approach with the result achieved using sentence transformers (Figure 8). We used the ranking-based non-parametric Wilcoxon test. The improvement in accuracy was statistically significant for decision Tree (p < 0.001), FFNN (p < 0.001), KNN (p < 0.01), and Random Forest (p < 0.001) Figure 7. Accuracy vs. number of training instances for sentence transformer + machine learning classifiers and our proposed method.

Statistical Analysis
We have analyzed the results statistically to compare our approach with the result achieved using sentence transformers (Figure 8). We used the ranking-based nonparametric Wilcoxon test. The improvement in accuracy was statistically significant for decision Tree (p < 0.001), FFNN (p < 0.001), KNN (p < 0.01), and Random Forest (p < 0.001) classifiers, however, there was no significant difference for Log regression and Naïve Bayers classifiers. The results of the Wilcoxon test show that the performance of the sentence transformers and the proposed two-stage semi-supervised methodology are statistically different.   Figure 9 shows the critical distance diagram from the post hoc Nemenyi test for the two-class and three-class classification scenarios. The best performance across four emotion subsets was demonstrated by FFNN (the mean rank is 1.33) and Histogram Gradient Boosting classifier (the mean rank is 2.88), although the performance of other machine learning classifiers (excluding Bagging regressor and AdaBoost regressor) was not significantly different (within a critical distance of 10.534 for the 2-class classification scenario, and within a critical distance of 9.123 for the 3-class classification scenario).  Figure 9 shows the critical distance diagram from the post hoc Nemenyi test for the two-class and three-class classification scenarios. The best performance across four emotion subsets was demonstrated by FFNN (the mean rank is 1.33) and Histogram Gradient Boosting classifier (the mean rank is 2.88), although the performance of other machine learning classifiers (excluding Bagging regressor and AdaBoost regressor) was not significantly different (within a critical distance of 10.534 for the 2-class classification scenario, and within a critical distance of 9.123 for the 3-class classification scenario).

Discussion
The previous studies (see a discussion in Section 2) have demonstrated that multilingual pre-trained transformer models can be adjusted for the sentiment analysis problem. These multilingual transformer models already store semantics about the languages they support, thus decreasing the need for very large, supervised training data. However, text sentiments (positive, negative, neutral) often depend not only on the text content but also on different emotions (joy, sadness, anger, etc.) that are often mixed and ambiguous.
In this study, we assume that sentiment labels are easier to determine if we already know exactly what emotion the text represents. Due to this reason, we are solving the twostaged sentiment analysis problem by detecting emotions in the first stage and, based on it, detecting the exact sentiments. Our experimental investigation proves that such a methodology is effective. When detecting emotions, we rely on the zero-shot classification method that does not require any training, but it can return the probabilities of emotions for the input text. These probabilities represent the strength/impact of the detected emotions in the text. Later, we map these emotion probabilities into hot encoding vectors by strengthening the impact of dominated emotion. During the second stage, we train machine learning classifiers (single-model or ensemble) with the training data using one-hot encodings as feature vectors. Thus, our method to solve the sentiment analysis problem is very different from the typical solutions (see a review of methods described in [2,4,5]), relying on the textual content directly. However, by relying on the semantics kept in the zero-shot method and its ability to determine emotions, we reduce the need for larger training data (see Figure 7), which is important for resource-poor languages [68].
The proposed method can be further investigated and potentially improved by:

Discussion
The previous studies (see a discussion in Section 2) have demonstrated that multilingual pre-trained transformer models can be adjusted for the sentiment analysis problem. These multilingual transformer models already store semantics about the languages they support, thus decreasing the need for very large, supervised training data. However, text sentiments (positive, negative, neutral) often depend not only on the text content but also on different emotions (joy, sadness, anger, etc.) that are often mixed and ambiguous.
In this study, we assume that sentiment labels are easier to determine if we already know exactly what emotion the text represents. Due to this reason, we are solving the two-staged sentiment analysis problem by detecting emotions in the first stage and, based on it, detecting the exact sentiments. Our experimental investigation proves that such a methodology is effective. When detecting emotions, we rely on the zero-shot classification method that does not require any training, but it can return the probabilities of emotions for the input text. These probabilities represent the strength/impact of the detected emotions in the text. Later, we map these emotion probabilities into hot encoding vectors by strengthening the impact of dominated emotion. During the second stage, we train machine learning classifiers (single-model or ensemble) with the training data using one-hot encodings as feature vectors. Thus, our method to solve the sentiment analysis problem is very different from the typical solutions (see a review of methods described in [2,4,5]), relying on the textual content directly. However, by relying on the semantics kept in the zero-shot method and its ability to determine emotions, we reduce the need for larger training data (see Figure 7), which is important for resource-poor languages [68].
The proposed method can be further investigated and potentially improved by: 1.
Applying a classification threshold. We have performed an error analysis of the misclassified instances, and most of them received the lowest probability score for certain emotions (see Table 9). Correctly classified emotions have the highest probability score when classified using the zero-shot model. Therefore, setting a certain threshold for emotions can increase the accuracy of the model, then emotions with a lower score than the threshold might potentially be in the neutral class. In our classification, the highest misclassified class was the neutral class (see Figure 6), which can be confused either with a positive or a negative class.

2.
Skipping one-hot vectorization. The current method transforms the outputs of the zero-shot method into one-hot encoding vectors used as features in the supervised training. We may expect possible improvement if, instead of determining one dominant emotion, we provide the whole spectrum of their influence (i.e., returned probabilities). Then the supervised machine learning model can be trained on the real values instead of binary (i.e., one-hot encoded) vectors.
In our experiments, we tested four sets of emotions. The third set achieved the best result compared to all other tested sets (see Table 6). Using a larger set of emotions and a different split of emotions into subsets may allow for improving the result.

Conclusions
In this paper, we have addressed the binary (positive, negative) and three-class (positive, negative, neutral) sentiment analysis problem for the English language with three datasets used for evaluation. Our proposed method is completely different from how such tasks are usually solved. We formulate our sentiment analysis problem as a two-stage classification problem: the first stage determines emotions, and based on it, the second stage determines sentiments. The core of the first stage is the zero-shot transformer model, which does not require any training, and can extract probabilities of emotions for the given text. The second stage takes the zero-shot classification results, converts them into the one-hot encoding vector (used as features), and trains the supervised machine learning classifier.
In our experiments, we have investigated a large variety of different machine learning methods, i.e., traditional machine learning, deep learning, single-model, and ensemble methods. The best accuracy equal to 0.87 and 0.63 for the binary and three-class classification problems was achieved with the set of 10 and 6 emotions, respectively. We have determined that the best zero-shot model is bart-large-mnli, and the best classifier is based on ensemble learning (a stacking classifier of Random Forest, KNN, decision tree, SVM, Naïve Bayes, and Support Vector Regression). Compared with previous research in [13], our proposed method achieved an improvement of 44%. The performance of our method is stable (differences are insignificant), even having small training datasets.
Our proposed method reduces the effort of training the vectorizers to map the text into a real vector space and the need for a large training dataset. Due to its simplified structure, under-researched languages can benefit from our research findings. Most importantly, our research validates the application of emotion detection can help to detect the sentiment of a given text.
In the future, we will consider testing (1) all possible emotions; (2) domain-dependent ones. Theoretically, different emotions in different contexts and domains may lead to different sentiments. It would be interesting to test this idea experimentally.

Conflicts of Interest:
The authors declare no conflict of interest.