A New Sentiment-Enhanced Word Embedding Method for Sentiment Analysis

: Since some sentiment words have similar syntactic and semantic features in the corpus, existing pre-trained word embeddings always perform poorly in sentiment analysis tasks. This paper proposes a new sentiment-enhanced word embedding (S-EWE) method to improve the effectiveness of sentence-level sentiment classiﬁcation. This sentiment enhancement method takes full advantage of the mapping relationship between word embeddings and their corresponding sentiment orientations. This method ﬁrst converts words to word embeddings and assigns sentiment mapping vectors to all word embeddings. Then, word embeddings and their corresponding sentiment mapping vectors are fused to S-EWEs. After reducing the dimensions of S-EWEs through a fully connected layer, the predicted sentiment orientations are obtained. The S-EWE method adopts the cross-entropy function to calculate the loss between predicted and true sentiment orientations, and backpropagates the loss to train the sentiment mapping vectors. Experiments show that the accuracy and macro-F1 values of six sentiment classiﬁcation models using Word2Vec and GloVe with the S-EWEs are on average 1.07% and 1.58% higher than those without the S-EWEs on the SemEval-2013 dataset, and on average 1.23% and 1.26% higher than those without the S-EWEs on the SST-2 dataset. In all baseline models with S-EWEs, the convergence time of the attention-based bidirectional CNN-RNN deep model (ABCDM) with S-EWEs was signiﬁcantly decreased by 51.21% of ABCDM on the SemEval-2013 dataset. The convergence time of CNN-LSTM with S-EWEs was vastly reduced by 41.34% of CNN-LSTM on the SST-2 dataset. In addition, the S-EWE method is not valid for contextualized word embedding models. The main reasons are that the S-EWE method only enhances the embedding layer of the models and has no effect on the models themselves.


Introduction
Word embedding maps a word into a vector space, playing an important role in natural language processing (NLP). Word vectors contain rich information, such as local contextual information [1], global co-occurrence information [2], and global contextual information [3]. Moreover, word embedding can apply to many NLP tasks, such as sentiment analysis [4][5][6][7][8], part-of-speech tagging [9], and named entity recognition [10]. It can also apply to many interdisciplinary tasks, such as influence maximization [11] and emotion role identification [12].
Word embedding learned by deep learning carries contextual semantic and syntactic information, which is beneficial for NLP tasks. Sitaula et al. [13] evaluated many machine learning and deep learning methods on sentiment analysis tasks. However, for featurebased word embedding, such as Word2Vec [1], if two words with opposite sentiment polarity have similar context in a corpus, the performance of word embedding may not be so good [14,15]. This is because feature-based word embeddings only assign a unique word embedding to each word and cannot generate word embeddings based on the context in downstream tasks. For example, the words "good" and "terrible" from two different 1.
Due to the different dimensions of sentiment word embeddings and sentiment orientations, they are hard to fuse into one vector. 2.
Since sentiment and word embeddings belong to two different vector spaces, sentiment classification models could not directly operate on them.
Word embedding Sentiment orientation Sentiment-enhanced word embedding To solve the above problems, we expect to find a method to build the mapping relationship between words and their sentiment orientations in sentiment lexicons and fuse sentiment information into these words. Fortunately, we are inspired by translations in the embedding space (TransE) [19] and contrastive learning [20,21]. TransE is a knowl-edge graph embedding method. It shortens the distance between two related entities and increases the distance between two unrelated entities in vector space. Meanwhile, contrastive learning narrows the distance between positive examples and increases the distance between negative examples in vector space. The bootstrap your own latent (BYOL) [20] minimizes the distance between two similar images; the supervised contrastive pre-training (SCAPT) [21] used supervised contrastive learning to cluster explicit and implicit positive sentiments, cluster explicit and implicit negative sentiments, and separated these two clusters. By borrowing these ideas, we propose a novel sentiment-enhanced word embedding (S-EWE) method to improve the performances of sentiment classification models. Specifically, we convert words in the sentiment lexicon to word embeddings and arrange each word embedding for a sentiment mapping vector. Then, we add original word embeddings and their sentiment mapping vectors to obtain the sentiment-enhanced word embeddings with the information of word and sentiment. Thirdly, we adopt one fully connected layer to reduce the dimensions of sentiment-enhanced word embedding to get the predicted sentiment orientations. Finally, we calculate the loss of the predicted and true sentiment orientations by the cross-entropy function. We further train the sentiment mapping vectors through backpropagating the loss so that the trained sentiment mapping vectors can find the mapping relationship between words in sentiment lexicons and their sentiment orientations. Under different sentiment classification models, we confirmed the effectiveness of the S-EWE method. The main contributions are as follows:

1.
A new sentiment-enhanced word embedding method is proposed. This method establishes the mapping relationship between words and their sentiment orientations by vector addition.
The subsequent parts of this paper are organized as follows. Section 2 introduces the related work. The sentiment-enhanced word embedding method is shown in Section 3. Datasets, empirical results, and model analysis are presented in Section 4. Finally, we summarize our paper and put forward a future outlook in Section 5.

Word Embedding
Word embedding is a knowledge enhancement technique because it can capture syntactic and semantic information in textual corpora. Word2Vec [1] has two models: a continuous skip-gram model (Skip-gram) and a continuous bag-of-words model (CBOW). They all use sliding windows to capture contextual information between words. Word2Vec achieved the best results on the Semantic-Syntactic Word Relationship dataset and the Microsoft Sentence Completion Challenge dataset. However, Word2Vec could not get global information of words. GloVe [2] used a co-occurrence probability matrix to obtain the global information of words. It also used slide windows to capture contextual information between words. Experimental results showed that GloVe achieved state-of-the-art performance on word analogies, similarity, and named entity recognition tasks. Since word embeddings trained by Word2Vec and GloVe could only assign a vector to each word, they could not solve the ambiguity problem of words. In addition, Word2Vec and GloVe could only capture the contextual information of words in a window due to the limited computing power. With the increased computing power, the embedding from language models (ELMo) [29] used two reversed recurrent neural network (RNN) layers to encode one input sequence. For a word i, the forward RNN captured the above information of the word i, and the backward RNN captured the following information of the word i. By concatenating the forward hidden state and backward hidden state of the word i, the contextual information of the word i was obtained. The ELMo model achieved remarkable performance on CoNLL  2003 NER and CoNLL 2000 chunking tasks. Due to the improvements in computing power, BERT [3] used bidirectional transformers [30] to encode information of a whole sentence. Since the self-attention mechanism [30] can capture the information of an entire sequence, the contextual information between words is no longer limited to the sliding window. BERT broke all records for the general language understanding evaluation (GLUE), with an average improvement of 6.7% over the other models. Moreover, BERT achieved significant performance on SQuAD, CoNLL-2003, and SWAG datasets. Liu et al. [31] averaged Chinese and English word embeddings in multilingual BERT (m-BERT) [3], respectively. They obtained the Chinese-English difference vector by subtracting the average English vector from the average Chinese vector. The experimental results showed that a Chinese word embedding minus the difference vector could get English word embedding and vice versa.
Even though word embedding has shown its power in many natural language processing tasks, it cannot distinguish the sentiment orientation of semantically similar words well. Singh et al. [32] extracted hashtags, mentions, and keywords from tweets. They performed centrality-aware random walks on them to get their word/node embeddings. Then, they put the embeddings into a deep learning model for sentiment classification. The performance of their method is better than those of the original baseline models. Sent2Vec [33] used the smoothed inverse frequency (SIF) [34] or the unsupervised smoothed inverse frequency (uSIF) [35] to encode a sentence into a vector. Then, this vector is input into a multi-layer perceptron (MLP) to learn sentimental embedding. These methods could achieve good results based on the syntactic and semantic knowledge inside a dataset in tasks of different sentiment classifications. However, these methods could not utilize external knowledge about the dataset.

Knowledge Enhancement
Knowledge enhancement is a way of injecting external knowledge into models, improving the performances of the models on downstream tasks. Liang and Yi [36] proposed the two-stage three-way enhanced technique for inclusive policy text classification. They first used the ensemble convolution neural network to extract text representation in the first stage. Then, they used a new determination method to reduce the decision risk. If the classification accuracy at the first state was not within the confidence interval, they used a traditional machine learning technique to reclassify texts. They conducted some experiments on Chengdu and Xiamen datasets. Their experimental results showed that their method outperforms baseline models. Bai et al. [37] proposed the multi-view document clustering with enhanced semantic embedding (MDCE) to solve the problems, including the sparseness, the high dimensionality, and the inconsistency of document clustering. This model used two deep neural networks to obtain document representation and neighbors representation. Then, the model adopted one fusion layer to fuse different representations and used a clustering layer to output the cluster results. The proposed MDCE model achieved the best performance on Aminer, Aminer (700), 3-sources, Reuters, and Multisource news datasets. ERNIE [38] incorporated knowledge graphs (KGs) into BERT [3]. It used two encoders [30] to encode tokens and KGs obtained by TransE [19], respectively. It also adopted a fusion layer to fuse these two representations into the same vector space. This method achieved excellent performance on FIGER, Open Entity, FewRel, and TACRED datasets. It also obtained comparable results with BERT in GLUE. However, it took much time in the process of entity alignment. To shorten the running time of entity alignment, the knowledge embedding and pre-trained language representation (KEPLER) [39] used the robustly optimized BERT approach (RoBERTa) [40] to represent definitions of considerable knowledge in knowledge graphs. This method combined knowledge graph and text representations, achieving better performance than ERNIE on TACRED, FewRel, and OpenEntity datasets. Saxena et al. [41] proposed a temporal question answering system. They first extracted knowledge from temporal knowledge graphs. Then, they used the proposed temporal KGE model to obtain temporal knowledge representation. They further used BERT [3] to get text representation. Finally, they fused temporal knowledge representation and text representation to answer questions. Their model achieved dramatic performance on the CRONQUESTIONS dataset. Even though injecting external knowledge into models could improve the performances of different models [38,42,43], all the above models have not adopted sentiment information.

Sentiment Enhancement
In recent times, some scholars have used the sentiments as external knowledge. Shi et al. [44] proposed a sentiment-enhanced neural graph recommender. This model applied an attention mechanism [30] to extract a representation of a user's review and an item review representation. They further extracted another user representation and item representation based on a graph convolutional network (GCN) [45]. The model obtained the final user and item representations by concatenating the two user representations and item representations. Then, the model adopted a fully connected network to extract an auxiliary sentiment representation from the output of the self-attention network. Finally, they trained their sentiment-enhanced neural graph recommender with a loss between sentiment auxiliary representation and user-item representation. Their experiments achieved the best performance on Toys, Kindle, Yelp2017, and Yelp2018 datasets. However, this model only adopted sentiment as an additional feature to implement a recommendation system. It did not enhance word embedding with sentiments. Gavilanes et al. [46,47] proposed an unsupervised system with sentiment propagation across dependencies (USS-PAD) model [48]. Their method found the relationship between each emoji's description and sentiment orientation. The classification results of their model are close to those of manual annotations. Yu et al. [15] used k-nearest neighbors (KNN) to zoom in on positive samples and pushed out negative samples in a sentiment lexicon. Their refinement method outperformed original baseline models on SemEval-2013 and SST-2 datasets. Wei et al. [49] used sentiment lexicon as external sentiment knowledge and injected this knowledge into the bidirectional long short-term memory (BiLSTM) to analyze implicit sentiment. They achieved the best results on SMP2019-ECISA, COAE (2015), and SemEval (2013-2017) datasets. Wang et al. [6] injected sentiment lexicon and labeled sentiment corpora as the pre-training corpora for two bidirectional gate recurrent unit (BiGRU) layers. They pre-trained their model with three objectives: predicting target words, predicting word sentiment, and predicting sentence sentiment. After pre-training, they concatenated word embeddings and the hidden states of two BiGRUs, and obtained the sentiment-enhanced word embedding. The experimental results show that their model achieved comparable results to contextualized word embedding models (such as BERT) on three datasets with only 1-7 M parameters. However, these models considered sentiment an additional feature or combined other features to analyze sentiment, but fewer scholars injected sentiment information into word embeddings.
In the field of injecting sentiment into word embedding, Sent2Vec [33] used SIF [34] or uSIF [35] to encode a sentence into a vector. The vector is further input into a fully connected network to learn sentimental embedding. This method achieved better results than other word embeddings on Dbpedia and Yahoo datasets. Naderalvojoud and Sezer [14] proposed Approaches 1 and 2 for enhancing sentiment. Specifically, Approach 1 adopted an MLP to find the mapping relationships between words and their sentiment intensity; and Approach 2 trained Word2Vec [1] and GloVe [2] with an additional sentiment vector if a target word is a sentiment word. The two methods achieved better results than Word2Vec and GloVe on SemEval-2013 and SST datasets. Therefore, few studies are focused on injecting sentiment information into word vectors.

The Proposed Method and Its Applications on Downstream Tasks
This section illustrates a novel sentiment-enhanced word embedding method and its applications on downstream tasks. Figure 2 shows their workflows. For example, in the pre-training stage, given the word "happy" and its sentiment orientation "1" in a sentiment lexicon, we assign "happy" a randomly initialized sentiment mapping vector∇ happy . Then, we add the word embedding of "happy" to the sentiment mapping vector and obtain a sentiment-enhanced word embedding. We feed it to a fully connected layer, predicting the probability of the "happy" sentiment label. By making a loss between the predicted sentiment label and the gold standard (i.e., 1), we can train ∇ happy by backpropagation. The ∇ happy will be the unique sentiment mapping vector for "happy." In the application stage, we add the well-trained ∇ happy and the original word embedding of "happy"; we obtain the sentiment-enhanced embedding of "happy" and apply the sentiment-enhanced embedding into the embedding layer of the downstream tasks.

The Sentiment-Enhanced Word Embedding Method
The S-EWE method aims to establish the mapping relationship between words in sentiment lexicons and their sentiment orientations to obtain trained sentiment-enhanced word embeddings for downstream tasks. When we convert one word to a word embedding, the dimensions of the word embedding are much larger than those of the sentiment orientation, and they belong to two different vector spaces. Therefore, we assign a sentiment mapping vector to the word embedding. This sentiment mapping vector can help word embedding find the mapping relationship between it and its sentiment orientation.
Let the V be a set of all words in a sentiment lexicon, and Y be a set of sentiment orientations of all words in the sentiment lexicon. Given a word w i ∈ V, its word embedding w i , and its sentiment orientation y i ∈ Y, we assign a sentiment mapping vector ∇ i to the word embedding w i . Then, the goal of the S-EWE method is to solve the mapping relationship function among w i , w i and ∇ i , denoted by f (y i |w i , ∇ i ). Figure 3 shows the running process of the S-EWE method. The method implements f (y i |w i , ∇ i ) by four modules, including the embedding layer, addition layer, MLP layer, and loss function as follows.

Embedding Layer
The embedding layer converts words to word embeddings, which is the first step in finding the relationship between w i and y i . Let sentiment words w = {w 1 , w 2 , ..., w n } where w i ∈ V, and n is the total number of sentiment words, and their sentiment orientations Y = {y 1 , y 2 , ..., y n } ∈ Y. Then, we first turn words into word embeddings as follows.
where w i ∈ R ed is the word embedding of the word w i (ed is the dimension of the embedding size), and embed(·) is the function that turns words into their word embeddings.

Addition Layer
The addition layer implements simple addition between vectors. We define a sentiment mapping matrix R = {∇ 1 , ∇ 2 , ..., ∇ n }, where ∇ i ∈ R ed is the mapping relationship vector between w i and y i . Let E = {e 1 , e 2 , ..., e n } be the sentiment-enhanced word embeddings of sentiment words w = {w 1 , w 2 , ..., w n }, where e i ∈ R ed . Then, by adding word embedding w i to its sentiment mapping vector ∇ i , we can obtain the sentiment-enhanced word embedding e i of the word w i as follows.
Then, the sentiment-enhanced word embedding of w is as follows.

MLP Layer
The MLP layer contains one fully connected network. It will reduce the dimensions of sentiment-enhanced word embeddings and map them and their sentiment orientation labels into the same vector space. Considering that the addition of vectors does not change the vectors' dimensions, the dimensions of e i are still much more than those of y i . We use one MLP to reduce the dimensionality of the sentiment-enhanced word vector. The MLP will map the sentiment-enhanced word embedding and label to the same vector space [38]. The predicted sentiment orientation label of the word w i , denoted byŷ i , is as follows.
where W and b are the weight matrix and bias of the output layer, respectively.

Loss Function
The loss function measures the difference between true sentiment orientation labels in a sentiment lexicon and predicted sentiment orientation labels of words at the final module of the S-EWE method. We use the cross-entropy function to calculate the loss, where N is the total number of samples. In order to better learn the mapping relationship between original word embeddings and their true sentiment orientation labels in the S-EWE method, we freeze the word embedding layer. All word embeddings and true sentiment orientation labels are fixed in this case. The parameters W, b, and sentiment mapping matrix are trainable. By backpropagation, the method can better learn the sentiment mapping matrix and find the mapping relationship between original word embeddings and their corresponding sentiment orientations. After training, we can obtain the well-trained sentiment mapping matrix, denoted as R t = {∇ t 1 , ∇ t 2 , ..., ∇ t n }. Finally, the well-trained sentiment-enhanced word embeddings, denoted by E t = {e t 1 , e t 2 , · · · , e t n }, are obtained as follows.
The detailed running procedure of the S-EWE method is described in Algorithm 1.

Applications of the Sentiment-Enhanced Word Embedding Method on Downstream Tasks
Since the S-EWE method needs to work with a sentiment lexicon, we need a robust sentiment lexicon. We selected the extended version of Affective Norms of English Words (E-ANEW) [50] and the Subjectivity Clue Lexicon [51] as our base sentiment lexicons, for they are well-known English sentiment lexicons. For the E-ANEW [50], we define sentiment words with sentiment intensity between [1, 5) as negative, sentiment words with sentiment intensity of 5 as neutral, and sentiment words with sentiment intensity between (5,9] as positive. For the Subjectivity Clue Lexicon [51], we directly extract the sentiment orientation (positive, neutral, negative).
To improve the robustness of these two lexicons, we adopted the following four rules to integrate them: (1) if a word appears in only one lexicon, then the word is directly added into the fused lexicon; (2) if a word has the same sentiment orientation in the two lexicons, then the word is directly added into the fused lexicon; (3) if the sentiment orientation of a word is neutral in one lexicon and the sentiment orientation is not neutral in another lexicon, then the word is assigned with non-neutral sentiment orientation and added into the fused lexicon; and (4) if a word has opposite sentiment orientations (positive and negative) in the two lexicons, then the word is discarded and does not belong to the fused lexicon. After finishing the above four rules, we can obtain the fused lexicon. Table 1 shows the details of the E-ANEW, Subjectivity Clue Lexicon, and fused lexicon, respectively. The words in the sentiment lexicon may not all appear in the corpus in most cases. Meanwhile, the S-EWE method will waste some storage space to train the words in the lexicon and may not obtain good sentiment-enhanced word embeddings. Inspired by sentiment-aware word embeddings (SAWE) [14], we extract all words from the corpus and build a final lexicon by the following two rules. (1) If a word appears in both the corpus and the fused lexicon, we directly add it and its corresponding sentiment orientation in the fused lexicon to the final lexicon. (2) If a word appears in the corpus but not in the fused lexicon, we add it to the final lexicon with the neutral sentiment.
Based on the fused and final lexicons, the sentiment-enhanced word embeddings are applied to downstream tasks for sentiment analysis, as shown in Figure 4. When the sentiment-enhanced word embeddings are trained by the S-EWE method on the fused lexicon (respectively, the final lexicon), we denote the well-trained embeddings by S-EWE f (respectively, S-EWE c ). Figure 4 shows the applications of the sentiment-enhanced word embeddings trained by the S-EWE method based on the final and fused lexicons for downstream tasks. Figure 4a shows how we apply S-EWE f to downstream tasks, and Figure 4b represents how we apply S-EWE c to downstream tasks. When the fused lexicon does not contain some words in the corpus, we adopt the following two methods to get the embedding layer of the downstream sentiment analysis models (Figure 4a). (1) If a word is in the fused lexicon, we input the well-trained sentiment-enhanced word embedding by the S-EWE method in the fused lexicon as the word embedding with sentiment information into the embedding layer of the downstream tasks. (2) If a word is not in the fused lexicon, we use the original word embedding as the word embedding with sentiment information into the embedding layer of the downstream tasks. When the final lexicon contains all words in the corpus (Figure 4b), we can directly input the well-trained sentiment-enhanced word embeddings by the S-EWE method in the final lexicon as the word embedding with sentiment information into the embedding layer of the downstream tasks.

Datasets and Evaluation Metrics
The SemEval-2013 [27] and SST-2 (the preprocessed SST-2 dataset was selected as an experimental dataset; the preprocessed data are at https://github.com/clairett/pytorchsentiment-classification, accessed on 26 November 2021) [28] datasets were selected as experimental datasets. The SemEval-2013 dataset is a Tweet sentiment analysis dataset containing five categories of labels (positive, negative, neutral, objective, and objective-orneutral). Our experiment only considered the positive and negative data by SemEval-2013 (binary). The SST-2 dataset contains movie reviews containing two labels: (1) positive and negative, and (2) very negative, negative, neutral, positive, and very positive. From this dataset, our experiment only used the positive and negative data. Tables 2 and 3 show the details of all the datasets.  The accuracy and macro-F1 score [14] were adopted to evaluate the performance of the proposed S-EWE method. Let TP (resp. TN) be the number of samples whose true labels and model prediction labels are both positive (resp. negative), and let FN (resp. FP) be the number of samples whose true labels are positive (resp. negative) and model prediction labels are negative (resp. positive). Accuracy and macro-F1 score are defined as and where stand for the precision, recall, and F1 scores of the category i; and k is the total number of categories.
TextCNN. TextCNN [22] convolves word vectors in the convolution kernel. It obtains the feature maps of these convolution results through max pooling.
TextRNN. TextRNN [23] is a bidirectional LSTM. It inputs word embeddings and outputs the hidden state of each word. [24] uses LSTM (resp. BiLSTM) to obtain the forward (resp. bidirectional) hidden state. It applies max pooling to get sentence embeddings.

TextRCNN. TextRCNN
ABCDM. ABCDM [25] adopts BiLSTM and BiGRU to encode one sentence to obtain long-term and short-term hidden states. Then, it uses two parallel attention layers to obtain the attention score of each hidden state. Finally, ABCDM applies 1D-CNN, 1D-max pooling, and 1D-average pooling to get the final sentence embedding.

Experimental Results
Let the sentiment classification (SC) models enhanced by S-EWE c and S-EWE f be denoted by SC S-EWE c and SC S-EWE f , respectively. When Word2Vec was used to pre-train word embeddings for sentiment classification tasks on SemEval-2013 [27] and SST-2 [28] datasets, the accuracies and macro-F1 scores under the sentiment classification models TextCNN, TextRNN, TextRCNN, TextBiRCNN, ABCDM, and CNN-LSTM with and without S-EWE c and S-EWE f are shown in Table 5. We conclude that: (1) the sentiment classification models with Word2Vec enhanced by S-EWE f and S-EWE c had higher accuracies and macro-F1 than the original Word2Vec except for TextRNN and CNN-LSTM; (2) in most cases, the sentiment classification models with Word2Vec enhanced by S-EWE c achieved the best accuracies and macro-F1 scores on two datasets; (3) TextRNN with Word2Vec enhanced by S-EWE f and S-EWE c achieved the same accuracies and macro-F1 scores of the TextRNN with original Word2Vec-that is, the S-EWE method did not improve the performance of TextRNN on the two datasets; and (4) on the two datasets, the variance of the results obtained by the model using the proposed method was minor, indicating that using S-EWE method can make the model more stable.
When GloVe is used to pre-train word embeddings for sentiment classification tasks on SemEval-2013 [27] and SST-2 [28] datasets, the accuracy and maro-F1 scores under sentiment classification models, including TextCNN, TextRNN, TextRCNN, TextBiRCNN, ABCDM, and CNN-LSTM, enhanced by S-EWE c and S-EWE f are shown in Table 6. Compared with these results, we can conclude that: (1) the sentiment classification models with GloVe enhanced by S-EWE f and S-EWE c had higher accuracies and macro-F1 scores than the original GloVe except for TextRNN; (2) in most cases, the sentiment classification models with GloVe enhanced by S-EWE c achieved the best accuracies and macro-F1 scores on two datasets; (3) TextRNN is not trainable on these two datasets with the proposed method; and (4) models using our method achieved more robust results on both datasets. Table 5. The accuracies and macro-F1 scores of different models using Word2Vec with and without S-EWE c and S-EWE f on SemEval-2013 and SST-2 datasets (%). For each model, we ran it five times to achieve five accuracy values and macro-F1 values, respectively. Then, we took their average accuracies and macro-F1 values as the final results. The value in parentheses is the variance of the five values for each model. The bold text represents the best result achieved by each model, and the underlined text stands for the lowest variance achieved by each model.

Models
SemEval-2013 SST-2 Accuracy Macro-F1 Accuracy Macro-F1 TextCNN [22] 80  Table 6. The accuracies and macro-F1 scores of different models using GloVe with/without S-EWE c and S-EWE f on SemEval-2013 and SST-2 datasets (%). The accuracies and macro-F1 values were obtained in the same way as those in Table 5. Moreover, the values in parentheses, bold text, and underlined text represent the common meanings in Table 5. As shown in Tables 5 and 6, TextCNN, TextRCNN, TextBiRCNN, ABCDM, and CNN-LSTM with Word2Vec or GloVe, enhanced by S-EWE c and S-EWE f , achieved better classification performances than these models without the sentiment-enhanced word embeddings on the two datasets. In particular, these models with S-EWE c obtained better classification performances than those with S-EWE f . TextRNN with Word2Vec or GloVe enhanced by S-EWE c and S-EWE f kept the same classification performance as the model without the sentiment-enhanced word embedding on the two datasets. Meanwhile, models using S-EWE c and S-EWE f can achieve more robust results on the two datasets.

Analysis of Convergence Time for Downstream Tasks
Under TextCNN, TextRNN, TextRCNN, TextBiRCNN, ABCDM, and CNN-LSTM using Word2Vec and GloVe with or without S-EWE c and S-EWE f on SemEval-2013 and SST-2 datasets, their convergence times were analyzed on one NVIDIA 2080Ti (12GB), as shown in Table 7. For all baseline models using Word2Vec enhanced by S-EWEs on the SemEval-2013 dataset, the convergence times of TextCNN S-EWE c , TextRCNN S-EWE c , and TextBiRCNN S-EWE c were on average increased by 22.56% over their original models; however, the convergence times of TextRNN S-EWE c , ABCDM S-EWE c , and CNN-LSTM S-EWE c on average decreased by 31.16%. Meanwhile, for all baseline models using Word2Vec enhanced by S-EWEs on the SST-2 dataset, the convergence time of CNN-LSTM S-EWE c was the most significant increase, by 440.32% over CNN-LSTM; the convergence time of TextRNN S-EWE c decreased the convergence time of TextRNN by 0.35%; and the convergence times of TextCNN S-EWE c , TextRCNN S-EWE c , TextBiRCNN S-EWE c , and ABCDM S-EWE f were on average increased by 21.98% of their original models.
For all baseline models using GloVe enhanced by S-EWEs on the SemEval-2013 dataset, the convergence times of TextCNN S-EWE c , TextRCNN S-EWE c , TextBiRCNN S-EWE c , and ABCDM S-EWE c were on average increased by 50.17% compared to their original models. The convergence time of CNN-LSTM S-EWE c was decreased by 26.24% compared to CNN-LSTM. Meanwhile, using GloVe enhanced by S-EWEs on the SST-2 dataset, the convergence times of TextBiRCNN S-EWE c and ABCDM S-EWE c were on average increased by 19.67% compared to their original models; the convergence times of TextCNN S-EWE c , TextRCNN S-EWE c , and CNN-LSTM S-EWE c were on average decreased by 25.34% compared to their original models.
In all baseline models with the S-EWE method, the convergence time of ABCDM S-EWE f was significantly decreased by 51.21% compared to ABCDM on the SemEval-2013 dataset, and the convergence time of CNN-LSTM S-EWE C vastly decreased by 41.34% compared to CNN-LSTM on the SST-2 dataset.
Based on the performances of various sentiment classification models and models' convergence times, we conclude that the proposed S-EWE method can enhance the ability of sentiment classification on SemEval-2013 and SST-2 datasets.

Comparisons with the Contextualized Word Embedding Models and the BERT with/without the S-EWEs
The contextualized word embedding models, including BERT [3] and CoSE-T [6], were selected to compare with the sentiment-enhanced word embedding method.
BERT [3]. It is a classic transformer-based [30] model. Its backbone is the encoder in a transformer. It adopts the masked language model (MLM) and next sentence prediction (NSP) as the pre-training objectives. When it is doing the downstream sentiment analysis task, the hidden state of the special token (CLS) it outputs will go through a pooler layer and output the classification result. We adopt the BERT base as our experimental model. CoSE-T [6]. It has two BiGRU layers. In the pre-training stage, the target words, word sentiment, and sentence sentiment are predicted by training on the labeled sentiment corpus with a sentiment lexicon. In the fine-tuning stage, the pre-trained word embedding and the hidden states of the two BiGRUs are concatenated as the sentiment representation. Finally, the representation will be fed into one fully connected layer with a softmax function to predict sentence sentiment.
BERT with S-EWEs . We extract the weights in the embedding layer of BERT and use S-EWE f and S-EWE c to enhance these weights. After enhancing the sentiment of the BERT embeddings, we replace these weights with the S-EWEs. Finally, we feed them into the BERT model. Table 8 shows the accuracies and macro-F1 values of the BERT and CoSE-T models on SemEval-2013 and SST-2 datasets. From these results, we can observe that: (1) Contextualized word embedding can achieve better results than feature-based word embedding (shown in Table 5 and 6) in sentiment analysis tasks. Specifically, CoSE-T achieved 90.80% accuracy on the SemEval-2013 dataset and 89.80% accuracy on the SST-2 dataset; BERT achieved 90.80% accuracy and 88.88% macro-F1 on the SemEval-2013 dataset, and 90.38% accuracy and 90.38% macro-F1 on the SST-2 dataset. (2) BERT with S-EWE embeddings performs worse than BERT and CoSE-T without S-EWE embeddings. Specifically, on the SemEval-2013 dataset, BERT S-EWE f was 9.14% less accurate than BERT and CoSE-T, and 11.34% lower than BERT in macro-F1; BERT S-EWE c was 14.97% less accurate than BERT and CoSE-T and 20.21% lower than BERT in macro-F1. On the SST-2 dataset, BERT S-EWE f was 10.39% lower than CoSE-T and 10.97% lower than BERT in accuracy, and 11.34% lower than BERT in macro-F1; BERT S-EWE c was 17.14% lower than CoSE-T and 17.72% lower than BERT in accuracy, and 17.73% lower than BERT in macro-F1. From these observations, we can conclude that the S-EWE method is unsuitable for enhancing contextualized word embedding. The main reasons are that the S-EWE method only enhances the embedding layer of the model and has no effect on the model itself. Table 8. Theaccuracies and macro-F1 scores of BERT (with and without S-EWE) and CoSE-T on SemEval-2013 and SST-2 datasets (%). † represents that the results are from [6].

Conclusions
This paper proposes a sentiment enhancement method, i.e., the sentiment-enhanced word embedding method. This model finds the relationship between the words in the sentiment lexicon and their corresponding sentiment orientations. After training the sentiment mapping matrix, this matrix and word embeddings are fused as sentiment-enhanced word embeddings. Then, the sentiment-enhanced word embeddings are fed into sentiment classification models to classify the sentiment orientations of sentences on SemEval-2013 and SST-2 datasets. Experimental results show that these models using Word2Vec and GloVe enhanced by the sentiment-enhanced word embeddings perform better than those with original Word2Vec and GloVe embeddings. Moreover, the convergence times of the models using Word2Vec and GloVe enhanced by the sentiment-enhanced word embeddings were acceptable. Since the S-EWE method only enhances the embedding layer of the model and has no effect on the model itself, it does not effectively enhance contextualized word embedding.
In the future, there will be some sustainable research directions.
(1) The proposed sentiment mapping embeddings could be considered external knowledge in other sentiment classification models. (2) Since sentiment orientations contain less information than sentiment intensity, sentiment intensity could be considered in the S-EWE method. (3) Since Chinese words contain one or more tokens, various Chinese sentiment-enhanced word embedding methods need further study by expanding the S-EWE method. (4) The methods of injecting sentiment into contextualized word embeddings, such as BERT, need to be used in sentiment analysis.
Author Contributions: Methodology, software, data curation and writing-original draft, Q.L.; supervision, writing-review and editing, funding acquisition, and formal analysis, X.L.; funding acquisition, investigation, and validation, Y.D.; formal analysis and software, Y.F.; funding acquisition and formal analysis, X.C. All authors have read and agreed to the published version of the manuscript.
Funding: This work is partially supported by the Sichuan Science and Technology Program (2022YFG0378, 2021YFQ0008), the National Natural Science Foundation of China (61802316, 61872298, 61902324) and the Innovation Fund of Postgraduate, Xihua University (grant number YCJJ2021025).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request. All source codes and experiment details are available at https://github.com/Balding-Lee/ESWV, accessed on 13 March 2022.

Conflicts of Interest:
The authors declare no conflict of interest.