Multi-Task Learning for Sentiment Analysis with Hard-Sharing and Task Recognition Mechanisms

: In the era of big data, multi-task learning has become one of the crucial technologies for sentiment analysis and classiﬁcation. Most of the existing multi-task learning models for sentiment analysis are developed based on the soft-sharing mechanism that has less interference between different tasks than the hard-sharing mechanism. However, there are also fewer essential features that the model can extract with the soft-sharing method, resulting in unsatisfactory classiﬁcation performance. In this paper, we propose a multi-task learning framework based on a hard-sharing mechanism for sentiment analysis in various ﬁelds. The hard-sharing mechanism is achieved by a shared layer to build the interrelationship among multiple tasks. Then, we design a task recognition mechanism to reduce the interference of the hard-shared feature space and also to enhance the correlation between multiple tasks. Experiments on two real-world sentiment classiﬁcation datasets show that our approach achieves the best results and improves the classiﬁcation accuracy over the existing methods signiﬁcantly. The task recognition training process enables a unique representation of the features of different tasks in the shared feature space, providing a new solution reducing interference in the shared feature space for sentiment analysis.


Introduction
With the fast development of e-commerce, the automated sentiment classification (ASC) method for reviews on various products is demanded in the field of nature language processing (NLP) [1]. ASC methods classify the reviews into positive/negative sentiment classes with satisfactory efficiency and accuracy [2]. More specifically, ASC intends to explore the in-depth attitudes and perceptions (such as positive, negative) from the text body associated with the user natural awareness.
Recently, many forms of neural networks (NN) have been proposed for ASC [3][4][5]. Inspired by the human behaviors that handle multiple tasks simultaneously, multi-task learning neural network (MTL-NN) is proposed, extending the NN with a more sophisticated internal structure. The MTL-NN is a hierarchical structure of NN performing sentiment analysis receiving data containing multiple tasks as input [1]. For example, an online shopping website contains review comments associated with various products, such as books, televisions, handphones, etc. Traditional single-task learning (STL) NN experiences difficulties analyzing text pieces mixing different product types. MTL-NN handles the entire text piece involving comments under different products. There are in general two main mechanisms for multi-task learning methods: (a) the soft-sharing mechanism that applies a task-specific layer to different tasks [6][7][8]; (b) the hard-sharing mechanism that • The proposed model addresses the issue of interference and generalization of the shared feature space during multi-task learning. • The proposed model comprises three encoders, including a lexicon encoder, a shared encoder, and a private encoder, to improve the quality of extracted features. • We propose a task recognition mechanism that makes the shared feature space have unique representation for different tasks.

Related Works
As one of the popular fields of natural language processing (NLP) [13,14], various sentiment classification methods were proposed in the recent years. For example, the Word2vec [15] technique, proposed by Google in 2013, significantly improves the traditional feature engineering methods for text classification. The Word2vec maps characters into low-dimensional vectors, representing the intrinsic connections between words [16,17]. The Word2vec accelerates the development of deep learning techniques in the field of sentiment classification More recently, various deep learning algorithms were proposed for sentiment analysis, such as TextCNN [18], TextRNN [7], HAN [19], etc. These algorithms use different neural networks to process the text of different lengths. For example, convolutional neural networks [20,21] are used to extract features from sentences. Recurrent neural networks are used to extract features from paragraphs [22,23], and attention mechanisms are used to extract features from articles [19,24]. However, these algorithms cannot be directly applied to multi-task sentiment analysis.
The MTL approach allows the model to extract features from multiple tasks simultaneously. The MTL technique was firstly used in the field of computer vision [25,26]. Numerous experiments have demonstrated that MTL is better than single-task learning methods on multi-task sentiment analysis [27,28]. The latent correlations among similar tasks that can be extracted by MTL are potentially helpful in improving the classification results.
Based on the neural network structure, MTL can be divided into soft-sharing MTL and hard-sharing MTL [29]. The soft-sharing mechanism divides the features into shared features and private features, which reduces the interference between multiple tasks [30,31]. However, it requires learning separate features for each task as private features. These private features are not shared. Thus, the parameters are not used effectively [32]. The hard-sharing mechanism allows the shared layer to extract features from all tasks, which can be used by all tasks [6]. Multiple tasks interfere with each other in the shared layer, but the interference between tasks is exploited to improve generalizability. To reduce the interference, the hard-sharing mechanism provides a private layer for each task [33].

Methodology
The overall structure of the model is shown in Figure 1, where the lexicon encoder is used to encode the data for each task into a lexicon embedding. The shared encoder extracts the semantic features from the embedding. A shared feature space is formed by the semantic features extracted by the shared encoder. The private encoder consists of two parts: one is task-specific layers, which are used to learn semantic features related to the source of the review, and the other is the task recognition layer.
layer, but the interference between tasks is exploited to improve generalizability. To reduce the interference, the hard-sharing mechanism provides a private layer for each task [33].

Methodology
The overall structure of the model is shown in Figure 1, where the lexicon encoder is used to encode the data for each task into a lexicon embedding. The shared encoder extracts the semantic features from the embedding. A shared feature space is formed by the semantic features extracted by the shared encoder. The private encoder consists of two parts: one is task-specific layers, which are used to learn semantic features related to the source of the review, and the other is the task recognition layer.

The Lexicon Encoder
The lexicon encoder is a feature extraction encoder that addresses the issue of converting input text to word vectors. The input to the lexicon encoder can be a sentence or a paragraph. The output of the encoder is usually the representation of the sum of corresponding token, segment, and position embedded. The embedded position is calculated from the positions of input vectors, as shown in Equation (1).
where is the position of the input vector; is the dimension, and is the dimension of the word vector. The lexicon encoder converts the input into dimension embeddings for the shared encoder learning (Section 3.2).

Shared Encoder
The shared encoder is used to extract the common sentence features in multiple tasks and places them into a shared feature space. To make the shared feature space contain richer semantic features, the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model [34,35] is introduced into our shared encoder. The pre-trained BERT model consists of multiple Transformer Encoders, which can be used to encode sentences. The structure of the shared encoder is shown in Figure 2.
From Figure 2, the transformer encoder model is composed of a stack of = 6 identical layers and specifically addresses the issue of learning long-term dependencies, which are composed of multi-head attention mechanisms and position-wise feed-forward networks. The two sub-layers are connected by residual connection [20] and layer normalization [21].

The Lexicon Encoder
The lexicon encoder is a feature extraction encoder that addresses the issue of converting input text to word vectors. The input to the lexicon encoder can be a sentence or a paragraph. The output of the encoder is usually the representation of the sum of corresponding token, segment, and position embedded. The embedded position is calculated from the positions of input vectors, as shown in Equation (1).
PE (pos,2i+1) = cos pos/10000 2i/d model , where pos is the position of the input vector; i is the dimension, and d model is the dimension of the word vector. The lexicon encoder converts the input X into d model dimension embeddings for the shared encoder learning (Section 3.2).

Shared Encoder
The shared encoder is used to extract the common sentence features in multiple tasks and places them into a shared feature space. To make the shared feature space contain richer semantic features, the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model [34,35] is introduced into our shared encoder. The pretrained BERT model consists of multiple Transformer Encoders, which can be used to encode sentences. The structure of the shared encoder is shown in Figure 2. The multi-head attention allows the model to pay attention to the information in different locations. Multi-head attention is composed of multi-dimensional self-attention, which linearly projects the query keys and values ℎ times. The self-attention consists of  From Figure 2, the transformer encoder model is composed of a stack of N = 6 identical layers and specifically addresses the issue of learning long-term dependencies, which are composed of multi-head attention mechanisms and position-wise feed-forward networks. The two sub-layers are connected by residual connection [20] and layer normalization [21].
The multi-head attention allows the model to pay attention to the information in different locations. Multi-head attention is composed of multi-dimensional self-attention, which linearly projects the query keys and values h times. The self-attention consists of queries and keys of dimension d k , and values of the dimension d v . We compute the self-attention products with Equation (3): where Q is the queries matrix; K is the keys matrix; V is the values matrix; and d k is the dimension of the queries and keys. Finally, we utilize the softmax function to calculate the weight of every input token.
On each of the projected versions, the self-attention is computed in parallel. The outputs are concatenated. The final output can be calculated by projecting them again.
where ⊕ is the concatenation operator; h i is the i-th attention representation of multi-head attention. The projections are parameter matrices Position-wise feed-forward networks are composed by two linear transformations and a nonlinear activation function Relu: We utilize residual connections and layer normalizations to connect the input layer, multi-head attention mechanisms, and position-wise feed-forward networks: where F (x) represents the output of sub-layers; σ represents layer normalization; LN(x) represents the output of the layer normalization.

Private Encoder
The private encoder is composed of a task-specific layer and a task recognition layer. The task-specific layer is used to extract emotion features that are independent of tasks. Therefore, there are multiple multi-scale CNN layers, which are designed for different tasks. The task recognition layer is used to learn task-recognized features. The overall structure of the private encoder is shown in Figure 3.
From Figure 3, the multi-scale CNN is composed of multiple convolution layers. Each convolution layer is composed of multiple convolution kernels of different sizes that are used to extract text features of different scales in the shared feature space. Information 2021, 12, 207 5 of 13 Let : refer to the concatenation of words , , ⋯ , . A convolution operation is a convolution filter ∈ ℝ sliding on a window of size ℎ to generate new features. For example, convolution is calculated on the words : . A new feature can be generated by where ∈ ℝ is a bias term; is the ReLU activation function. We apply the convolution filter to all possible word combinations : , : , ⋯ , : . A feature map can be generated by: where ∈ ℝ . We apply the max-pooling operation [3] to further process the feature . The maximum value of as a feature.
Multiple features ̂ of different length ℎ are extracted by multiple convolution filters of different sizes, which represent token information of different lengths. The final features ̂ are concatenated by the multiple features ̂ extracted by convolution kernels of different sizes.

The Task Recognition Mechanism
Inspired by adversarial training [32], we propose a task recognition mechanism that uses the three encoders to learn the different features between each task while performing sentiment classification.
In the training process, for a text dataset containing N samples , , we utilize the cross-entropy function as the loss function. It is calculated that the cross-entropy of the true and the predicted distributions occurs on all the tasks. The model is optimized in the direction of minimizing the cross-entropy value.
where is the ground-truth label; is prediction probabilities, and is the class number. Let x i:i+j refer to the concatenation of words x i , x i+1 , · · · , x i+j . A convolution operation is a convolution filter w h ∈ R hk sliding on a window of size h to generate new features. For example, convolution is calculated on the words x i:i+h−1 . A new feature can be generated by where b ∈ R is a bias term; f is the ReLU activation function. We apply the convolution filter to all possible word combinations {x 1:h , x 2:h+1 , · · · , x n−h+1:n }. A feature map can be generated by: where c h ∈ R n−h+1 . We apply the max-pooling operation [3] to further process the feature c h . The maximum value of c h as a feature.
Multiple featuresĉ h of different length h are extracted by multiple convolution filters of different sizes, which represent token information of different lengths. The final featuresĉ are concatenated by the multiple featuresĉ h extracted by convolution kernels of different sizes.

The Task Recognition Mechanism
Inspired by adversarial training [32], we propose a task recognition mechanism that uses the three encoders to learn the different features between each task while performing sentiment classification.
In the training process, for a text dataset containing N samples {x i , y i }, we utilize the cross-entropy function as the loss function. It is calculated that the cross-entropy of the true and the predicted distributions occurs on all the tasks. The model is optimized in the direction of minimizing the cross-entropy value.
where y j i is the ground-truth label;ŷ j i is prediction probabilities, and C is the class number.
Information 2021, 12, 207 6 of 13 Task Discriminator. The task discriminator is used to map the shared representation of sentences into a probability distribution, estimating the probabilities of the original task for the encoded sentences.
During the task recognition training process, a separate multi-scale CNN layer is designed for each task. There are independent parameters in different multi-scale CNN layers. Therefore, the interference between different tasks can be relieved. Suppose that the input sample belongs to task k, the corresponding multi-scale CNN is MCNN (k) . The output is: where x (k) is a sample of task k;ŷ (k) is prediction probabilities of task k. For the data of multiple tasks, we calculate the weighted sum of the loss for each task.
where α k is the weight for each task k. K is the number of tasks. A task recognition training process is designed to learn different features from among tasks and influence the representation in the shared feature space by backpropagation. The schematic diagram of task recognition training is shown in Figure 4. Task Discriminator. The task discriminator is used to map the shared representation of sentences into a probability distribution, estimating the probabilities of the original task for the encoded sentences.
During the task recognition training process, a separate multi-scale CNN layer is designed for each task. There are independent parameters in different multi-scale CNN layers. Therefore, the interference between different tasks can be relieved. Suppose that the input sample belongs to task , the corresponding multi-scale CNN is MCNN . The output is: where is a sample of task ; is prediction probabilities of task . For the data of multiple tasks, we calculate the weighted sum of the loss for each task.
where is the weight for each task . is the number of tasks. A task recognition training process is designed to learn different features from among tasks and influence the representation in the shared feature space by backpropagation. The schematic diagram of task recognition training is shown in Figure 4. Recognition Loss. Different with most existing multi-task learning algorithms, we add an extra recognition loss to add task-recognized features to shared feature space. The recognition loss is used to train a model to produce task-recognized features such that a classifier can reliably predict the task based on these features. The original loss of the task recognition training process is limited since it can only be used in binary situations. To overcome this, we extend it to multi-class form, which allows our model to be trained together with multiple tasks: where K is the total number of tasks, N is the total number of samples, and C is the number of samples for task i. For each i, there are samples j ∈ (1, C). represents the predicted task that the sample j belongs to. Therefore, is the task label, and ̂ is prediction probability of . It is noted that the requires only the input sentence and does not require the corresponding label . The final loss function of the model can be written as: Recognition Loss. Different with most existing multi-task learning algorithms, we add an extra recognition loss L rec to add task-recognized features to shared feature space. The recognition loss is used to train a model to produce task-recognized features such that a classifier can reliably predict the task based on these features. The original loss of the task recognition training process is limited since it can only be used in binary situations. To overcome this, we extend it to multi-class form, which allows our model to be trained together with multiple tasks: where K is the total number of tasks, N is the total number of samples, and C is the number of samples for task i. For each i, there are samples j ∈ (1, C). p j i represents the predicted task that the sample j belongs to. Therefore, p j i is the task label, andp j i is prediction probability of p j i . It is noted that the L rec requires only the input sentence x and does not require the corresponding label y. The final loss function of the model can be written as: where λ is a constant coefficient.

Dataset and Metrics
As shown in Table 1, the dataset that we employed in this experiment contains 16 different datasets from several popular review corpora, including books, electronics, DVD, kitchen, apparel, camera, health, music, toys, video, baby, magazines, software, sports, IMDB, and MR. The first 14 datasets are product reviews, which are collected by Blitzer et al. [33]. The remaining two datasets are movie reviews, which are from the IMDB datasets [34] and the MR datasets [18]. There are about 2000 reviews for each commodity, for a total of about 32,000 reviews. The goal is to classify a review as either positive or negative. All the datasets in each task are partitioned randomly into the training set, validation set, and test set with the proportion of 70%, 20%, and 10%, respectively. The detailed statistics about all the datasets are listed in Table 2. Table 1. Instances of the testing dataset I.

Commodity Type Example Label
Books this is a resource used by all nps i have talked to. great addition to your library. 1 it was a mistake to buy it. only few pages were interestin 0 Electronics great product but is only $ 30 at iriver.com's stor 1 i dont like this mouse, i brought, and never work, its useles 0 DVD an awesome film with some suspense and raunchiness all rolled in to one 1 i love pablo's act on comedy central. this one does n't even touch it 0 Kitchen it is very light and worm. i love it. definitely worth the price! 1 for the price, you get what you pay for. they are not the best quality 0 Apparel recipient was very satisfied with this blanket as pb are his initials 1 a red star !?!? i bet this wo n't sell well in eastern europe.  In addition, we collected four different types of commodity review datasets of daily necessities, literature, entertainment, and media from the raw data provided by Blitzer et al. [33] and formed dataset II. Each item in dataset II has more entries compared to dataset I. We also divided the training set, validation set, and test set for dataset II, and ensured that the number of positive and negative samples in each set did not differ much. Instances and statistics of dataset II are shown in Tables 3 and 4. Table 3. Instances of the testing dataset II.

Commodity Type Example Label
Daily Necessities great product-i heard from other mommies that this was the pump to get; i agree 1 rent a hospital grade medalia pump. you wont be sorr 0 Literature an excellent book for anyone that barbecues 1 imposible to do so with no item received 0 Entertainment thank you, i like this program and it does what i need it to do 1 i would not buy it ! hard to use. my machine runs slower since the install. 0 Media i received "the piano" promptly, and in pristine, excellent condition. 1 if this is n't worst dead album then in the dark is 0  1609  1486  199  199  187  213  3893  Literature  2257  2305  308  292  420  380  5962  Entertainment  2285  2219  299  301  407  393  5904  Media  2978  3007  389  411  613  584  7982  Total  9129  9017  1195  1203  1627  1570  23,741 In the experiment, we use the same evaluation criteria for each commodity review data set and each method, which are accuracy and F1-score.
The experimental results are shown in Tables 5 and 6.  Table 5 shows the accuracy of 16 sentiment classification tasks. Table 6 shows the accuracy of four sentiment classification tasks. The column of Avg. shows the average accuracy of the previous four single models. The highest accuracy rates are bolded in Tables 5 and 6. From Table 5, we can see that multi-task learning models work better than single tasks in most tasks. From Table 6, we can see our proposed MTL-REC model outperforms all compared existing methods in all the cases. The classification accuracy improvements are visualized in Figure 5. In Figure 5, it is noted that the classification accuracy improvement with the proposed method over all compared methods is from 2% to 7%. The significant classification accuracy improvement is mainly achieved by the hard-sharing mechanism and the task recognition training process. The task sharing layer reduces the interference between multiple tasks. Table 5 also shows that the CNN extracts text features similar to that of GRU and LSTM encoders and takes less time. In Table 6, the average accuracy of the multi-task learning model is almost the same as that of the single-task learning model. Table 7 shows a statistical test over the results shown in Table 5. The difference between the proposed method and each compared method is evaluated using the Wilcoxon signed-rank test. The p-values show that the proposed method is significantly different from the compared methods. Table 8 shows the overall time and memory used by different methods on dataset I and II. The proposed MTL-REC encoder improves the sentiment classification performance significantly, but requires more running time and memory. The time complexity of the sentiment analysis is usually not the main concern, since the feature extract part can always be performed offline.

Model Self-Comparision
To demonstrate the effectiveness of the proposed method, a comparative experiment is conducted. Tables 9 and 10 reflect the fact that the BERT and task recognition training process are helpful in sentiment classification tasks.

Model Self-Comparision
To demonstrate the effectiveness of the proposed method, a comparative experiment is conducted. Tables 9 and 10 reflect the fact that the BERT and task recognition training process are helpful in sentiment classification tasks. Table 9. Performance improvement using BERT and task recognition training on multiple tasks dataset I.

Task
Without In Tables 9 and 10, the highest accuracy rates and F1 scores are highlighted using bold font. According to Tables 9 and 10, the sentiment classification performance is further improved with BERT and the proposed task recognition mechanism.

Conclusions
In this paper, we propose a multi-task learning framework for sentiment classification with a novel task recognition mechanism. We introduce the pre-trained BERT as our shared encoder to further improve the performance of the shared encoder. In addition, we propose a task recognition training process, which enhances the shared feature space to obtain more task-recognized features. We designed a series of experiments to validate our proposed method. The experimental results show that the sentiment classification results of our proposed model are superior to existing state-of-art methods. Both semantic features and task-recognized features are extracted, enhancing the overall classification performance.
It is noted that we introduce the pre-trained BERT model, which reduces the efficiency of the algorithm and leads to longer computation times. The proposed method shows a significant improvement on the accuracy of sentiment classification.
As one of the future works, we will improve on the shared encoder to reduce the time complexity of the proposed multi-task learning algorithm. In addition, more challenging datasets, such as unbalanced, noisy datasets, and datasets in different languages, will be tested on the proposed method.