Hierarchical Transformer Network for Utterance-level Emotion Recognition

While there have been significant advances in de-tecting emotions in text, in the field of utter-ance-level emotion recognition (ULER), there are still many problems to be solved. In this paper, we address some challenges in ULER in dialog sys-tems. (1) The same utterance can deliver different emotions when it is in different contexts or from different speakers. (2) Long-range contextual in-formation is hard to effectively capture. (3) Unlike the traditional text classification problem, this task is supported by a limited number of datasets, among which most contain inadequate conversa-tions or speech. To address these problems, we propose a hierarchical transformer framework (apart from the description of other studies, the"transformer"in this paper usually refers to the encoder part of the transformer) with a lower-level transformer to model the word-level input and an upper-level transformer to capture the context of utterance-level embeddings. We use a pretrained language model bidirectional encoder representa-tions from transformers (BERT) as the lower-level transformer, which is equivalent to introducing external data into the model and solve the problem of data shortage to some extent. In addition, we add speaker embeddings to the model for the first time, which enables our model to capture the in-teraction between speakers. Experiments on three dialog emotion datasets, Friends, EmotionPush, and EmoryNLP, demonstrate that our proposed hierarchical transformer network models achieve 1.98%, 2.83%, and 3.94% improvement, respec-tively, over the state-of-the-art methods on each dataset in terms of macro-F1.


Introduction
Sentiment analysis, considered one of the most important methods for analyzing real-world communication, is a kind of classification task for extracting emotion from language. It can help us progress in many fields, such as data mining and developing empathetic machines for people. In this paper, we consider one of the tasks in this research direction, utterance-level emotion recognition (ULER) . In ULER, an utterance [Olson, 1977] is a unit of speech bounded by breathes or pauses, and its goal is to tag each utterance in a dialog with the indicated emotion (e.g., happy, sad, or angry). Traditional sentiment analysis methods are confined to analyzing only a single sentence or document, regardless of its surrounding information. However, in the field of ULER, contextual information is indispensable in emotional discrimination. For example, in Figure 1, the utterance "Yes, I agree with this point." can deliver different emotions in different contexts. To identify a speaker's emotion precisely, [Hazarika et al., 2018] produced contextual representations for prediction with a recurrent neural network (RNN), where each utterance is represented by a feature vector extracted by convolutional neural networks (CNN) at an earlier stage. Similarly, [Jiao et al., 2019] proposed a hierarchical gated recurrent unit (HiGRU) framework with a lowerlevel GRU to model the word-level inputs and an upper-level GRU to capture the contexts of utterance-level embeddings.
Theoretically, RNNs such as long short-term memory (LSTM) and gated recurrent units (GRUs) should propagate long-term contextual information. However, in practice, this is not always the case [Bradbury et al., 2017]. In cases where the input sequence is long, RNNs may experience an exploding gradient or vanishing gradient. Unlike traditional text classification problems, in the field of ULER, there are a limited number of datasets, and most datasets contain inadequate conversations. This issue limits the possibility of obtaining larger models for this task. In this task, we propose a hierarchical transformer framework to solve the above issues. First, we use a transformer [Vaswani et at., 2017] to model the word-level input and capture the contexts of utterance-level embeddings, which has been shown to be a powerful representation learning model in many NLP tasks and can exploit contextual information more efficiently than RNNs and CNNs. Second, for the data scarcity issue, we use a pretrained language model, bidirectional encoder representations from transformers (BERT) [Devlin et al., 2018] as the lower-level transformer, which is equivalent to introducing external data into the model and helps our model obtain better utterance embedding. Third, the same utterance can deliver different emotions in the same context. For example, in Figure 2, the utterance "Yes, I agree. I think so, too." can deliver different emotions, joy and sadness. However, previous studies have not addressed this situation because those models did not capture the interaction between the speakers, and did not consider the emotional dynamics of the speakers in a dialog. To solve the problem, we introduce speaker embedding into our model. To the best of our knowledge, this is the first model for ULER with speaker embedding. After obtaining the contextual utterance embedding vectors with a hierarchical transformer framework, we feed them into the fully connected layers for classification. We employ dropout on the fully connected layers to prevent overfitting. Finally, we obtain an utterance category with a softmax layer. We summarize our contributions as follows: • We propose a hierarchical transformer framework to better learn both the individual utterance embeddings and the contextual information of utterances.
• We use a pretrained language model, BERT, to obtain better dialog embedding, which is equivalent to introducing external data into the model and solve the problem of data shortage to some extent.
• For the first time, we use speaker embedding in the model for the ULER task, which allows our model to capture the interaction between speakers and better understand emotional dynamics in dialog systems.
• Our model outperforms state-of-the-art models on three benchmark datasets, Friends, EmotionPush, and EmoryNLP.

Related work
Text-based emotion recognition is a long-standing research topic, and there have been many excellent studies. However, these models do not perform well in the field of ULER because they treat texts independently and thus cannot capture the interdependence of utterances in dialogs. To capture the contexts of utterance-level embeddings more effectively, we propose a hierarchical transformer framework, which is mainly explored in the following topics.

Individual Utterance Information Extraction
In traditional methods, a common method of expressing text is the bag-of-words method. However, the bag-of-words method loses the order of the words. The n-gram model is a very popular statistical language model and usually performs well [Thorsten, 1998]. However, the n-gram model has a large defect in that it is affected by data sparsity [Bengio et at., 2013]. Recently, neural network methods have become increasingly popular. There is a trend moving from traditional methods to deep learning methods to obtain better text representations. Some prominent models include recursive autoencoders (RAEs) [Socher et al., 2011], convolutional neural networks (CNNs) [Kim, 2014], and recurrent neural networks (RNNs) [Abdul-Mageed and Ungar, 2017]. Although we can train a more complex model with a neural network, when the quantity of data is small, it does not perform well.

Pretrained Language Models
Unsupervised pretraining is a special case of semisupervised learning where the goal is to find a good initialization point. Pretrained language models, such as ELMo [Peters et al.,2018], OpenAI GPT [Radford et al., 2018], and BERT [Devlin et al., 2018], have achieved great success in a variety of NLP tasks, such as sentiment analysis and textual classification. They can generate deep contextualized embeddings since they are pretrained on a massive unlabeled corpus (i.e., English Wikipedia). Some proposed models [Sun et al., 2019] with pretrained language models have obtained outstanding results on the sentiment analysis task of individual sentences.
[Reimers et at., 2019] proposed Siamese BERT-networks (SBERT) to obtain sentence embeddings and proved that their model outperforms other state-of-the-art sentence embedding methods.

Contextual Information Extraction
The RNN architecture is a standard method for capturing the sequential relationship of data. [Poria et at., 2015] captured the contextual information with a bidirectional long shortterm memory (BiLSTM) network and obtained great

Feed Forward Network
Multi-Head Self-Attention

Transformer
The transformer learns the dependencies between words based entirely on self-attention without any recurrent or convolutional layers. Due to its rich representation and fast computation, it has been applied to many NLP tasks, e.g., response matching in dialog systems [Zhou et al., 2018] and language modeling [Dai et al.,2019].

Approach
In this section, we present the task definition and our proposed hierarchical transformer (HiTransformer) network. In addition, we propose a variation in HiTransformer by adding speaker embedding, named HiTransformer-s. The overall architecture of our models is illustrated in Figure 3.

Task Definition
Let there be a set of speakers, = { } =1 , where is the number of speakers, and a set of emotions, = { } =1 , where is the number of emotions, such as anger, joy, sadness, and neutral. Assume we are given a set of dialogs, = { } =1 , where is the number of dialogs. In each dialog, = {( , , )} =1 is a sequence of utterances, where the utterance is spoken by ∈ with an emotion ∈ . Our goal is to train a model to find the most likely emotion from for each new utterance.

HiTransformer: Hierarchical Transformer
Our HiTransformer consists of two-level transformers: the lower-level transformer models the word-level input and obtains the individual utterance embedding. The upper-level transformer captures the contextual information and obtains utterance-level embeddings.

Individual Utterance Embedding
For the input utterance = { } =1 , where is the − ℎ utterance in and is the number of words in the utterance . First, the utterance is lower-cased and tokenized according to a byte pair encoding (BPE) algorithm. If there are tokens exceeding the preset maximum length of input tokens, those tokens are excluded from the list. Then, we embed those tokens through WordPiece embeddings [Wu et at., 2016] and obtained the token embeddings = { } =1 . Finally, the input embeddings = { } =1 are the summation of the token embeddings and the positional embeddings = { } =1 : where ⊙ denotes element − wise addition. We feed the input embeddings into the lower-level transformer to learn the individual utterance embedding. We adopt the transformer-based pretrained language model BERT (illustrated in Figure 4) as the lower-level transformer, which is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning both the left and right contexts in all layers. The detailed structure is shown in Figure 4. The language model converts input embeddings into contextual word embedding = { } =1 .
The individual utterance embedding is then obtained by max-pooling on the contextual word embeddings within an utterance, which can assist in retaining important information in each dimension:

Contextual Utterance Embedding
Then, we feed the contextual utterance embedding vector into the classifier, which consists of two linear layers, one activation function and dropout. Finally, we obtain the predicted vector over all emotions with a softmax function.

HiTransformer-s: Hierarchical Transformer with speaker embeddings
The HiTransformer contains a main issue that it cannot capture the interaction of speakers in a dialog. For example, in Figure 2, the utterance "Yes, I agree. I think so, too." delivers different emotions, sadness and joy. However, the Hi-Transformer cannot tag it exactly. To solve this problem, we propose hierarchical transformer with speaker embeddings (HiTransformer-s), which can model the interaction of speakers in a dialog. For Finally, we concatenate the summation of the individual utterance embeddings and the embeddings of position with the speaker embeddings of every utterance as the input of the upper-level transformer.
Where ⊙ denotes element − wise addition, and ⊕ is the concatenation operator.

Model Training
To solve the issue of class imbalance, following the above research [Khosla, 2018], we use weighted cross entropy as the training loss to weight the samples of minority classes as below.
where denotes the number of utterances with emotion in the training set.

Experimental Settings
In this section, we present the datasets, evaluation metrics, baselines and experimental results of our model.

Dataset
Friends [Hsu and Ku, 2018]: The dataset is annotated from the Friends TV Scripts, and each dialog in the dataset consists of a scene of multiple speakers. In total, there are 1,000 dialogs, which are split into three parts: 720 for training, 80 for validation, and 200 dialogs for testing. Each utterance is tagged with an emotion in a set of emotions, {anger, joy, sadness, neutral, surprise, disgust, fear, and nonneutral}.

Evaluation Metrics
Following [Jiao et at., 2019], which achieved the best performance on several ULER datasets, we choose macro-averaged F1-score as the primary metric for evaluating the performance of our models.
where 1 is the F1-score of emotion . We also report the weighted accuracy (WA) and unweighted accuracy (UWA), which were adopted in a previous work [Hsu and Ku, 2018].
where is the percentage of class in the testing set, and is the corresponding accuracy. As shown in Table 1 and  Table 2, most of the datasets in this paper have an imbalanced emotion distribution, so the F1-score is better for measuring the model performance.

Compared Methods
We compare our model HiTransformer and HiTransformer-s with the following state-of-the-art baselines:

Parameters
We adopt the pretrained uncased BERT-Base 1 model as the lower-level transferable language model, where the maximum input length is 512. The number of combination layers of a multi-head attention and a feedforward neural network is 12. For the upper-level transformer layers, the number of transformer layers is 4 and the number of heads in the multihead attention is 8. For the classification layer, the internal hidden size of the classification layer is set to 300, and the dropout rate is 0.5 to prevent overfitting. We adopt Adam [Kingma and Ba, 2015] as the optimizer with a batch size of 1 and a learning rate of 1 × 10 −5 .

Result Analysis
We report the empirical results in Table 3, which present the overall performance of our models on all datasets. From these results, we make the following observations.

Comparison with Baselines
Our proposed HiTransformer-s outperforms the state-of-theart methods with significant margins on all the datasets in terms of macro-F1 score. Specifically, HiTransformer-s obtains 1.98%, 2.83%, and 3.94% absolute improvement on Friends, EmotionPush, and EmoryNLP, respectively. In addition, for Friends, HiTransformer-s obtains 0.88% improvement compared with the best performance in the past in terms of WA, and 0.12% less than the best performance from HiGRU-sf in terms of UWA. However, HiTransformer-s obtains an 8.18% improvement compared with HiGRU-sf in terms of WA. For EmotionPush, although HiTransformer-s is 0.78% lower than SA-BiLSTM in terms of WA, HiTransformer-s is 8.03% above SA-BiLSTM in terms of WA. Similarly, HiTransformer-s is 5.07% lower than HiGRU-sf in terms of UWA and 13.92% above HiGRU-sf in terms of WA. For EmoryNLP, HiTransformer-s obtains 1.88% and 2.37% absolute improvement in terms of WA and UWA, respectively. The HiTransformer outperforms the state-of-the-art methods on all the datasets in terms of the macro-F1 score as well. The above results demonstrate the superior power of HiTransformer-s and HiTransformer.

HiTransformer vs. HiTransformer-s
By analyzing ULER, we find that speaker information plays an important role in utterance classification. Therefore, we proposed HiTransformer-s on the basis of HiTransformer. From Table 3, we observe that HiTransformer-s outperforms HiTransformer on all three datasets in terms of macro-F1, WA, and UWA. Specifically, on Friends, HiTransformer-s attains 1.22%, 0.07% and 5.07% improvement over HiTransformer in terms of macro-F1, WA, and UWA, respectively.
On EmotionPush, HiTransformer-s attains 1.53%, 0.05% and 1.52% improvement over HiTransformer in terms of macro-F1, WA, and UWA. On EmoryNLP, HiTransformer-s attains 1.68%, 0.73% and 3.43% improvement over HiTransformer in terms of macro-F1, WA, and UWA. The results demonstrate that HiTransformer-s including speaker information is indeed capable of boosting the performance of the HiTransformer.

Conclusion
In this work, to address utterance-level emotion recognition in dialog systems, we propose a hierarchical transformer (Hi-Transformer) framework with a lower-level transformer to model word-level input and an upper-level transformer to capture the contexts of utterance-level embeddings. To obtain better individual utterance embeddings, we adopt BERT, which is pretrained on a massive unlabeled corpus as the lower-level transformer. To enable HiTransformer to obtain speaker information, we propose HiTransformer-s. Our proposed hierarchical transformer models outperform the stateof-the-art methods on all three datasets, which demonstrates that hierarchical transformer models can sufficiently capture the available utterance information in a dialog. In the future, we plan to pretrain a transformer model to capture the relationship of utterances, similar to BERT, and adopt it as the upper-level transformer to capture the textual information more sufficiently, which can also address the problem of data scarcity in ULER.