Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation

: Emotion Recognition in Conversation (ERC) aims to recognize the emotion for each utterance in a conversation automatically. Due to the difﬁculty of collecting and labeling, this task lacks the dataset corpora available on a large scale. This increases the difﬁculty of ﬁnishing the supervised training required by large-scale neural networks. Introducing the large-scale generative conversational dataset can assist with modeling dialogue. However, the spatial distribution of feature vectors in the source and target domains is inconsistent after introducing the external dataset. To alleviate the problem, we propose a Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation (DAN-CDERC) model, consisting of domain adversarial and emotion recognition models. The domain adversarial model consists of the encoders, a generator and a domain discriminator. First, the encoders and generator learn contextual features from a large-scale source dataset. The discriminator performs domain adaptation by discriminating the domain to make the feature space of the source and target domain consistent, so as to obtain domain invariant features. Then DAN-CDERC transfers the learned domain invariant dialogue context knowledge from the domain adversarial model to the emotion recognition model to assist in modeling the dialogue context. Due to the use of a domain adversarial network, DAN-CDERC obtains dialogue-level contextual information that is domain invariant, thereby reducing the negative impact of inconsistency in domain space. Empirical studies illustrate that the proposed model outperforms the baseline models on three benchmark emotion recognition datasets.


Introduction
Emotion plays a significant role in daily life and intelligent dialogue systems. Emotion recognition in conversation (ERC), which is one of the important tasks of Natural Language Processing, has attracted more and more attention in recent years. ERC is to predict the emotion of each utterance in a conversation. ERC is more challenging, considering the sequential information of the conversation and the self-speaker dependencies and inter-speaker dependencies [1].
In the literature, many neural network models have been applied to model dialogue and dependencies, such as recurrent neural networks [2,3], graph-based convolutional neural networks [4,5], and attention mechanisms [6][7][8]. However, some problems should not be ignored. A vital issue of emotion recognition in conversation is the lack of available labeled data, which is hard to collect and annotate. The emotion of the same statement in different dialogue scenarios is determined according to the context, rather than the same emotion [9]. It is difficult for annotators to figure out the contextual information. So, there are a relatively small number of available datasets, such as in the more commonly used Cross-domain sentiment classification, which aims to transfer knowledge to the target domain from the source domain, is one of the effective ways to alleviate the lack of datasets in the target domain. Many excellent achievements have been made in cross-domain sentiment classification [12][13][14]. Cross-domain sentiment classification generally requires the source domain data to be labeled. Unlike sentiment classification, ERC lacks large scale datasets and aims to identify the emotion of each utterance in a conversation, rather than a single sentence. An utterance of the speaker is affected by their own and external factors, such as topic, speaker's personality, argumentation logic, viewpoint, intent, etc. [9]. In turn, the utterance reflects these factors to a certain extent. These factors may lead to improved conversation understanding, including emotion recognition [9]. For the above reasons, aiming at the problem of lack of labeled data, Hazarika et al. [15] pre-trained a whole conversation jointly using a hierarchical generative model and transfer the knowledge to the target domain.
In summary, the generation task can be used to assist the ERC task, since the dialogue generation task and ERC have some similarities. However, the spatial distribution of feature vectors in the source and target domains is inconsistent after introducing external generation datasets.
Inspired by cross-domain sentiment classification, to alleviate the problem of inconsistency in the domain feature space of the source domain dataset and target domain dataset, we propose a Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation (DAN-CDERC) model, which transfers the knowledge from the domain adversarial model to the emotion recognition model, instead of directly modeling historical information and speaker information. The domain adversarial model consists of the encoders, a generator and a domain discriminator. The encoders are used to learn sequence knowledge from the massive amounts of the generative dataset. The generator generates an utterance in the source domain. The discriminator performs domain adaptation by discriminating the domain, which plays an essential role in reducing the domain inconsistency for domain adaptation. Since both the generative conversational task and the ERC task also need to model the dialogue context, DAN-CDERC can use the dialogue sequence knowledge from the large-scale dialogue generation dataset to assist the ERC in modeling the dialogue context. Due to the use of the domain adversarial network, our DAN-CDERC avoids the dissimilarity of the domain and vector space during the transferring process. In this paper, we try to achieve the same effect of these models without explicitly modeling speakers.
For sentence-level classification, the domain discriminator is used to discriminate sentences. Our discriminator can discriminate each utterance in the conversation that belongs to the source domain or the target domain, rather than the whole conversation. We believe this is important. On the one hand, our model is relatively simple, with only two encoder layers. It is challenging to represent the whole conversation with a vector effectively. On the other hand, using sequential utterances is more conducive to learning and transferring sequence knowledge.
In summary, our contributions are as follows: • To alleviate the problem of the small scale of the ERC task dataset, we propose Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation, which not only learns knowledge from large-scale generative conversational datasets, but also utilizes adversarial networks to reduce the difference between source and target domains; • We use two large-scale generative conversational datasets and three emotion recognition datasets to verify model performance. The empirical studies illustrate the effectiveness of the proposed model, even without modeling information dependencies such as speakers.
The rest of the paper is organized as follows: Section 2 discusses related work; Section 3 provides details of our model; Section 4 shows and interprets the experimental results; Section 5 analyses and discusses the experimental results; and finally, Section 6 concludes the paper.

Related Work
Inspired by sentence-level cross-domain sentiment analysis, this paper utilizes largescale dialogue generative datasets, adversarial networks and transfer learning for Emotion Recognition in Conversation. The related work includes dialogue generation, Emotion Recognition in Conversation, Cross-Domain Sentiment Analysis, Adversarial Network and Transfer Learning.

Dialogue Generation
Hierarchical Recurrent encoder-decoder (HRED) [16] is a classic generative hierarchical neural network. It has three key components, including the utterance encoder, the context encoder and the decoder. The latent variable hierarchical recurrent encoderdecoder (VHRED) [17] extended HRED. VHRED added a latent variable at the decoder, which is trained by maximizing a variational lower-bound on the log-likelihood. Variational Hierarchical Conversation RNN (VHCR) [18] augmented a global conversational latent variable along with local utterance latent variables to build a hierarchical latent structure with a new regularization technique called utterance drop.
Moreover, ERC is a vital step to endowing the dialogue system with emotional perception. Some researchers are interested in how to make the dialogue system have emotional perception. Zhou et al. [19] proposed novel mechanisms to make the responses more emotional respectively: embedded emotion categories, captured the change of implicit internal emotion states, and used explicit emotion expressions by an external emotion vocabulary. Deeksha et al. [20] employed a multi-task learning framework to predict emotion labels, and used emotion labels to guide the modeling of empathetic conversations. Li et al. [21] proposed a multi-resolution interactive empathetic dialogue model combining coarse-grained dialogue-level and fine-grained token-level emotions, which contains an interactive adversarial learning framework to judge emotional feedback. Xie et al. [22] combined 32 emotions and eight additional emotion regulation intentions to complete the task of empathic response generation. Ide et al. [23] made the generated responses more emotional by adding emotion recognition tasks.

Emotion Recognition in Conversation
Unlike the document-level sentiment and emotion classification, neural network models learn the representation of words and documents, and understand the self and interspeaker dependencies, for sentiment and emotion classification in conversation. Most of the models for ERC are hierarchical network structures, including at least one utterance encoder layer to encode utterances, and one context encoder layer to encode contextual content.
A dialogue, generally composed of multi-turn utterances, happens in a natural sequence, which is suitable for modeling with RNN. So RNN has become a fundamental component for emotion detection in conversation. Poria et al. [2] employed an LSTM-based to model dependencies and relations among the utterances. Majumder et al. [3] used three GRUs to model the speaker, the context and the emotion of the preceding utterances. In addition, the attention mechanism is also an important component. Wei et al. [24] employed GRUs and hierarchical attention to model the self and inter-speaker influences of utterances. Jiang et al. [25] proposed a hierarchical model and introduced a convolutional self-attention network as an utterance encoder layer.
Due to the rising of graph neural network models and the problem of context propagation in the current RNN-based methods, some work RNN-based networks are replaced by graph networks. Ghosal et al. [4] proposed Dialogue Graph Convolutional Network (DialogueGCN) to model self and inter-speaker dependencies. Zhang et al. [5] tried to address context-sensitive dependencies and speaker-sensitive dependencies using a conversational graph-based convolutional neural network in multi-speaker conversation. Sheng et al. [26] introduced a two-stage Summarization and Aggregation Graph Inference Network, which models inference for topic-related emotional phrases and local dependency reasoning over neighboring utterances. Zhang et al. [27] proposed a dual-level graph attention mechanism that augments the semantic information of the utterance and multi-task learning to alleviate the confusion between a few non-neutral utterances and much more neutral ones. Ma et al. [28] used a multi-view network to explore the emotion representation of a query from word-level and utterance-level views. TODKAT [29] used a topic-augmented language model (LM) with an additional layer specialized for topic detection, and combined LM with commonsense statements derived from a knowledge base ATOMIC. SKAIG [30] used commonsense knowledge to enrich the edges of the graph with knowledge representations from the model COMET.

Cross-Domain Sentiment Analysis
Cross-domain sentiment analysis is one of the areas where a classifier is trained in one source domain and applied to one target domain. Due to different expressions of emotions across several domains, many pivot-based methods [14,31,32] have been proposed to address domain adaptation problems by learning non-pivot words and pivot words. The selection of non-pivot words and pivot words will directly affect the performance of the target domain. Another effective way is adversarial training [13,[33][34][35], which obtains domain-invariant features by deceiving the discriminator.

Adversarial Network and Transfer Learning
Multi-source transfer learning can also lay a foundation for modeling various aspects of different emotions (e.g., mood, anxiety), where only a limited number of datasets with a small number of data samples are available.
Liang et al. [36] treated emotion recognition and culture recognition as two adversarial tasks for cross-culture emotion recognition to address the problem of generalization across different cultures. Lian et al. [37] and Li et al. [38] treated the speaker characteristics and emotion recognition as two adversarial tasks to reduce the speaker's influence on emotion recognition. Parthasarathy et al. [39] proposed an Adversarial Autoencoder (AAE) to perform variational inference over the latent factors, including age, gender, emotional state, and content of speech.
Furthermore, some researchers utilize transfer learning for emotion recognition. Gideon et al. [40] investigated that emotion recognition can benefit from using representations originally learned for different paralinguistic and different domains. Felbo et al. [41] used 1246 million tweets to train a pre-training model for emoji recognition. Li et al. [42] utilized a low-level transformer as the utterance encoder layer and a high-level transformer as the context encoder layer. EmotionX-IDEA [43] and PT-Code [44] learn emotional knowledge from BERT. Hazarika et al. [15] pre-trained a whole conversation jointly using a hierarchical generative model and transferred it to the target domain.
Our work strives to tackle a small number of datasets of ERC. Hence we use a large amount of publicly available generative conversational datasets to model conversation, and introduce a domain discrimination task to enhance domain adaptability.

Domain Adversarial Network for Emotion Recognition in Conversation
In this paper, there are two domains: a source domain D s and a target domain D t . Because the source domain dataset is used to train the generative task, it has no emotional label. For the source domain, given a dialogue containing m utterances d s = {u 1 , u 2 , . . . , u m }, m is the length of dialogue, we can leverage {u 1 , u 2 , . . . , u m−1 } and {û 2 ,û 3 , . . . ,û m } to train a generative conversational task. For the target domain, given a dialogue containing n utterances d t = {u 1 , u 2 , . . . , u n } and n labels Y d = {ŷ 1 ,ŷ 2 , . . . ,ŷ n }, n is the length of dialogue. Our goal is to predict the emotion labels of the d t .
This study proposes a Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation (DAN-CDERC) model to address emotion recognition with generative conversation. DAN-CDERC contains two key components: the Domain Adversarial model and the Emotion Recognition model. Figure 2 shows the architecture of the Domain Adversarial model, where the input is {u 1 , u 2 , . . . , u m } for the source domain and {u 1 , u 2 , . . . , u n } for the source target domain. The output of the generator is the generated response sequence {û 2 , u 3 , . . . ,û m }, and the output of the discriminator is the domain labels. Figure 3 shows the architecture of the Emotion Recognition model, where the input is {u 1 , u 2 , . . . , u n } and the output is emotion labels Y d = {ŷ 1 ,ŷ 2 , . . . ,ŷ n }.    [16], are used to learn sequence knowledge from the massive amounts of generative dataset. The Discriminator performs domain adaptation by discriminating the domain, which plays an essential role in reducing the domain inconsistency for domain adaptation. For the emotion recognition model, we leverage BERT [45] to encode utterances, and LSTM (context encoder) to encode context, which learns context weights from the generative conversational model. First, we leverage d s = {u 1 , u 2 , . . . , u m−1 } and {û 2 ,û 3 , . . . , u m } to train a generative conversational model, and leverage d s and d t to train the domaindistinguish task. Then a part of the parameters of the generative model is transferred to the emotion recognition model (target task). We leverage d t = {u 1 , u 2 , . . . , u n } and Y d = {ŷ 1 ,ŷ 2 , . . . ,ŷ n } to train the emotion recognition model.

Encoder Layers
The encoder layers include an utterance encoder and a context encoder. A Bidirectional LSTM is used as the utterance encoder, and a unidirectional LSTM is used as the context encoder. Given a dialogue d s = {u 1 , u 2 , . . . , u m−1 }, the utterance encoder uses Equation (1) to represent each u i as a high-dimensional vector h i . Then, the context encoder uses Equation (2) to learn the sequence knowledge of the context and represent d s as {H 1

Conversational Generator
The generator is used to decode and generate u i+1 one response at a time. In addition, at the decoding stage, the generator generates a new utterance u i+1 by computing a distribution over vocabulary V t for target elements u i+1 1 by projecting the output of the decoder via a linear layer with weights W o and bias b o ,

Domain Discriminator
The role of the domain discriminator is to predict the domain label of the utterance u i which comes from the target domain or the source domain. The generator and the discriminator are trained in parallel. For each u i through the encoding stage of Section 3.1.1, an H i can be obtained. Specifically, before feeding H i to the domain classification, the H i goes through the gradient reversal layer (GRL) [33].
During the backpropagation, the role of the GRL is to reverse the gradient. The following equations are the forwardpropagation and backpropagation when H i goes through GRL, respectively: We denote the hidden state H i through the GRL asĤ i .

Emotion Recognition with Transfer Learning
Given a dialogue containing n utterances d = {u 1 , u 2 , . . . , u n }, n is the length of dialogue. Our goal is to predict their labels {ŷ 1 ,ŷ 2 , . . . ,ŷ n }. This model has two components, including an utterance encoder and a context encoder.

Utterance Encoder
BERT is a classic pre-training model and has achieved good results on many NLP tasks. Consequently, BERT [45] is used to encode utterances. We choose the BERT-based uncased pre-trained model as our utterance encoder. Through BERT, we can obtain representations of utterances,

Context Encoder and Transfer Learning
The context encoder of classification is the same as the encoder of the generative conversational model. The parameters of the context encoder of the generative model are used for initialization. The input of the context encoder is H h = {h 1 , h 2 , . . . , h n }, and the output H H = {H 1 , H 2 , . . . , H n } can be obtained by the following formulas: We transfer {W hr , W hz , W hn , b hr , b hz , b hn , W p , b p } of the adversarial generative model to the context encoder of classification. Then H t is used as inputs to a softmax output layer: Here, W p and B p are model parameters, and P P is used to predict emotion.

Conversational Generator
The goal is to maximize the output X H probability given the input original X O . Therefore, we optimize the negative log-likelihood loss function: where θ is the model parameters, and (X O , X H ) is a pair (original utterance-new utterance) in training set τ, then: where is calculated by the decoder.

Domain Discriminator and Joint Learning
We feedĤ i through the GRL to the domain discriminator as: Our training objective is to minimize the cross-entropy loss over a set of training examples: We jointly train the conversational generator and the domain discriminator, and the final loss is the sum of the loss of the two tasks:

Emotion Recognition in Conversation
Given a dialogue d including n utterances and the pre-defined emotion y i of u i , our training objective is to minimize the cross-entropy loss over a set of training examples, with a 2 -regularization term, whereŷ i is the predicted label, and θ y is the set of model parameters.  [10], MELD [11] and DailyDialog [48] to evaluate the performance of our model. Table 1 and Figure 4 show the statistics of the datasets.   Ubuntu [47] is dyadic conversations extracted from the Ubuntu chat logs, which are used to receive technical support for various Ubuntu-related problems. It contains more than one million dialogues and 7 million sentences. • IEMOCAP [10] is a multimodal dataset, and we only use textual data. Emotion recognition of multimodal data is beyond the scope of this paper. It contains 151 dialogues and 7433 utterances. Each conversation consists of multi-turn utterances, and each utterance is annotated with one of the following emotions: angry, happy, sad, neutral, excited, and frustrated. • MELD [11] is a multimodal emotion classification dataset that contains textual, acoustic and visual information. MELD is extended from the EmotionLines dataset [49]. It contains 1432 dialogues and 13,708 utterances from the Friends TV series. Each conversation consists of multi-turn utterances, and each utterance is annotated with one of the following emotions: angry, joy, neutral, disgust, sad, surprise and fear. • DailyDialog [48] is a daily conversation dataset that reflects our daily ways of communication. It contains 13,118 multi-turn dialogues, and the speaker turns are roughly eight. Each utterance has a label of the following emotions: angry, happy, sad, surprise, fear, disgust and neutral (no_emotion).

Experiments
As displayed in Figure 4, it can be seen that the distribution of labels is relatively balanced in IEMOCAP. Unlike IEMOCAP and MELD, the no_emotion labels account for 83.1% of DailyDialog. It is unbalanced, so the no_emotion will not be evaluated. Table 2 shows the hyper-parameters of the model. For the baseline models, we use the hyper-parameters provided in the original papers or the same hyper-parameters as our setting. We employ AdaGrad [50] to optimize the classification model parameters. We utilize F1-score to measure the classification performance for each category, and the average F1-score and accuracy measure the overall performance.  Table 3 presents the results of using different source domains for three target datasets, where "×" means that only transfer learning is used without the adversarial network [15]; " √ " means that the model we proposed uses the adversarial network. As can be seen from Table 3, on the three target data, our DAN-CDERC has achieved a significant performance improvement compared with the method without the adversarial network, which is higher by around 5%, 2%, and 7%, respectively. It verifies the effectiveness of our DAN-CDERC model, which can build a good bridge between the source domain and the target domain. Moreover, it demonstrates the importance of domain adaptation in the transfer between different domains for ERC.

Experimental Results
For different source domains, our method shows a primary trend for the three target datasets, where Cornell as the source domain is better than Ubuntu as the source domain, which is higher by 0.46%, 0.21%, and 0.6%, respectively. For transfer learning, in general, the larger amount of source dataset, the better the experimental performance of the target dataset. As shown in Table 1, it is clear that the scale of Ubuntu is an order of magnitude larger than Cornell. To explore this reason, we analyze the characteristics of these datasets. Cornell is composed of movie scripts from multiple websites, and Ubuntu mainly consists of various technical Ubuntu-related problems. For the target domain datasets, the IEMO-CAP comes from the drama script, the MELD comes from the movie script, and DailyDialog is the daily dialogue. In terms of content, the similarity between the three target datasets and Cornell is greater than that of Ubuntu. This explains why when Ubuntu is used as the source domain, although the data scale is large, the effect is not as good as Cornell as the source domain dataset.
It can be observed that our model has apparent effects on the IEMOCAP (5%) and the DailyDialog (7%) compared with the method without the adversarial network. Still, it has little impact on the MELD (2%). This may be due to the fact that the size of IEMOCAP is relatively tiny, and DailyDialog is relatively unbalanced. The knowledge brought by domain migration can compensate for these deficiencies and improve performance. However, MELD is relatively large and balanced, and the transfer of different domains does not contribute much to performance improvement.

Analysis and Discussion
In this section, we give some analysis and discussion.

Comparison with Baselines
We compare our model with various baseline approaches for emotion recognition in conversation.
• bc-LSTM is a basic model which employs BiLSTM to capture contextual content from the surrounding utterances without distinguishing different speakers; • CMN [51] is the Conversational Memory Network, which models utterance context from dialogue history using two GRUs for speakers. Then, utterance representation is obtained by feeding the current utterance as the query to two memory networks for different speakers; • ICON [6] uses GRU to model the self and inter-speaker sentiment influences and employs a memory network to store contextual summaries for classification. In our implementation, we only use the uni-modal classification; • DialogueRNN [3] employs three GRUs (global GRU, party GRU, and speaker GRU) to model the speaker, the context and the emotion of the preceding utterances; • DialogueGCN [4] uses a graph convolutional neural network to model self and inter-speaker dependencies. It represents each utterance as a node and models the dependencies between the speakers of those utterances by leveraging the edges between a pair of nodes/utterances. • DAN Cornell means that the source domain is Cornell. • DAN Ubuntu means that the source domain is Ubuntu. Table 4 presents the results of our proposed DAN-CDERC model and strong baselines. DAN Cornell and DAN Ubuntu achieve an average F1-score of 64.40%, 63.94% and an accuracy of 65.07% and 64.61%, respectively. To our surprise, the F1-score of DAN Cornell outperforms DialogueGCN (when the learning rate is 0.000085, DAN Cornell achieves an F1-score of 64.68%, which is a 0.5% improvement over DialogueGCN). Although DAN Ubuntu does not perform as well as DialogueGCN, the difference is small, and it improves 1.19% over DialogueRNN. This is because, as mentioned in Section 4.2 above, the similarity between IEMOCAP and Cornell is small, and we mainly use adversarial networks to reduce the difference between source and target domains, without resorting to complex modeling inter and self-party dependency. For individual labels, our method also achieves a good performance.  Table 5 presents the results of our proposed model and strong baselines for MELD. DAN Cornell and DAN Ubuntu achieve F1-score of 59.44% and 59.23%, which are 1.34% and 1.13% better than DialogueGCN. DialogueGCN is only 1.06% better than DialogueRNN. The MELD is a multi-party conversations dataset, and there are more than 300 speakers in the dataset. Normally, there are several participants in each conversation, for example, in Figure 1 where there are five participants. Additionally, we also observe that many speakers in a conversation do not utter alternately, but one speaker may utter several utterances continuously. Hence, it is not easy for models such as DialogueGCN to model the speaker's information successfully.

IEMOCAP:
DailyDialog: Table 6 presents the results of DialogueRNN, DialogueGCN , DAN Cornell and DAN Ubuntu (since the DailyDialog is seriously unbalanced, we add two additional evaluation metrics, Micro F1 and Macro F1). Table 6 clearly shows that DialogueGCN performs poorly. Performance improvement is difficult due to the imbalance of the Daily-Dialog dataset compared with DialogueRNN, but DAN Cornell and DAN Ubuntu still achieve a 1.42% and 0.82% improvement on Weighted F1-score, respectively. Besides, our proposed DAN-CDERC model outperforms baseline models in terms of Micro F1 and Macro F1. To explain this gap in performance, it is essential to understand the distribution of emotions for DailyDialog. From Figure 4, most of the utterances are emotionless, and it may not be possible to model the speaker's information successfully by using DialogueRNN and DialogueGCN, compared with IEMOCAP. As shown in Table 7, we make statistics on the average length of the dialogue, and the average length of the utterance in the IEMOCAP, MELD and DailyDialog. The average dialogue length of the IEMOCAP dataset is around 50 utterances, while the MELD is around 10 and the DailyDialog is around eight. Moreover, the IEMOCAP has five sessions and two participants in each conversation. The MELD has multiple participants, more than 300. Although there are only two participants in each conversation in the DailyDialog dataset, they are collected from dialogues in different scenarios. That is to say, there is no relationship between the conversations. The IEMOCAP, which more easily models inter-dependencies and self-dependencies than the MELD (short conversations and many participants) and DailyDialog (short conversations and a weak correlation between conversations), have long conversations, few participants, and a strong correlation between conversations. This is why our model does not perform as well on the IEMOCAP as on MELD and DailyDialog compared with the other best models. Those models are more conducive to establishing dependencies, while our model lacks this ability.
Experiments show that our model is effective on three datasets. In addition, since the proposed model does not model the speakers' information, it is effective not only for dyadic conversations, but also for multi-party conversations.

Effectiveness of the Utterance Encoder Layer
We try to replace BERT with LSTM as the classification utterance encoder. The encoder parameters of the utterance of the domain adversarial model are transferred to the encoder of the classification model, and the results are shown in Table 8. When we replace the encoder, results demonstrate that BERT provides better representations of utterances than LSTM. When Cornell is the source domain, the gap is 3.16% on IEMOCAP and 2.12% on DailyDialog. The other gaps are all around 1%. However, the effect of LSTM as an utterance encoder also exceeds the performance of using BERT as the utterance encoder without the adversarial network. This indicates that a suitable utterance encoder and domain adversarial network can jointly promote performance improvement.
Moreover, we try to employ the utterance encoder parameters of the domain adversarial network to initialize the utterance encoder parameters of the emotion recognition model. However, we find that this method is not helpful for performance improvement. The possible reason for this phenomenon is that the representations of utterances differ between generative and emotion recognition tasks in different domains.

Source Domain Size
We compare the results of different sizes of source domain dataset, comprising 0%, 10%, 20%, 50% and 100% available in the source domain (The source domain: Cornell; The target domain: IEMOCAP). Figure 5 presents the results from the IEMOCAP dataset. A primary trend can be seen from the figure, which is that as the size of the source domain dataset increases, the classification performance in the target domain gets better. Compared with the method without transfer learning (0% source domain data), the use of only 10% source domain dataset can also significantly improve, with an increase of 2.15%. This shows the effectiveness of our method. This method, based on the adversarial network, has indeed improved the improvement by learning some inherent sequence knowledge instead of just the increasing scale of the dataset.

Comparison of Time and Number of Parameters
As shown in Table 9, we count the time required and the number of parameters for different methods per epoch on IEMOCAP in the inference stage.  [52]. Due to a significant amount of time required to process the pre-trained 840B GloVe vectors, the total time used by DialogueGCN is much more than our DAN model. So the proposed DAN-CDERC model takes the least total time. In addition, our model is simple, with only two layers (a BERT layer and a unidirectional LSTM layer). DialogueGCN has three layers (a CNN layer, a Bidirectional LSTM layer, and a GCN layer); DialogueRNN has a CNN, three GRUs, and an attention layer. They also need to model various information. Table 10 presents an example from IEMOCAP. This dialogue is carried out in a pessimistic atmosphere and alternates between emotions. Due to the recognition error of U 45 F and the alternation of emotion in the dialogue, DialogueGCN does not perform well in the next several utterances. Paying too much attention to the previous utterances and speaker may cause this phenomenon.

Case Studies
We analyze predicted labels for the IEMOCAP dataset. In the confusion matrix, we find that our model is mainly misclassified in two cases. One is to mistake "Sad" and "Frustrated" as "Neutral", and the other is to mistake "Neutral" as "Frustrated". As can be seen from Figure 4a, "Natural" and "Frustrated" account for a large proportion, and the above results may be caused by the imbalance of emotional labels distribution. The recognition of these kinds of emotions depends on contextual emotions, which is a shortcoming of our model compared to DialogueGCN and DialogueRNN.

Conclusions
Given the lack of large-scale publicly available datasets, transfer learning is an effective way to alleviate this problem. We present a Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation (DAN-CDERC) model, consisting of two parts, namely the domain adversarial model and the emotion recognition model. The domain adversarial network employs a conversational dataset to train the generative task and a source domain and target domain dataset to train the domain discriminator for domain adaptation simultaneously. The emotion recognition model receives the transferred sequence knowledge and recognizes the emotions. When Cornell is the source dataset, the DAN-CDERC achieves an F1 of 64.40%, 59.44% and 55.20% on three datasets, all outperforming the baselines, without resorting to complex modeling inter and self-party dependency. In addition, the data scale of the source domain will have an impact on emotion recognition, but for different source domain datasets, domain similarity is more important than the data scale. Since DAN-CDERC does not model speakers' information, it is effective not only for dyadic conversations, but also for multi-party conversations. Our method proves the feasibility of using conversational datasets and domain adaptation for ERC.
Although this paper attempts to solve the problem of domain adaptation for ERC, there is inconsistency in the domain space between different tasks and different datasets, which has not been fully considered. In addition, due to the introduction of large-scale data and the use of adversarial networks in this paper, the training time of the model on the source task is long, which is also one of the shortcomings of adversarial transfer networks.
In the future, using more and faster adaptation strategies to solve the task of ERC is worthy of continuous and in-depth research.