Diversifying Emotional Dialogue Generation via Selective Adversarial Training

Emotional perception and expression are very important for building intelligent conversational systems that are human-like and attractive. Although deep neural approaches have made great progress in the field of conversation generation, there is still a lot of room for research on how to guide systems in generating responses with appropriate emotions. Meanwhile, the problem of systems’ tendency to generate high-frequency universal responses remains largely unsolved. To solve this problem, we propose a method to generate diverse emotional responses through selective perturbation. Our model includes a selective word perturbation module and a global emotion control module. The former is used to introduce disturbance factors into the generated responses and enhance their expression diversity. The latter maintains the coherence of the response by limiting the emotional distribution of the response and preventing excessive deviation of emotion and meaning. Experiments are designed on two datasets, and corresponding results show that our model outperforms existing baselines in terms of emotional expression and response diversity.


Introduction
Building dialogue systems with the ability to communicate naturally with people is a fundamental task of building intelligent agents. Emotional expression is a key characteristic of a human-like dialogue system. Enabling dialogue systems to understand and express emotions has multiple benefits [1,2]: • More natural communication: Emotions are an important part of human communication. When dialogue systems can understand and express emotions, they can more accurately capture and respond to users' emotional expressions, making conversations more natural and human. • Emotion recognition: By understanding the user's emotions, the dialogue system can better understand the user's intentions and needs. Emotion recognition helps to parse user input more precisely and provide responses and support based on emotional information. • Emotional support: The dialogue system can express emotions and provide users with emotional support and emotional management. When users need reassurance, encouragement, or understanding, the emotional expression of dialogue systems can provide a positive impact and emotional connection. • Improvement of user experience: Emotion plays an important role in user experience. When the dialog system is able to recognize and respond to the user's emotions, the user feels understood and cared for, which helps to build a better user experience and enhance user satisfaction with the dialog system. • Emotion research and application: The ability of dialogue systems to understand and express emotions also contributes to the field of emotion research and application.
Early approaches relied on artificially designed rules to generate an emotional response from the system, but these methods had significant shortcomings in terms of cost and flexibility. Deep neural networks have greatly advanced due to their development and recent research has achieved promising results in this area [3][4][5][6]. On one hand, these advancements have benefited from the successful application of general models such as seq2seq, CVAE, and transformers in the task of generating dialogue responses. This has significantly enhanced the performance of the models. On the other hand, the increasing focus on affective computing within the academic community has led to the development of affective dialogue datasets such as Emotional Chatting Machines [3] and Empathetic Dialogues [7]. These datasets provide valuable data support for model training and learning.
Another important capability that a dialogue system should possess is the ability to generate diverse responses. However, one problem with neural training approaches is that the resulting models tend to generate high-frequency responses, often providing meaningless statements such as "I don't know". This problem arises because of the MLE training target, leading to an overconfident probability estimate for high-frequency tokens [8], which results in a decrease in diversity [9,10]. As dialogue is a one-to-many mapping, multiple responses are appropriate for the same input. Therefore, the ideal training target should be a soft target that assigns probability weights to multiple valid candidates [11]. However, studies have shown that the distribution of real text fluctuates significantly in the confusion degree of each target, and it is hard to achieve soft targets [12].
To tackle the problems mentioned above, we propose introducing perturbations to the decoding process of the system, which can reduce the generation of high-frequency words to some extent. In order to maintain emotional balance, we use the response's emotion label to regulate the impact of perturbations on the system's output. This ensures that the generated response maintains emotional consistency, preventing large deviations that could disrupt the overall emotional context.
To achieve this, we propose a CVAE-based model architecture. During training, the encoder processes both the input and response and leverages the recognition network to capture the potential variable z, which guides response generation and emotion recognition. To introduce perturbations, we incorporate a perturbation word selector to predict the type of each decoded word y t and determine whether to include a disturbance factor r. The global emotion label constraint, which utilizes an emotion classifier to identify the emotion of the hidden state s t during decoding, determines the value of the disturbance factor r. This ensures the generated response's emotional consistency with the real response. To better learn the characteristics of the real response, we utilize KL divergence to close the gap between the prior network and the recognition network. This paper's contributions can be summarized as follows: • A selective disturbance module is proposed that uses a disturbance word selector to perturb a portion of the response words based on learned potential variables, thereby improving response diversity. • We introduce a global emotion label constraint to control the impact of perturbations during decoding, ensuring that the model improves response diversity while maintaining emotional expression. • Our model's ability to generate more diverse emotional responses compared to the baseline is demonstrated through extensive experiments on two standard datasets.

Emotional Response Generation
In recent years, emotional dialogue generation methods have attracted significant interest. For instance, Zhou et al. proposed the emotional chat machine (ECM), which leverages external emotional vocabulary and internal emotional state memory to enable the system to generate responses of specific emotional categories [3]. Huang et al. utilized a special word that represents a specific emotion in a dictionary as an emotion marker at the encoder or decoder side. This pushes the decoder to generate responses with target emotions [13]. Song et al. proposed an emotion dialogue system (EmoDS) that leverages utterance-level classifiers and extra emotion vocabulary for generation [6]. Colombo et al. use a continuous representation of emotion to produce an emotional response in a controlled manner [14].
Using an emotion dictionary poses a challenge as the inclusion of fixed emotion words can result in a lack of consistency and diversity in the generated responses' content. To address this issue, a CVAE-based emotion regularization method called Emo-CVAE has been developed to enhance the emotional expression of responses [15]. This approach greatly enhances the accuracy of predicting response emotions and also promotes diversity. However, Emo-CVAE only incorporates the emotion label as an additional input condition and does not explore the interplay between emotion and the content generated in responses.
Moreover, Rashkin et al. introduced the Empathetic Dialogues dataset, which was the first dialogue dataset focused on empathy [7]. It categorized dialogues into 32 emotional categories. In a similar vein, Lin et al. developed a specialized decoder that can generate responses tailored to the emotions expressed by the interlocutor [16]. Majumder et al. explored the concept of emotional imitation [17] and developed a generation model that relies on similar examples [18]. Subsequently, classical models emerged, such as the EmpTranfo model based on the GPT framework, which incorporates an empathy prediction task [19], and the CoMAE model, which employs a hierarchical approach to model empathy factors [20]. Nevertheless, these models have yet to achieve the ability to generate dialogue responses that authentically and accurately express emotions as humans do. In addition to empathic conversation generation, there are also studies from the perspective of emotion regulation that combine emotion and conversation intention to generate responses [21] together. The previous studies mentioned primarily emphasize enhancing the emotional representation of the model, but there is a lack of specific research on the diversity of the generated emotional responses.

Response Diversity
In Section 1, we mentioned that neural dialogue systems tend to produce highfrequency but boring responses. How to avoid this problem is a long-term problem in the research of response generation. Researchers have tried this in different ways. Some methods detail the training objectives of MLE loss [10,22,23]. Other methods directly design auxiliary loss terms to impose a certain penalty on the response [24,25]. In addition, alternatives to MLE are also tested constantly. Li et al. propose a diversity promotion goal based on maximum mutual information (MMI) [26]. On this basis, Zhang et al. propose to optimize with the objective of maximizing adversarial information [27]. Some researchers use constraints on target responses to enhance diversity [28,29]. An adaptive label smoothing method is proposed to adaptively estimate the distribution of targets in the processing of decode in different contexts [30]. Negative training strategies are also used to improve the diversity of responses [31]. Although these methods are effective to a certain extent, they also have some disadvantages. Take MMI as an example. Although it can lead to greater mutual information, the resulting response is likely to be the same in connotation as the input, and cannot bring more information. The possible reason is that the model can easily find a shortcut to achieve the maximum goal of mutual information by simply copying a portion of the markers in the last speech, rather than learning the conversational features.
Inspired by some related studies and combined with the idea of adversarial training [32,33], we apply certain disturbances to the process of response generation to make the model generate more diversified responses. It should be noted that dialogue models based on adversarial learning are difficult to train and may suffer from pattern breakdown, which is not conducive to a diversity of responses. Therefore, we choose to perturb the decoded word embedding rather than the decoded hidden state. At the same time, emotional labels of responses were used to constrain disturbances to ensure that increased diversity did not lead to decreased ability of emotional expression.

Formalized Definition
For a given input utterance X = (x 1 , x 2 , . . . , x n ), we aim to give the appropriate response Y = (y 1 , y 2 , . . . , y m ), and the response Y should have the appropriate emotion e, where n is the number of words in X and m is the length of the response, e ∈ e 1 , . . . , e k , and k refers to the number of emotion categories. By connecting all the above inputs, we get the dialogue context c = [X; e]. Therefore, the target of response generation is where z is a latent variable for learning the characteristic distribution of the Y. P(z|c) means the sampling of z from the input, and P(Y|z, c) is the decoding process of generating the response according to the latent variables and context. It can be expressed as where y t is the decoding word at a current time step. y <t means the first t − 1 words generated by the decoder.

Model Framework
Our model overview is shown in Figure 1, which is built on the CVAE framework [34]. The encoder codes the input and response respectively and acts as the input of the identification network to obtain the latent variable z. The classifier performs emotion recognition for z. The perturbation word selector predicts the type of generated words y t according to z and decoder hidden state s t and controls the addition of perturbation. At the same time, emotion recognition is carried out on the decoder-generated response. By making the response's emotion fit the real response's emotion distribution respectively, the disturbance factor r is dynamically constrained, so that the generated response has similar emotion to the real response. The perturbation factor is applied to the word embedding of y t−1 to influence the decoding, so as to achieve the goal of enhancing the diversity of response.

Basic Encoder-Decoder
Our model is implemented based on the Encoder-Decoder framework, and the basic Encoder-Decoder is introduced in this section. Here, h t is used to represent the current hidden state of the encoder and s t is used to represent the current hidden state of the decoder. The corresponding h t−1 and s t−1 represent the hidden state of the encoder and decoder at the previous time, respectively. The Encoder and Decoder can be specific structures such as RNN, LSTM, transformer, etc., so no specific network structure is used to refer to them.
For each word x i in input X, we first obtain its embedding representation w(x i ) and send it to the encoder. Then the hidden state h t is calculated by the current input w(x i ) and the previous hidden state h t−1 .
To improve the performance of the decoder, dynamic attention is utilized to allow the decoder to focus on different content at different time steps prior to decoding Here, α t i represents the weight between the decoder's state s t and the encoder's state h i .
Based on these above, the decoder's hidden state s t is updated by the previous hidden state s t−1 , dynamic attention a t−1 and the previous generated word y t−1 . The softmax layer is hired to predict the current generated word y t by the decoder's hidden state s t Here, w(y t−1 ) represents the embedding of y t−1 . The total framework. During the training process, the input and real response are encoded respectively. The latent variable z is sampled by the recognition network, and a classifier is employed to identify the emotion category of z. In the decoding process, the perturbation word selector jointly predicts the type of the current generated word according to the hidden state s t and z and selectively applies the perturbation factor r to the decoding according to the type. At each time step of decoding, the currently generated response emotion is identified, and the disturbance factor r is dynamically constrained by bridging the gap between the emotion expressed in the generated response and the actual emotion. The section enclosed by the blue dotted line refers to the selective word disturbance module. The response encoder is only used for training purposes.

Latent Variable Learning
Using the basic framework presented in Section 3.2 as a foundation, we incorporate two networks in CVAE, name recognition and prior, and conduct sampling of the input and response during both training and testing. The latent variable z contains rich features and plays a crucial role in the selection process of disturbance words and emotion classification of discourse.
We make the assumption that z follows a multivariate Gaussian distribution, with a diagonal covariance matrix. Specifically, in the process of training to identify the network response to real samples, we get a posterior probability distribution of q θ (z|Y, c) ∼ N (µ, σ 2 I).
During the test process, the prior network p θ (z | c) is used to extract the latent variable, which is involved in the response generation of the decoder. Obviously, the goal of the system is to make the generated response close to the real response. KL divergence is used to estimate the difference in probability distribution between the two. Minimizing the KL divergence between the prior network and the recognition network allows the former to better fit the latter. Therefore, we take the KL loss term as a part of the total system loss and write it as L 1 . The parameterization of the above identification network and prior network can be achieved by MLP. [µ,

Adversarial Word Selector
Adding perturbations to models to enhance robustness has been practiced in some studies [32]. However, in the training of dialogue systems, the adversarial learning models are difficult to train and may have the problem of pattern collapse, so it is more inclined to generate boring responses. In order to address the issue of lack of diversity in the generated responses, we proposed a method to reduce the generation of high-frequency words in the model by introducing perturbations to influence the decoding process. The main aim of this method is to introduce disturbances to the generation of words during the decoding process, thus increasing the diversity of the generated responses. We introduce the emotional category of the response as a regulator. This means that the extent to which the disturbance affects the system's output depends on the emotional category of the response. We use the emotional category as an auxiliary input to the decoder, which is concatenated with the decoder's state and used to regulate the perturbation process. Specifically, we use a simple feedforward neural network to map the emotional category to a weight vector, which is then used to adjust the magnitude of the disturbance for each token in the response. This allows us to achieve a balance between diversity and emotional relevance in the generated responses. We use emotional labels to constrain the process, as described in Section 3.6.
It is important to note that not all generated words are suitable for perturbation. The research shows that topic headings play a very important role in dialogue interaction [35], and the random deviation of topics is not conducive to the continuity of dialogue. Therefore, our model needs to distinguish whether the current generated words are topic words or general words, to selectively perturb the decoding process. For the calculation of topic words, we refer to the PMI [36] method, that is, for any word x i in utterance X and y j in response Y, there is PMI measures the co-occurrence of words in a corpus and can be used to identify words that frequently occur together. A higher PMI score indicates a stronger association between two words, which can be interpreted as them being more likely to be related to the main topic.
Further, we compute the PMI value between sequence X = (x 1 , . . . x n ) and y i . This means that each word in X is assessed for relevance to y i , and a higher score can be interpreted that y i is more likely to be relevant to the topic.
In the decoding progress, the adversarial word selector combines the current state s t and the hidden variable z to predict the category of the currently generated word. If it is a topic word, it will not be disturbed; otherwise, it will be disturbed.
where MLP adv is the prediction network for the currently generated word class, W o is the corresponding weight matrix, and tp is the marker indicating whether the current word is the main topic, with values of 1 and 2.

Selective Adversarial Decoding
On the basis of the framework introduced in Section 3.2, we decode together with context, latent variables, and the prediction of adversarial word selector P(y t | y <t , c, z) = P(y t | y t−1 , s t , c, z) Here, tp = 1, 2 is the category of words, indicating whether the current generated word is a topic word, which is used to distinguish whether to add disturbance to the current word. The category is predicted by the perturbation word selector. When decoding y t , we choose to add a disturbance to the embedding of y t−1 , rather than directly on the hidden state st, to ensure the independence of the disturbance effects when each response word is generated. If tp = 1, it means that the current generated word is a keyword, and its generation probability is P(y t | y t−1 , s t , c, z, tp = 1) = so f tmax(W 1 s t ).
Otherwise, it means that the current generated word is not a topic word, and a disturbance is added to y t−1 's embedding. r = − g/ g 2 (17) where r is the perturbation term added to the embedding of y t−1 , and g is the gradient of the emotional consistency loss (the loss is described in Section 3.7), in which e is the emotion category andθ is the emotional classifier's parameter set. The perturbation r uses L 2 normalization, which divides the value of each dimension of the gradient by its L 2 -norm, in order to preserve the direction of the gradient. The prediction for the current word is represented by the following formula P(y t | y t−1 , s t , c, z, tp = 2) = so f tmax(W 2 s t ).
Thus, the loss of the decoding process is

Emotional Label Constraint
To prevent a significant deviation between the generated and actual response, it is crucial to control the amount of disturbance during the decoding process. To this end, we introduce a global emotional label constraint. In the training process, the recognition network obtains the hidden variable z, which is identified by the emotion classifier, so as to obtain the emotion distribution q ψ (e | z).
where MLE emo is an emotion classifier implemented by MLE, which identifies the emotion category of the real response according to the hidden variable z, and W E is the corresponding weight matrix.
When decoding, the current emotion type distribution p ψ (e | s t , z) is obtained by the emotion classifier under the influence of the hidden state st of the decoder and the hidden variable z. We expect the response generated by the perturbation decoding to be emotionally consistent with the real response. To limit the deviation between the generated response and the real response caused by excessive perturbation during the decoding process, we use a KL divergence to measure the distance between the two emotional distributions, and the perturbation signals r are constrained accordingly. The resulting loss of emotional restraint is

Loss
The total loss of the model can be expressed as: where L 2 is the reconstruction loss, L 1 is the KL divergence loss, and L 3 is the adversarial loss. α and β are hyperparameters that control the trade-off between the losses.

Datasets
We use two datasets for our experiment, DailyDialog [37] and OpenSubtitles2018 [38]. DailyDialog contains ten topics and seven emotions, totaling 13,118 rounds of dialogues. The average conversation is 7.9 rounds and the utterances are 14.7 tokens on average. OpenSubtitles2018 is a dialogue dataset from movie subtitles. The data set was filtered into conversations with sequences of 5-30 words long, each containing at least four utterances. The filtered data set contained 25,000 utterances.
Since the OpenSubtitles2018 dataset does not contain emotion labels, we need to train the dialogue emotion recognition model on other datasets to label the OpenSubtitles2018 dataset. IEMOCAP [39] and MELD [40] are two datasets commonly used in conversational emotion recognition tasks. Among them, IEMOCAP contains 151 dialogues with a total of 7433 utterances. Six types of emotions were labeled, among which non-neutral emotions accounted for 77%. MELD consists of 1433 dialogues and 13,708 utterances. The utterances in the dialogue were labeled with seven categories of emotions, of which 53% were nonneutral. It should be noted that IEMOCAP is played by professional actors, so emotions are expressed more clearly than in natural dialogue. The advantage of this dataset is its high quality and the limitation is its small data size. MELD, on the other hand, comes from the TV series Friends and several movies, and the dialogue is more natural. However, the dialogue in MELD involves too many plot backgrounds, so it is difficult to identify emotions.
In order to balance accuracy and generality, we trained several popular dialogue emotion classification models on two datasets, MELD and IEMOCAP. Since the sentiment categories of MELD and IEMOCAP are not exactly the same, we filtered the raw data to retain six sentiment categories shared by the two datasets. We selected the M2FNet [41] model that achieved the best performance on both datasets to label the OpenSubtitle2018 dataset. The relevant classification results are shown in Table 1. Note that M2FNet is a multi-modal dialogue emotion recognition model, but since OpenSubtitle2018 only contains text data, only text modal data are selected in our training and annotation.
CVAE obtains a posteriori distribution of latent variable z in training based on seq2seq framework and uses prior distribution to fit a posteriori distribution in testing, so as to minimize reconstruction errors. ECM combines implicit internal emotional state changes with explicit external emotional vocabulary expressions to generate responses with specific emotions. EmoDS captures the emotional features of words and sentences to generate responses. Emo-CVAE introduces a conditional variational autoencoder model of emotion regularization, which is used to regularize the latent spaces of CVAE by adding an additional emotion recognizer.
In the interest of fairness, we implemented the basic modules of all the above models with GRUs (Bidirectional GRUs).

Settings
In the experiment, we used bidirectional GRUs to implement the encoder with a hidden size of 256, and 512 GRUs to implement the decoder. Pre-trained 300-dimensional word embedding [44] is employed for initialization. The dimension of latent variable z is set to 300. We used the ADAM optimizer [45]. The learning rate is 1 × 10 −5 , the batch size is set to 128, and the dropout rate is set to 0.2. A beam search with the size of 5 is used when decoding.

Automatic Evaluation
We use four automatic metrics, namely emotional accuracy (acc), dist-1, dist-2, and perplexity (ppl), for response evaluation in terms of emotional expression, diversity, and content. The accuracy of the emotion expressed by the generated response, or simply "Acc", is used to evaluate the consistency of the emotion category between the generated response and the ground truth response. It measures the percentage of responses that are correctly classified into the corresponding emotion categories. Dist-1/2 to evaluate the diversity, representing different single and double words in the generated responses. It is an evaluation at the n-gram level. Ppl stands for perplexity, which is commonly used to evaluate language models. It measures how well a model can predict a sequence of words in a given corpus. A lower perplexity score indicates that the model can predict the next word more accurately and with less uncertainty, which reflects a better-fitting ability of the model to natural language. In other words, the generated response content is relevant and syntactically correct. Tables 2 and 3 show the results of the automatic evaluation in DailyDialog and Open-Subtitles2018, respectively. ↑ represents that the larger the metric is better and ↓ means the opposite. The best results are shown in bold.
The results of the automatic evaluation showed that our model significantly improved the accuracy of emotional expression and the diversity of responses compared with the baseline method. In terms of confusion degree, ECM and EmoDS based on seq2seq generally performed better than other models based on CVAE. This makes sense because good diversity usually leads to increased confusion in the model. Compared with EmoCVAE, our model achieves better results in terms of emotional accuracy and response diversity and is almost the same in terms of perplexity degree. In order to further study the performance of our model on emotional expression, the emotional accuracy of different categories in the DailyDialog dataset is given in Table 4. The corresponding confusion matrix of emotion classification is shown in Figure 2. The results demonstrate that our model outperforms the baselines not only in terms of average performance but also in recognizing most emotion categories. This indicates that the global emotional label constraint proposed in our method has a positive effect on generating emotional responses. Notably, our model also achieves good performance for emotions that are difficult to identify by baseline models such as CVAE and ECM, such as anger and frustration. This can partially explain why our model is better at expressing emotions. Accurately identifying the emotion of the actual response is essential for accurately expressing the corresponding emotion.
As can be seen from Figure 2, our model tends to produce a relatively neutral response in more cases when the generated response emotion category is wrong (that is, inconsistent with the true response). In addition, our model is better at generating neutral, happy, and surprised responses than negative emotions such as anger and disgust. Our model is relatively less likely to generate responses where the emotion category is fear, possibly because it is the least represented in the original dataset.

Manual Evaluation
In addition to the above automatic evaluation, we also designed a manual evaluation to further verify the effectiveness of the proposed method. We used a pairwise comparison method to compare each of the four baselines.
The manual evaluation was conducted on the DailyDialog dataset, following the methodology of [15], which involves non-uniform random sampling to obtain samples based on the distribution of whether the emotion categories of responses generated by the baseline model are correct compared to ours. The evaluation uses the following notation: TT represents the responses with correct emotion categories generated by both our model and the baseline model, TF represents samples where the responses generated by our model have the correct emotional expression, but the baseline model does not, and FT and FF have similar meanings. The distribution of response samples is shown in Table 5. For each case included in Table 5, 30 samples were randomly selected. We asked three evaluators to select responses that were better in terms of accuracy of emotional expression and variety of content. We allow ties to happen.
The results of the manual evaluation are shown in Tables 6 and 7. By combining the results from the perspectives of sensibility and diversity, it can be seen that our model can generate more appropriate emotions under the three conditions of TT, TF and FF. Especially when our model correctly identifies the emotion of the response, the response produced by the decoder not only expresses the emotion better than the baseline models but also has certain advantages in terms of diversity.
It is also observed that when our model did not correctly identify emotion categories and when the baseline model correctly identified emotion in the case of FT, there was a significant decline in the dominance of the generated responses in terms of emotional expression. However, combined with the sample distribution in Table 5, the probability of this happening is very small, so it will not cancel out the advantages of our model in most cases.

Ablation Study
An ablation experiment was conducted on the DailyDialog dataset to verify the effectiveness of the proposed selective word perturbation module and global emotion control module. Two submodels were designed, one without selective adversarial (w/o SA), which does not add disturbance to the decoding process and only uses global emotion constraint to fit the emotion distribution of the generated and real responses. The other model adds fixed disturbances to non-subject words without using global emotion constraints and is denoted as w/o EC (without emotion constraint) only according to the perturbation word selector's prediction. The DailyDialog dataset was used for the ablation study, and the experimental results were shown in Table 8. The results of the ablation study demonstrate that when the model's decoding process is not disturbed, its emotional expression ability remains largely unaffected. However, the diversity of the generated responses decreases significantly, and the performance is similar to that of the ECM baseline model. On the other hand, when the perturbation is not constrained, the diversity of the generated responses increases significantly, but the emotional expression ability and quality of the generated responses decrease significantly. This suggests that unconstrained perturbation is insufficient for generating high-quality responses. The selective perturbation and global emotional constraint modules proposed by our model are validated through the ablation study, showing their effectiveness in improving the diversity and emotional expression of the generated responses.

Conclusions
In this study, we propose a selective perturbation of emotional response generation to generate content-rich responses with appropriate emotional categories. The model is based on CVAE, and perturbation training is used to improve the diversity of response. In order to ensure that the dialogue topic does not have a large shift in the perturbation, we use the selective jamming module to predict the type of the current generated word according to the state and potential variables of the current decoder, so as to selectively apply interference to its decoding process. The global emotion constraint module uses the emotion distribution difference between the real response and the currently generated response to constrain the decoding interference, so as to ensure that the generated response is emotionally appropriate. Through the synergistic effect of the above two modules, the method proposed in this paper has achieved good results in the aspects of emotional expression and response diversity. Experiments on two standard datasets validate that our model outperforms baselines in generating more diverse responses with accurate emotions.
One potential direction could be to explore how this method could be adapted for use with pre-trained models, such as GPT-3 or BERT. Additionally, further research could be conducted to research the availability of this approach on other language generation goals beyond emotional response generation, such as machine translation or text summarization.