AdMISC: Advanced Multi-Task Learning and Feature-Fusion for Emotional Support Conversation

: The emotional support dialogue system is an emerging and challenging task in natural language processing to alleviate people’s emotional distress. Each utterance in the dialogue has features such as emotion, intent, and commonsense knowledge. Previous research has indicated subpar performance in strategy prediction accuracy and response generation quality due to overlooking certain underlying factors. To address these issues, we propose Advanced Multi-Task Learning and Feature-Fusion for Emotional Support Conversation (AdMISC), which extracts various potential factors influencing dialogue through neural networks, thereby improving the accuracy of strategy prediction and the quality of generated responses. Specifically, we extract features affecting dialogue through dynamic emotion extraction and commonsense enhancement and then model strategy prediction. Additionally, the model learns these features through attention networks to generate higher quality responses. Furthermore, we introduce a method for automatically averaging loss function weights to improve the model’s performance. Experimental results using the emotional support conversation dataset ESConv demonstrate that our proposed model outperforms baseline methods in both strategy label prediction accuracy and a range of automatic and human evaluation metrics.


Introduction
In the ever-evolving landscape of society, individuals are encountering increasing mental stress in their daily lives.The research shows that over 50% of adults have grappled with mental illnesses or disorders at some point, yet only approximately 20% of these individuals have sought or received relevant treatment.Recent studies have highlighted the growing significance of emotional support conversation (ESC) as a form of mental health therapy, garnering considerable attention [1].More and more researchers are integrating emotional support conversation with dialogue systems as a novel, intelligent mental-health therapy approach, and it has been applied to fields such as intelligent customer service and intelligent psychological counseling, such as Woebot [2].This emerging field paves the way for innovative developments in dialogue systems and offers a promising avenue for addressing mental health challenges.
As shown in Figure 1, the emotional support conversation takes place through multiple dialogue rounds between the seeker and the supporter.It requires supporters to employ a specific support strategy to respond empathetically to alleviate the seeker's distress.The existing research mainly focuses on two aspects: firstly, the accurate prediction of dialogue strategies to tailor responses accordingly.For instance, Tu et al. [3] utilized a mixed strategy prediction method.Secondly, the enhancement of the model's comprehension of dialogue context, such as the work of Peng et al. [4] who designed a hierarchical graph network to capture user intent.Despite some of the achievements of researchers, the task still faces the following challenges: 1.As the conversation progresses, users' emotions subtly evolve.Accurately identifying these emotional changes is essential for the model to predict strategy labels and provide empathetic responses [5].2. Dialogue strategy as a linguistic pattern is a highly complex concept encompassing many language features [6].Previous studies have modeled it using a single vector (i.e., category labels), which is insufficient for fully representing the complexity of strategy information.Integrating the contextual information that influences strategies has become a challenge.3. Existing emotional support dialogue models tend to generate generalized responses [7], which fail to provide effective emotional support.To address this issue, introducing more contextually relevant concepts can facilitate the model in generating more meaningful suggestions tailored to specific situations.How to explore these relevant concepts and integrate them is a crucial task.4. In multi-task joint training, the model's performance heavily relies on the weights assigned to each task's loss function [8], posing challenges in manual weight adjustment.
In response to the issues above in the current research, we propose a method of multitask learning and feature-fusion in emotional support conversations, termed AdMISC.This method is based on the pre-trained transformer neural network model [9], and addresses the identified problems.The main contributions of our work are as follows: 1. Addressing the oversight of dynamic emotional changes in existing models, AdMISC incorporates an Emotion Detector module to detect these changes.This utilization of dynamic emotional characteristics guides strategy prediction learning effectively.2. To address the limitations of single-vector modeling in strategy prediction, we propose a mixed strategy approach, which utilizes neural networks to enhance dialogue history and problem descriptions with commonsense reasoning.Additionally, it Despite some of the achievements of researchers, the task still faces the following challenges: 1.
As the conversation progresses, users' emotions subtly evolve.Accurately identifying these emotional changes is essential for the model to predict strategy labels and provide empathetic responses [5].

2.
Dialogue strategy as a linguistic pattern is a highly complex concept encompassing many language features [6].Previous studies have modeled it using a single vector (i.e., category labels), which is insufficient for fully representing the complexity of strategy information.Integrating the contextual information that influences strategies has become a challenge.

3.
Existing emotional support dialogue models tend to generate generalized responses [7], which fail to provide effective emotional support.To address this issue, introducing more contextually relevant concepts can facilitate the model in generating more meaningful suggestions tailored to specific situations.How to explore these relevant concepts and integrate them is a crucial task.

4.
In multi-task joint training, the model's performance heavily relies on the weights assigned to each task's loss function [8], posing challenges in manual weight adjustment.
In response to the issues above in the current research, we propose a method of multitask learning and feature-fusion in emotional support conversations, termed AdMISC.This method is based on the pre-trained transformer neural network model [9], and addresses the identified problems.The main contributions of our work are as follows: 1.
Addressing the oversight of dynamic emotional changes in existing models, AdMISC incorporates an Emotion Detector module to detect these changes.This utilization of dynamic emotional characteristics guides strategy prediction learning effectively.2.
To address the limitations of single-vector modeling in strategy prediction, we propose a mixed strategy approach, which utilizes neural networks to enhance dialogue history and problem descriptions with commonsense reasoning.Additionally, it integrates commonsense-enhanced information and dynamic emotional information to jointly model strategy prediction.

3.
To alleviate the generalization tendency observed in the generated text of existing models during the emotional support generation stage, we propose a feature fu-sion approach.This method leverages neural network multi-head attention and cross-attention mechanisms to focus on the original dialogue history, commonsenseenhanced dialogue history, commonsense-enhanced problem descriptions, dynamic emotional information, and strategy selection information in the feedforward network.These context-related concepts can guide the model in generating more targeted and suggestive responses.

4.
We propose a dynamic multi-task loss function weight balancing method to address the challenge of manually adjusting task weights in multi-task joint training.This method balances the impact of multiple loss functions on model training.
The experimental results demonstrate that the AdMISC model outperforms other baseline models in both automatic and human evaluation metrics on the ESConv dataset, confirming the feasibility and effectiveness of our approach.

Conversation Strategy
In emotional support dialogue systems, the selection of dialogue strategies plays a pivotal role in shaping the seeker experience, as distinct strategies yield varied response generation outcomes [10].Existing emotional support dialogue systems commonly utilize deep learning for strategy selection.For instance, Tu et al. [3] proposed a mixed strategy learning method grounded in deep learning principles.On the other hand, Peng et al. [11] incorporated seeker emotional feedback information for dialogue strategy selection.Xu et al. [12] employed a prior knowledge method in predicting dialogue strategy labels, and Cheng et al. [13] considered forward-looking heuristic strategy planning and selection.
Despite their remarkable achievements, the existing work still faces the challenges of intricate strategy modeling and variability of emotions.Furthermore, Zeng et al. [14] noted that strategy selection is intricately linked to context.Integrating implicit information present in the conversational context becomes imperative when classifying strategies.

Emotional Response Generation
The emotional response generation module within the emotional support dialogue system produces responses imbued with emotional support meaning, aligned with the selected dialogue strategy, and delivers them back to the seeker.Recent studies have suggested that augmenting the generation process with additional information can enhance the overall performance of emotional response generation.For instance, Zhong et al. [15] leveraged the ConceptNet [16] module to enhance response generation and emotional states.Quan et al. [3] captured the seekers' mental state by incorporating a generative commonsense model COMET [17], interacting with various factors to generate emotional responses.Deng et al. [18] enhanced the system through knowledge in the field of mental health.Other studies focus on acquiring the seeker's situations, emotions, and intentional information.For example, Xu et al. [12] explored contextual semantic relations and emotional states, while Zhao et al. [19] considered the transformation of semantics, strategies, and emotions in the model.
Although the improvement of the above method allows the model to generate fluent text and significantly reduces the occurrence of logical errors, there are still challenges in the emotional support dialogue task: replies tend to be general and lack pertinence and suggestions for questions.It is difficult to achieve the purpose of emotional support.In this regard, Wang et al.'s research [20] emphasized that the logic of emotional dialogue replies should prioritize improving references in context and strengthening the connections among themes, emotions, and knowledge.This approach aims to generate replies that align more closely with thematic logic, offer accurate references, and convey rich emotion.Additionally, Wang et al. [21] proposed that the language model should iteratively infer the psychological and emotional state information of the interlocutor based on the dialogue history as the thinking chain, thereby enhancing the quality of responses.

Influencing Features in ESC
In the research of emotional support conversations, the current work largely relies on seekers' emotional labels and contextual cues to guide models in perceiving emotional information within dialogues.However, due to the multifaceted nature of human emotion perception and expression, models trained solely on emotional labels and contextual cues may overlook these underlying influential factors.According to Yang et al. [22], a range of potential dialogue information could impact the effectiveness of models in learning strategy selection and generating emotionally supportive responses.These factors include conversation-level emotions, sentence-level emotions, seeker intentions in inquiries, dialogue history, and human common sense involved in the conversation.Additionally, psychological studies by Hill et al. [23] indicated that emotional support conversations involve a complex interactive process requiring consideration of various information, such as seeker emotions, intentions, emotional fluctuations, and commonsense content, which can be mined from dialogue contexts.

Commonsense Knowledge Generation Model COMET
To enhance the model's understanding of emotional support conversations using additional knowledge, past approaches have typically utilized pre-constructed commonsense knowledge bases or semantic networks, applying known relationships from these knowledge bases to entities within the dialogue.However, Bosselut et al. [17] argued that commonsense knowledge does not entirely suit the pattern of combining two entities with known relationships, and instead, they proposed using an automatically constructed knowledge base to generate commonsense knowledge.Therefore, they introduced a commonsense knowledge generation model called COMET, based on a large-scale pre-trained transformer neural network.This model can adaptively generate knowledge representations, meaning that given a head entity s and a tail entity o, it generates a relation r, forming high-quality commonsense semantic relation triples {s, r, o}.These relations are derived from the sets of relations defined in ConceptNet.Once trained, the COMET model can generate reasonable, rich, and novel commonsense semantic triples, even when faced with commonsense events unseen by the model.

Task Definition
In the training of the emotional support dialogue model, considering our training dataset, it can be articulated as follows: composed of M samples.Each sample is composed as follows: including S i as the seeker's situation; C i as a dialogue context; h i as a strategy for supporting; and R i as a target response.C i contains the history utterances between seeker and supporter, R i and C i as follows: where CLS is the start-token and also describes the state.EOS is the separation token between two utterances.C i and R i include N tokens.The goal of the ESC task is to build a model F that can generate an expected supportive response r i g referring the C i and S i as: where Θ is the set of learned parameters of F.

Method
The comprehensive architecture of AdMISC is depicted in Figure 2. The process begins with obtaining information about the conversation through the encoder.At this stage, each sentence underwent emotion recognition via the Emotion Detector.The natural language labels corresponding to each emotion were then amalgamated to derive dynamic emotional changes.Simultaneously, COMET processed the seeker's situation and the seeker's last reply for commonsense enhancement.This additional information and the dialogue context served as inputs to the encoder.
where Θ is the set of learned parameters of F .

Method
The comprehensive architecture of AdMISC is depicted in Figure 2. The process begins with obtaining information about the conversation through the encoder.At this stage, each sentence underwent emotion recognition via the Emotion Detector.The natural language labels corresponding to each emotion were then amalgamated to derive dynamic emotional changes.Simultaneously, COMET processed the seeker's situation and the seeker's last reply for commonsense enhancement.This additional information and the dialogue context served as inputs to the encoder.
Subsequently, within the Mixed-Strategy learning module, fine-grained dynamic emotional information, commonsense-enhanced historical conversations, and commonsense-enhanced conversation descriptions were integrated to model strategy prediction.Finally, the information obtained above was focused through a multi-layer attention network and injected into the emotional expression generation module and Decoder.

Emotion Detector
Mental health research underscores the significance of empathy in emotional support [24] and emphasizes that an essential aspect of enhancing empathy capabilities involves providing fine-grained emotional information [25].Consequently, in training emotional support dialogue systems, it proves highly advantageous for the model to gain a coherent understanding of the seeker's emotional state by capturing dynamic and fine-grained emotional changes, as opposed to relying solely on static emotional signals.To address Subsequently, within the Mixed-Strategy learning module, fine-grained dynamic emotional information, commonsense-enhanced historical conversations, and commonsenseenhanced conversation descriptions were integrated to model strategy prediction.Finally, the information obtained above was focused through a multi-layer attention network and injected into the emotional expression generation module and Decoder.

Emotion Detector
Mental health research underscores the significance of empathy in emotional support [24] and emphasizes that an essential aspect of enhancing empathy capabilities involves providing fine-grained emotional information [25].Consequently, in training emotional support dialogue systems, it proves highly advantageous for the model to gain a coherent understanding of the seeker's emotional state by capturing dynamic and fine-grained emotional changes, as opposed to relying solely on static emotional signals.To address this, we proposed the Emotion Detector module to discern the dynamic changes in the seekers' fine-grained emotions throughout the conversation.
Specifically, we utilized a BERT-based pre-trained emotion detection model, EmoRoBERTa [26], capable of discerning the emotion categories present in the input text.The model's output comprises 28 distinct emotions; we integrated these emotions to correspond with the 7 emotions in the dataset.Emotion recognition was executed by inputting each utterance within the ongoing round of dialogue context text into EmoRoBERTa, as expressed by the following equation: where the predicted emotion category word from the model is employed to signify the emotion detected in a conversation; these emotional category words are subsequently input into the encoder in their natural language form.This methodology circumvented the introduction of unnecessary parameters that could potentially disrupt model learning and is articulated as follows: The emotional support dialogue model can effectively extract the dynamic emotional changes corresponding to the dialogue process by employing the method above.

Commonsense Enhance
We utilized COMET to generate commonsense knowledge for S i and u i N , a process that can be represented as follows: where S e i and u i,e N represent the parts of the generated common sense with emotional factors; S g i and u i,g N represents the remaining part.Afterwards, we inputted them with C i into the attention network, S-C Attn and u-C Attn, allowing the model to learn the commonsense knowledge part, which can be represented as: After obtaining the representation enhanced with commonsense knowledge, we inputted CLS into a multi-layer perceptron MLP to obtain a probability distribution p g for representing the strategy.Multiplying p g with the strategy labels T from the dataset, we can obtain the initial strategy selection h g .The process is described as follows: Through the steps above, the model obtained commonsense knowledge from the dialogue via COMET and the initial predictions of strategy labels.

Feature-Fusion Learning to Predict Strategy Labels
Existing emotional support dialogue models commonly employ a single vector (i.e., seeker's emotional state) modeling method during strategy selection learning.However, the process of dialogue strategy learning is a multifaceted concept encompassing various language features [7].The modular approach of a single vector model proves insufficient for adequately representing intricate strategy pattern information.We proposed the Mixed-Strategy module to capture information in the dialogue and effectively model strategy selection.
This module integrated an initial strategy representation with dialogue history, commonsense-enhanced seeker's last reply, emotion cause composed of commonsenseenhanced dialogue descriptions, and dynamic sentence-level emotional state, thus modeling the strategy prediction learning process by combining the above information.The network structure of the module is depicted in Figure 3.
strategy selection.
This module integrated an initial strategy representation with dialogue history, commonsense-enhanced seeker's last reply, emotion cause composed of commonsense-enhanced dialogue descriptions, and dynamic sentence-level emotional state, thus modeling the strategy prediction learning process by combining the above information.The network structure of the module is depicted in Figure 3. ( , ... ) The dialogue history information undergoes embedding through the transformer encoder to obtain the corresponding representation, which is expressed as: Similarly, linearly combining the seeker's situation of each round of dialogue with the sentence-level dynamic emotional states obtained through the Emotion Detector, we can obtain the sequence of Emotion Cause t U , which is expressed as:

[ } { ;
, ] The above results were combined into a long vector in the Contact layer and input into a linear layer.The result was obtained through the ReLU activation function, which can be expressed as: among them, W μ and b μ are trainable parameters.After μ is obtained, perform a weighted operation on them and the comprehensive strategy representation, expressed as: ˆg h in the feed-forward network with residual connections in the input sublayer, the hidden state generated is expressed as g h  , and it is used as input.The output is obtained g h  through the SoftMax activation function, which can be described as: Following processing through various network layers, the predicted strategy labels are derived.
The dialogue history information undergoes embedding through the transformer encoder to obtain the corresponding representation, which is expressed as: Similarly, linearly combining the seeker's situation of each round of dialogue with the sentence-level dynamic emotional states obtained through the Emotion Detector, we can obtain the sequence of Emotion Cause U t , which is expressed as: The above results were combined into a long vector in the Contact layer and input into a linear layer.The result was obtained through the ReLU activation function, which can be expressed as: among them, W µ and b µ are trainable parameters.After µ is obtained, perform a weighted operation on them and the comprehensive strategy representation, expressed as: ĥg in the feed-forward network with residual connections in the input sublayer, the hidden state generated is expressed as hg , and it is used as input.The output is obtained ⌣ h g through the SoftMax activation function, which can be described as: among them, σ represents the hidden layer calculation, and is a trainable parameter.The obtained information was multiplied with the comprehensive strategy representation ⌣ h g , updating the strategy h g , expressed as: β is a hyperparameter.To train the obtained improved comprehensive strategy, the negative log-likelihood estimate of the ground truth real strategy label was used as its loss function, expressed as: This module used the encoder structure in the transformer network to encode the dialogue history and emotion cause information in the dialogue text and combined it with the initial strategy label to obtain a new strategy label.

Fusion of Dialogue Features to Generate Responses
To mitigate the generalization tendency of emotionally supportive response texts generated by the model, we proposed a feature-fusion-based response generation method.Specifically, we utilized commonsense-enhanced seeker descriptions, the seeker's last reply, dynamic emotional states obtained from the Emotion Detector, strategy labels obtained through the Mixed-Strategy network, and the dialogue history to guide the model's response generation via attention mechanisms.The emotional responses generation module Response-Generate was proposed.Its network structure is shown in Figure 4.

( )
among them, σ represents the hidden layer calculation, and is a trainable parameter.The obtained information was multiplied with the comprehensive strategy representation g h  , updating the strategy g h , expressed as: β is a hyperparameter.To train the obtained improved comprehensive strategy, the negative log-likelihood estimate of the ground truth real strategy label was used as its loss function, expressed as: This module used the encoder structure in the transformer network to encode the dialogue history and emotion cause information in the dialogue text and combined it with the initial strategy label to obtain a new strategy label.

Fusion of Dialogue Features to Generate Responses
To mitigate the generalization tendency of emotionally supportive response texts generated by the model, we proposed a feature-fusion-based response generation method.Specifically, we utilized commonsense-enhanced seeker descriptions, the seeker's last reply, dynamic emotional states obtained from the Emotion Detector, strategy labels obtained through the Mixed-Strategy network, and the dialogue history to guide the model's response generation via attention mechanisms.The emotional responses generation module Response-Generate was proposed.Its network structure is shown in Figure 4.

Cross Attention Decoder
Dialogue History and Seeker 's Last Reply Input t H and g h , t U and g h into the multi-head attention layer for attention en- hancement, which can be expressed as: Input H t and h g , U t and h g into the multi-head attention layer for attention enhancement, which can be expressed as: combined ⌣ H t and ⌣ U t with the hidden state O of the decoder through the cross-attention network, specifically as follows: Electronics 2024, 13, 1484 9 of 17 Similarly, we enhanced O with attention to improve the model's learning ability of commonsense knowledge during response generation.The representation is as follows: The updated label of strategy h g obtained was inputted along with O into the crossattention network, which can be represented as: The new information obtained was combined with other information of the model, expressed as: We combined all the information enhanced through attention mechanisms with O, resulting in a fused representation O ′ that incorporated various dialogue information.This fused representation guided the model in generating the final response.
Similarly, the ground truth target reply used negative log-likelihood estimation as the loss function for training the final reply, which can be expressed as: n r is the length of the reply.This module adopted the decoder structure, multi-head attention mechanism, and cross-attention mechanism.It integrated crucial dialogue information to enhance the quality of the response generated by the model.

Multi-Task Joint Training Loss Function
In existing multi-task jointly trained neural-network emotional support dialogue system models, there are primarily two approaches to handling the loss function: 1.
Determine the relationship between each loss function during the initial training by manually tuning the weights, which remain constant throughout the training process.

2.
Observe changes in various indicators of the loss function during the network training process and manually adjust the weights accordingly.
However, the performance of multi-task joint learning models is highly dependent on these weights, making the process of finding optimal weights through manual adjustment complex and challenging.Building on the proposition of Peng et al. [27] that the loss function of multi-task joint training should be assigned specific weights, this paper introduced a hyperparameter for each loss function of the two tasks, assigning them distinct weights.This optimization aimed to enhance the model's performance, specifically: α 1 and α 2 are the weight of the two loss functions.Treat the minimization of L as the optimization goal of the entire model.
To dynamically adjust the weight values, inspired by the research conducted by Liu [28] and others, this paper introduced a dynamic weight averaging method.This method involves learning the average by considering each task's loss-changing rate.The methodology can be expressed as follows: Ke (w k (t−1)/T) ∑ i e (w i (t−1)/T) (32) w k (•) is the relative decline rate; t is the iteration index; T represents the tasks used to control temperature.T will make the distribution between different tasks more even.L k (t) is the average loss of each epoch in the iteration step.On the initial training set, first initialize w k (t) to 1 based on experience.After obtaining the final weight result, evaluate it on the verification set to select the optimal weight.

Experimental Setup
Platform Settings.We employed the proposed AdMISC model to conduct experiments on the emotional support dialogue dataset ESConv and perform ablation experiments on the above modules.The details of the experimental environment are outlined in Table 1.Dataset and processing.We conducted experiments on the ESConv [29] dataset, which is a high-quality dataset for emotional support conversation tasks.This is an English dataset.The builder recruited multiple crowd-workers who understood emotional support conversation procedures and strategies to talk to volunteers with emotional support needs through an online platform.The crowd-workers conducted interviews on the strategies adopted in each round of conversation.Annotations were made, and seekers provided feedback on their emotional status after every two conversation rounds, indicating reduced emotional distress.Each sample in the dataset is a conversation between a seeker and a supporter, and each conversation contains additional information, such as a description of the problem faced by the seeker and annotations of strategy categories in the supporter's response.The conversation unfolded in three stages: inquiry, reassurance, advice, and finally, an assessment of the intensity of the seeker's current emotions.The dataset contains 1300 long conversations, with an average of 29.5 utterances per conversation.The conversations have a total of 5 topics and 7 emotions, as well as 8 support strategies.To facilitate a more effective comparison with the baseline model, this paper performed similar preprocessing on ESConv.The conversation samples were truncated every 10 rounds and randomly divided into training, validation, and test sets in a ratio of 8:1:1.The detailed statistics are shown in Table 2.
Implementation Details.We implemented the specific process based on blenderbotsmall [30], utilizing AdamW as the optimizer, adjusted parameters, and incorporated the Dropout mechanism to mitigate overfitting.The detailed experimental parameter settings are presented in Table 3.

Evaluation Metrics
For the comprehensive evaluation, we conducted both automatic and human evaluations.Automatic Evaluation.We employed a set of automatic evaluation indicators to assess the performance of the proposed model and other baseline models.These included: 1.
Strategy Prediction Accuracy (ACC): This metric evaluates the model's accuracy in strategy prediction.For the same dataset, a higher ACC indicates more accurate predictions by the model.

2.
Perplexity (PPL): Perplexity measures how well the model predicts the sequence of words.A lower perplexity score indicates better performance.

3.
BLEU-2 (B-2) and BLEU-4 (B-4): These scores represent the similarity between the response generated by the model and the ground truth.Higher BLEU scores indicate better alignment with the real answer.4.
ROUGE-L (R-L): This metric evaluates the overlap of words and sequences between the generated response and the ground truth.A higher ROUGE-L score signifies a closer resemblance to the real answer.
Human Evaluation.In order to comprehensively evaluate the improvement effect of the AdMISC emotional support dialogue system, this study employed human evaluation involving real participants.The evaluation method included engaging 5 participants to assume the role of seeker and interact with the AdMISC, FADO, and MISC models.Participants assessed the performance of the two models in specific scenarios, with agreement required from at least half of the participants before counting.An additional reviewer was invited to conduct random sampling and review 10% of the assessment results to ensure assessment quality.Specific aspects of the assessment included:

1.
Fluency: determining which model can generate more fluent and coherent responses.

2.
Accuracy: assessing which model better identifies the seeker's problem.

3.
Empathy: evaluating which model better understands the seeker's feelings and situation.4.
Suggestion: analyzing which model provides more effective suggestions.

5.
Overall: considering which model provides a more effective emotional support effect.

Baselines
This is a brief introduction to the AdMISC model proposed in this article and other comparative models.The parameters are all default settings.

1.
Transformer [9]: is a common Seq2Seq model trained based on the MLE loss function; 2.
MT Transformer [31]: this takes sentiment prediction as an additional learning task and uses sentiment labels provided in ESConv to learn sentiment prediction; 3.
MoEL [32]: combining output states from multiple decoders to enhance empathetic reply generation for different emotions; 4.
BlenderBot-Joint [30]: preset a special strategy token before generating an emotional support reply statement; 6.
MISC [3]: emotionally supportive dialogue model based on ESConv predicts emotional labels and generates emotionally supportive replies through a hybrid strategy learning module; 7.
GLGH [4]: this model establishes a global-to-local hierarchical graph structure to generate supportive emotional responses through the seeker's global emotional states and local intentions; 8.
FADO [11]: this model designs a two-level feedback strategy selector to punish or encourage the strategy during the strategy selection process.

Model Comparison and Analysis
We conducted comparative experiments between AdMISC and baseline models from automatic and human evaluation perspectives.
Automatic Evaluation.The experimental results comparing the AdMISC model with the above baseline models are presented in Table 4.The results indicate the following: 1.
AdMISC outperforms the baselines in most metrics, which is powerful proof of the effectiveness of the proposed method.

2.
Models that come after BlenderBot-Joint combine the dialogue history with static emotion labels to guide strategy prediction, resulting in improved performance on ACC compared to previous models.AdMISC, in addition to the information above, incorporates additional dialogue information, further enhancing the ACC metric.This demonstrates that our approach of mining additional information and integrating them into the model is effective.

3.
In the remaining metrics related to dialogue diversity and fluency, AdMISC also excels, indicating that the response generation module still requires strategy and other information from the dialogue, such as commonsense knowledge and emotion, to facilitate the generation of supportive responses further.Human Evaluation.The evaluation results, expressed as the percentage of participants choosing a certain model out of the total, are presented in Table 5.We report the comparison results between our model and the two baselines (i.e., FADO and MISC).In particular, for each pair of model comparisons and each metric, we show the number of samples where our model achieves a better (denoted as "Win"), equal (denoted as "Tie"), and worse performance (denoted as "Lose") compared with the baselines.As seen, AdMISC outperforms all baselines across different evaluation metrics, as the number of "Win" cases is always significantly larger than that of "Lose" cases in each pair of model comparisons, which is consistent with the results in Table 4.In addition, the number of "Win" cases is the largest for the Suggestion metric compared with other metrics, which demonstrates that integrating all the methods we proposed can supply meaningful information for emotional support.

Ablation Study
We compared the original AdMISC model with the following derived model and proved that all designed modules played a certain role by comparing the changes in ACC, D-1, B-2, R-L, Precision, and Recall.
w/o u.To show the benefit of the Mixed-Strategy learning module, we removed the corresponding loss function by setting α 1 = 0 in Equation (31).
w/o a.To show the effect of the Response-Generate module, we removed the corresponding loss function by setting α 2 = 0 in Equation (31).
w/o f.To show the enhancement of the multi-task joint learning loss function, we set α 1 and α 2 to 1 and it remained unchanged during the iterative training process.
We provide the ablation study results on the ESConv dataset in Table 6.From this table, we make observations as follows:

Case Study
We illustrated a conversation in the test set to obtain an intuitive understanding of our model with some baselines, the results of which are presented in Table 7. Various problems appear in the compared models, such as inconsistency, repetition, contradiction, etc.The FADO model does not have obvious narrative errors, but it generates interrogative sentences, which are inappropriate for the emotional support task.Intuitively, AdMISC achieves the best performance in contrast.

Top-k Prediction and Label Prediction Results
We further demonstrated the effectiveness of our method by comparing Top-k prediction accuracy and visually comparing the distribution of strategy labels.
Top-k prediction.As shown in Figure 5, a comparison between AdMISC and the FADO model was conducted on Top-k prediction accuracy.AdMISC consistently outperforms the FADO model across all Top-k accuracy metrics, indicating more accurate label classification.

Top-k Prediction and Label Prediction Results
We further demonstrated the effectiveness of our method by comparing Top-k prediction accuracy and visually comparing the distribution of strategy labels.
Top-k prediction.As shown in Figure 5, a comparison between AdMISC and the FADO model was conducted on Top-k prediction accuracy.AdMISC consistently outperforms the FADO model across all Top-k accuracy metrics, indicating more accurate label classification.Strategy label distribution.To deeply evaluate the performance of the AdMISC model in emotional label prediction, we compared the emotional label prediction results of FADO and AdMISC on the same problem and made statistics with the Ground-Truth emotional label, as shown in Figure 6.Strategy label distribution.To deeply evaluate the performance of the AdMISC model in emotional label prediction, we compared the emotional label prediction results of FADO and AdMISC on the same problem and made statistics with the Ground-Truth emotional label, as shown in Figure 6.
model in emotional label prediction, we compared the emotional label prediction results of FADO and AdMISC on the same problem and made statistics with the Ground-Truth emotional label, as shown in Figure 6.
Analyzing the data revealed that FADO tended to classify more labels as "Question" in terms of emotion label prediction, resulting in significantly more labels than the Ground-Truth emotion labels, with only a handful of them correctly classified as "Selfdisclosure".FADO's prediction results did not correctly classify any related issues in the four labels "Reflection of feelings", "Restatement or Paraphrasing", "Information", and "Others".In contrast, the label prediction distribution of the AdMISC model was more reasonable, with more labels correctly predicted, highlighting the positive role of the mixed strategy learning module in predicting strategy labels.

Conclusions
In this paper, we propose a multi-task joint learning method in emotionally supportive dialogue models.This method extracts dynamic emotional change information at the sentence level and combines commonsense-enhanced historical dialogue information and seeker's situation descriptions to guide strategy selection.Additionally, the method introduces multi-head attention mechanisms and cross-attention mechanism layers to enhance dialogue with the features extracted in feedforward networks, improving the quality of Analyzing the data revealed that FADO tended to classify more labels as "Question" in terms of emotion label prediction, resulting in significantly more labels than the Ground-Truth emotion labels, with only a handful of them correctly classified as "Self-disclosure".FADO's prediction results did not correctly classify any related issues in the four labels "Reflection of feelings", "Restatement or Paraphrasing", "Information", and "Others".In contrast, the label prediction distribution of the AdMISC model was more reasonable, with more labels correctly predicted, highlighting the positive role of the mixed strategy learning module in predicting strategy labels.

Conclusions
In this paper, we propose a multi-task joint learning method in emotionally supportive dialogue models.This method extracts dynamic emotional change information at the sentence level and combines commonsense-enhanced historical dialogue information and seeker's situation descriptions to guide strategy selection.Additionally, the method introduces multi-head attention mechanisms and cross-attention mechanism layers to enhance dialogue with the features extracted in feedforward networks, improving the quality of response generation.A series of experiments demonstrated the feasibility and effectiveness of this method.Furthermore, our approach may provide information for other downstream tasks in dialogue systems.For example, in open-domain dialogue systems or recommendation systems, strengthening the connection between contextually relevant information and target responses may allow the model to generate better quality responses.In future work, we will continue to explore other dialogue features that affect emotional support effects and implement our method using lighter weight networks.

Limitations
Although our method is a certain improvement over existing baseline models, we believe that there are still many issues that remain to be solved in the work of emotional support conversation models.First, the accuracy of our method in dialogue strategy prediction has improved, but it is still not high enough, and the model produced some errors at the prediction stage.One reason for these errors may be that the model needs more semantic information to help better establish the connection between context and supporting strategies.It also needs to build a larger corpus and clearer dialogue strategy annotations.Secondly, we used COMET to enhance the model with commonsense knowledge.However, more professional domain knowledge may be required for emotional support tasks, such as human health or mental health knowledge.In addition, we evaluated model performance using both automatic and human evaluation.However, the currently used automatic evaluation indicators are still not reasonable enough and cannot judge the emotional support ability of the model.Better evaluation indicators should be established for this purpose, and

Supporter
Hi there!Can you help me?I will do my best.What do you need help with I feel depressed because I had to quit my job and stay home with my kids because of their remote school.I can understand why that make you feel depressed.Do you have any advice on how to feel better?Yes of course.It is good that you are acknowledging your feelings.To improve mood you could practice hobbies or other things you enjoy doing.

Figure 1 .
Figure 1.An example of an emotional support conversation from the ESConv dataset.The red font in the figure indicates the characteristics of emotional support conversation, the supporter expresses its willingness to help, the seeker explains its emotional state and emotional support needs, supporter provides comfort and advice.

Figure 1 .
Figure 1.An example of an emotional support conversation from the ESConv dataset.The red font in the figure indicates the characteristics of emotional support conversation, the supporter expresses its willingness to help, the seeker explains its emotional state and emotional support needs, supporter provides comfort and advice.

Figure 2 .
Figure 2. The overall architecture of our proposed AdMISC model mainly consists of two modules: Feature-Fusion learning and Response Generate.The Feature-Fusion learning module also contains two sub-modules: Emotion Detector with EmoRoBERTa to recognize the specific emotion in the seeker's utterances, and COMET to generate commonsense knowledge based on the conversation.* in the figure represents linear matrix multiplication.

Figure 2 .
Figure 2. The overall architecture of our proposed AdMISC model mainly consists of two modules: Feature-Fusion learning and Response Generate.The Feature-Fusion learning module also contains two sub-modules: Emotion Detector with EmoRoBERTa to recognize the specific emotion in the seeker's utterances, and COMET to generate commonsense knowledge based on the conversation.* in the figure represents linear matrix multiplication.

Figure 3 .
Figure 3. Network structure of Mixed-Strategy module.The dialogue history, coupled with the emotion cause and the seeker's last reply enriched by common sense, serve as inputs to the encoder.Following processing through various network layers, the predicted strategy labels are derived.

Figure 4 .
Figure 4.The network structure of Response-Generate contains three parts: the commonsense-enhanced dialogue history and seeker's last reply, strategy, and emotion cause as input.Combine them with different layers, then update the hidden state of the decoder.Finally, the output is a generated response with feature fusion.

Figure 4 .
Figure 4.The network structure of Response-Generate contains three parts: the commonsenseenhanced dialogue history and seeker's last reply, strategy, and emotion cause as input.Combine them with different layers, then update the hidden state of the decoder.Finally, the output is a generated response with feature fusion.
Network structure of Mixed-Strategy module.The dialogue history, coupled with the emotion cause and the seeker's last reply enriched by common sense, serve as inputs to the encoder.Following processing through various network layers, the predicted strategy labels are derived.

Table 4 .
Comparison between AdMISC and baseline models.The upward arrow in the figure indicates that the higher the evaluation standard, the better, and the downward arrow indicates that the lower the evaluation standard, the better.

Table 5 .
Human evaluation results.

Table 6 .
Ablation experiment results.The upward arrow in the figure indicates that the higher the evaluation standard, the better.

Table 7 .
Comparison between AdMISC and baseline model responses (some contextual content has been ignored).