Next Article in Journal
Mitigating Adversarial Attacks against IoT Profiling
Previous Article in Journal
Mixed-Criticality Traffic Scheduling in Time-Sensitive Networking Using Multiple Combinatorial Packing Based on Free Time Domain
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Cross-Modal Interaction and Fusion Network Enhanced with Self-Distillation for Emotion Recognition in Conversations

1
College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China
2
Hubei Province Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan 430081, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(13), 2645; https://doi.org/10.3390/electronics13132645
Submission received: 31 May 2024 / Revised: 1 July 2024 / Accepted: 3 July 2024 / Published: 5 July 2024
(This article belongs to the Section Artificial Intelligence)

Abstract

Emotion recognition in conversations (ERC), which aims to capture the dynamic changes in emotions during conversations, has recently attracted a huge amount of attention due to its importance in providing engaging and empathetic services. Considering that it is difficult for unimodal ERC approaches to capture emotional shifts in conversations, multimodal ERC research is on the rise. However, this still suffers from the following limitations: (1) failing to fully explore richer multimodal interactions and fusion; (2) failing to dynamically model speaker-dependent context in conversations; and (3) failing to employ model-agnostic techniques to eliminate semantic gaps among different modalities. Therefore, we propose a novel hierarchical cross-modal interaction and fusion network enhanced with self-distillation (HCIFN-SD) for ERC. Specifically, HCIFN-SD first proposes three different mask strategies for extracting speaker-dependent cross-modal conversational context based on the enhanced GRU module. Then, the graph-attention-based multimodal fusion (MF-GAT) module constructs three directed graphs for representing different modality spaces, implements in-depth cross-modal interactions for propagating conversational context, and designs a new GNN layer to address over-smoothing. Finally, self-distillation is employed to transfer knowledge from both hard and soft labels to supervise the training process of each student classifier for eliminating semantic gaps between different modalities and improving the representation quality of multimodal fusion. Extensive experimental results on IEMOCAP and MELD demonstrate that HCIFN-SD is superior to the mainstream state-of-the-art baselines by a significant margin.

1. Introduction

Emotion pervades daily human communication [1]. Thus, it is of great significance to automatically understand and recognize human emotions to support related decision-making for providing engaging and empathetic services. As a crucial branch under emotion recognition, emotion recognition in conversations (ERC) has recently attracted a huge amount of attention from both academia and industry due to its promising applications, such as public opinion monitoring [2], emotion-aware dialogue generation [3], intelligent healthcare assistance [4], and so forth. ERC plays an increasingly important role in building an emotion-aware human–computer interaction environment, which is expected to generate appropriate emotional responses during conversations. The task of ERC is to capture the dynamic changes in emotions during conversations based on human communication signals such as audio and text and accurately identify the emotional status of each utterance [5].
Earlier ERC studies have usually relied on a single modality [6,7], which results in relatively poor performance, robustness, and generalization of the ERC system. For instance, if the relevant modality cannot be collected or is of poor quality in some cases, the ERC system cannot work. In addition, it is quite difficult for a single modality to provide enough emotional clues to accurately capture emotional shifts in conversations [5,8], which leads to unsatisfactory emotion recognition performance in unimodal ERC systems. It is known that humans express their emotions through multiple forms such as text, speech, and facial expression in daily communication and interpret emotions of other speakers based on the history of utterances as shown in Figure 1. It indicates that different modalities complement each other in emotion expression and recognition, and capturing the contextual information in conversations is highly critical for accurate emotion identification. Thus, recently, research that leverages multimodal information to achieve ERC is on the rise [8,9,10,11], but many multimodal ERC approaches usually generate multimodal information based on the concatenation strategy without considering cross-modality interaction [9,10,11]. Due to the data heterogeneity of different modalities, how to effectively perform multimodal interaction and fusion for supporting the extraction of comprehensive conversational context to achieve satisfactory emotion recognition performance remains a challenging research topic.
Through a literature review, it is found that existing multimodal ERC research mainly depends on sequence-based algorithms or/and graph neural networks for capturing conversational context. However, they typically fail to fully explore richer multimodal interactions and fusion. They either perform cross-modal interactions and fusion before feeding into sequence-based algorithms or graphs or extract cross-modal complementary information within graphs. For instance, IMAN [12] generates multimodal fusion representations based on the concatenation of cross-modal interactions and, then, employs the gated recurrent units (GRUs) for modeling conversational context. SMFNM [5] initializes the representations of nodes in a graph based on the multimodal fusion representations. In MMGCN [13], which aims to explore cross-modal interactions during information propagation, each node is connected with the nodes which correspond to the same utterance but from different modalities. MM-DFN [14] employs graph convolution operation to explore both intra- and inter-modality interactions. It is known that cross-modal interactions not only are used for simple multimodal fusion but also run through the entire conversational context. Therefore, this study attempts to explore hierarchical cross-modal interaction and fusion.
In addition, many existing ERC approaches cannot work well on modeling speaker-dependent context in conversations. For instance, bc-LSTM [15], MMMU-BA [16], and AGHMN [17] employ sequence-based algorithms to learn contextual information from the neighboring utterances; thus, they can be considered as speaker-independent ERC approaches. Due to the importance of speaker information, speaker-embedding-based approaches, such as Ctnet [18] and MMGCN [13], have been developed for enhancing the performance of emotion recognition in conversations. However, the obtained speaker embedding is static information and ignores the complementary information from other modalities [13,18]. Considering that the speaker dependencies in conversations, including intra-speaker and inter-speaker dependencies, are dynamic and complicated, it is highly crucial to dynamically model speaker-dependent context in ERC. In recent years, some graph-based approaches (such as DialogueGCN [6], MMGCN [13], LR-GCN [19], and GraphCFC [20] have attempted to capture speaker-sensitive contextual information through the pre-defined relation types of edges, but they need to construct fully connected graphs for capturing speaker-dependent context under this explicit declaration of edge relation, which inevitably results in a heavy computation load. Drawing inspiration from the existing research, this work tries to explore speaker-dependent cross-modal interactions at different stages for extracting high-quality conversational context.
Moreover, recent ERC approaches mainly implement geometric manipulation (such as matrix computation) in the feature spaces to learn unified multimodal representations, but they fail to consider the semantic gaps between different modalities during the cross-modal interaction [5,13,21,22]. It may bring the risk of misalignment and further influence the final emotion prediction performance. Thus, how to learn complementary information with more similar semantics from other modalities to address the issue of the semantic gap between different modalities for enhancing the performance of emotion identification is also a great challenge.
To address the above-mentioned issues, we propose a novel multimodal framework for emotion recognition in conversations, called “hierarchical cross-modal interaction and fusion network enhanced with self-distillation (HCIFN-SD)”. In HCIFN-SD, the speaker-dependent context modeling is efficiently integrated with cross-modal interactions via an improved GRU structure for the purpose of better capturing multi-view cross-modal contextual information in conversations. Then, the obtained representations are utilized for constructing three directed graphs to further explore speaker-dependent multimodal interaction and fusion. Finally, the self-distillation mechanism is employed to eliminate the semantic gaps between different modalities for achieving more accurate emotion recognition. The overall structure of HCIFN-SD is illustrated in Figure 2.
To demonstrate the effectiveness of our proposed HCIFN-SD, we conduct extensive experiments on two popular benchmark datasets (i.e., IEMOCAP [23] and MELD [24]). Experimental results show that our HCIFN-SD is superior to existing baselines and sets new state-of-the-art records on ERC tasks. The main contributions of this paper can be summarized as follows:
(1)
We propose a novel multimodal framework for conversational emotion recognition called HCIFN-SD. It effectively integrates hierarchical cross-modal interaction and a fusion network with self-distillation for learning high-quality multimodal contextual representations, filling the gaps in existing approaches for ERC.
(2)
The new GRU structure named MCI-GRU is designed, which not only retains the speaker-dependent cross-modal long-distance conversational context, but also captures local cross-modal conversational context from recent neighboring utterances. The proposed MF-GAT module constructs three directed graphs representing different modality views, which aims to capture both the long-distance conversational context and temporal complementary information from other modalities. In addition, the self-distillation is introduced to minimize the semantic gaps between different modalities for achieving more satisfactory emotion prediction performance.
(3)
Experimental results demonstrate that our proposed HCIFN-SD outperforms the existing mainstream ERC models and achieves new state-of-the-art records on the benchmark datasets. Furthermore, extensive ablation studies can systemically demonstrate the importance and the rationality of each component in HCIFN-SD.
The rest of this paper is organized as follows: The most relevant works about emotion recognition in conversations, multimodal interaction and fusion, and graph neural networks are briefly reviewed in Section 2. The proposed methodology is introduced in Section 3. Section 4 presents the experimental data and setup. The experimental results and related discussion are reported in Section 5. Finally, Section 6 concludes this paper and outlines future work.

2. Related Work

2.1. Emotion Recognition in Conversations

Emotion recognition is an interdisciplinary research field, involving cognitive science, psychology, pattern recognition, artificial intelligence, and so on [25]. ERC, as a crucial branch under emotion recognition, has recently attracted increasing research interest due to its critical role in developing an empathetic human–computer interaction environment. Compared with single-sentence emotion recognition, ERC is more complicated because human emotion is heavily influenced by multiple parties during the conversation process [26]. The influence includes emotion inertia and emotion contagion [11]. As shown in Figure 1, when getting the message that the man will leave, the woman’s emotion is pushed into the frustrated state (u5), which denotes emotion contagion. And the woman maintains the frustrated emotion state In the following multiple utterances (u7 and u8), which is emotion inertia. It means although one party tends to maintain his/her initial emotion during the conversation (i.e., intra-speaker influence), his/her emotional state may also be transferred into a new state due to interactions with the other speakers (i.e., inter-speaker influence). In summary, human emotion in a conversation is largely powered by the intra-speaker and inter-speaker context dependencies of the whole conversation. However, it is also found that many existing ERC approaches cannot work well on modeling intra- and inter-speaker dependencies in conversations [12,13,16,17], which results in relatively poor emotion recognition performance. Therefore, this study proposes the speaker-dependent framework for capturing intra- and inter-speaker context dependencies.
Based on the adopted methods for conversational context modeling, existing ERC research can be categorized into sequence-based approaches and graph-based approaches. Sequence-based Approaches: They usually employ sequence-based algorithms (such as recurrent neural network (RNNs), long short-term memory (LSTM), and gated recurrent units (GRUs)) to extract conversational context dependencies. For example, BC-LSTM [15] utilizes the bidirectional LSTM architecture to learn conversational context for emotion classification. DialogueRNN [6] employs two distinct GRUs to update context and the speaker state, respectively, and then, the updated speaker state is fed into the third GRU to update emotion representation for final emotion classification. HiGRU [27] obtains the individual utterance embedding based on the GRU encoder and then captures the contextual utterance embedding via another GRU encoder. CMN [9] generates multimodal features through concatenation and then employs two distinct GRUs to capture inter-speaker dependency relations. ICON [10] adopts multiple GRUs to learn both intra- and inter-speaker dependencies and uses memory networks to integrate contextual information for final emotion prediction. A-DMN [11] explores both intra- and inter-speaker dependencies in a conversation via RNNs for updating the memory for the current utterance. Different from the above approaches, COSMIC [28] not only utilizes GRUs for conversational context extraction but also integrates commonsense knowledge to interpret the speaker’s intension. To sum up, these sequence-based models typically use recent neighboring utterances for contextual information extraction; thus, it is difficult for them to maintain contextual information from long-distance utterances, but previous ERC research has demonstrated the significance of long-distance emotion dependency [6,18].
Graph-based Approaches: They typically build a graph for a conversation [7], where each node represents one utterance, and the edges denote dependencies among utterances. For instance, DialogueGCN [7] constructs one graph for each conversation and sets a fixed window size for learning contextual information. ConGCN [29] symbolizes the entire conversation as a large graph, where all utterances and all speakers are treated as graph nodes. MMGCN [13] belongs to a multimodal ERC model, which generates three nodes for each utterance to represent different modalities to learn context propagation under the same modality and then integrates the contextual information of different modalities through concatenation. Similar to MMGCN [13], MM-DFN [14] not only constructs three semantic spaces representing different modalities in a graph to learn complementary information among modalities, but also develops a dynamic fusion module to reduce the redundant information accumulated at each graph layer. In general, these graph-based models capture the contextual dependencies among utterances based on the graph structure, but they fail to consider the sequential information in conversations. It is intuitive to combine the advantages of sequence-based and graph-based approaches for capturing the conversational context from both local and long-distance utterances [5,21]. Therefore, this study attempts to combine the advantages of recurrence and graph structures for learning more comprehensive conversational context information.

2.2. Multimodal Interaction and Fusion

Earlier ERC studies mainly rely on unimodal information, especially the text modality. For example, DialogXL [30] employs the pre-trained language model (Xlnet) to extract utterance-level textual features for ERC. COMPM [31], which integrates the speaker’s pre-trained memory with the context model, is a text-based ERC framework. The aforementioned studies [6,7,17,19] also depend on only the textual modality to achieve ERC. However, a single modality is quite difficult to provide enough emotional clues to accurately capture emotional shifts in conversations [5,8], which leads to relatively poor recognition performance and robustness. It is known that humans express their emotions through multiple forms in daily communication, and different signals carry complementary information for emotion expression and recognition. Therefore, recent ERC research efforts [8,9,10,11,13,14] mainly focus on mining complementary information from multimodal signals to obtain high-quality emotional clues for improving emotion recognition performance.
Existing multimodal research can be broadly categorized into feature-level fusion, decision-level fusion, and model-level fusion [32]. Feature-level fusion, also called early fusion, usually generates multimodal representations through concatenation or addition strategies at the input level. For example, CMN [9], ICON [10], and A-DMN [11] obtain utterance-level multimodal representations via feature concatenation. But feature-level fusion cannot efficiently exploit the complementary relations among multiple modalities. Decision-level fusion, which is proposed based on the modality independence assumption, typically aggregates the unimodal decision values through simple voting, averaging, or weighted sum to generate the final decision. For instance, Sahoo and Routray [33] separately utilized speech and facial expressions to obtain the emotion recognition results of the individual modality, respectively, and then adopted the voting strategy at the decision level to achieve final emotion recognition. Nevertheless, this modality independence assumption is invalid in practice, and decision-level fusion cannot capture the cross-modal interactions between different modalities. To sum up, the above-mentioned fusion methods cannot fully exploit the complementary information from different modalities. Model-level fusion, as a trade-off between feature-level and decision-level fusion, has become a promising solution for multimodal information integration. For example, SMFNM [5], CTNet [18], and IMAN [12] employ the multi-head attention or transformer structure to implement cross-modal interactions for learning complementary information to enhance the representation ability of the original modality, and they all demonstrate the importance of learning cross-modal complementary information. Therefore, this study follows the model-level fusion idea to implement multi-stage multimodal fusion for dealing with the data heterogeneity of different modalities on ERC tasks.

2.3. Graph Neural Networks

With the rapid emergence of non-Euclidean data (such as social relations, interaction relationships, and knowledge relations) in various fields, researchers have begun to represent them as graphs, which consist of node, edge, and/or edge weights [34]. Due to the complexity of graph data, it is difficult for traditional neural networks to process them. Thus, graph neural networks (GNNs) have been proposed to meet the demand and have attracted increasing research interest for multiple applications (such as item recommendation [35], opinion detection [36], and knowledge graph construction [37]). GNNs update the node’s representation through iteratively aggregating and propagating neighboring information, followed by a graph pooling operation, to generate a graph representation. Graph convolutional networks (GCNs) [38], as a typical architecture of GNNs, can directly operate convolution on graphs to learn hierarchical representations and update the node’s features. GCNs can be classified into two streams: spectral and spatial. Spectral GCNs transform graph convolution into a frequency domain filtering operation based on the signal processing theory, while spatial GCNs implement graph convolution by information propagation, which is more similar to the convolutional operations in convolutional neural networks (CNNs). Compared with spectral GCNs, spatial GCNs have gained greater attention due to their efficiency and flexibility. In addition, an attention mechanism, as a resource allocation scheme, has also been introduced to GCNs to make the models focus on the task-related information, which is called a graph attention network (GAT) [39]. Multiple studies [19,20] have demonstrated the effectiveness of utilizing the attention mechanism to assign different weights to different neighboring nodes. Therefore, this study also follows the information propagation idea and employs the attention mechanism in the graph to extract multimodal conversational context.

3. Proposed Method

In this section, we first formalize the problem definition and then introduce the proposed HCIFN-SD in detail. The overall structure of HCIFN-SD is illustrated in Figure 2. It shows that our HCIFN-SD consists of three stages: (1) unimodal feature extraction, which aims to extract unimodal alignment features and mark the location information of the speakers; (2) speaker-dependent cross-modal context modeling, which proposes the MCI-GRU module to implement speaker-dependent cross-modal conversational context learning; (3) graph-attention-based multimodal context extraction and fusion, which aims to capture both the long-distance conversational context and temporal complementary information from other modalities based on three directed graphs representing different modality views. Finally, the prediction module introduces the self-distillation strategy to supervise the training of the MCI-GRU module for the purpose of eliminating the semantic gap between different modalities and obtaining better emotion prediction performance.

3.1. Problem Formulation

Assume there are M speakers s 1 , s 2 , , s M in a conversation C , where C can be denoted as C = u 1 , u 2 , u 3 , , u n in temporal sequence. Here, u i is the i-th utterance in conversation, and n is the total number of utterances. We also define a mapping function φ to specify the relation between the utterance and its corresponding speaker, i.e., utterance u i is uttered by speaker s φ u i . Furthermore, in a multimodal ERC scenario, each utterance involves the textual and acoustic modalities, denoted as u i u i a , u i t , where T and A denote the textual and acoustic modalities, respectively. Consequently, the conversation C can be further represented as C t = u 1 T , u 2 T , u 3 T , , u n T and C a = u 1 A , u 2 A , u 3 A , , u n A , given the defined emotional labels Y = y 1 , y 2 , . . , y c , where c is the number of emotion categories in the conversational dataset. Our HCIFN-SD aims to predict the discrete emotion labels of the constituent utterances in the conversation C , i.e., the task of HCIFN-SD is to recognize the emotion state y i for the utterance u i based on the captured speaker-dependent multimodal multi-stage context information in conversations.

3.2. Unimodal Feature Extraction

For the textual modality, we employ the pre-trained language model DistilRoberta [40] for textual feature extraction. RoBERTa has 24 stacked transformer layers and 355 M parameters in total [40]. In order to reduce the computation load, speed up the training process, and maintain a certain performance of RoBERTa, knowledge distillation is leveraged during the pre-training phase to obtain the smaller and lighter DistilRoberta model. In addition, the sentence transformer is adopted to obtain utterance-level textual feature f i T with 768 dimensions for the utterance u i , which is a 768-dimensional vector. For the acoustic modality, the pre-trained Wav2vec2.0 [41] model is utilized to extract utterance-level acoustic features. In this study, we also fine-tune the Wav2vec2.0 model on the first three sessions of the IEMOCAP [23] dataset, generating a 768-dimensional acoustic feature f i A   for each utterance.
Then, we employ two distinct 1D convolutional layers to map the extracted unimodal features into a low-dimensional feature space, which can be described as follows:
U ~ i m = C o n v 1 D ( f i m , k m )
where m A , T , k m is set to be 1 in this study, denoting the convolutional filter size, and U ~ i m R d , here d represents the hidden layer size and d = 200 .
Meanwhile, inspired by MMGCN [13], we also extract speaker embeddings to represent the identity information of the specific speaker, which can be computed as follows:
S s p k = S p k E m b e d d i n g ( S , D )
where S is the set of speakers, and D is the number of speakers in the current conversation. S s p k R d , which is a feature vector containing the speaker identity information. Then, the speaker embedding is attached to generate utterance-level unimodal representations with the speaker identify information, which can be described as follows:
X i m = λ S s p k + U ~ i m
where λ denotes the trade-off parameter, X i m represents the unimodal representation enhanced with the speaker identity information, and X i m R d . Through this addition operation, the model can learn both who is speaking (captured by the speaker embedding) and what is being said (captured by the utterance-level unimodal representations).

3.3. Speaker-Dependent Cross-Modal Context Modeling

3.3.1. Speaker-Dependent Cross-Modal Interactions

In this subsection, the “Speaker-dependent Cross-modal Interactions (SCIs)” component is proposed for capturing both intra- and inter-speaker-dependent cross-modal interactions. Our SCI focuses on extracting speaker-dependent complementary emotion clues from another modality to enhance the representations of the current modality. For instance, when a higher pitch in the acoustic modality appears, the speaker-dependent cross-modal interaction (e.g., text audio) can accurately capture this emotional shift in the conversation. However, how to effectively distinguish the different speaker for implementing intra-speaker and inter-speaker cross-modal dependencies is a challenge. It is known that the decoder architecture of the transformer introduces the concept of a mask to effectively control the information flow and block out irrelevant information during the process of attention computation through assigning a minus infinity score for the masked parts [42]. Inspired by this, our SCI component leverages the mask mechanism to distinguish different speakers for capturing intra-speaker and inter-speaker cross-modal dependencies in multi-party conversation scenarios.
Cross-modal Transformers: It follows the encoder architecture of the transformer but utilizes the masked multi-head attention rather than the standard multi-head attention and has a stack of three identical layers for reducing computational load. It indicates that each layer in our cross-modal transformer consists of a masked multi-head attention sublayer and a fully connected feedforward sublayer. The specific mask strategy will be introduced in detail later. In addition, the residual connection is also adopted around each sub-layer, followed by the layer normalization. Specifically, we consider two different modalities α and β ( α , β T , A ,   a n d   α β ). The cross-modal interaction from modality α to β can be denoted as C T X β , X α , which is mathematically described as follows:
C T X β , X α = C o n c a t h e a d 1 , , h e a d h W C T
h e a d i = A t t e n t i o n Q i β , K i α , V i α = s o f t max ( Q i β ( K i α ) T d h ) V i α
where the query Q i β = X β ( W Q β ) i , key K i α = X α ( W K α ) i , and V i α = X α ( W V α ) i . Here, ( W Q β ) i R d × d h ,   ( W K α ) i R d × d h , ( W V α ) i R d × d h , and W C T R h d h × d   are learnable weight matrices. Here, C T X β , X α R d indicates extracting complementary information from the modality α to enhance the representation ability of the modality β , h represents the number of heads in the attention mechanism, which means that the feature vector will be divided into h heads for parallel calculation. Based on the above description, we can implement cross-modal interactions. Therefore, the following will introduce different mask strategies to achieve the identification of different cross-modal dependencies.
Intra-speaker Mask: We employ the intra-speaker mask strategy in the multi-head attention to identify the same speaker. It is essentially an attentive masking matrix as shown in Figure 3a, which can be denoted as follows:
m int r a = if   s φ ( u i ) s φ ( u j ) 0 o t h e r w i s e ,
where s φ ( u i )   is the mapping function from the utterance to its corresponding speaker. The masking operation needs to be implemented after the multiplication of the query and the key but before the Softmax calculation. Therefore, the intra-speaker mask strategy allows the multi-head attention to concentrate on the cross-modal interaction of the same speaker (i.e., emotion inertia), while blocking information from other speakers.
Inter-speaker Mask: On the contrary, the inter-speaker mask strategy treats other speakers as a unified identity (denoted as “other speakers”) to simplify the inter-speaker modeling process in multi-party conversation scenarios. The illustration of the inter-speaker mask is shown in Figure 3b, which can be described as follows:
m int e r = if   s φ ( u i ) = s φ ( u j ) 0 o t h e r w i s e ,
Therefore, utilizing the inter-speaker mask can effectively explore cross-modal interaction between different speakers, which means it only takes into account the emotion contagion from other speakers.
Local Information Mask: Given the great importance of local context information in conversational emotion recognition, we also design a local information mask strategy to behave as a context window for exploring the local context from neighboring utterances. The illustration of the local information mask is shown in Figure 3c, which can be described as follows:
m l o c a l = 0 i f   u i { u i i p < i < i + p } o t h e r w i s e ,
where the hyperparameter p denotes the size of the context window. Therefore, the local information mask can selectively focus on important local cross-modal interaction information.
To sum up, based on the different mask strategies, we can obtain multi-view cross-modal interactions, including the intra-speaker interaction C T int r a ( X β , X α ) , the inter-speaker interaction C T i n t e r ( X β , X α ) , and the local interaction C T l o c a l ( X β , X α ) .

3.3.2. Multi-View Cross-Modal-Interaction-Enhanced GRU

Considering that the extracted intra-speaker and inter-speaker cross-modal interactions contain long-term context dependencies, we propose a “Multi-view Cross-modal-Interaction-Enhanced GRU (MCI-GRU)” module for learning both long-distance and local conversational context. The illustration of MCI-GRU is shown in Figure 4. Given that the reset gate in GRU specializes in learning local context dependencies in a sequence, we use C T l o c a l ( X β , X α )   as the input of GRU for the purpose of leveraging the advantages of the reset gate to extract short-term context information. Meanwhile, due to the powerful ability of the update gate to preserve long-term context dependencies, we utilize both C T int r a ( X β , X α )   and C T i n t e r ( X β , X α )   to participate in the calculation of the update gate in GRU for exploring speaker-dependent long-distance conversational context. This can be described as follows:
H α β = M C I G R U ( C T l o c a l ( X β , X α ) , C T int r a ( X β , X α ) , C T i n t e r ( X β , X α ) )
where α , β T , A , α β , and H α β denotes the extracted speaker-dependent cross-modal conversational context.
Then, the calculation process of MCI-GRU will be introduced. The multi-view cross-modal interactions (e.g., from α to β ) for the utterance u i include C T l o c a l ( X β , X α ) , C T int r a ( X β , X α ) , and C T i n t e r ( X β , X α ) , respectively, which can be marked as C T i l o c a l , C T i i n t r a , and C T i i n t e r in the subsequent equations in order to simplify the equation expression. Specifically, the computation of the reset gate can be described as follows:
H α β = M C I G R U ( C T l o c a l ( X β , X α ) , C T int r a ( X β , X α ) , C T i n t e r ( X β , X α ) )
The update gate is computed based on the simultaneous leverage of C T i i n t r a and C T i i n t e r to extract speaker-dependent long-distance context dependencies, which can be formulated as follows:
r i = σ ( W r l C T i l o c a l + W r h h i 1 + W r s C T i int r a + W r c C T i int e r + b r )
z i = σ ( W z l C T i l o c a l + W z h h i 1 + W z s C T i int r a + W z c C T i int e r + b z )
h ~ i = tanh ( W h ( r i h i 1 , C T i l o c a l ) + W h s C T i int r a + W h c C T i int e r )
h i = ( 1 z i ) h i 1 + z i h ~ i
where W r l , W r h , W r s , W r c , W z l , W z h , W z s , W z c , W h , W h s , W h c , b r , and b z are learnable parameters; r i , z i , h ~ i , and h i denote the output of the reset gate, the output of the update gate, the candidate hidden state, and the hidden state, respectively; σ and t a n h denote the sigmoid and tanh activation functions, respectively; means the point-wise multiplication. Therefore, the obtained hidden state of MCI-GRU is the captured speaker-dependent cross-modal context representations, which can be denoted as H α β . Considering that H α β indicates extracting complementary context information from the α modality to enhance the representation of the β modality, H α β is essentially the enhanced representation of the β modality. Thus, H A T can be denoted as H T , and the enhanced textual repreparation of the utterance u i can be marked as h i T . Similarly, H T A can be denoted as H A , and the enhanced acoustic repreparation of the utterance u i can be marked as h i A .
In addition, we also generate a fusion representation H ~ f u s e based on the concatenation strategy at this stage for the purpose of supervising the subsequent multimodal fusion, which can be obtained as follows:
H ~ f u s e = C o n c a t ( C T A T int r a , C T A T int e r , G A T , C T T A int r a , C T T A int e r , G T A )
Then, we propose the “Local Context Enhancement (LCE)” component to further enhance the contextual information of the fusion modality, which can be described as follows:
q i = W q h ~ i f u s e + b q
α i = s o f t max ( W 1 ( q i k i ) + b α ) , k i = { v i i p < i < i + p }
v i = α i k i T
h i f u s e = N o r m ( tanh ( v i ) h ~ i f u s e + h ~ i f u s e )
where h ~ i f u s e represents the fusion representation H ~ f u s e   for the utterance u i ; W q and   W 1 are learnable parameters, and p is a hyperparameter; v i denotes the context information of the fusion modality for the utterance u i ; if i = 1, we set the v i to be h ~ i f u s e ; N o r m denotes the layer normalization; and h i f u s e R d indicates the enhanced contextual information representation for the utterance u i . Therefore, the enhanced contextual representation of the fusion modality can be denoted as H f u s e .

3.4. Graph-Attention-Based Multimodal Interaction and Fusion

3.4.1. Directed Graph Construction

In this stage, three directed graphs, including the A T graph, T A graph, and the fusion graph, are accordingly constructed, which are denoted as G A T = { V A + V T , E A T , W A T } , G T A = { V T + V A , E T A , W T A } , and G f u s e = { V f u s e , E f u s e , W f u s e } , respectively. Here, V denotes the set of corresponding utterance nodes, E denotes the set of edges between nodes, R means the relation types of edges, and W represents the set of weights between edges. Specifically, these directed graphs are constructed as follows:
Nodes: The utterance u i   is represented by two nodes v i A and v i T both in the A T graph and the T A graph, which are initialized with h i A and h i T corresponding to the enhanced acoustic and textual modalities. It means that if there are n utterances in a conversation, there are 2n nodes both in the A T graph and the T A graph accordingly. But each utterance is represented by one node in the fusion graph, initialized with h i f u s e . Thus, given a conversation with n utterances, we construct the fusion graph with n nodes.
Edges: The edge between the utterance u i and the utterance u j ( i < j ) is denoted as r i j E , which means the context information in u i is propagated to u j . Given that a future utterance cannot propagate information backwards, we define that u j cannot propagate information to its predecessor u i , i.e., r j i E , in our directed graphs. If each utterance is connected to all its predecessors, the large number of edges will result in high computational costs. When the speaker speaks utterance u ρ , he/she has realized the current context information to express the emotion of utterance   u ρ . Assuming that utterance   u ρ is the s-th nearest utterance spoken by s φ u i before u i , we construct a directed edge r ρ i E , and l and ρ < κ < i , the directed edge r κ i E is also constructed in the graphs. But τ < ρ , we ignore the edge from u τ to u i in the graph, i.e., r τ i E . It indicates that the conversational context before utterance   u ρ is relatively less crucial for identifying the emotional state of utterance u i   ( ρ < i ), and utterance   u ρ needs to propagate the learnt conversational context to utterance u i . Therefore, it is significant for the utterance u i to find the cut-off point   u ρ to construct directed edges, for the purpose of reducing computational complexity while preserving the remote context. Here, s is a hyperparameter that needs to be determined during training. Based on the above edge definitions, the edges among the n utterances in a conversation can be constructed in the fusion graph. For example, v ρ f u s e is connected with v i f u s e in the fusion graph as shown in Figure 5.
Furthermore, for the A T graph and the T A graph, we also define that any two nodes under the same modality have no edges, and each node will be connected with the nodes which correspond to its previous utterances but from another modality as shown in Figure 6a,b. It indicates that v ρ A is connected with v i T in the A T graph, and v ρ T will be connected with v i A in the T A graph, which enables the cross-modal graphs to further explore cross-modal interactions.
Edges types: The relation types of the above defined edges depend on speaker dependency. It means if two nodes ( v ρ A and v i T ) have an edge ( r ρ i A T ) and their corresponding utterances are uttered by the same speaker, i.e., s φ u ρ = s φ u i , the relation type of r ρ i A T is the intra-speaker type. Otherwise, the relation type of the edge is set to be the inter-speaker type. For example, as shown in Figure 6a, supposing there are two speakers in a conversation, the edge type of r 12 A T is the inter-speaker, while the type of r 14 A T is intra-speaker edge.
Edge weights: The edge weights determine the significance of neighboring nodes. In this study, we employ cross-modal graph attention (GAT) [43] to update the edge weights for characterizing the significance of diverse neighboring information. The edge weight of r i j A T can be calculated as follows:
α i j A T = exp ( σ ( a T [ W α h i A | | W α h j T ] ) ) u k N ( u i ) exp ( σ ( a T [ W α h i A | | W α h k T ] ) )
where α i j A T   represents the edge weight between node v i A and node v j T , h i A is the enhanced acoustic representation of the utterance u i , h i T denotes the enhanced textual representation of the utterance u i , W α and a T are learnable parameters, || is the concatenation operation, σ denotes the non-linear activation function LeakyReLU, and N ( u i ) is the set of neighboring nodes of the utterance u i . Similarly, based on Equation (19), we can also calculate the edge weights in the T A graph ( α i j T A ) and the fusion graph ( α i j f u s e ), respectively.

3.4.2. Graph-Attention-Based Multimodal Fusion

In order to further explore the cross-modal context learning and relieve the over-smoothing issue of GNN [44], we have made some improvements to the vanilla GAT, which is called “Graph-Attention-based Multimodal Fusion (MF-GAT)”. Specifically, taking the node v i T in the A T graph as an example, MF-GAT employs the attention mechanism to aggregate information from its neighboring nodes, which can be described as follows:
g i A T = v j A N ( v i T ) α i j A T W i j A T h j A
where α i j A T is the edge weight between node v j A and v i T , as well as the attention coefficient, W i j A T is the learnable parameter, v j A   is the neighboring node of v i T , N ( v i T ) denotes the set of neighboring node of v i T , and g i A T is the aggregated information of utterance   u i in the A T graph. Similarly, the aggregated information of utterance   u i in both the T A graph and the fusion graph can be obtained based on Equation (20), which can be denoted as g i T A and g i f u s e , respectively.
Then, we further explore the cross-modal interaction through the above-mentioned MCI-GRU cell to update the representation of utterances under each modality, which can be expressed as follows:
m i A T ( l ) = M C I G R U ( g i A T l , g i T A l , g i f u s e l , m i A T ( l 1 ) )
where g i A T ( l ) , g i T A ( l ) , and g i f u s e ( l ) denote the aggregated information of utterance   u i at the layer l in the A T , T A , and fusion graphs, respectively; m i A T ( l 1 ) is the updated representation of node v i T at the (l − 1) layer; and m i A T ( l ) indicates the updated representation of node v i T at the l layer. Here, the gate mechanism in MCI-GRU is utilized to preserve long-term context dependencies. The specific calculations of MCI-GRU have been described in Equations (10)–(14).
Inspired by ResNet [44] and Transformer, we also propose a new GNN layer in MF-GAT to address the over-smoothing issue and generate the final representation in each graph, which can be formulated as follows:
m ~ i A T ( l ) = N o r m ( m i A T ( l 1 ) + m i A T ( l ) )
f i A T ( l ) = N o r m ( F F N ( m ~ i A T ( l ) ) + m ~ i A T ( l ) )
F F N ( m ~ i A T ( l ) ) = D r o p ( L i n e a r ( D r o p ( σ ( L i n e a r ( m ~ i A T ( l ) ; Θ 1 ) ) ) ; Θ 2 ) )
where N o r m and F F N denote the layer normalization and feedforward functions, respectively; D r o p and L i n e a r   are the dropout and linear activation functions, respectively; σ denotes the non-linear activation function; Θ 1 and Θ 2 are learnable parameters; and f i A T ( l ) R d denotes the output representation of the utterance   u i at the l layer in the A T graph. Similarly, we can obtain the output representations of the utterance   u i at the l layer in the T A and the fusion graphs, denoted as f i T A ( l ) R d and f i f u s i o n ( l ) R d . Here, l is a hyperparameter that needs to be determined in the training process.
Then, the output representations of the utterance   u i are concatenated based on the skip connection [45] to generate the final representations of the utterance   u i , which can be described as follows:
F i A T = C o n c a t ( f i A T 0 ,   f i A T 1 , f i A T 2 , , f i A T l )
F i T A = C o n c a t ( f i A T 0 , f i T A 1 , f i T A 2 , , f i T A l )
F i f u s e = C o n c a t ( f i f u s e 0 , f i f u s e 1 , f i f u s e 2 , , f i f u s e l )
where F i A T R d ( l + 1 ) , F i T A R d ( l + 1 ) , and F i f u s e R d ( l + 1 ) denote the final obtained representations of the utterance   u i in the A T , T A , and fusion graphs, respectively.
Finally, we concatenate the representations of the utterance   u i learned from three directed graphs to obtain the final multimodal fusion representation F i R d 3 ( l + 1 ) , which can be described as follows:
F i = C o n c a t ( F i A T , F i T A , F i f u s e )

3.5. Self-Distillation-Based Emotion Prediction

Emotion Prediction. The final multimodal fusion representation F i is fed into a fully connected layer followed with a Softmax layer for emotion prediction:
p i = S o f t m a x ( W p ( Re L U ( W c F i + b c ) ) + b p )
y ^ i = argmax ( p i )
where W c R d 1 × d 1 , b c R d 1 ,   W p R C × d 1 , and b p R C are trainable parameters, p i is the predicted probability, and y ^ i is the predicted emotion label for utterance u t . Here, d 1   = d 3 ( l + 1 ) , and C denotes the pre-defined emotion categories. Thus, C is set to be 6 in IEMOCAP [23], but C is equal to 7 in MELD [24]. We employ the cross-entropy loss with L2 regularization to calculate the emotion classification loss during the training process for the purpose of optimizing all learnable parameters, which can be calculated as follows:
L c l s = 1 i = 1 N C N i = 1 N C j = 1 N y i j log ( p i j ) + λ c l s | Θ c l s |
where N is the total number of utterances in the i-th conversation, N C is the number of conversations in dataset, y i j is the true emotion label for the utterance u j in the i-th conversation, p i j is the predicted probability, λ c l s is the L2 regularization coefficient, and Θ c l s denotes the trainable parameter.
Self-Distillation. However, there are semantic gaps between different modalities during the cross-modal interactions, which may bring the risk of misalignment and further influence the final emotion prediction performance. Therefore, we introduce self-distillation [45] to utilize the soft label y ^ i to supervise the training of each MCI-GRU module to ensure that the learnt representation distributions of each modality are close to each other. Concretely, we treat the main fusion module of the three directed graphs as the teacher module. The three modality outputs in the second stage (i.e., h i T , h i A , and h i f u s e ) are fed into fully connected layers followed with Softmax layers, respectively, which are treated as student modules, denoted as s t u d e n t T , s t u d e n t A , and s t u d e n t f . Taking the s t u d e n t T as an example, the predicted probability of s t u d e n t T can be expressed as follows:
p i T = S o f t m a x ( W T ( Re L U ( W S T h i T + b S T ) ) + b T )
where W S T , W T , b S T , and b T are learnable parameters, and p i T is the obtained prediction probability for the utterance u i based on s t u d e n t T . Similarly, we can obtain p i T and p i T for the utterance u i based on s t u d e n t A and s t u d e n t f , respectively. Meanwhile, we use the true emotion labels and the cross-entropy loss function to supervise and optimize the training of each student module, which can be described as follows:
L C E T = 1 i = 1 N C N i = 1 N C j = 1 N y i j log p i j T
where y i j is the true emotion label for the utterance u j in the i-th conversation, and p i j T is the predicted probability based on s t u d e n t T . Similarly, we can obtain L C E A and L C E f for the s t u d e n t A and s t u d e n t f , respectively. Therefore, the total cross-entropy loss of the three student modules can be expressed as follows:
L C E = λ 1 L C E T + λ 2 L C E A + λ 3 L C E f
where λ 1 , λ 2 , and λ 3 are hyperparameters. In addition, we also introduce the KL divergence to approximate the probability distribution of the teacher module and student modules to further explore high-quality multimodal fusion. It can be calculated as follows:
L K L T = 1 i = 1 N C N i = 1 N C j = 1 N p i j T log ( p i j T p i j )
where p i j is the predicted probability for the utterance u j in the i-th conversation based on the teacher module, and p i j T is the predicted probability based on s t u d e n t T . Similarly, we can obtain L K L A and L K L f for the s t u d e n t A and s t u d e n t f , respectively. Therefore, the total KL divergence loss of the three student modules can be expressed as follows:
L K L = β 1 L K L T + β 2 L K L A + β 3 L K L f
where β 1 , β 2 , and β 3 are hyperparameters.
Therefore, we combine the above L c l s , L C E , and L K L together to generate the final objective function:
L = μ 1 L c l s + μ 2 L C E + μ 3 L K L
where μ 1 , μ 2 , and μ 3 are hyperparameters. In this study, we set μ 1 = μ 2 = μ 3 = 1.

4. Experimental Datasets and Setup

In this section, the experimental datasets as well as evaluation metrics are first introduced; then, the mainstream ERC baselines are implemented to evaluate the performance of our proposed HCIFN-SD, and finally, the implementation details are reported.

4.1. Experimental Datasets and Evaluation Metrics

(1) Datasets: We evaluate our proposed HCIFN-SD on two widely used multimodal datasets: IEMOCAP [23] and MELD [24], which include textual, acoustic, and visual signals. In this paper, we only use the textual and acoustic modalities for experiments. The basic statistic information of these two datasets is listed in Table 1.
IEMOCAP is one of the most widely used multimodal datasets in ERC tasks. It has five conversation sessions under various scenarios involving ten speakers, but each dialogue in IEMOCAP is a dyadic conversation dataset. The recorded multimodal data for approximately 12 h contains 151 conversations, which are further segmented into 7433 utterances. Each utterance is labeled with one of the following six emotion categories, including happiness, sadness, neutral, anger, excitement, and frustration. Following the previous studies [5,13], the first four sessions are utilized for model training and validation, and the remaining session is used for model testing.
MELD is a popular multimodal multi-party conversational dataset, which was collected from the Friends TV shows. It contains a total of 1433 conversations, involving 13,708 utterances uttered by 304 speakers. The dialogue in MELD is relatively short because Table 1 indicates that the average number of utterances per dialogue is 9.6, but each dialogue in MELD involves more than 2 speakers. Each utterance is labeled with one of the following seven emotion labels: neutral, surprise, fear, sadness, joy, disgust, and anger. Experiments on MELD are conducted based on the previous dataset partition rule [5,13].
(2) Evaluation Metrics: Considering that both IEMOCAP and MELD are imbalanced categorical datasets, we use the weighted average accuracy (WAA) and weighted average F1-score (WAF1) as evaluation metrics for the purpose of conveniently comparing our HCIFN-SD with the state-of-the-art ERC approaches. In addition, F1 scores for each emotion category are also reported for comprehensive comparison with the baselines.

4.2. Baseline Approaches

To demonstrate the effectiveness of our proposed HCIFN-SD, we compare it with the following ERC baselines.
  • CMN [9] employs two distinct GRUs to explore inter-speaker context dependencies for predicting utterance-level emotions, but it cannot work on multi-party conversation tasks.
  • ICON [10], as an improved approach of CMN, employs distinct GRUs to explore both intra- and inter-speaker dependencies for emotion identification, but it still cannot deal with multi-party conversation scenarios.
  • BC-LSTM [15] utilizes the bidirectional LSTM architecture to learn conversational context for emotion classification, but it is a speaker-independent model.
  • DialogueRNN [6] adopts three distinct GRUs, including Global GRU, Speaker GRU, and Emotion GRU, to capture conversational context, learn speaker identity information, and identify utterance-level emotions, respectively.
  • A-DMN [11] is a multimodal sequence-based ERC approach, which utilizes bidirectional LSTM layers to model intra-speaker and inter-speaker influences and then employs the attention mechanism to combine them to update the memory for final emotion prediction.
  • DialogueGCN [7] first extracts sequential context based on the Bi-GRU layer, then explores speaker-level dependencies based on the GCN, and finally integrates these two types of context for final emotion recognition. Considering that the original DialogueGCN is a unimodal approach, we implement multimodal DialogueGCN by concatenating features of each modality.
  • MMGCN [13] is a multimodal fusion GCN, in which each utterance can be represented by three nodes for distinguishing different modalities to capture both cross-modal dependencies and speaker-dependent dependencies.
  • MM-DFN [14] is a graph-based multimodal dynamic fusion approach, which explores the dynamics of conversational context under different modality spaces for the purpose of reducing redundancy and improving complementarity between modalities.
  • GraphCFC [20] utilizes multimodal subspace extractors and cross-modal interactions to achieve multimodal fusion. It is a speaker-dependent ERC model.
  • GA2MIF [46] is a multimodal fusion approach, involving the textual, acoustic, and visual modalities. It constructs three graphs for modeling intra-modal local and long-distance context as well as cross-modal interactions.
  • SMFNM [5] first implements intra-modal interactions based on the semi-supervised strategy, then selects the textual modality as the main modality to guide the cross-modal interaction, and finally explores the conversational context from both multimodal and main-modal perspectives for obtaining more comprehensive context information.

4.3. Implementation Details

We implement HCIFN-SD through the Pytorch framework. The hyperparameters and parameters have been optimized through the grid searching strategy. The hyperparameters are set as follows: The dimension of the unimodal representations enhanced with the speaker identity information is set to be d = 200. The context window size in the local information mask is set to be p = 27 for IEMOCAP and p = 3 for MELD, respectively. Another context window size in the LCE component is set to be p = 4 for both datasets. In the directed graphs, we set s = 1, which indicates that only the latest intra-speaker dependency is considered as the cut-off point to construct the directed edges for context dependency propagation. The number of graph layers in the MF-GAT module is set to be l = 2 for IEMOCAP and l = 1 for MELD, respectively. Therefore, the dimensions of the final multimodal fusion representation are 1800 for the IEMOCAP dataset and 1200 for the MELD dataset, respectively. In addition, we set λ 1 = λ 2   = λ 3   = 1, β 1   =   β 2   =   β 3   = 1, and μ 1   = μ 2   = μ 3   = 1, respectively.
During training, the Adam optimizer is used with a learning rate of 0.0002 and a maximum of 100 epochs to optimize trainable parameters. The early stopping strategy is employed, and the stop condition is that if the validation loss does not decrease for 5 consecutive epochs, the training will stop. The L 2 regularization parameter is set to be 0.00001. The dropout rate is set to be 0.5 to alleviate the overfitting problem.

5. Results and Discussion

In this section, the results of all comparison experiments and ablation studies are reported and discussed in detail to demonstrate the effectiveness and superiority of our proposed HCIFN-SD. Then, a quantitative analysis on the hyperparameters is conducted. Finally, we qualitatively discuss our proposed HCIFN-SD through a case study.

5.1. Comparison with Baseline Models

We compare HCIFN-SD with state-of-the-art baselines on the IEMOCAP and MELD datasets under multimodality features. Experimental results, including the overall performance and F1 scores of each emotion category, are reported in Table 2 and Table 3. Here, results with “*” indicate that the baseline is a unimodal approach, but for a fair comparison, we rerun the corresponding open-source codes using our extracted unimodal features to generate multimodal features for emotion prediction; “-” indicates that the results are unavailable. Other baseline results are obtained from the existing literature.

5.1.1. Comparison of Overall Performance

Table 2 shows that our proposed HCIFN-SD performs better than all the baselines and achieves 73.42% on WAF1 and 73.07% on WAA for the IEMOCAP dataset, which achieves an absolute improvement of 2.52% on WAA and 2.27% on WAF1 compared with the currently advanced ERC model (i.e., SMFNM). Table 3 shows that our proposed approach succeeds over the current state-of-the-art ERC approach (i.e., SMFNM) by 1.94% on WAF1 and 2.11% on WAA. To sum up, our proposed HCIFN-SD outperforms all the baseline models in terms of WAF1 and WAA, which demonstrates the superiority of our proposed approach in dealing with multimodal ERC tasks. The significant improvements can be explained by the following reasons:
(1)
Our HCIFN-SD is significantly superior to BC-LSTM on both datasets. It is difficult to accurately identify human emotion in a conversation scenario because the emotional state can be easily influenced by the interlocutor’s behaviors [47]; thus, it is quite significant to model the speaker-sensitive dependencies in conversations [11]. However, BC-LSTM is a typical speaker-independent approach. It only depends on LSTM structures to capture conversational context from the surrounding utterances without considering intra-speaker and inter-speaker influences, which leads to the relatively poor performance of BC-LSTM. All the other models take into account speaker-sensitive modeling. For instance, CMN and ICON utilize two distinct GRUs for two speakers to separately explore intra-speaker dependencies and use another GRU for modeling inter-speaker dependencies. The graph-based models (such as DialogueGCN, MMGCN, MM-DFN, GraphCFC, GA2MIF, and SMFNM) utilize the nodes to represent individual utterances and use the edges between nodes to bridge the intra-speaker and inter-speaker influences. Our proposed HCIFN-SD not only explicitly models the intra-speaker and inter-speaker dependencies based on the proposed different mask strategies but also utilizes the different types of edges in the directed graph construction to comprehensively explore speaker-sensitive conversational context.
(2)
Our proposed approach significantly succeeds over the multimodal sequence-based approaches (such as CMN, ICON, and A-DMN). It can be observed that these approaches directly concatenate multimodal information without incorporating the cross-modal interaction information. However, lots of multimodal research has demonstrated the importance of exploring cross-modal interaction for enhancing the quality of multimodal fusion representation [5,18,48]. In addition, these recurrence-based models tend to utilize local utterances for conversational context extraction and usually ignore long-term context information. To address these issues, our HCIFN-SD designs the improved GRU structure named MCI-GRU, which utilizes the speaker-sensitive dependency information to participate in the calculation of the update gate to explore speaker-dependent long-distance context. Furthermore, our HCIFN-SD proposes multi-stage cross-modal interactions and fusion and employs the self-distillation strategy to enhance the representation ability of multimodal information.
(3)
HCIFN-SD outperforms the multimodal graph-based approaches (such as MMGCN, MM-DFN, GraphCFC, and GA2MIF). It is found that MMGCN and MM-DFN generate three nodes for each utterance to represent different modalities and construct edges for different modality nodes under the same utterance to explore intra-speaker-based complementary information from other modalities. Obviously, this graph construction approach not only results in data redundancy, due to inconsistent data distribution among different modalities, but also fails to extract diverse information in the conversational graph, which all affect the representation ability of the multimodal fusion information. GraphCFC defines different types of edges based on the perspectives of speakers and modalities to extract speaker-dependent cross-modal context information, but it not only fails to capture the sequential information of conversations but also ignores the semantic gaps among modalities. GA2MIF concentrates on cross-modal contextual modeling based on the proposed two modules, but it only depends on the static speaker identity information without considering the in-depth exploration of the speaker-dependent cross-modal context extraction and fusion. To address these issues, our HCIFN-SD proposes the MCI-GRU module to explore speaker-dependent cross-modal context information, utilizes the MF-GAT module to further explore speaker-dependent multimodal interaction and fusion, and finally employs the self-distillation to eliminate the semantic gaps among different modalities.
Therefore, our proposed approach can more adequately extract both long-distance and local conversational context information based on the speaker-sensitive modeling and multi-stage cross-modal interactions and effectively reduce the inconsistent data distribution to eliminate semantic gaps among modalities. Therefore, it is not surprising that compared with baseline models, our HCIFN-SD achieves the best emotion recognition performance.

5.1.2. Comparison for Each Emotion Category

For performance comparison in each emotion category, we also visualize the confusion matrices of HCIFN-SD on the benchmark testing sets as shown in Figure 7.
Table 2 indicates that compared with baseline models, our HCIFN-SD shows remarkably significant improvements for happiness and neutral by 9.25~38.84% and 7.6~24.1%, respectively, on the IEMOCAP dataset. Table 2 also reports that HCIFN-SD slightly outperforms the currently advanced model (i.e., SMFNM) for Frustration and achieves comparable performance for sadness, anger, and excitement on par with the state-of-the-art baselines. Meanwhile, the happiness category is easily confused with excitement and vice versa as shown in Figure 7a, which is consistent with previous research [5]. The possible explanation is that happiness and excitement are quite similar in some cases. Figure 7a presents that the sadness, neutral, and anger emotions can be easily misclassified as frustration, while the frustration emotion can be confused with neutral and anger. These phenomena are strongly related to the class-imbalanced distribution in IEMOCAP, because the frustration and neutral emotions belong to the majority categories [23]. Furthermore, humans have a similar perception and interpretation of negative emotions (such as sadness, anger, and frustration), which can also explain these phenomena.
Table 3 shows that for the MELD dataset, our proposed HCIFN-SD achieves higher F1 scores than baseline models for the majority of emotions, including neutral, joy, disgust, fear, and surprise. Due to the highly imbalanced class distributions in MELD [24], some existing baselines (such as BC-LSTM and DialogueRNN) cannot identify disgust and fear, and some baselines (such as MM-DFN and GraphCFC) do not report F1 scores for these two classes, but the F1 scores of our proposed HCIFN-SD are 26.47% and 34.00%, respectively. These results demonstrate the effectiveness of our HCIFN-SD in recognizing most emotion categories. Table 3 also indicates that our proposed HCIFN-SD achieves a comparable performance for sadness and anger on par with the state-of-the-art baselines. Figure 7b clearly shows that most emotion categories can be easily confused with neutral, which can also be explained by the class-imbalanced issue in MELD.

5.2. Ablation Study

To demonstrate the effectiveness of each component in our proposed approach, we conduct extensive ablation experiments on the two benchmark datasets.

5.2.1. Impact of Different Mask Strategies

In this subsection, we discuss the impact of different mask strategies on the performance of HCIFN-SD. To verify the effectiveness of the intra-speaker mask, inter-speaker mask, and local information mask to model speaker-sensitive cross-modal context information, three comparison settings are used. Experimental results are listed in Table 4. Here, “w/o” means removing one specific mask strategy.
Table 4 presents that HCIFN-SD outperforms w/o intra-speaker mask by 0.24~0.42% on WAF1 and 0.15~0.24% on WAA, which verifies the effectiveness of exploring cross-modal interaction based on the intra-speaker mask strategy. Table 4 reports HCIFN-SD exhibits an absolute improvement of 0.14~0.48% WAF1 and 0.01~0.49% on WAA over w/o inter-speaker mask, which demonstrates the effectiveness of exploring cross-modal interaction based on the inter-speaker mask strategy. Therefore, the above results suggest that the speaker-sensitive dependencies in conversations can be well modeled by our proposed intra- and inter-speaker mask strategies. In addition, our HCIFN-SD is superior to w/o local information mask by 0.20~1.27% on WAF1 and 0.19~1.17% on WAA, which suggests the effectiveness of exploring cross-modal interaction based on the local information mask strategy.
Table 4 also shows that removing the intra-speaker mask leads to a slight degradation in performance, but removing the local information mask results in a more significant performance degradation for the IEMOCAP dataset, which indicates that learning local conversational context from neighboring utterances is more important for dealing with IEMOCAP. However, experimental results in Table 4 indicate that exploring local conversational context from neighboring utterances is just as important as exploring intra-speaker dependencies for the MELD dataset. The possible reason for these inconsistent findings is that the speakers of each conversation differ in the two datasets. IEMOCAP consists of dyadic conversations [23], which indicates that plenty of intra-speaker dependencies are contained in neighboring utterances. The average number of speakers per dialogue in MELD is 4.7 [24], which suggests that the neighboring utterances hardly contain self-speaker dependency information. Therefore, it is quite necessary to comprehensively explore cross-modal interactions based on the proposed mask strategies.

5.2.2. Impact of Multi-View Cross-Modal-Interaction-Enhanced GRU

Considering that MCI-GRU utilizes three types of cross-modal interactions to explore speaker dependencies in both long-distance and local conversational contexts, the following four comparison settings are executed for the purpose of discussing the impact of the proposed MCI-GRU module on the performance of HCIFN-SD. Experimental results are listed in Table 5.
Table 5 shows that compared with w/o intra-CI, our HCIFN-SD achieves 0.07~0.94% and 0~0.86% improvements on WAF1 and WAA, respectively, which verifies the effectiveness of exploring intra-speaker cross-modal interaction. Table 5 also reports that compared with w/o inter-CI, using inter-CI (i.e., our HCIFN-SD) can exhibit an enhancement of 0.46~1.65% WAF1 and 0.07~1.66% on WAA, respectively, which means it is important to explore inter-speaker cross-modal interaction. Furthermore, HCIFN-SD performs better than w/o local-CI by 0.37~1.11% on WAF1 and 0.42~1.04% on WAA, which suggests the effectiveness of exploring local-information-based cross-modal interaction. Thus, it is not surprising that w/o MCI-GRU, which is equivalent to removing three types of cross-modal interactions without GRU, achieves the worst performance as shown in Table 5. Therefore, our HCIFN-SD, which proposes the MCI-GRU module to leverage three of cross-modal interactions and the GRU structure to explore conversational context, achieves the best emotion prediction performance on the two benchmark datasets.
Table 5 indicates that modeling the inter-speaker cross-modal interaction has a greater impact on performance, especially for the IEMOCAP dataset, which is consistent with the experiment results in Table 4. The possible reason is that capturing emotion contagion through inter-speaker modeling is more beneficial for identifying emotion shift to infer correct emotions in dyadic conversations. In addition, Table 5 reports that whether to explore intra-speaker modeling has little impact on performance for the MELD dataset. We infer that this is due to the relatively short dialogues in MELD, resulting in relatively few consecutive utterances expressed by each speaker, which inevitably weakens the benefits of exploring intra-speaker dependencies.

5.2.3. Impact of Graph-Attention-Based Multimodal Fusion

The vanilla GAT aggregates information from neighboring nodes, but our MF-GAT utilizes the MCI-GRU cell to further explore cross-modal interactions and designs a new GNN layer to address the over-smoothing issue of GAT to generate the final graph representation. Therefore, the following three comparison settings are implemented to demonstrate the impact of the proposed MF-GAT module on the performance of HCIFN-SD. Here, “w/o MCI-GRU cell” removes the MCI-GRU cell in the MF-GAT module, which indicates that this approach directly uses g i A T ( l 1 ) and g i A T ( l ) to implement calculation of the GNN layer. “w/o GNN layer” denotes that the output of the MCI-GRU cell (i.e., m i A T ( l ) ) is used for generating the final graph representation. “Vanilla GAT” means removing both the MCI-GRU cell and the GNN layer. The corresponding results are listed in Table 6.
From Table 6, we can see that using the MCI-GRU cell (i.e., our HCIFN-SD) can bring about 0.23~1.26% and 0.01~1.04% improvements in WAF1 and WAA, respectively, which demonstrates the effectiveness of utilizing the MCI-GRU cell to further explore cross-modal interactions. Table 6 indicates that compared with the w/o GNN layer, our HCIFN-SD improves 0.41~1.95% WAF1 and 0.38~1.54% WAA, respectively, which means that it is crucial to utilize our proposed GNN layer to eliminate the over-smoothing issue, especially for the IEMOCAP dataset. Therefore, compared with our HCIFN-SD, the Vanilla GAT approach that removes both the MCI-GRU and GNN layer presents a significant performance degradation. In addition, it is found that compared with the MCI-GRU cell, the GNN layer in the MF-GAT module shows a greater influence on emotion recognition performance. The possible reason for this phenomenon is that GAT updates the node’s information through aggregating information from neighboring nodes; thus, stacking more layers is an intuitive solution to extract remote context information, but this solution also inevitably results in over-smoothing. Therefore, it is quite necessary to address over-smoothing to avoid the potential decline of representation quality.

5.2.4. Impact of Self-Distillation

Considering that there are semantic gaps between different modalities during the cross-modal interactions, which may result in misalignment and further decrease the final emotion prediction performance, our HCIFN-SD introduces the self-distillation strategy for adding two other loss calculations to eliminate the potential semantic gaps among modalities. Therefore, to demonstrate the effectiveness of these loss functions, the following three comparison settings are implemented, and the experimental results are reported in Table 7.
The approach that fails to consider both L C E and L K L (i.e., just using L c l s as the final loss function during the training process) achieves the worst prediction performance as shown in the 1st row in Table 7. Then, the approach that incorporates L C E (i.e., the 2nd row in Table 7) can bring about 0.18~2.26% and 0.04~2.15% improvements in WAF1 and WAA, respectively, which indicates that adding the L C E function is beneficial for optimizing the training of each student module in the second stage and further improves the final prediction performance. Similarly, the approach that incorporates L K L (i.e., the 3rd row in Table 7) can improve 0.09~1.56% WAF1 and 0~1.35% WAA, respectively. It suggests that introducing the KL divergence can indeed reduce the difference in the probability distributions of the teacher module and three student modules, which further enhances the final classification performance. In addition, experimental results show that compared with L C E , L K L seems more important for enhancing performance. Therefore, our HCIFN-SD that simultaneously introduces both L C E and L K L based on L c l s can achieve the best prediction performance on both datasets.

5.3. Analysis on Hyperparameters

The hyperparameter p determines how many neighboring utterances can be considered for exploring local context in conversations, and the hyperparameter p also denotes the size of window in the proposed LCE component. The hyperparameter s determines the selection of cut-off utterance   u ρ , and the number of graph layer l is a hyperparameter, which directly influences the information propagation in conversations. Therefore, we conduct an analysis on these hyperparameters to find the optimal settings.

5.3.1. Analysis of Context Window Size

In this subsection, we discuss the impact of different context window size p on the performance of HCIFN-SD. We conduct comparative experiments for selecting the optimal p for each dataset. Experimental results are presented in Figure 8 and Figure 9.
With the increase in context window size p , the performance of HCIFN-SD shows an increasing trend in both datasets as shown in Figure 8 and Figure 9. When p = 27, HCIFN-SD reaches the best prediction performance for 73.07% WAF1 and 73.42% WAA on the IEMOCAP dataset. For the MELD dataset, HCIFN-SD obtains the optimal performance when p = 3. These results indicate that while we just consider the neighboring six utterances (i.e., the past context window size and the future context window size are equal to three) to calculate cross-modal interactions for utterance u i in MELD, more neighboring utterances need to be considered for IEMOCAP. A possible reason is that the average number of utterances per dialogue is different. The average number of utterances per dialogue is more than 40 in IEMOCAP, but the conversations in MELD are very short. This results in a different range of neighboring utterances that we need to consider when computing local context information in conversations. However, when p > 27 for IEMOCAP or p > 3 for MELD, a decreasing trend in performance can be observed. We infer that too large a context window size means more utterances are involved in context learning, which evolves into a vanilla transformer without a mask to focus on all utterances in conversation. Therefore, the context window size p has an impact on the performance of HCIFN-SD, and we set p = 27 for IEMOCAP and p = 3 for MELD in our experiments.

5.3.2. Analysis of Window Size in LCE

The hyperparameter p influences the selection of neighboring fusion information in the LCE component. Comparative experiments are implemented for selecting the optimal p . Experimental results on the IEMOCAP are presented in Figure 10.
Theorem-type environments (including propositions, lemmas, corollaries, etc.) can be formatted as follows:
Figure 10 reports that increasing the window size p   from 1 to 4 can consistently improve the prediction ability of HCIFN-SD. When p = 4, HCIFN-SD can obtain the optimal prediction performance for 73.07% WAF1 and 73.42% WAA on the IEMOCAP dataset, which indicates that utilizing an appropriate number of neighboring utterances in the LCE component will further enhance the contextual information of the fusion modality and improve the final prediction performance of HCIFN-SD. However, when p > 4, HCIFN-SD suffers from a significant performance decline as shown in Figure 10. It is also found that when setting the inappropriate hyperparameter p , the performance of HCIFN-SD is even worse than without using the LCE component. Therefore, the window size in LCE has an impact on the prediction performance of HCIFN-SD.

5.3.3. Analysis of Different Number of Graph Layers

In this subsection, we discuss the impact of different numbers of MF-GAT layers on the performance of HCIFN-SD. We conduct comparative experiments for selecting the optimal l for each dataset. Experimental results on the IEMOCAP dataset are presented in Figure 11.
Figure 11 shows that when l = 2, SMFNM obtains the best prediction performance on the IEMOCAP dataset, which indicates that stacking two graph attention layers can capture the high-quality context dependencies. When l > 2, a slight performance degradation is observed. However, compared with Vanilla GAT, the decrease in magnitude of our HCIFN-SD is relatively gentle as shown in Figure 11, which also verifies the effectiveness of the proposed GNN layer in MF-GAT to relieve the over-smoothing issue. Similarly, the optimal number of graph layers is 1 for MELD. Therefore, the number of graph layers in MF-GAT has an impact on the prediction performance of HCIFN-SD.

5.3.4. Analysis of the Cut-Off Point Selection in Graph Construction

In this subsection, we explore the impact of the cut-off point selection (i.e., s) when constructing the graph. Experimental results on the IEMOCAP dataset are presented in Table 8.
Table 8 shows that increasing the value of s from 1 to 3 cannot bring performance improvements but instead leads to a significant decline in performance. When s = 1, our HCIFN-SD performs best with 73.07% WAA and 73.42% WAF1, which means that choosing the first nearest utterance spoken by p u i before u i (i.e., s = 1) as the cut-off point to construct edges can obtain a good trade-off between context extraction and computational load. Due to the different number of speakers involved in each dialogue, s = 1 determines the different range of utterances, which needs to be considered during the graph construction for different datasets. For instance, for the IEMOCAP dataset, s = 1 suggests that the edges are constructed from the current utterance to the past two utterances for propagating both local and long-distance context. However, when s > 1, the performance of HCIFN-SD shows a significant decline in performance, which indicates that selecting a large range to construct edges may bring some noise and further influence the context propagation. Therefore, the hyperparameter s is set to be 1.

5.4. Case Study

When recognizing utterances with non-neutral emotions, such as “All right, Bye-bye. [happy]”, “Okay, thanks. [happy]”, and “Brian, I need help. [sad]”, many existing ERC models usually misclassify these utterances as neutral, but our proposed HCIFN-SD can identify the correct emotions of these utterances by the multi-stage extraction of speaker-dependent multimodal conversational context. Furthermore, when a conversation contains consecutive utterances with the same emotion label, many baseline ERC models tend to classify the emotion of the next utterance as the same label. Taking the case selected from IEMOCAP as an example, there is an emotion shift from neutral to happy in the fourth utterance as shown in Figure 12. Baseline models cannot correctly identify the emotion, but our proposed HCIFN-SD can accurately capture this emotion shift, demonstrating the superiority of our proposed HCIFN-SD in capturing emotion shift in conversations.
It is known that emotion contagion is a critical factor in ERC systems. To qualitatively demonstrate the effectiveness of the proposed inter-speaker mask, we explore the differences between using and not using an inter-speaker mask in learning cross-modal interactions, as shown in Figure 13. It can be seen that the woman’s emotions were neutral at the beginning, but when the man expressed the frustration emotion due to leaving, the woman’s emotion was also influenced by the frustration emotion. And the emotion contagion influenced the emotions of the subsequent conversation. Figure 13 shows that utilizing the inter-speaker mask to implement cross-modal interactions can accurately detect this emotional contagion.
Like most ERC approaches, our proposed HCIFN-SD also encounters challenges in accurately identifying similar emotions as shown in Figure 7. Specifically, our proposed HCIFN-SD misclassifies anger into frustration with a probability of 25.29%, and wrongly identifies excitement as happiness with a probability of 21.07% on the IEMCAP dataset. However, our proposed method can alleviate this issue to some extent. For instance, our HCIFN-SD identifies happiness as excitement with a probability of 13.19%, while MMGCN and GA2MIF do so with probabilities of 31.94% [13] and 25.0% [46], respectively. Thus, our HCIFN-SD achieves a more promising performance in identifying similar positive emotions.

6. Conclusions

In this paper, we propose a novel ERC approach, called “hierarchical cross-modal interaction and fusion network enhanced with self-distillation (HCIFN-SD)”, to address the research gaps of existing ERC methods. HCIFN-SD first proposes different mask strategies to explore both intra-speaker and inter-speaker emotional dependencies between different modalities and then utilizes speaker-dependent and local cross-modal interactions to extract speaker-dependent long-distance and local conversational context based on the proposed MCI-GRU module. In the MF-GAT module, we first construct three directed graphs to represent different modality spaces, implement cross-modal interactions to update the node’s information based on the utilization of MCI-GR cell, and then design a new GNN layer to alleviate the over-smoothing issue of GAT to generate the final graph representation. Finally, self-distillation is introduced to transfer knowledge from both hard and soft labels, for the purpose of eliminating the semantic gap between different modalities to obtain more outstanding emotion recognition performance. We have demonstrated the effectiveness and rationality of our HCIFN-SD on two benchmark datasets and achieved new state-of-the-art records with 73.42% WAF1 and 73.07% WAA on IEMOCAP and 64.71% WAF1 and 64.34% WAA on MELD. In addition, we analyze and discuss the impact of different settings of hyperparameters on the performance of HCIFN-SD. The case study qualitatively demonstrates the effectiveness of HCIFN-SD.
This study only focuses on the textual and acoustic modalities. In the future, our work will explore how to effectively extract visual information to enhance the representation quality of multimodal information to accurately identify similar emotions in ERC tasks. In addition, we will explore how to efficiently extract high-quality multimodal conversational context to further improve the identification performance for ERC in our future work.

Author Contributions

P.W.: conceptualization, methodology, software, validation, writing—original draft, and writing—review and editing; J.Y.: conceptualization, methodology, software, validation, writing—original draft, writing—review and editing, and supervision; Y.X.: conceptualization, validation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants 62107032. Authors are grateful to the anonymous reviewers for helpful comments.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Van Kleef, G.A.; Côté, S. The social effects of emotions. Annu. Rev. Psychol. 2022, 73, 629–658. [Google Scholar] [CrossRef] [PubMed]
  2. Li, R.; Wu, Z.; Jia, J.; Bu, Y.; Zhao, S.; Meng, H. Towards discriminative representation learning for speech emotion recognition. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Beijing, China, 10–16 August 2019; pp. 5060–5066. [Google Scholar]
  3. Ma, Y.; Nguyen, K.L.; Xing, F.Z.; Cambria, E. A survey on empathetic dialogue systems. Inf. Fusion 2020, 64, 50–70. [Google Scholar] [CrossRef]
  4. Nimmagadda, R.; Arora, K.; Martin, M.V. Emotion recognition models for companion robots. J. Supercomput. 2022, 78, 13710–13727. [Google Scholar] [CrossRef]
  5. Yang, J.; Dong, X.; Du, X. SMFNM: Semi-supervised multimodal fusion network with main-modal for real-time emotion recognition in conversations. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101791. [Google Scholar] [CrossRef]
  6. Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6818–6825. [Google Scholar]
  7. Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 154–164. [Google Scholar]
  8. Zou, S.; Huang, X.; Shen, X.; Liu, H. Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation. Knowl.-Based Syst. 2022, 258, 109978. [Google Scholar] [CrossRef]
  9. Hazarika, D.; Poria, S.; Zadeh, A.; Cambria, E.; Morency, L.-P.; Zimmermann, R. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1 (Long Papers), pp. 2122–2132. [Google Scholar]
  10. Hazarika, D.; Poria, S.; Mihalcea, R.; Cambria, E.; Zimmermann, R. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2594–2604. [Google Scholar]
  11. Xing, S.; Mai, S.; Hu, H. Adapted Dynamic Memory Network for Emotion Recognition in Conversation. IEEE Trans. Affect. Comput. 2020, 13, 1426–1439. [Google Scholar] [CrossRef]
  12. Ren, M.; Huang, X.; Shi, X.; Nie, W. Interactive Multimodal Attention Network for Emotion Recognition in Conversation. IEEE Signal Process. Lett. 2021, 28, 1046–1050. [Google Scholar] [CrossRef]
  13. Hu, J.; Liu, Y.; Zhao, J.; Jin, Q. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th In-ternational Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; pp. 5666–5675. [Google Scholar]
  14. Hu, D.; Hou, X.; Wei, L.; Jiang, L.; Mo, Y. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Con-versations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7037–7041. [Google Scholar]
  15. Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Morency. Context-dependent sentiment analysis in us-er-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
  16. Ghosal, D.; Akhtar, M.S.; Chauhan, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Contextual inter-modal attention for multimodal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3454–3466. [Google Scholar]
  17. Jiao, W.; Lyu, M.; King, I. Real-Time Emotion Recognition via Attention Gated Hierarchical Memory Network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8002–8009. [Google Scholar]
  18. Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 985–1000. [Google Scholar] [CrossRef]
  19. Ren, M.; Huang, X.; Li, W.; Song, D.; Nie, W. LR-GCN: Latent Relation-Aware Graph Convolutional Network for Conversational Emotion Recognition. IEEE Trans. Multimed. 2021, 24, 4422–4432. [Google Scholar] [CrossRef]
  20. Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphCFC: A Directed Graph Based Cross-Modal Feature Complementation Approach for Multimodal Conversational Emotion Recognition. IEEE Trans. Multimed. 2023, 26, 77–89. [Google Scholar] [CrossRef]
  21. Joshi, A.; Bhat, A.; Jain, A.; Singh, A.V.; Modi, A. COGMEN: Contextualized GNN based Multimodal Emotion recognition. arXiv 2022, arXiv:2205.02455v1. [Google Scholar]
  22. Shen, W.; Wu, S.; Yang, Y.; Quan, X. Directed Acyclic Graph Network for Conversational Emotion Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 1551–1560. [Google Scholar]
  23. Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
  24. Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 527–536. [Google Scholar]
  25. Picard, R.W. Affective Computing: From Laughter to IEEE. IEEE Trans. Affect. Comput. 2010, 1, 11–17. [Google Scholar] [CrossRef]
  26. Gross, J.J.; Barrett, L.F. Emotion Generation and Emotion Regulation: One or Two Depends on Your Point of View. Emot. Rev. 2011, 3, 8–16. [Google Scholar] [CrossRef] [PubMed]
  27. Jiao, W.; Yang, H.; King, I.; Lyu, M.R. HiGRU: Hierarchical gated recurrent units for utterance-level emotion recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2–7 June 2019; pp. 397–406. [Google Scholar]
  28. Ghosal, D.; Majumder, N.; Gelbukh, A.; Mihalcea, R.; Poria, S. COSMIC: Commonsense knowledge for emotion identification in conversations. In Proceedings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 16–20 November 2020; pp. 2470–2481. [Google Scholar]
  29. Zhang, D.; Wu, L.; Sun, C.; Li, S.; Zhu, Q.; Zhou, G. Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}, Macao, China, 10–16 August 2019; pp. 5415–5421. [Google Scholar]
  30. Shen, W.; Chen, J.; Quan, X.; Xie, Z. DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; pp. 13789–13797. [Google Scholar]
  31. Lee, J.; Lee, W. CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Con-versation. arXiv 2022, arXiv:2108.11626v3. [Google Scholar]
  32. Yang, J.; Du, X.; Hung, J.-L.; Tu, C.-H. Analyzing online discussion data for understanding the student’s critical thinking. Data Technol. Appl. 2021, 56, 303–326. [Google Scholar] [CrossRef]
  33. Sahoo, S.; Routray, A. Emotion recognition from audio-visual data using rule based decision level fusion. In Proceedings of the 2016 IEEE Students’ Technology Symposium (TechSym), Kharagpur, India, 30 September–2 October 2016; pp. 7–12. [Google Scholar]
  34. Zhou, Y.; Zheng, H.; Huang, X.; Hao, S.; Li, D.; Zhao, J. Graph Neural Networks: Taxonomy, Advances and Trends. ACM Trans. Intell. Syst. Technol. 2022, 13, 15. [Google Scholar] [CrossRef]
  35. Li, X.; Sun, L.; Ling, M.; Peng, Y. A survey of graph neural network based recommendation in social networks. Neurocomputing 2023, 549, 126441. [Google Scholar] [CrossRef]
  36. Lu, G.; Li, J.; Wei, J. Aspect sentiment analysis with heterogeneous graph neural networks. Inf. Process. Manag. 2022, 59, 102953. [Google Scholar] [CrossRef]
  37. Dai, G.; Wang, X.; Zou, X.; Liu, C.; Cen, S. MRGAT: Multi-Relational Graph Attention Network for knowledge graph completion. Neural Netw. 2022, 154, 234–245. [Google Scholar] [CrossRef]
  38. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  39. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  40. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  41. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th Conference on Neural Information Processing Systems, Online, 6–12 December 2020; pp. 12449–12460. [Google Scholar]
  42. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1–15. [Google Scholar]
  43. Brody, S.; Alon, U.; Yahav, E. How attentive are graph attention networks? arXiv 2021, arXiv:2105.14491. [Google Scholar]
  44. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  45. Liu, F.; Ren, X.; Zhang, Z.; Sun, X.; Zou, Y. Rethinking skip connection with layer normalization. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 3586–3598. [Google Scholar]
  46. Li, J.; Wang, X.; Lv, G.; Zeng, Z. GA2MIF: Graph and Attention Based Two-Stage Multi-Source Information Fusion for Con-versational Emotion Detection. IEEE Trans. Affect. Comput. 2024, 15, 130–143. [Google Scholar] [CrossRef]
  47. Lee, C.-C.; Busso, C.; Lee, S.; Narayanan, S.S. Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions. In Proceedings of the International Speech Communication Association, Brighton, UK, 6–10 September 2009; pp. 1983–1986. [Google Scholar]
  48. Yang, J.; Li, Z.; Du, X. Analyzing audio-visual data for understanding user’s emotion in human-computer interaction en-vironment. Data Technol. Appl. 2023, 58, 318–343. [Google Scholar]
Figure 1. An example of a multimodal dialogue system. Each utterance includes textual and acoustic modalities.
Figure 1. An example of a multimodal dialogue system. Each utterance includes textual and acoustic modalities.
Electronics 13 02645 g001
Figure 2. Overall structure of the proposed hierarchical cross-modal interaction and fusion network enhanced with self-distillation (HCIFN-SD).
Figure 2. Overall structure of the proposed hierarchical cross-modal interaction and fusion network enhanced with self-distillation (HCIFN-SD).
Electronics 13 02645 g002
Figure 3. Illustration of the three mask strategies for cross-modal interactions.
Figure 3. Illustration of the three mask strategies for cross-modal interactions.
Electronics 13 02645 g003
Figure 4. Illustration of the proposed MCI-GRU.
Figure 4. Illustration of the proposed MCI-GRU.
Electronics 13 02645 g004
Figure 5. Illustrations of the fusion graph.
Figure 5. Illustrations of the fusion graph.
Electronics 13 02645 g005
Figure 6. Illustrations of the edges and edge types in the A T graph and the T A graph, respectively. (a)   A T graph shows the propagation of information from the audio modality to the text modality. (b)   T A graph shows the propagation of information from the text modality to the audio modality.
Figure 6. Illustrations of the edges and edge types in the A T graph and the T A graph, respectively. (a)   A T graph shows the propagation of information from the audio modality to the text modality. (b)   T A graph shows the propagation of information from the text modality to the audio modality.
Electronics 13 02645 g006
Figure 7. Visualization confusion matrices of HCIFN-SD: (a) results on the IEMOCAP dataset; (b) results on the MELD dataset. Here, Hap, Sad, Neu, Ang, Exc, and Fru are abbreviations and denote Happy, Sad, Neutral, Angry, Excited, and Frustrated, respectively. The number in the confusion matrices represents the accuracy performance.
Figure 7. Visualization confusion matrices of HCIFN-SD: (a) results on the IEMOCAP dataset; (b) results on the MELD dataset. Here, Hap, Sad, Neu, Ang, Exc, and Fru are abbreviations and denote Happy, Sad, Neutral, Angry, Excited, and Frustrated, respectively. The number in the confusion matrices represents the accuracy performance.
Electronics 13 02645 g007
Figure 8. Experimental results of our proposed HCIFN-SD with different size of context windows on the IEMOCAP dataset.
Figure 8. Experimental results of our proposed HCIFN-SD with different size of context windows on the IEMOCAP dataset.
Electronics 13 02645 g008
Figure 9. Experimental results of our proposed HCIFN-SD with different size of context windows on the MELD dataset.
Figure 9. Experimental results of our proposed HCIFN-SD with different size of context windows on the MELD dataset.
Electronics 13 02645 g009
Figure 10. Experimental results of our proposed HCIFN-SD with different sizes of context windows in LCE on the IEMOCAP dataset.
Figure 10. Experimental results of our proposed HCIFN-SD with different sizes of context windows in LCE on the IEMOCAP dataset.
Electronics 13 02645 g010
Figure 11. Experimental results of our HCIFN-SD with different numbers of graph layers on the IEMOCAP dataset.
Figure 11. Experimental results of our HCIFN-SD with different numbers of graph layers on the IEMOCAP dataset.
Electronics 13 02645 g011
Figure 12. Example of conversational emotion recognition on the IEMOCAP dataset.
Figure 12. Example of conversational emotion recognition on the IEMOCAP dataset.
Electronics 13 02645 g012
Figure 13. Example of exploring emotional contagion in ERC on the IEMOCAP dataset.
Figure 13. Example of exploring emotional contagion in ERC on the IEMOCAP dataset.
Electronics 13 02645 g013
Table 1. Statistic information of both the IEMOCAP and MELD datasets.
Table 1. Statistic information of both the IEMOCAP and MELD datasets.
DatasetDialogue CountUtterance CountCategoriesSpeaker per Dialogue
TrainValidTestTrainValidTest
IEMOCAP120315810162362
MELD10391142809989110926107more than 2
Table 2. Performance of different approaches on the IEMOCAP testing set under multimodal features. We report F1 scores for each emotion category and overall scores using WAA and WAF1. Best results are highlighted in bold.
Table 2. Performance of different approaches on the IEMOCAP testing set under multimodal features. We report F1 scores for each emotion category and overall scores using WAA and WAF1. Best results are highlighted in bold.
Model (& Params)IEMOCAP
HappinessSadnessNeutralAngerExcitementFrustrationWAAWAF1
CMN (-)30.3862.4152.3959.8360.2560.6956.5656.13
ICON (-)29.9164.5757.3863.0463.4260.8159.0958.54
BC-LSTM * (-)35.6069.2053.5066.3061.1062.4059.8059.10
DialogueRNN * (8.8 M)32.9078.0059.1063.3073.6059.4063.4062.80
A-DMN (-)50.6077.2063.9060.1077.9063.2064.9064.10
DialogueGCN * (2.7 M)36.7081.9065.1064.5068.8065.4066.4065.80
MMGCN (2.6 M)45.4577.5361.9966.6772.0464.1265.5665.71
MM-DFN (2.2 M)42.2278.9866.4269.7775.5666.3368.2168.18
GraphCFC (5.4 M)43.0884.9964.7071.3578.8663.7069.1368.91
GA2MIF (9.3 M)41.6584.5068.3870.2975.9966.4969.7570.00
SMFNM (-)59.5081.5070.0062.9074.4070.1070.8070.90
HCIFN-SD-small (4 M)67.3673.4777.6068.2474.9269.0372.5872.88
HCIFN-SD (12 M)68.7575.5177.6069.4172.9170.3473.0773.42
Table 3. Performance of different approaches on the MELD testing set under multimodal features. We report F1 scores for each emotion category and overall scores using WAA and WAF1. Best results are highlighted in bold.
Table 3. Performance of different approaches on the MELD testing set under multimodal features. We report F1 scores for each emotion category and overall scores using WAA and WAF1. Best results are highlighted in bold.
Model (& Params)MELD
NeutralSurpriseFearSadnessJoyDisgustAngerWAAWAF1
BC-LSTM * (-)76.4048.470.015.6049.700.044.5058.6056.80
DialogueRNN * (8.8 M)73.2051.900.024.8053.200.045.6058.0057.00
A-DMN (-)77.1057.7312.0629.1257.417.1643.9760.45
DialogueGCN * (2.7 M)73.3057.0016.5041.1061.8020.1055.3060.2060.40
MMGCN (2.6 M)76.9649.633.6420.3953.762.8245.2359.3157.82
MM-DFN (2.2 M)75.8050.4223.7255.4848.2762.4959.46
GraphCFC (5.4 M)76.9849.3626.8951.8847.5961.4258.86
SMFNM (-)75.0057.5016.8036.8062.3025.0050.3062.6062.40
HCIFN-SD-small (4 M)77.5562.2828.0035.1066.1725.0046.9664.4164.05
HCIFN-SD (12 M)77.8762.9934.0035.1065.9226.4746.6764.7164.34
Table 4. Ablation results for different mask strategies on IEMOCAP and MELD datasets.
Table 4. Ablation results for different mask strategies on IEMOCAP and MELD datasets.
ApproachesIEMOCAPMELD
WAAWAF1WAAWAF1
w/o S s p k & mask70.3670.5764.1363.96
w/o intra-speaker mask72.8373.0064.5664.10
w/o inter-speaker mask72.5872.9464.7064.30
w/o local information mask71.9072.1564.5264.14
Ours (HCIFN-SD)73.0773.4264.7164.34
Table 5. Ablation study for the proposed MCI-GRU module on IEMOCAP and MELD datasets.
Table 5. Ablation study for the proposed MCI-GRU module on IEMOCAP and MELD datasets.
ApproachesIEMOCAPMELD
WAAWAF1WAAWAF1
w/o intra-CI72.2172.4864.7164.27
w/o inter-CI71.4171.7764.6463.88
w/o local-CI72.0372.3164.2963.97
w/o MCI-GRU71.0470.9364.7164.22
Ours (HCIFN-SD)73.0773.4264.7164.34
Table 6. Ablation study for the proposed MF-GAT module on IEMOCAP and MELD datasets.
Table 6. Ablation study for the proposed MF-GAT module on IEMOCAP and MELD datasets.
ApproachesIEMOCAPMELD
WAAWAF1WAAWAF1
w/o MCI-GRU cell72.0372.1664.7064.11
w/o GNN layer71.5371.4764.3363.93
Vanilla GAT71.5371.4764.2164.03
Ours (HCIFN-SD)73.0773.4264.7164.34
Table 7. Ablation study for the self-distillation on IEMOCAP and MELD datasets.
Table 7. Ablation study for the self-distillation on IEMOCAP and MELD datasets.
L C E L K L IEMOCAPMELD
WAAWAF1WAAWAF1
w/ow/o69.3869.5363.8763.75
ww/o70.9271.1664.6764.16
w/ow71.7271.8664.7164.25
Ours (HCIFN-SD)73.0773.4264.7164.34
Table 8. Performance of HCIFN-SD on the IEMOCAP dataset with different numbers of s.
Table 8. Performance of HCIFN-SD on the IEMOCAP dataset with different numbers of s.
sIEMOCAP (6-Way)
WAAWAF1
173.0773.42
271.8472.12
371.2971.73
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, P.; Yang, J.; Xiao, Y. Hierarchical Cross-Modal Interaction and Fusion Network Enhanced with Self-Distillation for Emotion Recognition in Conversations. Electronics 2024, 13, 2645. https://doi.org/10.3390/electronics13132645

AMA Style

Wei P, Yang J, Xiao Y. Hierarchical Cross-Modal Interaction and Fusion Network Enhanced with Self-Distillation for Emotion Recognition in Conversations. Electronics. 2024; 13(13):2645. https://doi.org/10.3390/electronics13132645

Chicago/Turabian Style

Wei, Puling, Juan Yang, and Yali Xiao. 2024. "Hierarchical Cross-Modal Interaction and Fusion Network Enhanced with Self-Distillation for Emotion Recognition in Conversations" Electronics 13, no. 13: 2645. https://doi.org/10.3390/electronics13132645

APA Style

Wei, P., Yang, J., & Xiao, Y. (2024). Hierarchical Cross-Modal Interaction and Fusion Network Enhanced with Self-Distillation for Emotion Recognition in Conversations. Electronics, 13(13), 2645. https://doi.org/10.3390/electronics13132645

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop