Multimodal Emotion Recognition in Conversation Based on Hypergraphs

: In recent years, sentiment analysis in conversation has garnered increasing attention due to its widespread applications in areas such as social media analytics, sentiment mining, and electronic healthcare. Existing research primarily focuses on sequence learning and graph-based approaches, yet they overlook the high-order interactions between different modalities and the long-term dependencies within each modality. To address these problems, this paper proposes a novel hypergraph-based method for multimodal emotion recognition in conversation (MER-HGraph). MER-HGraph extracts features from three modalities: acoustic, text, and visual. It treats each modality utterance in a conversation as a node and constructs intra-modal hypergraphs (Intra-HGraph) and inter-modal hypergraphs (Inter-HGraph) using hyperedges. The hypergraphs are then updated using hypergraph convolutional networks. Additionally, to mitigate noise in acoustic data and mitigate the impact of ﬁxed time scales, we introduce a dynamic time window module to capture local-global information from acoustic signals. Extensive experiments on the IEMOCAP and MELD datasets demonstrate that MER-HGraph outperforms existing models in multimodal emotion recognition tasks, leveraging high-order information from multimodal data to enhance recognition capabilities.


Introduction
Language is a means of expressing and communicating emotions and accurately perceiving the emotions of others is a crucial factor in effective interpersonal communication.
In conversation, humans transmit information from the speaker's brain to the listener's brain through speech.Through verbal communication, speakers not only translate their thoughts into linguistic information but also engage in the exchange and transmission of information.The task of Emotion Recognition in Conversation (ERC) aims to capture the emotional states of users in conversation, and it plays a significant role in various domains such as conversational agents, sentiment analysis, and electronic healthcare services.
For machines to effectively communicate with humans through emotions, they must possess sufficient capabilities for emotion analysis and judgment.Key factors of the ERC task include emotional stimuli (acoustic, text, visual), data collection (EEG recordings, MRI scans, facial expressions), and the ability of models to extract rich semantic features from conversation [1,2].Traditional emotion analysis tasks employ single-modal feature extraction methods, meaning they recognize emotions from only one aspect such as acoustic, text, or video.Due to the diverse sources of emotional fluctuations, using a single modality can lead to misidentification issues, resulting in lower accuracy.Moreover, human cognitive levels are directly related to how emotions are expressed.Therefore, relying solely on a single modality makes it challenging to accurately determine emotional states.In recent years, multimodal machine learning has gained popularity as it helps compensate for the limitations of single modality in reflecting real-world situations in certain cases.Effectively modeling the interaction between utterances in conversation and enhancing the semantic relevance of emotional information is of paramount significance for improving the performance of multimodal ERC tasks.
Currently, most multimodal ERC methods rely on Recurrent Neural Networks (RNNs) to extract sequential feature information from conversations.However, RNN-based approaches primarily propagate context and sequential information within the conversation.They simply concatenate single-modal feature information, ignoring the interaction between different modalities and the semantic relevance of the conversation context.This limitation hampers the effectiveness of multimodal ERC tasks.Since Kipf and Welling [3] introduced Graph Convolutional Networks (GCN), GCN has found wide application in various fields, such as natural language processing, computer vision, and recommendation systems.GCN, with its powerful relationship modeling capabilities, effortlessly captures long-distance contextual information in multimodal ERC tasks, modeling interactions within modalities and between different modalities.However, existing GCN-based models employ a one-to-one mapping between data, which becomes more complex when dealing with multimodal data due to the need to model data correlations.
To solve these problems, this paper proposes a multimodal ERC based on hypergraph (MER-HGraph).Firstly, acoustic, video, and text features are extracted from the conversation.Considering that acoustic data are susceptible to noise and fixed time scales, a dynamic time window is designed to process acoustic features using a Transformer model and attention mechanism.Then, the utterances from the three modalities are treated as nodes, and hypergraph convolution operations are applied to capture data correlations in the representation learning process.By constructing separate intra-modality and intermodality hypergraphs, the modeling of modal data becomes more flexible, and it effectively facilitates interactions within the current conversation as well as between modalities.The main contributions of this paper can be summarized as follows:

•
The MER-HGraph model, a multimodal conversation emotion analysis approach based on hypergraphs, is introduced.Through the design of intra-modality hypergraphs and inter-modality hypergraphs, it effectively captures context dependencies within modalities and interaction relationships between different modalities.This leads to a significant improvement in the accuracy of emotion analysis.

•
The use of a dynamic time window in processing extracted acoustic features involves dynamically segmenting and re-evaluating speech signal window information through an attention mechanism.This approach effectively alleviates the impact of noise and fixed time scales inherent to acoustic signals.

•
Extensive experiments were conducted on two real datasets, IEMOCAP and MELD.
The results indicate that the MER-HGraph model outperforms all baseline models in the task of multimodal conversation emotion analysis.
The remaining part of the article is structured as follows.Section 2 describes related work, Section 3 offers a detailed explanation of the model's architecture and the method used for constructing hypergraphs, Section 4 presents an analysis of the experimental results, along with a breakdown of experimental parameters, and Section 5 concludes the paper.

Single Modal Feature Processing
Acoustic features can be broadly categorized into two types: classical handcrafted features and those based on deep learning.Classical handcrafted features involve extracting characteristics from each frame of the acoustic signal.On the other hand, deep learning-based features dynamically capture inter-frame characteristics.Schuller [4] and others classify these two types of features into aspects such as signal energy, fundamental frequency, speech quality, cepstral coefficients, and spectrogram.Tripathi et al. [5] found that Mel-frequency cepstral coefficients (MFCCs) outperform spectrogram features.When it comes to processing acoustic features, Wang et al. [6] proposed a method of handling different time frames of speech signals through LSTM and integrating two sequences for acoustic feature processing.Lee et al. [7] introduced a parallel fusion model that extracts temporal information from spectrograms using the BERT model, and utilizes CNN for spectrogram information extraction.Ye et al. [8] proposed a time-aware bidirectional multiscale network, which employs a time-aware module to capture speech signal features and utilizes a bidirectional structure to model long-term dependencies.
Language features represent one way to realize speech information.With the emergence of pre-trained models, Mikolov et al. [9] introduced the Word2Vec word representation method.This approach employs a simple model to learn continuous word vectors and trains the model based on distributed representations.Devlin et al. [10] proposed the BERT pre-trained model, a novel language representation model for semantic understanding.BERT is trained based on contextual representations and excels at capturing the semantic relationships of words in different contexts.In the processing of text features, Wang et al. [11] put forward an automated method for constructing a fine-grained sentiment lexicon that encompasses sentiment information.They achieved this by extending the sentiment seed lexicon using a graph propagation method.Jassim et al. [12] combined sentiment lexicons with TF-IDF weight distribution to obtain sentence vectors, resulting in a substantial improvement over conventional sentiment lexicon methods.Xu et al. [13] proposed a CNN-based sentiment classification model, training distributed word embeddings for each word using both CNN_TEXT and the Word2Vec method.All of these methods find wide applications in text sentiment recognition, with deep learning-based approaches making significant strides in handling text sentiment recognition tasks.
Visual features also reflect changes in human emotions.Yang et al. [14] employed a Convolutional Neural Network (CNN) to obtain the output of the last layer as the global emotional feature map.By coupling this emotional heat map with the original output, a local emotional representation is formed.Combining both global and local emotional feature maps yields the classification result, enhancing the complexity of feature extraction.Guo et al. [15] utilized DenseNet to extract features from images and compared it with ResNet, BERT, and BERT-ResNet.The results demonstrated that DenseNet is more adept at feature extraction from images.Considering that excessive focus on locality may neglect overall discriminative information in target regions of the image, Li et al. [16] introduced a weakly supervised Discriminative Enhancement Network strategy.This approach applies emotional maps and discriminative enhancement maps to features, then aggregates them into an emotional vector as the basis for classification.This method better utilizes both the overall and local information in emotional images, leading to an improved classification accuracy.

Hypergraph Neural Network
In recent years, hypergraph learning has garnered attention due to its effectiveness in modeling high-order relationships among samples.Hypergraph learning is capable of extracting features from high-order relationships, thereby reducing information loss.This progress addresses the issue of relationships between data points extending beyond pairwise interactions.Jiang et al. [17] proposed a dynamic hypergraph convolutional neural network that dynamically updates the hypergraph structure using KNN and K-Means, enhancing its ability to capture data relationships.This allows for better extraction of both global and local relationships in the data.To better apply graph learning strategies to hypergraphs, Bai et al. [18] utilized an attention mechanism to dynamically update the hyperedge weights in the hypergraph.This not only addresses the oversmoothing issue in deep hypergraph convolution but also significantly enhances the representational capacity of the hypergraph by incorporating attention mechanisms.
There are also a few studies that integrate hypergraph learning with prediction tasks.For instance, Ding et al. [19] proposed learning two types of project embeddings based on hypergraph convolutional networks and gated recurrent units.They flexibly combined these two embeddings using an attention mechanism to obtain conversation representations.Xia et al. [20] introduced a graph convolutional network based on hypergraphs and line graphs.They maximized the interaction between conversation representations learned by the two networks and integrated it into the network's training through self-supervised learning to enhance recommendation tasks.Ren et al. [21] treated conversation as hyperedges, merging users' repetitive behaviors within these hyperedges to form a hypergraph.This not only expresses complex relationships between unique items but also captures relationships between repetitive behaviors.In studies on other tasks, it has been observed that hypergraph neural networks are better at capturing high-order relationships within conversation, leading to improved predictive performance.

Multimodal Emotion Recognition in Conversation
In multimodal ERC tasks, many studies adopt sequence modeling methods to model the dependencies within each modality.For example, DialogueRNN [22] proposes the use of three GRUs to model speaker information, contextual information from the preceding dialogue, and emotional information.The global GRU and party GRU are employed to compute and update the global contextual state and the participant's state, while the emotion GRU calculates the emotional representation of the current dialogue content.AF-CAN [23] utilizes a context-aware recurrent neural network to simulate interactions and dependencies between speakers.It employs bidirectional GRU network units to capture past and future feature information.BiERU [24] extracts features from the conversation using long short-term memory units and one-dimensional convolutional neural networks.It designs a generalized neural tensor block and a dual-channel feature extractor to obtain contextual information and emotional features.However, sequence modeling methods tend to focus attention on dependencies within each modality, thereby neglecting complementary information between different modalities.This limitation restricts the performance of multimodal conversation emotion analysis.With the growing popularity of Graph Convolutional Networks (GCNs) in solving various graph-based problems, including prediction tasks and recommendation systems, DialogueGCN [25] employs a relational GCN to describe the dependencies between speakers.In the graph, nodes represent individual utterances, and edges between utterances represent dependencies between speakers and their relative positions in the conversation.RGCN [26] designs a residual convolutional neural network, generating a complex contextual feature for each individual utterance using an internal feature extractor based on ResNet.MMGCN [27] introduces a spectral domain graph convolutional network to encode multimodal contextual information, capturing speech-level contextual dependencies across multiple modalities.DSAGCN [28] proposes a conversation emotion analysis model that combines dependency parsing with GCN.It inputs feature vectors from three modalities into a Bi-LSTM, and then utilizes attention mechanisms and GCN for emotion classification.GraphMFT [29] suggests constructing three graphs (V-A graph, V-T graph, and A-T graph) for each conversation and extracts intra-modal and inter-modal interaction relationships using an improved Graph Attention Network (GAT).These methods either fail to capture interactions between different modalities or overlook the heterogeneity of multimodal data.Moreover, existing multimodal ERC tasks mostly employ GCN to model interactions between different modalities.However, GCN simplifies the relationships between feature data to binary relations, resulting in the loss of many high-order associations present in the original data.Thus, the MER-HGraph is proposed.The Laplacian matrix of the hypergraph extends the node neighborhoods, enabling it to aggregate richer high-order information, and consequently more accurately model multi-order associations.The information of the multimodal conversation sentiment analysis models is shown in Table 1.Datasets: IEMOCAP/MELD Acc/wa-F1: 67.9/68.07Acc/wa-F1: 61.30/58.37

Model and Methods
In the task of multimodal emotion analysis in conversation, this paper proposes the MER-HGraph model as shown in Figure 1.The specific model comprises single-modality feature extraction, a multimodal conversation hypergraph network, and an emotion prediction layer.The multimodal conversation hypergraph network encompasses speaker embedding, hypergraph construction, and hypergraph convolutional networks.

Prediction Layer Hypergraph Construction
Hypergraph Convolutional Network

Problem Definition
In multimodal ERC tasks, each conversation consists of a total of n utterances, which can be defined as {u 1 , u 2 , . . . ,u n }.Each utterance (u V i , u A i , u T i ) is represented in three modalities: V for visual, A for acoustic, and T for text.The objective of the multimodal ERC task is to learn to predict the corresponding emotion of u i by leveraging both the dependencies within each modality and the interactions across modalities.

Single Modal Feature Extraction
We employ DenseNet, OpenSMILE, and TextCNN to respectively extract features from the visual, acoustic, and text modalities.Considering that acoustic signals are susceptible to noise interference and face issues related to fixed time scales, we designed a Dynamic Temporary Window (DTWB) to process the extracted acoustic features, as illustrated in Figure 2.

Problem Definition
In multimodal ERC tasks, each conversation consists of a total of n utterances, which can be defined as 1 2 { , ,..., } n u u u .Each utterance ( , , ) u u u is represented in three modalities: V for visual, A for acoustic, and T for text.The objective of the multimodal ERC task is to learn to predict the corresponding emotion of i u by leveraging both the dependencies within each modality and the interactions across modalities.

Single Modal Feature Extraction
We employ DenseNet, OpenSMILE, and TextCNN to respectively extract features from the visual, acoustic, and text modalities.Considering that acoustic signals are susceptible to noise interference and face issues related to fixed time scales, we designed a Dynamic Temporary Window (DTWB) to process the extracted acoustic features, as illustrated in Figure 2.

T (Softmax( )
denote the projection mapping of  i ∈ R T .These scores are then used to partition the time sequence into several strong or weak emotion windows based on a threshold set at the median of the scores.To handle acoustic signals in batches, window segmentation is implemented using an attention mask mechanism.The calculations are as follows: where M ij denotes the value of attention mask M ∈ R T×T at the i-th row and j-th column; k = 1, . . ., n, b w k , and e w k are the start and end indices of the k-th window's row and column, respectively; denote the projection mapping of u A 1 i ; the output is defined as u A 2 i ∈ R T×D .The global dynamic window module reevaluates the importance between windows by taking input u A 2 i and calculating scores sc A 2 i .In this process, each window is first used to generate a new token through weighted summation, as calculated by the following formulas: By duplicating the window token upsampling for each window to match their respective lengths, and concatenating them together generates the sequence u A 3 i ∈ R T×D .Finally, we fuse features from modality u A 2 i and modality u A 3 i to obtain the acoustic features.We employ fully connected networks to process the visual modality features, enhancing their representational power.For the text modality, we utilize a bidirectional LSTM network to extract contextual information from the utterances.The computation process for single modality encoding is as follows: where u A i , u T i , u V i denote the input for acoustic, text, and visual modalities, and respectively denote the encoded outputs for acoustic, text, and visual modalities.

Speaker Embedding
Since there are a large number of participants in each conversation, speaker information plays a crucial role in multimodal ERC tasks.To fully leverage this information, we use a one-hot vector s i to represent speaker information.It is integrated with multimodal features before hypergraph construction to obtain a new fused representation of utterances with integrated speaker information.The speaker encoding can be represented as:

Hypergraph Learning
Due to the powerful expressive capabilities of hypergraph neural networks (HGNNs) in representation learning, we adopt an HGNN to describe relationships within the conversation.Let G = (U, E, H) represent a hypergraph, where the vertex set u i ∈ U and hyperedge set ε ∈ E contain N unique nodes and M hyperedges, respectively, and H ∈ R N×M is the incidence matrix between hyperedges and vertices, defined as: We treat each utterance in the conversation as a vertex, forming the set U. All conversations form a hyperedge E. By sharing vertices, we connect hyperedges to construct the hypergraph for multimodal conversation emotion analysis.For the hypergraph G, D ∈ R N×N is the diagonal matrix of vertex degrees, and B ∈ R M×M is the diagonal matrix of hyperedge degrees, defined as: To address the node classification problem on the hypergraph, where node labels should be smooth across the hypergraph's structure, a regularization framework is employed for hypergraph classification.The calculation is as follows: where Ω( f ) is the hypergraph regularization term; R emp ( f ) is the supervised empirical loss; f (•) is the classification function; and λ is a non-negative parameter.

Hypergraph Construction
In order to capture the dependencies between utterances in the conversation and the interactions between different modalities, we use the extracted features from three modalities as input to construct intra-modal hypergraphs (Intra-HGraph) and inter-modal hypergraphs (Inter-HGraph) for each conversation.(1) Intra-HGraph refers to the contextual dependency relationships between utterances in a conversation.In the conversation, let u A , u T , u V represent a node, where each node denotes an utterance in its modality.There are 3 × m nodes, where m is the number of utterances in the current modality in the conversation.Three types of hyperedges, denoted as (ε A , ε T , ε V ), are created within each modality.Each node is connected to past P and future F context nodes within the current modality.(2) Inter-HGraph refers to the interaction relationships between different modalities within the same utterance.The nodes in the Inter-HGraph are the same as those in the Intra-HGraph.We connect each node to nodes from the same utterance but belonging to different modalities, constructing inter-modality hyperedges (ε 1 , ε 2 . . ., ε n ).
Considering that different adjacent nodes may have varying impacts on the current utterance node and different modalities of the same node may interact differently, we assign weights W εε to each hyperedge, and these weights form the diagonal matrix W ∈ R M×M of hyperedge weights for this hypergraph.Additionally, we construct association matrices H ra and H er between nodes and hyperedges for both the Intra-Hgraph and Inter-HGraph.

Hypergraph Convolution
The hypergraph convolution operation efficiently utilizes higher-order relationships and local clustering structures to achieve effective information propagation between vertices.The process of multimodal conversation hypergraph convolution is illustrated in Figure 3.This process can be divided into two stages: (1) information aggregation from vertices to hyperedges; (2) information aggregation from hyperedges to vertices.Specifically, the information from each vertex is aggregated into the corresponding hyperedge, resulting in a representation for each hyperedge.Then, the hyperedges connected to each vertex are located, and their information is aggregated into the vertex, generating a representation for each vertex.
Electronics 2023, 12, x FOR PEER REVIEW 9 of 16 ( 1) ( ) where ( 1) l i x + denotes the ( 1)-th l + layer and -th t node.Each jj W for j is set to 1.We do not employ non-linear activation functions and convolutional filter parameter matrices.For W εε , we assign the same weight of 1 to each hyperedge.The row normalization matrix form of Equation ( 11) is given by: Hypergraph convolution can be viewed as a two-stage evolution of feature transformation on the hypergraph structure, performing a "node-hyperedge-node" transformation.By concatenating the two incidence matrices, ra H and er H , for Intra-HGraph and Inter-HGraph, we obtain the final incidence matrix H .The multiplication operation H X defines the aggregation of information from nodes to hyperedges, followed by pre-multiplication by H to aggregate information from hyperedges back to nodes. Intra-HGraph

Multimodal Emotion Prediction in Conversation
We use the obtained incidence matrix i h H ∈ as input for a fully connected network for emotion prediction, and its computation formula is as follows: In the Intra-HGraph, we aggregate the information of vertex u A i−1 , u A i , u A i+1 to edge e A , the information of vertex u T i−1 , u T i , u T i+1 to edge e T , and the information of vertex to edge e V .In the Inter-HGraph, we aggregate the information of vertex to edge e 1 , the information of vertex u A i , u T i , u V i to edge e 2 , and the information of vertex u A i+1 , u T i+1 , u V i+1 to edge e 3 .Through this process, MER-HGraph obtains representations for all vertices in both intra-modal and cross-modal aspects, further enhanc-ing the learning of conversation representations.We define the hypergraph convolution as follows: where x (l+1) i denotes the (l + 1)-th layer and t-th node.Each W jj for j is set to 1.We do not employ non-linear activation functions and convolutional filter parameter matrices.For W εε , we assign the same weight of 1 to each hyperedge.The row normalization matrix form of Equation ( 11) is given by: Hypergraph convolution can be viewed as a two-stage evolution of feature transformation on the hypergraph structure, performing a "node-hyperedge-node" transformation.By concatenating the two incidence matrices, H ra and H er , for Intra-HGraph and Inter-HGraph, we obtain the final incidence matrix H.The multiplication operation H T X (l) h defines the aggregation of information from nodes to hyperedges, followed by pre-multiplication by H to aggregate information from hyperedges back to nodes.

Multimodal Emotion Prediction in Conversation
We use the obtained incidence matrix h i ∈ H as input for a fully connected network for emotion prediction, and its computation formula is as follows: where h i ∈ H denotes the final feature vector of the i-th utterance u i ; ReLU denotes the non-linear activation function; p i denotes for the predicted emotion probability distribution of u i ; ŷi denotes the predicted emotion, and w, w , b, b denotes trainable parameters.We employ the cross-entropy loss function as the objective function for training, which is computed as follows: where N is the total number of conversations in the dataset, n(i) is the number of utterances in the i-th conversation; y ij denotes the true emotion of the j-th utterance in the i-th conversation; p ij denotes the predicted emotion probability distribution of the j-th utterance in the i-th conversation; λ denotes the L2-regularization weight; and W ls denotes the set of learnable parameters.

Experimental Environment
The experimental environment was based on the Ubuntu 20.04 operating system, equipped with an Intel i7-11800H CPU, NVIDIA GeForce RTX 3060 GPU, and 12 GB of memory.The development environment utilized a TensorFlow deep learning framework, Python 3.8, and CUDA 14.1.A cross-entropy criterion was employed as the objective function for model training, and the Adam optimization algorithm was used to update model parameters.

Datasets
In this study, extensive experiments were conducted on two widely used public datasets, MELD and IEMOCAP.Both datasets are multimodal, containing modalities of acoustic, text, and vision.The statistical summary of these two datasets is presented in Table 2.The IEMOCAP dataset consists of recordings of ten actors engaged in dyadic interactions, organized into five conversations, each involving one male and one female participant.This dataset provides three modalities: acoustic, text, and visual, with 7433 (5810 + 1623) text utterances and approximately 12 h of acoustic and video.Each utterance can be labeled with one of six different emotion categories: happy, sad, neutral, angry, excited, and frustrated.
The MELD dataset is derived from the EmotionLines dataset and features multiple speakers in conversation.It consists of 1153 (1039 + 114), and 280 conversations for training, validation, and testing, respectively, all from the TV series "Friends".Each conversation is labeled with one of the following emotion categories: anger, disgust, sadness, joy, surprise, fear, and neutral.

Baseline Methods
To validate the effectiveness of the model, it was compared against several baseline methods: (1) The bc-LSTM [30] is a method proposed to capture contextual features from surrounding utterances using bidirectional LSTM.However, it does not take into account the interdependence between speakers.(2) ICON [31] utilizes two separate GRUs to model the contextual information of utterances from two speakers in the dialogue history.The current utterance serves as the query input to two distinct speaker memory networks, generating utterance representations.Another GRU connects the output of the individual speaker GRUs in the CMN, explicitly modeling the interplay between speakers.
speakers and contextual information.However, GCN adopts a pairwise interaction approach, which overlooks higher-order information in the conversation.MER-HGraph, on the other hand, leverages HGNN to aggregate information from each vertex to its corresponding hyperedge, obtaining hyperedge representations.It then aggregates hyperedge information back to vertices, yielding high-order information within the conversation.Additionally, MER-HGraph employs a dynamic time window to reduce the impact of noise and fixed time scales in acoustic signals.Furthermore, on the MELD dataset, MER-HGraph achieves an accuracy score and weighted-average F1 score improvement of 1.46% and 0.76%, respectively, compared to the best-performing model.Overall, the proposed MER-HGraph model effectively enhances the capability of multimodal conversation sentiment analysis by modeling contextual dependencies and cross-modal interactions using HGNN, and by introducing a dynamic time window to mitigate the impact of acoustic signal noise, as demonstrated on both the IEMOCAP and MELD datasets.

Impact of Dynamic Temporary Window Block
Considering the influence of temporal information localization on sentiment analysis performance, experiments were conducted using both fixed and dynamic time windows, as illustrated in Figure 4. Here, the horizontal axis represents the chronological order, and the vertical axis represents the importance scores over time.[lau]    The experimental results indicate that the dynamic temporary window module, by partitioning the signal into different lengths of strong and weak emotions locally, and assessing the interaction information between emotions globally, achieves better performance.For instance, in areas of interest such as laughter and positive semantics ("Cool"), the dynamic window provides a smoother signal processing compared to the fixed window.In areas that should not be focused on, like noise and silence, the dynamic window almost disregards the signal information, while the fixed window shows significant fluctuations in information processing.This experiment validates that the dynamic time window effectively captures both local and global signal features in acoustic.

Impact of The Number of Context Nodes
When capturing intra-modal dependencies, the current node needs to connect to past P and future F contextual nodes.The influence of the number of contextual nodes ( , ) P F on MER-HGraph is considered.In the IEMOCAP dataset, it is set to (4,4), (8,8), (40,40)  , and in the MELD dataset, it is set to (1,1),(2,2), (10,10)  . The impact of different numbers of contextual nodes on the accuracy scores and weighted average F1 scores of MER-HGraph is shown in Figure 5.The experimental results indicate that the dynamic temporary window module, by partitioning the signal into different lengths of strong and weak emotions locally, and assessing the interaction information between emotions globally, achieves better performance.For instance, in areas of interest such as laughter and positive semantics ("Cool"), the dynamic window provides a smoother signal processing compared to the fixed window.In areas that should not be focused on, like noise and silence, the dynamic window almost disregards the signal information, while the fixed window shows significant fluctuations in information processing.This experiment validates that the dynamic time window effectively captures both local and global signal features in acoustic.

Impact of the Number of Context Nodes
When capturing intra-modal dependencies, the current node needs to connect to past P and future F contextual nodes.The influence of the number of contextual nodes (P, F) on MER-HGraph is considered.In the IEMOCAP dataset, it is set to (4,4), (8,8), • • • (40, 40), and in the MELD dataset, it is set to (1, 1), (2, 2), • • • (10, 10).The impact of different numbers of contextual nodes on the accuracy scores and weighted average F1 scores of MER-HGraph is shown in Figure 5.
When capturing intra-modal dependencies, the current node needs to connect to past P and future F contextual nodes.The influence of the number of contextual nodes ( , ) P F on MER-HGraph is considered.In the IEMOCAP dataset, it is set to (4,4), (8,8), (40,40)  , and in the MELD dataset, it is set to (1,1),(2,2), (10,10)  . The impact of different numbers of contextual nodes on the accuracy scores and weighted average F1 scores of MER-HGraph is shown in Figure 5.In Figure 5a, it can be observed that the performance of MER-HGraph on the IE-MOCAP dataset increases with the increase in P or F .When ( , ) P F reaches the threshold of (16,16) , the accuracy score and F1 score of the MER-HGraph model achieve the best performance.In Figure 5b, it can be observed that with the MELD dataset, the accuracy score and F1 score of the MER-HGraph model reach the best performance when ( , ) P F reaches the threshold of (5,5) .It is possible that the IEMOCAP dataset requires longer contextual information modeling, while in the MELD dataset, some adjacent In Figure 5a, it can be observed that the performance of MER-HGraph on the IEMO-CAP dataset increases with the increase in P or F. When (P, F) reaches the threshold of (16,16), the accuracy score and F1 score of the MER-HGraph model achieve the best performance.In Figure 5b, it can be observed that with the MELD dataset, the accuracy score and F1 score of the MER-HGraph model reach the best performance when (P, F) reaches the threshold of (5,5).It is possible that the IEMOCAP dataset requires longer contextual information modeling, while in the MELD dataset, some adjacent utterances are not necessarily adjacent in actual scenarios, so there is no need for longer contextual modeling.

Ablation Study
The ablation experiments were conducted to further validate the roles and importance of different parts of the model.The results of the ablation experiments are shown in Table 4. Here, "-" indicates that the corresponding part was removed, while "+" indicates that the corresponding part was used.In the first case of the ablation experiments, the HGNN network was replaced with a GCN network.In the second case, the DTWB was replaced with a fully connected network.The third case represents the proposed MER-HGraph model in this paper.Based on the results in Table 4, we observe that HGNN outperforms GCN.This is attributed to the fact that GCN acquires information within modalities and interactions across multimodalities through pairwise connections.On the other hand, HGNN constructs hypergraphs where a single hyperedge can connect multiple speech nodes.Moreover, HGNN can simultaneously link acoustic, text, and visual modalities through a single hyperedge.As a result, HGNN can capture higher-order information in the data, reducing information loss and thereby improving the performance of the emotion analysis task.Additionally, the DTWB module is designed to handle audio signals, reducing the impact of noise and fixed time scales.This allows for better capturing of temporal sequences in speech signals, consequently enhancing the performance of emotion classification.

Conclusions
For the task of multimodal emotion analysis in conversation, we propose a method based on a hypergraph neural network.Unlike previous studies, we introduced an HGNN to construct Intra-Hgraph and Inter-HGraph within conversation to capture dependencies between utterances and interactions between different modalities.Additionally, to address the issues of noise and fixed time scales in speech signals, we designed a dynamic time window to extract local and global information from the audio signals.Through this approach, MER-HGraph can acquire richer feature information, thereby enhancing the effectiveness of emotion analysis tasks.The proposed model is evaluated on the IEMOCAP and MELD datasets and compared with other baseline models, demonstrating that MER-HGraph outperforms them.Given the wide application of hypergraph neural networks in other research fields, this study introduces hypergraphs into the task of multimodal emotion analysis, and future improvements to HGNNs could further enhance the performance of multimodal conversation emotion analysis tasks.

Figure 1 .
Figure 1.Multimodal emotion recognition in conversation based on hypergraphs.Figure 1. Multimodal emotion recognition in conversation based on hypergraphs.

Figure 1 .
Figure 1.Multimodal emotion recognition in conversation based on hypergraphs.Figure 1. Multimodal emotion recognition in conversation based on hypergraphs.
start and end indices of the -th k window's row and column, respectively; global dynamic window module reevaluates the im- portance between windows by taking input 2 A i u and calculating scores 2 A i sc .In this

Figure 2 . 1 i
Figure 2. Dynamic temporary window block.First, the local dynamic window processes the acoustic signal u A i through a Transformer model to obtain u A 1 i , and computes scores sc A 1

4. 3 .
Experimental Result and Analysis 4.3.1.Evaluation Metrics To evaluate the performance of the model, accuracy and weighted average F1 score are used as evaluation metrics.Accuracy measures the correctness of the model's predictions, while the weighted average F1 score considers both precision and recall.Its calculation formula is as follows: TP represents True Positives; TN represents True Negatives; FP represents False Positives; FN represents False Negatives.
(a) Dynamic time window (b) Fixed time window

Figure 4 .
Figure 4. Impact of dynamic windows and fixed windows.The horizontal axis represents chronological order and vertical axis is of importance score.

Figure 4 .
Figure 4. Impact of dynamic windows and fixed windows.The horizontal axis represents chronological order and vertical axis is of importance score.

Figure 5 .
Figure 5. Impact of the number of context nodes.

Figure 5 .
Figure 5. Impact of the number of context nodes.

Table 2 .
Data distribution of IEMOCAP and MELD.
represents laughter, [y] denotes affirmative tone, [noise] indicates noise, and [s] denotes silence.The yellow region represents the area of interest, while the blue region indicates the area that should not be considered.

Table 4 .
Ablation study for the main components in MER-HGraph on the IEMMOCAP and MELD dataset.