Improvement of Multimodal Emotion Recognition Based on Temporal-Aware Bi-Direction Multi-Scale Network and Multi-Head Attention Mechanisms

: Emotion recognition is a crucial research area in natural language processing (NLP), aiming to identify emotional states such as happiness, anger, and sadness from various sources like speech, text, and facial expressions. In this paper, we propose an improved MMER (multimodal emotion recognition) method using TIM-Net (Temporal-Aware Bi-Direction Multi-Scale Network) and attention mechanisms. Firstly, we introduce the methods for extracting and fusing the multimodal features. Then, we present the TIM-Net and attention mechanisms, which are utilized to enhance the MMER algorithm. We evaluate our approach on the IEMOCAP and MELD datasets, and compared to existing methods, our approach demonstrates superior performance. The weighted accuracy recall (WAR) on the IEMOCAP dataset is 83.9%, and the weighted accuracy recall rate on the MELD dataset is 62.7%. Finally, the impact of the TIM-Net model and the attention mechanism on the emotion recognition performance is further investigated through ablation experiments.


Introduction
Emotion is an important way for humans to express their inner world, and its complexity and diversity play a crucial role in human communication and interaction [1].The American psychologist Ekman [2] proposed six basic emotions based on research needs.In 1977, the concept of affective computing was introduced [3], and emotion recognition is an important direction in affective computing [4].In recent years, significant progress has been made in unimodal approaches using text, audio, and video for emotion recognition.Emotion recognition has found applications in various fields, such as traffic safety [5], intelligent interaction [6,7], and healthcare [8][9][10].
Emotion recognition in speech data is known as speech emotion recognition (SER).SER establishes a mapping relationship between speech feature information and emotional states through different models by extracting the feature information from the speech data [11].The selection and design of speech emotional features are crucial steps in SER, which directly affect its performance.In 2021, Wang et al. [12] achieved excellent results by using a shared-weight multimodal Transformer [13] to capture the dependencies between modalities.Finally, the success of pre-trained models in speech recognition tasks has garnered increasing attention in speech emotion recognition research.In 2022, Zou et al. [14] utilized the Wav2vec [15] pre-trained model to extract the deep features from speech and combined them with traditional acoustic features for emotion recognition.The ablation experiment proved that adding deep features achieved better performance.
Research on the video modality primarily focuses on facial emotion recognition (FER).AffectNet [16] is a widely recognized dataset for video-based emotion recognition.Bakariya et al. [17] developed a real-time system capable of detecting faces, assessing human emotions, and recommending music to users.Meena et al. [18] proposed a CNN-based facial image emotion analysis model.The study of emotion in textual data is called emotion analysis (EA).With the rapid increase in text content created by users on the Internet, such as product reviews, social media posts, and blogs, we have access to abundant public opinion information.EA has also been proven to be beneficial in critical events such as the COVID-19 pandemic [19,20].
It is difficult to obtain accurate emotional information through a single modality alone [21,22].With the continuous development and application of artificial intelligence technology, against a background of multimodal information fusion, emotion recognition technology can comprehensively analyze the information from different modalities, such as text, speech, and facial expressions, to obtain more complementary information [23], thereby improving the accuracy of emotion recognition [24][25][26].Multimodal sentiment analysis has a wide range of applications in human-computer interaction, business, and education and holds great social value [27].The core challenge of multimodal sentiment analysis is representation fusion, which aims to learn representations reflecting the cross-modal interactions between the "individual elements" of different modalities, effectively reducing the number of individual representations [28].Recent works [24,29,30] have achieved success in the problem of multimodal sentiment analysis.The essence of multimodal sequences is time series, and there are local interactions between modalities at the same time node [31].For example, the meaning expressed by a word at a particular moment is related to the pronunciation of the word and also the accompanying facial expression.Combining multimodal emotional features with artificial intelligence is an important research direction in the field of emotion recognition.The cartoon animal characters developed by Bates [32] can express set emotions and provide us with a new emotional communication experience.The virtual tiger (Tigrito) developed by Hayes-Roth et al. [33], which can predict and generate emotions, further demonstrates the broad application prospects of multimodal sentiment analysis.
In the process of delving deeper into the field of multimodal emotion recognition, we have discovered that integrating information from different modalities often leads to more precise and comprehensive analysis results.Especially in the current flourishing landscape of deep learning technologies, some advanced model architectures provide powerful tools for processing both speech and text data.Based on this, we propose a multimodal MMER-TAB model focusing on two modalities: speech and text.Building upon the previous research, we have made improvements to the existing MMER model.While MMER has achieved significant results in some respects, it also has some limitations that motivate our work.Firstly, the MMER model faces challenges in handling long-term dependencies and global context.To address this issue, we introduce a new structure consisting of three layers of TAB modules combined with attention mechanisms to enhance the modeling capability of the model.These improvements enable our model to better capture the long-term dependencies and global contextual information in speech signals, thereby improving its performance and robustness.The main contributions of our study are as follows: • We particularly focus on two Transformer-based feature extraction methods, namely Wav2vec and BERT (Bidirectional Encoder Representation from Transformers) [34].Additionally, we investigate the TIM-Net [35] model and make corresponding improvements to better adapt it to our application scenario in multi-modal emotion recognition tasks.• We introduce a multi-head attention mechanism, which accurately captures significant emotional features from both speech and text, thereby enhancing the model's sensitivity and capability to capture emotional information.This innovation gives our model a competitive edge in emotion recognition.• To validate the effectiveness of our proposed model, we conduct extensive experiments on the IEMOCAP [36] and MELD [37] datasets.The experimental results demonstrate that our MMER-TAB (Multi-Modal Emotion Recognition-Temporal-Aware Block) model exhibits outstanding performance in multi-modal conversation emotion analy-sis tasks.This not only confirms the effectiveness of the feature extraction methods and model improvements we have adopted but also provides new insights into and methods for research in the field of multi-modal emotion recognition.
The rest of the paper is structured as follows: Section 2 introduces the feature extraction methods for the speech and text modalities, respectively.Section 3 presents the TIM-Net network architecture and attention mechanism.Section 4 elaborates on the proposed model's structure.Section 5 discusses and analyzes the experimental results.Finally, Section 6 provides a summary of the paper.

Wav2vec Speech Features
Early approaches to speech features involved manually crafted Low-Level Descriptor (LLD) features, such as prosodic features, acoustic features, and spectral features.Mel-Frequency Cepstral Coefficients (MFCCs) are classic audio features known for their simplicity and efficiency.However, MFCCs only consider frequency information and overlook the temporal correlation in audio data.Additionally, MFCCs often require manual parameter settings, such as for the window size and stride.With the advancement of deep learning, researchers have turned to deep learning methods for extracting speech emotion features.Common deep learning methods for speech feature extraction include Convolutional Neural Networks (CNNs) [38,39], Recurrent Neural Networks (RNNs) [40], Bidirectional Long Short-Term Memory (BiLSTM) [41][42][43], etc.The Wav2vec2.0 version used in this paper is an end-to-end training approach that can learn representative feature descriptions directly from audio data through self-supervised learning, eliminating the need for manual parameter tuning.
For speech emotion recognition tasks, the Wav2vec2.0method effectively captures the emotional information in audio.Furthermore, Wav2vec2.0can enhance the model performance through pre-training, such as Wav2vec2-base-960h, which is pre-trained on diverse audio data for 960 h.This pre-training allows the model to capture more audio patterns and structures, resulting in improved robustness in speech emotion recognition tasks.In practical applications, fine-tuning can be performed based on specific tasks and data to further enhance the model performance.It is important to note that Wav2vec2base-960h is a relatively complex model, requiring significant computational resources and longer training times.
The Wav2vec2.0 process mainly involves taking an input speech signal X and encoding it using a seven-layer CNN network to obtain the latent variable Z.The latent variable Z is quantized into the quantized variable Q through the Gumbel softmax quantization module.Simultaneously, Z is randomly masked at some positions and put into the Transformer [28] network to obtain the contextual feature vector C. The structure and overall process of Wav2vec2.0are illustrated in Figure 1.

BERT Text Features
The text information corresponding to speech is the most fundamental and intuitive carrier of emotion, often used to infer the emotional state of the speaker [44].The process of emotion state recognition based on text is illustrated in Figure 2, primarily encompass-

BERT Text Features
The text information corresponding to speech is the most fundamental and intuitive carrier of emotion, often used to infer the emotional state of the speaker [44].The process of emotion state recognition based on text is illustrated in Figure 2

BERT Text Features
The text information corresponding to speech is the most fundamental and intuitive carrier of emotion, often used to infer the emotional state of the speaker [44].The process of emotion state recognition based on text is illustrated in Figure 2  Text emotion recognition first requires collecting text data from various sources, such as social media, comments, news articles, etc.The collected text data are then subjected to preprocessing operations, including cleaning, tokenization, the removal of stop words, stemming, and punctuation handling.Text feature extraction involves converting the text into numerical features suitable for machine learning algorithms.Common feature extraction methods include the bag-of-words model, TF-IDF (Term Frequency-Inverse Document Frequency) [45] features, word embedding [46] of syntactic features, etc.
Word embedding is currently the most commonly used method in text feature extraction, aiming to map words from the text data to a continuous vector space.This mapping process is achieved by capturing the contextual relationships between words.Common word embedding techniques include Word2Vec [47], Glove (Global Vectors for Word Representation) [48], Elmo (Embeddings from Language Models) [49], etc.However, it is important to note that using a single vector to represent a word in different contexts may lead to some semantic understanding errors.
Considering these factors, this paper adopts the language representation model BERT for feature extraction.BERT is a pre-trained language model based on the Transformer architecture, utilizing a bidirectional training approach during pre-training, allowing it to consider both the left and right context information simultaneously.BERT's strength lies in its powerful context modeling and multi-task pre-training, enabling the model to learn richer, more universal semantic representations, resulting in outstanding Text emotion recognition first requires collecting text data from various sources, such as social media, comments, news articles, etc.The collected text data are then subjected to preprocessing operations, including cleaning, tokenization, the removal of stop words, stemming, and punctuation handling.Text feature extraction involves converting the text into numerical features suitable for machine learning algorithms.Common feature extraction methods include the bag-of-words model, TF-IDF (Term Frequency-Inverse Document Frequency) [45] features, word embedding [46] of syntactic features, etc.
Word embedding is currently the most commonly used method in text feature extraction, aiming to map words from the text data to a continuous vector space.This mapping process is achieved by capturing the contextual relationships between words.Common word embedding techniques include Word2Vec [47], Glove (Global Vectors for Word Representation) [48], Elmo (Embeddings from Language Models) [49], etc.However, it is important to note that using a single vector to represent a word in different contexts may lead to some semantic understanding errors.
Considering these factors, this paper adopts the language representation model BERT for feature extraction.BERT is a pre-trained language model based on the Transformer architecture, utilizing a bidirectional training approach during pre-training, allowing it to consider both the left and right context information simultaneously.BERT's strength lies in its powerful context modeling and multi-task pre-training, enabling the model to learn richer, more universal semantic representations, resulting in outstanding performance across various natural language processing tasks.BERT, proposed by Google in 2018 as an alternative to Word2Vec, is essentially composed of stacked Transformer encoders.It follows a two-phase framework comprising pre-training and fine-tuning on specific tasks.BERT's innovation lies in using two pre-training tasks: a Masked Language Model (MLM), which predicts masked words in a sequence, and Next Sentence Prediction (NSP), which predicts whether the next sentence is related to the current one.This addresses the issue of different semantic expressions for the same word in different contexts.The structure of the BERT model is illustrated in Figure 3.

Multimodal Feature Fusion
Multimodal feature fusion [50] enables the acquisition and interpretation of information from different dimensions, providing a more comprehensive and accurate understanding.The two most commonly used modalities in speech emotion recognition are au-

Multimodal Feature Fusion
Multimodal feature fusion [50] enables the acquisition and interpretation of information from different dimensions, providing a more comprehensive and accurate understanding.The two most commonly used modalities in speech emotion recognition are audio and text data.Audio data contain information such as speech rate, intonation, and volume, which can represent the speaker's emotions but may struggle to convey contextual semantic information.Text data can capture rich semantic information but may suffer from ambiguity and are significantly influenced by the text recognition accuracy.This paper proposes an algorithm that integrates the strengths and mitigates the weaknesses of both audio and text data, achieving complementary multimodal feature information.
The focus of multimodal feature fusion lies in the fusion stage and the fusion method, which will be discussed separately below.

Classification Based on the Fusion Stage
The fusion stage can be categorized into three types, as illustrated in Figure 4: featurelevel fusion, model-level fusion, and decision-level fusion.Feature-level fusion, or early fusion, involves extracting different modality features and concatenating them to form an overall multimodal feature representation.Since various modality information often exhibits high correlation, extracting this correlation after feature-level fusion can be challenging.Therefore, this method may not fully capture the correlation between different modalities, and in the temporal dimension, simple feature fusion may not achieve cross-temporal fusion of multimodal data.As the number of modalities increases, concatenating feature vectors may lead to high-dimensional features, difficulty in training models, and information redundancy.
Model-level fusion involves merging two features at an intermediate stage in the model.Afterward, independent models continue to extract the features, and finally, both types of features are combined for the classification task before the final classification.Taking the example of Multi-Layer LSTM (ML-LSTM), this approach combines multiple layers of neural networks with a traditional LSTM (Long Short-Term Memory) model.The fusion process is as follows: The text features are put into the first LSTM layer (Layer 1), producing hidden layer states for each neuron.Subsequently, the audio features are concatenated with the hidden layer states from Layer 1 and put into the second LSTM layer (Layer 2), generating hidden layer states for each neuron in the second layer.The visual features are then concatenated with the hidden layer states from Layer 2 and put into the third LSTM layer (Layer 3), producing hidden layer states for each neuron in the third layer.Finally, the fused features are input into the fully connected layer (FC) to obtain the prediction result.
Decision-level fusion, also known as late fusion, primarily involves using different network architectures for feature extraction and textual and audio information fusion.Decision-level fusion models each modality separately, treating different modalities as mu- Feature-level fusion, or early fusion, involves extracting different modality features and concatenating them to form an overall multimodal feature representation.Since various modality information often exhibits high correlation, extracting this correlation after feature-level fusion can be challenging.Therefore, this method may not fully capture the correlation between different modalities, and in the temporal dimension, simple feature fusion may not achieve cross-temporal fusion of multimodal data.As the number of modalities increases, concatenating feature vectors may lead to high-dimensional features, difficulty in training models, and information redundancy.
Model-level fusion involves merging two features at an intermediate stage in the model.Afterward, independent models continue to extract the features, and finally, both types of features are combined for the classification task before the final classification.Taking the example of Multi-Layer LSTM (ML-LSTM), this approach combines multiple layers of neural networks with a traditional LSTM (Long Short-Term Memory) model.The fusion process is as follows: The text features are put into the first LSTM layer (Layer 1), producing hidden layer states for each neuron.Subsequently, the audio features are concatenated with the hidden layer states from Layer 1 and put into the second LSTM layer (Layer 2), generating hidden layer states for each neuron in the second layer.The visual features are then concatenated with the hidden layer states from Layer 2 and put into the third LSTM layer (Layer 3), producing hidden layer states for each neuron in the third layer.Finally, the fused features are input into the fully connected layer (FC) to obtain the prediction result.
Decision-level fusion, also known as late fusion, primarily involves using different network architectures for feature extraction and textual and audio information fusion.Decision-level fusion models each modality separately, treating different modalities as mutually independent.The features are extracted for each modality, and the emotion recognition results for individual modalities are obtained through emotion classifiers.Subsequently, a decision method is applied to recognize the results of each modality, ultimately yielding the final emotion classification result.Designing the decision rules for decision fusion is a challenging task.If the decision rules are too simple, they may not accurately reflect the correlation between different modalities.

Classification Using the Fusion Method
The simplest way to perform multimodal feature fusion is concatenation, such as concatenating using CONCAT or stacking operations.Another approach is to employ attention mechanisms.If a single layer of attention is insufficient, multiple attention operations can be applied.For example, attention operations can be performed from text to audio and vice versa.For instance, the query matrix W Q S from speech is computed using the key matrix and value matrix from the text W K T , W V T , while the key matrix and value matrix from the speech W K S , W V S are used to compute the query matrix from the text W Q T .This type of attention mechanism is also known as a cross-modal attention mechanism.
Based on the above analysis, the chosen fusion stage in this paper is model-level fusion, and the selected fusion method uses cross-modal attention mechanisms.A multimodal emotion recognition framework has been developed to fuse the features from both speech and text, and this fusion framework will be introduced in Section 4. from the speech  ,  are used to compute the query matrix from the text  .This type of attention mechanism is also known as a cross-modal attention mechanism.Based on the above analysis, the chosen fusion stage in this paper is model-level fusion, and the selected fusion method uses cross-modal attention mechanisms.A multimodal emotion recognition framework has been developed to fuse the features from both speech and text, and this fusion framework will be introduced in Section 4.

The TIM-Net Emotion Recognition Network Model
TIM-Net is capable of learning contextual representations from different temporal scales.The network structure is illustrated in Figure 5 [35].Specifically, TIM-Net utilizes a temporal-aware block to learn the temporal emotion representations initially.It then integrates supplementary information from both the past and the future to enrich the contextual representations.Finally, it fuses features from multiple temporal scales with the aim of better adapting to emotional changes.TIM-Net outperforms the other methods in terms of its accuracy on each corpus.The superior generality and performance exhibited by TIM-Net can be attributed to its core module, called the temporal-aware block (TAB).This core module captures temporal-aware representations.Each TAB consists of three sub-blocks and a sigmoid function for learning the temporal attention map .The temporal-aware feature  is generated through element-wise multiplication of the input and .For the same sub-block of the  − ℎ ( ), an expandable dilated causal convolution (DC Conv) [51] with a dilation rate of 2 is applied at the beginning of each sub-block.The expandable convolution enlarges and refines the receptive field, while the causal constraint ensures that future The superior generality and performance exhibited by TIM-Net can be attributed to its core module, called the temporal-aware block (TAB).This core module captures temporalaware representations.Each TAB consists of three sub-blocks and a sigmoid function for learning the temporal attention map A. The temporal-aware feature F is generated through element-wise multiplication of the input and A. For the same sub-block of the j-th (TAB j ), an expandable dilated causal convolution (DC Conv) [51] with a dilation rate of 2 j−1 is applied at the beginning of each sub-block.The expandable convolution enlarges and refines the receptive field, while the causal constraint ensures that future information does not leak into the past.Batch normalization, activation functions, and spatial dropout follow the convolution operation.This paper made modifications to the TAB structure, changing it from a 2-layer structure to a 3-layer one.To reduce the complexity of the model, we replaced scalable dilated causal convolution with regular convolution and replaced the spatial pool with a regular pool.The modified structure is illustrated in Figure 6.

Attention Mechanism
Attention mechanisms enable neural networks to automatically learn and selectively focus on important information in the input.Multi-head attention is one implementation of attention mechanisms, achieved by running the attention mechanism in parallel multiple times and concatenating the independently computed attention outputs, linearly transforming them into the desired dimensions.Specifically, the multi-head attention mechanism projects the input matrix differently, generating several output matrices that are then concatenated together.Under the multi-head attention mechanism, the input sequence data are divided into multiple heads, each independently computing and producing different outputs.These outputs are then concatenated to form the final output.The output for each head can be expressed as follows, where  ,  ,  are the query, key, and value transformation matrices for the i-th head.In summary, the multi-head attention mechanism is an effective implementation of attention that can significantly enhance the model's performance and generalization ability.
We use both the multi-head attention mechanism and the cross-modal attention mechanism.These two attention mechanisms have different positions and functions in the model.

Model Design and Interpretation
Considering the two modal features of text and audio in speech emotion recognition, based on the TIM-Net network structure, the TAB design is improved and combined with the use of multi-head attention, and the network framework is constructed, as shown in Figure 7.The features extracted after passing through Wav2Vec2.0and RoBERT have a dimensionality of 768.The multi-head attention mechanism employs 8 heads, with 2 layers in the encoder.By increasing the number of heads, the model can capture more contextual information.The TAB consists of 3 layers, with ReLU used as the activation function within the TAB.The input dimensionality received by the fully connected layer is 768 × 2 (concatenation of audio and text), with the gelu activation function and AdamW optimizer used and a learning rate of 5 × 10 −5 .To prevent overfitting, the dropout is set to 0.1.

Attention Mechanism
Attention mechanisms enable neural networks to automatically learn and selectively focus on important information in the input.Multi-head attention is one implementation of attention mechanisms, achieved by running the attention mechanism in parallel multiple times and concatenating the independently computed attention outputs, linearly transforming them into the desired dimensions.Specifically, the multi-head attention mechanism projects the input matrix differently, generating several output matrices that are then concatenated together.Under the multi-head attention mechanism, the input sequence data are divided into multiple heads, each independently computing and producing different outputs.These outputs are then concatenated to form the final output.The output for each head can be expressed as follows, where W Q i , W K i , W V i are the query, key, and value transformation matrices for the i-th head.In summary, the multi-head attention mechanism is an effective implementation of attention that can significantly enhance the model's performance and generalization ability.
We use both the multi-head attention mechanism and the cross-modal attention mechanism.These two attention mechanisms have different positions and functions in the model.

Model Design and Interpretation
Considering the two modal features of text and audio in speech emotion recognition, based on the TIM-Net network structure, the TAB design is improved and combined with the use of multi-head attention, and the network framework is constructed, as shown in Figure 7.The features extracted after passing through Wav2Vec2.0and RoBERT a have a dimensionality of 768.The multi-head attention mechanism employs 8 heads, with 2 layers in the encoder.By increasing the number of heads, the model can capture more contextual information.The TAB consists of 3 layers, with ReLU used as the activation function within the TAB.The input dimensionality received by the fully connected layer is 768 × 2 (concatenation of audio and text), with the gelu activation function and AdamW optimizer used and a learning rate of 5 × 10 −5 .To prevent overfitting, the dropout is set to 0.1.The first part is the feature extraction module.This section primarily focuses on extracting features from the input data, which include two modalities: audio and text.The process for extracting the features from audio modality data is as follows: first, encode the audio using Wav2vec-2.0,and then introduce a multi-head self-attention mechanism to learn more discriminative speech emotion features.The feature extraction process for text modality data involves using the BERT model, followed by introducing a multi-head selfattention mechanism to focus on significant emotional features within the text sequence.
The second part is the Cross-Modal Encoder (CME) attention module.This section primarily models the cross-modal interactions of the multi-modal features, utilizing a cross-modal attention mechanism to jointly optimize the feature embeddings for audio and text.The cross-modal attention mechanism achieves this by learning two sets of semantic interaction weights separately and readjusting the feature representations of audio and text.This enables capturing interactive information between the audio and text modalities, achieving semantic consistency in the multi-modal context.
The third part is the emotion classification module.In this section, the multi-modal fusion features of audio and text are first concatenated.Then, the TAB module is employed to learn the temporal dimension features with context dependencies.The features learned with context dependencies are then put into an FC layer, utilizing a softmax classifier to obtain a probability matrix.The maximum value in the matrix is taken as the final emotion recognition result.
The following sections will provide separate introductions to the cross-modal attention module and the emotion classification module of this model.The first part is the feature extraction module.This section primarily focuses on extracting features from the input data, which include two modalities: audio and text.The process for extracting the features from audio modality data is as follows: first, encode the audio using Wav2vec-2.0,and then introduce a multi-head self-attention mechanism to learn more discriminative speech emotion features.The feature extraction process for text modality data involves using the BERT model, followed by introducing a multi-head self-attention mechanism to focus on significant emotional features within the text sequence.
The second part is the Cross-Modal Encoder (CME) attention module.This section primarily models the cross-modal interactions of the multi-modal features, utilizing a cross-modal attention mechanism to jointly optimize the feature embeddings for audio and text.The cross-modal attention mechanism achieves this by learning two sets of semantic interaction weights separately and readjusting the feature representations of audio and text.This enables capturing interactive information between the audio and text modalities, achieving semantic consistency in the multi-modal context.
The third part is the emotion classification module.In this section, the multi-modal fusion features of audio and text are first concatenated.Then, the TAB module is employed to learn the temporal dimension features with context dependencies.The features learned with context dependencies are then put into an FC layer, utilizing a softmax classifier to obtain a probability matrix.The maximum value in the matrix is taken as the final emotion recognition result.
The following sections will provide separate introductions to the cross-modal attention module and the emotion classification module of this model.

Cross-Modal Attention Module
This paper employs cross-modal attention to focus on the interaction between different modal data, learning the semantic interaction weights for the speech and text modalities and readjusting the feature representations.
Firstly, both speech and text representations are projected into the same space using 1D-CNN, and the representations are as follows: where S represents the speech modality, T represents the text modality, H {S,T} represents the final emotional feature representations for speech and text obtained from the feature extraction module, k {S,T} represents the convolution kernel size for modality {S, T}, and d denotes the dimension of the projected features for speech and text.The speech embeddings mapped using 1D-CNN are denoted as H S and H T .
We denote the process of transferring information from the speech modality to the text modality as S → T , and correspondingly, T → S is used to represent the information transfer from the text modality to the speech modality.To learn the relationship between speech and text, linear projection is initially employed to transform each feature sequence into a query matrix Q, a key matrix K, and a value matrix V.The calculation formula is as follows: where Q l , K l , V l ∈ R d×d represent the query matrix Q, key matrix K, and value matrix V for the feature sequence of the modality, and Next, the dot product operation is performed on the query matrix and the key matrix for both speech and text.The softmax function is then applied to scale and normalize the results row-wise to obtain the attention weights.Finally, the feature sequences are aggregated using the corresponding weights to obtain the interactive information transferred between the two modalities.

Cross-Modal Transfer S → T
The information from the speech modality is transferred to the text modality using a cross-modal attention mechanism with h heads.Unlike the original multi-head selfattention mechanism where Q = K = V, with the cross-modal attention mechanism, the query matrix is Q S , and the key matrix and value matrix are K T and V T , respectively.This mechanism facilitates the transfer of speech information to the text modality, enabling learning the text feature representations as guided by the speech information.The similarity is computed by taking the dot product of the query matrix Q S from the speech and the key matrix K T from the text.The So f tmax function is applied to scale and normalize the results, followed by multiplying them with the value matrix V T to obtain the attention weights.The specific formula is as follows: Then, the results from h heads are concatenated and mapped.The specific process is as follows: where Att S→T (i) represents the i-th (where i ∈ [1, h]) cross-modal attention weight.
Finally, residual connection and layer normalization are applied to the single-modal speech features H S mapped using 1D-CNN and the interactive multi-modal features M S→T .The specific formula is as follows: where CM S→T is the cross-modal output representation from the speech-to-text modality, containing not only complementary information from both modalities but also the original speech emotion features, effectively reducing the information loss.

Cross-Modal Transfer T → S
The process of transferring information from the text modality to the speech modality is similar to the process of transferring it from the speech modality to the text modality.All the computation formulas are as follows:

The Emotion Classification Module
In the emotion classification module, the TAB sub-block is employed to focus on the multi-modal fused feature representation after the cross-modal information interaction.Initially, the two cross-modal fused features are concatenated to obtain the ultimate representation of the multi-modal fused emotion features, denoted as E f usion and expressed as follows: Subsequently, the fused features are put into the TAB3 sub-block, which is designed to capture the contextual relationships between the features, thereby capturing temporalaware representations.The resulting multi-modal features are denoted as P.These multimodal features P are then fed into a fully connected layer, where linear transformations are applied to learn the correlations between features and map them to the output space.A softmax classifier is utilized to obtain the multi-modal emotion recognition results based on both speech and text.

Experimental Simulation Parameters
The experimental environment for this study is shown in Table 1.[36] This study was primarily conducted on the publicly available dataset IEMOCAP.It was created by the SAIL (Signal Analysis and Interpretation Laboratory) at the University of Southern California and is a multimodal database widely used in emotion recognition.The dataset comprises approximately 12 h of audiovisual data, including videos, speech, facial motion capture, and text.It consists of five sessions recorded with ten different actors, with each session featuring recordings of two speakers, one male and one female.
The dataset is labeled into four main emotion categories (Figure 8): anger (1102), sadness (1083), neutral (1708), and happiness (1636).To ensure fair comparisons when evaluating our model on the IEMOCAP dataset, we employed a five-fold cross-validation method, where one session was held out as the test set for each training iteration.The training-test split for each iteration is illustrated in Figure 9.It can be observed that the data distribution for each cross-validation is relatively uniform.[36] This study was primarily conducted on the publicly available dataset IEMOCAP.It was created by the SAIL (Signal Analysis and Interpretation Laboratory) at the University of Southern California and is a multimodal database widely used in emotion recognition.The dataset comprises approximately 12 h of audiovisual data, including videos, speech, facial motion capture, and text.It consists of five sessions recorded with ten different actors, with each session featuring recordings of two speakers, one male and one female.
The dataset is labeled into four main emotion categories (Figure 8): anger (1102), sadness (1083), neutral (1708), and happiness (1636).To ensure fair comparisons when evaluating our model on the IEMOCAP dataset, we employed a five-fold cross-validation method, where one session was held out as the test set for each training iteration.The training-test split for each iteration is illustrated in Figure 9.It can be observed that the data distribution for each cross-validation is relatively uniform.[36] This study was primarily conducted on the publicly available dataset IEMOCAP.It was created by the SAIL (Signal Analysis and Interpretation Laboratory) at the University of Southern California and is a multimodal database widely used in emotion recognition.The dataset comprises approximately 12 h of audiovisual data, including videos, speech, facial motion capture, and text.It consists of five sessions recorded with ten different actors, with each session featuring recordings of two speakers, one male and one female.
The dataset is labeled into four main emotion categories (Figure 8): anger (1102), sadness (1083), neutral (1708), and happiness (1636).To ensure fair comparisons when evaluating our model on the IEMOCAP dataset, we employed a five-fold cross-validation method, where one session was held out as the test set for each training iteration.The training-test split for each iteration is illustrated in Figure 9.It can be observed that the data distribution for each cross-validation is relatively uniform.

Multimodal EmotionLines Dataset (MELD) [37]
MELD evolved from the EmotionLines dataset and is a multimodal emotional dialogue dataset.It consists of organized dialogue from the popular American TV series Friends, comprising 1433 instances of dialogue with a total of 13,708 utterances.Each utterance in the dialogue is annotated with one of seven emotion labels: anger, disgust, fear, joy, surprise, sadness, or neutral.Additionally, each utterance in MELD is annotated as positive, negative, or neutral.The distribution of the training/test and validation data samples in MELD is shown in Figure 10.
Appl.Sci.2024, 14, x FOR PEER REVIEW 13 of 22 5.2.2.Multimodal EmotionLines Dataset (MELD) [37] MELD evolved from the EmotionLines dataset and is a multimodal emotional dialogue dataset.It consists of organized dialogue from the popular American TV series Friends, comprising 1433 instances of dialogue with a total of 13,708 utterances.Each utterance in the dialogue is annotated with one of seven emotion labels: anger, disgust, fear, joy, surprise, sadness, or neutral.Additionally, each utterance in MELD is annotated as positive, negative, or neutral.The distribution of the training/test and validation data samples in MELD is shown in Figure 10.

Evaluation Metrics
To evaluate the performance of the model, we use weighted average recall (WAR) and unweighted average recall (UAR) as the evaluation metrics.UAR averages the recall for each class without considering the number of samples per class.WAR, on the other hand, considers the number of samples for each class and calculates a weighted average recall.The difference between UAR and WAR lies in whether they consider the weights of the class sample sizes.UAR treats each class equally, while WAR assigns different weights based on the sample sizes of classes.The formulas for calculating UAR and WAR are as follows:

Mul-TAB Result Analysis
This section analyzes the accuracy, robustness, hyperparameters, TAB3, and multihead attention mechanism of Mul-TAB.

Accuracy Analysis
In this study, we proposed a speech emotion recognition model based on the improved TIM-Net network and multimodal fusion and conducted experiments on the IEMOCAP and MELD datasets.The test accuracy of our model is shown in the comparative line charts in Figures 11 and 12. On the IEMOCAP dataset, our model achieved a testing accuracy, as shown in Figure 11, reflected in the best WAR reaching 83.9% and the best UAR reaching 82.0%.For MELD, the best testing accuracy was 63.6%, and the training loss was 0.359, as illustrated in Figure 12.The corresponding confusion matrices for the experiments are shown in Figure 13a on the IEMOCAP dataset, our model performed best in recognizing the "anger" emotion category, with an accuracy of 90.5%, while the recognition accuracy of the "neutral" emotion category was the lowest at 73.5%. Figure 13b for MELD, our model performed best in recognizing the "neutral" emotion category, with an accuracy of 89.9%, while the recognition accuracy of the "disgust" emotion category was the lowest at 8.82%.
sample size of the  − ℎ class; and Total Sample Size is the total sample size across all classes.

Mul-TAB Result Analysis
This section analyzes the accuracy, robustness, hyperparameters, TAB3, and multihead attention mechanism of Mul-TAB.

Accuracy Analysis
In this study, we proposed a speech emotion recognition model based on the improved TIM-Net network and multimodal fusion and conducted experiments on the IE-MOCAP and MELD datasets.The test accuracy of our model is shown in the comparative line charts in Figures 11 and 12. On the IEMOCAP dataset, our model achieved a testing accuracy, as shown in Figure 11, reflected in the best WAR reaching 83.9% and the best UAR reaching 82.0%.For MELD, the best testing accuracy was 63.6%, and the training loss was 0.359, as illustrated in Figure 12.The corresponding confusion matrices for the experiments are shown in Figure 13a on the IEMOCAP dataset, our model performed best in recognizing the "anger" emotion category, with an accuracy of 90.5%, while the recognition accuracy of the "neutral" emotion category was the lowest at 73.5%. Figure 13b for MELD, our model performed best in recognizing the "neutral" emotion category, with an accuracy of 89.9%, while the recognition accuracy of the "disgust" emotion category was the lowest at 8.82%.

Robustness Analysis
During training, we used 80% of the dataset as the training set and 20% as the test set.As the IEMOCAP dataset is divided into five sessions, we performed five-fold crossvalidation by leaving one session out as the test set in each training iteration.Through multiple experiments, it can be observed that our model exhibits strong robustness.A comparative line chart of the model's test accuracy is depicted in Figure 14.

Robustness Analysis
During training, we used 80% of the dataset as the training set and 20% as the test set.As the IEMOCAP dataset is divided into five sessions, we performed five-fold crossvalidation by leaving one session out as the test set in each training iteration.Through multiple experiments, it can be observed that our model exhibits strong robustness.A comparative line chart of the model's test accuracy is depicted in Figure 14.

Hyperparameter Analysis
Given that we used RoBERT and Wav2Vec2.0for the feature extraction, the feature dimension was 768.After multiple rounds of training and testing, it was observed that the optimal configuration is a batch size of 2, 100 epochs, a learning rate of 5 × 10

Hyperparameter Analysis
Given that we used RoBERT a and Wav2Vec2.0for the feature extraction, the feature dimension was 768.After multiple rounds of training and testing, it was observed that the optimal configuration is a batch size of 2, 100 epochs, a learning rate of 5 × 10

TAB3 Analysis
The emotion classification module of the model was modified by removing th sub-block, and the resulting emotion classification results are shown in Figure 15.T achieved WAR is 82.9%.This indicates that the TAB sub-block, which is capable of c ing the contextual relationships in the features and obtaining temporal-aware repre tions, contributes to the improvement of the model's performance.

Multi-Head Attention Mechanism Analysis
The feature extraction module of the model was modified by removing the multihead attention mechanism.In this setup, only Wav2Vec2.0feature extraction is performed speech data, and only BERT feature extraction is performed for text data, while the subsequent steps remain unchanged.The resulting emotion classification results are shown in Figure 16, with the best achieved WAR being 82.3%.This indicates that the multi-head attention mechanism helps in learning more discriminative features for both speech and text, thereby enhancing the model's performance.When comparing the algorithms with and without the TAB module, which lacks the multi-head attention mechanism, with the final algorithm, we obtain the results shown in Table 2. Analysis of Figure 17 reveals that individual use of the TAB module and the multihead attention mechanism is not as effective as their combined usage.When comparing the algorithms with and without the TAB module, which lacks the multi-head attention mechanism, with the final algorithm, we obtain the results shown in Table 2. Analysis of Figure 17 reveals that individual use of the TAB module and the multi-head attention mechanism is not as effective as their combined usage.

Comparison and Analysis with Other Experiments
In this paper, we proposed a speech emotion recognition model based on the improved TIM-Net network and multimodal fusion, which was experimentally validated on the IEMOCAP and MELD datasets.To evaluate our approach, we compared it with other models.The results show that our method performs well regarding WAR and UAR.The comparative results are presented in Tables 3 and 4, where "CV" stands for cross-validation, "5-fold" indicates five-fold cross-validation, and "10-fold" indicates ten-fold crossvalidation.(Note: The UAR and WAR for MMER in the table are results reproduced using only the pure IEMOCAP dataset after removing the enhanced speech and text data added by the authors of the MMER paper).

Conclusions
This experiment investigated a deep-learning-based multimodal emotion recognition approach.The proposed multimodal fusion method based on TIM-Net and a multi-head attention mechanism demonstrates significant advantages in speech emotion recognition tasks, effectively improving the accuracy of emotion classification.This paper extensively discussed the methods for fusing multimodal features, elucidates and analyzes the TIM-Net model, and proposes enhancements.Finally, it combined the multimodal features to accomplish emotion recognition.Through analyzing and discussing the experimental results, we gain further insights into the contributions of different modal features.These research findings have important guiding significance and application value for future studies and applications in speech emotion recognition.
The model we propose also has shortcomings, such as identifying emotions by capturing the context of the conversation and applying common sense reasoning to understand the emotional changes in the conversation between the listener and the speaker.Future research directions include further optimizing the multimodal feature extraction and fusion methods to enhance the collaboration between different feature modalities.Additionally, an in-depth exploration of the design and implementation details of the TIM-Net model is necessary to optimize the model's structure and parameters further.Exploring more effective training methods and optimization strategies to improve the model's generalization ability, studying cross-domain and cross-language speech emotion recognition issues, and combining common sense reasoning are also required.

Figure 3 .
Figure 3. Structure diagram of the BERT model.

Figure 4 .
Figure 4. Multimodal fusion method.AU is an analysis unit.

Figure 4 .
Figure 4. Multimodal fusion method.AU is an analysis unit.

3 . 1 .
The TIM-Net Emotion Recognition Network Model TIM-Net is capable of learning contextual representations from different temporal scales.The network structure is illustrated in Figure 5[35].Specifically, TIM-Net utilizes a temporal-aware block to learn the temporal emotion representations initially.It then integrates supplementary information from both the past and the future to enrich the contextual representations.Finally, it fuses features from multiple temporal scales with the aim of better adapting to emotional changes.TIM-Net outperforms the other methods in terms of its accuracy on each corpus.Appl.Sci.2024, 14, x FOR PEER REVIEW 7 of 22

Figure 6 .
Figure 6.Schematic diagram of the modified TAB structure.

Figure 6 .
Figure 6.Schematic diagram of the modified TAB structure.

Figure 9 .
Figure 9.Each training test's data.The x-axis represents the distribution of training and testing data when session  is used as the testing set, where the left column represents the training data and the right column represents the testing data.The y-axis represents the quantity of data.

Figure 9 .
Figure 9.Each training test's data.The x-axis represents the distribution of training and testing data when session  is used as the testing set, where the left column represents the training data and the right column represents the testing data.The y-axis represents the quantity of data.

Figure 9 .
Figure 9.Each training test's data.The x-axis represents the distribution of training and testing data when session i is used as the testing set, where the left column represents the training data and the right column represents the testing data.The y-axis represents the quantity of data.

Figure 10 .
Figure 10.Training/testing and validation data samples for MELD.

Figure 10 .
Figure 10.Training/testing and validation data samples for MELD.5.3.Experimental Analysis5.3.1.Evaluation MetricsTo evaluate the performance of the model, we use weighted average recall (WAR) and unweighted average recall (UAR) as the evaluation metrics.UAR averages the recall for each class without considering the number of samples per class.WAR, on the other hand, considers the number of samples for each class and calculates a weighted average recall.The difference between UAR and WAR lies in whether they consider the weights of the class sample sizes.UAR treats each class equally, while WAR assigns different weights based on the sample sizes of classes.The formulas for calculating UAR and WAR are as follows:Recall = TP TP + FN(8)

Figure 11 .
Figure 11.Test accuracy comparison when using Session 2 as the test set.(Experiments conducted on the IEMOCAP dataset).

Figure 11 .
Figure 11.Test accuracy comparison when using Session 2 as the test set.(Experiments conducted on the IEMOCAP dataset).

Figure 12 .
Figure 12.Training loss and testing accuracy on MELD.Figure 12. Training loss and testing accuracy on MELD.

Figure 12 .
Figure 12.Training loss and testing accuracy on MELD.Figure 12. Training loss and testing accuracy on MELD.
−5 , three layers in the TAB sub-block, and two layers in the multi-head attention mechanism.Using a single GPU, each training and inference iteration takes approximately 4-5 min.The entire training process, with five iterations and 100 epochs, requires approximately × × × ≈ 1.7 days.The Mul-TAB model has a total of 143 million parameters.4.TAB3 AnalysisThe emotion classification module of the model was modified by removing the TAB sub-block, and the resulting emotion classification results are shown in Figure15.The best achieved WAR is 82.9%.This indicates that the TAB sub-block, which is capable of capturing the contextual relationships in the features and obtaining temporal-aware representations, contributes to the improvement of the model's performance.

Figure 14 .
Figure 14.Comparative line chart of cross-validation accuracy.
−5 , three layers in the TAB sub-block, and two layers in the multi-head attention mechanism.Using a single GPU, each training and inference iteration takes approximately 4-5 min.The entire training process, with five iterations and 100 epochs, requires approximately 5×100×5 60×24 ≈ 1.7 days.The Mul-TAB model has a total of 143 million parameters.4. TAB3 Analysis The emotion classification module of the model was modified by removing the TAB sub-block, and the resulting emotion classification results are shown in Figure 15.The best achieved WAR is 82.9%.This indicates that the TAB sub-block, which is capable of capturing the contextual relationships in the features and obtaining temporal-aware representations, contributes to the improvement of the model's performance.

Figure 15 .
Figure 15.Cross-validation accuracy comparison without the TAB module.

Figure 15 .
Figure 15.Cross-validation accuracy comparison without the TAB module.
Appl.Sci.2024, 14, x FOR PEER REVIEW 17 of 225.Multi-Head Attention Mechanism AnalysisThe feature extraction module of the model was modified by removing the multihead attention mechanism.In this setup, only Wav2Vec2.0feature extraction is performed for speech data, and only BERT feature extraction is performed for text data, while the subsequent steps remain unchanged.The resulting emotion classification results are shown in Figure16, with the best achieved WAR being 82.3%.This indicates that the multi-head attention mechanism helps in learning more discriminative features for both speech and text, thereby enhancing the model's performance.

Figure 17 .
Figure 17.The performance of TAB and multi-head attention modules on different datasets.The first five columns are the experimental results on IEMOCAP, and the last column is the experimental results on MELD.
, primarily encompassing data preprocessing, text feature extraction, model training, and emotion recognition.
TN represents True Negatives; FP represents False Positives; FN represents False Negatives; N is the number of classes; Class Size i is the sample size of the i-th class; and Total Sample Size is the total sample size across all classes.

Table 2 .
Results of ablation experiments.The table has been converted into a bar chart, as shown in Figure17.

Table 2 .
Results of ablation experiments.The table has been converted into a bar chart, as shown in Figure 17.Appl.Sci.2024, 14, x FOR PEER REVIEW 18 of 22

Table 3 .
Comparative results with other models on IEMOCAP dataset.In the modal column, S represents speech, T represents text, and V represents vision.

Table 4 .
Comparative results with other models on MELD.