You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

1 September 2023

Uni2Mul: A Conformer-Based Multimodal Emotion Classification Model by Considering Unimodal Expression Differences with Multi-Task Learning

,
and
Smart Policing Academy, China People’s Police University, Langfang 065000, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advanced Technologies for Emotion Recognition

Abstract

Multimodal emotion classification (MEC) has been extensively studied in human–computer interaction, healthcare, and other domains. Previous MEC research has utilized identical multimodal annotations (IMAs) to train unimodal models, hindering the learning of effective unimodal representations due to differences between unimodal expressions and multimodal perceptions. Additionally, most MEC fusion techniques fail to consider the unimodal–multimodal inconsistencies. This study addresses two important issues in MEC: learning satisfactory unimodal representations of emotion and accounting for unimodal–multimodal inconsistencies during the fusion process. To tackle these challenges, the authors propose the Two-Stage Conformer-based MEC model (Uni2Mul) with two key innovations: (1) in stage one, unimodal models are trained using independent unimodal annotations (IUAs) to optimize unimodal emotion representations; (2) in stage two, a Conformer-based architecture is employed to fuse the unimodal representations learned in stage one and predict IMAs, accounting for unimodal–multimodal differences. The proposed model is evaluated on the CH-SIMS dataset. The experimental results demonstrate that Uni2Mul outperforms baseline models. This study makes two key contributions: (1) the use of IUAs improves unimodal learning; (2) the two-stage approach addresses unimodal–multimodal inconsistencies during Conformer-based fusion. Uni2Mul advances MEC by enhancing unimodal representation learning and Conformer-based fusion.

1. Introduction

Emotion classification is a crucial subtask that has been extensively studied in domains such as human–computer interaction, healthcare, and medicine. Early research on emotion classification primarily utilized unimodal data, such as text [1,2], audio [3], or visual cues [4], and achieved notable success. Recently, researchers have started incorporating two or more modalities to achieve multimodal emotion classification (MEC) [5,6,7]. MEC (explanations for all abbreviations in Appendix A Table A1) can furnish salient clues to more accurately discern the genuine emotional states of the opinion holder and enhance the precision of outcomes [8]. With the burgeoning of short video applications, MEC has become an active area of research [9].
The two most salient components in MEC tasks are unimodal emotion representation learning and multimodal fusion. In many scenarios, unimodal emotion expression differs from multimodal emotion perception, and multimodal annotation stems from the interaction between each modality. Taking a video clip from the CH-SIMS dataset as an exemplar, as illustrated in Figure 1, the text “Isn’t it about to go bankrupt”, with an emotional annotation of −0.8, indicates a negative emotion, and the audio modality annotation is also −0.8. The visual modality expression includes a smile and corresponds to a positive emotional annotation of 1.0. The combination of these three modalities results in a multimodal emotional annotation of 0.6, indicating a positive emotion for the video clip.
Figure 1. An example of inconsistency between unimodal emotion expression and multimodal emotion perception. Stage One: unimodal emotion recognition, Stage Two: multimodal fusion for emotion recognition. M: Multimodal, V: Vision, A: Audio, T: Text, ⊕: multimodal fusion.
The majority of existing research employs identical multimodal annotations (IMAs) for unimodal model pretraining and multimodal fusion [10,11,12]. The unimodal models trained using IMAs cannot learn satisfactory representations but rather those of forced alignment. The representation of forced alignment will contain certain discrepancies in the distribution of each modality and form an irregular semantic space, as depicted in Figure 2a. Multimodal fusion grounded on this will generate distorted and convoluted classification boundaries (see Figure 2c). If the unimodal models are trained using independent unimodal annotations (IUAs), the representations for each modality are roughly within a unified region and harbor certain differences (see Figure 2b). Leveraging unimodal representations with certain differences such as this for multimodal fusion will contribute to superior fusion outcomes [13]. Meanwhile, the multimodal fusion strategy is also imperative to achieve a comprehensive MEC model. A satisfactory fusion strategy can fully harness the information from each modality and can capture the differences and interactions between modalities. Building upon this, the distributions of each emotion category are uniform, and the classification boundaries are also relatively systematic (see Figure 2d). Most current studies on fusion pay less attention to the unimodal–multimodal emotions’ inconsistency [14,15,16,17] and struggle to learn the differences between different modalities [13].
Figure 2. Hypothesized unimodal and multimodal emotion representation. In (a,b), the markers of different colors represent different modalities, and shapes with dotted curves of different colors represent semantic spaces of respective modalities. The black solid curves indicate the envelope of these colored lines. In (c,d), the markers of different colors represent different emotion categories, and the black solid curves represent classification boundaries. V: Vision, A: Audio, T: Text.
The CH-SIMS dataset encompasses not only IMAs for each clip but also IUAs for each modality [13], which is consistent with our stance and endeavors to train the unimodal model with IUAs, but does not delve deeply enough into the two aforementioned problems: unimodal representation learning and multimodal fusion strategy.
Problem Statement: In most existing multimodal datasets with IMAs:
D 1 = X σ , Y M ,
where σ represents different modalities, X is the data, and Y M is the IMAs. MEC models based on these datasets take X σ as input and Y M as output in the whole training stage. This process can be expressed as
X σ f Y M ,
where f is responsible for both unimodal representation and multimodal fusion.
We define CH-SIMS as
D 2 = X σ , Y σ , Y M ,
where Y σ represents IUAs and Y M represents IMAs. We express the training process based on CH-SIMS as
X σ f 1 Y σ f 2 Y M ,
where f 1 is responsible for unimodal representation and f 2 for multimodal fusion.
To find optimal f 1 and f 2 , we propose a Conformer-based MEC model (Uni2Mul), which integrates a sub-task to predict IUAs and IMAs. We argue that to achieve a satisfactory unimodal representation, it is necessary to not only design a suitable feature extraction network, but also train it with the correct IUAs rather than IMAs. Moreover, a satisfactory fusion strategy can better leverage the effectiveness of the unimodal representation under a multi-task framework.
The contributions of this work can be summarized as follows:
The key contributions are using IUAs for better unimodal learning (see Figure 2b) and the two-stage approach to account for unimodal–multimodal inconsistencies during fusion (see Figure 2d). Uni2Mul advances MEC by improving unimodal representation learning and fusion.
The remainder of the paper is organized as follows: Section 2 describes the related work, Section 3 discusses the proposed methods and overall architecture of the Uni2Mul model, Section 4 explains the experimental setup and results, Section 5 presents the discussion, and Section 6 summarizes our work.

3. Methods

Given a video clip that contains multimodal information such as text, audio, and vision, the task aims to predict emotional annotation of the clip, i.e., positive, negative, and neutral. The key is to extract and fuse the representations of each modality. To achieve this, we propose a Conformer-based MEC model (Uni2Mul). Figure 3 provides a detailed illustration of our model. The visual, acoustic, and textual features are extracted from the video data using CLIP (ViT-B/32), Wav2Vec, and BERT (bert-base-chinese), three pre-trained models, respectively. These features are then fed into unimodal neural networks to predict IUAs. Meanwhile, hidden representations are fetched from these unimodal neural networks and fused to predict IMAs. More details about our model are provided below.
Figure 3. Conformer-based MEC model (Uni2Mul). The model contains two stages: emotional classification models for each modality (left part) aim to optimize unimodal representation, and multi-task fusion network with Conformer (right part) performs multimodal fusion and MEC. The different colors correspond to different modalities. Pink, green, and purple represent the visual, acoustic, and textual modalities respectively. M : Multimodal, V : Vision, A : Audio, T : Text, H : representation from hidden layer, Y : emotional annotation, q : query, v : value, FC: Fully Connected layer. GAP: Global Average Pooling1D, ⊗: operation of multiplication, ⊕: operation of addition.

3.1. Unimodal Neural Networks

3.1.1. Vision

CLIP is a transferable visual model trained from 400 million (image, text) pairs collected from the internet [72]. We use this pre-trained model as a visual feature extractor to make full use of its prior knowledge, which includes not only visual but also textual information. To extract better unimodal emotional representations, we construct an emotion classification model for the visual modality using CNN and LSTM, inspired from [30].
Specifically, we use CLIP to extract features from images and define the visual feature as X V . We then feed X V into three CNN layers, whose operation process can be formulated as
R c n n V = W c n n V · X V + b c n n V ,
where R c n n V represents the output of CNNs and can be defined as R c n n V = [ r 0 V , r 1 V , , r t V ] , where t is the timestep. W c n n V denotes the weight matrix and b C N N V represents the bias vector.
Next, the LSTM layer takes the outputs of CNNs as input and performs the following operations:
i t V = σ W i V h t 1 V : r t V + b i V , f t V = σ W f V h t 1 V : r t V + b f V , o t V = σ W o V h t 1 V : r t V + b o V , C t V = f t V · C t 1 V + i t V · t a n h W c V h t 1 V : r t V + b c V , H t V = o t V · t a n h ( C t V ) ,
where i t V , f t V , and o t V represent the output of the input gate, forget gate, and output gate, respectively, in the LSTM; C t V represents the current moment cell state; and H t V represents the hidden output of the LSTM and can be defined as H t V = [ h 0 V , h 1 V , , h t V ] (where t is timestep). σ ( · ) denotes the sigmoid activation function and t a n h ( · ) is the hyperbolic tangent function. W i V , W f V , W o V , and W c V represent the weight matrices, while b i V , b f V , b o V , and b c V represent the bias vectors.
Finally, we define the hidden representation as H V = H t V , which is fed into a fully connected layer for visual emotional classification. The operation process of the visual emotional classification can be formulated as
Y V = W V · H V + b V ,
where Y V represents the annotation of the visual modality, W V represents the weight matrix, and b V represents the bias vector.

3.1.2. Audio

Wav2Vec 2.0 is a self-supervised framework for speech representation learning. It can capture the information about the speaker and performs well in speech recognition tasks, especially in ultra-low-resource cases [46]. Therefore, we use it as a feature extractor for acoustic data. Inspired by Atila et al. [41], we propose a network that consists of a CNN, followed by a self-attention mechanism and a Bi-LSTM.
Firstly, we employ Wav2Vec 2.0 to extract acoustic features from the waveform signal and define the acoustic feature as X A . We then feed X A into a CNN layer:
R c n n A = W c n n A · X A + b c n n A ,
where R c n n A represents the CNN output, W c n n A represents the weight matrix, and b c n n A represents the bias vector.
We use the “Scaled Dot-Product Attention” from [56] as the attention mechanism. The matrix of outputs is computed as
Q u e A = W Q u e A · R c n n A , K e y A = W K e y A · R c n n A , s c o r e A = s o f t m a x ( W a t t n A · t a n h Q u e A · K e y A ) , R a t t n A = s c o r e A · ( W V a l V · V a l A ) ,
where Q u e A , K e y A , and V a l A stand for query, key, and value, respectively. V a l A is equal to K e y A , and s c o r e A represents the weight on the value. S o f t m a x is the softmax function, and t a n h ( · ) is the hyperbolic tangent function. R a t t n A is the output of the attention module and can be defined as R a t t n A = [ r 0 A , r 1 A , , r t A ] , where t is the timestep. R a t t n A is fed into BiLSTM to extract deep representations. W Q u e A ,   W K e y A , W a t t n A , and W V a l V represent the weight matrices. The outputs of BiLSTM can be computed as follows:
i t A = σ W i A h t 1 A : r t A + b i A , f t A = σ W f A h t 1 A : r t A + b f A , o t A = σ W o A h t 1 A : r t A + b o A , C t A = f t A · C t 1 A + i t A · t a n h W c A h t 1 A : r t A + b c A , H t A = o t A · t a n h ( C t A ) ,
where i t A , f t A , and o t A represent outputs of the input gate, forget gate, and output gate, respectively, in LSTM; C t A represents the current moment cell state; and H t A represents the hidden output of LSTM and can be defined as H t A = [ h 1 A , , h t A ] , where t is the timestep. σ ( · ) denotes the sigmoid activation function and t a n h ( · ) is the hyperbolic tangent function. W i A , W f A ,   W o A , and W c A represent the weight matrices and b i A , b f A , b o A , and b c A represent the bias vectors.
Finally, we feed the output of BiLSTM into a fully connected layer for acoustic emotional classification. The operation process of the acoustic emotional classification can be formulated as
H A = [ H t A : H t A ] , Y A = W A · H A + b A ,
H A and Y A represent hidden representation and annotation of the acoustic modality, respectively. W A represents the weight matrix, and b A represents the bias vector.

3.1.3. Text

We use BERT to obtain embeddings from texts. We do not use word segmentation tools, due to the characteristics of BERT. We add two unique tokens to indicate the beginning and the end for each text. We define the textual feature extracted by BERT as X T .
During the training of the unimodal emotional classification models, we find that the performance of the textual modality emotional classification model is superior to the other modalities, so we want to extract more information from the textual modality. Inspired by [54], we design the CNN-attention-based model.
Specifically, we generate the query and key for the textual modality using two 1D CNNs. We use the “Scaled Dot-Product Attention” from [56] as the attention mechanism, too. The matrix of outputs is computed as
Q u e T = W Q u e T · X T + b Q u e T , K e y T = W K e y T · X T + b K e y T , s c o r e T = s o f t m a x ( W a t t n T · t a n h Q u e T · K e y T ) , R a t t n T = s c o r e T · ( W V a l T · V a l T ) ,
where Q u e T , K e y T , and V a l T stand for query, key, and value, respectively. V a l T is equal to K e y T , and s c o r e T represents the weight on the value. S o f t m a x is the softmax function and t a n h ( · ) is the hyperbolic tangent function. R a t t n T is the output of the attention module. W Q u e T ,   W K e y T , W a t t n T , and W V a l T represent the weight matrices. b Q u e T and b K e y T represent the bias vectors.
Subsequently, we conduct a global average pooling operation on Q u e T and R a t t n T , and concatenate them together to formulate the textual representation, H T . H T is computed as follows:
H T = [ G A P Q u e T : G A P ( R a t t n T ) ] ,
where G A P ( · ) represents the global average pooling operation.
Lastly, we send H T to a fully connected layer for emotional classification. We formulate the classification operation as follows:
Y T = W T · H T + b T ,
where Y T represents the annotation of the textual modality, W T represents the weight matrix, and b T represents the bias vector.
These networks for each modality are trained with IUAs to ensure the difference of representation between the inter-modal information, and are then saved to construct Uni2Mul framework.

3.2. Multi-Task Multimodal Fusion Network with Conformer

Inspired by Gulati et al. [73], we propose the Conformer-based fusion method. The Conformer [73] (see Figure 4) is a convolution-augmented Transformer for speech recognition. It can combine convolution neural networks and Transformers to model both local and global dependencies of audio sequences in a parameter-efficient way. The model exhibits better accuracy with fewer parameters than previous work on the LibriSpeech dataset.
Figure 4. Conformer architecture. (a) is Conformer architecture, (b) is feed-forward module, (c) is convolution module, and (d) is multi-head self-attention module. ⊕: operation of addition.
The Conformer block (see Figure 4a) is composed of four modules stacked together, i.e., a feed-forward module, a self-attention module, a convolution module, and a second feed-forward module at the end. In the multi-head self-attention module (see Figure 4d), we employ pre-norm residual units with dropout, which helps in training and regularizing deeper models. The convolution module (see Figure 4c) is stacked in the order of layernorm layer, convolution layer, gated linear unit (GLU), and so on. Batchnorm is deployed just after the convolution for training deep models. As for the feed-forward module (see Figure 4b), it starts with a layernorm layer, which is followed by two interleaved linear layers and dropout layers. We also apply Swish activation to regularize the network.
Especially, we first load the unimodal models saved above and set these models as untrainable to obtain unimodal representations from their hidden layers. We then concatenate these representations to form the multimodal representation, which can be formulated as
X M = H V : H A : H T ,
We use a Conformer-based method to fuse the multimodal representation and predict the emotional annotation of video. Especially, we send X M into the first feed-forward module and define the result as R F F M 1 , which is fed into the multi-head self-attention module. The operation of the multi-head self-attention module in the Conformer can be formulated as follows:
R L N M = L N ( R f f m 1 ) , Q u e M = W Q u e M · R L N M , K e y M = W K e y M · R L N M , s c o r e M = s o f t m a x ( W a t t n M · t a n h Q u e M · K e y M ) , R a t t n M = s c o r e M · ( W V a l M · V a l M ) , R m h a M = W M H A M · C o n c a t ( R a t t n 1 M , , R a t t n h M ) , R D P M = D P ( R m h a M ) ,
where L N ( · ) is the layer normalization operation, and Q u e M , K e y M , and V a l M stand for query, key, and value, respectively. V a l M is equal to K e y M , and s c o r e M represents the weight on the value. S o f t m a x is the softmax function and t a n h ( · ) is the hyperbolic tangent function. R a t t n M is the output of the attention module of one head and R m h a M is the output of the multi-head attention module. W Q u e M ,   W K e y M , W a t t n M , and W V a l M represent the weight matrices. C o n c a t · represents the concatenate operation and D P ( · ) represents the dropout operation.
R D P M is then fed into the convolution module, and the operations in the convolution module can be formulated as
R c n n 1 M = W c n n 1 M · L N R D P M + b c n n 1 M , R c n n 2 M = W c n n 2 M · B N ( G L U ( R c n n 1 M ) ) + b c n n 2 M , R C O N = D P R c n n 2 M + H D P M ,
where L N ( · ) is the layer normalization operation, G L U ( · ) represents the gated linear unit operation, B N ( · ) is the batch normalization operation, and D P ( · ) represents the dropout operation. R c n n 1 M and R c n n 2 M are convolution operation results. W c n n 1 M and W c n n 2 M represent the weight matrices, and b c n n 1 M and b c n n 2 M represent the bias vectors. We feed H C O N into the second feed-forward module and define the result as R f f m 2 .
Based on the research results of CH-SIMS, the accuracy of the textual and visual modalities in unimodal emotion recognition models is relatively high, which we refer to as the dominant modalities. In contrast, the accuracy of the acoustic modality is relatively low, which we refer to as the less salient modality. During the process of multimodal fusion, in order to avoid neglecting information from less salient modalities, we utilize a multi-task framework for multimodal fusion. In other words, our multimodal model has four outputs, which include the unimodal classification results for the visual, acoustic, and textual modalities, as well as the overall MEC result. The operation process of the MEC can be formulated as
Y M = W M · G A P R f f m 2 + b M , Y V = W V · H V + b V , Y A = W A · H A + b A , Y T = W T · H T + b T ,
where Y M represents the annotation for each video clip; W M ,   W V ,   W A , and W T represent the weight matrices; and b M ,   b V ,   b A , and b T represent the bias vectors. G A P ( · ) represents the global average pooling operation.

4. Experiment

4.1. Dataset

The CH-SIMS [13] dataset collects 60 videos from movies, TV series, and variety shows, and divides these videos into 2281 video clips. Each video clip has three unimodal annotations and one multimodal annotation. Given that there are only a few datasets containing unimodal annotations currently available, we only conduct experiments on the CH-SIMS dataset.

4.2. Parameters’ Setting

All experiments are carried out using a single NVIDIA (Santa Clara, CA, USA) Quadro P520 card. We adopt Adam as the optimizer with a learning rate of 1 × 10−4, and set the number of epochs to 30 and batch size to 32. All models are trained using sparse categorical cross-entropy on each softmax output. We evaluate the models with accuracy (Acc.) and F1 score. Acc. is equal to the proportion of correctly classified samples to the total number of samples, and can be formulated as follows:
A c c . = T P + T N T P + F N + T N + F P ,
T P is true positives, T N is true negatives, F P is false positives, and F N is false negatives.
Precision is the ability of the classifier not to label as positive a sample that is negative, and recall is the ability of the classifier to find all the positive samples. The F1 score can be interpreted as a weighted harmonic mean of the precision and recall, and can be formulated as follows:
p r e c i s i o n = T P T P + F P , r e c a l l = T P T P + F N , F 1 = 2 × p r e c i s i o n × r e c a l l p r e c i s i o n + r e c a l l ,
Higher values of Acc. and F1 score represent better performance. We use the “save_best_only = True” parameter in Keras to save only the best model during training. This parameter ensures that the model weights are saved whenever the monitored metric improves and overwrites the previously saved weights only if there is an improvement. Thus, at the end of training, the saved weights correspond to the best performing model on the validation set.
In addition to the general parameters mentioned above, personalized parameters for unimodal and multimodal models are set as follows:
Vision: In the process of selecting the visual timestep, we conduct several pilot experiments to access the impact of different timesteps on the performance of the visual models. We observe that increasing the timestep results in a larger memory footprint for the feature matrix, without a significant improvement in model performance. To strike a balance between computational complexity and performance, we decide to set the visual timestep to 10. The output length of CLIP is 512. Therefore, the input shape of the visual modality is 10 × 512 . The following three 1D convolution layers have 64 filters, and the kernel sizes are set to 3, 1, and 3, respectively, with strides being set to 1. The subsequent BiLSTM has 32 units.
Audio: Similarly, in line with our previous findings for the visual timestep, our pilot experiments demonstrate that setting the acoustic timestep to 128 not only maintains a good level of performance but also showcases the effectiveness of this parameter. As a result, we set the acoustic timestep to 128. The output length of Wav2Vec 2.0 is 512. Thus, the input shape of the acoustic modality is 128 × 512 . The following 1D convolution layer has 64 filters, and the kernel size and strides are set to 3 and 2, respectively. The subsequent BiLSTM has 32 units, and dropout is used with a rate of 0.5 to prevent overfitting.
Text: The longest sentence in the CH-SIMS dataset has 36 tokens. The output length of BERT used in this paper is 512. Hence, the input shape of the textual modality is 36 × 512 . The subsequent two 1D convolution layers have 64 filters, and both the kernel size and strides are set to 1.
Multimodal: The number of attention heads in our multi-head attention module in Conformer is 2. Dropout is used with a rate of 0.5 to prevent overfitting. The convolution layer in Conformer has 64 filters, and the kernel size is set to 3. The loss weights for multimodal, vision, audio, and text are 0.3, 0.2, 0.2, and 0.3, respectively.

4.3. Experimental Results

We employ the models described in [13] as our initial benchmark. Following established research practices, we conduct experiments involving single-task and multi-task scenarios to evaluate the fusion of multimodal data. We test two variations of unimodal models, namely pre-trained and without pre-training, within both the single-task and multi-task frameworks. The evaluation of these models is based on the Acc. metric, and the outcomes are presented in Table 1, comparing our models to the baseline performance on the CH-SIMS dataset.
Table 1. The results of the baseline and our models on the CH-SIMS dataset. The models with * are multi-task models, extended from single-task models by introducing independent unimodal annotations. “S” in model names stands for “Single-task”, while “M” denotes “multi-task”. The model names with “w/o pre-train” indicate that the corresponding unimodal models have not been pre-trained, while the ones without “w/o pre-train” indicate that the unimodal models have been pre-trained. “Conformer” in model names means the fusion method of the models.
The results indicate that all four of our models outperform the baseline models mentioned in the CH-SIMS paper in terms of Acc. Particularly, the Uni2Mul-M-Conformer model exhibits an accuracy improvement of 7.75 points compared to the MLF-DNN∗ baseline.
We can observe that the pre-training method in stage one improves the performance of the MEC models. For example, Uni2Mul-S-Conformer demonstrates a higher Acc. than Uni2Mul-S-Conformer (w/o pre-train), and Uni2Mul-M-Conformer shows a higher Acc. than Uni2Mul-M-Conformer (w/o pre-train). These results suggest that pre-trained unimodal models can capture good representations and pass them unchanged to the fusion network for MEC.
Additionally, a multi-task framework can achieve complementary information among multiple related tasks, improving the generalization ability of the MEC models. For example, Uni2Mul-M-Conformer (w/o pre-train) outperforms Uni2Mul-S-Conformer (w/o pre-train), and Uni2Mul-M-Conformer outperforms Uni2Mul-S-Conformer.
These findings demonstrate that Uni2Mul models can achieve superior performance, and our proposed methods for representation extraction and fusion strategy prove effective for MEC.

5. Discussion

5.1. Ablation Study

5.1.1. Unimodal Representation

To validate our hypotheses, we conducted the following experiments to examine the impact of different feature extraction methods and neural network structures on the performance of unimodal emotion classification models. We trained individual models for each modality, utilizing two types of annotations: IMAs and IUAs. The evaluation of these models was based on the Acc. and F1 score. The results are shown in Table 2. To compare the performance differences more clearly between IMA and IUA models, we plotted Figure 5 based on the data in Table 2.
Table 2. Performance of unimodal models in stage one. “ATTN” in the model names represents attention mechanism. Mel is the Mel spectrogram feature.
Figure 5. Performance comparison chart of unimodal models in stage one. (ac) depict the model performance metrics, Acc. and F1 score, of three visual models: CNN, CNN-LSTM, and CNN-ATTN-LSTM. (df) depict the model performance metrics, Acc. and F1 score, of three acoustic models: BiLSTM, CNN-BiLSTM, and CNN-ATTN-BiLSTM. (gi) depict the model performance metrics, Acc. and F1 score, of three textual models: BiLSTM, ATTN-BiLSTM, and CNN-ATTN. V: Vision, A: Audio, T: Text.
In the visual modality, we compared our unimodal model, the CNN-LSTM-based model, with the CNN-based and CNN-ATTN-LSTM-based models. Two types of features were tested for each model: video images and the outputs of CLIP. As presented in Table 2 and Figure 5a–c, the Acc. (IUAs) consistently outperformed Acc. (IMAs), and our CNN-LSTM-based model yielded the highest performance among the three models. Specifically, the CNN-LSTM-based model demonstrated a 0.65 percentage point improvement in Acc. (IUAs) and a 5.53 percentage point improvement in F1 (IUAs) when compared to Acc. (IMAs) and F1 (IMAs). Furthermore, CLIP proved to be an effective method for visual feature extraction. Taking the CNN-LSTM model trained using IUAs as an example, employing the output of CLIP as the input feature resulted in a 16.85 percentage point increase in Acc. and a 32.52 percentage point increase in F1 score, compared to using images as the input. This improvement can be attributed to the ability of CLIP to leverage prior knowledge of both vision and text, allowing it to extract textual information along with visual information.
For the acoustic modality, we compared our unimodal model, the CNN-ATTN-BiLSTM-based model, with the BiLSTM-based and CNN-BiLSTM-based models. Two types of features were examined for each model: Mel features and outputs of Wav2vec. As indicated in Table 2 and Figure 5d–f, most models exhibited higher F1 score for IUAs compared to IMAs, and our CNN-ATTN-BiLSTM-based model delivered the best performance among the three models. Additionally, Wav2Vec 2.0 proved to be an effective method for acoustic feature extraction. Although the CNN-ATTN-BiLSTM-based model had a 1.1 percentage point lower Acc. (IUAs) compared to Acc. (IMAs), it demonstrated a 9.58 percentage point higher F1 score for IUAs compared to F1 score for IMAs.
In the textual modality, we compared our unimodal model, the CNN-ATTN-based model, with the BiLSTM-based and ATTN-BiLSTM-based models. Two types of features were tested for each model: outputs of Word2Vec and BERT. As depicted in Table 2 and Figure 5g–i, in most cases, the F1 score for IUAs outperformed the F1 score for IMAs, and in some instances, the Acc. (IUAs) surpassed Acc. (IMAs). Our CNN-ATTN-based model yielded the best performance among the three models. As expected, BERT proved to be an effective method for textual feature extraction.
The results shown in Table 2 and Figure 5 are consistent with our idea: unimodal models trained using IUAs can learn better emotional representation.

5.1.2. Multimodal Fusion

We compared four fusion strategies: concatenate, multi-head attention, Transformer, and Conformer. The evaluation of these fusion strategies was conducted using Acc. (accuracy) and F1 score. The results are presented in Table 3.
Table 3. Results for MEC with different fusion strategies in stage two.
Based on the overall trend observed in Table 3, Conformer emerged as the most effective fusion method, followed by Transformer, multi-head attention, and concatenate. Specifically, the Uni2Mul-M-Conformer model achieved an Acc. (IUAs) that was 0.22, 0.44, and 1.76 percentage points higher than those of Uni2Mul-M-Transformer, Uni2Mul-M-Attention, and Uni2Mul-M-Concatenate, respectively.
In most cases, Acc. (IUAs) outperformed Acc. (IMAs), and F1 (IUAs) outperformed F1 (IMAs). The Uni2Mul-M-Conformer model demonstrated an Acc. (IUAs) that was 3.72 percentage points higher than that of Acc. (IMAs), while F1 (IUAs) was 3.58 percentage points higher than that of F1 (IMAs). These results align precisely with our expectations.
Furthermore, the multimodal fusion models constructed using pre-trained unimodal models exhibited higher Acc. and F1 score compared to those built without pre-training. Additionally, the multi-task framework outperformed the single-task framework, which is consistent with previous research findings.

5.2. Visualization

5.2.1. Visualization of Hidden Representations

To assess the impact of IMAs and IUAs on model feature extraction, we utilized t-SNE to visualize the hidden representations of our models (refer to Figure 6).
Figure 6. Unimodal and multimodal representation. (a,b) are the results from unimodal models trained using IMAs and IUAs, respectively; (c,d) are the results from models trained using IMAs and IUAs with Conformer-based fusion strategy. In (a,b), the markers of different colors represent different modalities, and shapes filled with different colors represent semantic spaces of different modalities. In (c,d), the markers of different colors represent different emotional categories, and shapes filled with different colors represent classification distribution. V: Vision, A: Audio, T: Text.
For unimodal representations, models trained using IMAs displayed irregular clustering in the semantic space with diffuse distributions (see Figure 6a). Conversely, models trained using IUAs formed more regular, spherical-like clusters (see Figure 6b) that adhered to a uniform Gaussian distribution.
In terms of multimodal representations, the Conformer-based fusion model trained with IMAs were not well concentrated, with multiple overlapping regions, resulting in suboptimal classification performance and limited generalizability (Figure 6c). In contrast, the model trained with IUAs exhibited distinct classification boundaries and minimal overlap in the classification distribution, yielding comparatively superior classification performance and adequate generalizability (see Figure 6d).

5.2.2. Visualization of Attention Weights

In stage one, we obtained a 64-dimensional visual feature vector, a 64-dimensional acoustic feature vector, and a 128-dimensional textual feature vector. In stage two, we concatenated these three vectors into a 256-dimensional multimodal vector and fed it into the fusion network. To explore the differences in fusion methods, we visualized the attention weights of three fusion strategies: multi-head attention (see Figure 7a), Transformer (see Figure 7b), and Conformer (see Figure 7c). In Figure 7a–c, we constructed an attention weight matrix for the aforementioned 256-dimensional multimodal feature vectors. The horizontal axis positions correspond to visual, acoustic, and textual feature components from left to right, and the vertical axis positions correspond to them from top to bottom. The numerical values (different colors) of each position in the matrix represent the degree of correlation of the corresponding positions, and the diagonal represents autocorrelation. The higher the value, the brighter the color. The weights of different heads were averaged.
Figure 7. Attention weights of three fusion methods: (a) for “multi-head attention”, (b) for “Transformer”, and (c) for “Conformer”. The lighter the color, the higher the weight.
In Figure 7a, we observed a few scattered regions with relatively high weights in the distribution, indicating that the multi-head attention fusion model successfully identified important emotional features. In Figure 7b, the weights were predominantly higher along the main diagonal, suggesting that the Transformer model discovered significant intra-modal relationships. Although there were a few high weight regions near the bottom left corner, their weights were lower compared to those along the main diagonal. This implies that the model had less proficiency in fusing inter-modal information and primarily focused on intra-modal relationships. Moving on to Figure 7c, we notice that the main diagonal had a wider area of prominence compared to Figure 7b. Additionally, some vertical highlight regions appeared at the bottom left of the main diagonal. These observations indicate that the Conformer model, benefiting from its convolutional kernel, assigns higher weights to intra-modal relationships and exhibits greater confidence in inter-modal relationships. These findings align with our expectations and suggest that the Conformer network serves as an effective fusion model.

5.2.3. Visualization of Confusion Matrix

To provide a more intuitive representation of the impact of IMAs, IUAs, and fusion strategies on the performance of multi-modal fusion models, we visualize the confusion matrices of these models. The results are shown in Figure 8.
Figure 8. Confusion matrix for MEC models. (a,c,e,g) are from models trained with IMAs, while (b,d,f,h) are from models trained with IUAs. (a,b) belong to the concatenate-based fusion models, (c,d) belong to the multi-head attention-based fusion models, (e,f) belong to the transformer-based fusion models, and (g,h) belong to the Conformer-based fusion models.
Comparing the left side of Figure 8a,c,e,g with the right side Figure 8b,d,f,h, it can be observed that regardless of the fusion strategy employed, models trained with IUAs outperform those trained with IMAs. Models trained with IMAs tend to have higher recognition accuracy for the negative class but lower accuracy for the positive and neutral classes. On the other hand, models trained with IUAs exhibit better recognition capabilities for both the positive and neutral classes, especially those based on the Conformer fusion method, which show improvements in recognition accuracy for all three classes.
This research, however, has several limitations. The first limitation is the lack of additional datasets for extensive validation. The second limitation is that CH-SIMS is a dataset with imbalanced samples, specifically a small number of positive, especially neutral emotion samples. The adaptability of the model to imbalanced data can also be improved.

6. Conclusions

In this paper, we propose a robust Conformer-based MEC model called Uni2Mul, which focuses on optimizing unimodal representation and multimodal fusion. We divide the implementation of Uni2Mul into two stages. In stage one, we construct individual unimodal neural networks for each modality and train them using IUAs to optimize the unimodal representations. This results in pre-trained unimodal models with superior performance. In stage two, we concatenate the hidden representations of these pre-trained unimodal models and feed the concatenated feature into the Conformer-based fusion network. This fusion network includes a sub-task that predicts IUAs for MEC. We also perform ablation experiments for the two stages separately. For stage one, we construct three different neural network structures for each modality and train these networks using IUAs and IMAs, respectively. For stage two, we try four fusion methods: concatenate, multi-head attention, Transformer, and Conformer. We summarize our overall findings as follows:
(1)
Unimodal models trained using IUAs can learn more differentiated information and improve the complementarity between modalities compared to those trained using IMAs.
(2)
The hidden representations of the pre-trained unimodal models serve as effective inputs for the fusion network. This ensures that the differentiated information learned using the unimodal models is passed unchanged to the fusion network.
(3)
The Conformer module, with its multi-head attention mechanism and convolutional kernel, excels in paying attention to important intra-modal information and capturing inter-modal relationships. It is the best among the four fusion strategies mentioned above.

Author Contributions

Conceptualization, L.Z.; Funding Acquisition, N.J.; Methodology, L.Z.; Resources, N.J.; Software, L.Z.; Validation, C.L.; Visualization, C.L.; Writing—Original Draft, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Doctoral Research Innovation Program of China People’s Police University, grant number BSKY202201; 2022 Humanities and Social Science Research Youth Foundation Project of Ministry of Education, grant number 22YJC860014; and Hebei Province Science and Technology Support Program, grant number 18215601. The APC was funded by the 2022 Humanities and Social Science Research Youth Foundation Project of the Ministry of Education.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

https://github.com/thuiar/MMSA, accessed on 6 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Abbreviations used in this paper. Each abbreviation retains the underlined letter from its corresponding phrase.
Table A1. Abbreviations used in this paper. Each abbreviation retains the underlined letter from its corresponding phrase.
AbbreviationsStand for
Uni2MulUnimodal to Multimodal
MECMultimodal Emotion Classification
IUAsIndependent Unimodal Annotations
IMAsIdentical Multimodal Annotations
CNNConvolutional Neural Network
RNNRecurrent Neural Network
LSTMLong Short-Term Memory
SERSpeech Emotion Recognition
LFCCLinear Frequency Cepstral Coefficients
MFCCMel-scale Frequency Cepstral Coefficients
ROCReceiver Operating Characteristic
BiLSTMBidirectional LSTM
BNBatch Normalization
ReLURectified Linear Unit
BERTBidirectional Encoder Representation from Transformers
GRUGated Recurrent Unit
VGGVisual Geometry Group
SVMSupport Vector Machine
EEGelectroencephalogram
GSRGalvanic Skin Response
CLIPContrastive Language-Image Pre-training
GLUGated Linear Unit
LNLayer Normalization
DPDropout Operation
GAPGlobal Average Pooling1D

References

  1. Taboada, M.; Brooke, J.; Tofiloski, M.; Voll, K.; Stede, M. Lexicon-Based Methods for Sentiment Analysis. Comput. Linguist. 2011, 37, 267–307. [Google Scholar] [CrossRef]
  2. Thelwall, M.; Buckley, K.; Paltoglou, G. Sentiment Strength Detection for the Social Web. J. Am. Soc. Inf. Sci. Technol. 2012, 63, 163–173. [Google Scholar] [CrossRef]
  3. Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu Features? End-to-End Speech Emotion Recognition Using a Deep Convolutional Recurrent Network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
  4. Hoffmann, H.; Kessler, H.; Eppel, T.; Rukavina, S.; Traue, H.C. Expression Intensity, Gender and Facial Emotion Recognition: Women Recognize Only Subtle Facial Emotions Better than Men. Acta Psychol. 2010, 135, 278–283. [Google Scholar] [CrossRef] [PubMed]
  5. Collignon, O.; Girard, S.; Gosselin, F.; Roy, S.; Saint-Amour, D.; Lassonde, M.; Lepore, F. Audio-Visual Integration of Emotion Expression. Brain Res. 2008, 1242, 126–135. [Google Scholar] [CrossRef]
  6. Cho, J.; Pappagari, R.; Kulkarni, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
  7. Pampouchidou, A.; Simantiraki, O.; Fazlollahi, A.; Pediaditis, M.; Manousos, D.; Roniotis, A.; Giannakakis, G.; Meriaudeau, F.; Simos, P.; Marias, K.; et al. Depression Assessment by Fusing High and Low Level Features from Audio, Video, and Text. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 27–34. [Google Scholar]
  8. Dardagan, N.; Brđanin, A.; Džigal, D.; Akagic, A. Multiple Object Trackers in OpenCV: A Benchmark. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021. [Google Scholar]
  9. Guo, W.; Wang, J.; Wang, S. Deep Multimodal Representation Learning: A Survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
  10. Ghaleb, E.; Niehues, J.; Asteriadis, S. Multimodal Attention-Mechanism For Temporal Emotion Recognition. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 251–255. [Google Scholar]
  11. Deng, J.J.; Leung, C.H.C.; Li, Y. Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data. In Computational Science and Its Applications—ICCSA 2021; Lecture Notes in Computer Science; Gervasi, O., Murgante, B., Misra, S., Garau, C., Blečić, I., Taniar, D., Apduhan, B.O., Rocha, A.M.A.C., Tarantino, E., Torre, C.M., Eds.; Springer International Publishing: Cham, Switzerland, 2021; Volume 12951, pp. 552–563. ISBN 978-3-030-86969-4. [Google Scholar]
  12. Li, J.; Wang, S.; Chao, Y.; Liu, X.; Meng, H. Context-Aware Multimodal Fusion for Emotion Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18 September 2022; pp. 2013–2017. [Google Scholar]
  13. Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotations of Modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5 July 2020. [Google Scholar]
  14. Gunes, H.; Piccardi, M. Bi-Modal Emotion Recognition from Expressive Face and Body Gestures. J. Netw. Comput. Appl. 2007, 30, 1334–1345. [Google Scholar] [CrossRef]
  15. Cimtay, Y.; Ekmekcioglu, E.; Caglar-Ozhan, S. Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion. IEEE Access 2020, 8, 168865–168878. [Google Scholar] [CrossRef]
  16. Huan, R.-H.; Shu, J.; Bao, S.-L.; Liang, R.-H.; Chen, P.; Chi, K.-K. Video Multimodal Emotion Recognition Based on Bi-GRU and Attention Fusion. Multimed. Tools Appl. 2021, 80, 8213–8240. [Google Scholar] [CrossRef]
  17. Du, Y.; Liu, Y.; Peng, Z.; Jin, X. Gated Attention Fusion Network for Multimodal Sentiment Classification. Knowl.-Based Syst. 2022, 240, 108107. [Google Scholar] [CrossRef]
  18. Jabid, T. Robust Facial Expression Recognition Based on Local Directional Pattern. ETRI J. 2010, 32, 784–794. [Google Scholar] [CrossRef]
  19. Zhu, Y.; Li, X.; Wu, G. Face Expression Recognition Based on Equable Principal Component Analysis and Linear Regression Classification. In Proceedings of the 2016 3rd International Conference on Systems and Informatics (ICSAI), Shanghai, China, 19–21 November 2016; pp. 876–880. [Google Scholar]
  20. Barman, A.; Dutta, P. Facial Expression Recognition Using Distance Signature Feature. In Advanced Computational and Communication Paradigms; Bhattacharyya, S., Chaki, N., Konar, D., Chakraborty, U.K., Singh, C.T., Eds.; Advances in Intelligent Systems and Computing; Springer: Singapore, 2018; Volume 706, pp. 155–163. ISBN 978-981-10-8236-8. [Google Scholar]
  21. Liu, S.; Tian, Y. Facial Expression Recognition Method Based on Gabor Wavelet Features and Fractional Power Polynomial Kernel PCA. In Advances in Neural Networks—ISNN 2010; Zhang, L., Lu, B.-L., Kwok, J., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6064, pp. 144–151. ISBN 978-3-642-13317-6. [Google Scholar]
  22. Chao, W.-L.; Ding, J.-J.; Liu, J.-Z. Facial Expression Recognition Based on Improved Local Binary Pattern and Class-Regularized Locality Preserving Projection. Signal Process. 2015, 117, 1–10. [Google Scholar] [CrossRef]
  23. Sánchez, A.; Ruiz, J.V.; Moreno, A.B.; Montemayor, A.S.; Hernández, J.; Pantrigo, J.J. Differential Optical Flow Applied to Automatic Facial Expression Recognition. Neurocomputing 2011, 74, 1272–1282. [Google Scholar] [CrossRef]
  24. Saravanan, A.; Perichetla, G.; Gayathri, D.K.S. Facial Emotion Recognition Using Convolutional Neural Networks. SN Appl. Sci. 2019, 2, 446. [Google Scholar]
  25. Yu, Z.; Zhang, C. Image Based Static Facial Expression Recognition with Multiple Deep Network Learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9 November 2015; pp. 435–442. [Google Scholar]
  26. Ebrahimi Kahou, S.; Michalski, V.; Konda, K.; Memisevic, R.; Pal, C. Recurrent Neural Networks for Emotion Recognition in Video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9 November 2015; pp. 467–474. [Google Scholar]
  27. Ding, H.; Zhou, S.K.; Chellappa, R. FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 118–126. [Google Scholar]
  28. Verma, M.; Kobori, H.; Nakashima, Y.; Takemura, N.; Nagahara, H. Facial Expression Recognition with Skip-Connection to Leverage Low-Level Features. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 51–55. [Google Scholar]
  29. Yang, H.; Ciftci, U.; Yin, L. Facial Expression Recognition by De-Expression Residue Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2168–2177. [Google Scholar]
  30. Li, T.-H.S.; Kuo, P.-H.; Tsai, T.-N.; Luan, P.-C. CNN and LSTM Based Facial Expression Analysis Model for a Humanoid Robot. IEEE Access 2019, 7, 93998–94011. [Google Scholar] [CrossRef]
  31. Ming, Y.; Qian, H.; Guangyuan, L. CNN-LSTM Facial Expression Recognition Method Fused with Two-Layer Attention Mechanism. Comput. Intell. Neurosci. 2022, 2022, 1–9. [Google Scholar] [CrossRef]
  32. Iliou, T.; Anagnostopoulos, C.-N. Statistical Evaluation of Speech Features for Emotion Recognition. In Proceedings of the 2009 Fourth International Conference on Digital Telecommunications, Colmar, France, 20–25 July 2009; pp. 121–126. [Google Scholar]
  33. Wang, K.; An, N.; Li, B.N.; Zhang, Y.; Li, L. Speech Emotion Recognition Using Fourier Parameters. IEEE Trans. Affect. Comput. 2015, 6, 69–75. [Google Scholar] [CrossRef]
  34. Lahaie, O.; Lefebvre, R.; Gournay, P. Influence of Audio Bandwidth on Speech Emotion Recognition by Human Subjects. In Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Montreal, QC, USA, 22 July 2017; pp. 61–65. [Google Scholar]
  35. Bandela, S.R.; Kumar, T.K. Stressed Speech Emotion Recognition Using Feature Fusion of Teager Energy Operator and MFCC. In Proceedings of the 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 3–5 July 2017; pp. 1–5. [Google Scholar]
  36. Han, K.; Yu, D.; Tashev, I. Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. In Proceedings of the Interspeech 2014, Singapore, 14–18 September 2014. [Google Scholar]
  37. Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
  38. Lee, J.; Tashev, I. High-Level Feature Representation Using Recurrent Neural Network for Speech Emotion Recognition. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar] [CrossRef]
  39. Kumbhar, H.S.; Bhandari, S.U. Speech Emotion Recognition Using MFCC Features and LSTM Network. In Proceedings of the 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA), Pune, India, 19–21 September 2019; pp. 1–3. [Google Scholar]
  40. Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. In Proceedings of the Workshop on Speech, Music and Mind (SMM 2018), Hyderabad, India, 1 September 2018; pp. 21–25. [Google Scholar]
  41. Atila, O.; Şengür, A. Attention Guided 3D CNN-LSTM Model for Accurate Speech Based Emotion Recognition. Appl. Acoust. 2021, 182, 108260. [Google Scholar] [CrossRef]
  42. Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 2020. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
  43. Chung, Y.-A.; Hsu, W.-N.; Tang, H.; Glass, J. An Unsupervised Autoregressive Model for Speech Representation Learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
  44. Liu, A.T.; Li, S.-W.; Lee, H. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech. IEEEACM Trans. Audio Speech Lang. Process. 2021, 29, 2351–2366. [Google Scholar] [CrossRef]
  45. Liu, A.T.; Yang, S.; Chi, P.-H.; Hsu, P.; Lee, H. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6419–6423. [Google Scholar]
  46. Fan, Z.; Li, M.; Zhou, S.; Xu, B. Exploring Wav2vec 2.0 on Speaker Verification and Language Identification 2021. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021. [Google Scholar]
  47. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space 2013. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
  48. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality 2013. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
  49. Javed, N.; Muralidhara, B.L. Emotions During COVID-19: LSTM Models for Emotion Detection in Tweets. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications; Gunjan, V.K., Zurada, J.M., Eds.; Lecture Notes in Networks and Systems; Springer: Singapore, 2022; Volume 237, pp. 133–148. ISBN 9789811664069. [Google Scholar]
  50. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
  51. Gou, Z.; Li, Y. Integrating BERT Embeddings and BiLSTM for Emotion Analysis of Dialogue. Comput. Intell. Neurosci. 2023, 2023, 6618452. [Google Scholar] [CrossRef] [PubMed]
  52. Gui, L.; Zhou, Y.; Xu, R.; He, Y.; Lu, Q. Learning Representations from Heterogeneous Network for Sentiment Classification of Product Reviews. Knowl.-Based Syst. 2017, 124, 34–45. [Google Scholar] [CrossRef]
  53. Chen, F.; Ji, R.; Su, J.; Cao, D.; Gao, Y. Predicting Microblog Sentiments via Weakly Supervised Multimodal Deep Learning. IEEE Trans. Multimed. 2018, 20, 997–1007. [Google Scholar] [CrossRef]
  54. Liu, G.; Guo, J. Bidirectional LSTM with Attention Mechanism and Convolutional Layer for Text Classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
  55. Xie, H.; Feng, S.; Wang, D.; Zhang, Y. A Novel Attention Based CNN Model for Emotion Intensity Prediction. In Natural Language Processing and Chinese Computing; Lecture Notes in Computer Science; Zhang, M., Ng, V., Zhao, D., Li, S., Zan, H., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11108, pp. 365–377. ISBN 978-3-319-99494-9. [Google Scholar]
  56. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need 2017. In Proceedings of the 2017 Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  57. Akula, R.; Garibay, I. Interpretable Multi-Head Self-Attention Architecture for Sarcasm Detection in Social Media. Entropy 2021, 23, 394. [Google Scholar] [CrossRef] [PubMed]
  58. Pérez-Rosas, V.; Mihalcea, R.; Morency, L.-P. Utterance-Level Multimodal Sentiment Analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 973–982. [Google Scholar]
  59. Xu, N.; Mao, W. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis. In Proceedings of the Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6 November 2017; pp. 2399–2402. [Google Scholar]
  60. Deng, D.; Zhou, Y.; Pi, J.; Shi, B.E. Multimodal Utterance-Level Affect Analysis Using Visual, Audio and Text Features. arXiv 2018, arXiv:1805.00625. [Google Scholar]
  61. Poria, S.; Cambria, E.; Gelbukh, A. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-Level Multimodal Sentiment Analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2539–2544. [Google Scholar]
  62. Yu, Y.; Lin, H.; Meng, J.; Zhao, Z. Visual and Textual Sentiment Analysis of a Microblog Using Deep Convolutional Neural Networks. Algorithms 2016, 9, 41. [Google Scholar] [CrossRef]
  63. Li, Y.; Zhao, T.; Shen, X. Attention-Based Multimodal Fusion for Estimating Human Emotion in Real-World HRI. In Proceedings of the Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, Cambridge, UK, 23 March 2020; pp. 340–342. [Google Scholar]
  64. Wang, H.; Yang, M.; Li, Z.; Liu, Z.; Hu, J.; Fu, Z.; Liu, F. SCANET: Improving Multimodal Representation and Fusion with Sparse-and Cross-attention for Multimodal Sentiment Analysis. Comput. Animat. Virtual Worlds 2022, 33, e2090. [Google Scholar] [CrossRef]
  65. Li, P.; Li, X. Multimodal Fusion with Co-Attention Mechanism. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020. [Google Scholar] [CrossRef]
  66. Zhu, H.; Wang, Z.; Shi, Y.; Hua, Y.; Xu, G.; Deng, L. Multimodal Fusion Method Based on Self-Attention Mechanism. Wirel. Commun. Mob. Comput. 2020, 2020, 1–8. [Google Scholar] [CrossRef]
  67. Thao, H.T.P.; Balamurali, B.T.; Roig, G.; Herremans, D. AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention. Sensors 2021, 21, 8356. [Google Scholar] [CrossRef]
  68. Gu, D.; Wang, J.; Cai, S.; Yang, C.; Song, Z.; Zhao, H.; Xiao, L.; Wang, H. Targeted Aspect-Based Multimodal Sentiment Analysis: An Attention Capsule Extraction and Multi-Head Fusion Network. IEEE Access 2021, 9, 157329–157336. [Google Scholar] [CrossRef]
  69. Ahn, C.-S.; Kasun, C.; Sivadas, S.; Rajapakse, J. Recurrent Multi-Head Attention Fusion Network for Combining Audio and Text for Speech Emotion Recognition. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18 September 2022; pp. 744–748. [Google Scholar]
  70. Xie, B.; Sidulova, M.; Park, C.H. Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion. Sensors 2021, 21, 4913. [Google Scholar] [CrossRef] [PubMed]
  71. Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A Text Enhanced Transformer Fusion Network for Multimodal Sentiment Analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
  72. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision 2021. In Proceedings of the 2021 International Conference on Machine Learning, Virtual Event, 18–24 July 2021. [Google Scholar]
  73. Gulati, A.; Qin, J.; Chiu, C.-C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition 2020. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
  74. Williams, J.; Kleinegesse, S.; Comanescu, R.; Radu, O. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion. In Proceedings of the Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne, Australia, 20 July 2018; pp. 11–19. [Google Scholar]
  75. Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.-P. Memory Fusion Network for Multi-View Sequential Learning 2018. In Proceedings of the AAAI conference on artificial intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  76. Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
  77. Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
  78. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.