1. Introduction
The distinction between human beings and machines lies in their respective emotional states. Human beings are emotional creatures, whereas machines are devoid of such sentiments. If artificial intelligence is to be genuinely integrated into our lives, it is imperative to comprehend the nuances of human emotion. Lungarella et al. have underscored the pivotal role of emotional understanding in the advancement of true artificial intelligence. Their findings highlight the necessity for greater consideration and analysis of emotions in the development of AI systems [
1,
2]. Sentiment analysis represents the most fundamental and crucial area of affective computing research, and plays a crucial and indispensable role in the advancement of current AI systems towards the next generation of affective intelligence [
3].
The expression and perception of human emotion are naturally multisensory. The majority of early research on sentiment analysis concentrated exclusively on single-modal data, such as facial expressions, voice audio, verbal text, or human physiological signals. This approach overlooked the multidimensionality of human emotions and failed to capture the emotional richness that can be derived from synthesizing multisensory data. In recent years, the growth of multimedia social networks and the improvements in computer processing capabilities have led to a significant increase in the amount of multimedia data on the Internet. Consequently, the ways and carriers through which people can express their emotions on the Internet are becoming increasingly diverse. It is thus imperative to integrate data from diverse modalities, including text, voice, image, and video, to enhance the precision and granularity of sentiment analysis outcomes. Consequently, in comparison to unimodal data, multimodal data encompass a more diverse range of information, facilitating the formulation of decisions from multiple perspectives and, consequently, more accurate predictions [
4].
In the past, researchers have often focused on studying sentiment analysis based on single-modal data such as text, speech, or facial expressions. For example, text sentiment analysis is a topic that has been intensively studied by a large number of researchers, and there are many solutions available in the industry that have been widely applied to different domains [
5,
6]. However, in recent years, the advent of multimedia social networks has led to significant developments in the field of social media, increased computer processing power, and technological updates of complex sensors; therefore, more and more multimodal data can now be used for sentiment analysis to provide more detailed and accurate prediction results. Human emotional expression is naturally multimodal. People make full use of spoken language, facial expressions, voice tones, and body movements to express their emotions in daily communication. The same sentence, expressed with different facial expressions or intonations, may contain very different emotions. In multimodal sentiment analysis, multisensory data in the form of different modalities are the most important information carriers. Multimodal data commonly used in sentiment analysis research include textual language, sound and audio, visual facial expressions, scene images, and human physiological signals.
In the context of multimodal sentiment analysis, the dynamic association between multimodal data represents a pivotal factor in the prediction of sentiment. Consequently, researchers in the field of sentiment analysis have undertaken a multitude of investigations into the interaction of multimodal dynamics. Multimodal dynamics can be broadly classified into two categories: intra-modal dynamics and inter-modal dynamics. As the term implies, intra-modal dynamics pertains to the affective states intrinsic to the data originating from a single sensory source. This encompasses alterations in facial expressions or voice intonation, which reflect shifts in affective states. In contrast, intermodal dynamics pertains to the information interactions and affective associations between data from disparate sensory sources. For instance, the conjunction of a quivering vocal line and a smiling face may evince excitement, whereas a quivering vocal line accompanied by dilated eyes and a protruding mouth may suggest fear. It is essential to model both aspects concurrently in order to develop effective multimodal dynamics. To address this issue, researchers have put forth a number of multimodal fusion methodologies for modeling. The majority of research in this field begins with an aggregation perspective, whereby individual unimodal branches are combined and brought together at a predetermined point in the propagation process [
7,
8,
9]. As an illustration, Hou et al. [
10] present the concept that the early fusion model EF-LSTM merges the feature representations of three modalities—text, speech, and image—and then inputs them into LSTM for encoding. However, early fusion often results in the input of redundant vectors. The late fusion approach is primarily concerned with fusion at the decision level, which occurs after decoding. However, this approach is unable to extract information regarding the interaction between modalities. The tensor fusion network (TFN), proposed by Zadeh et al. [
11], employs an outer product-based method to calculate multidimensional tensors for capturing interaction information between different modalities. However, this approach is associated with the limitations of large parameters and low efficiency. The low-rank multimodal fusion algorithm, proposed by Liu et al. [
12], is another notable approach in this field. The low-rank multimodal tensor fusion technique based on TFN shows some improvements, but it still causes parameter explosion once the features become too long. Zadeh et al. [
13] proposed the memory fusion network (MFN), which captures both timing and inter-modal interaction information through Delta attention and multi-view gated memory. However, the MFN does not sufficiently consider the heterogeneity between different data types. Tsai et al. [
14] proposed the Multimodal Transformer (MulT), which addresses the issue of long-term data dependency across different modalities by extending the Transformer structure. However, this approach introduces a more complex network structure.
In light of the shortcomings of existing multimodal sentiment analysis methods, including low modal fusion efficiency, large parameters, high computational complexity, and difficulties capturing deep inter-modal interaction information, we propose a multimodal sentiment analysis network structure based on a single-stream transformer. This structure employs a joint self-attention mechanism and a co-attention mechanism after feature extraction to address the issue of inter-modal variability and enhance the inter-modal fusion effect. Our contributions are summarized as follows:
- We propose a more efficient single-stream transformer model that can handle multiple modalities simultaneously, which utilizes an attentional mechanism to solve the long-term dependency problem, aiming to enhance the multimodal sentiment analysis ability. 
- We put forth two pre-training tasks that permit the joint multimodal representation to be fine-tuned based on other tasks, thereby enhancing the model’s capacity to effectively integrate multimodal data. 
- Experiments were conducted on the CMU-MOSI and CMU-MOSEI datasets, and the results demonstrated that our method outperformed other methods in terms of ACC-2 and F1 value metrics, thereby substantiating the efficacy of the proposed method. 
- The remainder of this paper is organized as follows:  Section 2-  offers a comprehensive overview of the current state of research on multimodal sentiment analysis.  Section 3-  provides a comprehensive account of the proposed modeling framework.  Section 4-  presents the findings of the experiments conducted to evaluate the proposed model. In conclusion,  Section 5-  provides a comprehensive analysis and summary of the proposed method. 
  3. Results
  3.1. Overall Architecture
As depicted in 
Figure 1, the SS-Trans model is composed of three parts: the joint embedding, the backbone, and the prediction part. Initially, the joint embedding part links the modal pairs, such as 
 and 
 before the SS-Trans part, in which 
 and 
 is generated by the fusion component during the node embedding. Subsequently, the backbone part of SS-Trans is used, which is a single-stream converter model utilizing textual, visual, and audio modalities. In SS-Trans, the model is pre-trained to accomplish two tasks simultaneously: MLM and TIA. The MLM task is comparable to the pre-training task employed in BERT. In this task, a random selection of words or image regions are masked in the input, and BERT learns them as the output. TIA understands the modal relationship by determining whether the multimodal data are paired to predict whether the multimodal data will match the output of BERT. The multimodal sentiment analysis aims to acquire a function that can objectively obtain sentiment scores or categories. The acquisition of said function may be achieved using either a regression task or a classification task.
  3.2. Joint Embedding
Firstly, a set of multimodal signals, including text T, audio A, and visual signal V, is used to predict emotions and moods. These signals can be regarded as a set of triples (
, 
, 
), which are defined as follows:
These signals are vectors with a fixed length of L and dimensions of , , and . To account for inputs smaller than L, zero vectors are used for padding.
In SS-Trans, text, visual, and audio aspects are combined in pairs, such as 
 and 
. The three pairs, 
, 
, and 
, are used as inputs such that 
 and 
 are generated by the fusion component. Since the importance of textual modality has been demonstrated in the work of previous researchers, 
 serves as the anchor point. The modal pairs are defined as follows:
In the case where , , and  are processed at the same time in one training step, backpropagation is performed for these patterns.
For joint embedding, two main components can be identified: text embedding and fusion. Text embedding encompasses two sub-components, namely token embedding and position embedding. The former converts word tokens in the text into numerical values, whereas the latter provides location details in the text. However, it is important to note that video and audio signals lack continuous features and position embedding is not required. The model utilizes the customary BERT marker embedding and positional embedding, taking the output from the text embedding as the fusion component’s input.
  3.3. Fusion Component
In the fusion stage, initially, a linear layer is used to align the scale of the text and other modalities. Next, the two distinct modalities are joined, and segment embedding is incorporated to differentiate them. Ultimately, layer normalization is executed. The process of coding can be presented as follows:
In Equation (3),  denotes the combination of two modalities, A and B, with A indicating the text modality and B indicating the audio modality. In Equation (4), the fusion of two modalities, A and C, is presented, where A is the text modality and C is the visual modality. Since previous research has highlighted the significance of the textual modality, we utilize it as the foundation and expand on it through the use of two augmented modalities, with their dimensions varying according to the text. Specifically, Equations (5) and (6) show that B and C are connected in series with A. Afterwards,  and  must be added to the segment embedding by merging A with B′ and C′, respectively, based on the length of the sequence in item . Similarly, modals are distinguished using the segment embedding. Fragment embedding distinguishes sentences belonging to BERT, which are composed of 0 and 1 of A, B, and C. Finally, the Layer Norm (LN) refers to the regularized dimension . The dimensions and length of  and  are , and the length of  is L.
  3.4. SS-Trans
The diagram displayed in 
Figure 2 illustrates the internal configuration of SS-Trans. The E tokens, which are the outcome of the co-embedding, transform into H tokens, the ultimate output of SS-Trans. Within the H tokens, the 
 employs the Pooler to accomplish downstream tasks. Pooler combines the fully connected layer with the tanh activation functions. By processing the three pairs T, V′, A′ through Pooler, we arrive at 
, 
, 
 as follows:
  3.5. Pre-Training Tasks
  3.5.1. MLM
Masked Language Modelling (MLM): During the pre-training task of MLM, we implement the same strategy as BERT by randomly masking 15% of the input text tokens. The goal of this task is to generate original text based on unmasked text tags and tags from other modalities. For the loss function of MLM task, we use the least negative log-likelihood function:
  3.5.2. TIA
Text and Image Alignment (TIA): The images are randomly replaced with another image, with a probability of 0.5. The TIA header of a single linear layer projects the assembled output features into binary class. We compute the negative log-likelihood loss as our TIA loss:
  3.6. Predicting Outcomes
To achieve a coherent joint representation, a self-attention layer and a co-attention layer are utilized for the 
, 
, and 
 derived from the 
 concerning each modal pair. Notably, the [CLS] tokens are solely employed for the classification task. First, this self-attention layer is then applied to the modalities 
, 
, and 
. The purpose of the self-attention layer is to enable the model to allocate attention between different parts of the sequence and capture internal dependencies. 
Figure 3 displays the self-attention layer. The formulas are as follows:
The representations of the different modalities are then further fused through a co-attention mechanism that allows the model to allocate attention across the modalities and capture cross-modal dependencies. 
Figure 4 showcases the co-attention layer, where Q, K, and V are all identical modalities. The formulas are as follows:
Finally, the co-representations of each pair of different modalities are connected and transmitted to a dense connectivity layer to predict the results.
        
  3.7. Full Pre-Training Loss
To optimize our model parameters, we implemented an alternating optimization strategy, iteratively improving our pre-training task. We employed distinct loss functions for sentiment analysis and emotion recognition, comprising the addition of and 
. 
 refers to the cooperative loss function, including the pre-training tasks mentioned earlier. After computation in three different modal pairs, 
 has three types: 
, 
, and 
. Hence, the ultimate objective function is the mean of the three combined losses whilst summing up 
. 
 here refers to the loss function of this task. Since sentiment analysis is a regression task, the Mean Square Error (MSE) loss function is employed. For the task of emotion recognition, we used the cross-entropy loss function. Hence, the objective function can be summed up as follows:
  4. Experiments
  4.1. Datasets
In the broader domain of machine learning and artificial intelligence, sentiment analysis has a diverse range of datasets that are responsive to the growing demand for data. The models employed for multimodal sentiment analysis must demonstrate robust generalization capabilities, enabling the inference of valuable social information and their effective deployment in industrial applications. The dataset utilized for this task should possess specific attributes, including the participation of diverse speakers, gender representation, discussion topics, colloquialisms, vocabulary usage, intensity of sentiment, and a spectrum of data volumes. 
Table 1 enumerates the frequently utilized multimodal sentiment datasets.
In this study, the CMU-MOSI and CMU-MOSEI datasets were selected for experimentation. The CMU-MOSEI dataset is one of the largest multimodal sentiment analysis datasets currently available, encompassing comprehensive sentiment and emotion annotations. The CMU-MOSEI dataset comprises over 23,000 sentences from more than 1000 YouTube speakers, encompassing a diverse array of topics. The use of this training set renders the model more robust. The CMU-MOSEI dataset is also labeled with the polarity of each sentiment and provides a careful classification and intensity score for each sentiment, which offers more detailed information about the sentiment and improves the accuracy of sentiment analysis. Both datasets are also similar in terms of their categorization of sentiments, which ensures consistency in the experiments, allowing for fusion at the decision level.
  4.1.1. CMU-MOSI
CMU-MOSI (Multimodal Opinion Level Sentiment Intensity) was developed by Amir Zadeh et al. [
21]. The dataset comprises 93 YouTube videos, covering 2199 discourses. Each discourse is labeled on a continuum from −3 to +3 by five different workers, with negative emotions indicated by values below 0 and positive emotions indicated by values above 0.
  4.1.2. CMU-MOSEI
The CMU-MOSEI dataset [
22] is the most extensive dataset for discerning sentiment analysis and emotional recognition at a discourse level. It features over 65 h of annotated videos, with 1000 speakers and 250 topics derived from online video-sharing platform YouTube. Since many industrial products use similar data, this makes it one of the most useful datasets. To reduce eventual bias, every video discourse has annotations from three different sources. The emotional polarity of every dialog was labeled with a value ranging from −3 to +3.
  4.2. Data Preprocessing
In this paper, we investigate sentiment analysis based on multimodal data, focusing on feature extraction and fusion analysis for text, image, and audio data. Due to the heterogeneity of the multimodal data, the preprocessing of data needs to be performed separately for each modality using the processing method. We converted the data into different formats to train various types of tasks. For example, for the visual–semantic task, the features in the textual part originate from positional encoding and word embeddings in tokens, as well as the superposition of the clause encoding. The word embeddings are transformed to match the vocabulary used in BERT, with the specific implementation employing the hugging face open-source interface. Following the BERT model, CLS and SEP markers were inserted at the beginning and conclusion of the text. The audio aspect contains sound quality features superimposed on positional encoding, the image aspect contains region features superimposed on positional encoding and the two parts were spliced according to modal pairs. For the plain text task, the CLS and SEP markers were inserted at the beginning and end of the text. Subsequently, word embeddings were overlaid with position and clause encoding, and the other modal parts were replaced by an all-zero matrix.
  4.3. Feature Extraction
  4.3.1. Text Modal Feature Extraction
The previous methods used for extracting textual modal features have several shortcomings. For instance, the quality of extracted textual features is directly affected by the evaluation function, and features cannot be extracted efficiently when multiple meanings of a word are encountered. However, BERT cannot only extract feature information in semantics but can also obtain contextual information, thus providing a more effective solution. This is why we used the BERT model to extract text features and provide contextual features.
  4.3.2. Visual Modal Feature Extraction
Facial expressions serve as the primary means by which humans communicate emotions, with morphological changes occurring in facial organs and muscles to convey crucial information that supports the recognition of facial emotions [
23]. Recently, a variety of deep learning-based methods were developed for facial feature extraction, including FaceNet [
24], MTCNN [
25], and other network models. We used the CMU-MOSI and CMU-MOSEI datasets to extract key frames from the video at a rate of 30 frames per second, and then used the MTCNN network to detect faces in the image. 
Figure 5 shows the three-part image feature extraction process using the MTCNN network. First, the image is fed into the fully convolutional neural network, P-Net, which generates the candidate images and the edge regression vectors; then, the output of the P-Net is fed into the R-Net to remove the error window generated by the P-Net; finally, the output of the O-Net is used to obtain the final predicted face images and the face position information. Once the aligned face images are obtained, the Openface [
26] tool is used to extract emotional features from the faces.
  4.3.3. Audio Modal Feature Extraction
Because of the inherent complexity of audio data and the various features collected by traditional methods such as signal processing, it may be challenging to understand and analyze audio data. To improve efficiency, the COVAREP [
27] speech processing toolkit was used to extract audio features. This tool extracts fundamental frequency, VUV (voiced or unvoiced), standard amplitude entropy (NAQ), Mel Frequency Cepstral Coefficients (MFCCs), and other features from audio data at a rate of 30 frames per second. In this paper, we use the fundamental tone (F0), Constant-Q transform (CQT), and Mel Frequency Cepstral Coefficients (MFCCs).
  4.4. Experimental Setup and Evaluation Metrics
After data preprocessing, the text feature dimension was 768, the audio feature dimensions of CMU-MOSI and CMU-MOSEI data were 74 and 74, and the visual feature dimensions were 47 and 35.
In our experiments, we used the BERT-base pre-trained model with the Adam optimizer. We set the batch-size to 64, the initial learning rate to 3 × 10
−4, and the seq_len to 50. In the MLM pre-training task, we followed BERT’s strategy of randomly masking the input text tokens with a 15% probability. In the TIA pre-training task, we randomly replaced the matched image with another image with a probability of 0.5. The particulars of the experimental metadata statistics are presented in 
Table 2.
The PyTorch 1.5 deep learning framework was used to build the model, which takes pre-processed data as the input and outputs the results of sentiment analysis or emotion recognition. In this experiment, continuous training iterations with a batch size of 32 and 40 iterations were used, and the parameters of the model were optimized with the Adam optimizer. Early termination of the training was employed to prevent overfitting. Maintaining the hyperparameters above, we conducted ablation experiments by selectively removing parts of the model’s structure to assess their respective effectiveness. The concrete results of our experiments will be explained in detail in 
Section 5.
The MOSI and MOSEI datasets were evaluated using a7-category accuracy (A7: affective scores range between [−3, 3]), a2-category accuracy (A2: positive/negative affect), F1 values, mean absolute error (MAE), and Pearson correlation (Corr, Correlation).
  4.5. Experimental Results and Analysis
Table 3 and 
Table 4 present the outcomes of SS-Trans’s multimodal sentiment analysis on the CMU-MOSI and CMU-MOSEI datasets. In this task, SS-Trans was compared with the baseline of the aforementioned datasets. As evidenced by 
Table 3 and 
Table 4, the SS-Trans proposed in this paper achieves top performance in the multimodal sentiment analysis model. This demonstrates its ability to effectively capture local semantic associations within modalities, as well as the synergistic associations of sentiment semantics with various local information across multiple modalities. 
Table 3 and 
Table 4 reveal that SS-Trans enhances the ACC-2 of CMU-MOSI and ACC-2 of CMU-MOSEI by 1.06% and 1.33%, respectively, as well as increasing the F1 of CMU-MOSI and F1 of CMU-MOSEI by 1.50% and 1.62%, correspondingly. Furthermore, it is evident from 
Table 3 that LMF [
12] and TFN [
11] have a severely restricted performance, with an average absolute error exceeding 90%. This emphasizes the important role of modal fusion in sentiment analysis and also emphasizes that it should not be underestimated. Furthermore, 
Table 3 demonstrates that MulT [
14] outperforms MFM [
28] and ICCN [
29], indicating the transformer model’s ability to efficiently extract and represent the correlations among diverse modalities in a multimodal context. In addition, the MMIM [
30] model demonstrates superiority and usually exceeds most baseline models in sentiment classification. However, since the MMIM model handles information interactions between modes via the modal gating mechanism, the SS-Trans model proposed in this paper incorporates an attention mechanism, allowing the model to selectively attend to crucial features between different modes. As a result, the SS-Trans model outperforms the MMIM model in sentiment classification, indicating that an appropriate design of the attention mechanism is required.
 Table 5 displays the 2-classification accuracy (A2) and F1 values of SS-Trans regarding emotion recognition on the CMU-MOSEI dataset. It can be observed that SS-Trans achieved the best results for most of the emotions during the emotion recognition task. A superior A2 score was attained for “Surprise”. To further explore the results of sentiment recognition by SS-Trans on the CMU-MOSEI dataset, 
Figure 6 shows that the F1 values of “Happiness” and “Anger” were increased by 0.8% and 0.9%, respectively. Notably, an even more notable enhancement of 1.3% was observed for “Disgust”. The poor results in previous models have been attributed to the difficulty that traditional Convolutional Neural Network (CNN)-based methods have in capturing long-term dependencies. This is because the dataset depends on the time–relation modeling. The transformer-based SS-Trans model is more adept at handling time series, can efficiently focus and exploit changes in emotion and contextual information over time, and exhibits superior emotional perception. However, the dataset is unevenly distributed in terms of emotion labels; specifically, out of a total of 23,453 data, only about 2000 pieces of data relate to the emotions “Fear” and “Surprise”. Therefore, this result strongly demonstrates the commendable ability of the SS-Trans model to identify rare emotions and the effectiveness of fair data.
   4.6. Ablation Study
To evaluate the efficiency of various modules in the proposed method, ablation research was carried out on the model of multi-mode signals and attention modules. Ablation experiments were performed on the CMU-MOSI and CMU-MOSEI datasets with consistent training parameters, using MAE and A2 as the evaluation metrics. The ablation scheme is shown in 
Table 6.
		
- -: Rejecting text modalities in multimodal signals. 
- -: Rejecting visual modalities in multimodal signals. 
- -: Rejecting audio modalities in multimodal signals. 
- -co-self-attention: Eliminate the attention module in SS-Trans. 
Table 7 shows the comparison results obtained from the model ablation experiments conducted. Firstly, to demonstrate the significance of verbal and non-verbal signals for multimodal sentiment analysis, certain modalities were extracted from the multimodal signals. As presented in 
Table 4, removing either of the visual or audio modalities resulted in a decline in performance, indicating the importance of non-verbal signals (both visual and audio signals) in solving multimodal sentiment analysis. This also highlights the complementary relationship between the textual, visual, and auditory senses. In addition, it was determined that the audio mode is more important than the visual mode for the SS-Trans. Additionally, to showcase the attention module’s efficiency in multimodal representation learning, we excluded the co-self-attention module. Our results revealed that co-self-attention enhanced the mean absolute error by 0.092 on the CMU-MOSI dataset and by 0.104 on the CMU-MOSEI dataset. Meanwhile, the A2 value decreased by 5.2 and 4.5, indicating that using an attention-based multimodal learning representation is beneficial for multimodal sentiment analysis. The accuracy is further improved by improving the cross-modal fusion through a carefully designed attention module. To facilitate observation, 
Figure 7 displays the outcomes of the A2 and MAE for the ablation study. The results show that the SS-Trans model proposed in this study outperforms all the ablation experiments, suggesting the effectiveness and rationality of each module in our SS-Trans model.
   5. Conclusions
This study presents a multimodal sentiment analysis model SS-Trans based on a single-stream transformer. In this method, feature fusion is performed using three modalities, text, visual and audio, which enhances sentiment analysis by including more diverse information in the multimodal data compared to unimodal data. The model finetunes the multimodal joint representation by passing the modal fusion through the SS-Trans network proposed in this paper and processing it through two pre-training tasks, which further enhances the precision of sentiment analysis and emotion recognition prediction. In addition, we integrated the acoustic and visual modal representations in multilevel textual features, introduced the self-attention and co-attention mechanisms, and employed the attention mechanism to address the long-term dependency problem. The findings demonstrate that our proposed model, SS-Trans, outperforms other models and exhibits superior comprehensive performance metrics, including A2 and F1. While the accuracy of the model in terms of multimodal sentiment analysis on both CMU-MOSI and CMU-MOSEI datasets is commendable, it also indicates that the model is somewhat narrow in its evaluation scope. In subsequent research, we intend to refine the network model and expand the scope of evaluation in the hope that we can extend SS-Trans to more tasks to further improve the model performance.