Dynamic Tuning and Multi-Task Learning-Based Model for Multimodal Sentiment Analysis

Liang, Yi; Tohti, Turdi; Hu, Wenpeng; Kong, Bo; Han, Dongfang; Yan, Tianwei; Hamdulla, Askar

doi:10.3390/app15116342

Open AccessArticle

Dynamic Tuning and Multi-Task Learning-Based Model for Multimodal Sentiment Analysis

by

Yi Liang

^1,2

,

Turdi Tohti

^1,*

,

Wenpeng Hu

^2,*,

Bo Kong

¹,

Dongfang Han

¹,

Tianwei Yan

²

and

Askar Hamdulla

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

²

Information Research Center of Military Science, Beijing 100142, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6342; https://doi.org/10.3390/app15116342

Submission received: 9 April 2025 / Revised: 11 May 2025 / Accepted: 16 May 2025 / Published: 5 June 2025

(This article belongs to the Special Issue Advancements in Natural Language Processing, Semantic Networks, and Sentiment Analysis: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Multimodal sentiment analysis aims to uncover human affective states by integrating data from multiple sensory sources. However, previous studies have focused on optimizing model architecture, neglecting the impact of objective function settings on model performance. Given this, this study introduces a new framework, DMMSA, which utilizes the intrinsic correlation of sentiment signals and enhances the model’s understanding of complex sentiments. DMMSA incorporates coarse-grained sentiment analysis to reduce task complexity. Meanwhile, it embeds a contrastive learning mechanism within the modality, which decomposes unimodal features into similar and dissimilar ones, thus allowing for the simultaneous consideration of both unimodal and multimodal emotions. We tested DMMSA on the CH-SIMS, MOSI, and MOEI datasets. When only changing the optimization objectives, DMMSA achieved accuracy gains of 3.2%, 1.57%, and 1.95% over the baseline in five-class and seven-class classification tasks. In regression tasks, DMMSA reduced the Mean Absolute Error (MAE) by 1.46%, 1.5%, and 2.8% compared to the baseline.

Keywords:

multimodal sentiment analysis; contrastive learning; multi-task learning; dynamic tuning

1. Introduction

Multimodal sentiment analysis (MSA) involves integrating and synergistically parsing diverse heterogeneous data modalities [1,2,3], encompassing many information forms such as text, visual, auditory, and even biometric markers [4]. With the evolution of social media ecosystems and the proliferation of multimedia content, information presentation has evolved from pure text to richly illustrated content, culminating in today’s prevalent video-based information [5,6].

Traditional unimodal sentiment analysis is confined mainly to the textual domain [7]. In contrast, MSA encompasses a comprehensive interpretation of multiple perceptual channels, including visual cues (e.g., facial expressions, scene color, and body movement) and audio characteristics (e.g., pitch amplitude, frequency distribution, and speech tempo) [8,9]. MSA has garnered significant attention in recent research. On the one hand, human emotional expression inherently possesses cross-modal properties, with text, speech, and even haptic cues intricately interwoven to form sentiments, rendering a single modality insufficient to fully unveil the complexity of these sentiments [10,11]. On the other hand, through their deep fusion of multiple signal sources, MSA technologies significantly enhance the accuracy of sentiment recognition and understanding, fulfilling the high-precision sentiment intelligence demands in domains such as intelligent customer service, VR/AR experience optimization, and mental health assessment [12,13].

Existing deep learning-based MSA models typically consist of three core components: a single-modality feature extraction module, a multimodal feature fusion module, and a classification head [14]. The use of pre-trained models as the single-modality feature extraction module has been widely recognized as an effective strategy. Consequently, current researchers often focus on optimizing multimodal feature fusion methods and model training processes to further enhance the overall performance of the models.

Multimodal fusion has emerged as a core technique for understanding video contexts, demonstrating its value across numerous downstream tasks [15,16,17]. Prior research has proposed a series of fusion techniques for MSA. For instance, Yu et al. [18] employed a self-supervised joint learning strategy called Self-MM, which integrates discriminative information learned from individual unimodal tasks with shared similarity information from the multimodal task during the late fusion stage, thereby enhancing model performance.

While multimodal fusion techniques are crucial for models [19,20], setting optimization objectives is equally indispensable in model construction [5]. Suitable optimization objectives effectively guide the model toward continuous performance optimization throughout training [5]. Moreover, as shown in Figure 1, the setting of optimization objectives and model structure optimization focus on different modules, complementing each other. Early researchers typically used only the cross-entropy loss function to supervise the training of MSA models [14]. However, relying solely on single-task training makes it difficult to fully exploit the correlations between different modalities [5]. Therefore, subsequent researchers attempted to use multi-task learning methods, allowing MSA models to complete multiple related tasks simultaneously during training, thereby enhancing the utilization of different modality information. In this approach, the selection of related tasks is crucial, as different tasks can influence each other. Only by choosing the right combination of tasks can the model achieve better performance in multi-task training.

As shown in Figure 2, the sentiment intensity derived solely from analyzing text data is +1. When analyzing audio data alone, the sentiment intensity is −0.2. Similarly, when considering only visual information, the sentiment intensity is −1. However, when taking into account these three modalities of data comprehensively, the overall sentiment intensity of the information is 0. If we ignore the text information, we might conclude that the sentiment is negative. Conversely, if we overlook the visual information, we might infer that the sentiment is positive. In any case, it is impossible to accurately analyze the overall sentiment of the entire piece of information. Therefore, in the MSA task, single-modality sentiment has a direct impact on the overall sentiment [12,21,22]. Nevertheless, many existing methods tend to overlook the significance of unimodal sentiment [18,23]. Thus, enabling MSA models to focus on both uni- and multimodal sentiments concurrently has emerged as a formidable challenge in this field. Another challenge faced by MSA tasks lies in their broad range of sentiment ratings. For example, the MOSI dataset requires models to accurately map samples to a sentiment intensity scale of [−3, +3], which increases the prediction difficulty.

Given these challenges, we propose DMMSA, a Dynamic Tuning and Multi-Task Learning-Based MSA model. DMMSA ensures that the model can capture unimodal signals in detail and integrate multimodal information through the collaborative optimization of unimodal and multimodal tasks. The model is equipped with a text-oriented contrastive learning module to promote feature decoupling and enhance the depth and accuracy of sentiment understanding. Furthermore, incorporating coarse-grained sentiment classification tasks to converge the prediction range improves the accuracy of sentiment intensity determination. We implement Global Dynamic Weight Generation (GDWG) to avoid negative transfer effects and achieve the joint adjustment of model parameters, thereby maximizing overall performance.

The main contributions of this paper can be summarized as follows:

We propose the Multi-NT-Xent loss to guide the model in decomposing unimodal features and establishing text-centered contrastive relations.
By employing coarse-grained sentiment analysis tasks, we effectively converge the prediction range, reducing the complexity of modeling sentiment intensity.
To address the issue of unequal convergence rates among different tasks during multi-task training, we propose the GDWG strategy, effectively mitigating the negative transfer effects arising from such mismatches.

Our model is evaluated on three benchmark datasets: CH-SIMS [18], MOSI [24], and MOSEI [23]. The results show that DMMSA outperforms the baseline method in classification and regression tasks when the model structure remains unchanged and only the optimization objectives are replaced. Additionally, we conduct comprehensive ablation studies, substantiating the efficacy of each component within our proposed architecture.

2. Related Works

2.1. Multimodal Sentiment Analysis

As a core topic in affective computing research, past scholars have primarily focused MSA research on representation learning and multimodal fusion strategies. In representation learning, Wang et al. [25] introduced the Recurrent Attended Variation Embedding Network (RAVEN), tailored for fine-grained structural modeling of non-verbal subword sequences, dynamically adjusting word-level representations in response to non-verbal cues. Regarding multimodal fusion techniques, Zaden et al. [26] designed the Tensor Fusion Network to deeply model intra-modal and inter-modal relationships in online video analysis, addressing the transient variability of spoken languages, sign language, and audio signals. Subsequently, they proposed the Memory Fusion Network [23], employing attention mechanisms for interactive information integration across different views. Sun et al. [4], attentive to heterogeneity issues, proposed an attention-based cross-modal fusion scheme that facilitates modal interactions through attention mechanisms, promoting the effective alignment of distinct modal features. However, these efforts overlooked the potential impact of the settings of optimization objectives on model performance.

Yu et al. [18] recognized the significant influence of unimodal sentiment expressions on overall affective states, leading to the construction of the CH-SIMS dataset, which encompasses both uni- and multimodal sentiment intensity measures. Their study employed the L1 loss function for multimodal and unimodal sentiment analysis as a joint optimization objective. Experimental results revealed that incorporating unimodal sentiment analysis tasks enhanced the model’s accuracy in predicting holistic emotional dispositions. Yang et al. [5] decomposed unimodal representations into similarity and dissimilarity components, utilizing a text-centric contrastive learning approach. However, when implementing multi-task learning, they failed to adequately account for the potential negative transfer effects resulting from pronounced disparities in task convergence rates and loss scales.

In contrast, the DMMSA model incorporates a GDWG mechanism, enabling the model to adaptively adjust task weights based on the relative rates of loss decrease during training, effectively mitigating the detrimental impact of negative transfer on model performance. Moreover, DMMSA incorporates coarse-grained sentiment analysis tasks to constrain the prediction scope.

2.2. Contrastive Learning

Contrastive learning systematically constructs and discriminates between feature differences in positive and negative sample pairs to reveal intrinsic structural relationships within data [8,27,28,29]. This strategy has proven particularly effective in multimodal feature fusion research [5,30]. Specifically, Radford et al. [31] employed multimodal contrastive learning techniques to align image–text pairs, effectively alleviating the inherent data heterogeneity between visual and textual modalities and fostering widespread application in diverse multimodal downstream tasks such as visual question answering and caption generation. Similarly, Akbari et al. [32] trained a vision–audio–text translation model using the same contrastive learning approach, successfully achieving deep alignment among these three modalities.

For performing MSA tasks, Yang et al. [5] devised two contrastive learning mechanisms, intra-modal contrast and inter-modal contrast, to guide the model toward generating features that embody the homogeneity across modalities and capture the heterogeneity between them. This strategy ensures that the model attends equally to commonalities and differences in modal interactions during modeling. Nonetheless, while this method yielded promising results, it did not address the limitation of traditional NT-Xent loss functions, which are tailored for single positive pair settings and are ill suited for scenarios involving multiple positive pairs. The

N T - X e n t

loss is given by

\begin{matrix} L_{N T X} = - \sum_{(a, p) \in P} log \frac{exp (s i m (a, p) / τ_{m})}{\sum_{(a, k) \in N \cup P} exp (s i m (a, k) / τ_{m})} \end{matrix}

(1)

where

τ_{m}

is the temperature coefficient that controls the similarity distribution.

(a, p)

and

(a, k)

denote positive and negative sample pairs, respectively. N represents the set of negative pairs, and P signifies the set of positive pairs.

However, CL manifestly fails to converge to zero in situations involving more than one positive pair, i.e.,

n > 1

. In light of this limitation, while leveraging contrastive learning strategies to aid the model in extracting both similarity and dissimilarity features, this paper proposes an improvement to the NT-Xent loss function tailored to accommodate multiple positive instances, namely the Multi-NT-Xent loss:

\begin{matrix} L_{M N T X} = - log \frac{\sum_{(a, p) \in P} exp (s i m (a, p) / τ_{m})}{\sum_{(a, k) \in N \cup P} exp (s i m (a, k) / τ_{m})} \end{matrix}

(2)

Consequently, the model is effectively guided in its contrastive learning tasks, whether faced with a single positive pair or multiple ones.

3. Methodology

In this section, we first provide a brief introduction to the task definition. We then describe the overall operational process of the DMMSA model, and finally, we detail our proposed model training method based on multi-task learning and dynamic tuning.

3.1. Task Definition

MSA aims to decipher sample sentiment states by harnessing multiple signals, encompassing text (

I_{t}

), visual (

I_{v}

), and audio (

I_{a}

) modalities. Task types within this domain are typically categorized into two broad classes: classification and regression. Focusing on the latter, the proposed DMMSA model takes

I_{t}

,

I_{v}

, and

I_{a}

as inputs, yielding an output sentiment intensity value

y^{*}

, constrained within the actual interval

[- R, R]

, where R defines the upper and lower bounds of the sentiment score.

3.2. Model Architecture

This section primarily introduces the overall architecture of the DMMSA model to facilitate the reader’s understanding. The overall architecture of the DMMSA model is depicted in Figure 3.

In the single-modality feature extraction stage, we employ the pre-trained

B E R T

model to extract text features.

B E R T

is a pre-trained language model based on the transformer architecture. During pre-training, it leverages a vast amount of unsupervised text data to learn rich linguistic knowledge and semantic information. Therefore, compared to models such as RNN, LSTM, and GRU, it can encode text more effectively. For processing visual and audio modality data, we use the transformer model as the visual encoder and audio encoder in DMMSA. The overall formula is as follows:

\begin{matrix} T = B E R T (I_{t}) \end{matrix}

(3)

\begin{matrix} V = T r a n s f o r m e r_{v} (I_{v}) \end{matrix}

(4)

\begin{matrix} A = T r a n s f o r m e r_{a} (I_{a}) \end{matrix}

(5)

where T, V, and A represent the text, visual, and audio features extracted by the encoders, respectively.

T r a n s f o r m e r_{v} ()

represents the encoder for visual information, and

T r a n s f o r m e r_{a}

represents the encoder for audio information.

Next, we input the features from different modalities into the feature decomposition module to obtain modality-similar features and modality-dissimilar features. The feature decomposition module consists of six parallel project modules, each of which includes a fully connected layer with a Tanh activation function and layer normalization [5]. The formulas are as follows:

\begin{matrix} T_{s} = σ (L i n e a r_{s}^{T} (T)) \end{matrix}

(6)

\begin{matrix} T_{d} = σ (L i n e a r_{d}^{T} (T)) \end{matrix}

(7)

\begin{matrix} V_{s} = σ (L i n e a r_{s}^{V} (V)) \end{matrix}

(8)

\begin{matrix} V_{d} = σ (L i n e a r_{d}^{V} (V)) \end{matrix}

(9)

\begin{matrix} A_{s} = σ (L i n e a r_{s}^{A} (A)) \end{matrix}

(10)

\begin{matrix} A_{d} = σ (L i n e a r_{d}^{A} (A)) \end{matrix}

(11)

where

T_{s}

,

V_{s}

, and

A_{s}

represent the modality-similar features extracted by the mapping modules;

T_{d}

,

V_{d}

, and

A_{d}

represent the modality-dissimilar features;

σ

represents the Tanh activation function; and

L i n e a r

represents the fully connected layer. The feature decomposition module decomposes features from different modalities into modality-similar and modality-dissimilar features. This is attributed to the incorporation of two tasks, single-modality sentiment classification and contrastive learning, in the training process, which helps the model effectively perform feature decomposition. The details are discussed in the section on multi-task learning.

Finally, we concatenate the extracted modality-similar and modality-dissimilar features and input them into a multi-layer perceptron (MLP) network to obtain the final sentiment prediction result. The formula is as follows:

\begin{matrix} y^{*} = M L P ([T_{s}; T_{d}; A_{s}; A_{d}; V_{s}; V_{d}]) \end{matrix}

(12)

where

y^{*}

represents the predicted result and “;” represents concatenation.

3.3. Multi-Task Learning

In this section, we focus on the setup of multi-task learning and how we mitigate negative transfer through dynamic tuning.

As shown in Figure 3, we employ a hard parameter sharing method for multi-task learning. In this approach, all tasks share the hidden layer parameters of the neural network, and each task has its own output layer. This method not only reduces the risk of overfitting but also encourages the model to find a general representation that covers all tasks.

In terms of task setup, we first draw inspiration from previous studies and introduce a unimodal sentiment classification task to help the model better analyze the sentiment of the information [5]. Given that the MSA task requires the model to predict a value from a wide range, which is quite challenging, we further add a coarse-grained sentiment analysis task to help the model narrow down the prediction range and thus improve prediction accuracy. Moreover, to assist the model in better distinguishing between similarity and dissimilarity features, we also incorporate a contrastive learning task. Additionally, by leveraging the collaborative effect of the unimodal sentiment analysis and contrastive learning tasks, we enable the model to accurately extract similarity and dissimilarity features from the three information modalities. This, in turn, enhances the model’s effective utilization of the intrinsic relevance and distinctive emotional inclinations embedded in the sentiment signals of different modalities. The final multi-task learning objective function is given by

\begin{matrix} L_{M S A} = L_{M S R} + L_{M S C} + λ_{U n i} L_{U n i} + λ_{C L} L_{C L} \end{matrix}

(13)

where

λ

represents the weights generated by the GDWG method,

L_{M S R}

is the multimodal sentiment regression loss,

L_{M S C}

is the loss related to multimodal sentiment classification,

L_{U n i}

is the unimodal sentiment analysis loss, and

L_{C L}

is the contrastive learning loss.

L_{M S R}

—Multimodal Sentiment Regression Loss. The purpose of this task is to guide the model in integrating signals from different modalities to estimate the sentiment intensity of samples accurately. Herein, we feed the fused decomposed similarity and dissimilarity features into a multimodal MLP for sentiment intensity prediction, associating its output with the given multimodal sentiment intensity labels via a smooth L1 loss function to derive this loss. The formulaic expression is as follows:

\begin{matrix} y^{*} = M L P ([T_{s}; T_{d}; A_{s}; A_{d}; V_{s}; V_{d}]) \end{matrix}

(14)

\begin{matrix} L_{m u l} = \{\begin{matrix} 0.5 * {(\frac{y^{*} - y}{φ})}^{2}, i f (\frac{y^{*} - y}{φ}) < 1 \\ (y^{*} - y) - 0.5 φ, o t h e r w i s e \end{matrix}\} \end{matrix}

(15)

where

y^{*}

represents the predicted result, “;” represents concatenation, y represents the multimodal sentiment label, and

φ

controls the smoothness.

L_{M S C}

—Multimodal Sentiment Classification Loss. The purpose of this task is to direct the model toward the coarse-grained classification of sentiment, thus narrowing down the prediction scope. Unlike the fine-grained sentiment analysis task, this task does not require the model to precisely predict the exact intensity of sentiment, but rather to predict the range interval to which the sentiment intensity belongs. Initially, we map the sentiment intensity labels of samples to preset sentiment polarity classifications (such as positive, negative, and neutral) in accordance with predefined regulations, thereby creating a sentiment polarity label collection. Next, the disassembled multimodal features are efficiently combined and fed into a sentiment classifier as input, generating the probability distribution of each sample among different sentiment polarities. Eventually, the predicted probability distribution of the classifier is compared with the actual assigned sentiment polarity labels, and the cross-entropy loss function is utilized to measure the loss between the two. The specific formulaic expression is as follows:

\begin{matrix} y_{M S C} = C l a s s i f i e r ([T_{s}; T_{d}; A_{s}; A_{d}; V_{s}; V_{d}]) \end{matrix}

(16)

\begin{matrix} L_{M S C} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log (y_{M S C}^{i, c}) \end{matrix}

(17)

where N denotes the number of samples and C represents the number of categories.

L_{U n i}

—Unimodal Sentiment Analysis Loss. The purpose of this task is to guide the model in delving into the sentiment information embedded within each modality. Here, to ensure consistent treatment of modal features, we feed the similarity features (

T_{s}

,

V_{s}

, and

A_{s}

) and dissimilarity features (

T_{d}

,

V_{d}

, and

A_{d}

) of each modality separately into a weight-sharing MLP layer. The MLP layer outputs six sentiment predictions,

u^{*} = M L P ([T_{s} / T_{d} / A_{s} / A_{d} / V_{s} / V_{d}])

, with the similarity features used to infer the multimodal sentiment label y and the dissimilarity features used to predict the corresponding unimodal sentiment labels

y^{t / v / a}

. In the absence of unimodal labels, the dissimilarity feature prediction task adjusts to instead predict the multimodal label y, maintaining the coherence of model training. Finally, a smooth L1 loss function is employed for each prediction to measure the loss between the prediction and the respective ground-truth label

u = [y / y / y / y^{t} / y^{v} / y^{a}]

. The specific formulaic expression is as follows:

\begin{matrix} L_{m u l} = \{\begin{matrix} 0.5 * {(\frac{u^{*} - u}{φ})}^{2}, i f (\frac{u^{*} - u}{φ}) < 1 \\ (u^{*} - u) - 0.5 φ, o t h e r w i s e \end{matrix}\} \end{matrix}

(18)

L_{C L}

—Contrastive Learning Loss. Contrastive learning tasks help models learn the commonalities among positive examples and the differences from negative examples by comparing positive and negative samples. Through this task, models can extract high-quality similarity and dissimilarity features, thereby paying attention to both unimodal and multimodal sentiment information. Here, considering that text modality data often assume a dominant role in MSA tasks, with other modalities providing auxiliary information to enhance prediction accuracy, we opt to use text data as the reference anchor for constructing positive and negative sample pairs [12]. The specific configuration is as follows:

\begin{matrix} N = \{(T_{s}, T_{d}), (T_{s}, V_{d}), (T_{s}, A_{d})\} \end{matrix}

(19)

\begin{matrix} P = \{(T_{s}, V_{s}), (T_{s}, A_{s})\} \end{matrix}

(20)

Subsequently, we employ our proposed Multi-NT-Xent loss to guide the model in maximizing the similarity between positive sample pairs while minimizing the similarity between negative sample pairs. The calculation formula for the Multi-NT-Xent loss is as follows:

\begin{matrix} L_{M N T X} = - log \frac{\sum_{(a, p) \in P} exp (s i m (a, p) / τ_{m})}{\sum_{(a, k) \in N \cup P} exp (s i m (a, k) / τ_{m})} \end{matrix}

(21)

where

τ_{m}

is the temperature coefficient that controls the similarity distribution;

(a, p)

and

(a, k)

denote positive and negative sample pairs, respectively; N represents the set of negative pairs; and P signifies the set of positive pairs.

3.4. Global Dynamic Weight Generation

In multi-task learning scenarios, distinct tasks often exhibit asynchronous convergence patterns, leading specific tasks to stabilize either prematurely or tardily [22,24,33,34]. This inconsistency in convergence rates can engender negative transfer, where the learning process of one task adversarially impacts the performance of other tasks, thereby compromising overall model effectiveness [35]. To address this challenge, we introduce the GDWG mechanism. This mechanism aims to adaptively adjust the relative weights of individual tasks during training. Specifically, it assesses the descent rate of each task’s loss function at every training stage and, based on these assessments, generates weight values for each task. The specific mathematical expression is as follows:

\begin{matrix} w_{k} (t - 1) = \frac{L_{k} (t - 1)}{L_{k} (1)} \end{matrix}

(22)

\begin{matrix} λ_{k} (t - 1) = \frac{exp (w_{k} (t - 1) / τ)}{\sum_{j}^{J} exp (w_{j} (t - 1) / τ)} \end{matrix}

(23)

where

w_{k} (t)

denotes the relative decay rate of task k at the t-th training stage,

λ_{k} (t)

represents the weight value assigned to task k at stage t, and

L_{k} (t)

signifies the loss incurred by task k at stage t. J signifies the total number of tasks subject to adjustment, and

τ

is a temperature coefficient that governs the magnitude of weight updates, with smaller values indicating greater amplitudes of weight updates. All tasks under consideration are initially assigned equal weights during the model’s initialization phase. Subsequently, their actual loss values at the first training stage,

L_{k} (1)

, serve as the respective baseline loss references.

4. Experiments

4.1. Experimental Settings

All experiments in this study were conducted on a 3090 GPU. For the CH-SIMS dataset, the “bert-base-chinese” model was employed, while for the MOSI and MOSEI datasets, the “bert-base-uncased” model was selected for fine-tuning. During fine-tuning, the learning rate was set at 0.00001, with a batch size of 64 and a total of 150 training epochs. In terms of modality information extraction, for the CH-SIMS and MOSI datasets, two single-layer transformers were used to extract audio and visual information. However, for the MOSEI dataset, considering its data characteristics and complexity, a three-layer transformer was adopted to extract information from the visual and audio modalities. Throughout the entire training process, the learning rate was uniformly set at 0.0001, with a batch size of 128 and a total of 300 training epochs.

4.2. Datasets and Baseline Models

To evaluate the performance of the DMMSA model, we selected three representative MSA datasets: CH-SIMS [18], MOSI [24], and MOSEI [23]. CH-SIMS, a resource for MSA in Chinese, comprises 2281 video samples, with sentiment labels expressed as scores within the continuous interval [−1, +1]. MOSI, an English dataset, includes 2199 video clips and employs a [−3, +3] sentiment intensity rating system. MOSEI, an extended English MSA collection derived from MOSI, significantly expands the scale to 22,856 video segments, maintaining the [−3, +3] sentiment intensity scoring range. The specific details of the dataset division are presented in Table 1.

4.3. Baseline Models

LF-DNN: This model concatenates unimodal features and analyzes sentiment [18].

MFN: This model first employs LSTM for view-specific interaction, then utilizes the attention mechanism for cross-view interaction, and finally summarizes through time with a multi-view gated memory [23].

LMF: By decomposing in parallel tensors and weights, this model utilizes modality-specific low-rank factors to perform multimodal fusion [36].

TFN: This model learns end-to-end dynamics within and across modalities. It utilizes a new multimodal fusion method (tensor fusion) to model the dynamics across modalities [26].

MulT: The core of MulT lies in its cross-modal attention mechanism, which offers a potential cross-modal adaptation by directly attending to low-level features in other modalities to fuse multimodal information [37].

MISA: This model learns modality-invariant and modality-specific representation spaces for each modality to obtain better representations for the fused input [38].

MAG-BERT: This model enhances performance by applying multimodal adaptation gates at different layers of the BERT backbone [14].

Self-MM: This model first utilizes a self-supervised label generation module to obtain unimodal labels and then jointly learns multimodal and unimodal representations based on multimodal labels [39].

ConFEDE: This model first decomposes unimodal features into modality-invariant and modality-specific features through feature decomposition. Subsequently, it utilizes multi-task learning to jointly optimize multimodal sentiment analysis, unimodal sentiment analysis, and contrastive learning tasks [5].

4.4. Evaluation Metrics

We report the model’s performance in classification and regression tasks following prior work. For classification, we computed the accuracy of three-class prediction (Accuracy-3) and five-class prediction (Accuracy-5) on CH-SIMS, as well as the accuracy of two-class prediction (Accuracy-2) and seven-class prediction (Accuracy-7) on MOSI and MOSEI. Here, Accuracy-2 and the F1-score for MOSI and MOSEI are reported in two forms: “negative/non-negative” and “negative/positive” (excluding 0). We present Mean Absolute Error (MAE) and Pearson correlation (Corr) regarding regression. All metrics, except for MAE, are better when higher. Since the output of the model is a specific value, in the context of classification tasks, we primarily determined whether the classification was accurate by checking if this value fell within the correct interval. For example, in the three-class classification task of the CH-SIMS dataset, the accuracy (ACC-3) was divided into the following intervals: [−1, −0.1], (−0.1, 0.1], and (0.1, 1]. For example, suppose the ground truth of an input sample was −0.5, and the model predicted a value of −0.3. In this case, we consider that the model has made a correct classification in the three-class classification task, as it successfully predicted the interval in which the sample’s sentiment lies.

4.5. Comparative Experiments

Table 2 and Table 3 summarize the results of the comparative experiments with various methods. The listed results are based on the average of five runs with different random seeds, with the performance data for all baseline models, except for ConFEDE, sourced from published studies.

On the CH-SIMS dataset, DMMSA outperformed all baseline models in classification and regression tasks. Compared to ConFEDE, it achieved increases of 1.27% and 3.20% in Acc-3 and Acc-5, respectively. This is mainly due to the integrated coarse-grained sentiment analysis task, which boosts classification performance. DMMSA also showed significant progress in the MAE and Corr metrics. It effectively captured the interdependencies of uni- and multimodal sentiment analysis and embedded a coarse-grained task that constrained the prediction scope and simplified the analysis. As shown in Figure 4, DMMSA exhibited smaller MAE fluctuations and faster convergence during training than models without coarse-grained analysis, confirming its positive role in optimization.

To further validate our approach, experiments were conducted on the MOSI and MOSEI datasets without unimodal sentiment labels. Table 3 shows that DMMSA surpassed all baselines, demonstrating excellent performance even without unimodal labels. This is mainly due to the design of the contrastive learning task. As shown in Figure 5, the model was still able to identify and separate uni- and multimodal features effectively under the guidance of this task. Notably, DMMSA’s improvements in Acc-5, MAE, and Corr were more significant than those in Acc-2 and Acc-3. This is because complex tasks require higher model performance and feature quality. Simple tasks can yield good results with lower-level features, while DMMSA’s advantage lies in extracting higher-quality features, making its performance gain more prominent as the task difficulty increases.

To confirm this hypothesis, we designed an incremental experiment. Table 4 shows that DMMSA’s performance in Acc-2 converged when trained with 60% of the data and did not improve with more data. In contrast, its performance in Acc-7 and regression tasks continuously improved as the amount of training data increased.

4.6. Ablation Study and Analysis

We carried out an ablation study on the proposed method to explore the individual contributions of each module to model performance. The results are presented in Table 5.

We can observe that the model’s performance decreased to varying degrees under the three ablation strategies. The “w/o MSC” strategy resulted in decreases across all performance metrics. The decline can primarily be attributed to the loss of practical constraints on the sentiment prediction range after removing the MSC task. Under the “w/o CL” configuration, the model’s MAE and Corr. indicators significantly decreased. This is mainly because the core objective of CL tasks is to assist the model in effectively distinguishing and extracting similarity and dissimilarity features from single modalities. Once the CL task was removed, the model lost the feature discrimination ability promoted by this mechanism, making it difficult to accurately distinguish and utilize these critical sentiment features; so, its performance was weakened in regression tasks.

In the “w/o GDWG” setting, the model showed a slight upward trend in performance in regression tasks but a significant decline in classification tasks. The reason for this phenomenon is that the model lost the effective regulatory mechanism for gradient convergence rates and magnitude differences among the various tasks during training. As a result, the model tended to over-optimize a single task at the cost of neglecting the learning needs of other tasks, leading to an obvious imbalance in overall performance.

5. Conclusions

This study presents DMMSA, an affective analysis framework. It combines multi-task learning strategies with dynamic tuning mechanisms to improve the accuracy of understanding complex human sentiments. This is achieved by leveraging the inherent correlations between unimodal and multimodal sentiment signals. DMMSA systematically extracts and decomposes sentiment representations from multimodal inputs into similarity and dissimilarity parts. These are then enhanced via coarse-grained sentiment classification tasks and contrastive learning mechanisms that work on the interaction of sentiment representations. To fully assess DMMSA, it is evaluated on three typical MSA datasets: CH-SIMS, MOSI, and MOSEI. Experimental results show that DMMSA outperforms various benchmark models in all overall performance metrics across all datasets. Additionally, a series of ablation experiments further confirms the essential contribution of each component module in DMMSA to the overall performance enhancement, validating the methodological soundness and effectiveness of this design.

6. Discussion

The DMMSA model has achieved remarkable performance across multiple datasets, which can be attributed to several key factors. First, its multi-task learning strategy incorporates single-modality sentiment classification tasks and contrastive learning tasks, enabling the model to gain a more comprehensive understanding and analysis of sentiment signals. This strategy not only enhances the model’s comprehension of single-modality sentiment but also optimizes the feature extraction and decomposition process through the contrastive learning mechanism. Second, the Global Dynamic Weight Generation (GDWG) strategy dynamically adjusts task weights based on the convergence speed of tasks, effectively avoiding the negative transfer issues in multi-task learning. This mechanism ensures that the model balances the learning progress of different tasks during training, thereby improving overall performance. Additionally, by introducing coarse-grained sentiment classification tasks, the DMMSA model more accurately predicts sentiment intensity. This task simplifies the complexity of sentiment analysis by restricting the prediction range, thereby enhancing the model’s prediction accuracy. Compared with existing baseline models, the DMMSA model has achieved significant improvements in multiple key metrics. For instance, on the CH-SIMS dataset, the DMMSA model’s Acc-3 and Acc-5 values were 1.27% and 3.20% higher than those of the ConFEDE model. On the MOSI and MOSEI datasets, the DMMSA model outperformed all baseline models in metrics such as Acc-2, Acc-7, MAE, and Corr. These results indicate that while optimizing the model structure is important, optimizing the training objectives can also effectively enhance model performance.

7. Limitations

Although we have alleviated the negative transfer effects caused by differences in task convergence rates using the GDWG strategy, this problem still exists and is a key factor restricting the performance improvement of DMMSA. Table 4 shows that, as the training sample size increases, the performance of DMMSA in Acc-2 decreases, while in Acc-5, MAE, and Corr, the model shows an upward trend. Therefore, the focus of future research will be on exploring the optimization paths for GDWG. For instance, methods such as dynamic weighting based on meta-learning, dynamic parameter adjustment based on reinforcement learning, and multi-task learning based on transfer learning will all be considered. These methods will help more effectively suppress negative transfer and thereby enhance the overall performance of the model.

Author Contributions

Conceptualization, Y.L. and B.K.; methodology, Y.L. and D.H.; software, Y.L.; validation, Y.L., T.Y., B.K. and D.H.; formal analysis, Y.L.; investigation, T.T. and W.H.; resources, T.T. and A.H.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L., W.H. and T.T.; visualization, Y.L.; supervision, T.T., W.H. and B.K.; funding acquisition, T.T. and A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62166042); the Natural Science Foundation of Xinjiang, China (2021D01C076); and the Tianshan Talents Cultivation Program—Leadings Talents for Scientific and Technological Innovation (No. 2024TSYCLJ0002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CH-SIMS, MOSI, and MOSEI datasets are available at https://github.com/thuiar/MMSA (accessed on 12 March 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, D.; Li, M.; Xiao, D.; Liu, Y.; Yang, K.; Chen, Z.; Wang, Y.; Zhai, P.; Li, K.; Zhang, L. Towards multimodal sentiment analysis debiasing via bias purification. arXiv 2024, arXiv:2403.05023. [Google Scholar]
Du, J.; Jin, J.; Zhuang, J.; Zhang, C. Hierarchical graph contrastive learning of local and global presentation for multimodal sentiment analysis. Sci. Rep. 2024, 14, 5335. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Zhou, B.; Chu, D.; Sun, Y.; Meng, L. Modality translation-based multimodal sentiment analysis under uncertain missing modalities. Inf. Fusion 2024, 101, 101973. [Google Scholar] [CrossRef]
Sun, L.; Lian, Z.; Liu, B.; Tao, J. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans. Affect. Comput. 2023, 15, 309–325. [Google Scholar] [CrossRef]
Yang, J.; Yu, Y.; Niu, D.; Guo, W.; Xu, Y. ConFEDE: Contrastive Feature Decomposition for Multimodal Sentiment Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Huang, J.; Pu, Y.; Zhou, D.; Cao, J.; Gu, J.; Zhao, Z.; Xu, D. Dynamic hypergraph convolutional network for multimodal sentiment analysis. Neurocomputing 2024, 565, 126992. [Google Scholar] [CrossRef]
Zeng, Y.; Yan, W.; Mai, S.; Hu, H. Disentanglement Translation Network for multimodal sentiment analysis. Inf. Fusion 2024, 102, 102031. [Google Scholar] [CrossRef]
Hu, G.; Lin, T.E.; Zhao, Y.; Lu, G.; Wu, Y.; Li, Y. Unimse: Towards unified multimodal sentiment analysis and emotion recognition. arXiv 2022, arXiv:2211.11256. [Google Scholar]
Ging, S.; Zolfaghari, M.; Pirsiavash, H.; Brox, T. Coot: Cooperative hierarchical transformer for video-text representation learning. Adv. Neural Inf. Process. Syst. 2020, 33, 22605–22618. [Google Scholar]
Lu, Q.; Sun, X.; Gao, Z.; Long, Y.; Feng, J.; Zhang, H. Coordinated-joint translation fusion framework with sentiment-interactive graph convolutional networks for multimodal sentiment analysis. Inf. Process. Manag. 2024, 61, 103538. [Google Scholar] [CrossRef]
Shi, H.; Pu, Y.; Zhao, Z.; Huang, J.; Zhou, D.; Xu, D.; Cao, J. Co-space Representation Interaction Network for multimodal sentiment analysis. Knowl.-Based Syst. 2024, 283, 111149. [Google Scholar] [CrossRef]
Truong, Q.T.; Lauw, H.W. Vistanet: Visual aspect attention network for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 305–312. [Google Scholar]
Feng, W.; Wang, X.; Cao, D.; Lin, D. An Autoencoder-based Self-Supervised Learning for Multimodal Sentiment Analysis. Inf. Sci. 2024, 675, 120682. [Google Scholar] [CrossRef]
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. Proc. Conf. Assoc. Comput. Linguist. Meet. 2020, 2020, 2359–2369. [Google Scholar] [PubMed]
Liang, Y.; Tohti, T.; Hamdulla, A. Multimodal false information detection method based on Text-CNN and SE module. PLoS ONE 2022, 17, e0277463. [Google Scholar] [CrossRef] [PubMed]
Mai, S.; Zeng, Y.; Zheng, S.; Hu, H. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Trans. Affect. Comput. 2022, 14, 2276–2289. [Google Scholar] [CrossRef]
Sun, Z.; Sarma, P.; Sethares, W.; Liang, Y. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8992–8999. [Google Scholar]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
Fu, Y.; Huang, B.; Wen, Y.; Zhang, P. FDR-MSA: Enhancing multimodal sentiment analysis through feature disentanglement and reconstruction. Knowl.-Based Syst. 2024, 297, 111965. [Google Scholar] [CrossRef]
Jiang, X.; Xu, X.; Lu, H.; He, L.; Shen, H.T. Joint Objective and Subjective Fuzziness Denoising for Multimodal Sentiment Analysis. IEEE Trans. Fuzzy Syst. 2024, 33, 15–27. [Google Scholar] [CrossRef]
Aslam, A.; Sargano, A.B.; Habib, Z. Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks. Appl. Soft Comput. 2023, 144, 110494. [Google Scholar] [CrossRef]
Liu, S.; Liang, Y.; Gitter, A. Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 2016, 31, 82–88. [Google Scholar] [CrossRef]
Wang, Y.; Shen, Y.; Liu, Z.; Liang, P.P.; Zadeh, A.; Morency, L.P. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7216–7223. [Google Scholar]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
Huang, W.; Han, A.; Chen, Y.; Cao, Y.; Xu, Z.; Suzuki, T. On the comparison between multi-modal and single-modal contrastive learning. Adv. Neural Inf. Process. Syst. 2024, 37, 81549–81605. [Google Scholar]
Hu, H.; Wang, X.; Zhang, Y.; Chen, Q.; Guan, Q. A comprehensive survey on contrastive learning. Neurocomputing 2024, 610, 128645. [Google Scholar] [CrossRef]
Lei, J.; Li, L.; Zhou, L.; Gan, Z.; Berg, T.L.; Bansal, M.; Liu, J. Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. arXiv 2021, arXiv:2102.06183. [Google Scholar]
Li, L.; Chen, Y.C.; Cheng, Y.; Gan, Z.; Yu, L.; Liu, J. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. arXiv 2020, arXiv:2005.00200. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Adv. Neural Inf. Process. Syst. 2021, 34, 24206–24221. [Google Scholar]
Hervella, Á.S.; Rouco, J.; Novo, J.; Ortega, M. Multi-Adaptive Optimization for multi-task learning with deep neural networks. Neural Netw. 2024, 170, 254–265. [Google Scholar] [CrossRef]
Ren, Y.; Li, Y.; Kong, A.W.K. Adaptive Multi-task Learning for Few-Shot Object Detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 297–314. [Google Scholar]
Agiza, A.; Neseem, M.; Reda, S. Mtlora: Low-rank adaptation approach for efficient multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16196–16205. [Google Scholar]
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. Proc. Conf. Assoc. Comput. Linguist. Meet. 2019, 2019, 6558–6569. [Google Scholar]
Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]

Figure 1. (a) Critical and secondary modules in the process of optimizing model architecture. Such approaches focus on enhancing the unimodal feature extraction and multimodal feature fusion modules. (b) The modules of primary concern in our work. We concentrate on improving model performance by setting appropriate optimization objectives while keeping the structure of other modules unchanged.

Figure 2. An example from the CH-SIMS dataset.

Figure 3. The overall framework of the DMMSA model.

Figure 4. Evolution of the MAE over epochs for different optimization strategies on the MOSI validation dataset. “Cross” represents DMMSA incorporating the coarse-grained sentiment analysis task, whereas “No_Cross” corresponds to DMMSA with the coarse-grained sentiment analysis task removed. The circles indicate the lowest MAE values.

Figure 5. Similarity between similarity and dissimilarity features. “CL” denotes the similarity of a model incorporating the contrastive learning task. “No_CL” represents the similarity of a model without the contrastive learning task. “T”, “I”, and “A” denote text, image, and audio modalities, respectively. “S” and “D” denote similarity features and dissimilarity features, respectively.

Table 1. Dataset-specific partitioning details.

Dataset	Training	Valid.	Test	Total
CH-SIMS	1368	456	457	2281
MOSI	1284	229	686	2199
MOSEI	16,326	1871	4659	22,856

Table 2. Results of comparative experiment on the CH-SIMS dataset. “Acc” denotes “accuracy”.

Model	Acc-3 (↑)	Acc-5 (↑)	MAE (↓)	Corr (↑)
LF-DNN	66.91	41.62	0.420	0.612
MFN(A)	65.73	39.47	0.435	0.582
LMF	64.68	40.53	0.441	0.576
TFN	65.12	39.30	0.432	0.591
Mult(A)	64.77	37.94	0.453	0.561
Self-MM	64.73	43.15	0.414	0.598
ConFEDE	68.36	43.72	0.3924	0.6351
DMMSA	69.63	46.92	0.3778	0.66

Table 3. Results of comparative experiment on the MOSI and MOSEI datasets. In Acc-2 and F1, the left side of “/” represents “negative/non-negative”, and the right side represents “negative/positive”.

Model	MOSI					MOSEI
Model	Acc-2	F1	Acc-7	MAE	Corr	Acc-2	F1	Acc-7	MAE	Corr
LF-DNN [18]	77.52/78.63	77.46/78.63	34.52	0.955	0.658	80.60/82.74	80.85/82.52	50.83	0.58	0.709
MFN(A) [23]	77.4/-	77.3/-	34.1	0.965	0.632	78.94/82.86	79.55/82.85	51.53	0.573	0.718
LMF [40]	-/82.5	-/82.4	33.2	0.917	0.695	80.54/83.48	80.94/83.36	51.59	0.576	0.717
TFN [26]	-/80.8	-/80.7	34.9	0.901	0.698	78.50/81.89	78.96/81.74	51.60	0.573	0.714
MulT(A) [37]	-/83.0	-/82.8	40.0	0.871	0.698	81.15/84.63	81.56/84.52	52.84	0.559	0.733
MISA(A) [38]	81.8/83.4	81.7/83.6	42.3	0.783	0.776	83.6/85.5	83.8/85.3	52.2	0.555	0.756
MAG-BERT [14]	82.13/83.54	81.12/83.58	41.43	0.790	0.766	79.86/86.86	80.47/83.88	50.41	0.583	0.741
ConFEDE [5]	83.85/85.55	83.83/85.76	43.82	0.725	0.789	80.7/84.38	81.2/84.32	51.96	0.555	0.753
DMMSA	83.97/85.70	83.92/85.70	45.39	0.710	0.793	82.63/86.27	83.04/86.21	53.91	0.527	0.777

Table 4. Performance of DMMSA under varying amounts of training data.

Data	Acc-2	F1	Acc-7	MAE	Corr
MOSEI	82.63/86.27	83.04/86.21	53.91	0.527	0.777
MOSEI*0.8	82.41/86.15	82.87/86.14	53.39	0.532	0.772
MOSEI*0.6	83.77/86.12	84.01/85.99	52.77	0.538	0.769
MOSEI*0.4	82.78/86.09	83.16/86.03	53.36	0.540	0.767
MOSEI*0.2	80.68/85.03	81.27/85.06	52.37	0.551	0.760
MOSEI*0.1	81.89/85.08	82.33/85.04	52.18	0.552	0.755

Table 5. The ablation experiments on CH-SIMS. “w/o CL” signifies the exclusion of the contrastive learning (CL) task.

Model	Acc-3	Acc-5	MAE	Corr
DMMSA	69.63	46.92	0.3778	0.66
w/o MSC	68.41	44.68	0.3807	0.656
w/o CL	69.41	46.74	0.3828	0.651
w/o GDWG	69.32	46.17	0.3776	0.663

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, Y.; Tohti, T.; Hu, W.; Kong, B.; Han, D.; Yan, T.; Hamdulla, A. Dynamic Tuning and Multi-Task Learning-Based Model for Multimodal Sentiment Analysis. Appl. Sci. 2025, 15, 6342. https://doi.org/10.3390/app15116342

AMA Style

Liang Y, Tohti T, Hu W, Kong B, Han D, Yan T, Hamdulla A. Dynamic Tuning and Multi-Task Learning-Based Model for Multimodal Sentiment Analysis. Applied Sciences. 2025; 15(11):6342. https://doi.org/10.3390/app15116342

Chicago/Turabian Style

Liang, Yi, Turdi Tohti, Wenpeng Hu, Bo Kong, Dongfang Han, Tianwei Yan, and Askar Hamdulla. 2025. "Dynamic Tuning and Multi-Task Learning-Based Model for Multimodal Sentiment Analysis" Applied Sciences 15, no. 11: 6342. https://doi.org/10.3390/app15116342

APA Style

Liang, Y., Tohti, T., Hu, W., Kong, B., Han, D., Yan, T., & Hamdulla, A. (2025). Dynamic Tuning and Multi-Task Learning-Based Model for Multimodal Sentiment Analysis. Applied Sciences, 15(11), 6342. https://doi.org/10.3390/app15116342

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Tuning and Multi-Task Learning-Based Model for Multimodal Sentiment Analysis

Abstract

1. Introduction

2. Related Works

2.1. Multimodal Sentiment Analysis

2.2. Contrastive Learning

3. Methodology

3.1. Task Definition

3.2. Model Architecture

3.3. Multi-Task Learning

3.4. Global Dynamic Weight Generation

4. Experiments

4.1. Experimental Settings

4.2. Datasets and Baseline Models

4.3. Baseline Models

4.4. Evaluation Metrics

4.5. Comparative Experiments

4.6. Ablation Study and Analysis

5. Conclusions

6. Discussion

7. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI