Image–Text Sentiment Analysis Based on Dual-Path Interaction Network with Multi-Level Consistency Learning

Ji, Zhi; Wu, Chunlei; Xu, Qinfu; Wu, Yixiang

doi:10.3390/electronics15030581

Open AccessArticle

Image–Text Sentiment Analysis Based on Dual-Path Interaction Network with Multi-Level Consistency Learning

by

Zhi Ji

^1,2,*,

Chunlei Wu

^1,2,

Qinfu Xu

^1,2 and

Yixiang Wu

^1,2

¹

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

²

Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 581; https://doi.org/10.3390/electronics15030581

Submission received: 23 December 2025 / Revised: 19 January 2026 / Accepted: 21 January 2026 / Published: 29 January 2026

(This article belongs to the Special Issue AI-Driven Image Processing: Theory, Methods, and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

With the continuous evolution of social media, users are increasingly inclined to express their personal emotions on digital platforms by integrating information presented in multiple modalities. Within this context, research on image–text sentiment analysis has garnered significant attention. Prior research efforts have made notable progress by leveraging shared emotional concepts across visual and textual modalities. However, existing cross-modal sentiment analysis methods face two key challenges: Previous approaches often focus excessively on fusion, resulting in learned features that may not achieve emotional alignment; traditional fusion strategies are not optimized for sentiment tasks, leading to insufficient robustness in final sentiment discrimination. To address the aforementioned issues, this paper proposes a Dual-path Interaction Network with Multi-level Consistency Learning (DINMCL). It employs a multi-level feature representation module to decouple the global and local features of both text and image. These decoupled features are then fed into the Global Congruity Learning (GCL) and Local Crossing-Congruity Learning (LCL) modules, respectively. GCL models global semantic associations using Crossing Prompter, while LCL captures local consistency in fine-grained emotional cues across modalities through cross-modal attention mechanisms and adaptive prompt injection. Finally, a CLIP-based adaptive fusion layer integrates the multi-modal representations in a sentiment-oriented manner. Experiments on the MVSA_Single, MVSA_Multiple, and TumEmo datasets with baseline models such as CTMWA and CLMLF demonstrate that DINMCL significantly outperforms mainstream models in sentiment classification accuracy and F1-score and exhibits strong robustness when handling samples containing highly noisy symbols.

Keywords:

multimodal sentiment analysis; prompt learning; multimodal fusion

1. Introduction

A few years ago, people primarily used textual data to express their thoughts. However, relying solely on text modality can sometimes hinder accurate sentiment prediction. Today, driven by the flourishing social media era, people increasingly convey emotions and opinions through multimodal formats combining images and text, propelling multimodal sentiment analysis into a research hotspot. Image–text multimodal sentiment analysis is a technique that identifies users’ emotional states by integrating visual and linguistic information. By fusing complementary information from both modalities, it captures complex emotions more comprehensively and accurately.

Unimodal sentiment analysis relies exclusively on a single data source, making it prone to misjudgment due to information gaps. Multimodal sentiment analysis overcomes the data limitations of unimodal approaches by fusing multi-source information, leveraging cross-modal complementarity to capture complex emotions with greater precision.

Early multimodal sentiment analysis research primarily relied on manually engineered feature systems. Although such methods could partially characterize affective attributes, their inherent shortcomings significantly constrained research depth. Traditional feature engineering suffered not only from labor-intensive operational bottlenecks but also from limitations imposed by researchers’ cognitive boundaries, making it difficult to adequately model the abstract semantics and non-linear correlations inherent in emotional expression. Consequently, analytical efficacy struggled to break through its theoretical ceiling. The paradigm shift brought by current deep learning technologies has revolutionized this landscape. Intelligence algorithms represented by neural networks overcome these limitations by autonomously mining latent representations within multimodal data, enabling synergistic optimization of cross-modal features. This data-driven feature learning mechanism not only circumvents the subjective biases of manual intervention but also achieves leapfrog improvements in core metrics such as sentiment classification accuracy.

For the multimodal sentiment analysis task, this paper focuses specifically on image and text data acquired from social media. In image–text sentiment analysis, the consistency between the emotional semantics expressed by the text modality and the sentiment indicated by the image modality is crucial for subsequent polarity judgment. Relying solely on a single modality to detect emotional information may fail to accurately identify the true sentiment of a post. This is illustrated in Figure 1: In Figure 1a, the text suggests a positive sentiment, while the image conveys the opposite emotion. In Figure 1b, the image appears positive but the accompanying text reveals it to be a negative post. In Figure 1c,d, both the image and text convey congruent emotions. Such cases reveal the limitations of simple fusion, demanding that the model possess fine-grained, conditional information filtering and alignment capabilities. Moreover, the lack of emotional orientation explanation indicates that the fusion mechanism must incorporate emotional semantics as a supervisory signal for alignment. Correspondingly, we achieve precise emotional alignment through multi-level consistency learning and inject clear emotional orientation through emotional contrast learning.

Existing methods predominantly suffer from key limitations. Most approaches either simply concatenate features extracted from different modalities [1,2] or learn image–text relationships at a coarse level [3]. They fail to adequately explore the intricate associations between images and text. Furthermore, previous methods often prioritize fusion mechanisms over sentiment-specific consistency, potentially resulting in learned features lacking affective coherence. Crucially, as most methods are not explicitly designed for sentiment analysis, their fusion processes lack sentiment-oriented constraints. This deficiency can lead to learned cross-modal representations failing to capture deep-level semantic emotional associations.

To address the aforementioned challenges, this study proposes DINMCL, specifically tailored for image–text sentiment analysis. The method fully fuses textual and visual features using fusion techniques incorporating both local and global consistency learning. The global contrastive alignment of the GCL module complements the fine-grained semantic exploration of the LCL module, effectively uncovering deep cross-modal semantic associations. Moreover, to address the limitations of existing single-encoder models in comprehensively capturing the complex semantics required for multimodal sentiment analysis, we have designed a multi-level feature representation module. For instance, while a single text encoder like RoBERTa excels at understanding contextual and syntactic information, it may overlook the intuitive descriptive information of an image. Similarly, a single vision encoder like ViT is adept at capturing global scene features but might miss specific emotion-bearing objects. Distinctive features are comprehensively extracted through the synergistic combination of a dual-text encoder (RoBERTa + BLIP) and multi-perspective visual representations (VIT + Faster R-CNN). To evaluate model performance, experiments were conducted on three public multimodal datasets: MVSA-Single, MVSA-Multiple, and TumEmo. On all three datasets, the proposed model demonstrated superior performance compared to baseline models.

The main contributions can be concluded in below:

We construct a multi-level feature representation module that establishes a multi-dimensional, multi-granularity representation space through four path parallel feature extraction. This approach of separating global and local features avoids the loss of emotional cues often observed in traditional feature concatenation methods.
We propose a complementary consistency learning framework GCL and LCL. The GCL module implements adaptive filtering of cross-modal global correlations through a collaborative mechanism combining multi-head cross-attention and self-attention gating. Dynamically generated Key Prompts guide the network to focus on semantically salient global regions across modalities. The LCL module introduces an innovative Crossing Prompter, which is specifically designed to enhance fine-grained local feature interactions and is integrated with a Graph Attention Network (GAT), constructing a fine-grained local feature interaction system.
Furthermore, we perform contrastive learning between the concatenated feature vector of GCL and LCL and the text embeddings generated by the CLIP text encoder, which incorporate sentiment polarity. By contrasting the similarity with texts of different polarities, we achieve sentiment classification decisions. We conducted a series of validation experiments on the aforementioned dataset, which verified the effectiveness of the approach adopted in this paper.

The subsequent parts of this are structured in the following manner: Section 2 delivers a comprehensive review of prior studies; Section 3 delves into the details of our adopted approach; Section 4 showcases the conducted experiments alongside the outcomes; Section 5 discussion this paper; Section 6 concludes this paper.

2. Related Work

2.1. Multimodal Fusion

Multimodal fusion refers to integrating information from different modalities to generate more comprehensive, accurate, and robust understanding, decisions, or predictions than using a single modality. Based on the hierarchy of data processing, multimodal fusion can be categorized into Data-Level Fusion [4]: Also known as pixel-level fusion or raw-data fusion, this operates at the lowest data level. It combines raw data from different modalities (e.g., images and depth maps) during preprocessing to form a new dataset. Suitable when raw data exhibits high inter-modal correlation and complementarity. Feature-Level Fusion [4]: Occurs after feature extraction but before decision-making. Features extracted separately from each modality are fused at a specific feature layer. Widely used in tasks like image classification, speech recognition, and sentiment analysis. Decision-Level Fusion [4]: Also called object-level fusion, this combines outputs after unimodal models independently generate predictions (e.g., classification labels). Final decisions derive from integrating these results. Ideal for scenarios requiring aggregated predictions (e.g., multi-sensor systems or expert opinion synthesis).

The objective of multimodal fusion is to retain useful and relevant information from input modalities while eliminating noise and ambiguous features from individual modalities. By preserving complementary information and suppressing noise and redundancy, model performance is enhanced.

In recent years, image–text sentiment analysis has gained significant attention due to its potential applications in opinion mining and decision-making domains. Research on multimodal fusion methods centers on designing innovative data integration mechanisms to achieve effective cross-modal feature collaboration. The current mainstream paradigm involves three fusion stages: Early Fusion enhances modal complementarity through raw data-level interaction. Mid-level Fusion establishes cross-modal correlations during feature abstraction. Late Fusion aggregates multi-dimensional information at the decision level. These three strategies address feature integration requirements at distinct levels of data processing pipelines. Yu et al. [5] proposed a late fusion model using weights to adaptively learn modality-specific variations. Zadeh et al. [6] introduced a tensor fusion network that explicitly models high-order interactions between modalities, making it a widely adopted method for multimodal sentiment analysis. Nguyen et al. [7] developed a hierarchical co-attention network featuring symmetric co-attention for bidirectional text–visual feature interaction, moving beyond unidirectional dependencies. This approach represents joint multimodal features by computing the tensor product of text and image representations. Huang et al. [8] proposed a Deep Multimodal Attention Fusion (DMAF) method integrating late and mid-level fusion to combine unimodal features with intrinsic cross-modal correlations within a hybrid framework. Zeng et al. [9] presented a multi-source feature fusion model incorporating heterogeneous images. Liu et al. [10] demonstrated the effectiveness of low-rank fusion methods through a technique utilizing low-rank tensors for sentiment analysis.

Cross-modal attention mechanisms [11] have emerged as a core technical paradigm, achieving breakthrough progress in representation learning architectures. The research focus has shifted from basic feature concatenation to deep interactive modeling based on Transformer architectures. By establishing dynamic cross-modal interaction matrices, these methods systematically capture latent semantic relationships among speech, text, and visual modalities. Zhu et al. [12] focused on relationships between affective image regions and textual information, introducing a cross-modal alignment module to capture region-word correspondences coupled with an adaptive cross-modal gating module for feature fusion. Tsai et al. [13] proposed directional pairwise cross-modal attention, modeling interactions across time steps in multimodal sequences. Xue et al. [14] constructed a Multilayer Attention Map Network (MAMN) to filter noise prior to fusion while capturing coherent and heterogeneous correlations across multi-granularity features. According to F. Zhao et al. [15], based on the predominant technologies they employed, we propose a novel fine-grained taxonomy categorizing state-of-the-art (SOTA) models into five classes: encoder–decoder methods, attention mechanisms, graph neural network methods, generative neural network methods, and other constraint-based approaches.

2.2. Image–Text Sentiment Analysis

In recent years, extensive experiments have demonstrated that emotion formation often relies on the combined effect of multimodal information, rather than being determined solely by a single source (such as images or text). Image–text sentiment analysis [16] leverages features from both modalities for holistic emotion prediction. Wang et al. [17] proposed a Cross-media Bag-of-Words Model (CBM) that unifies images and text into a bag-of-words structure for Weibo sentiment classification. You et al. [18] introduced a Cross-modality Consistent Regression (CCR) method, utilizing visual and textual features for joint sentiment prediction. Xu et al. [19] developed a Hierarchical Semantic Attentional Network (HSAN), employing image captions as semantic information to aid multimodal sentiment analysis and comprehensively capture detailed semantics. Xu et al. [2] proposed MultiSentiNet, a deep network leveraging scene and object features of images with attention mechanisms to identify significant sentence words. Xu et al. [3] constructed a Shared Memory Network designed to capture interactions between textual content and visual data. Truong et al. [20] introduced the Visual Aspect Attention Network (VistaNet), using images as attention to identify sentences critical for document-level sentiment classification. Zhao et al. [21] developed an image–text alignment-based model integrating text, social cues, mid-level image features, and cross-modal correlations. Poria et al. [22] proposed an LSTM-based approach to model interdependencies between utterances for multimodal sentiment prediction. Basu et al. [23] presented the Multimodal Bi-Transformer (MMBT) model, leveraging image and text features to predict expressed stance and sentiment. Thuseethan et al. [24] fused multiple salient visual cues with highly focused textual cues through stacked learning to capture inter-data relationships. Inspired by aspect-level sentiment analysis, Xu et al. [25] introduced a multimodal sentiment analysis framework focusing on aspects and released a publicly accessible dataset for such analysis. Yang et al. [26] designed a Multi-channel Graph Neural Network (MGNNS) employing affect-aware learning to derive new fusion features capturing global co-occurrence patterns. Xiao et al. [27] proposed a Bidirectional Interaction Translator (BIT) architecture featuring interactive encoding components with bidirectional information flow. Yu et al. [28] constructed a Hierarchical Image-Target Matching Network (HITM) that synchronously decouples global semantic associations and local feature correspondences through dual-path attention. Li et al. [29] developed a Contrastive Learning-Multilayer Fusion (CLMLF) framework combining multi-scale contrastive loss with gated attention residual connections. This enables simultaneous modality-specific feature disentanglement and cross-modal affective semantic transfer via latent-space distillation pipelines. Wang et al. [30] adopted a neural network leveraging holistic dataset characteristics to recognize emotions in text–image compositions. Zhang et al. [31] proposed CTMWA (Crossmodal Translation-Based Meta Weight Adaptation), a robust image–text sentiment analysis method dynamically adjusting unimodal weights through meta-learning strategies. BLIP-2 [32] is an advanced vision-language pre-training framework whose core innovation lies in introducing a learnable Q-Former as a bridge to align a frozen image encoder with a frozen large language model in a parameter-efficient manner; for this purpose, this paper selects its specific variant, BLIP-2-FLAN-T5-XL, as a baseline model, which leverages the powerful FLAN-T5-XL as its text decoder to evaluate the zero-shot performance of general-purpose vision-language models on sentiment analysis tasks. InstructBLIP [33] is an advanced framework built upon the BLIP-2 architecture, which significantly enhances vision-language instruction-following and reasoning capabilities through fine-tuning on multi-task instruction-tuning datasets. Accordingly, this study selects its most representative variant, InstructBLIP-13B, as a comparative baseline. This version achieves a favorable balance between model capability and computational cost, providing a strong baseline for evaluating the zero-shot performance of large-scale general-purpose models on fine-grained sentiment analysis tasks. LLaVA [34] is a representative end-to-end vision-language model that achieves efficient cross-modal alignment and instruction comprehension by directly projecting visual encoder features and conducting joint training with large language models; this study selects its widely adopted LLaVA-1.6-34B version as a comparative baseline, which strikes a favorable balance between computational efficiency and multimodal understanding capability, providing a moderate reference benchmark for evaluating the zero-shot performance of general-purpose vision-language models on sentiment analysis tasks. Qwen2.5-VL is a large-scale vision-language model released by Alibaba Group [35]. As the latest iteration in the Qwen-VL series, the core of the model deeply integrates a powerful vision encoder based on the Vision Transformer (ViT) architecture with a large-scale language model, Qwen2.5. Previous research on multimodal fusion can be broadly categorized into six main types: Traditional deep learning models (e.g., CNN, BiLSTM, ResNet50) primarily perform unimodal feature extraction, lacking explicit cross-modal interaction and an emotion-oriented nature. Language foundation models (e.g., BERT), being unimodal, are not designed for cross-modal alignment. Graph Neural Network models (e.g., TGNN, SGN, OGN, MGNNS), while capable of modeling structured relationships, tend to focus on local object-word alignment and are not explicitly emotion-guided. Attention-based models (e.g., OSDA, DuIG, HSAN, MultiSentiNet, CoMN, MVAN, CLMLF, ITIN, CTMWA) introduce customized interaction mechanisms, but their emotional alignment is often implicit and not explicitly optimized. Large Multimodal Models (e.g., Qwen2.5-VL-72B, BLIP-2, InstructBLIP, LLaVA) possess powerful general-purpose alignment capabilities, yet their design is not specifically tailored for sentiment analysis tasks. These methods are not specifically designed for image–text sentiment analysis and often fail to achieve deep integration of the two modalities. Most approaches rely on shallow feature concatenation to handle multimodal data. This mechanistic fusion struggles to bridge the semantic gap caused by modal heterogeneity, leading to significant limitations in modeling visual–textual dynamic correlations. Furthermore, prior work predominantly focuses on fusion itself, where the learned features may not achieve true emotional consistency, and deeper emotional semantics within each modality remain underexplored. To address these challenges, we propose DINMCL. It employs a multi-level feature representation module that conducts four parallel streams of feature extraction for image and text modalities, comprehensively capturing informative features. Furthermore, it incorporates parallel GCL and LCL modules for feature fusion, ensuring thorough integration and providing a robust foundation for subsequent sentiment polarity analysis.

3. Method

In this section, we present the overall structure of the DINMCL. Now, we will delve into each module of the DINMCL in detail.

3.1. Overview

For image–text sentiment analysis,

M = {(T_{1}, I_{1}), (T_{2}, I_{2}), \dots, (T_{n}, I_{n})}

. Where M denotes the multimodal set, T represents the text modality, I indicates the image modality, and N is the number of tweets. The objective of image–text sentiment analysis is to identify the sentiment in a given tweet, formally expressed as

(T_{i}, I_{i}) \to y_{i}

, where

y_{i}

is computed from sentiment labels. The label set L varies across datasets: For MVSA-Single and MVSA-Multiple datasets,

L \in {Positive, Neutral, Negative}

. For the TumEmo dataset:

L \in {Angry, Bored, Calm, Fear, Happy, Love, Sad}

3.2. Model Design

In this section, we elaborate on the dual-path interactive network model based on multi-level consistency learning proposed in this study. Figure 2 illustrates the overall architecture of the proposed model, which primarily consists of the following modules: Multi-level Feature Representation model, Global Congruity Learning model, Local Crossing-Congruity Learning model, and Fusion & Prediction model. Specifically:The Multi-level Feature Representation model includes a text encoder, an image encoder, and an image caption generator. In the text pathway, RoBERTa serves as the primary encoder for deeply understanding the contextual emotional semantics of the text. The BLIP image caption generator functions as an auxiliary pathway; the descriptive captions it generates provide a human-language summary of the image content. This summary naturally filters visual information and can act as a “semantic bridge” connecting the image and text. In the vision pathway, ViT acts as the primary encoder for extracting the global scene features and overall ambiance of the image. Faster R-CNN serves as the auxiliary pathway, detecting and extracting features of potential, specific emotion-bearing objects, which are key clues for fine-grained sentiment analysis. The Global Congruity Learning model achieves global semantic alignment through self-attention mechanisms and key-value prompt injection, optimizing cross-entropy loss to drive sentiment classification.Within our framework, we designed an adaptive key prompt generation mechanism. This mechanism dynamically generates a set of sparse “key Prompt” vectors by analyzing the current cross-modal context. These prompts function as an “emotional filter,” which is injected into the attention computation to adaptively enhance the focus on key regions of cross-modal emotional expression while suppressing irrelevant or distracting information, thereby achieving adaptive alignment of global emotional content. The Local Crossing-Congruity Learning model captures fine-grained cross-modal correlations via the Crossing Prompter module (shown in Figure 3). Its core function is to act as a “question generator” for cross-modal interaction. Specifically, it takes the features of one modality as input and dynamically generates a set of “prompt vectors”. These prompt vectors are designed to “ask” about the features of another modality, thereby guiding the model in the GAT network to actively explore and focus on the fine-grained local semantic regions most relevant to the input modality. This mechanism avoids aimless feature mixing and achieves directed fine-grained alignment while constraining local semantic distribution matching using KL divergence loss. The Fusion & Prediction model leverages the CLIP text encoder to generate polarity queries, aligns visual key-value features through multi-head cross-attention, and outputs multi-polarity similarity scores to finalize sentiment decisions.

3.3. Multi-Level Feature Representation

This section introduces the encoding approaches for text and image modalities. For the text modality

T = {t_{1}, \dots, t_{n}}

, two feature extraction methods are commonly employed. RoBERTa is an enhanced pre-trained language model based on BERT. It significantly improves text semantic representation capabilities through dynamic masking, full-sentence masking strategies, and training on large-scale data, and is now widely used. We utilize the RoBERTa model to extract text features for subsequent processing in the LCL module:

T_{text} = RoBERTa (T)

(1)

where

T_{text} \in R^{n_{t} \times d_{t}}

denotes the output feature matrix, and

d_{t}

represents the feature dimension.

In the image captioning task, the BLIP model has significantly enhanced the accuracy and semantic relevance of generated text through its multimodal joint modeling capability, noise data purification mechanism, and unified dual-task framework integrating generation and understanding. It has now become the predominant model for image captioning. We use the pre-trained BLIP model as an image caption generator to generate the corresponding text sequence

X_{cap} = {x_{1}, x_{2}, \dots, x_{N}}

, where the following is true:

X_{cap} = BLIP (I)

(2)

The corresponding text sequence is then fed into the pre-trained language model RoBERTa to generate textual features for the subsequent GCL module:

T_{cap} = RoBERTa (X_{cap})

(3)

where

T_{cap} \in R^{n_{t} \times d_{t}}

, with

d_{t}

denoting the dimensionality of the output feature vector

T_{cap}

.

For the image modality

I = {t_{1}, t_{2}, \dots, t_{n}}

, there are two methods to extract features. Pretrained Vision Transformer (ViT) models have demonstrated efficient and robust feature extraction capabilities in tasks such as image classification, object detection, and segmentation. They achieve this by splitting images into a sequence of patches and leveraging the Transformer architecture to capture global context information. For the image modality, feature encodings are extracted using a pretrained ViT model. This involves first uniformly partitioning the input image I into several image patches (PatchSplit). Each patch is then transformed into a vector representation via a linear projection (Linear). Positional encodings are added to these vector representations to preserve spatial information. Subsequently, a multi-layer Transformer encoder models global interactions on the embedded vectors and outputs the global image feature

V_{i}

:

V_{i} = Transformer (Linear (PatchSplit (I)) + P)

(4)

where

V_{i} \in R^{p_{i} \times p_{i} \times d_{i}}

,

d_{i}

is the dimension of the image feature vector

V_{i}

. In the formula, P is PositionalEncoding. The multi-head self-attention mechanism enables the model to capture long-range dependencies and understand the image content from a holistic perspective.

Faster R-CNN is an efficient object detection framework. By introducing the Region Proposal Network (RPN) and combining it with deep convolutional networks to extract image features, it has been widely used in computer vision fields such as object detection and image segmentation due to its efficiency and accuracy. Here, a pre-trained Faster R-CNN model with ResNet-50 as the backbone network is used to extract image features. First, the Region Proposal Network generates candidate regions that may contain objects. Then, ResNet-50 is used to extract features and perform classification regression on the candidate regions, and finally outputs the local target feature

V_{i^{*}}

:

V_{i^{*}} = ResNet (I)

(5)

where

V_{i^{*}} \in R^{p_{i} \times p_{i} \times d_{i}}

. This process focuses on prominent objects in the image and extracts local detail features through convolution operations, complementing the global perspective of ViT.

Next, since the sizes of the two types of image features differ from each other, we perform the following transformations to convert the image representation size to match that of the text representation:

V_{g} = Flatten (V_{i} W_{i} + b_{i})

(6)

V_{r} = Flatten (V_{i^{*}} W_{i^{*}} + b_{i^{*}})

(7)

The Flatten function denotes the flattening operation, which converts the first two dimensions of each image vector into a single dimension. Specifically,

V_{i} \in R^{p_{i} \times p_{i} \times d_{i}} \to V_{g} \in R^{n_{i} \times d_{t}}

and

V_{i^{*}} \in R^{p_{i} \times p_{i} \times d_{i}} \to V_{r} \in R^{n_{i} \times d_{t}}

, where

n_{i} = p_{i} \times p_{i}

. After the dimensionality transformation,

d_{i} = d_{t}

. Here, W denotes learnable weights, and b represents learnable bias terms. This operation projects both image and text features into a shared space, facilitating subsequent feature fusion modules.

3.4. Multi-Level Consistency Learning Model

After obtaining image and text features, we propose a novel interactive fusion module consisting of two sub-models: the GCL model and the LCL model. This framework enables multi-level fusion of global and local features: The GCL module employs self-attention mechanisms and cross-modal matrix multiplication to capture holistic semantic consistency between text and image at the global level, optimizing macro-level inter-modal correlations through cross-entropy loss. The LCL module utilizes cross-prompters and a Graph Attention Network (GAT) to focus on fine-grained alignment of local features, enhancing micro-level emotional expressiveness correlations via cross-modal interactions. Operating synergistically, these modules ensure both global matching of high-level semantic information across modalities and in-depth exploration of dynamic relationships among local elements.

First, we introduce the GCL model. The overall process achieves multimodal information fusion through cross-modal attention mechanisms and global congruity feedback. The text feature

T_{cap}

first passes through a multi-head self-attention layer, which performs learnable linear projections in h parallel subspaces (with parameter matrices

W_{q}^{t, i}, W_{k}^{t, i}, W_{v}^{t, i} \in R^{d \times d / h}

,

i = 1, 2, \dots, h

), generating context-aware intermediate features. Then, through residual connections, the original feature information is preserved, and layer normalization is applied to calibrate the mean and variance of the feature distribution. Finally, the dynamically weight-adjusted Q value matrix is output:

Q = LayerNorm (T_{cap} + \sum_{i = 1}^{h} Softmax (\frac{(T_{cap} W_{q}^{t, i}) {(T_{cap} W_{k}^{t, i})}^{⊤}}{\sqrt{d_{t} / h}}) (T_{cap} W_{v}^{t, i}))

(8)

where

Q \in R^{d_{t} \times d_{t}}

. The multi-head attention mechanism captures heterogeneous semantic patterns of the text (such as sentiment polarity, modification relationships, etc.) through multiple sets of independent parameters

{W_{q}^{t, i}, W_{k}^{t, i}, W_{v}^{t, i}}

. The normalized Q value serves as a semantic alignment guiding signal in the subsequent cross-modal attention, interacting with the

K / V

values in the image feature space to drive global congruity representation learning.

To enhance the expressive ability of the image feature

V_{g}

, an injectable prompt vector

V_{p}

is introduced. The

V_{g}

is randomly projected to generate the prompt matrix

V_{p} = V_{g} W_{p} \in R^{n_{i} \times d_{p}}

, where

W_{p}

is a learnable parameter,

W_{p} \in R^{d_{g} \times d_{p}}

, and

d_{p}

is the dimension of the prompt feature. The original image feature

V_{g}

and the prompt feature

V_{p}

are concatenated along the feature dimension, and the final key and value are generated using the concatenated vector:

V_{concat} = Concat (V_{g}, V_{p})

(9)

K = V_{concat} W_{k}

(10)

V = V_{concat} W_{v}

(11)

where

V_{concat} \in R^{n_{i} \times (d_{t} + d_{p})}

,

W_{k} \in R^{(d_{t} + d_{p}) \times d_{k}}

, and

W_{v} \in R^{(d_{t} + d_{p}) \times d_{v}}

.

Then, the similarity between the text query Q and the image key K is calculated by matrix multiplication to generate the attention weights, which are then applied to V, and after adding a residual connection, the cross-modal attention-fused feature representation Z is generated:

Z = Softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V + V_{g}

(12)

where

Z \in R^{d_{t} \times d_{t}}

. Here, the addition of the residual connection

V_{g}

ensures gradient stability. It not only retains the low-order information of the original image features to avoid information loss, but also alleviates the gradient vanishing problem and accelerates model convergence. Finally, a non-linear transformation is performed on the fusion result:

Z_{out} = FFN (Z) = ReLU (Z W_{1} + b_{1}) W_{2} + b_{2}

(13)

The final output result

Z_{o u t} \in R^{d_{t} \times d_{t}}

breaks the limitation of linear interaction in the attention mechanism by introducing the ReLU activation function and multi-layer linear mapping. It maps the fused cross-modal features to a higher-dimensional non-linear space, thereby enhancing the model’s ability to model complex semantic relationships. This transformation can excavate the implicit deep-layer associations in the interaction between the text and the image, so that the fused features can not only retain the local information of the cross-modal alignment but also capture the global semantic consistency through non-linear recombination, ultimately improving the robustness and discriminability of the model for multimodal data understanding. Input

Z_{o u t}

into the Classifier for semantic discrimination. Taking the MVSA-Single dataset as an example, there are three labels: positive, neutral, and negative. The classifier predicts results from these three polarities, and the loss is calculated through the Cross-entropy Loss module:

L_{G C L} = - \sum_{i} y_{i} \log (p_{i})

(14)

where

y_{i}

represents the actual sentiment label, and

p_{i}

corresponds to the label inferred from the Classifier output.

In the LCL model, first, considering that the global context information of two input vectors needs to be learned by exchanging first. The two-modal inputs

T_{text}

and

V_{r}

respectively pass through the GAT Layers for neighborhood feature aggregation and context modeling to form enhanced independent representations. Then, through stacked Crossing Prompt layers, multi-level cross-modal interactions are realized:

\begin{matrix} T & = GAT (T_{text}) \end{matrix}

(15)

\begin{matrix} V & = GAT (V_{r}) \end{matrix}

(16)

where

T \in R^{n_{t} \times d_{t}}

,

V \in R^{n_{i} \times d_{t}}

. This helps the model screen out the most useful local relationships within text and images, laying a good foundation for subsequent cross-modal fusion. Then, we use the cls that generates the sentence-level embedding as a reference, and add it to the beginning of the text feature T and the image feature V:

\begin{matrix} T_{cls} & = Concat ({cls}_{t}, T) \end{matrix}

(17)

\begin{matrix} V_{cls} & = Concat ({cls}_{v}, V) \end{matrix}

(18)

where

T_{cls} \in R^{(n_{t} + 1) \times d_{t}}

,

V_{cls} \in R^{(n_{i} + 1) \times d_{t}}

, and

{cls}_{t}

and

{cls}_{v}

represent the initial embeddings of the cls tokens for the text and visual modalities, respectively. The concatenated vectors are then input into number of

α

-Intra-attention layers to further explore and capture local or global semantic correlations within each modality, resulting in the updated embedding vectors:

{\hat{T}}_{cls} \in R^{(n_{t} + 1) \times d_{t}}

,

{\hat{V}}_{cls} \in R^{(n_{i} + 1) \times d_{t}}

Then, the exchange-fusion of the two feature vectors is carried out using an exchange mechanism. The specific exchange rule is: for each modality, select the tokens with the smallest attention-score proportion to the cls token and replace their embedding vectors with the average embedding of all tokens from the other modality. More precisely, for both modalities, we select a

θ

-proportion of tokens with the smallest attention scores to cls and perform information exchanging. For example, if the token corresponding to the a-th row is selected from the updated text feature matrix T, the embedding update process of this token can be expressed as follows:

{\hat{T}}_{cls} [a, :] = \frac{1}{n} \sum_{i = 1}^{n} {\hat{V}}_{cls} [i, :] + {\hat{T}}_{cls} [a, :]

(19)

Similarly, the update process for the

{\hat{V}}_{cls}

vector is as follows:

{\hat{V}}_{cls} [a, :] = \frac{1}{n} \sum_{i = 1}^{n} {\hat{T}}_{cls} [i, :] + {\hat{V}}_{cls} [a, :]

(20)

During the cross-modal information exchange process, we introduce residual connections to directly transmit historical states to the current update step, so as to suppress feature degradation and information loss during the iterative process. After b layers of cross-modal interaction in the Crossing Prompter, the exchanged image and text features are denoised through the cross-modal feature adapter projection, outputting

p_{t}^{c}

and

p_{v}^{c}

. These vectors are then fed into c layers of attention modules to generate

T_{text}^{'}

and

V_{r}^{'}

, which conduct global dependency modeling on the fused features. Through self-attention weights, important features are dynamically enhanced while noise is further suppressed. Finally, the updated embedding matrices are input into the Feed-Forward Network (FFN) with layer normalization:

\begin{matrix} T_{e} = FFN (T_{text}^{'}) \end{matrix}

(21)

\begin{matrix} V_{e} = FFN (V_{r}^{'}) \end{matrix}

(22)

In the cross-modal feature fusion stage, the model embeds the Transformer outputs of the text branch and the image branch, concatenates them along the feature dimension, and then realizes cross-modal interaction mapping through a multi-layer perceptron to generate a joint representation matrix in a unified semantic space. Here, the KL Loss method is used to assist in aligning the two vectors:

L_{L C L} = K L (Softmax (Proj & Norm (T_{text}^{'}) ‖ Softmax (Classifier (V_{r}^{'})))

(23)

The Classifier outputs the classification probability of

V_{r}^{'}

, and Projection and Normalization maps

T_{text}^{'}

to the dimension matching the classification probability. This design emphasizes the dominance of the visual modality, with the text modality providing auxiliary semantic alignment.

3.5. Fusion & Prediction

The multi-modal pre-training model CLIP proposed by OpenAI realizes large-scale image–text alignment for the first time through contrastive learning, opening a new era of multi-modal pre-training large models. In this module, CLIP plays a core role in cross-modal semantic alignment and prior knowledge injection. First, construct a task-related text prompt: They are mostly expressing a [polarity] feeling where [polarity] is a replaceable emotional label word. In the MVSA-Single and MVSA-Multiple datasets: polarity corresponds to positive, neutral, and negative In the TumEmo dataset, polarity corresponds to Angry, Bored, Calm, Fear, Happy, Sad, Love The Text Encoder of CLIP generates the corresponding embedding vectors for each polarity label as the anchors for contrastive learning:

t_{polarity} = E_{CLIP - Text} (Prompt)

(24)

Concatenate the global context representation (GCL) and the local fine-grained representation (LCL) into a fused vector:

h_{fuse} = Concat (h_{GCL} + h_{LCL})

(25)

Use the concatenated fused vector

h_{fuse}

as Key and Value, and the CLIP-generated polarity text embedding

t_{polarity}

as Query:

Q = t_{polarity} W_{Q}, K = h_{fused} W_{K}, V = h_{fused} W_{V}

(26)

where

W_{Q}

,

W_{K}

, and

W_{V}

are learnable projection matrices, input into an H-layer multi-head attention mechanism to achieve cross-modal interaction, and finally mapped to sentiment polarity scores through a projection layer:

\begin{matrix} h_{attn} & = Concat (Softmax (\frac{Q_{1} K_{1}^{⊤}}{\sqrt{d_{k}}}) V_{1}, \dots, Softmax (\frac{Q_{H} K_{H}^{⊤}}{\sqrt{d_{k}}}) V_{H}) \cdot W_{O} \end{matrix}

(27)

\begin{matrix} S^{cls} & = W_{proj} \cdot LayerNorm (h_{attn}) + b_{proj} \end{matrix}

(28)

where

W_{O}

is the output projection matrix,

h_{attn}

is the visual feature representation after the final attention pooling,

W_{proj}

and

b_{proj}

are classification parameters. The output

S^{cls}

corresponds to the similarity of different sentiment polarities. For example, in the MVSA-Single and MVSA-Multiple datasets, the similarity scores

[S_{pos}, S_{neu}, S_{neg}]

are compared with all polarity texts generated by CLIP, maximizing the similarity of the target polarity and suppressing others. Finally, the probability distribution is generated through Softmax normalization:

P (polarity) = Softmax (S^{cls}) = \frac{\exp (S_{pos})}{\sum \exp {(S)}^{'}} \cdot \frac{\exp (S_{neu})}{\sum \exp {(S)}^{'}} \cdot \frac{\exp (S_{neg})}{\sum \exp (S)}

(29)

Here, the multi-polarity contrastive loss function is adopted, mainly because it effectively combines the cross-modal alignment ability of CLIP with the discriminative advantage of contrastive learning:

L_{contrast} = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{\exp (S_{i, target} / τ)}{\exp (S_{i, target} / τ) + \sum_{j = 1}^{M} \exp (S_{j, non - target} / τ)}

(30)

where

S_{i, target}

is the target polarity embedding similarity of the i-th sample,

S_{j, non - target}

is the embedding similarity of the j-th non-target sample, M is the number of negative samples per sample, and

τ

is the temperature coefficient.

We combine the losses of KL divergence (represented by

L_{GCL}

), cross-entropy (

L_{LCL}

), and contrastive methods to create the overall loss function for model training:

L = L_{GCL} + L_{LCL} + L_{contrast}

(31)

The procedural steps for implementing the model are presented in Algorithm 1.

Algorithm 1 DINMCL

Require: Training datasets M, sentiment label L, text T, image I,

{T, I, L} \in M

Ensure: Sentiment prediction results

y_{i}

1:: for each iteration do
2:: for each batch do
3:: The textual feature $T_{text}$ for LCL is obtained by utilizing a RoBERTa encoder via Equation (1).
4:: The textual feature $T_{cap}$ for GCL is obtained by utilizing a RoBERTa encoder and a BLIP encoder via Equations (2) and (3)
5:: Obtain image feature $V_{i}$ through transformer via Equation (4)
6:: Obtain image feature $V_{i^{*}}$ through ResNet-50 via Equation (5)
7:: Obtain image feature $V_{g}$ for GCL after transformation via Equation (6)
8:: Obtain image feature $V_{r}$ for LCL after transformation via Equation (7)
9:: Obtain Key Prompt Injection $V_{concat}$ via Equation (9)
10:: Obtain Q value matrix by using $T_{cap}$ through attention projection and normalization, as shown in Equation (8)
11:: Obtain $K, V$ value matrix via Equations (10) and (11)
12:: Obtain the cross-modal attention fused feature representation Z via Equation (12)
13:: Obtain the final fused result $Z_{out}$ after non-linear transformation via Equation (13)
14:: Obtain the enhanced independent representations T and V by inputting $T_{text}$ and $V_{r}$ into GAT layers, as shown in Equations (15) and (16)
15:: Obtain $T_{cls}$ and $V_{cls}$ by using the sentence-level embedding CLS as a reference, as shown in Equations (17) and (18)
16:: Obtain multimodal fusion features $T_{e}$ and $V_{e}$ via Equations (19)–(22)
17:: Obtain the anchors $t_{polarity}$ for contrastive learning via Equation (24)
18:: Obtain the vector $h_{fuse}$ fused after the final concatenation of GCL and LCL via Equation (25)
19:: Obtain cross-modal interaction vector $h_{attn}$ via Equations (26) and (27)
20:: Obtain sentiment polarity scores $S^{cls}$ via Equation (28)
21:: Obtain the probability distribution $P (polarity)$ via Equation (29)
22:: Calculate the cross-entropy Loss $L_{GCL}$ via Equation (14)
23:: Calculate the KL Loss $L_{LCL}$ via Equation (23)
24:: Calculate the multi-polarity contrastive loss $L_{contrast}$ using label information via Equation (30)
25:: Calculate the total loss of the model L via Equation (31)
26:: Optimize the model
27:: end for
28:: end for

4. Experiment

4.1. Dataset

We evaluated the proposed model on three public multimodal sentiment analysis datasets: MVSA-Single, MVSA-Multiple, and TumEmo.

The MVSA-Single and MVSA-Multiple datasets are sourced from Twitter and represent two of the most widely-used image–text sentiment polarity prediction datasets. Each instance in these datasets is annotated with one of three sentiment categories: positive, neutral, or negative. In the MVSA-Single dataset, the numbers of positive, neutral, and negative labeled data are 2683, 470, and 1358 respectively. In the MVSA-Multiple dataset, the numbers of positive, neutral, and negative labeled data are 11,739, 1339, and 3946, respectively.

The dataset, derived from Tumblr, contains a substantial volume of image–text data with fine-grained sentiment annotations and is a weakly-supervised dataset.

The TumEmo dataset is a large-scale multimodal emotion classification dataset containing 190,000 instances of image–text pairs collected from Tumblr. It is specifically labeled for fine-grained emotion classification rather than simple sentiment polarity (positive/neutral/negative). The numbers of Happy, Love, Bored, Sad, Fear, Calm, and Angry labeled data are 50,267, 34,511, 32,283, 25,277, 20,264, 18,109, and 14,554 respectively.

These three datasets collectively capture users’ emotional expressions across diverse topics and domains. Their data distribution aligns well with the requirements for sentiment analysis in social media scenarios, effectively supporting model training and application in real-world contexts.

We partitioned each dataset into training, validation, and test sets at a 8:1:1 ratio, with specific partitioning details provided in Table 1.

4.2. Experimental Setup

The DINMCL model was implemented based on the PyTorch framework [31] and trained on an NVIDIA GeForce RTX 3090 GPU with CUDA 12.0 and torch 1.9.1. The experimental parameter settings for this experiment are detailed in Table 2.

4.3. Baselines

To evaluate the effectiveness of the proposed DINMCL method, we have compared our model with both multimodal sentiment models using identical modalities and unimodal baseline models in our evaluation.

Text Models: CNN [36]: A convolutional neural network commonly employed for text classification tasks. BiLSTM [37]: A bidirectional long short-term memory framework recognized for its effectiveness in recognizing key semantic elements within sentences. BERT [38]: A large-scale pre-trained language model demonstrating exceptional performance, particularly in text classification tasks. TGNN [39]: A graph neural network dedicated to text-level classification.

Image Models: ResNet-50 [40]: A deep convolutional neural network model comprising 50 convolutional layers. OSDA [41]: A foundational model for multi-view image sentiment analysis. Its image modality-specific variants include: SGN: Focuses on scene features. OGN: Focuses on object features. DuIG: Integrates dual perspectives of objects and scenes.

Multimodal Models: HSAN [19]: A deep learning model based on hierarchical semantic attention mechanisms. It employs layered semantic attention to generate image descriptions. MultiSentiNet [2]: Introduces a visual feature-guided model to extract keywords from text. It performs basic sentiment discrimination through linear fusion of textual and visual features combined with simple threshold-based classification. CoMN [3]: Incorporates a stacked co-memory framework to capture interactions between visual content and textual words. Primarily achieves simple cross-modal information exchange through basic feature concatenation and shallow memory units. MGNNS [26]: A multi-channel graph neural network comprising text channels, image scene channels, and image object channels. Conducts multimodal sentiment analysis based on global features of the dataset. MVAN [41]: Learns sentiment representations via a multi-view attention network. This simplified multi-view feature fusion model integrates shallow features from different views (e.g., text and images) through basic weighted averaging or concatenation. CLMLF [29]: A lightweight cross-modal fusion model that combines text bag-of-words features with image edge detection results using linear weighted concatenation. ITIN [12]: Designs a cross-modal alignment module that enhances fine-grained discriminative capabilities for sentiment classification tasks by mining implicit associations between local image regions and textual words. CTMWA [42]: Proposes a novel Cross-modal Translation-Based Meta Weight Adaptation method. It constructs a cross-modal translation network as an encoder and incorporates a unimodal weight adaptation strategy to optimize feature fusion. BLIP-2 [32]: This study employs the BLIP-2-FLAN-T5-XL variant of the BLIP-2 framework as a baseline model, which efficiently bridges vision and language modules via a learnable Q-Former and utilizes the FLAN-T5-XL decoder to evaluate the zero-shot performance of vision-language models on sentiment analysis tasks. InstructBLIP [33]: InstructBLIP is a vision-language model built upon the BLIP-2 architecture, enhanced through instruction tuning, with its 13B-parameter version striking a favorable balance between capability and efficiency. LLaVA [34]: This study selects the LLaVA-1.6-34B version as a comparative baseline, which achieves efficient cross-modal understanding through direct projection of visual features and joint training with a large language model, striking a favorable balance between computational efficiency and multimodal capability. Qwen2.5-VL [35]: Qwen2.5-VL-72B, the largest and most capable model in Alibaba’s Qwen2.5-VL series, comprises 72 billion parameters. As a leading dense large vision-language model, it deeply integrates a powerful vision encoder with a language model of comparable scale, achieving state-of-the-art zero-shot performance on a range of general multimodal understanding benchmarks. While its extensive parameter count enables strong in-context learning and complex reasoning capabilities, it also entails significant computational resources for deployment. We conducted inference experiments on three datasets using a machine equipped with four NVIDIA A100 GPUs (each with 80 GB VRAM).

4.4. Comparison Experiments

To evaluate the effectiveness of our proposed model for sentiment analysis, we compared its experimental results with several standard methods. Detailed results are presented in Table 3.

First, across all datasets, multimodal models significantly outperformed their unimodal counterparts. This stems from the complementarity of information between image and text modalities, where multimodal modeling effectively integrates complementary cues to enhance prediction accuracy. Second, single-image modality models demonstrated the weakest performance, potentially due to the sparsity of emotional representations in images that increases the difficulty of extracting discriminative features. In contrast, single-text modality models performed better, benefiting from the directness, explicitness, richness, and contextual depth of textual sentiment expressions.

Table 3 shows that our model outperforms all baseline methods in terms of results. In the MVSA-Single dataset, our method has an accuracy 1.20% higher than Qwen2.5-VL-72B, 1.25% higher than the CTMWA model, an F1 value 0.72% higher than Qwen2.5-VL-72B, and 0.79% higher than the CTMWA model; On the MVSA-Multiple dataset, our model has an accuracy 1.48% higher than Qwen2.5-VL-72B and an F1 value 1.43% higher than Qwen2.5-VL-72B; In the TumEmo dataset, our model has an accuracy 1.16% higher than CTMWA, 1.30% higher than Qwen2.5-VL-72B, and an F1 value 1.09% higher than CTMWA and 1.29% higher than Qwen2.5-VL-72B.

This study demonstrates that, while general-purpose large models Qwen2.5-VL-72B exhibit remarkable capabilities, they present limitations in the specific task of multimodal sentiment analysis: their generalist design struggles to capture fine-grained emotional cues, while simultaneously incurring prohibitively high computational costs. Most of the aforementioned baseline models are not specifically designed for the image–text sentiment analysis [43], and their fusion is suboptimal. This raises uncertainty about whether the learned features genuinely capture sentiment. The majority of existing research tends to adopt relatively simplistic strategies. For instance, features extracted from different modalities (such as images and text) are often directly concatenated, or only coarse-grained interactions are captured at the macro level. Such approaches struggle to delve into the fine-grained, deep-level semantic correlations embedded between images and text. More critically, the core focus of these methods often overemphasizes feature fusion itself, neglecting the essential requirement of the sentiment analysis task: ensuring cross-modal representations possess high consistency in sentiment semantics. Due to the lack of targeted sentiment-oriented constraints, the fusion process fails to effectively guide the model in learning semantic information highly relevant to sentiment. Consequently, the final learned cross-modal representations struggle to precisely capture and represent the complex sentiment semantic relationships deeply embedded within the image and text information. This fundamentally limits the model’s discriminative power and accuracy in sentiment analysis tasks.

Our model first employs multi-level feature representations, going beyond simple feature concatenation to establish a richer foundation for subsequent analysis. At its core lies the introduction of a dual mechanism comprising Global Consistency Learning and Local Correlation Learning: GCL ensures alignment and consistency in overall sentiment semantics across modalities at the global level through techniques like self-attention and prompt injection.LCL focuses on fine-grained local interactions between modalities, delving into the nuanced sentiment correlations embedded between image patches and text segments. Finally, during the fusion and prediction stage, we innovatively incorporate CLIP contrastive learning. Using its powerful image–text alignment capability, the classifier and projection layer compute similarity scores based on sentiment semantics. This design significantly enhances sentiment-oriented constraints within the fusion process, compelling the learned cross-modal representations to precisely reflect the deep-level, consistent sentiment semantics inherent in the image–text information. Consequently, it effectively captures complex sentiment relationships, ultimately enhancing the accuracy of sentiment analysis.

4.5. Ablation Experiments

To validate the performance of the model building blocks, we conducted ablation studies on three core datasets, aiming to dissect the core modules of the model and measure their individual impact, with detailed results presented in Table 4.

w/o Captioner: Based on the full model, we removed the BLIP image–text captioner from the Multi-level Feature Representation module, directly substituting original text data processed by RoBerta to generate feature vectors. As evidenced in Table 3, this experiment confirms the crucial role of the Captioner’s semantic distillation capability in the model.

w/o ViT: In the global consistency learning pathway, when the local object features extracted by Faster R-CNN are used to replace the global scene features extracted by ViT, the model’s performance on all datasets experiences a significant and consistent decline. This strongly demonstrates that the global visual context provided by ViT is indispensable for the model to achieve effective sentiment understanding and cannot be completely replaced by local object features.

w/o Faster-RCNN: This experiment aims to verify the irreplaceability of local visual cues by replacing the input features in the LCL pathway—originally the local object features extracted by Faster-RCNN—with the global scene features provided by ViT. The experimental results clearly show that this replacement leads to a consistent and significant performance decline of the model across all datasets. This finding strongly confirms that the fine-grained information of salient objects in images captured by Faster-RCNN constitutes a crucial foundation for the LCL module to achieve effective fine-grained cross-modal alignment.

w/o GCL: Based on the full model, we removed the GCL module and exclusively employed the LCL module for fine-grained cross-modal local interactions. The significant performance drop observed after removing GCL demonstrates the indispensable role of Global Consistency Learning in scene-level sentiment understanding for multimodal affective tasks.

w/o LCL: With all other model components intact, we ablated the LCL module and retained only the GCL model. Results indicate that relying solely on GCL’s global correlations fails to capture fine-grained semantics, resulting in the loss of local affective cues.

w/o GCL Key Prompt: With all other modules unchanged, we replaced the dynamically generated Key Prompt in the GCL module with static vectors. Results confirm that the dynamically generated Key Prompt provides semantic guidance for cross-modal global alignment. By enabling real-time adaptation to sample content, it directs the network to precisely identify high-value cross-modal interaction regions while suppressing extraneous noise, thereby enhancing the robustness and generalization capability of affective decision-making.

w/o LCL Crossing Prompter: In the LCL module, after removing the Crossing Prompter while retaining the GAT graph attention network for feature processing and replacing the Crossing Prompter with a standard Transformer module to enable interaction between image and text features through multi-head attention, the experimental results show a significant performance drop. This indicates that GAT only works within a single modality and cannot directly capture the interaction information between image and text modalities. If the subsequent module fails to effectively model cross-modal relationships, the quality of the final fused features will degrade significantly. Moreover, the attention mechanism of the standard Transformer is global and cannot specifically optimize local interactions. In contrast, the Crossing Prompter is able to better capture the local interaction relationships between image and text modalities.

w/o CLIP: In the Fusion & Prediction model, we substituted the affect-polarized textual features generated by the CLIP text encoder with standard RoBERTa embeddings. Results demonstrate that conventional RoBERTa fails to produce CLIP-style structured affective features, undermining contrastive learning effectiveness. This also verifies that cross-modal semantic correlations established through contrastive pretraining provide a highly discriminative anchor space for affective classification.

w/o Fusion & Prediction: We removed the Fusion & Prediction model and instead directly concatenated the output vectors from GCL and LCL as input to the classification layer for affective prediction. The significant performance degradation demonstrates the critical importance of this module.

4.6. Case Study

To validate the effectiveness of our proposed model, we conducted a case study building upon ablation experiments. By analyzing the results under three different scenarios depicted in the Figure 4, we further confirmed the validity of our approach. In the table, DINMCL-GCL denotes the model variant without global consistency learning, DINMCL-LCL represents the model variant without local consistency learning, and DINMCL-CLIP indicates the model variant without the final contrastive learning module.

In Case 1, the textual sentiment clearly indicates negativity. However, combined with the image, it can be inferred as an instance of sarcasm. Here, the LCL module aligns image–text semantics through fine-grained feature learning. The core function of the LCL module is to establish a fine-grained association between local image objects and textual sentiment phrases. In this case, the absence of the LCL prevents the model from aligning the ironic word “rubbish” with potential visual cues in the image, such as “beautiful scenery” and “the joy of cycling”, thus causing the loss of the irony signal. In Case 2, the emotional signals in both image and text are more pronounced and focused, resulting in superior model performance. Case 3 (from the TumEmo dataset with labels Angry, Bored, Calm, Fear, Happy, Love, Sad) demonstrates that the absence of GCL removes global consistency constraints. This amplifies the LCL module’s sensitivity to local features, thus overfitting to partial signals. The fact that only DINMCL-GCL incorrectly predicts the sentiment as “Love” indicates that, in fine-grained sentiment classification, the GCL module may have an ambiguous grasp of sentiment boundaries, whereas the LCL module can capture more precise contextual cues. Case 4 demonstrates that, even in the ideal scenario where the sentiment signals of images and texts are highly consistent, the absence of the LCL module can still lead to model misjudgment. This profoundly reveals that fine-grained cross-modal alignment within the model plays an irreplaceable role in suppressing semantic noise and ensuring decision robustness.

Therefore, synergistic collaboration between GCL and LCL is essential to sufficiently learn and fuse multimodal representations from both global and local perspectives.

4.7. Parametric Experiments

The choice of the number of layers significantly impacts model performance: an appropriate number helps the model learn rich and effective features while preserving its generalization ability. The core of the experiment lies in determining the optimal number of layers to balance the model’s fitting on training data against its generalization capability on unseen data.

Final Multi-Head Cross Attention Layer: This layer employs CLIP-generated sentiment polarity text embeddings as guiding signals to dynamically select and enhance the most sentiment-relevant cross-modal semantic associations from the concatenated GCL and LCL multimodal features, ultimately driving the emotion classification decision. As illustrated in Figure 5, emotional cues may be distributed across different segments of global and local features. A single attention computation may be insufficient to capture all critical interactions, leading to inadequate information filtering and fusion that compromises classification accuracy. Deep architectures help the model progressively filter out noise unrelated to target emotions while amplifying key cues. However, excessively deep attention layers may cause the model to overfit to specific patterns or noise in the training data, thereby diminishing generalization capability.

LCL Crossing Prompter Layer: Within the LCL module, the Crossing Prompter’s core function is to leverage dynamically generated Key Prompts to guide the network in focusing on cross-modal local regions, enabling fine-grained feature interactions. As shown in Figure 6, experimental results consistently demonstrate that both accuracy and F1 scores peak at two layers across the MVSA-Single, MVSA-Multiple, and TumEmo datasets. When the layer count increases from two to three, performance declines across all datasets. The performance of the four-layer and five-layer configurations is also significantly lower than that of the two-layer configuration. Excessively deep prompters may introduce redundant computations or noise interference, compromising the effectiveness of local feature interactions. The two-layer architecture of the Crossing Prompter achieves the optimal balance for emotion classification tasks capturing fine-grained cross-modal interactions while avoiding performance degradation caused by over complexity. The above experiments were conducted under the optimal GCL conditions.

GCL Layer: The core function of the GCL module is to establish high level semantic alignment between text and image through dynamic filtering of cross-modal global associations. By employing a self-attention gating mechanism, it suppresses irrelevant noise while amplifying emotion-relevant global cues. As shown in Figure 7, a single-layer structure can only achieve shallow interactions and fails to sufficiently distill complex semantics.In contrast, the double-layer GCL configuration demonstrates optimal performance by enabling deeper and more meaningful cross-modal interactions, effectively capturing intricate semantic dependencies. However, increasing the depth of the GCL module beyond two layers introduces challenges such as overfitting and feature homogenization, where the model becomes overly specialized and loses its ability to generalize across diverse samples. The above experiments were conducted under the optimal LCL conditions.

Temperature Parameter: In the model, it is used to adjust the temperature coefficient in contrastive learning, with its core function being to control the distribution range of sample similarity in the embedding space. As shown in Figure 8, when the temperature value is too low, the model becomes overly sensitive to the similarity distribution, which enhances sample distinctiveness but tends to overfit to local features, making it difficult to capture global consistency. Conversely, when the temperature value is too high, the model’s sensitivity to similarity distribution decreases, leading to a more uniform sample distribution and weakened distinctiveness, ultimately resulting in degraded classification performance. Experimental results show that, when the temperature value is set to 0.07, the sample distribution in the embedding space achieves a balance between distinctiveness and robustness, effectively mitigating overfitting to local features while ensuring global feature consistency, thereby achieving optimal classification performance across multiple datasets.

4.8. Model Complexity Analysis

As shown in Table 5, the DINMCL model proposed in this work achieves a good balance between computational resource consumption and efficiency. Compared with Qwen2.5-VL-72B, which is designed for general tasks and has a massive parameter scale (72,000 M parameters, 145 GB memory), DINMCL (158.7 M parameters, 6.8 GB memory) is a dedicated sentiment analysis model with extremely lightweight resource requirements, making it feasible for practical deployment. Meanwhile, compared with the high-performing specialized model in the same field, CTMWA, DINMCL achieves stable performance gains in exchange for a moderate increase in computational cost (approximately 30% more parameters and 31% more FLOPs). This result indicates that the increased complexity of DINMCL primarily stems from the specialized fusion modules designed to enhance the model’s sentiment alignment capability, rather than from parameter redundancy. Its computational cost remains within a reasonable range for specialized models, validating the idea that its design is both efficient and necessary. Through comprehensive analysis of model complexity and efficiency, this work demonstrates that DINMCL achieves significant and stable performance improvements with a moderate increase in parameter count and computational cost, reflecting a well-considered design trade-off. The total parameter size of DINMCL is 158.7 M, with only 35% being trainable parameters. Most of the computational overhead originates from the frozen pre-trained encoders, which are also the primary source of the model’s parameters—the four powerful pre-trained encoders. To fully leverage their learned general representation knowledge while avoiding catastrophic forgetting and overfitting, we adopted a freezing strategy. The frozen components (accounting for 65% of total parameters) include the text encoders of RoBERTa and BLIP, pre-trained on massive text corpora and containing rich linguistic knowledge, and ViT-large and Faster R-CNN, pre-trained on large-scale image datasets and possessing powerful visual feature extraction capabilities. This freezing strategy prevents the pre-trained, powerful feature extractors from being skewed by the smaller-scale multimodal sentiment dataset and significantly reduces computational overhead. The key modules (GCL and LCL) contribute to over half of the performance gain while incurring a relatively light parameter cost (approximately 41% combined). Compared to baseline models, such as CTMWA, DINMCL achieves a 1.25% accuracy improvement on the MVSA-Single dataset with only about 1.3 times the parameter count. Furthermore, its modular architecture provides a clear pathway for future compression and optimization, confirming the necessity and efficiency of the current complexity in addressing the fundamental challenge of sentiment alignment.

4.9. Cross-Dataset Generalization Evaluation

MVSA-Single → MVSA-Multiple: In this setup, all samples from the MVSA-Single dataset are used as the training set, allowing the model to learn from this data distribution. After training, the model is directly evaluated on all test samples of the MVSA-Multiple dataset, without involving any training or validation information from that dataset. This simulates learning from one data distribution and applying it to another, different but related, data distribution scenario.

MVSA-Multiple → MVSA-Single: This is the reverse transfer setup. The model is trained on all data from MVSA-Multiple and subsequently evaluated on the complete test set of MVSA-Single. This setting further tests the model’s generalization independence with respect to direction and its robustness to changes in training data scale.

As shown in the Table 6, under both strict cross-dataset transfer settings, our proposed DINMCL model achieved significant and consistent leading performance. In the MVSA-Single → MVSA-Multiple task, DINMCL’s accuracy far surpassed that of the best baseline, CTMWA; in the reverse task, MVSA-Multiple → MVSA-Single, it also maintained a significant advantage. Furthermore, compared to the results on the original in-distribution test sets, DINMCL’s performance remained nearly unchanged, whereas the performance of other baseline models declined significantly. This result strongly demonstrates that the emotion alignment signals captured by DINMCL through its multi-level consistency learning possess powerful generalization capabilities, rather than merely overfitting to specific dataset biases.

4.10. Fine-Grained Performance Analysis

To comprehensively evaluate model robustness in class-imbalanced scenarios, we further analyzed the fine-grained performance of all models on the three datasets using balanced accuracy and macro-F1 score. Balanced accuracy is the average of per-class recall, treating each class equally; macro-F1 is the arithmetic mean of the F1 scores for all classes, equally measuring the model’s performance on each category. As shown in Table 7, DINMCL significantly outperforms the baseline models in both balanced accuracy and macro-F1 score. This result indicates that the performance advantage of DINMCL does not stem from overfitting to the majority classes but reflects genuine, cross-category generalization capability.

5. Discussion

The multimodal fusion model proposed in this paper systematically addresses the core challenges in image–text sentiment analysis through a three-tier progressive design. At the feature representation level, the four-branch parallel encoding architecture (RoBERTa/ViT/Faster R-CNN/BLIP+RoBERTa) constructs a multi-granularity representation space that disentangles global and local features, fundamentally avoiding the loss of emotional cues inherent in traditional concatenation operations. At the interaction mechanism level, the innovative consistency learning complementary network divides modal interaction into global and local dual pathways: the GCL module employs multi-head cross-attention and self-attention gating to achieve dynamic cross-modal association filtering, while the LCL module combines cross prompters with GAT to achieve fine-grained semantic alignment of local regions guided by Key Prompts. This hierarchical interaction mechanism enables, for the first time, deep coupling of emotional semantics between the dual modalities. At the decision-making level, the introduction of sentiment-polarized text embeddings generated by CLIP serves as a contrastive learning anchor, enabling the model to explicitly capture sentiment-oriented features within the multi-head cross-attention layer, ultimately achieving interpretable decisions through similarity calculation. Experimental results demonstrate that the proposed model significantly outperforms baseline methods on three public datasets, and ablation studies further validate the contribution of each module. However, this study has limitations that warrant further investigation in future work. Although the four-encoder structure enhances feature richness, it also incurs significant computational cost and memory overhead, limiting its application in resource-constrained environments. While ablation experiments confirm the necessity of each encoder, how to achieve model lightweighting via knowledge distillation or parameter sharing remains a pressing issue.

Future work will focus on the following directions: (1) developing a lightweight dynamic feature selection mechanism to balance computational efficiency and model performance; (2) introducing domain adaptation strategies to mitigate data bias; (3) constructing a more systematic error analysis framework to clarify failure boundaries.

6. Conclusions

This study constructs a multimodal fusion framework that not only enhances sentiment analysis performance but also achieves breakthroughs in cross-modal understanding paradigms through hierarchically structured interaction mechanisms. Regarding application expansion directions, the architecture demonstrates threefold generic applicability: 1. The modular design of GCL/LCL can be transferred to domains such as multimodal retrieval and video analysis. 2. The CLIP-driven contrastive learning mechanism provides a universal paradigm for interpretable decision-making. 3. The integration of the Crossing Prompter with graph neural networks pioneers new pathways for structured cross-modal reasoning. Future work will further explore lightweight deployment solutions and conduct in-depth validation in scenarios such as social media event analysis and human–computer interaction affective computing, thereby advancing multimodal understanding from specialized tasks toward general cognitive capabilities.

7. Abbreviation

To ensure terminological clarity and consistency, all abbreviations for models, methods, and technical terms used in this paper are compiled in the following Table 8 for quick reference.

Author Contributions

Conceptualization, supervision, and validation, C.W.; investigation, software, and writing—original draft, Z.J.; methodology and data curation, Q.X.; formal analysis and writing—review and editing, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the grants from the Natural Science Foundation of Shandong Province (ZR2024MF145), the National Natural Science Foundation of China (62072469), and the Qingdao Natural Science Foundation (23-2-1-162-zyyd-jch).

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, C.; Hu, Z. Multimodal sentiment analysis of social media based on top-layer fusion. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Chengdu, China, 9–12 December 2022; pp. 1–6. [Google Scholar]
Xu, N.; Mao, W. MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2399–2402. [Google Scholar]
Xu, N.; Mao, W.; Chen, G. A co-memory network for multimodal sentiment analysis. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 929–932. [Google Scholar]
Li, S.; Tang, H. Multimodal Alignment and Fusion: A Survey. Available online: https://arxiv.org/abs/2411.17040 (accessed on 12 January 2024).
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc. AAAI Conf. Artif. Intell. 2021, 35, 10790–10797. [Google Scholar] [CrossRef]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar]
Ghosal, D.; Akhtar, M.S.; Chauhan, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3454–3466. [Google Scholar]
Huang, F.; Zhang, X.; Zhao, Z.; Xu, J.; Li, Z. Image-text sentiment analysis via deep multimodal attentive fusion. Knowl.-Based Syst. 2019, 167, 26–37. [Google Scholar] [CrossRef]
Zeng, Y.; Li, Z.; Tang, Z.; Chen, Z.; Ma, H. Heterogeneous graph convolution based on in-domain self-supervision for multimodal sentiment analysis. Expert Syst. Appl. 2023, 213, 119240. [Google Scholar] [CrossRef]
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.B.; Morency, L.-P. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar]
Hao, Z.; Jin, Y.; Yan, X.; Wang, C.; Yang, S.; Ge, H. Cross-modal hashing retrieval with compatible triplet representation. Neurocomputing 2024, 602, 128293. [Google Scholar] [CrossRef]
Zhu, T.; Li, L.; Yang, J.; Zhao, S.; Liu, H.; Qian, J. Multimodal sentiment analysis with image-text interaction network. IEEE Trans. Multimed. 2023, 25, 3375–3385. [Google Scholar] [CrossRef]
Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar]
Xue, X.; Zhang, C.; Niu, Z.; Wu, X. Multi-level attention map network for multimodal sentiment analysis. IEEE Trans. Knowl. Data Eng. 2023, 35, 5105–5118. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, C.; Geng, B. Deep multimodal data fusion. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Tuerhong, G.; Dai, X.; Tian, L.; Wushouer, M. An end-to-end image-text matching approach considering semantic uncertainty. Neurocomputing 2024, 607, 128386. [Google Scholar] [CrossRef]
Wang, M.; Cao, D.; Li, L.; Li, S.; Ji, R. Microblog sentiment analysis based on cross-media bag-of-words model. In Proceedings of the International Conference on Internet Multimedia Computing and Service, Xiamen, China, 10–12 July 2014; pp. 76–80. [Google Scholar]
You, Q.; Luo, J.; Jin, H.; Yang, J. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, 22–25 February 2016; pp. 13–22. [Google Scholar]
Xu, N. Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; pp. 152–154. [Google Scholar]
Truong, Q.-T.; Lauw, H.W. VistaNet: Visual aspect attention network for multimodal sentiment analysis. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 305–312. [Google Scholar]
Zhao, Z.; Zhu, H.; Xue, Z.; Liu, Z.; Tian, J.; Chua, M.C.H.; Liu, M. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 2019, 56, 102097. [Google Scholar] [CrossRef]
Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.-P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
Basu, P.; Tiwari, S.; Mohanty, J.; Karmakar, S. Multimodal sentiment analysis of #MeToo tweets using focal loss (grand challenge). In Proceedings of the 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), New Delhi, India, 24–26 September 2020; pp. 461–465. [Google Scholar]
Thuseethan, S.; Janarthan, S.; Rajasegarar, S.; Kumari, P.; Yearwood, J. Multimodal deep learning framework for sentiment analysis from text-image web data. In Proceedings of the 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Melbourne, Australia, 14–17 December 2020; pp. 267–274. [Google Scholar]
Xu, N.; Mao, W.; Chen, G. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 371–378. [Google Scholar]
Yang, X.; Feng, S.; Zhang, Y.; Wang, D. Multimodal sentiment detection based on multi-channel graph neural networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; pp. 328–339. [Google Scholar]
Xiao, X.; Pu, Y.; Zhao, Z.; Gu, J.; Xu, D. BIT: Improving image-text sentiment analysis via learning bidirectional image-text interaction. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–9. [Google Scholar]
Yu, J.; Chen, K.; Xia, R. Hierarchical interactive multimodal transformer for aspect-based multimodal sentiment analysis. IEEE Trans. Affect. Comput. 2023, 14, 1966–1978. [Google Scholar] [CrossRef]
Li, Z.; Xu, B.; Zhu, C.; Zhao, T. CLMLF: A contrastive learning and multi-layer fusion method for multimodal sentiment detection. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 2282–2294. [Google Scholar]
Wang, H.; Ren, C.; Yu, Z. Multimodal sentiment analysis based on cross-instance graph neural networks. Appl. Intell. 2024, 54, 3403–3416. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv 2023, arXiv:2305.06500. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-vl technical report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 207–212. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Huang, L.; Ma, D.; Li, S.; Zhang, X.; Wang, H. Text Level Graph Neural Network for Text Classification. arXiv 2019, arXiv:1910.02356. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Yang, X.; Feng, S.; Wang, D.; Zhang, Y. Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimed. 2021, 23, 4014–4026. [Google Scholar] [CrossRef]
Zhang, B.; Yuan, Z.; Xu, H.; Gao, K. Crossmodal translation based meta weight adaption for robust image-text sentiment analysis. IEEE Trans. Multimed. 2024, 26, 9949–9961. [Google Scholar] [CrossRef]
Ramezani, E.B. Sentiment analysis applications using deep learning advancements in social networks: A systematic review. Neurocomputing 2025, 634, 129862. [Google Scholar] [CrossRef]

Figure 1. Four examples of tweets on image–text sentiment analysis are shown.

Figure 2. The model of Dual-path Interaction Network with Multi-level Consistency Learning (DINMCL).

Figure 3. The model of Crossing Prompter in Local Crossing-Congruity Learning model (LCL).

Figure 4. Lable prediction results.

Figure 5. Experimental results of the Final Multi-Head Cross Attention at different layers.

Figure 6. Experimental results of the LCL Crossing Prompter at different layers.

Figure 7. Experimental results of the GCL at different layers.

Figure 8. Experimental results for different temperature parameters.

Table 1. Statistics of the MVSA and TumEmo datasets.

Dataset	Train	Val	Test	Total
MVSA-Single	3611	450	450	4511
MVSA-Multiple	13,642	2409	2410	17,024
TumEmo	156,204	19,525	19,536	195,265

Table 2. Experimental parameter settings.

Parameter	MVSA-Single	MVSA-Multiple	TumEmo
Learning rate	$2 \times 10^{- 5}$	$2 \times 10^{- 5}$	$2 \times 10^{- 5}$
Dropout	0.1	0.1	0.1
Epoch	30	30	60
Optimizer	AdamW	AdamW	AdamW
GCL layer	2	2	2
LCL Crossing Prompter layer	2	2	2
Final Multi-Head Cross Attention layer	3	4	4
Temperature parameter	0.07	0.07	0.07
Batch size	32	32	16

Table 3. Experimental results on MVSA-Single, MVSA-Multiple, and TumEmo datasets.

Modality	Model	MVSA-Single		MVSA-Multiple		TumEmo
		Acc	F1	Acc	F1	Acc	F1
Text	CNN	0.6819	0.5590	0.6564	0.5766	0.6154	4774
	BiLSTM	0.7012	0.6506	0.6790	0.6790	0.6188	0.5126
	BERT	0.7111	0.6970	0.6759	0.6624	-	-
	TGNN	0.7034	0.6594	0.6967	0.6180	0.6379	0.6362
Image	ResNet50	0.6467	0.6155	0.6188	0.6098	-	-
	OSDA	0.6675	0.6651	0.6662	0.6623	0.4770	0.3438
	SGN	0.6620	0.6248	0.6765	0.5864	0.4353	0.4232
	OGN	0.6659	0.6191	0.6743	0.6010	0.4564	0.4446
	DuIG	0.6822	0.6538	0.6819	0.6081	0.4636	0.4561
Image-Text	HSAN	0.6988	0.6690	0.6796	0.6776	0.6309	0.5398
	MultiSentiNet	0.6984	0.6963	0.6886	0.6811	0.6418	0.5962
	CoMN	0.7051	0.7001	0.6992	0.6983	0.6426	0.5909
	MGNNS	0.7377	0.7270	0.7249	0.6934	0.6672	0.6669
	MVAN	0.7298	0.7298	0.7236	0.7230	0.6646	0.6339
	CLMLF	0.7533	0.7346	0.7200	0.6983	-	-
	ITIN	0.7519	0.7497	0.7352	0.7349	-	-
	CTMWA	0.7591	0.7574	0.7402	0.7384	0.6857	0.6860
	Blip2-flan-t5-xl	0.7321	0.7310	0.7213	0.7186	0.6544	0.6528
	Instructblip-vicuna-13B	0.7386	0.7406	0.7277	0.7243	0.6612	0.6604
	LLaVA-v1.6-34B	0.7483	0.7476	0.7308	0.7289	0.6693	0.6686
	Qwen2.5-VL-72B	0.7596	0.7581	0.7413	0.7389	0.6843	0.6840
	DINMCL (our)	0.7716	0.7653	0.7561	0.7532	0.6973	0.6969

Table 4. The modules in the model were ablated on the MVSA-Single, MVSA-Multiple, and TumEmo datasets.

Model	MVSA-Single		MVSA-Multiple		TumEmo
	Acc	F1	Acc	F1	Acc	F1
w/o Captioner	0.7701	0.7612	0.7533	0.7476	0.6912	0.6921
w/o ViT	0.7681	0.7587	0.7486	0.7413	0.6859	0.6840
w/o Faster-RCNN	0.7677	0.7538	0.7457	0.7388	0.6813	0.6809
w/o GCL	0.7588	0.7576	0.7453	0.7445	0.6758	0.6736
w/o LCL	0.7610	0.7583	0.7471	0.7456	0.6783	0.6767
w/o GCL Key Prompt	0.7686	0.7588	0.7549	0.7468	0.6812	0.6803
w/o LCL Crossing Prompter	0.7688	0.7563	0.7492	0.7484	0.6801	0.6796
w/o CLIP	0.7667	0.7576	0.7512	0.7428	0.6906	0.6911
w/o Fusion & Prediction	0.7611	0.7522	0.7488	0.7411	0.6856	0.6883
DINMCL (our)	0.7716	0.7653	0.7561	0.7532	0.6973	0.6969

Table 5. Computational efficiency comparison of models.

Models	Params (M)	FLOPs (G)	Inference Time (ms)	Memory (GB)
CNN	1.5	0.4	5.2	1.1
BiLSTM	3.2	0.1	3.8	0.9
BERT	110.0	22.0	32.5	3.5
TGNN	4.8	1.2	12.1	2.2
ResNet50	25.6	4.1	8.9	2.5
OSDA	15.7	3.5	15.3	2.8
SGN	5.5	1.5	13.5	2.3
OGN	6.1	1.8	14.2	2.4
DuIG	12.3	2.9	14.8	2.6
HSAN	8.9	2.1	16.7	2.7
MultiSentiNet	18.5	4.3	18.5	3.1
CoMN	22.4	5.7	20.2	3.4
MGNNS	9.7	3.2	18.7	3.0
MVAN	10.5	3.6	19.1	3.2
CLMLF	14.2	4.8	21.3	3.5
ITIN	28.3	6.9	23.6	3.9
CTMWA	122.1	18.5	25.3	5.1
BLIP2-FLAN-T5-XL	3500.0	280.0	420.0	28.0
InstructBLIP-Vicuna-13B	13,000.0	850.0	680.0	52.0
LLaVA-1.6-34B	34,000.0	1150.0	950.0	89.0
Qwen2.5-VL-72B	72,000.0	1520.0	2150.0	145.0
DINMCL (Ours)	158.7	24.3	35.6	6.8

Table 6. Bidirectional cross-dataset transfer evaluation.

Model	MVSA-Single → MVSA-Multiple		MVSA-Multiple → MVSA-Single
	Acc	F1	Acc	F1
HSAN	0.6532	0.6455	0.6568	0.6456
MultiSentiNet	0.6612	0.6573	0.6525	0.6519
CoMN	0.6678	0.6632	0.6568	0.6533
MGNNS	0.7053	0.7006	0.6963	0.6947
MVAN	0.7066	0.7021	0.6989	0.6997
CLMLF	0.7235	0.7107	0.6879	0.6726
ITIN	0.7266	0.7215	0.7049	0.6987
CTMWA	0.7423	0.7388	0.7263	0.7187
DINMCL (our)	0.7705	0.7638	0.7557	0.7528

Table 7. Report the balanced accuracy and macro-F1 score.

Model	MVSA-Single		MVSA-Multiple		TumEmo
	Balanced Acc	Macro-F1	Balanced Acc	Macro-F1	Balanced Acc	Macro-F1
HSAN	0.6876	0.6538	0.6773	0.6682	0.6288	0.5147
MultiSentiNet	0.6842	0.6773	0.6835	0.6679	0.6283	0.5779
CoMN	0.6743	0.6689	0.6786	0.6735	0.6275	0.5659
MGNNS	0.6935	0.6872	0.6780	0.6679	0.6357	0.6342
MVAN	0.6897	0.6826	0.6771	0.6753	0.6375	0.6249
CLMLF	0.7356	0.7188	0.7163	0.6830	-	-
ITIN	0.7388	0.7378	0.7156	0.7189	-	-
CTMWA	0.7371	0.7389	0.7347	0.7286	0.6673	0.6649
Blip2-flan-t5-xl	0.7156	0.7143	0.7049	0.7086	0.6375	0.6353
Instructblip-vicuna-13B	0.7186	0.7173	0.7082	0.7010	0.6383	0.6377
LLaVA-v1.6-34B	0.7204	0.7213	0.7086	0.7097	0.6473	0.6347
Qwen2.5-VL-72B	0.7277	0.7276	0.7120	0.7167	0.6486	0.6471
DINMCL (our)	0.7711	0.7642	0.7537	0.7518	0.6947	0.6915

Table 8. All abbreviations for models.

Abbreviation	Full Name
DINMCL	Dual-path Interaction Network with Multi-level Consistency Learning
GCL	Global Congruity Learning
LCL	Local Crossing-Congruity Learning
BERT	Bidirectional Encoder Representations from Transformers
BLIP	Bootstrapping Language-Image Pre-training
CLIP	Contrastive Language-Image Pre-training
CLMLF	Cross-modal Translation-Based Meta Weight Adaptation
CTMWA	Contrastive Learning-Multilayer Fusion
FFN	Feed-Forward Network
GAT	Graph Attention Network
ITIN	Interactive Transformer Integration Network
LLaVA	Large Language and Vision Assistant
MGNNS	Multi-channel Graph Neural Network
MVAN	Multi-view Attention Network
ReLU	Rectified Linear Unit
RoBERTa	A Robustly Optimized BERT Pretraining Approach
ViT	Vision Transformer

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ji, Z.; Wu, C.; Xu, Q.; Wu, Y. Image–Text Sentiment Analysis Based on Dual-Path Interaction Network with Multi-Level Consistency Learning. Electronics 2026, 15, 581. https://doi.org/10.3390/electronics15030581

AMA Style

Ji Z, Wu C, Xu Q, Wu Y. Image–Text Sentiment Analysis Based on Dual-Path Interaction Network with Multi-Level Consistency Learning. Electronics. 2026; 15(3):581. https://doi.org/10.3390/electronics15030581

Chicago/Turabian Style

Ji, Zhi, Chunlei Wu, Qinfu Xu, and Yixiang Wu. 2026. "Image–Text Sentiment Analysis Based on Dual-Path Interaction Network with Multi-Level Consistency Learning" Electronics 15, no. 3: 581. https://doi.org/10.3390/electronics15030581

APA Style

Ji, Z., Wu, C., Xu, Q., & Wu, Y. (2026). Image–Text Sentiment Analysis Based on Dual-Path Interaction Network with Multi-Level Consistency Learning. Electronics, 15(3), 581. https://doi.org/10.3390/electronics15030581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Image–Text Sentiment Analysis Based on Dual-Path Interaction Network with Multi-Level Consistency Learning

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Fusion

2.2. Image–Text Sentiment Analysis

3. Method

3.1. Overview

3.2. Model Design

3.3. Multi-Level Feature Representation

3.4. Multi-Level Consistency Learning Model

3.5. Fusion & Prediction

4. Experiment

4.1. Dataset

4.2. Experimental Setup

4.3. Baselines

4.4. Comparison Experiments

4.5. Ablation Experiments

4.6. Case Study

4.7. Parametric Experiments

4.8. Model Complexity Analysis

4.9. Cross-Dataset Generalization Evaluation

4.10. Fine-Grained Performance Analysis

5. Discussion

6. Conclusions

7. Abbreviation

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI