Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring

Jiang, Shan; Chen, Yingzhao; Chaomu, Rilige; Liu, Zheng

doi:10.3390/inventions11010013

Open AccessArticle

Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring

¹

Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing 100086, China

²

Key Laboratory of Trustworthy Data Intelligence and Security Governance, Minzu University of China, Lingshui 572423, China

^*

Author to whom correspondence should be addressed.

Inventions 2026, 11(1), 13; https://doi.org/10.3390/inventions11010013

Submission received: 5 January 2026 / Revised: 26 January 2026 / Accepted: 29 January 2026 / Published: 2 February 2026

(This article belongs to the Special Issue Advances and Innovations in Deep Learning: Unveiling Multidisciplinary Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

Large volumes of images shared on social media have made image captioning an important tool for social hotspot identification and public opinion monitoring, where accurate visual–language alignment is essential for reliable analysis. However, existing image captioning models based on BLIP-2 (Bootstrapped Language–Image Pre-training) often struggle with complex, context-rich, and socially meaningful images in real-world social media scenarios, mainly due to insufficient cross-modal interaction, redundant visual token representations, and an inadequate ability to capture multi-scale semantic cues. As a result, the generated captions tend to be incomplete or less informative. To address these limitations, this paper proposes ECMA (Enhanced Cross-Modal Attention), a lightweight module integrated into the Querying Transformer (Q-Former) of BLIP-2. ECMA enhances cross-modal interaction through bidirectional attention between visual features and query tokens, enabling more effective information exchange, while a multi-scale visual aggregation strategy is introduced to model semantic representations at different levels of abstraction. In addition, a semantic residual gating mechanism is designed to suppress redundant information while preserving task-relevant features. ECMA can be seamlessly incorporated into BLIP-2 without modifying the original architecture or fine-tuning the vision encoder or the large language model, and is fully compatible with OPT (Open Pre-trained Transformer)-based variants. Experimental results on the COCO (Common Objects in Context) benchmark demonstrate consistent performance improvements, where ECMA improves the CIDEr (Consensus-based Image Description Evaluation) score from 144.6 to 146.8 and the BLEU-4 score from 42.5 to 43.9 on the OPT-6.7B model, corresponding to relative gains of 1.52% and 3.29%, respectively, while also achieving competitive METEOR (Metric for Evaluation of Translation with Explicit Ordering) scores. Further evaluations on social media datasets show that ECMA generates more coherent, context-aware, and socially informative captions, particularly for images involving complex interactions and socially meaningful scenes.

Keywords:

vision language models; enhanced cross-modal attention; image captioning; cross-modal alignment; frozen image encoder; fine-grained visual cues; large language model

1. Introduction

With the development of deep learning, image description generation has become an important direction in the intersection of computer vision and natural language processing. This task requires the model to understand the semantic content of the input image and generate a structured, coherent, and semantically accurate natural language description [1]. Image captioning not only plays a critical role in practical applications such as image retrieval, human–computer interaction, and assistive technologies for the visually impaired, but also serves as a fundamental building block for more complex multimodal understanding tasks, including visual question answering, visual reasoning, and vision–language alignment [2].

Alongside the continuous advancement of large visual models (such as ViT [3], EVA-CLIP [4], CLIP-L [5]) and large language models (such as OPT [6], LLaMA [7], Flan-T5 [8]), image description generation has gradually entered an era of “knowledge enhancement” and “reasoning enhancement”. Nevertheless, multimodal information fusion remains the core bottleneck limiting performance gains. In particular, current models often underperform in several key description tasks: fine-grained object attribute depiction, modeling of relationships between multiple objects, scene semantic inference, and the construction of abstract semantic structures. This lack of capability stems essentially from the inadequate transfer and fusion of visual features during the cross-modal interaction stage. Against this backdrop, large-scale pre-trained vision–language models (LVLMs) have emerged as a key paradigm for advancing multimodal understanding, among which Bootstrapping Language-Image Pre-training 2 (BLIP-2) [9] has rapidly established itself as one of the foundational models widely cited in both academia and industry, thanks to its modular design, powerful cross-modal alignment capability, and exceptional data efficiency. The core innovation of BLIP-2 lies in the introduction of a mediator module—the Querying Transformer (Q-Former)—which builds an effective and compact semantic bridge between the visual encoder [10] and the large language model (LLM) through learnable Query Tokens. This design effectively reduces the computational cost of feeding visual features into LLMs, enabling BLIP-2 to efficiently reuse existing large language models.

Despite its success, BLIP-2 still exhibits several limitations when applied to image captioning tasks. First, the interaction between visual features and query tokens is relatively limited, which restricts the model’s ability to fully exploit rich visual semantics. Second, aggressive compression of visual information may lead to the loss of fine-grained and context-sensitive cues that are critical for accurate caption generation. Third, the lack of refined semantic modeling during the cross-modal fusion stage can result in suboptimal performance on metrics that emphasize semantic consistency and structural fidelity, such as SPICE and CIDEr [11].

To explicitly address these research gaps, this paper focuses on the following objectives:

(1): enhancing cross-modal interaction between visual features and query tokens to facilitate more effective information exchange;
(2): preserving and aggregating visual semantics at multiple levels of abstraction to mitigate information loss during feature compression;
(3): improving the refinement of semantic representations during caption generation to enhance semantic coherence and structural consistency.

Based on these objectives, we propose Enhanced Cross-Modal Attention (ECMA), a lightweight module integrated into the Q-Former of BLIP-2. ECMA redesigns the cross-modal attention mechanism to strengthen vision–language fusion through bidirectional semantic alignment, multi-head visual-guided attention, and hierarchical feature aggregation. Importantly, ECMA can be seamlessly incorporated into the original BLIP-2 framework without requiring additional large-scale pre-training or fine-tuning of the vision encoder and large language model, thereby improving captioning performance while maintaining nearly unchanged computational cost.

The remainder of this paper is organized as follows: Section 2 reviews relevant work on image description and visual language pre-training, and details the proposed ECMA module and its integration in BLIP-2; Section 3 reports the experimental results and provides a comprehensive analysis of benchmark datasets and social media datasets; finally, Section 4 summarizes the entire paper and discusses potential directions for future research.

1.1. Motivation for Social-Oriented Image Captioning

Although vision language models like BLIP-2 have demonstrated strong performance on standard image captioning benchmarks, their utility in real-world social media scenarios remains constrained. Social platform images are typically defined by intricate scene layouts, implicit semantic cues, and tight contextual links to ongoing social events and public discourse. Generating precise, informative captions for such content demands more than basic object recognition; it requires fine-grained modeling of object interactions, dynamic action sequences, and contextually relevant social signals [12]. While BLIP-2’s Q-Former module establishes an efficient cross-modal communication pipeline, it primarily focuses on lightweight alignment rather than deep semantic interaction.—a limitation that prevents it from fully capturing the complex semantic relationships inherent in visual data. In practical deployments, this deficiency in cross-modal interaction, combined with over-compression of visual information, often results in captions that are semantically incomplete or ambiguous. Such flaws directly compromise the reliability of downstream tasks such as social hotspot detection and public opinion analysis.

To tackle these issues, this study focuses on refining BLIP-2’s native cross-modal attention mechanism, rather than introducing overly complex architectural overhauls. The goal is to strengthen visual-linguistic semantic alignment while retaining the model’s computational efficiency. To this end, we propose an Enhanced Cross-Modal Attention (ECMA) module. This module is engineered to capture richer local visual details, deepen the interactive depth between visual and linguistic representations, and enable Query Tokens to move beyond superficial visual feature extraction toward a genuine understanding of the logical relationships among image elements. By doing so, the optimized framework elevates the quality, accuracy, and semantic consistency of generated captions, providing a more robust basis for image captioning tasks in socially oriented, context-dependent scenarios. The full workflow of image description generation in this research is depicted in Figure 1.

Figure 1a shows the original input image of the target scene, serving as the visual basis for subsequent cross-modal processing. Figure 1b shows the heatmap visualization of the image, highlighting the high attention weight regions assigned by the ECMA module during visual feature extraction. Figure 1c depicts the step-by-step process of image description generation, demonstrating how the enhanced cross-modal attention mechanism fuses visual features with linguistic context to generate coherent text output.

1.2. Research Objectives and Innovations

The goal of this study is to improve the cross-modal semantic fusion capability of BLIP-2 image description generation without significantly increasing computational cost. To this end, this paper proposes the following innovations:

This paper proposes an enhanced cross-modal attention mechanism (ECMA). ECMA reconstructs the fusion of visual and Query Tokens through bidirectional interaction, multi-scale aggregation, and cross-head guided attention. Compared to the native Q-Former, ECMA is able to capture deeper semantic relationships;
ECMA’s design emphasizes seamless compatibility with the existing BLIP-2 system. Its lightweight structure enhances cross-modal coupling capabilities without disrupting the pre-training alignment space. Since there is no need to adjust the parameters of ViT or large language models, ECMA can be directly inserted into the interactive phase of Q-Former, thereby maintaining the stability and controllability of training. This modular extension approach also provides a portable design paradigm for other visual language tasks such as VQA and cross-modal retrieval;
Improved multiple indicators. Experimental results show that CIDEr, BLEU, METEOR, and ROUGE-L have all been improved. In particular, the improvement of CIDEr indicates that ECMA has indeed improved the semantic structure modeling capability.

2. Materials and Methods

2.1. Related Work

Image description generation tasks have developed rapidly over the past decade, from the initial encoder decoder model to today’s generative systems based on large scale vision language pre-training (VLP) models, and have undergone significant technological evolution [13]. In this section, we systematically review three key research directions closely related to our work: visual encoders [14], cross-modal pre-trained models [15], and cross-modal attention mechanisms [16]. These studies form the theoretical foundation of the proposed Enhanced Cross-Modal Attention (ECMA) module.

2.1.1. Visual Encoder

Modern image description models commonly use Vision Transformer (ViT) [17] as the visual encoder. It obtains global visual semantics based on Patch Tokenization and a multi-layer self-attention structure, and is the mainstream choice for cross-modal alignment tasks.

Among the many ViT variants, CLIP’s ViT-L/14 [18] and EVA-CLIP’s ViT-g/14 [19] are the two most widely used cross-modal visual encoders. CLIP achieves strong semantic alignment capabilities through large-scale image-text contrastive learning, while EVA-CLIP’s ViT-g/14 further expands the model’s depth and width, making it more advantageous in visual feature representation. The high-performance version of BLIP-2 uses ViT-g/14 as the visual encoder and keeps it completely frozen during training. Therefore, improvements to enhance cross-modal interaction must be applied to the Q-Former, not rely on fine-tuning of the visual encoder. The ECMA proposed in this paper features a pluggable design, enabling improvements to Query–Vision interaction quality without modifying the visual encoder.

2.1.2. Cross-Modal Pre-Trained Models

Cross-modal pre-trained models aim to learn robust alignment between visual and linguistic representations, providing a powerful foundation for image description generation and other vision–language tasks. This section reviews three representative models: CLIP, BLIP [20], and BLIP-2, which progressively improve cross-modal learning efficiency and generation capability.

CLIP employs a dual-tower architecture and focuses on large-scale image–text contrastive learning, enabling strong global semantic alignment between visual and textual embeddings. By training on hundreds of millions of image–text pairs, CLIP significantly enhances the discriminative ability of visual representations. Its vision encoders, such as ViT-L/14, have become widely adopted backbones for subsequent multimodal models and are also used as the visual representation basis in both BLIP and BLIP-2. However, CLIP itself is primarily designed for representation alignment and does not directly support fine-grained cross-modal generation.

BLIP introduces a unified pre-training framework that combines image–text contrastive learning, image–text matching, and language modeling, allowing a single model to support both retrieval and generation tasks. This unified design advances multimodal modeling by enabling flexible task adaptation. Nevertheless, BLIP relies on end-to-end training of large multimodal components, which limits its scalability and efficiency when interfacing with large language models.

BLIP-2 addresses this limitation by proposing the lightweight Q-Former, which serves as an efficient bridge between frozen visual encoders (e.g., CLIP or EVA-CLIP) and large language models. Through staged alignment and training only the Q-Former, BLIP-2 achieves strong performance while substantially reducing training cost. Despite these advantages, the cross-modal interaction in BLIP-2 remains largely unidirectional, and the utilization of visual features is limited, which constrains its ability to capture fine-grained semantic relationships.

To overcome these limitations, the proposed ECMA is designed to enhance cross-modal interaction and semantic extraction by improving information flow between visual features and query tokens, enabling more effective and fine-grained vision–language alignment for image captioning.

2.1.3. Limitations of Cross-Modal Attention Mechanisms

Activating cross-modal attention is one of the core technologies in multimodal research.

Standard cross-modal attention is unidirectional, Q typically comes from one modality, such as text, while K and V come from another modality, such as an image [21]. This attention only allows the Query Token to read information from visual features. The information flow is unidirectional, cannot take advantage of the local relationships between visual features, and has limited interaction between Query Tokens, making it difficult to deepen semantic understanding. This is the key limitation of BLIP-2 Q-Former;

Bi-directional attention has the advantages of more comprehensive inter-modal interaction and is helpful for semantic modeling of complex images. Some visual question answering models, such as ViLBERT, have proposed the concept of bi-directional alignment [22];

Multi-scale attention: Multi-scale attention allows the model to establish relationships at different semantic levels, especially in complex scenes, small object details, actions [23], and interactive relationships. This method can capture multi-granular visual information from low-level texture to high-level semantics, and performs excellently in complex scene understanding. However, existing models such as BLIP-2 have not fully utilized this mechanism. BLIP-2’s Q-Former ignores multi-scale visual semantics.

In summary, existing studies exhibit the following characteristics and limitations:

ViT-based visual encoders provide strong visual representations but are typically frozen in large-scale vision–language models, limiting the potential for performance gains through encoder fine-tuning.
Cross-modal pre-trained models such as BLIP-2 achieve high efficiency via lightweight alignment modules, yet suffer from limited cross-modal interaction and unidirectional attention.
Current cross-modal attention mechanisms often overlook multi-scale semantic modeling, which is critical for accurately describing complex and socially rich scenes.

These observations motivate the proposed ECMA module, which aims to enhance cross-modal interaction, exploit multi-scale visual semantics, and improve semantic coherence in image captioning without increasing training cost.

The following section introduces the Enhanced Cross-Modal Attention (ECMA) module proposed in this paper and explains how it enhances the semantic alignment between images and text based on the BLIP-2 framework.

2.2. Overview of the BLIP-2 Architecture

BLIP-2 (Bootstrapping Language–Image Pre-training) proposes an architecture of “frozen large model, lightweight bridging”. Its core idea is to use a powerful visual encoder and a pre-trained large language model (LLM) and Q-Former as an intermediate adapter to achieve semantic mapping from image to language.

The core components of the model include a visual encoder, a Q-Former, and a large language model (LLM): the visual encoder adopts a ViT-L/14 or ViT-G/14 architecture; the Q-Former has several learnable query tokens built in, and the overall structure is a two-stream Transformer (visual stream and text stream); the LLM parameters are kept frozen, and the output of the Q-Former is used as a prefix prompt to guide the generation of natural language descriptions. The model inference process is as follows: after the image is encoded by the visual encoder, it is input into the Q-Former. Cross-modal interaction is achieved through the attention mechanism of the visual features by the Query Token. Finally, the LLM completes the natural language generation. Cross-modal information interaction mainly occurs within the Q-Former module.

2.3. Limitations Analysis of BLIP-2 Cross-Modal Attention

2.3.1. One-Way Attention

In Q-Former, the Query Token only reads information from the visual features via Formula (1). The visual features are not updated, nor do they receive feedback from the Query Token. This design allows information flow from visual features to query tokens, while feedback from query tokens to visual features is not supported. Visual features are frozen after extraction and cannot be dynamically adjusted according to the semantic emphasis of the current description. When describing complex scenes, models struggle to actively focus on the visual regions most relevant to the text’s semantics.

Q' = L a y e r N o r m (Q + A t t e n t i o n (Q, K_{v}, V_{v}))

(1)

where

Q \in R^{N_{q} \times d}

denotes the learnable query token embeddings, and

Q'

represents the updated query features after cross-attention.

K_{v} \in R^{N_{v} \times d}

and

V_{v} \in R^{N_{v} \times d}

are the key and value matrices projected from the visual feature tokens extracted by the vision encoder, with

N_{q}

and

N_{v}

indicating the numbers of query tokens and visual tokens, respectively. The function

A t t e n t i o n (\cdot)

corresponds to the standard multi-head cross-attention operation, and

L a y e r N o r m (\cdot)

denotes layer normalization. A residual connection is applied to stabilize training.

Query tokens have limited guiding capabilities; visual features cannot respond to the semantic needs of query tokens. Query tokens can only passively retrieve information from fixed visual representations, making it difficult to achieve accurate visual-language alignment.

2.3.2. Excessive Semantic Compression Leads to Information Flow Bottlenecks

BLIP-2 uses a fixed 32 Query Tokens to compress visual features, which introduces an information bottleneck and results in an excessively high dimensionality compression ratio, compressing approximately 196 visual patches to 32 Query Tokens, with a compression ratio exceeding 6:1. While high-level semantic information is preserved, fine-grained details (such as small objects, textures, and spatial relationships) may be lost during compression. For simple images, 32 Query Tokens may be redundant; for complex scenes, 32 tags may be insufficient to encode all important information. A fixed compression strategy means that the information density cannot be dynamically adjusted according to the complexity of the image.

2.4. ECMA: Enhancing Cross-Modal Attention

To address the aforementioned issues, this paper proposes the Enhanced Cross-Modal Attention (ECMA) module, which is designed to strengthen fine-grained semantic interaction between Query Tokens and visual features. The core idea of ECMA is to improve cross-modal information exchange without altering the overall architecture of BLIP-2 or fine-tuning the frozen vision encoder and large language model (LLM), thereby preserving training efficiency and model stability. ECMA is seamlessly integrated into the Q-Former, serving as a lightweight yet effective enhancement to the original cross-modal attention mechanism.

Specifically, ECMA introduces three complementary mechanisms: bidirectional attention, which enables mutual information flow between visual features and query tokens; multi-scale visual aggregation, which enriches visual representations by incorporating hierarchical semantic cues; and semantic gating, which adaptively suppresses irrelevant or noisy information while preserving task-critical signals. Through these mechanisms, ECMA enhances the expressiveness and robustness of cross-modal representations, facilitating more accurate and context-aware image caption generation.

Figure 2 illustrates the overall architecture of the ECMA enhancement model, highlighting how ECMA can be integrated into BLIP-2 without introducing additional complexity. Figure 3 shows the internal structure of the proposed E-Former, where each component employs a modular and plug-and-play design. Each module will be described in detail in subsequent sections.

Figure 2 illustrates the end-to-end image captioning workflow: a frozen image encoder first extracts features from the input (e.g., a dog wearing a hat), then the E-Former (integrated with ECMA) processes these features alongside learned queries; the output is mapped via a fully connected layer to the LLM decoder, which generates context-aware captions (e.g., “a close up of a dog wearing a hat”), demonstrating ECMA’s role in bridging visual features and language generation.

Figure 3 presents two components: the end-to-end workflow (top): a frozen image encoder processes the input (a dog in a hat), with the E-Former (enhanced by ECMA) bridging visual features to a frozen LLM for caption generation; the E-Former block (bottom): stacked N-times, it integrates self-attention, cross-attention, and ECMA’s Enhanced Cross-Modal Attention, plus attention masking, to enable multimodal alignment—supporting the generation of context-matched captions (e.g., “a close up of a dog wearing a hat”).

2.4.1. Bidirectional Cross-Modal Attention

The bidirectional cross-modal attention introduced in this paper includes:

Visual-to-Query V2Q: The Query Token extracts information from visual features, the same as the original BLIP-2, as shown in Equation (2):

$Q_{v 2 q} = A t t e n t i o n (Q, K_{v}, V_{v})$

(2)

where $Q$ denotes the learnable query token embeddings, and $K_{v}$ and $V_{v}$ are the key and value matrices projected from the visual feature tokens.
Visual Q2V Query: The visual features are adjusted according to the semantic requirements of the Query Tokens, as shown in Equation (3):

$F_{q 2 v} = A t t e n t i o n (F, K_{q}, V_{q})$

(3)

where $F$ represents the visual feature tokens extracted by the vision encoder. $K_{q}$ and $V_{q}$ are the key and value matrices obtained by linear projections of the query tokens $Q$ .
Enhanced visual features: The original visual features are added to the query-guided update to obtain the enhanced features, as shown in Equation (4):

$F_{e n h a n c e d} = F + α \cdot F_{q 2 v}$

(4)

where $α$ is a learnable scaling factor, initialized to 0.1, which enables the model to smoothly transition from the original unidirectional attention to bidirectional attention.

2.4.2. Multi-Scale Visual Aggregation

Vision Transformer (ViT) has the ability to output features across multiple layers, while the BLIP-2 model only uses the output of its last layer during feature extraction. To fully explore the semantic information contained in different layers of ViT, this paper proposes a multi-scale feature extraction strategy, specifically extracting feature representations from layers {4, 8, 12, 16, 20, 24} of ViT. These levels correspond to a progressive semantic abstraction hierarchy from low to high: the shallow layers (layers 4 and 8) mainly encode basic visual features such as edges and textures, the middle layers (layers 12 and 16) focus on the representation of object parts and local structural information, and the deep layers (layers 20 and 24) tend to capture global semantic content and scene category information. For each target layer l, this paper calculates the attention score between the Query Token and the features of that layer, and uses it as the weight coefficient in the feature fusion process. The specific calculation method is shown in Equation (5). To make the multi-scale feature extraction strategy clearer and easier to understand, the process is shown in Figure 4.

w_{l} = \frac{e x p (s i m (\bar{Q}, \bar{F_{l}}) / τ)}{e x p (s i m (\bar{Q}, \bar{F_{l'}}) / τ)}

(5)

where

\bar{Q}

denotes the aggregated query representation obtained by pooling the query tokens, and

\bar{F_{l}}

represents the pooled visual feature of the l-th ViT layer. The index

l' \in L

is a dummy variable that iterates over all selected layers and is used for softmax normalization. The function

s i m (\cdot)

denotes cosine similarity, and

τ

is a temperature parameter controlling the sharpness of the attention distribution. The weights

w_{l}

are normalized across all selected layers using a softmax operation, reflecting the relative importance of different semantic levels for the current query.

Multi-scale features are fused and weighted into a unified visual representation, as shown in Equation (6):

F_{m u l t i} = \sum_{l \in L} w_{l} \cdot {P r o j}_{l} (F_{l})

(6)

where

F_{l}

denotes the visual features extracted from the l-th layer of the Vision Transformer, and

w_{l}

is the corresponding layer-wise weight computed in Equation (5).

{P r o j}_{l} (\cdot)

represents a layer-specific linear projection that maps features from different layers into a unified feature dimension. The resulting

F_{m u l t i}

is the fused multi-scale visual representation.

In addition to simple weighted fusion, this paper also introduces a cross-level attention mechanism, which enables features from different levels to enhance each other, as shown in Equation (7):

F_{l}^{'} = F_{l} + \sum_{l' \in L, l' \neq l} A t t e n t i o n (F_{l}, K_{l'}, V_{l'})

(7)

where

F_{l}^{'}

denotes the updated visual features at level

l

after cross-level interaction. For each

l' \in L

with

l' \neq l

,

K_{l'}

and

V_{l'}

are the key and value matrices obtained by projecting the features

F_{l}^{'}

from the corresponding layer. The cross-level attention operation enables visual features at different semantic levels to interact and mutually enhance each other.

This multi-scale aggregation mechanism effectively improves the model’s sensitivity to fine-grained visual cues and significantly enhances its ability to represent small objects, local structures, and complex semantic relationships, thereby bringing stable performance gains in cross-modal generation tasks [24].

2.4.3. Semantic Residual Gating Fusion

To stabilize training, this paper adds a semantic residual gating mechanism to the query output. The update of the Query Token is controlled by gating as shown in Equations (8)–(10):

∆ Q = A t t e n t i o n (Q, K_{v}^{e n h a n c e d}, V_{v}^{e n h a n c e d})

(8)

G = σ (W_{g} [Q; ∆ Q] + b_{g})

(9)

Q_{n e w} = Q + G ⊙ ∆ Q

(10)

where

K_{v}^{e n h a n c e d}

and

V_{v}^{e n h a n c e d}

are the key and value matrices projected from the enhanced visual features.

σ (\cdot)

denotes the sigmoid activation function,

⊙

represents element-wise multiplication, and

W_{g}

and

b_{g}

are learnable parameters of the gating module. The gate

G

adaptively controls the contribution of the query update

∆ Q

, allowing the model to balance between preserving the original query representation and incorporating new semantic information from visual features, thereby stabilizing training.

The gate value

G

is based not only on the current state

Q

of the Query Token, but also on textual context information, as shown in Equation (11):

G = σ (W_{g} [Q; ∆ Q; C_{t e x t}] + b_{g})

(11)

where

C_{t e x t}

denotes the contextual representation produced by the text encoder. By incorporating textual context, the gating mechanism can adaptively regulate the degree of visual information fusion according to language priors [25].

The gating mechanism ensures that the update range is within a reasonable range, preventing gradient explosion or disappearance. When

G \approx 0

Q_{n e w}

is approximately equal to

Q

maintaining feature stability. When

G \approx 1

,

Q_{n e w}

approximately absorbs all increments

∆ Q

, fully integrating new features, as shown in Equation (12).

\{\begin{matrix} Q_{n e w} = 0 \cdot (Q + ∆ Q) + (1 - 0) \cdot Q = Q \\ Q_{n e w} = 1 \cdot (Q + ∆ Q) + (1 - 1) \cdot Q = Q + ∆ Q \end{matrix} \binom{G \approx 0}{G \approx 1}

(12)

Different gating strategies are applied to different semantic levels, as shown in Equation (13):

G_{l} = σ (W_{g, l} [Q; ∆ Q_{l}] + b_{g, l})

(13)

where

∆ Q_{l}

denotes the attention-based query update obtained from the l-th semantic level. The layer-specific gating parameters

W_{g, l}

and

b_{g, l}

enable the model to independently control the degree of information fusion at different levels of abstraction.

The network architecture of ECMA is illustrated in Figure 5, providing an overview of the module’s structural composition and the information flow between visual features and query tokens. Figure 5 illustrates the sequential workflow of ECMA’s three core sub-components: Bidirectional Cross-Modal Attention (enabling mutual visual-query information exchange, with heatmaps showing attention weights), Multi-Scale Visual Aggregation (fusing local/global visual features to capture fine-grained details of input images like the dog), and Semantic Residual Gating Fusion (filtering noise to retain task-relevant semantics); the processed Qnew is passed to the Text Encoder for caption generation, embodying ECMA’s modular, cross-modal alignment-enhancing design.

The dual “Q” design aims to achieve a synergistic output of “basic fusion features + additional semantic information”: the first “Q” corresponds to the basic fusion query features extracted from ECMA’s multi-stage processing (including bidirectional cross-modal attention, multi-scale aggregation, and semantic residual gated fusion). This processing targets visual encoder features and text encoder features, representing the core features after multimodal interaction; “Q_new” represents the additionally introduced new query (a new text instruction, supplementary semantic information, or input prompts from subsequent modules); the last “Q” is composed of these two components, used to further integrate the basic fusion features and additional semantic information, thereby generating a more accurate and task-appropriate multimodal output (e.g., an image description that better matches the new instruction). Essentially, this constitutes a secondary enhancement design of “pre-existing fusion features + new information”, enabling the output to simultaneously accommodate the results of early multimodal fusion and the supplementary needs of later stages.

Figure 6: Overview of the proposed image captioning framework. The pipeline starts with an input image, which is fed into a frozen visual encoder to extract visual features. These features then undergo a query-vision interaction module to establish alignment between learnable query tokens and visual content. The enhanced feature representation is processed by our core E-Former module, which integrates bidirectional cross-modal attention, multi-scale visual aggregation, and semantic residual gating fusion to model fine-grained vision language interactions. The resulting query-aware visual tokens are finally passed to a frozen large language model, which generates the final image caption in an autoregressive manner.

2.5. Dataset

The dataset used in this study is MS COCO [26], a widely recognized benchmark for image captioning tasks. MS COCO includes a total of 82,783 training images, 40,504 validation images, and 40,775 test images, with each image annotated by five human-generated descriptions. These descriptions provide rich textual information that is essential for training and evaluating image captioning models. The experiment utilizes the COCO Karpathy split, which is commonly used in state-of-the-art models such as BLIP, BLIP-2, and SimVLM, offering a standardized evaluation framework that facilitates comparability with previous works.

Regarding dataset preprocessing, since MS COCO is already a well-established dataset, no additional preprocessing or balancing was necessary. The dataset itself is diverse and well-distributed across categories, ensuring a natural balance. Furthermore, no data augmentation techniques such as image flipping or rotation were applied, as the model is designed to learn from the inherent variability present in the dataset. Images were resized to 224 × 224 pixels to meet the input requirements of BLIP-2. For feature extraction, CLIP ViT-L/14 features were employed, and text was tokenized using BPE segmentation. To preserve the semantic integrity of the descriptions, punctuation was retained, and no lowercase conversion was performed.

MS COCO includes a wide range of object categories, with 80 object types such as “person”, “car”, “dog”, “cat”, and “tree”. These categories are distributed across various scenes and contexts, offering a comprehensive representation of real-world environments. This makes the dataset suitable for training models that need to handle a diverse range of visual and textual information.

2.6. Experimental Setup

All experiments are conducted following the BLIP-2 training paradigm. The visual encoder is kept identical to BLIP-2, and the parameters of the large language model are fully frozen, while only the Q-Former and the proposed ECMA module are optimized. Experiments are conducted on two language model backbones, OPT-2.7B and OPT-6.7B. We use the AdamW optimizer [27] with

β_{1}

= 0.9 and

β_{2}

= 0.98, which control the exponential decay rates of the first and second moment estimates of the gradients, respectively, providing stable and smooth updates. A weight decay of 0.05 is applied to prevent overfitting by constraining the magnitude of model parameters. The learning rate follows a cosine decay schedule with a peak of

1 \times 10^{- 4}

, combined with a linear warm-up of 2000 steps, and a minimum learning rate of

5 \times 10^{- 5}

in the second stage. The learning rate is defined as shown in Equation (14). The training batch size is set to 32 for both OPT-2.7B and OPT-6.7B, while the evaluation batch size is 16 for OPT-2.7B and 8 for OPT-6.7B to accommodate memory limitations. Gradient checkpointing and FP16 precision are enabled for OPT-6.7B to further reduce GPU memory usage. Input images are resized to 224 × 224 and augmented with random resized cropping and horizontal flipping. Beam search with a size of 5 is applied during caption generation, with maximum and minimum output lengths of 30 and 8 tokens, respectively. All experiments are performed on a single NVIDIA RTX 4090 GPU with 24 GB memory, and a fixed random seed of 42 ensures reproducibility.

η_{i} = \{\begin{matrix} 1.0 \times 10^{- 4} i f θ_{i} \in Θ_{E C M A} \\ 5.0 \times 10^{- 5} i f θ_{i} \in Θ_{Q - F o r m e r} \end{matrix}

(14)

where

Θ_{E C M A}

represents the parameter set of the ECMA module, and

Θ_{Q - F o r m e r}

represents the parameter set of the original Q-Former. This differentiated learning rate setting enables newly introduced modules to converge faster while avoiding destructive updates to existing knowledge structures.

This paper adopts a two-stage learning rate adjustment strategy to ensure smooth start and effective convergence of training: a linear warm-up strategy is adopted in the first two epochs of training [28], and the learning rate is linearly increased from the initial value to the target learning rate. This strategy helps avoid gradient instability issues in the early stages of training and improves the robustness of the optimization process. In the t-th training step, the learning rate in the warm-up stage is calculated as shown in Equation (15):

η_{t} = η_{t a r g e t} \times \frac{t}{T_{w a r m u p}}

(15)

where

η_{t a r g e t}

is the target learning rate and

T_{w a r m u p}

is the total number of steps in the warm-up phase.

After the warm-up phase, this paper adopts a cosine annealing scheduling strategy with restart [29], and the learning rate decays according to Formula (16):

η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + c o s (\frac{T_{c u r}}{T_{c y c l e}} π))

(16)

where

η_{m i n} = 1.0 \times 10^{- 6}

is the minimum learning rate,

η_{m a x}

is the maximum learning rate of the current cycle,

T_{c u r}

is the number of training steps completed in the current cycle, and

T_{c y c l e}

is the total number of steps in each cycle. This paper adopts automatic mixed precision (AMP) [30] technology to accelerate the training process while maintaining numerical stability. The specific configuration is as follows: FP16 precision is used to calculate the gradient in the forward propagation stage; FP16 precision is used to calculate the gradient in the back propagation stage; and FP32 precision is used to update the parameters in the parameter update stage. Hybrid precision training increases training speed by approximately 30% while reducing memory usage by approximately 40%.

3. Results and Discussion

To verify the effectiveness of the proposed ECMA in image description generation tasks, this section will systematically evaluate the method on mainstream datasets and conduct quantitative and qualitative comparisons with advanced models such as BLIP. Meanwhile, by analyzing the contribution of each component of ECMA through ablation experiments, we further illustrate the improvement of our proposed method in the direction of image description generation.

3.1. Experimental Results

Table 1 shows the performance comparison results of ECMA with several representative image description models. From the overall results, it can be observed that ECMA achieves leading performance in key evaluation metrics such as BLUE@1, BLUE@4, METEOR, CIDEr, ROUGE_L, and SPICE.

BLEU@1 and BLEU@4—Measure n-gram precision between the generated captions and reference captions. BLEU@1 evaluates unigram overlap, while BLEU@4 considers 4-gram sequences, capturing fluency and local coherence.
METEOR—Considers unigram matches with stemming and synonymy, balancing precision and recall. It emphasizes semantic similarity and alignment with reference captions.
ROUGE_L—Computes the longest common subsequence between generated and reference captions, reflecting sentence-level structure consistency.
CIDEr—Evaluates consensus between generated captions and multiple references by weighting n-grams according to their rarity, focusing on informativeness and content coverage.
SPICE—Analyzes scene-graph-based semantic propositional content of captions, assessing how well visual relationships and objects are described.

These metrics collectively evaluate fluency, semantic correctness, informativeness, and visual-language alignment of the generated captions. In particular, it improves by approximately 13.5 points over BLIP in CIDEr, indicating that ECMA significantly enhances the model’s ability to capture fine-grained visual semantics and effectively improves the accuracy and richness of the descriptive text. The best results for each evaluation metric are highlighted in bold. To provide a clearer and more interpretable comparison between ECMA and other models, a visual bar chart is shown in Figure 7. In this chart, the X-axis represents the different image captioning models evaluated, and the Y-axis indicates the corresponding values of each performance metric, including BLEU@1, BLEU@4, METEOR, ROUGE_L, CIDEr, and SPICE. This presentation allows readers to easily compare the performance of ECMA against baseline models across multiple metrics.

Table 1. ECMA compared with other models.

Model	BLUE@1	BLUE@4	METEOR	ROUGE_L	CIDEr	SPICE
BLIP	79.1	39.9	31.0	60.0	133.5	23.8
ORT [31]	80.5	38.6	28.7	58.4	128.3	22.6
AoANet [32]	80.2	38.9	29.2	58.8	129.8	22.4
DLCT [33]	81.4	39.8	29.5	59.1	133.8	23.0
FlanT5	82.2	40.7	30.8	60.8	138.0	24.7
ECMA	83.5	44.0	31.9	62.2	147.0	25.2

This multi-metric bar chart compares ECMA against representative image captioning models (BLIP, ORT, AoANet, DLCT, FlanT5) across six standard evaluation metrics (BLEU@1, BLEU@4, METEOR, ROUGE_L, CIDEr, SPICE) on the COCO dataset. ECMA outperforms all baselines in CIDEr (the most critical metric for caption informativeness) and achieves competitive results in other metrics, demonstrating its effectiveness in enhancing visual-language alignment for image description. Dotted lines denote the performance ceiling of each metric across baselines, highlighting ECMA’s consistent gains.

ECMA was applied to OPT-2.7b and OPT-6.7b, and the ablation test results are shown in Table 2 and Table 3, respectively. Table 2 shows the independent and combined effects of each ECMA module on OPT-2.7b. The results show that after introducing bidirectional cross-modal attention, CIDEr and ROUGE_L are slightly improved, indicating that bidirectional interaction helps to enhance semantic alignment; after adding multi-scale visual aggregation, BLUE4 is significantly improved, indicating that multi-layer visual features can provide richer contextual information; when all three modules are enabled simultaneously, all indicators reach their optimal levels, verifying that the overall design of ECMA has complementary and synergistic gains.

The results in Table 3 further validate the effectiveness of ECMA in larger language models. Similar to OPT-2.7b, each component exhibits a gradual gain trend: bidirectional cross-modal attention brings initial improvement; multi-scale visual convergence continuously enhances BLUE4 and CIDEr; the semantic residual gating module plays a stabilizing and optimizing role in OPT-6.7b, enabling the final performance to reach the optimal value under the combination of all modules. Overall results demonstrate that the ECMA design exhibits robust adaptability and improvement effects across LLMs of different sizes.

The goal of the proposed ECMA module is not simply to improve overall captioning accuracy, but to enhance the model’s capability in selectively interacting with visual features across different semantic levels. Therefore, we analyze the experimental results from the perspective of semantic alignment and fine-grained representation quality. As shown in Table, removing the ECMA module leads to a significant performance degradation, especially on SPICE. This suggests that ECMA plays a crucial role in enhancing fine-grained visual-semantic alignment rather than merely increasing model capacity.

Compared with BLIP-2, our method achieves consistent improvements across all metrics, with particularly notable gains on CIDEr. These improvements indicate that the proposed ECMA module enhances the model’s ability to capture fine-grained semantic relations and object-level details, which aligns with the design motivation of enhanced cross-modal interaction.

We further conduct an ablation study by incorporating the SPICE metric. Unlike n-gram or consensus-driven metrics, SPICE evaluates captions by matching scene graph structures, focusing on objects, attributes, and relations at a symbolic level. As shown in the results, the SPICE score does not exhibit a significant increase. We attribute this to the fact that ECMA primarily strengthens cross-modal feature alignment and contextual grounding rather than explicitly optimizing scene graph structure prediction. Consequently, improvements in descriptive richness and semantic relevance may not be directly reflected by SPICE, which is consistent with observations reported in prior image captioning studies. Although the absolute numerical improvements over recent methods are moderate, our approach consistently outperforms BLIP-2 across all metrics, with particularly notable gains on CIDEr, which emphasizes semantic relevance and consensus. This improvement aligns with the goal of enhanced cross-modal interaction, especially in capturing fine-grained object semantics and contextual relationships.

Although both multi-level aggregation and semantic gating are integral components of the proposed ECMA module, their contributions are not strictly symmetric. As observed from the ablation results, the primary performance gains are driven by multi-level aggregation, which enriches the representation by integrating visual features across different semantic scales. In contrast, the semantic gating mechanism contributes more prominently to stabilizing the performance and preserving semantic consistency.

Specifically, semantic gating acts as a selective control mechanism that suppresses noisy or weakly relevant visual responses while emphasizing semantically aligned features during cross-modal interaction. This selective modulation helps prevent semantic drift and mitigates overfitting to dominant visual patterns, thereby ensuring more reliable object-level and relation-level alignment. As a result, although its direct contribution to absolute metric gains may appear moderate, semantic gating plays a crucial role in maintaining robustness and semantic fidelity, which is essential for high-quality caption generation.

In this work, semantic alignment improvement is not defined solely in terms of a single evaluation metric, but rather in terms of the reduction of specific captioning error patterns commonly observed in BLIP-2. In particular, we focus on three representative semantic error types: (1) object omission, where visually salient entities are not mentioned in the generated caption; (2) attribute or relation inconsistency, where object properties or inter-object relations are inaccurately described; and (3) semantic redundancy or confusion, where repeated or weakly relevant semantic content is introduced.

From this perspective, the proposed ECMA module contributes to reducing these errors by selectively regulating cross-modal interactions and improving the coherence between visual tokens and linguistic representations. qualitative analysis and consistent gains on complementary metrics indicate that ECMA enhances fine-grained semantic reasoning and stabilizes visual–textual alignment, particularly in complex scenes with multiple objects and interactions.

3.2. Discussion

Extensive experiments demonstrate that the proposed ECMA framework consistently improves the performance of BLIP-2 on image caption generation tasks, particularly in capturing fine-grained visual details and semantic relationships. Ablation studies further indicate that the contributions of different ECMA components are not symmetric. Specifically, the multi-scale visual aggregation module serves as the primary driver of performance improvement by enriching visual representations across different semantic levels, while the bidirectional cross-modal attention module facilitates effective visual–textual interaction and semantic alignment. In contrast, the semantic residual gating fusion module plays a more critical role in stabilizing cross-modal information integration and preserving semantic consistency rather than directly boosting absolute performance scores. By selectively filtering and regulating cross-modal features, this module mitigates noisy interactions and enhances robustness, which is especially beneficial for maintaining reliable fine-grained semantic alignment. The integration of all three components results in the best overall performance across key evaluation metrics.

Moreover, the proposed method introduces only a small number of additional parameters, is easy to reproduce, and does not increase the training cost of the large language model, achieving superior performance over existing approaches on CIDEr, ROUGE_L, and other evaluation metrics. Qualitative image captioning results are presented in Figure 8.

Figure 8 displays ECMA-generated captions for six diverse images (covering animals, human activities, and scenes): examples include fine-grained descriptions (e.g., “a dog wearing a hat”) and context-aware details (e.g., “the singapore skyline at sunset”), demonstrating ECMA’s ability to capture visual attributes, actions, and scene contexts across social media-style content.

While the ECMA method proposed in this study performs well in image captioning generation tasks, it still has some limitations. First, although ECMA performs well on standard datasets (such as COCO), its performance in some complex scenes is still lacking, especially in images with multiple objects and different types. The model’s accuracy in fine-grained description and contextual understanding needs further improvement. Second, although the COCO dataset used in this study is widely used in image captioning, it mainly focuses on everyday scenes and lacks more challenging domain-specific data, limiting the model’s generalization ability in specific application scenarios. Furthermore, although this study did not perform image augmentation, data augmentation in certain visual contexts may further improve the model’s robustness and generalization ability. Finally, the ECMA model has high computational complexity. Although its parameter count and training cost are still manageable compared to other large language models (such as the OPT series), balancing model performance and computational cost remains a pressing issue in large-scale practical applications.

For future improvements, further research can be carried out in the following aspects: First, introduce more diverse and challenging domain datasets to enhance the model’s cross-domain adaptability; second, explore more efficient computational optimization methods to reduce the computational cost of ECMA in large-scale applications; third, combine dynamic attention mechanisms to enable the model to handle complex scenes and interactions more flexibly; and fourth, try to combine data augmentation techniques with model design to improve the robustness and accuracy of the model in specific vision tasks.

3.3. Social Hotspot and Public Opinion Monitoring: An Application-Oriented Evaluation

As social media platforms generate increasing volumes of user-generated images, image captioning is expected to play a critical role in social hotspot detection and public opinion monitoring. Beyond basic content understanding, accurate and context-sensitive captions can support the identification of emerging events, trending topics, and shifts in public sentiment, thereby improving the efficiency and reliability of downstream social analysis and crisis response. However, standard benchmark datasets such as COCO provide only limited coverage of real-world social media scenarios. User-generated content is highly diverse, encompassing daily-life snapshots, food photography, human activities, landscapes, events, and object-centric posts, each posing distinct challenges for visual–language alignment. Accurate captioning in these settings requires fine-grained modeling of objects, actions, interactions, and contextual cues that are often absent or underrepresented in laboratory-controlled datasets.

In such complex environments, generating semantically rich and visually consistent descriptions is essential for downstream applications, including content indexing, intelligent retrieval, personalized recommendation, and assistive services. To address the limitations of aggregated benchmark metrics and to better evaluate real-world applicability, we therefore design an application-oriented evaluation framework that targets representative content categories commonly observed in social media platforms, enabling a more comprehensive assessment of the practical effectiveness and cross-scenario generalization ability of ECMA.

Figure 9 shows a user-friendly deployment interface for ECMA-based image captioning: users can upload images (e.g., a COCO dataset sample of a pitcher and vase), and the system generates a precise description (“a glass pitcher and a metal vase on a table”)—illustrating ECMA’s practical usability for real-world image description tasks.

Specifically, we resample the COCO dataset to construct a social media–oriented evaluation subset covering five representative content categories: animals, landscapes, human-centered activities, food, and object-centered scenes. These categories reflect the dominant types of user-generated images on social media and span a wide range of visual complexities and semantic characteristics. While the evaluation task remains standard image captioning, ECMA generates a natural language description for each image.

Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 present qualitative comparisons across all five categories, demonstrating ECMA’s adaptability to diverse semantic requirements. For animal and food images, ECMA captures fine-grained visual details and contextual cues, producing more informative descriptions than baseline models. In human-centered scenarios, ECMA more accurately models continuous actions and interactions, which are essential for understanding user-generated content. For landscape and object-centered images, ECMA generates captions that are more coherent and context-aware, reflecting a better grasp of scene-level structure and object relevance. Beyond caption quality, these enhanced descriptions have strong potential for downstream social media applications. By capturing detailed visual attributes, human activities, and contextual relationships, ECMA can support social hotspot detection and public opinion monitoring, enabling the identification of emerging events, trends, and sentiment changes. As a result, ECMA provides a practical foundation for transforming large-scale user-generated visual data into actionable insights for social analysis and decision-making.

Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 presents ECMA’s generated descriptions for 12 representative images in each of five content categories: animals, landscapes, food, human activities, and object-centric scenes. For animals, descriptions capture fine-grained visual attributes, behaviors, and contextual settings (e.g., “a cat sitting on top of a sink in a bathroom”, “two giraffes are eating leaves from a tree”). Landscape and urban scenes highlight spatial relationships and scene attributes (e.g., “a view of a body of water from a high point”, “a busy city street filled with traffic”). Food images focus on item details and interaction contexts (e.g., “a pizza with pepperoni and olives on a plate”, “a person holding a half-eaten doughnut with icing”). Human activity images emphasize actions and social interactions (e.g., “a male tennis player in action on the court”, “a bride and groom cutting their wedding cake”). Object-focused images illustrate precise object attributes and spatial placement (e.g., “a blue vase filled with white flowers”, “a motorcycle parked on the side of the road”). Together, these examples demonstrate ECMA’s ability to generate rich, context-aware, and category-specific captions across diverse visual content.

In addition to the MS COCO dataset, we further evaluated the proposed method on the PhotoChat dataset [34,35]. The PhotoChat dataset contains 1,743,042 images collected from real-world interpersonal chat conversations, making it representative of social communication scenarios. Unlike well-organized benchmark datasets, images in PhotoChat are typically generated in spontaneous communication scenarios and are often taken on an adhoc basis. Therefore, they may suffer from incomplete object framing, motion blur, occlusion, or poor lighting conditions. These features closely match real-world image sharing behavior in everyday social interactions and provide a more challenging testing platform for evaluating the robustness of image description models under unconstrained conditions. Therefore, evaluation on PhotoChat complements the quantitative results based on COCO and helps bridge the gap between standard benchmark datasets and real-world social use cases, as shown in Figure 15.

For the qualitative evaluation of recent social media scenarios, we used images taken by the authors, as shown in Figure 16. These images were chosen because there is currently no large-scale, up-to-date publicly available social media image dataset available for free distribution for research. Collecting images directly from online social media platforms can raise copyright, privacy, and ethical issues. Using selfies ensures full compliance with data usage regulations while reflecting typical characteristics of social media content, such as casual composition, diverse settings, and unrestricted shooting conditions.

Figure 7, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 present qualitative comparisons that illustrate the descriptive differences among different methods. Beyond visual illustration, we further conduct a qualitative error analysis based on these examples.

We observe that the errors can be broadly categorized into several representative types:

(1): object omission, where salient objects in complex scenes are missing from the generated captions;
(2): attribute mismatch, including incorrect color, quantity, or object states;
(3): imprecise relational description, where spatial or interaction relationships are inaccurately expressed.

Compared with BLIP-2, the proposed method shows a reduced tendency toward object omission and relational errors, particularly in cluttered or unconstrained scenes. Nevertheless, failure cases still occur under severe occlusion or extreme visual ambiguity, indicating remaining challenges for future work.

4. Conclusions

This study presents the ECMA module, a plug-and-play enhancement for BLIP-2, which addresses limitations in cross-modal semantic alignment and information interaction. By integrating bidirectional cross-modal attention, multi-scale visual feature aggregation, and semantically gated fusion, ECMA overcomes the one-way interaction limitation of traditional methods, captures fine-grained visual cues across ViT feature levels, and improves the controllability of cross-modal fusion while maintaining training stability. Extensive experiments on the COCO dataset demonstrate that ECMA significantly improves performance on key metrics such as CIDEr, BLEU, METEOR, and ROUGE-L, while generating more informative and context-aware captions across diverse content categories. Ablation studies further verify the independent contributions and complementary effects of each ECMA component, confirming the effectiveness of bidirectional interaction, multi-scale visual aggregation, and semantic residual gating fusion. Moreover, a discussion of the method’s limitations highlights challenges in handling complex multi-object scenes and domain-specific data, providing guidance for potential improvements. Future work will explore extending ECMA to occluded or complex scenes [36], incorporating lightweight visual fine-tuning strategies such as LoRA [37,38] and Adapter to optimize feature utilization, and adapting the module to temporal cross-modal tasks such as video understanding, aiming to further enhance generalization, robustness, and applicability of the model.

Author Contributions

Conceptualization, S.J. and Y.C.; methodology, Y.C.; software, Y.C.; validation, S.J., Y.C., R.C. and Z.L.; formal analysis, S.J., R.C. and Z.L.; investigation, Y.C.; resources, S.J., R.C. and Z.L.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, S.J., R.C. and Z.L.; visualization, Y.C.; supervision, S.J., R.C. and Z.L.; project administration, S.J., R.C. and Z.L.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by Hainan ProvincialNatural Science Foundation of China (No 624LALH009) and the Fundamental Research Funds for the Central Universities (2024QNYL06).

Data Availability Statement

The data used in this study consist of three datasets. The MS COCO dataset is publicly available at https://cocodataset.org/. The PhotoChat dataset is publicly available at https://github.com/google-research/google-research/tree/master/multimodalchat, accessed on 28 January 2026. In addition, a small set of images captured by the authors was used for supplementary experiments and is not publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

References

Albadarneh, I.A.; Hammo, B.H.; Al-Kadi, O.S. Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation. arXiv 2025, arXiv:2506.05399. [Google Scholar] [CrossRef]
Fei, J.; Wang, T.; Zhang, J.; He, Z.; Wang, C.; Zheng, F. Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 2788–2798. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), New York, NY, USA, 23–26 June 2021. [Google Scholar]
Fang, Y.; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. EVA: Exploring the Limits of Masked Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open Pre-trained Transformers. Meta AI Technical Report. arXiv 2022, arXiv:2205.01068. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. J. Mach. Learn. Res. 2024, 25, 1–53. [Google Scholar]
Li, D.; Li, J.; Le, H.; Wang, G.; Savarese, S.; Hoi, S.C.H. LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Toronto, ON, Canada, 10–12 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 31–41. [Google Scholar]
Eom, S.; Shim, J.; Koo, G.; Na, H.; Hasegawa-Johnson, M.A. Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM. In Findings of the Association for Computational Linguistics; EMNLP: Miami, FL, USA, 2024; pp. 14158–14167. [Google Scholar]
Guo, H.; Xie, Z.; Cao, S.; Wang, B.; Liu, W.; Le, A.; Li, L.; Li, Z. SNS-Bench-VL: Benchmarking multimodal large language models in social networking services. arXiv 2025, arXiv:2505.23065. [Google Scholar]
Ma, Y.; Ji, J.; Sun, X.; Zhou, Y.; Ji, R. Towards Local Visual Modeling for Image Captioning. Pattern Recognit. 2023, 138, 109420. [Google Scholar] [CrossRef]
Liu, Y.; Li, X.; Zhang, L.; Wang, Z.; Zheng, Z.; Zhou, Y.; Xie, C. OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning. arXiv 2025, arXiv:2509.01644. [Google Scholar] [CrossRef]
Kim, S.; Xiao, R.; Georgescu, M.-I.; Alaniz, S.; Akata, Z. COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 14690–14700. [Google Scholar]
Tschannen, M.; Gritsenko, A.; Wang, X.; Naeem, M.F.; Alabdulmohsin, I.; Parthasarathy, N.; Evans, T.; Beyer, L.; Xia, Y.; Mustafa, B.; et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv 2025, arXiv:2502.14786. [Google Scholar]
Zheng, C. The Linear Attention Resurrection in Vision Transformer. arXiv 2025, arXiv:2501.16182. [Google Scholar] [CrossRef]
Zhou, J.; Jiang, J.; Zhu, Z. Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation. arXiv 2025, arXiv:2510.23894. [Google Scholar]
Zhan, G.; Liu, Y.; Han, K.; Xie, W.; Zisserman, A. EIP: Enhanced Visual-Language Foundation Models for Image Retrieval. arXiv 2025, arXiv:2502.15682. [Google Scholar]
Tschannen, M.; Kumar, M.; Steiner, A.; Zhai, X.; Houlsby, N.; Beyer, L. Image Captioners Are Scalable Vision Learners Too. arXiv 2023, arXiv:2306.07915. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Guo, Q.; Yao, K.; Chu, W. Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–24 October 2022. [Google Scholar]
Hu, W.; Dou, Z.; Li, L.H.; Kamath, A.; Peng, N.; Chang, K.-W. Matryoshka Query Transformer for Large Vision-Language Models. arXiv 2024, arXiv:2405.19315. [Google Scholar] [CrossRef]
Hashmi, K.A.; Badrinarayanan, V.; Udpa, S.; Kundu, A.; Khandelwal, Y. FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
Ma, X.; Zhou, C.; Kong, X.; He, J.; Gui, L.; Neubig, G.; May, J.; Zettlemoyer, L. MEGA: Moving Average Equipped Gated Attention. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wu, M.; Zhang, X.; Sun, X.; Zhou, Y.; Chen, C.; Gu, J.; Sun, X.; Ji, R. DIFNet: Boosting Visual Information Flow for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18020–18029. [Google Scholar]
Wortsman, M.; Dettmers, T.; Zettlemoyer, L.; Morcos, A.; Farhadi, A.; Schmidt, L. Stable and Low-Precision Training for Large-Scale Vision-Language Models. arXiv 2023, arXiv:2304.13013. [Google Scholar]
Kalra, D.S.; Barkeshli, M. Why Warmup the Learning Rate? Underlying Mechanisms and Improvements. arXiv 2024, arXiv:2406.09405. [Google Scholar] [CrossRef]
Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
Koryakovskiy, I.; Yakovleva, A.; Buchnev, V.; Isaev, T.; Odinokikh, G. One-Shot Model for Mixed-Precision Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7939–7949. [Google Scholar]
Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image captioning: Transforming objects into words. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; ACM: New York, NY, USA, 2019; pp. 11137–11147. [Google Scholar]
Huang, L.; Wang, W.; Chen, J.; Wei, X.-Y. Attention on attention for image captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Long Beach, CA, USA, 2019; pp. 4634–4643. [Google Scholar]
Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.-W.; Ji, R. Dual-level collaborative Transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, Virtual, 2–9 February 2021; pp. 2286–2293. [Google Scholar]
Schumann, C.; Ricco, S.; Prabhu, U.; Ferrari, V.; Pantofaru, C. A Step Toward More Inclusive People Annotations for Fairness AIES. arXiv 2021, arXiv:2105.02317. [Google Scholar]
Zang, X.; Liu, L.; Wang, M.; Song, Y.; Zhang, H.; Chen, J. PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior for Joint Image-Text Modeling. arXiv 2021, arXiv:2108.01453. [Google Scholar]
Honda, U.; Watanabe, T.; Matsumoto, Y. Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Zanella, M.; Ayed, B. Low-Rank Few-Shot Adaptation of Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 16–22 June 2024; pp. 1593–1603. [Google Scholar]
Chen, X.; Liu, J.; Wang, Y.; Wang, P.; Brand, M.; Wang, G. SuperLoRA: Parameter-Efficient Unified Adaptation for Large Vision Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 16–22 June 2024; pp. 8050–8055. [Google Scholar]

Figure 1. Overall workflow of the proposed image captioning framework.

Figure 2. EMCA model diagram. This flowchart shows an E-Former + LLM image captioning pipeline: An image is split into features, paired with learned queries, encoded by E-Former, transformed, then fed to an LLM Decoder to generate the caption.

Figure 3. EMCA internal diagram. This diagram explains an E-Former multimodal image captioning framework: image is feature-extracted via a frozen encoder, fused with a prompt in E-Former, then passed to a frozen LLM to generate the caption—enabling end-to-end image-to-text generation.

Figure 4. Multi-scale feature extraction diagram. This diagram shows a ViT-based multi-scale feature pipeline for ECMA: A Vision Transformer (ViT) outputs hierarchical features (scales 4–24), which are fed into the ECMA module to leverage varied-granularity visual information.

Figure 5. EMCA network structure diagram. This diagram shows ECMA’s multimodal pipeline: A dog image (via Visual Encoder) and text (via Text Encoder) feed into ECMA’s three sub-modules; ECMA outputs a fused query (Q), which pairs with Q_new for the final result.

Figure 6. Methodology diagram. This system flowchart depicts an E-Former-based end-to-end image captioning workflow: a dog image is encoded, paired with queries, processed by E-Former into Query-Aware Visual Tokens, and fed to a frozen LLM for caption generation.

Figure 7. Performance comparison of ECMA and representative image captioning models on the COCO dataset.

Figure 8. Image description generation result diagram. This figure showcases a set of image-caption pairs generated by an image captioning model, featuring six examples: a close-up of a dog wearing a hat, a dog sticking its head out of a car window, a group of people riding bikes across a street, the Singapore skyline at sunset with the Merlion statue, a person riding a horse in front of a crowd, and a cat standing on a table next to a bottle of water.

Figure 9. Image description generation diagram. This screenshot demonstrates the interface of an image captioning demo.

Figure 10. ECMA’s effectiveness in animal identification. This figure presents 12 image-caption pairs generated by an image captioning model, covering a diverse range of scenes and subjects including domestic animals (a cat on a sink, a dog on ice, a puppy on a toilet), wildlife (giraffes, zebras, elephants, polar bears, a brown bear cub), and farm animals (a sheep, a horse pulling a carriage), with each paired caption providing a concise, contextually accurate description that reflects the key visual elements of the corresponding image, demonstrating the model’s ability to generate reliable descriptions for varied animal-centric scenarios.

Figure 11. ECMA’s effectiveness in landscape recognition. This figure displays 12 image-caption pairs generated by an image captioning model, showcasing a variety of urban, natural, and transportation scenes—such as a high-angle view of a river, a city street with a car, a downtown intersection with a street sign, boats in a harbor, a snow-covered hill, an airplane on a runway, a church with a tall spire, a train passing a forest, a tennis player in action, birds flying over a lighthouse, a busy city street, and airplanes in a cloudy sky—where each caption concisely and accurately captures the key visual elements of its corresponding image, highlighting the model’s versatility in describing diverse landscape and cityscape scenarios.

Figure 12. ECMA’s effectiveness in food identification. This figure presents 12 image-caption pairs generated by an image captioning model, focusing on diverse food and dining scenarios—including a cupcake, pizza, a shared meal tray, ice cream, a half-eaten doughnut, hot dogs, a loaded dinner plate, sandwiches, quiche, a dressed hot dog, and a pineapple candle-lighting setup—where each caption concisely and accurately captures the key food items and context of the corresponding image, demonstrating the model’s ability to generate precise descriptions for varied culinary and dining scenes.

Figure 13. ECMA’s performance in recognizing people. This figure displays 12 image-caption pairs generated by an image captioning model, featuring a wide range of human activities—such as a cyclist standing by his bike, a tennis player in action, a snowboarder posing in the snow, a surfer riding a wave, a newlywed couple cutting their wedding cake, a little girl eating in a high chair, two men playing frisbee, a toddler with a teddy bear, a baseball player swinging a bat, a skateboarder performing a trick, a couple playing video games, and two women with a large teddy bear—where each caption concisely and accurately captures the core action and context of the corresponding image, demonstrating the model’s proficiency in describing diverse human-centric scenarios and daily activities.

Figure 14. ECMA’s performance in object recognition. This figure presents 12 image-caption pairs generated by an image captioning model, showcasing a diverse set of everyday objects and scenes—including a teddy bear, a vase of flowers, a parked motorcycle, a laptop on a bed, a bathroom toilet, gold scissors, a pile of scissors on rope, a purse with accessories, a refrigerator on a sidewalk, a building-mounted clock, a fire hydrant, and a glass pitcher with a metal vase—where each caption concisely and accurately identifies the key subject and context of the corresponding image, demonstrating the model’s ability to reliably describe common household, outdoor, and personal items.

Figure 15. Performance of ECMA on the PhotoChat Validation Dataset. This figure presents 12 image-caption pairs sampled from the validation results of an image captioning model on the PhotoChat dataset. The examples showcase a diverse set of everyday objects and scenes, including a tree-lined urban street, a desktop calculator, a military fighter jet, a partially open laptop, a wheelchair positioned by stairs, a solved Rubik’s Cube, a bicycle locked to a utility pole, a pair of eyeglasses, a grand piano in a minimalist interior, a barber cutting a customer’s hair, a compact digital camera on a bed, and a television displaying a video of a man. Each caption concisely and accurately identifies the key subject and contextual details of the corresponding image, demonstrating the model’s ability to reliably describe common household, outdoor, and public scenes.

Figure 16. Qualitative Validation Results on a Custom-Collected Image Dataset. This figure presents 12 image-caption pairs sampled from the validation results of an image captioning model on a custom dataset of personally captured photographs. The examples showcase a diverse set of everyday objects and scenes, including a Captain America-themed birthday cake, pedestrians walking across an urban park, a parking lot filled with cars, a polar bear walking on a black surface, a watermelon bowl filled with ice cream and fruit, rabbits resting on a grassy field, a small chair with a vase of tulips, turtles in a glass enclosure, two pizzas with assorted toppings, a duck standing on a puddle, a fruit platter with watermelon and mango, and a black-and-white cat lying on the ground. Each caption concisely and accurately identifies the key subject and contextual details of the corresponding image, demonstrating the model’s ability to reliably describe a wide range of self-collected household, outdoor, and public scenes.

Table 2. Ablation experiments using ECMA in OPT-2.7b.

Bidirectional Cross-Modal Attention	Multi-Scale Visual Aggregation	Semantic Residual Gating Fusion	BLUE@4	METEOR	CIDEr	ROUGE_L	SPICE
×	×	×	42.8	32.1	145.7	61.8	25.5
√	×	×	42.8	32.0	145.5	61.9	25.5
√	√	×	43.8	31.6	140.0	61.8	25.1
√	√	√	44.0	31.9	147.0	62.2	25.5

Table 3. Ablation experiments using ECMA in OPT-6.7b.

Bidirectional Cross-Modal Attention	Multi-Scale Visual Aggregation	Semantic Residual Gating Fusion	BLUE@4	METEOR	CIDEr	ROUGE_L	SPICE
×	×	×	42.5	31.7	144.6	61.5	25.4
√	×	×	43.2	31.4	145.3	61.6	25.0
√	√	×	43.9	31.8	146.7	62.0	25.2
√	√	√	43.9	31.8	146.8	62.0	25.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, S.; Chen, Y.; Chaomu, R.; Liu, Z. Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring. Inventions 2026, 11, 13. https://doi.org/10.3390/inventions11010013

AMA Style

Jiang S, Chen Y, Chaomu R, Liu Z. Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring. Inventions. 2026; 11(1):13. https://doi.org/10.3390/inventions11010013

Chicago/Turabian Style

Jiang, Shan, Yingzhao Chen, Rilige Chaomu, and Zheng Liu. 2026. "Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring" Inventions 11, no. 1: 13. https://doi.org/10.3390/inventions11010013

APA Style

Jiang, S., Chen, Y., Chaomu, R., & Liu, Z. (2026). Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring. Inventions, 11(1), 13. https://doi.org/10.3390/inventions11010013

Article Menu

Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring

Abstract

1. Introduction

1.1. Motivation for Social-Oriented Image Captioning

1.2. Research Objectives and Innovations

2. Materials and Methods

2.1. Related Work

2.1.1. Visual Encoder

2.1.2. Cross-Modal Pre-Trained Models

2.1.3. Limitations of Cross-Modal Attention Mechanisms

2.2. Overview of the BLIP-2 Architecture

2.3. Limitations Analysis of BLIP-2 Cross-Modal Attention

2.3.1. One-Way Attention

2.3.2. Excessive Semantic Compression Leads to Information Flow Bottlenecks

2.4. ECMA: Enhancing Cross-Modal Attention

2.4.1. Bidirectional Cross-Modal Attention

2.4.2. Multi-Scale Visual Aggregation

2.4.3. Semantic Residual Gating Fusion

2.5. Dataset

2.6. Experimental Setup

3. Results and Discussion

3.1. Experimental Results

3.2. Discussion

3.3. Social Hotspot and Public Opinion Monitoring: An Application-Oriented Evaluation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI