Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

Wang, Liang; Jiao, Meiqing; Li, Zhihai; Zhang, Mengxue; Wei, Haiyan; Ma, Yuru; An, Honghui; Lin, Jiaqi; Wang, Jun

doi:10.3390/electronics14163325

Open AccessArticle

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

by

Liang Wang

^1,2,

Meiqing Jiao

^1,2,

Zhihai Li

^1,2,

Mengxue Zhang

^1,2,

Haiyan Wei

^1,2,

Yuru Ma

^1,2,

Honghui An

^1,2,

Jiaqi Lin

¹

and

Jun Wang

^1,2,*

¹

School of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Liaoning Key Laboratory of Intelligent Technology for Chemical Process Industry, Shenyang 110142, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3325; https://doi.org/10.3390/electronics14163325

Submission received: 26 July 2025 / Revised: 17 August 2025 / Accepted: 18 August 2025 / Published: 21 August 2025

Download

Browse Figures

Versions Notes

Abstract

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality.

Keywords:

image captioning; CLIP; GPT-2; multi-step cross-attention; external commonsense knowledge enhancement; Faster R-CNN; ConceptNet knowledge graph

1. Introduction

Image captioning denotes a machine’s capability to accurately recognize objects, scenes, and their interrelationships within an image, coupled with autonomous generation of linguistically coherent and semantically dense descriptions [1]. Owing to the unstructured nature of images, in-depth semantic parsing of visual content holds critical importance for domains including cross-modal retrieval, autonomous driving, and intelligent medical imaging analysis. Consequently, image captioning has become a pivotal task in multimodal research.

Within the domain of image captioning, training models from scratch on large-scale datasets confronts two fundamental constraints: scarcity of paired image-text resources and prohibitively high computational costs. Consequently, transferring the potent vision-language pretrained model Contrastive Language-Image Pre-Training (CLIP) [2] as the visual encoder, and coupling it with Generative Pre-Trained Transformer 2 (GPT-2) [3] as the decoder to leverage its text generation capabilities has emerged as an efficacious approach for image captioning. This paradigm not only mitigates training expenditures, but also yields semantically richer and higher-quality textual descriptions.

However, prevailing frameworks integrating CLIP with GPT-2 exhibit two enduring limitations:

The inherent richness of image semantics versus the limited textual descriptions in training data leads to insufficient cross-modal alignment precision, thereby degrading the quality of visual–textual feature integration. Consequently, the generated captions suffer from attribute omission, reduced accuracy, and impoverished diversity.
Existing approaches often fail to effectively incorporate external commonsense knowledge to assist decoding during caption generation. This lack of integration prevents the GPT-2 decoder from being adequately guided by such knowledge, which in turn constrains caption diversity, semantic depth, and commonsense plausibility. Consequently, despite advances in knowledge graph technology, developing efficient mechanisms to integrate external commonsense knowledge to enhance descriptive capabilities remains a critical challenge.

To address these challenges, this article proposes an image captioning model that integrates multi-step cross-attention cross-modal alignment with external commonsense knowledge augmentation, building on the transfer of pretrained CLIP and GPT-2 to the captioning task. The main contributions are as follows:

This article proposes an image captioning architecture built upon the CLIP and GPT-2 backbone, integrating the following key components: the Vision Transformer (ViT) visual encoder (from CLIP) for global semantics, a region-based visual encoder using Faster Region-based Convolutional Neural Network (Faster R-CNN) for regional object-centric details, the Bidirectional Encoder Representations from Transformers (BERT) text encoder, a multi-step cross-attention cross-modal alignment module, an external commonsense knowledge augmentation module (leveraging ConceptNet), and the GPT-2 text decoder for description generation. This architecture embodies a trilevel enhancement through “global + regional + knowledge” integration: ViT captures global image semantics, Faster R-CNN extracts regional object particulars, and ConceptNet provides implicit relational commonsense knowledge. By synergizing multimodal deep integration with external knowledge augmentation, the framework achieves notable improvements in terms of fine-grained semantic understanding, cross-modal alignment precision, hallucination suppression, and text generation quality.
This article presents a multi-step cross-attention mechanism that iteratively performs alignment and fusion of visual and textual features across multiple rounds. This iterative process significantly enhances semantic consistency and cross-modal fusion accuracy between image and text, while simultaneously improving the model’s robustness and generalization capabilities.
This article introduces the open knowledge graph ConceptNet, and this approach maps region-specific object features extracted by Faster R-CNN to entities within ConceptNet. Subsequently, it obtains multiple triplets sets through retrieval, which are combined with TransE embeddings and multi-hop reasoning, enabling a deep integration of external commonsense knowledge and cross-modal image-text features. Through knowledge association and expansion, this approach enhances the semantic expression of text generation, further improving location awareness, detail awareness, richness, accuracy, and cognitive plausibility of objects in the generated text.
This article presents an empirical study on the Microsoft Common Objects in COntext (MSCOCO) and Flickr30k Entities (Flickr30k) datasets. Experimental results demonstrate that the proposed model outperforms existing methods across multiple evaluation metrics. Furthermore, ablation studies and qualitative analysis validate the effectiveness of the multi-step cross-attention cross-modal alignment module and the external commonsense knowledge enhancement module.

The remainder of this article is structured as follows: Section 2 systematically surveys pertinent literature, including pretrained models (CLIP and GPT-2), recent advances in image captioning frameworks based on pretrained models, and state-of-the-art techniques incorporating cross-attention mechanisms with ConceptNet knowledge graphs. Section 3 details the architecture of the image captioning framework, delineating its core modules: the multi-step cross-attention cross-modal fusion module, the region-based visual feature extraction module using Faster R-CNN, the entity linking module with ConceptNet, and the joint training procedure. Section 4 presents experimental results on MSCOCO and Flickr30K datasets through comparative analysis, ablation studies, and qualitative evaluation. Section 5 concludes with key contributions, limitations, and future research directions.

2. Related Works

2.1. Vision–Language Pre-Trained Model CLIP and Generative Pre-Trained Model GPT-2

CLIP (Contrastive Language-Image Pre-Training) [4], proposed by OpenAI, is a contrastive-learning-based multimodal model. It is designed to embed images and text into a unified semantic space through large-scale image–text pair training, enabling cross-modal understanding and zero-shot transfer capabilities. To this end, OpenAI constructed a massive dataset of approximately 400 million image–text pairs, where textual descriptions serve as fine-grained labels for unsupervised training.

CLIP comprises a vision encoder and a text encoder, which employ mainstream feature extraction networks (e.g., ViT and BERT) to, respectively, encode images and text into semantic embeddings. During training, the model treats semantically aligned image-text pairs as positive samples and mismatched pairs as negative samples. By computing similarity between image and text embeddings and employing symmetric contrastive loss to maximize positive-pair similarity while minimizing negative-pair similarity, achieving effective cross-modal semantic alignment. This training strategy enables CLIP to demonstrate exceptional cross-modal understanding and transfer generalization capabilities across multiple downstream tasks—such as image captioning, video captioning, and visual question answering—establishing it as a foundational model for multimodal pre-training.

GPT-2 (Generative Pre-Trained Transformer 2) [5], developed by OpenAI, is an open-source auto-regressive language generation model based on the Transformer decoder architecture. It substantially enhances text generation capabilities through large-scale unsupervised pre-training. Leveraging masked self-attention mechanisms, GPT-2 exhibits strong zero-shot/few-shot learning capabilities, coherent long-text generation, and flexible adaptation to diverse linguistic styles. It achieves superior performance across natural language processing tasks, including dialogue systems, text summarization, and machine translation.

GPT-2 extends the standard Transformer architecture through increased network depth and scaled parameter size, thereby augmenting its capacity for modeling complex linguistic features. It employs a self-supervised learning approach, pre-trained on large-scale corpora to predict masked tokens based on context, which enables the model to learn word relationships and grammatical patterns, thereby producing more natural and coherent text. Given the typically short text length in image captioning, employing the lightweight GPT-2 as the text decoder—compared to larger models like Generative Pre-Trained Transformer 4 (GPT-4) and GPT-4 Omni (GPT-4o) [6]—delivers comparable generation quality while offering superior computational efficiency, enhanced stability, and simplified training, making it architecturally better suited for this task.

2.2. Image Captioning Method Based on the Pre-Trained Model

OpenAI’s CLIP model offers a novel approach for cross-modal tasks such as image captioning. Mokady et al. [7] proposed the ClipCap approach, demonstrating how CLIP models can be combined with pretrained language models to generate image captions with exceptional relevance and creativity. Li et al. [8] proposed the COS-Net model, which leverages CLIP for cross-modal retrieval to extract initial semantic information. Additionally, it further constructs a Transformer-based semantic understanding module capable of performing fine-grained filtering and content completion on retrieved objects. However, the COS-Net model exhibits dependency on training data, posing potential bias risks. Fang et al. [9] proposed the ViTCAP model, which employs a vision-only Transformer architecture. To enhance the semantic richness of generated captions, ViTCAP incorporates the Concept Token Network (CTN) module.

ConZIC [10] adopts BERT as its textual decoder, formally casting the caption generation within a Gibbs sampling framework. It utilizes CLIP’s loss function to provide image-text matching guidance, while incorporating a stylistic score of generated sentences as a constraint to guide the decoder in selecting contextually appropriate tokens from candidate vocabularies, thereby enhancing caption quality. However, ConZIC has been observed to overlook minor objects in images, compromising detail depiction. CapDec [11] injects Gaussian noise into the text inputs of CLIP encoding, training a Transformer decoder (following the GPT-2 architecture) for end-to-end learning to map noisy text embeddings back to the original text. Additionally, the distribution characteristics of different datasets based on a limited amount of matching data are modeled. However, although CapDec outperforms text-only training approaches, it has been observed to underperform fully supervised models and remains confined to predominantly English pretraining corpora, constraining its scalability. Wang et al. [12] leverage CLIP’s cross-modal matching capabilities to retrieve textual corpora with maximal semantic similarity to training captions during fine-tuning, mapping these into learnable vector embeddings. At inference time, the approach retrieves texts exhibiting the highest relevance to input images, converting them via a pre-trained mapping network into control vectors into the language decoder.

2.3. Image Captioning Model with Knowledge Graph Grounding and Cross-Attention Mechanisms

The ConceptNet knowledge graph [13,14] originated from the Open Mind Common Sense initiative at the MIT Media Lab, which aimed to collect commonsense data through public collaboration. It was later expanded by incorporating expert annotations, gamified crowdsourcing (e.g., the Verbosity game), and other open knowledge sources (e.g., WordNet and Wiktionary). Huang et al. [15] proposed integrating knowledge graphs into image captioning models, enhancing visual content comprehension and descriptive expressiveness by injecting external semantic information into the encoder–decoder architecture. Li et al. [16] proposed an end-to-end visual storytelling model that integrates external knowledge from knowledge graphs. By introducing a knowledge-augmented attention mechanism, the model fuses semantic knowledge with visual features to enhance scene comprehension and event reasoning capabilities.

Cross Attention (CA) [17], as a pivotal variant of attention mechanisms, plays a foundational role in multimodal learning tasks. In image captioning models, related work utilizes cross-attention to achieve cross-modal feature alignment and fusion [18]. For example, when generating a caption such as “a bird is standing on a branch” the cross-attention module enables the model to attend primarily to semantically relevant regions in the image—such as the bird and the supporting branch—while filtering out irrelevant background information. Lukovnikov and Fischer [19] proposed a training-free cross-attention control mechanism that enhances semantic alignment and image quality by modulating attention weights in diffusion models to achieve precise mapping between local text tokens and target image regions. EHAT [20] employs masked heterogeneous cross-attention to achieve local alignment between visual content and monolingual text, while simultaneously constructing interaction mappings between images and English-Chinese bilingual text through a heterogeneous reasoning network. This approach enhances both the accuracy and consistency of image captions. Cao et al. [21] introduced CLIP retrieval assistance in the CAST model and enabled cross-attention fusion between images and retrieved text through a dual-attention decoder, effectively improving alignment with visual information while reducing hallucinated content. However, these methods still exhibit shared limitations: while leveraging cross-attention mechanisms to enhance multimodal fusion between vision and language, they predominantly focus on local alignment or retrieval augmentation. Consequently, the models struggle to comprehensively understand multi-level semantics and complex relationships within images, leading to descriptions that fall short in both detail accuracy and relational expressiveness.

This article employs commonsense knowledge from the ConceptNet knowledge graph for model knowledge enhancement, utilizing a TransE-based multi-hop reasoning knowledge graph embedding approach to realize knowledge vectorization [22]. Specifically, TransE maps entities and relations into low-dimensional vectors, while sequence encoders (e.g., RNN/BiLSTM) generate path representations by aggregating entity. This methodology integrates TransE’s local relationship modeling with multi-hop reasoning’s global path exploration, significantly improving the embedding quality and reasoning capability of the ConceptNet knowledge graph. The enhanced knowledge vectors are subsequently integrated with vision-language features through multi-step cross-attention fusion for joint model training. The proposed methodology demonstrates considerable potential for application in multimodal tasks. Gu et al. [23] proposed the TextKG model, which significantly enhances video captioning accuracy by integrating knowledge graph embeddings with visual features and employing cross-attention mechanisms in a dual-stream Transformer architecture. Han et al. [24] proposed the CMCR model, which employs cross-attention mechanisms to fuse multimodal information for more accurate event localization. Additionally, it integrates unbiased knowledge graph-based scene graphs to enhance caption logicality and consistency, resulting in significant performance improvements.

3. Model Design

3.1. Model Structure

As shown in Figure 1, the model architecture mainly consists of the following components: ViT visual encoder, BERT text encoder, multi-step cross-attention cross-modal fusion module, external commonsense knowledge enhancement module, and GPT-2 decoder. The ViT visual encoder processes input images to represent them as sequences of image features, while the BERT text encoder encodes descriptive captions of input images into textual feature sequences. Subsequently, multi-step cross-attention cross-modal mechanisms are leveraged to fuse cross-modal features from images and texts, achieving complementary information integration and enabling similarity measurement within a unified embedding space. The external knowledge enhancement module retrieves relevant knowledge triplets associated with the current image from a knowledge base. These triplets, along with the fused features, are then fed into the GPT-2 decoder to generate descriptive outputs.

3.2. Multi-Step Cross-Attention Cross-Modal Alignment Module

Building upon the advantages of the above cross-attention mechanism in cross-modal alignment and fusion for image captioning, this article proposes a multi-step cross-attention-based cross-modal fusion method for the proposed image captioning model. The core principle of multi-step cross-attention lies in progressively refining inter-modal alignment through iterative updates to the query representation during training, which contrasts with conventional single-step cross-attention that performs one-time cross-modal alignment between images and text. During generation, it alternatively employs either visual or textual features as the query source, facilitating bidirectional enhancement of vision–language representations [25].

Let

X_{I} \in R^{n \times d}

denote the image modality’s local features (specifically, patch embeddings extracted by Vision Transformers (ViT) from image grids),

n

represent the number of local features,

d

indicate the feature dimensionality,

X_{T} \in R^{m \times d}

signify the textual modality’s local features (e.g., word embeddings),

m

specify the vocabulary size,

Q^{K} \in R^{d}

be the query vector at decoding step

k

(initialized via global feature aggregation),

d

= 512, and

K

define the total iterative steps. The proposed multi-step cross-attention cross-modal fusion operates as follows:

V_{I} = M e a n_p o o l i n g (X_{I})

(1)

V_{T} = M e a n_p o o l i n g (X_{T})

(2)

Step 1: Initialize the query vector. The initial query vector

Q^{0}

is generated by first applying mean pooling to local features of both image and text modalities to produce their respective global features, then performing element-wise addition between the image global feature and text global feature (as shown in Equation (3)):

Q^{0} = V_{I} + V_{T}

(3)

In Equation (3),

+

denotes element-wise addition.

Step 2: Iteratively compute cross-attention and update the query vector (

k = 1, 2, \dots K

), where each iteration consists of two cross-attention computations and one query vector update:

Text-aligned representation computation. Taking the current query $Q^{k - 1}$ as the query and the local visual features $X_{I}$ as the keys and values, the cross-attention is computed to update the textual query vector $Q_{T}^{k}$ (as shown in Equation (4)):

$Q_{T}^{k} = s o f t \max (\frac{Q^{k - 1} X_{I}^{T}}{\sqrt{d}}) X_{I}$

(4)
Image-aligned representation computation. Taking the current query $Q^{k - 1}$ as the query and the local textual features $X_{T}$ as the keys and values, is computed to update the image query vector $Q_{I}^{k}$ (as shown in Equation (5)):

$Q_{I}^{k} = s o f t \max (\frac{Q^{k - 1} X_{T}^{T}}{\sqrt{d}}) X_{T}$

(5)
Query vector update. The aligned representations $Q_{I}^{k}$ and $Q_{T}^{k}$ are multiplied element-wise to generate a new query vector, as shown in Equation (6), and then proceed to Step 2:

Q^{k} = Q_{I}^{k} ⊙ Q_{T}^{k}

(6)

In Equation (6),

⊙

denotes element-wise multiplication.

Step 3: After K iterative steps, the final query vector

Q^{K}

incorporates deep cross-modal interactive information. This vector is then used to generate the cross modally fused feature

V_{f u s e d}

through weighted summation (as shown in Equation (7)):

V_{f u s e d} = Q^{K} + α V_{I} + β V_{T}

(7)

In Equation (7), the weighting coefficients are set as

α

= 0.7,

β

= 0.3.

This multi-step cross-attention mechanism enables dynamic cross-modal interaction, where each iteration dynamically adjusts inter-modal relevance weights through attention computation. Furthermore, the model implements parameter sharing across iterative steps, enhancing architectural flexibility.

3.3. ConceptNet Commonsense Knowledge Augmentation Module Based on TransE and Multi-Hop Reasoning

3.3.1. ConceptNet Entity Linking with Region-Based Visual Features Extracted by Faster R-CNN

In image captioning models, performing entity recognition on bounding boxes {B_i} detected by Faster R-CNN [26] (each corresponding to an object) constitutes a critical step for knowledge augmentation. This pipeline comprises four stages: region object feature extraction, entity probability computation, entity linking, and knowledge encoding. The specific workflow for the first three stages is as follows:

Step 1: Region-Based Object Feature Extraction.

Given an input image, Faster R-CNN is employed to extract region features for candidate bounding boxes {

B_{i}

}. First, a backbone network (e.g., ResNet-101) generates the feature map

Φ (I) \in R^{h \times w \times 2048}

. Then, the RoI Align operation transforms each candidate box

B_{i} = (x, y, h, w)

into fixed-size features (as shown in Equation (8)):

V_{i} = R O I A l i g n (Φ (I), B_{i}) \in R^{2048}, i = 1, 2, \dots, N

(8)

in Equation (8),

N

denotes the number of detected objects.

Step 2: Entity Probability Computation.

Based on the object feature

V_{i}

, a pre-trained language model (e.g., BERT) is used to compute the entity probability distribution. The feature

V_{i}

is used as input and passed through a classifier head (i.e., a single-layer linear projection) to map it into the entity space (as shown in Equation (9)):

Z_{i} = W_{c} V_{i} + b_{c} \in R^{d_{e}}

(9)

In Equation (9),

W_{c} \in R^{d_{e} \times 2048}

denotes the weight matrix,

b_{c}

represents the bias vector, and

d_{e}

is the entity embedding dimension (set to 256 in this work).

The softmax function is then applied to generate the entity probability distribution (as shown in Equation (10)):

P (e | V_{i}) = s o f t \max (Z_{i}) \in R^{| E |}

(10)

In Equation (10),

E

denotes the predefined entity set in the dataset (e.g., object categories in COCO).

Step 3: Entity Linking.

The entity

e_{i}

with the highest probability is selected for knowledge triple retrieval. First, entity determination is performed via the argmax operation (as shown in Equation (11)). This step maps visual objects to structured knowledge entities (e.g., “apple” → “/c/en/apple” in ConceptNet), followed by querying the ConceptNet knowledge graph to retrieve M related triples {(e_i,r_j,e_k)} (e.g., apple, UsedFor, eating).

e_{i} = \underset{e \in E}{\arg \max} P (e | V_{i})

(11)

3.3.2. ConceptNet Knowledge Embedding via TransE and Multi-Hop Reasoning

The workflow for ConceptNet knowledge graph embedding leveraging Translating Embeddings (TransE) and multi-hop reasoning is as follows:

Step 1: Data Preparation and Initialization.

The triplet embedding set

S = {(h, r, t)}

is extracted from ConceptNet, where

h

denotes the head entity vector,

r

the relation vector and

t

the tail entity vector. Initialize low-dimensional vectors for entities and relations:

h, r, t \in R^{d}

, and apply normalization to ensure vector stability.

Step 2: TransE Foundational Embedding Training.

Under TransE’s [27] translational invariance principle

h + r \approx t

, the objective minimizes the distance for positive triplets

(h, r, t)

while maximizing the distance for negative triplets. A margin-based loss function is adopted to maximize the distance difference between positive and negative samples, formally defined in Equation (12):

L_{T r a n s E} = \sum_{{(h, r, t)}^{\in T_{p o s}}} \sum_{{(h^{'}, r, t^{'})}^{\in T_{n e g}}} \max (0, f (h, r, t) - f (h^{'}, r, t^{'}) + γ)

(12)

In Equation (12),

T_{p o s}

denotes the set of all positive triplets in the training set,

T_{n e g}

represents negative triplets generated by random head/tail entity replacement, and

f (h, r, t) = | | h + r - t | |_{2}

is the triplet scoring function where lower scores indicate higher plausibility. The hyperparameter

γ > 0

defines the preset margin constraint (typically γ ∈ [0.5,2.0]), empirically set to 1.0 in this work to enforce a minimum score differential of

γ

between positive and negative samples.

Step 3: Multi-hop Reasoning Path Modeling.

Path Search: Extract multi-hop paths from ConceptNet that form relational chains (as shown in Equation (13)):

$P = (e_{1}, r_{1}, e_{2}, r_{2}, e_{3}, \dots, e_{k}, r_{k}, e_{k + 1})$

(13)

In Equation (13), entities in the path are denoted by $e_{i} \in R^{d} (i = 1, 2, \dots, k + 1)$ , relations by $r_{j} \in R^{d} (j = 1, 2, \dots, k)$ , and hop depth $k$ = 4.
Path Vector Encoding: Using an LSTM sequence model [28,29] to encode the path information, the final hidden state $h_{n}$ constitutes the path’s global representation $V_{P} \in R^{d}$ , this vector comprehensively integrates the temporal dependencies among all entities and relations along the path (as shown in Equation (14)):

$V_{P} = L S T M_{E n c o d e r} ([e_{1}; r_{1}; e_{2}; \dots; e_{k + 1}])$

(14)
Path Score Function: The path score is computed as the similarity between the path vector and target entity vector, while enforcing alignment between the path’s endpoint and target entity. The path scoring and loss functions are formalized in Equations (15) and (16), respectively:

$s c o r e (P) = | | P - t | |$

(15)

$L_{p a t h} = | | V p - (e_{1} + \sum_{i = 1}^{k} r_{i} {) | |_{2}}^{2}$

(16)

Step 4: Joint Training and Optimization.

Joint Loss Function: The unified loss function combining TransE’s single-hop loss (direct relations) and multi-hop path reasoning loss is formalized in Equation (17), where hyperparameters $α_{1} = 1.0$ , $β_{1} = 0.5$ .

$L_{k g} = α_{1} L_{T r a n s E} + β_{1} L_{p a t h}$

(17)
Parameter Update: Optimize entity vectors, relation vectors, and path encoder parameters via gradient descent.

3.4. GPT-2-Based Text Decoder Module

This article channels the outputs from the multi-step cross-attention cross-modal fusion module and external commonsense knowledge augmentation module into the GPT-2 text decoder via integrated fusion operations, as the following:

Cross-Modal Feature Fusion and Joint Representation Generation

Concatenating feature

V_{f u s e d}

and knowledge vector

V_{k g}

dynamically generates joint representation

V_{H}

. A Sigmoid gating mechanism is then applied to modulate the weight of the knowledge based on the image content, thereby suppressing irrelevant commonsense knowledge (as shown in Equation (18)):

g = σ (W_{g} [V_{f u s e d}; V_{k g}] + b_{g})

(18)

In Equation (18),

g \in R^{d}

,

σ

denotes the Sigmoid function,

W_{g}

and

b_{g}

are learnable parameters, and [;] represents feature concatenation.

2.: Dimensionality Transformation

After applying dimensional transformation using an MLP [30], the resulting features are fed into GPT-2 (as shown in Equation (19)):

E_{v i s} = M L P (V_{H})

(19)

In Equation (19),

E_{v i s}

denotes the converted joint representation.

3.5. Joint Training and Fine-Tuning

CLIP Cross-Modal Image-Text Feature Alignment

The ViT and BERT encoders in CLIP are frozen, while retaining CLIP’s contrastive learning objective that computes cosine similarity between image embeddings

V_{I}

and text embeddings

V_{T}

. Cross-modal alignment capability is optimized via cross-entropy loss, with the CLIP alignment loss function formalized in Equation (20):

L_{c l i p} = - \frac{1}{B} \sum_{j = 1}^{B} \log (\frac{\exp (s i m (V_{I}^{j}, V_{T}^{j}) / τ)}{\sum_{i = 1}^{B} \exp (s i m (V_{I}^{j}, V_{T}^{i}) / τ)})

(20)

In Equation (20),

B

denotes the batch size, and

τ

is the temperature coefficient set to

τ

= 0.07.

2.: Autoregressive Text Generation Optimization

Taking the transformed joint representation

E_{v i s}

as the initial state, the GPT-2 decoder autoregressively generates descriptive text. The objective function

L_{g e n}

employs cross-entropy loss (as shown in Equation (21)):

L_{g e n} = - \sum_{t = 1}^{m} \log P (w_{t} | w_{< t}, E_{v i s})

(21)

In Equation (21),

w_{t}

denotes the reference token at sequence position

t

,

w_{< t}

represents the historical words generated prior to timestep

t

.

3.: Total Loss Function for Joint Training

During the training phase, the ViT encoder and the BERT encoder in CLIP are kept frozen, while the cross-modal fusion layers and the projection layer of GPT-2 are primarily trained. The aggregate loss constitutes a weighted summation of the multi-task losses previously defined (as shown in Equation (22)):

L_{t o t a l} = L_{g e n} + λ_{1} L_{c l i p} + λ_{2} L_{k g}

(22)

In Equation (22), the weighting hyperparameters are initialized as

λ_{1}

= 0.5 and

λ_{2}

= 0.3.

During the model training, hyperparameter

L_{c l i p}

undergoes primary optimization in initial stages, while progressively increasing weights

L_{g e n}

and

L_{k g}

in later phases to mitigate knowledge noise interference during critical early feature alignment. Through synergistic training integrating cross-modal alignment, dynamic knowledge fusion, and generation optimization, the model achieves deep integration of visual semantics with commonsense reasoning, significantly enhancing descriptive accuracy and logical coherence.

4. Experiments

4.1. Experimental Datasets

To validate the model’s efficacy, extensive experiments were conducted on two widely recognized benchmark datasets for image captioning research: MSCOCO and Flickr30K. Comprehensive analysis of experimental results was performed from both quantitative and qualitative perspectives, with detailed performance comparisons against state-of-the-art models. The MSCOCO dataset contains 123,287 images curated for image captioning experiments, partitioned into 82,783 training samples and 40,504 test samples. Each image features five manually annotated descriptive captions. The Flickr30K dataset consists of 31,783 images, each paired with five distinct annotations, totaling 158,915 sentences. Since Flickr30K does not provide a predefined training/testing split, in this work the training and testing sets are partitioned at an 8:2 ratio.

4.2. Evaluation Metrics

In image captioning tasks, five established evaluation metrics [31] are conventionally employed: BLEU (Bilingual Evaluation Understudy) [32] was originally developed to assess translation quality in natural language processing, computing scores based on n-gram overlap between generated captions and ground-truth reference descriptions; ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [33]—unlike BLEU, which measures precision, ROUGE exclusively evaluates recall, representing a fundamental distinction between these metrics, with ROUGE-L as its typical variant; METEOR (Metric for Evaluation of Translation with Explicit Ordering) [34] is an improvement over BLEU, designed to address the limitation of BLEU’s reliance solely on precision; CIDEr (Consensus-based Image Description Evaluation) [35] is an important metric for evaluating the performance of image captioning tasks, assessing the alignment between the generated descriptions and the image content based on the consensus of human annotations; and SPICE (Semantic Propositional Image Caption Evaluation) [36] measures semantic accuracy via semantic scene graph matching.

These metrics employ distinct algorithms to evaluate the proximity of generated outputs to ground-truth annotations, with each emphasizing different aspects due to their algorithmic divergence. While conventional metrics effectively assess generation quality (and remain prevalent in most studies), this work retains five widely-adopted metrics as evaluation standards—maintaining methodological continuity with existing research and ensuring comparability across benchmarks. A composite summation metric (Sum) aggregating five aforementioned sub-metrics is introduced for ablation studies. Higher numerical values across all metrics indicate superior captioning performance.

4.3. Experimental Setup

4.3.1. Experimental Environment Configuration

The experimental environment configuration is specified as follows: Windows 10 64-bit operating system; NVIDIA GeForce RTX 4090 GPU (24 GB GDDR6X VRAM); Intel Core i9-13900K processor operating at 3.0 GHz base frequency (24 cores/32 threads); Python 3.9 and PyTorch 1.11.0.

4.3.2. Experimental Parameter Configuration

Building upon an extensive literature review and iterative empirical experimentation, the parameter configuration employed in this study is shown in Table 1.

4.4. Model Performance Benchmarking

4.4.1. Comparative Analysis of Loss Functions

Figure 2 illustrates the evolution of the total loss function across epochs during both training and testing phases on the MSCOCO and Flickr30k datasets. From the figure, three distinct phases can be observed: the loss value exhibits a steep decline in the initial phase; the reduction rate becomes significantly decelerated during the intermediate phase, gradually approaching a stable region; and in the final phase, the loss function enters a steady-state oscillation stage, aligning with the expected characteristics of the model’s terminal convergence state.

These observations collectively demonstrate the model’s reproducible optimization stability, with the loss function effectively exhibiting the intended guiding functionality.

4.4.2. Comparative Analysis of Performance Metrics

Comparative experimental results on the MSCOCO and Flickr30k datasets are presented as follows:

To comprehensively evaluate the proposed model’s performance, quantitative and qualitative benchmarking was conducted on the MSCOCO and Flickr30K datasets. As quantified in Table 2, the model achieves significant improvements over the top-performing baseline Pure Transformer-based (PureT) on MSCOCO, with absolute gains of +0.7% in BLEU-1, +2.9% in BLEU-4, +2.3% in METEOR, +0.5% in ROUGE-L, +3.2% in CIDEr, and +0.4% in SPICE. On Flickr30K (Table 3), the model attains scores of 76.3 (B-1), 33.8 (B-4), 30.5 (M), 52.6 (R), 78.4 (C), and 16.8 (S), demonstrating substantial improvements over the baseline Spatial Guided Attention (SGA). Collectively, these comparative results demonstrate the model’s exceptional image captioning capabilities with marked advantages across all performance metrics.

To ensure reliability, this article conducted five independent experiments. The results show the following: on MSCOCO, CIDEr fluctuated within ±0.17 and BLEU-4 within ±0.22; On Flickr30k, CIDEr varied by ±0.18 and BLEU-4 by ±0.27.

4.5. Ablation Study

Ablation studies are conducted on an image-captioning backbone with a CLIP encoder and GPT-2 decoder to validate the efficacy of the proposed multi-step cross-attention cross-modal fusion and external knowledge enhancement module, enabling fine-grained analysis of each module’s contribution. The ablation protocol establishes a CLIP-encoder and GPT-2-decoder based image captioning architecture as the baseline model, with sequential integration of the proposed enhancement modules. The results on the MSCOCO and Flickr30K datasets are presented in Table 4 and Table 5, respectively.

Effectiveness of the Multi-Step Cross-Attention Cross-Modal Fusion Module

On the MSCOCO dataset, after adding this module (comparing Row 1 and Row 2 of Table 4), the BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics increased by 1.7%, 5.3%, 5.3%, 3.1%, 4.4%, and 5.4%, respectively, with the SUM totals core improving by 3.8%. On the Flickr30k dataset (comparing Row 1 and Row 2 of Table 5), the BLEU-1, BLEU-4, METEOR, ROUGE-L, and CIDEr metrics increased by 1.7%, 2.2%, 4.1%, 4.4%, and 11.7%, respectively, with the SUM total score improving by 5.2%.

2.: Effectiveness of the External Commonsense Knowledge Enhancement Module

On the MSCOCO dataset, after adding this module (comparing Row 2 and Row 3 of Table 4), the BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics increased by 1.3%, 0.5%, 3.7%, 1.2%, 1.6%, and 3.0%, respectively, with the SUM total score improving by 1.6%. On the Flickr30k dataset (comparing Row 2 and Row 3 of Table 5), the BLEU-1, BLEU-4, METEOR, ROUGE-L, and CIDEr metrics increased by 0.7%, 2.1%, 0.7%, 1.3%, and 6.7%, respectively, with the SUM total score improving by 2.6%.

Collectively, the integration of the multi-step cross-attention cross-modal fusion module and external commonsense knowledge enhancement module yields significant improvements across primary evaluation metrics in the baseline model. The model demonstrates substantial gains in CIDEr scores in both comparative experiments and ablation studies, further validating the efficacy of these modules and their notable performance enhancement.

4.6. Qualitative Results Analysis

To more intuitively validate that the proposed model can generate finer-grained descriptive sentences, Figure 3 and Figure 4 demonstrate example descriptions generated by our method in comparison with both reference descriptions from the dataset and those generated by the PureT model.

As evidenced in Figure 3 and Figure 4, the model generates captions demonstrating superior performance in detail fidelity, descriptive accuracy, contextual richness, and linguistic fluency. Significant improvements are observed in spatial localization capabilities. In Figure 3, the model successfully captures fine-grained elements such as “motorcyclist with a stuffed toy and a water cup”, highlighting its detailed perception ability. The captions further exhibit precise spatial relationship understanding, exemplified by the accurate localization of “sidewalk next to a stone bench”. Additionally, the model demonstrates enhanced contextual integration capability, as shown in Figure 4 where it generates richer descriptions including contextual elements like “a man riding a horse is traveling along a treelined road”.

However, when dealing with images containing a large number of small objects as depicted in the figure, the richness and accuracy of the generated descriptions tend to decline. For instance, in Figure 4, the generated description is overly reliant on local features. It only pays attention to local elements such as people, computers, and tables. As a result, the generated description is lacking in both richness and accuracy and fails to comprehensively understand the overall scene.

5. Conclusions

This article addresses limitations inherent in CLIP-encoder- and GPT-2-decoder-based image-captioning architectures by proposing a novel model incorporating multi-step cross-attention cross-modal alignment and ConceptNet-based external commonsense knowledge enhancement. The solution employs iterative multi-step cross-attention mechanisms to achieve precise image-text alignment through progressive fusion cycles. Additionally, an external commonsense knowledge enhancement module based on ConceptNet is incorporated to supplement commonsense reasoning, thereby ensuring that the generated descriptions are more consistent with commonsense knowledge and contextual coherence. Experimental validation demonstrates that the model achieves consistent improvements across various metrics on the MSCOCO and Flickr30K datasets compared to state-of-the-art methods. Both the ablation studies and qualitative analysis further validate the effectiveness and performance gains contributed by these two modules.

Despite the improved image captioning richness and accuracy achieved through the multi-step cross-attention cross-modal alignment module and external knowledge augmentation module, the proposed model exhibits the following limitations. First, while the multi-step cross-attention mechanism significantly enhances semantic alignment capability, its heavy reliance on local feature interactions may compromise description precision when processing images containing numerous small objects or complex scenes. Second, although leveraging ConceptNet enhances commonsense reasoning, its coverage of domain-specific knowledge remains limited, constraining the accuracy and technical rigor of generated descriptions in specialized contexts. Furthermore, as a crowdsourced knowledge repository, ConceptNet carries inherent risks of inheriting biases, which must be acknowledged [46]. Third, current reliance on ConceptNet as the sole knowledge source precludes investigation into differential impacts of knowledge characteristics (e.g., quality) on model performance. This limitation may constrain generalizability in heterogeneous knowledge environments.

To address these limitations, future work will focus on three key optimizations. First, sparse attention mechanisms (e.g., Sparse Transformer) will be integrated [47] to explicitly select the most relevant key-query pairs, mitigating attention dispersion. This will enhance focus on critical regions, improve attention concentration and semantic alignment efficacy, thereby boosting description precision and robustness. Alternatively, implementing retrieval-augmented multi-scale semantic guidance mechanisms [48] will be considered, incorporating externally matched textual references as auxiliary signals during attention computation to further enhance semantic extraction accuracy and descriptive consistency. Second, consider introducing domain-specific knowledge graphs (e.g., Unified Medical Language System, biomedical knowledge bases) and integrating domain knowledge into vision–language models via relational transformers [49] or relation-aware semantic transfer mechanisms [50]. This methodology not only improves the technical accuracy of generated descriptions but also reduces inherent bias risks in crowdsourced data through cross-verification with multi-source knowledge. Third, we will conduct attribution analysis to isolate individual effects of knowledge properties (quality, coverage, formality) on performance, thereby developing decision-centric frameworks for optimal knowledge source selection.

Author Contributions

Writing—original draft preparation, L.W. and M.J.; Writing—review and editing, L.W., M.J., Z.L., Y.M. and J.W.; Supervision, Z.L. and M.J.; Data curation, Z.L. and M.Z.; Visualization, H.W., H.A. and J.L.; Project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by Industry-University-Research Innovation Fund for Chi nese Universities under grant 2021LD06009; in part by the Natural Science Foundation of Liaoning Province under grant 2022-MS-291; in part by Research Project of Liaoning Provincial Department of Education under grant LJ2020024; in part by Basic Research Project of Liaoning Provincial De partment of Education under grant LJKMZ20220781; in part by General Basic Research Projects of Liaoning Provincial Education Department under grant JYTMS20231488; in part by the Applied Basic Research Program of Liaoning Provincial Department of Science and Technology (2025): Research on Intrusion Detection Technology and Intelligent Defense Strategies for Industrial Internet (Acceptance No. 1746669597594), and in part by the Basic Research Project of Liaoning Provincial Department of Education (2025): Research on Optimization Technologies for Key Metrics such as Energy Consumption and Coverage in WSNs.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bernardi, R.; Cakici, R.; Elliott, D.; Erdem, A.; Erdem, E.; Ikizler-Cinbis, N.; Keller, F.; Muscat, A.; Plank, B. Automatic Description Generation from Images: A Surveyof Models, Datasets, and Evaluation Measures. J. Artif. Intell. Res. 2016, 55, 409–442. [Google Scholar] [CrossRef]
Barraco, M.; Cornia, M.; Cascianelli, S.; Baraldi, L.; Cucchiara, R. The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 4662–4670. [Google Scholar] [CrossRef]
Luo, Z.; Hu, Z.; Xi, Y.; Zhang, R.; Ma, J. I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning. In Proceedings of the IEEE/CVF Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–8 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; ML Research Press: New York, NY, USA, 2021; pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a (accessed on 12 July 2024).
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; Language Models Are Unsupervised Multitask Learners. OpenAI Blog. 2019. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 1 June 2024).
Ramos, R.; Martins, B.; Elliott, D.; Kementchedjhieva, Y. SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2840–2849. [Google Scholar] [CrossRef]
Mokady, R.; Hertz, A.; Bermano, A.H. ClipCap: CLIP Prefix for Image Captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar] [CrossRef]
Li, Y.; Pan, Y.; Yao, T.; Mei, T. Comprehending and Ordering Semantics for Image Captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17990–17999. [Google Scholar] [CrossRef]
Fang, Z.; Wang, J.; Hu, X.; Liang, L.; Gan, Z.; Wang, L.; Yang, Y.; Liu, Z. Injecting Semantic Concepts into End-to-End Image Captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 17988–17998. [Google Scholar] [CrossRef]
Zeng, Z.; Zhang, H.; Lu, R.; Wang, D.; Chen, B.; Wang, Z. ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 23465–23476. [Google Scholar] [CrossRef]
Nukrai, D.; Mokady, R.; Globerson, A. Text-only Training for Image Captioning Using Noise-Injected CLIP. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4055–4063. Available online: https://aclanthology.org/2022.findings-emnlp.299/ (accessed on 16 July 2025).
Wang, J.; Yan, M.; Zhang, Y.; Sang, J. From Association to Generation: Text-Only Captioning by Unsupervised Cross-Modal Mapping. arXiv 2023, arXiv:2304.13273. [Google Scholar] [CrossRef]
Liu, H.; Singh, P. ConceptNet—A practical commonsense reasoning tool kit. BT Technol. J. 2004, 22, 211–226. [Google Scholar] [CrossRef]
Speer, R.; Chin, J.; Havasi, C. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), San Francisco, CA, USA, 4–9 February 2017; pp. 4444–4451. [Google Scholar] [CrossRef]
Huang, F.; Li, Z.; Wei, H.; Zhang, C.; Ma, H. Boost Image Captioning with Knowledge Reasoning. Mach. Learn. 2020, 109, 2313–2332. [Google Scholar] [CrossRef]
Li, T.; Wang, H.; He, B.; Chen, C.W. Knowledge-Enriched Attention Network with Group-Wise Semantic for Visual Storytelling. arXiv 2022, arXiv:2203.05346. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
Wei, X.; Zhang, T.; Li, Y.; Wang, Y.; Wu, F. Multi-Modality Cross Attention Network for Image and Sentence Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10938–10947. [Google Scholar] [CrossRef]
Lukovnikov, D.; Fischer, A. Layout-to-Image Generation with Localized Descriptions Using ControlNet with Cross-Attention Control. arXiv 2024, arXiv:2402.13404. [Google Scholar] [CrossRef]
Song, Z.; Hu, Z.; Zhou, Y.; Zhao, Y.; Hong, R.; Wang, M. Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning. IEEE Trans. Multimed. 2024, 26, 9008–9020. [Google Scholar] [CrossRef]
Cao, S.; An, G.; Cen, Y.; Yang, Z.; Lin, W. CAST: Cross-modal retrieval and visual conditioning for image captioning. Pattern Recognit. 2024, 153, 110555. [Google Scholar] [CrossRef]
Cao, J.; Fang, J.; Meng, Z.; Liang, S. Knowledge graph embedding: A survey from the perspective of representation spaces. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
Gu, X.; Chen, G.; Wang, Y.; Zhang, L.; Luo, T.; Wen, L. Text with knowledge graph augmented transformer for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 19–24 June 2023; pp. 18941–18951. [Google Scholar] [CrossRef]
Han, S.; Liu, J.; Zhang, J.; Gong, P.; Zhang, X.; He, H. Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph. Complex Intell. Syst. 2023, 9, 4995–5012. [Google Scholar] [CrossRef]
Chen, H.; Ding, G.; Liu, X.; Lin, Z.; Liu, J.; Han, J. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12673–12682. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Bordes, A.; Usunier, N.; Garcia-Durán, A.; Weston, J.; Yakhnenko, O. Translating Embeddings for Modeling Multi-Relational Data. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2 (NIPS’13), Lake Tahoe, NV, USA, 5–10 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; pp. 2787–2795. Available online: https://dl.acm.org/doi/10.5555/2999792.2999923 (accessed on 16 July 2025).
Van Houdt, G.; Mosquera, C.; Nápoles, G. A Review on the Long Short Term Memory Model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
Wang, J.; Ge, C.; Li, Y.; Zhao, H.; Fu, Q.; Cao, K.; Jung, H. A Two-Layer Network Intrusion Detection Method Incorporating LSTM and Stacking Ensemble Learning. Comput. Mater. Contin. 2025, 83, 5129–5153. [Google Scholar] [CrossRef]
Bisong, E. The Multilayer Perceptron (MLP). In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Springer Nature: Berkeley, CA, USA, 2019; pp. 401–405. [Google Scholar] [CrossRef]
Berger, U.; Stanovsky, G.; Abend, O.; Frermann, L. Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis. arXiv 2024, arXiv:2408.04909. [Google Scholar] [CrossRef]
Evtikhiev, M.; Bogomolov, E.; Sokolov, Y.; Bryksin, T. Out of the BLEU: How should we assess quality of the Code Generation models? J. Syst. Softw. 2023, 203, 111741. [Google Scholar] [CrossRef]
Lin, C. ROUGE: A Package for Automatic Evaluation of Summaries. In Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; Available online: https://api.semanticscholar.org/CorpusID:964287 (accessed on 20 July 2025).
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. Available online: https://aclanthology.org/W05-0909/ (accessed on 20 July 2025).
Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-Based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 4566–4575. [Google Scholar] [CrossRef]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic Propositional Image Caption Evaluation. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer Nature: Cham, Switzerland, 2016; pp. 382–398. [Google Scholar] [CrossRef]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-Memory Transformer for Image Captioning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10575–10584. [Google Scholar] [CrossRef]
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving Image Captioning by Leveraging Intra- and Inter-Layer Global Representation in Transformer Network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 1655–1663. [Google Scholar] [CrossRef]
Song, Z.; Zhou, X.; Dong, L.; Tan, J.; Guo, L. Direction Relation Transformer for Image Captioning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 5056–5064. [Google Scholar] [CrossRef]
Tewel, Y.; Shalev, Y.; Schwartz, I.; Wolf, L. Zerocap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17918–17928. [Google Scholar] [CrossRef]
Wang, Y.; Xu, J.; Sun, Y. End-to-End Transformer Based Model for Image Captioning. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 2585–2594. [Google Scholar] [CrossRef]
Wang, J.; Wang, W.; Wang, L.; Wang, Z.; Feng, D.D.; Tan, T. Learning Visual Relationship and Context-Aware Attention for Image Captioning. Pattern Recognit. 2020, 98, 107075. [Google Scholar] [CrossRef]
Zhong, Y.; Wang, L.; Chen, J.; Yu, D.; Li, Y. Comprehensive Image Captioning via Scene Graph Decomposition. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 211–229. [Google Scholar] [CrossRef]
Zhang, W.; Shi, H.; Tang, S.; Xiao, J.; Yu, Q.; Zhuang, Y. Consensus Graph Representation Learning for Better Grounded Image Captioning. arXiv 2021, arXiv:2112.00974. [Google Scholar] [CrossRef]
Du, R.; Zhang, W.; Li, S.; Chen, J.; Guo, Z. Spatial Guided Image Captioning: Guiding Attention with Object’s Spatial Interaction. IET Image Process. 2024, 18, 3368–3380. [Google Scholar] [CrossRef]
Denton, R.; Díaz, M.; Kivlichan, I.; Prabhakaran, V.; Rosen, R. Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation. arXiv 2021, arXiv:2112.04554. [Google Scholar] [CrossRef]
Lei, Z.; Zhou, C.; Chen, S.; Huang, Y.; Liu, X. A Sparse Transformer-Based Approach for Image Captioning. IEEE Access 2020, 8, 213437–213446. [Google Scholar] [CrossRef]
Wang, L.; Hu, Y.; Xia, Z.; Chen, E.; Jiao, M.; Zhang, M.; Wang, J. Video Description Generation Method Based on Contrastive Language–Image Pre-Training Combined Retrieval-Augmented and Multi-Scale Semantic Guidance. Electronics 2025, 14, 299. [Google Scholar] [CrossRef]
Chen, T.; Li, Z.; Wei, J.; Xian, T. Mixed Knowledge Relation Transformer for Image Captioning. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022. [Google Scholar] [CrossRef]
Li, Z.; Tang, H.; Peng, Z.; Qi, G.; Tang, J. Knowledge-Guided Semantic Transfer Network for Few-Shot Image Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–15. [Google Scholar] [CrossRef]

Figure 1. The structure of the model.

Figure 2. The total loss.

Figure 3. Captioning examples of the model, ground truth (GT), and PureT.

Figure 4. Captioning examples of the model and ground truth (GT).

Table 1. Experimental parameter settings.

Parameter Name	Parameter Value
Lr (Learning Rate)	1 × 10⁻⁴
lr Decay	0.9
lr_step_size	1
batch_size	64
Optimizers	AdamW
beam size	5
CLIP_ViT	VIT-B/32
CLIP_ENcoder_dim	512
GPT-2_dim	768
Epochs	30
iterations_K	5
Anchor_scale	[8, 16, 32]
RoI pooling size	7 × 7
RPN_POST_NMS_TOP_N_TEST	1000
CrossModal_Adapter_dim	2048 → 512
Weight decay rate	1 × 10⁻²
Dropout probabilities	0.3
Temperature (τ)	0.07
Random seed	42

In the table, parameters including optimizers, beam size, CLIP_ViT, CLIP_ENcoder_dim, GPT-2_dim, iterations_K Anchor Scales, RoI Pooling Size, and RPN_POST_NMS_TOP_N_TEST are fixed. Other parameters (such as Learning Rate, batch size, etc.) require manual tuning and optimization.

Table 2. Experimental comparison results of the MSCOCO dataset.

Method Category	Method	B-1	B-4	M	R	C	S
Semantic Perception Enhancement models	M2Transformer [37]	80.8	39.1	29.2	58.6	131.2	22.6
Semantic Perception Enhancement models	GET [38]	81.5	39.5	29.3	58.9	131.6	22.8
Spatial Relation Enhancement models	DRT [39]	81.7	40.4	29.5	59.3	133.2	23.3
Zero-Shot Learning models	Zerocap [40]	81.9	40.7	29.8	58.7	136.4	23.5
Grid Feature Enhancement models	PureT [41]	82.1	40.9	30.2	60.1	138.2	24.2
The method of this article	Ours	* 82.7	* 42.1	* 30.9	* 60.4	* 142.6	* 24.3

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (CIDEr: p = 0.03, SPICE: p = 0.028).

Table 3. Experimental comparison results of the Flickr30k dataset.

Method Category	Method	B-1	B-4	M	R	C	S
Semantic Perception Enhancement models	A_R_L [42]	69.8	27.7	21.5	48.5	57.4	-
Semantic Perception Enhancement models	Sub-GC [43]	70.7	28.5	22.3	-	61.9	16.4
Structured Semantic Enhancement models	CGRL [44]	72.5	27.8	22.4	-	65.2	16.8
Spatial Relation Enhancement models	SGA [45]	73.2	30.4	22.6	50.8	68.4	16.4
The method of this article	Ours	* 76.3	* 33.8	* 30.5	* 52.6	* 78.4	* 16.8

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (CIDEr: p = 0.03, SPICE: p = 0.028).

Table 4. Ablation experiment results for the MSCOCO dataset.

Baseline	Multi-Step Cross-Attention	External Knowledge Augmentation	B-1	B-4	M	R	C	S	SUM
√			80.2	39.8	28.3	57.9	134.5	22.4	363.1
√	√		81.6	41.9	29.8	59.7	140.4	23.6	377.0
√	√	√	* 82.7	* 42.1	* 30.9	* 60.4	* 142.6	* 24.3	* 383.0

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (CIDEr: p = 0.03, SPICE: p = 0.028).

Table 5. Ablation experiment result for the Flickr30k dataset.

Baseline	Multi-Step Cross-Attention	External Knowledge Augmentation	B-1	B-4	M	R	C	SUM
√			74.5	32.4	29.1	49.7	65.8	251.5
√	√		75.8	33.1	30.3	51.9	73.5	264.6
√	√	√	* 76.3	* 33.8	* 30.5	* 52.6	* 78.4	* 271.6

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (CIDEr: p = 0.03).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Jiao, M.; Li, Z.; Zhang, M.; Wei, H.; Ma, Y.; An, H.; Lin, J.; Wang, J. Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation. Electronics 2025, 14, 3325. https://doi.org/10.3390/electronics14163325

AMA Style

Wang L, Jiao M, Li Z, Zhang M, Wei H, Ma Y, An H, Lin J, Wang J. Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation. Electronics. 2025; 14(16):3325. https://doi.org/10.3390/electronics14163325

Chicago/Turabian Style

Wang, Liang, Meiqing Jiao, Zhihai Li, Mengxue Zhang, Haiyan Wei, Yuru Ma, Honghui An, Jiaqi Lin, and Jun Wang. 2025. "Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation" Electronics 14, no. 16: 3325. https://doi.org/10.3390/electronics14163325

APA Style

Wang, L., Jiao, M., Li, Z., Zhang, M., Wei, H., Ma, Y., An, H., Lin, J., & Wang, J. (2025). Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation. Electronics, 14(16), 3325. https://doi.org/10.3390/electronics14163325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

Abstract

1. Introduction

2. Related Works

2.1. Vision–Language Pre-Trained Model CLIP and Generative Pre-Trained Model GPT-2

2.2. Image Captioning Method Based on the Pre-Trained Model

2.3. Image Captioning Model with Knowledge Graph Grounding and Cross-Attention Mechanisms

3. Model Design

3.1. Model Structure

3.2. Multi-Step Cross-Attention Cross-Modal Alignment Module

3.3. ConceptNet Commonsense Knowledge Augmentation Module Based on TransE and Multi-Hop Reasoning

3.3.1. ConceptNet Entity Linking with Region-Based Visual Features Extracted by Faster R-CNN

3.3.2. ConceptNet Knowledge Embedding via TransE and Multi-Hop Reasoning

3.4. GPT-2-Based Text Decoder Module

3.5. Joint Training and Fine-Tuning

4. Experiments

4.1. Experimental Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.3.1. Experimental Environment Configuration

4.3.2. Experimental Parameter Configuration

4.4. Model Performance Benchmarking

4.4.1. Comparative Analysis of Loss Functions

4.4.2. Comparative Analysis of Performance Metrics

4.5. Ablation Study

4.6. Qualitative Results Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI