Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance

Wang, Liang; Zhang, Mengxue; Jiao, Meiqing; Chen, Enru; Ma, Yuru; Wang, Jun

doi:10.3390/electronics14142809

Open AccessArticle

Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance

by

Liang Wang

^1,2,

Mengxue Zhang

^1,2,

Meiqing Jiao

^1,2,

Enru Chen

^1,2,

Yuru Ma

^1,2 and

Jun Wang

^1,2,*

¹

School of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Liaoning Key Laboratory of Intelligent Technology for Chemical Process Industry, Shenyang 110142, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2809; https://doi.org/10.3390/electronics14142809

Submission received: 8 June 2025 / Revised: 4 July 2025 / Accepted: 10 July 2025 / Published: 12 July 2025

Download

Browse Figures

Versions Notes

Abstract

To address the issues of modeling the relationships between multiple local region objects in images and enhancing local region features, as well as mapping global image semantics to global text semantics and local region image semantics to local text semantics, a novel image captioning method based on CLIP and integrating local feature enhancement and multi-scale semantic guidance is proposed. The model employs ViT as the global visual encoder, Faster R-CNN as the local region visual encoder, BERT as the text encoder, and GPT-2 as the text decoder. By constructing a KNN graph of local image features, the model models the relationships between local region objects and then enhances the local region features using a graph attention network. Additionally, a multi-scale semantic guidance method is utilized to calculate the global and local semantic weights, thereby improving the accuracy of scene description and attribute detail description generated by the GPT-2 decoder. Evaluated on MSCOCO and Flickr30k datasets, the model achieves a significant improvement in the core metric CIDEr over established strong baselines, with 4.7% higher CIDEr than OFA on MSCOCO, and 16.6% higher CIDEr than Unified VLP on Flickr30k. Ablation studies and qualitative analysis validate the effectiveness of each proposed module.

Keywords:

image captioning; Contrastive Language-Image Pretraining; Faster R-CNN; local feature enhancement; multi-scale semantic guidance

1. Introduction

Image captioning is defined as a machine’s capability to accurately recognize objects, scenes, and their interrelationships within an image, while automatically generating natural language descriptions that are concise, fluent, accurate, and rich in information [1]. As an important research topic in cross-modal understanding and generation, it finds wide applications in fields such as intelligent healthcare, recommendation systems, autonomous driving, and AI-driven marketing [2]. Recently, with the emergence of vision-language pre-trained models, generative pre-trained models, and advances in multimodal learning techniques, research on image captioning methods has attracted significant attention globally.

The image captioning approach using Contrastive Language-Image Pre-training (CLIP) as the encoder and Generative Pre-trained Transformer 2 (GPT-2) as the decoder offers advantages such as low training costs and high-quality caption generation, making it one of the mainstream architectures for current image captioning models [3].

CLIP [4], proposed by OpenAI, undergoes self-supervised pre-training on 400 million image-text pairs. It embeds both vision and language into a unified semantic space, enabling zero-shot transfer to cross-modal understanding tasks. CLIP comprises a visual encoder and a text encoder, both leveraging state-of-the-art feature extraction networks. It can flexibly combine arbitrary images and texts as training samples, where aligned image-text pairs serve as positive samples and mismatched pairs as negative samples. This contrastive learning framework for joint vision-language training endows CLIP with exceptional image-text comprehension and transfer learning capabilities.

GPT-2 [5] is an open-source autoregressive language model developed by OpenAI in 2019. Utilizing masked self-attention mechanisms, GPT-2 exhibits strengths in zero-shot/few-shot inference, coherent long-text generation, and multi-style adaptability. It has achieved remarkable performance across numerous NLP tasks and excels in applications such as text generation, dialog systems, summarization, and translation.

In CLIP-based image captioning, the visual encoder typically employs pre-trained models such as ViT (Vision Transformer) or ResNet. While ViT demonstrates significant advantages in global semantic modeling and zero-shot transfer due to its global attention mechanism, it suffers from inherent limitations in local feature perception [6]: absence of convolutional neural networks’ local inductive bias, insufficient representation of fine-grained features (e.g., small objects, complex textures), and vulnerability to interference (e.g., occlusion, illumination variations) causing critical information omission.

To address ViT’s limitations, researchers propose combining Faster R-CNN’s [7] local region feature extraction capability with ViT’s global semantic modeling strength. Faster R-CNN excels at spatially hierarchical feature extraction through its convolutional architecture, effectively compensating for ViT’s detail representation shortcomings. However, existing solutions primarily rely on feature concatenation or attention weighting, resulting in two persistent challenges for generating high-quality captions:

Local Relation Modeling and Feature Enhancement: How to model relationships among multiple local-region objects in images, differentiate their relative importance, and subsequently enhance fine-grained local features while computing object-specific weights to improve semantic accuracy in generated text.
Multi-scale Semantic Alignment: How to map the correspondence between global image semantics and global text semantics, as well as local region image semantics and local text semantics, and thereby guide GPT-2 to generate more accurate descriptions.

To resolve these issues, this paper proposes an image captioning method integrating graph attention network-based local feature enhancement with multi-scale semantic guidance. Building upon CLIP transferred to image captioning, this article adopts ViT + Faster R-CNN as the visual encoder and GPT-2 as the decoder. This article employs graph attention network (GAT)-based local feature enhancement because its inherent graph structure can effectively model the complex spatial and semantic relationships between local objects in images, overcoming limitations of simple concatenation or weighting methods; leveraging the graph attention mechanism dynamically learns importance weights for different local objects and their relationships within image semantic understanding, enabling adaptive feature enhancement for critical local regions. Simultaneously, to address global-local semantic alignment for generation guidance, a multi-scale semantic guidance mechanism is proposed, separately aligning ViT’s global image semantic feature with GPT-2’s textual global feature and graph-enhanced local object semantic features with specific attributes and object descriptions in text, thereby guiding GPT-2 to generate descriptions with greater accuracy and richness in both overall scene composition and fine-grained details. Experiments on MSCOCO and Flickr30k datasets validate the model’s effectiveness.

The key innovations include:

A novel architecture combining CLIP encoder, Faster R-CNN local visual encoder, and GPT-2 text decoder.
A K-Nearest Neighbors (KNN) graph constructed from local image features to model inter-object relationships, enhanced via GAT to refine local feature representation—providing a new approach for image detail characterization.
A multi-scale semantic guidance mechanism that computes global and local semantic weights to enhance the richness and accuracy of complex scene descriptions and attribute details generated by the GPT-2 decoder.

The remainder of this paper is organized as follows: Section 2 reviews related works; Section 3 elaborates on the model design; Section 4 presents the experimental setup and result analysis; and finally, Section 5 concludes the research findings and current limitations, while outlining future research directions.

2. Related Works

With the advancement of deep learning, pre-trained models have emerged as a cornerstone technology in image captioning research. Leveraging powerful semantic priors captured through self-supervised learning, they have significantly propelled progress in this field. The dominant approaches predominantly adhere to an encoder–decoder paradigm, where a visual encoder extracts image features and a language decoder generates descriptive text. However, existing methods still exhibit significant limitations in multi-scale semantic fusion, particularly concerning the modeling of local relationships and achieving fine-grained alignment. These shortcomings constrain the accuracy and richness of the generated captions. This paper focuses on addressing these core challenges. Relevant prior work is reviewed as follows.

2.1. Local Feature Extraction and Relationship Modeling in Image Captioning

Local relationship modeling is crucial for enhancing the detail of captions. Zhong et al. [8] proposed RegionCLIP, which achieves region-text alignment and lays the foundation for fusing global and local features. However, it lacks modeling of relationships between regions, and simple feature concatenation is susceptible to interference. Li et al. [9] used graph neural networks to model action-object-scene triplets, exploring the potential of structured semantic mapping, but they overly rely on predefined labels and cannot capture scene details at a fine-grained level. Qian et al. [10] proposed the TLGSA model, which uses graph convolutions to construct local semantic graphs to improve semantic focus, but its relationship modeling depends on label co-occurrence and lacks geometric interaction and fine-grained alignment. Wang et al. [11] proposed the LCM-Captioner model, which reduces the number of parameters and improves performance through collaborative attention. However, its capture of complex spatial and semantic relationships between regions mainly relies on implicit associations established by attention mechanisms, lacking more direct and structured explicit modeling. BLIP-2 [12] achieves efficient cross-modal alignment through Q-Former, but it has significant limitations in visual relation modeling: its 32 fixed-length generic query vectors focus on global vision-language associations, making it difficult to explicitly model fine-grained object-level spatial and semantic interactions within images (such as positional relationships and attribute associations). This constraint limits the accuracy of complex scene description generation.

Although recent studies have made progress in directions such as feature extraction efficiency (e.g., Ramos et al. combining ConvNeXt and LSTM [13]), data synthesis (e.g., Ma et al. using multi-context synthetic data [14]), detailed description architectures (e.g., Yang et al.’s SAMT-generator multi-stage Transformer [15]), and style adaptability (e.g., Mandava and Vinta fusing NST with ViT/GPT-2 [16]), their core mechanisms have not systematically solved the problem of explicit modeling of local object relationships. For example, while SAMT-generator enhances detailed descriptions, it does not introduce relationship reasoning between objects; style transfer models optimize generation diversity but do not improve semantic consistency. In essence, existing methods still have significantly limited capabilities to model fine-grained relationships such as spatial interactions and attribute associations between objects in complex scenes.

The key differentiator of this work: unlike existing methods that rely on implicit attention mechanisms or constrained graph structures, this paper proposes an explicit and dynamic local object relationship modeling mechanism. It directly captures and reasons about complex spatial relationships (e.g., position, relative orientation) and semantic associations (e.g., attributes, action interactions) between objects, addressing the limitations of current approaches (e.g., SAMT-generator) in structured relation representation.

2.2. Challenges of Multi-Scale Alignment

CLIP-based generation architectures face the core deficiency of insufficient local alignment. Mokady et al. [3] proposed ClipCap, which was the first to combine CLIP with GPT-2 for caption generation, innovatively fusing CLIP and GPT-2. However, it only aligns global image-text semantics while ignoring region-phrase-level associations; methods such as Ramesh et al.’s [17] DALL-E and Barraco et al.’s [18] CLIP-Captioner similarly rely on CLIP global features to complete image descriptions, lacking local detail guidance and leading to detail omissions in generated content.

Subsequent studies attempted to optimize local semantic guidance. Su et al. [19] proposed a plug-and-play framework that uses CLIP matching scores to guide GPT-2 generation, improving cross-modal efficiency. Fei et al. [20] proposed ViECap, which concatenates entity prompts with soft prompts generated by CLIP image embeddings and inputs them into GPT-2, significantly improving zero-shot cross-domain captioning capabilities. However, these methods still share common deficiencies: they corely depend on global features or retrieval results, and even when entity prompts are introduced, they do not explicitly model complex relationships between entities (such as spatial interactions and attribute associations), nor do they have a systematic hierarchical (global scene-local object/attribute) semantic guidance and alignment mechanism. This lack of explicit modeling of inter-object relationships and hierarchical alignment is a key factor leading to significant shortcomings in generated captions in terms of keyword accuracy (e.g., color, position) and relationship expression (e.g., “a cat sitting on a table”).

The key differentiator of this work: in contrast to ClipCap’s purely global alignment and ViECap’s prompt-concatenation strategies, this paper designs a hierarchical multi-scale semantic alignment mechanism. This mechanism not only explicitly models local objects and their relationships (as described in Section 2.1) but also synergistically aligns these structured local features with global scene semantics. This systematic approach guides the language model to generate more accurate keywords (object attributes, locations) and relational expressions, resolving the systemic shortcomings of existing methods in fine-grained alignment and relational consistency.

2.3. Necessity of Feature Fusion

Image captioning models need to have strong robustness in complex scenes such as occlusions and viewpoint changes. Vishniakov et al. [21] systematically compared the performance of convolutional networks and ViTs under different training paradigms. The results showed that convolutional networks have significantly better robustness in occluded scenes than ViTs. This conclusion provides important inspiration for introducing Faster R-CNN into image captioning models to enhance their adaptability and description reliability in complex/occluded scenes.

Existing research has made progress in the application of pre-trained models, feature extraction efficiency, and architectural design, but there are systematic deficiencies in explicitly and dynamically modeling complex relationships between local objects (including spatial, semantic, and interactive relationships) and achieving effective hierarchical multi-scale semantic alignment. Meanwhile, the significant differences in the robustness of visual encoders highlight the value of fusing complementary visual features (such as CLIP’s strong semantic priors and Faster R-CNN’s strong robustness in occluded scenes) to improve model performance in complex scenes.

The key differentiator of this work: addressing the deficiencies in robustness of existing methods under complex scenes and the limitation of single encoders in concurrently ensuring robust global semantics and local details, this paper innovatively integrates a dual-encoder architecture combining CLIP and Faster R-CNN. This goes beyond simple feature concatenation; instead, through a deliberately designed fusion strategy, the model leverages both CLIP’s powerful semantic priors and global comprehension capabilities and Faster R-CNN’s high-robustness detection of local objects under complex/occluded conditions, thereby significantly enhancing overall performance and descriptive reliability in diverse and challenging scenarios.

3. Model Design

3.1. Model Structure

This model adopts a progressive four-stage architecture (as illustrated in Figure 1). Its core advantage lies in efficiently fusing multi-scale visual information with text semantics, while achieving powerful cross-modal understanding and generation capabilities. As shown in Figure 1, the core components include the ViT Global Visual Encoder module (top-left), the Faster R-CNN Local Region Encoder module (left-center), the BERT Text Encoder module (bottom-left), the Graph Attention Network Local Feature Enhancement (GAT-LFE) module (center), the Multi-scale Semantic Guidance (MSG) module (bottom-right), and the GPT-2 Decoder module (right). The core workflow is as follows:

Step 1: Obtain visual-text features via a dual-path visual encoder (ViT extracts global image features, Faster R-CNN extracts local image features) and the BERT text encoder.

Step 2: Construct a KNN graph for the local region features and use the Graph Attention Network to enhance the features of local region objects in the image. Establish multi-scale alignment through a dual-path semantic guidance mechanism: at the global level, use CLIP cross-modal attention to fuse scene features with text semantics; at the local level, use a Sigmoid gating module to filter object attribute keywords.

Step 3: Project the enhanced multi-scale features into the GPT-2 latent space via linear projection to output the generated text description.

Based on pre-trained models CLIP, BERT, and GPT-2, this system constructs a progressive “Global Perception-Local Reasoning-Semantic Calibration-Conditional Generation” architecture through dynamic graph construction, hierarchical attention (GAT for region interaction and CLIP for cross-modal alignment), and the synergistic operation of the dual-path gating module. The framework fuses global and local visual information and precisely aligns multi-scale semantics. It enhances the accuracy, detail richness, and robustness of complex image descriptions at a low computational cost, addressing key challenges such as multi-scale alignment, local relationship modeling, and computational efficiency.

3.2. Visual Encoder Module and Text Encoder Module

3.2.1. Visual Encoder Module Based on ViT and Faster R-CNN

This module adopts a dual-stream visual feature extraction architecture. It uses the ViT model from CLIP to extract the global semantic features of the image and Faster R-CNN to extract features of local regions within the image. The specific feature extraction process is as follows:

1.: Global Image Feature Extraction Based on ViT

Given an input image

X \in R^{H \times W \times C}

, the global image features are extracted using the CLIP visual encoder ViT, as shown in Equation (1):

f_{g l o b a l_i m a g e} = C L I P_{V i T} (X) \in R^{d}

(1)

2.: Local Region Feature Extraction and Node Definition based on Faster R-CNN

Faster R-CNN generates candidate regions (Proposals) through a Region Proposal Network (RPN) and extracts local features for each region via RoI Pooling. Each candidate region yields a feature vector (e.g., a d-dimensional vector output by RoI Pooling). To avoid redundant nodes, only high-confidence regions (e.g., top-K classification scores or filtered by an IoU threshold) are retained as nodes in the graph. The set of local feature vectors for the regions is represented as in Equation (2):

{{f_{l o c a l_i m a g e}}^{i}}_{i = 1}^{N} = F a s t e r R_C N N (X) \in R^{N \times d}

(2)

In the above two equations,

d

is the dimension of the feature vector, and

N

is the number of regions.

To model relationships between different local region objects within the same image and enhance their semantic representation (this part prepares the input for the subsequent GAT module in Section 3.3), a graph structure needs to be constructed based on the extracted local region features.

3.: Graph Structure Modeling for Local Region Features based on KNN Graph

In the model, a dynamic undirected KNN graph

G = < V, E >

is constructed using the local object features extracted by Faster R-CNN. The nodes

V

represents the normalized feature vector

f_{n o r m_f u s e d_i m a g e}

of the regional object. The edges

E

are computed based on spatial proximity, with each node only connected to its Top-K spatially nearest neighbors.

The process of constructing a KNN graph based on Faster R-CNN is as follows:

Step 1: Feature Standardization. Perform L2 normalization on features (as shown in Equation (3)) to eliminate the impact of scale differences.

{f_{n o r m_f u s e d_i m a g e}}^{i} = \frac{{f_{l o c a l_i m a g e}}^{i}}{{| | {f_{l o c a l_i m a g e}}^{i} | |}_{2}}

(3)

Step 2: Similarity Metric and Distance Matrix Calculation. Calculate the distances of all feature pairs using cosine distance, generating an N × N distance matrix D (as shown in Equation (4)).

d (i, j) = 1 - \frac{{f_{l o c a l_i m a g e}}^{i} {\cdot f}_{l o c a l_i m a g e}^{j}}{| | {f_{l o c a l_i m a g e}}^{i} | | | | {f_{l o c a l_i m a g e}}^{j} | |}

(4)

Step 3: K-Nearest Neighbor Screening. First, set K = 5, sort each node by distance in ascending order, and filter out the K nearest neighbor nodes with the smallest distances

V j = \{V_{1}, V_{2}, \dots V_{K}\}

,

j = 1, 2, \dots, N

; then convert the distances into similarity weights through Equation (5) (where

σ

is the Gaussian kernel parameter).

e_{i j} = e x p (- \frac{d {(i, j)}^{2}}{σ^{2}})

(5)

Step 4: Constructing the KNN Graph Structure. Finally, generate a sparse matrix

A \in R^{N \times N}

with weighted edges to complete the construction of the KNN graph.

The core functions of this graph are:

Transforming unordered region features into structured representations, explicitly capturing spatial proximity relationships between objects;

Providing the necessary computational structure for GAT, enabling GAT to perform information propagation and aggregation based on this spatial adjacency;

Efficiently guiding local semantic reasoning (e.g., enhancing action understanding through neighbor information), providing key support for subsequently generating detailed descriptions containing accurate object relationships (e.g., position, interaction), while sparse connections ensure computational efficiency.

3.2.2. Text Encoder Module Based on BERT

The text encoder module in this model also employs the BERT model. The feature vector representation of the descriptive text output by the BERT model is shown in Equation (6):

f_{t e x t} = C l i p_{B E R T} (t e x t)

(6)

where

f_{t e x t} \in R^{L \times d}

represents the normalized text feature matrix, and

L

is the number of context vectors, i.e., the number of descriptive texts.

3.3. Image Local Region Object Feature Enhancement Module Based on Graph Attention Network

3.3.1. Graph Attention Network

The Graph Attention Network [22] (GAT) revolutionizes graph neural network modeling based on the spatial domain by introducing an attention mechanism. Compared to traditional Graph Convolutional Networks (GCNs) that rely on spectral domain decomposition, GAT discards complex Laplacian matrix operations. It dynamically computes the relevance between nodes to update features, offering advantages such as weighted neighbor aggregation, noise robustness, and interpretability. The main roles of GAT in this paper are summarized as follows:

Local Object Relationship Modeling: Based on the KNN graph built from Faster R-CNN’s local object features, GAT dynamically computes attention weights between nodes (region object feature vectors), capturing co-occurrence and spatial relationships between objects (e.g., “person-riding horse-grassland”), enhancing scene understanding capabilities.
Fine-grained Local Feature Enhancement: Through multi-layer message passing, GAT performs high-order relational reasoning on local object features, effectively compensating for CLIP’s global features’ lack of detail in descriptions.
Dynamic Attention Allocation for Local Objects: Using an adaptive attention mechanism, GAT assigns differentiated weights to different local objects (e.g., highlighting the main object “person”, weakening the background “cloud”), improving the semantic accuracy of the generated text.

3.3.2. Node Attention Coefficient Calculation

According to the attention weight calculation formula for nodes in GAT, the attention score

a_{i j}

from node

i

to neighbor node

j

and the normalized attention score

{\hat{a}}_{i j}

are calculated using the following Equations (7) and (8):

a_{i j} = L e a k Re L U (a^{T} [W V_{i}; W V_{j}])

(7)

{\hat{a}}_{i j} = S o f t m a x (a_{i j}) = \frac{\exp (a_{i j})}{\sum_{k \in N (i)} \exp (a_{i k})}

(8)

where

L e a k Re L U

[23] is the activation function addressing the vanishing gradient problem,

W \in R^{d \times d}

is a learnable weight parameter matrix,

a_{i j}

is the attention score of node

i

to its neighboring node

j

through the attention mechanism,

{\hat{a}}_{i j}

is obtained by normalizing the attention coefficient using

S o f t m a x

,

a \in R^{2 d}

is the attention vector, and [;] denotes vector concatenation.

3.3.3. Node Feature Update

According to the GAT node update formula, the features of the region object at node

i

are updated by aggregating the weighted features of neighbor nodes, as shown in Equation (9):

{V^{'}}_{i} = E L U (\sum_{j \in N (i)} a_{i j} W V_{j})

(9)

where the activation function uses

E L U

(Exponential Linear Units) [24]. After multi-layer stacking, the enhanced node features

{{V^{'}}_{i}}_{i = 1}^{N}

are output, dynamically capturing semantic relationships between objects.

3.3.4. Image Global-Local Region Multi-Scale Feature Fusion

The global image features output by CLIP’s ViT are concatenated with the mean-pooled enhanced local region features output by GAT and then projected into a joint space. This achieves complementary multi-scale visual features, providing multi-scale visual conditions for the subsequent decoder (e.g., GPT-2). The fused vector representation is shown in Equation (10):

f_{f u s e d_i m a g e} = W_{f u s e d} [f_{g l o b a l_i m a g e}; \sum_{i = 1}^{N} {V^{'}}_{i}]

(10)

where

f_{f u s e d_i m a g e} \in R^{d}

is the fused image feature vector, and

W_{f u s e d} \in R^{d \times d}

is the learned weight matrix.

3.4. Multi-Scale Semantic Guidance Module

Based on the global image features obtained via ViT, the local features obtained via Faster R-CNN, and the semantic relationships between local features modeled by the Graph Attention Network, this paper employs a multi-scale semantic guidance method to obtain a global feature text description and local feature keywords. These are then jointly input into GPT-2 along with the fused image features to enhance the accuracy of image scene descriptions and attribute detail descriptions.

3.4.1. Global Semantic Guidance

This module aims to enhance the overall scene description accuracy and consistency of the generated text. Its core process is:

Step 1: Use the CLIP text encoder to map keywords describing the overall image scene into a global description vector

E_{g l o b a l} \in R^{d}

. Taking Figure 2 as an example, Keywords like “sunlight”, “beach”, “waves” are encoded for a surfing scene image. Keywords like “skyscrapers”, “traffic lights”, “zebra crossing” are encoded for an urban street scene image.

Step 2: Interact

E_{g l o b a l}

with the visual context via cross-modal attention. Emphasize the global semantics most relevant to the current visual context (e.g., if a surfing scene image contains “beach”, increase the weights of words like “sunlight” and “waves”). The global semantic weight vector is generated as shown in Equation (11):

W_{g l o b a l} = S o f t \max (\frac{f_{f u s e d_i m a g e} {E^{T}}_{g l o b a l}}{\sqrt{d}}) E_{g l o b a l}

(11)

where

W_{g l o b a l} \in R^{d}

is the generated global semantic weight vector.

This mechanism continuously and dynamically adjusts the contribution of the global scene description during the generation process, ensuring the generated description is semantically coherent overall and avoiding contradictory details (e.g., “snowfield” appearing in a “beach” description).

3.4.2. Local Semantic Guidance

This module aims to improve the fine-grained description accuracy of object attributes in the generated text. Its core process is:

Step 1: Use CLIP’s text encoder and the previously obtained

{{V^{'}}_{i}}_{i = 1}^{N}

to extract textual vectors

f_{l o c a l_t e x t}

for local object keywords in the image.

Step 2: Filter key local semantics through a Sigmoid gating mechanism [25], generating the local semantic weight vector

W_{l o c a l}

.

The local semantic weight vector is generated as shown in Equation (12):

W_{l o c a l} = S i g m o i d (W_{g} [f_{f u s e d_i m a g e}; f_{l o c a l_t e x t}]) f_{l o c a l_t e x t}

(12)

where

W_{g} \in R^{d \times 2 d}

is a learnable weight matrix. The Sigmoid function computes weights independently for each feature dimension, enabling fine-grained filtering of local semantics.

This mechanism enhances fine-grained attribute descriptions, highlighting key object attribute details, such as accurately generating the color (“red”) and shape (“square”) in “red square surfboard”. It simultaneously addresses the problem of redundant information interference in multi-scale semantic fusion, improving the conciseness of the subsequent GPT-2 generated text and resolving semantic conflicts, such as avoiding contradictions between global scene and local object descriptions.

3.5. Text Decoder Module Based on GPT-2

Traditional GPT-2 decoders often use a greedy strategy, selecting the word with the highest probability at each time step. However, this local optimal choice does not guarantee the maximum joint probability of the global sequence. Due to the lack of global optimization leading to myopia, it can reduce the coherence and rationality of the generated text. Beam search, by retaining num_beams high-probability candidate words at each time step and finally selecting the optimal sequence based on global probability, balances local and global optimization, effectively improving the quality of the generated text.

This paper adopts a hybrid greedy search and beam search strategy to improve generation efficiency and quality. In the initial stage, greedy search is used to quickly generate the first n words (e.g., the first 3–5 words), leveraging its high speed to determine the main direction of the text and reduce computational complexity. Subsequently, it switches to beam search (setting num_beams = k). By retaining multiple candidate paths, it avoids local optima, optimizing the coherence and diversity of long text. This method quickly locates the core semantics in the greedy phase and optimizes details in the beam phase, reducing redundancy while balancing computational efficiency and text coherence.

The input, output, and processing flow for text generation based on GPT-2 are as follows:

Step 1: Linearly combine the image feature

f_{f u s e d_i m a g e}

, the weighed global text feature

f_{we i g h t e d_g l o b a l_t e x t}

, and the weighed local knowledge feature

f_{w e i g h e d_l o c a l_t e x t}

to obtain the fused feature vector

C_{f u s e d}

, as shown in Equation (13):

C_{f u s e d} = [f_{f u s e d_i m a g e}; f_{g l o b a l_t e x t}; f_{l o c a l_t e x t}]

(13)

Step 2: Use a Multilayer Perceptron (MLP) [26] to map the fused feature

C_{f u s e d} \in R^{d}

to the embedding dimension

R^{h}

of GPT-2 for dimension alignment, where

R^{h}

is the hidden dimension of GPT-2 (typically 768 or 1024), as shown in Equation (14):

E_{v i s} = MLP (C_{f u s e d})

(14)

where

E_{v i s} \in R^{h}

is the final aligned visual feature vector.

Step 3: Use

E_{v i s}

as the initial hidden state input to the GPT-2 decoder, or concatenate it with text embeddings for embedding space injection. GPT-2 then generates the text sequence

w_{1}, w_{2}, \dots, w_{t}

, auto-regressively based on

E_{v i s}

, predicting the word probability distribution at each step. The mathematical description is shown in Equation (15):

P (w_{t} | w_{< t}, E_{v i s}) = S o f t m a x (W_{o} h_{t})

(15)

where

W_{o}

is the output layer weight, and

h_{t}

is the hidden state at the current time step.

3.6. Model Joint Training and Fine-Tuning

This model adopts a three-stage compound loss function design, achieving a balance between text generation quality and cross-modal alignment through multi-objective optimization.

1.: Text Generation Loss: Uses cross-entropy loss to constrain the match between the generated text and the ground truth description, defined as in Equation (16):

$L_{g e n} = - \sum_{t = 1}^{T} \log (P (w_{t} | w_{< t}, E_{v i s}))$

(16)

where $W_{t}$ represents the t-th generated token, $W_{< t}$ is the history token sequence, and $E_{v i s}$ is the visual encoding feature.
2.: Semantic Alignment Loss: Utilizes CLIP’s contrastive loss to force alignment between image features and text embeddings in a joint semantic space, as shown in Equation (17):

$L_{a l i g n} = 1 - \frac{f_{g l o b a l_i m a g e} \cdot f_{g l o b a l_t e x t}}{| | f_{g l o b a l_i m a g e} | | | | f_{g l o b a l_t e x t} | |}$

(17)

where $f_{g l o b a l_i m a g e}$ and $f_{g l o b a l_t e x t}$ represent the global image feature vector and the text embedding vector, respectively.):
3.: Gating Sparsity Constraint Loss: L1 regularization prevents excessive activation of gating weights, as shown in Equation (18):

$L_{g a t e} = λ | | W_{l o c a l} | |_{1}$

(18)

where $W_{l o c a l}$ is the weight matrix of the gating module, and $λ$ is the regularization coefficient.
4.: Total Loss Function: The weighted sum of the above three loss functions, as shown in Equation (19):

$L_{t o t a l} = L_{g e n} + α 1 L_{a l i g n} + α 2 L_{g a t e}$

(19)

where the weight hyperparameters are set to $α 1$ = 0.7, $α 2$ = 0.3.
5.: Training Strategy: This paper employs a phased training strategy. In the pre-training stage, CLIP parameters are frozen, and the GAT and gating modules are trained separately to optimize local semantic extraction capabilities. In the joint fine-tuning stage, partially unfreeze certain CLIP parameters (e.g., the last two Transformer layers), and the GAT, gating module, and GPT-2 are jointly optimized to improve cross-modal alignment. The training process achieves high-precision image description generation through multi-scale feature fusion and dynamic gating filtering, combining CLIP’s cross-modal alignment capability, GAT’s local relationship modeling, and GPT-2’s generation capability. The core lies in balancing the text generation loss and semantic alignment constraints, while the phased training strategy improves model convergence efficiency, ensuring the generated image descriptions are both accurate and semantically coherent.

4. Experiments

4.1. Experimental Setup

The experimental environment configuration is shown in Table 1, and the specific parameter settings is shown in Table 2.

4.2. Data Preprocessing

Whether it is the MSCOCO dataset or the Flickr30k dataset, necessary data preprocessing is required when they are used to train and test image description models.

4.2.1. Visual Modal Preprocessing

This article adopts the following preprocessing pipeline to ensure the model obtains high-quality visual feature representations:

Step 1: Resize images to the standard input size of 224 × 224 pixels.

Step 2: Extract region features using Faster R-CNN, augmented with Graph Attention Networks (GAT) to strengthen the representational capacity of local regions.

4.2.2. Text Modal Preprocessing

Step 1: Preprocess the dataset text, including converting to lowercase (optional), removing punctuation (while retaining select symbols such as periods), replacing numbers with the special token <num> to handle special characters and numerals, and removing excess whitespace.

Step 2: Generate word embeddings using the BERT model and align them with visual features through cross-modal attention to resolve ambiguity and achieve semantic alignment.

4.2.3. Dataset Partitioning and Validation Strategy

The MSCOCO dataset available for image captioning experiments contains a total of 123,287 images, of which 82,783 are used for training and 40,504 are used for testing. Each image has five manually annotated sentences describing it.

The Flickr30K dataset contains 31,783 images, with five distinct captions per image, totaling 158,915 captions. As Flickr30K does not have predefined training and test sets, this paper divides the data into training and test sets in a random 8:2 ratio.

The verification strategy adopts a multi-index monitoring mechanism, with BLEU-4 and CIDEr as the core, and METEOR, ROUGE-L, and SPICE as auxiliary indicators. To ensure the reliability of the results, five independent experiments were conducted to record the fluctuation range of the indicators.

4.3. Model Evaluation

4.3.1. Experimental Comparison on MSCOCO and Flickr30k Datasets

To comprehensively evaluate the performance of the proposed model, quantitative and qualitative comparative analyses were conducted through experiments on the MSCOCO and Flickr30k datasets. The results of the quantitative comparison are presented in Table 3 and Table 4, with all values reported as percentages (%). For the MeaCap metric, the optimal indicator from its text-trained version (MeaCapToT) in the In-domain experiment was used.

As quantified in Table 3, the model achieves state-of-the-art performance on MSCOCO, with particular break throughs in semantic richness (CIDEr: 156.7, +4.7% over OFA) and fine-grained accuracy (METEOR: 32.8, +2.8% over OFA). The 13.4% CIDEr gain against PureT demonstrates the method’s superiority in capturing contextual details.

On the Flickr30k dataset (Table 4), the model demonstrates comprehensive performance advantages: compared to the SR-PL model, it achieves a significant improvement in fine-grained description accuracy (BLEU-4 + 16.4%, ROUGE-L + 4.8%); in contrast to the Unified VLP model, it makes significant improvement in semantic coherence metrics (METEOR + 47.4%, CIDEr + 16.6%). This leapfrog advancement validates the effectiveness of the GAT-based local enhancement and multi-scale alignment mechanism in complex scene description tasks.

In summary, based on the results of the comparative experiments, the proposed model demonstrates superior image captioning capabilities. Its advantages are more pronounced across most performance metrics, particularly excelling in semantic generation quality (CIDEr) and local scene understanding (BLEU-4). To ensure reliability, this article conducted five independent experiments. Results show: On MSCOCO, CIDEr fluctuated within ±0.16 and BLEU-4 within ±0.21; On Flickr30k, CIDEr varied by ±0.16 and BLEU-4 by ±0.27.

4.3.2. Ablation Study

To further validate the effectiveness of the proposed Graph Attention Network-based Local Feature Enhancement (GAT-LFE) module and the Multi-scale Semantic Guidance (MSG) module, ablation experiments were conducted using an image captioning backbone composed of a CLIP encoder, a Faster R-CNN image region feature encoder, and a GPT-2 decoder. The contribution of each module was analyzed in detail. Starting from this baseline model, the experiments sequentially introduced the GAT-LFE module and the MSG module. Results on the MSCOCO and Flickr30k datasets are presented in Table 5 and Table 6, respectively.

The comparative results of the ablation experiments in Table 5 and Table 6 show that:

Effectiveness of the Local Feature Enhancement Module in Graph Attention Network

On the MSCOCO dataset, after adding this module (comparing Row 1 and Row 2 of Table 5), the BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics increased by 1.2%, 4.3%, 7.4%, 1.6%, 6.5%, and 2.2%, respectively, with the SUM total score improving by 4.1%. On the Flickr30k dataset (comparing Row 1 and Row 2 of Table 6), the BLEU-1, BLEU-4, METEOR, ROUGE-L, and CIDEr metrics increased by 1.1%, 1.5%, 2.1%, 4.2%, and 8.5%, respectively, with the SUM total score improving by 3.8%.

2.: Effectiveness of the Multi-Scale Semantic Guidance Module

On the MSCOCO dataset, after adding this module based on the existing graph attention network (comparing Row 2 and Row 3 of Table 5), the BLEU-1, BLEU-4, METEOR, ROUGE-L, CIDEr, and SPICE metrics increased by 2.1%, 2.4%, 7.9%, 3.1%, 9.4%, and 5.2%, respectively, with the SUM total score improving by 5.7%. On the Flickr30k dataset (comparing Row 2 and Row 3 of Table 6), the BLEU-1, BLEU-4, METEOR, ROUGE-L, and CIDEr metrics increased by 0.7%, 3.6%, 14.1%, 1.0%, and 10.1%, respectively, with the SUM total score improving by 5.2%.

Overall, after adding the local feature enhancement module of the graph attention network and the multi-scale semantic guidance module, the benchmark model’s scores on major indicators significantly increased. The notable improvement in the CIDEr index from both comparative and ablation experiments further demonstrates the effectiveness of the two modules and their significant impact on performance metrics.

4.3.3. Qualitative Result Analysis

To visually demonstrate that the proposed method can generate more fluent, rich, and fine-grained descriptive sentences for images, Figure 3 and Figure 4 show comparisons between the example descriptions generated by this method, the reference descriptions in the dataset, and the descriptions generated by the OFA model. Since the official implementation of the baseline OFA model lacks an interface for occlusion scenario testing, Figure 5 only presents qualitative results from the model of this article.

As illustrated in Figure 3, the model demonstrates superior scene understanding compared to OFA. While OFA provides generic descriptions (e.g., describing urban cyclists simply as “a group of people”), the model captures crucial contextual details and spatial relationships, such as the proximity of buildings to the cyclists and the specific attributes and location of an old truck with surfboards. Furthermore, the model successfully resolves ambiguities present in OFA’s outputs (e.g., replacing UNK with meaningful concepts like “surfboard”).

Figure 4 highlights the model’s enhanced semantic precision and ability to incorporate environmental context. Where OFA misinterpreted specific elements (e.g., confusing political advertisements on a bus for flags), the model accurately identifies key visual details. Additionally, the model consistently enriches descriptions with relevant scene context (e.g., specifying the “field” environment for the elephant rider), constructing more complete semantic representations of actions within their settings.

Figure 5 showcases the model’s robustness in handling occlusion. Despite significant occlusion (e.g., the partially visible keyboard), the model generates coherent and informative descriptions that correctly infer the spatial relationships between objects (e.g., the cat lying beside the keyboard on the table).

As can be seen from the figures, the descriptions generated by the model of this article perform well in terms of detail, accuracy, richness, and scene understanding.

5. Conclusions

This study addresses the challenges of modeling inter-object relationships among local regions and achieving multi-scale semantic alignment in image captioning. This article propose an innovative model that integrates GAT-LFE and MSG mechanism. Extensive experiments on the MSCOCO and Flickr30k datasets demonstrate that the model achieves consistent performance improvements over state-of-the-art baselines across multiple evaluation metrics. Ablation studies and qualitative analyses robustly validate the core contributions and synergistic effects of both the GAT-based local enhancement module and the MSG module.

The core contributions encompass three aspects: first, a dual-stream feature extraction architecture combines CLIP’s ViT encoder for visual global semantics and Faster R-CNN’s local region features, strengthened by a BERT-GPT-2 backbone for contextual generation. Second, a local feature enhancement mechanism leverages GAT-based adaptive weighting to fuse interactive region features, overcoming limitations of traditional modeling in statically and indiscriminately processing local features. Third, a multi-scale semantic-guided decoding strategy hierarchically aligns global image-text semantics (e.g., scene categories) and fine-grained region-keyword mappings (e.g., object attributes), refining description granularity.

Despite its strengths, The approach in this article has limitations: first, regarding computational efficiency, the global attention mechanism in GAT results in high memory consumption, hindering deployment on resource-constrained devices, while the multi-scale alignment process significantly increases inference latency, becoming a bottleneck for real-time deployment; second, the model’s reliance on off-the-shelf pre-trained models (e.g., CLIP and GPT-2) with fixed parameters restricts opportunities for exploring performance improvements in cross-modal alignment through end-to-end fine-tuning. Furthermore, the distribution characteristics of BERT’s pre-training corpus constrain the model’s generalization capability to out-of-domain text styles (e.g., poetic descriptions); finally, despite significant improvements in automated scores such as CIDEr, they fail to quantify human subjective judgments on text fluency and logical coherence. This article has not yet incorporated human evaluation, which constitutes a critical direction for future work.

To address the current limitations, future work will focus on breakthroughs in three key areas: First, exploring BLIP-2’s Q-Former [12] to optimize graph neighborhood selection, combined with multi-scale knowledge distillation into light-weight decoders (e.g., T5-small [42]), significantly reducing computational complexity; second, investigating Flamingo-style [43] continual multimodal learning mechanisms and cross-modal layer LoRA [44] fine-tuning strategies to reduce annotation dependency and enable efficient adaptation to new domains (e.g., medical imaging, remote sensing); third, developing interpretability-constrained and diffusion-guided generation methods, along with designing vision-language adapters, to achieve precise attribute manipulation under multi-scale semantics (e.g., changing object color or position in descriptions), potentially drawing inspiration from efficient modeling strategies in network systems [45]. Finally, to enhance the evaluation quality, the future work will prioritize a large-scale human assessment: the future work plans to invite at least three native English-speaking experts, each independently evaluating 10% of images (i.e., 4000 images) randomly selected from the MSCOCO test dataset. They will assess the fluency and logical coherence of generated image descriptions.

Author Contributions

Writing—original draft preparation, L.W. and M.Z.; Writing—review and editing, L.W., M.Z., M.J., E.C. and J.W.; Supervision, E.C. and M.J.; Data curation, E.C.; Visualization, Y.M.; Project administration, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by Industry-University-Research Innovation Fund for Chinese Universities under Grant 2021LD06009; in part by the Natural Science Foundation of Liaoning Province under Grant 2022-MS-291; in part by Research Project of Liaoning Provincial Department of Education under Grant LJ2020024; in part by Basic Research Project of Liaoning Provincial Department of Education under Grant LJKMZ20220781; in part by General Basic Research Projects of Liaoning Provincial Education Department under Grant JYTMS20231488; in part by the Applied Basic Research Program of Liaoning Provincial Department of Science and Technology (2025): Research on Intrusion Detection Technology and Intelligent Defense Strategies for Industrial Internet (Acceptance No. 1746669597594), and in part by the Basic Research Project of Liaoning Provincial Department of Education (2025): Research on Optimization Technologies for Key Metrics such as Energy Consumption and Coverage in WSNs.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bernardi, R.; Cakici, R.; Elliott, D.; Erdem, A.; Erdem, E.; Ikizler-Cinbis, N.; Keller, F.; Muscat, A.; Plank, B. Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures. J. Artif. Intell. Res. 2016, 55, 409–442. [Google Scholar] [CrossRef]
Hossain, M.Z.; Sohel, F.; Shiratuddin, M.F.; Laga, H. A Comprehensive Survey of Deep Learning for Image Captioning. ACM Comput. Surv. 2019, 51, 1–36. [Google Scholar] [CrossRef]
Mokady, R.; Hertz, A.; Bermano, A.H. ClipCap: CLIP Prefix for Image Captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 1 June 2024).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. RegionCLIP: Region-based Language-Image Pretraining. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16772–16782. [Google Scholar] [CrossRef]
Li, M.; Xu, R.; Wang, S.; Zhou, L.; Lin, X.; Zhu, C.; Zeng, M.; Ji, H.; Chang, S.-F. CLIP-Event: Connecting Text and Images with Event Structures. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 16399–16408. [Google Scholar] [CrossRef]
Qian, K.; Pan, Y.; Xu, H.; Tian, L. Transformer Model Incorporating Local Graph Semantic Attention for Image Caption. Vis. Comput. 2024, 40, 6533–6544. [Google Scholar] [CrossRef]
Wang, Q.; Deng, H.; Wu, X.; Yang, Z.; Liu, Y.; Wang, Y.; Hao, G. LCM-Captioner: A Lightweight Text-Based Image Captioning Method with Collaborative Mechanism between Vision and Text. Neural Netw. 2023, 162, 318–329. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023, arXiv:2301.12597. [Google Scholar] [CrossRef]
Ramos, L.; Casas, E.; Romero, C.; Rivas-Echeverría, F.; Morocho-Cayamcela, M.E. A Study of ConvNeXt Architectures for Enhanced Image Captioning. IEEE Access 2024, 12, 13711–13728. [Google Scholar] [CrossRef]
Ma, F.; Zhou, Y.; Rao, F.; Zhang, Y.; Sun, X. Image Captioning with Multi-Context Synthetic Data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 4089–4097. [Google Scholar] [CrossRef]
Yang, X.; Yang, Y.; Ma, S.; Li, Z.; Dong, W.; Woźniak, M. SAMT-generator: A second-attention for image captioning based on multi-stage transformer network. Neurocomputing 2024, 593, 127823. [Google Scholar] [CrossRef]
Mandava, M.; Vinta, S.R. Image Captioning with Neural Style Transfer Using GPT-2 and Vision Transformer Architectures. In Machine Vision and Augmented Intelligence; Kumar Singh, K., Singh, S., Srivastava, S., Bajpai, M.K., Eds.; Springer Nature: Singapore, 2025; pp. 537–548. ISBN 978-981-97-4359-9. [Google Scholar] [CrossRef]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Barraco, M.; Cornia, M.; Cascianelli, S.; Baraldi, L.; Cucchiara, R. The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–24 June 2022; pp. 4661–4669. [Google Scholar] [CrossRef]
Su, Y.; Lan, T.; Liu, Y.; Liu, F.; Yogatama, D.; Wang, Y.; Kong, L.; Collier, N. Language Models Can See: Plugging Visual Controls in Text Generation. arXiv 2022, arXiv:2205.02655. [Google Scholar] [CrossRef]
Fei, J.; Wang, T.; Zhang, J.; He, Z.; Wang, C.; Zheng, F. Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3113–3123. [Google Scholar] [CrossRef]
Vishniakov, K.; Shen, Z.; Liu, Z. ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy. arXiv 2023, arXiv:2311.09215. [Google Scholar] [CrossRef]
Vrahatis, A.G.; Lazaros, K.; Kotsiantis, S. Graph Attention Networks: A Comprehensive Review of Methods and Applications. Future Internet 2024, 16, 318. [Google Scholar] [CrossRef]
Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical Evaluation of Rectified Activations in Convolutional Network. arXiv 2015, arXiv:1505.00853. [Google Scholar] [CrossRef]
Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv 2015, arXiv:1511.07289. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Bisong, E. The Multilayer Perceptron (MLP). In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 401–405. [Google Scholar] [CrossRef]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-Memory Transformer for Image Captioning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10575–10584. [Google Scholar] [CrossRef]
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Conference, USA, 2–9 February 2021; Volume 35, pp. 1655–1663. [Google Scholar] [CrossRef]
Zhang, J.; Fang, Z.; Sun, H.; Wang, Z. Adaptive Semantic-Enhanced Transformer for Image Captioning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 1785–1796. [Google Scholar] [CrossRef]
Wang, Y.; Xu, J.; Sun, Y. End-to-End Transformer Based Model for Image Captioning. arXiv 2022, arXiv:2203.15350. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Xu, B.; Jian, M.; Liu, H.; Li, X. TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip. Inf. Technol. Control 2024, 53, 35095. [Google Scholar] [CrossRef]
Chen, L.; Li, K. Multi-Modal Graph Aggregation Transformer for Image Captioning. Neural Netw. 2025, 181, 106813. [Google Scholar] [CrossRef] [PubMed]
Cao, S.; An, G.; Cen, Y.; Yang, Z.; Lin, W. CAST: Cross-Modal Retrieval and Visual Conditioning for Image Captioning. Pattern Recognit. 2024, 153, 110555. [Google Scholar] [CrossRef]
Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; PMLR: San Diego, CA, USA, 2022; Volume 162, pp. 23318–23340. Available online: https://proceedings.mlr.press/v162/wang22al.html (accessed on 11 July 2025).
Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv 2022, arXiv:2205.01917. [Google Scholar] [CrossRef]
Kuo, C.-W.; Kira, Z. Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17948–17958. [Google Scholar] [CrossRef]
Wang, J.; Wang, W.; Wang, L.; Wang, Z.; Feng, D.D.; Tan, T. Learning Visual Relationship and Context-Aware Attention for Image Captioning. Pattern Recognit. 2020, 98, 107075. [Google Scholar] [CrossRef]
Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified Vision-Language Pre-Training for Image Captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13041–13049. [Google Scholar] [CrossRef]
Liu, X.; Li, H.; Shao, J.; Chen, D.; Wang, X. Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data. In Computer Vision—ECCV 2018, Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XV; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 353–369. [Google Scholar] [CrossRef]
Zeng, Z.; Xie, Y.; Zhang, H.; Chen, C.; Chen, B.; Wang, Z. MeaCap: Memory-Augmented Zero-shot Image Captioning. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14100–14110. [Google Scholar] [CrossRef]
Tu, H.; Yang, B.; Zhao, X. ZeroGen: Zero-Shot Multimodal Controllable Text Generation with Multiple Oracles. In Natural Language Processing and Chinese Computing, Proceedings of the 12th National CCF Conference, NLPCC 2023, Foshan, China, 12–15 October 2023; Proceedings, Part II; Liu, F., Duan, N., Xu, Q., Hong, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; pp. 494–506. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. Available online: http://www.jmlr.org/papers/v21/20-074.html (accessed on 11 July 2025).
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Wang, J.; Tang, J.; Li, C.; Ma, Z.; Yang, J.; Fu, Q. Modeling and Analysis in the Industrial Internet with Dual Delay and Nonlinear Infection Rate. Electronics 2025, 14, 2058. [Google Scholar] [CrossRef]

Figure 1. The structure of model.

Figure 2. Example pictures of surfing scenes and urban street scenes.

Figure 3. Captioning comparative examples of the model of this article, ground truth (GT), and OFA (a).

Figure 4. Captioning comparative examples of the model of this article, ground truth (GT), and OFA (b).

Figure 5. Captioning comparative examples of the model of this article and ground truth (GT).

Table 1. Experimental Environment Configuration.

Experimental Environment	Specific Information
Operating System	Windows 10
GPU	NVIDIA GeForce RTX 4090 (24 GB GDDR6X) (NVIDIA, Santa Clara, CA, USA)
CPU	Intel Core i9-13900K (24 cores/32 threads) (Intel, Santa Clara, CA, USA)
Memory	128 GB DDR5
Development Language	Python 3.9
Development platform	Pytorch 1.11.0

Table 2. Experimental Parameter Setting.

Parameter Name	Parameter Value
lr (Learning Rate)	1 × 10⁻⁴
lr Decay	0.9
lr_step_size	1
batch-size	64
optimizers	AdamW
beam size	5
CLIPViT	VIT-B/32
CLIPBERT_ENcoder_dim	512
GPT-2_dim	768
epochs	30
K	5
Anchor_scale	[64, 128, 256]
RoI pooling size	7 × 7
RPN_POST_NMS_TOP_N_TEST	100
CrossModal_Adapter_dim	2048 → 512

In the table, parameters including Optimizer, Beam Size, CLIPViT, CLIPBERT_ENcoder_dim, Anchor Scales, RoI Pooling Size, and RPN_POST_NMS_TOP_N_TEST are fixed. Other parameters (such as Learning Rate, Batch Size, Training Epochs, etc.) require manual tuning and optimization.

Table 3. Experimental Comparison Results of MSCOCO Dataset.

Method Categories	Methods	B-1	B-4	M	R	C	S
Semantic capture enhancement models	M² Transformer [27]	80.8	39.1	29.2	58.6	131.2	22.6
	GET(w/MAC) [28]	81.5	39.5	29.3	58.9	131.6	22.8
	AS-Transformer(w/vinvl) [29]	82.3	41.0	29.8	60.0	136.1	23.8
Multimodal feature fusion models	PureT [30]	82.1	40.9	30.2	60.1	138.2	24.2
	TSIC-CLIP [31]	-	40.3	30.1	59.6	137.9	-
	MMGAT [32]	83.9	42.5	31.1	60.8	144.6	24.6
	CAST [33]	83.1	42.2	30.8	60.6	140.6	24.7
Pre-training-based cross-modal representation models	CLIPCap [3]	-	32.2	27.1	-	108.4	21.2
	OFA [34]	-	43.5	31.9	-	149.6	26.1
	CoCa [35]	-	40.9	33.9	-	143.6	24.7
Robustness-enhanced models	PTOD [36]	81.5	39.7	30.0	59.5	135.9	23.7
The method of this article	Ours	* 82.9	* 42.5	* 32.8	* 60.6	* 156.7	* 24.1

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (p = 0.04).

Table 4. Experimental Comparison Results of Flickr30k Dataset.

Method Categories	Methods	B-1	B-4	M	R	C	S
Semantic capture enhancement models	A_R_L [37]	69.8	27.7	21.5	48.5	57.4	-
Multimodal feature fusion models	TSIC-CLIP [31]	-	26.8	23.3	48.1	63.4	-
Pre-training-based cross-modal representation models	Unified VLP [38]	30.1	-	23.0	-	67.4	17.0
Robustness-enhanced models	SR-PL [39]	72.9	29.3	21.8	49.9	65.0	15.8
Unsupervised/weakly supervised zero-shot generation models	MeaCap [40]	-	15.3	20.6	-	50.2	14.5
	MAGIC [19]	44.5	6.4	13.1	31.6	20.4	7.1
	ZeroGen [41]	54.9	13.1	15.2	37.4	26.4	8.3
The method of this article	Ours	* 75.8	* 34.1	* 33.9	* 52.3	* 78.6	-

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (p = 0.04).

Table 5. Ablation Experiments Result on MSCOCO Dataset.

Baseline	GAT-LFE	MSG	B-1	B-4	M	R	C	S	SUM
√			80.2	39.8	28.3	57.9	134.5	22.4	363.1
√	√		81.2	41.5	30.4	58.8	143.3	22.9	378.1
√	√	√	* 82.9	* 42.5	* 32.8	* 60.6	* 156.7	* 24.1	* 399.6

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (p = 0.04).

Table 6. Ablation Experiments Result on Flickr30k Dataset.

Baseline	GAT-LFE	MSG	B-1	B-4	M	R	C	SUM
√			74.5	32.4	29.1	49.7	65.8	251.5
√	√		75.3	32.9	29.7	51.8	71.4	261.1
√	√	√	* 75.8	* 34.1	* 33.9	* 52.3	* 78.6	* 274.7

In the table, bold indicates the best performance metrics, while * denotes results from the model of this article. If the model of this article achieves the best performance, the corresponding entry is marked as ‘bold with an asterisk’. Improvements are statistically significant (p = 0.04).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Zhang, M.; Jiao, M.; Chen, E.; Ma, Y.; Wang, J. Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance. Electronics 2025, 14, 2809. https://doi.org/10.3390/electronics14142809

AMA Style

Wang L, Zhang M, Jiao M, Chen E, Ma Y, Wang J. Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance. Electronics. 2025; 14(14):2809. https://doi.org/10.3390/electronics14142809

Chicago/Turabian Style

Wang, Liang, Mengxue Zhang, Meiqing Jiao, Enru Chen, Yuru Ma, and Jun Wang. 2025. "Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance" Electronics 14, no. 14: 2809. https://doi.org/10.3390/electronics14142809

APA Style

Wang, L., Zhang, M., Jiao, M., Chen, E., Ma, Y., & Wang, J. (2025). Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance. Electronics, 14(14), 2809. https://doi.org/10.3390/electronics14142809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance

Abstract

1. Introduction

2. Related Works

2.1. Local Feature Extraction and Relationship Modeling in Image Captioning

2.2. Challenges of Multi-Scale Alignment

2.3. Necessity of Feature Fusion

3. Model Design

3.1. Model Structure

3.2. Visual Encoder Module and Text Encoder Module

3.2.1. Visual Encoder Module Based on ViT and Faster R-CNN

3.2.2. Text Encoder Module Based on BERT

3.3. Image Local Region Object Feature Enhancement Module Based on Graph Attention Network

3.3.1. Graph Attention Network

3.3.2. Node Attention Coefficient Calculation

3.3.3. Node Feature Update

3.3.4. Image Global-Local Region Multi-Scale Feature Fusion

3.4. Multi-Scale Semantic Guidance Module

3.4.1. Global Semantic Guidance

3.4.2. Local Semantic Guidance

3.5. Text Decoder Module Based on GPT-2

3.6. Model Joint Training and Fine-Tuning

4. Experiments

4.1. Experimental Setup

4.2. Data Preprocessing

4.2.1. Visual Modal Preprocessing

4.2.2. Text Modal Preprocessing

4.2.3. Dataset Partitioning and Validation Strategy

4.3. Model Evaluation

4.3.1. Experimental Comparison on MSCOCO and Flickr30k Datasets

4.3.2. Ablation Study

4.3.3. Qualitative Result Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI