DMFormer: Dense Memory Linformer for Image Captioning

He, Yuting; Jiang, Zetao

doi:10.3390/electronics14091716

Open AccessArticle

DMFormer: Dense Memory Linformer for Image Captioning

by

Yuting He

and

Zetao Jiang

^*

Guangxi Key Lab of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1716; https://doi.org/10.3390/electronics14091716

Submission received: 16 March 2025 / Revised: 19 April 2025 / Accepted: 22 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue AI/Machine Learning in Computer Vision/Image Processing and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Image captioning is a cross-task of computer vision and natural language processing, aiming to describe image content in natural language. Existing methods still have deficiencies in modeling the spatial location and semantic correlation between image regions. Furthermore, these methods often exhibit insufficient interaction between image features and text features. To address these issues, we propose a Linformer-based image captioning method, the Dense Memory Linformer for Image Captioning (DMFormer), which has lower time and space complexity than the traditional Transformer architecture. The DMFormer contains two core modules: the Relation Memory Augmented Encoder (RMAE) and the Dense Memory Augmented Decoder (DMAD). In the RMAE, we propose Relation Memory Augmented Attention (RMAA), which combines explicit spatial perception and implicit spatial perception. It explicitly uses geometric information to model the geometric correlation between image regions and implicitly constructs memory unit matrices to learn the contextual information of image region features. In the DMAD, we introduce Dense Memory Augmented Cross Attention (DMACA). This module fully utilizes the low-level and high-level features generated by the RMAE through dense connections, and constructs memory units to store prior knowledge of image and text. It learns the cross-modal associations between visual and linguistic features through an adaptive gating mechanism. Experimental results on the MS-COCO dataset show that the descriptions generated by the DMFormer are richer and more accurate, with significant improvements in various evaluation metrics compared to mainstream methods.

Keywords:

image captioning; Linformer; relation memory augmented attention; dense memory augmented cross attention

1. Introduction

The image captioning task integrates the fields of computer vision and natural language processing. It aims to enable the computer to recognize the content of a given image and automatically generate natural language text that describes the image, thereby achieving a modality conversion from vision to language. This task is highly challenging as it not only requires the accurate identification of objects, their actions, and their attributes in the image, but also necessitates an understanding of the relationships between objects and the generation of grammatically correct descriptions. Image captioning has broad application prospects in the fields of human–computer interaction, image retrieval, and navigation for the visually impaired. It is a research topic with significant practical value [1].

Currently, most of the image captioning methods are based on the encoder–decoder framework. Early studies typically employed a Convolutional Neural Network (CNN) as the encoder to extract visual features from the image and used a Recurrent Neural Network (RNN) as the decoder to generate the description statement. However, these methods have shortcomings in capturing long-distance dependencies. In recent years, the Transformer architecture has become the mainstream approach for image captioning due to its strong ability to model long-distance dependencies and associate contextual information [2,3,4]. The core component of the Transformer is the Multi-Head Self-Attention (MHA) module, which has a time and space complexity of

O (n^{2})

. Recent studies have proposed a new module called Multi-Head Linear Self-Attention (MHLA), which has a complexity of

O (n)

and has comparable performance to traditional MHA [5,6]. The linear Transformer built upon MHLA has shown significant advantages in processing large-scale image captioning tasks. However, the application of Linformer in the field of image captioning is still limited, and its potential has not been fully explored. Therefore, we attempt to build an image captioning model using Linformer to leverage its efficiency and scalability.

In the encoder–decoder-based image captioning methods, the encoder models the contextual features of the image and passes the refined visual signals to the decoder to generate fluent text description. Therefore, designing efficient encoders and decoders is crucial for the image captioning task. Although recent methods have greatly promoted the development of the field of image captioning [5,6,7,8,9], but there still are the following shortcomings: (1) Each region in an image does not exist in isolation, but has complex spatial and semantic correlations with other regions. These spatial relationships are crucial for understanding the overall semantics of the image. However, existing methods are insufficient in capturing the spatial correlations between the internal features of image regions, lacking prior knowledge of the relationships between regional features. As a result, the generated description often fails to accurately describe the spatial structure of the image and struggles to associate contextual visual information. (2) The image captioning task is essentially an interactive process between image and text information. However, existing methods have shortcomings in the interaction between image features and text features, and cannot fully explore their semantic associations. This results in information loss and affects the accuracy and richness of the descriptions.

To address the aforementioned problems, Wang et al. [10] attempted to extend the absolute position encoding in the Transformer architecture from one-dimensional to two-dimensional, aiming to enhance the spatial perception of the model. However, this simple extension often performs poorly when dealing with complex and dense image content. Meanwhile, Kuo et al. [11] tried to introduce external knowledge to assist the cross-modal semantic interaction between images and text. Although this approach is theoretically innovative, it inevitably brings higher computational and time costs, which may hinder the development of subsequent related research. In contrast, multi-level spatial perception and cross-modal modeling show the potential to solve these problems. This strategy not only helps the model to understand the image content in depth, but also more effectively captures the complex spatial relationships and semantic correlations between image regions. Moreover, cross-modal modeling, by integrating the semantic information of images and text, is able to more comprehensively explore the semantic associations between them. This not only improves the accuracy of the descriptions, but also enhances the model’s ability to perceive contextual information.

To this end, we propose a novel image captioning method, the Dense Memory Linformer for Image Captioning (DMFormer), based on Linformer. By reducing time and space complexity, Linformer enables the construction of an efficient encoder–decoder structure. We design a low-complexity multi-layer cross encoder–decoder structure that includes two core components: the Relation Memory Augmented Encoder (RMAE) and the Dense Memory Augmented Decoder (DMAD). Specifically, in the RMAE, we introduce Relation Memory Augmented Attention (RMAA). This module encodes the complex spatial relationships between image regions using geometric information in an explicit spatially aware manner. Meanwhile, it learns the contextual information of relationships between image region features through memory units in an implicit spatially aware manner. The combination of explicit and implicit spatial perception enhances the model’s ability to capture image features rich in spatial information. In the DMAD, we propose Dense Memory Augmented Cross Attention (DMACA). This module adopts a multi-layer structure with dense connections to fully utilize both low-level and high-level features from the RMAE. It employs an adaptive gating mechanism to learn cross-modal associations between visual and language features. Additionally, memory units are constructed to store prior knowledge of images and text, capturing the intrinsic semantic information and global dependencies of objects and reducing information loss.

Our major contributions are summarized as follows:

We propose the RMAE. On the one hand, it encodes the complex spatial relationships between image regions using geometric information in an explicit spatially aware manner. On the other hand, it introduces memory units to design the Relation Memory Augmented Attention (RMAA), which learns the contextual information of relationships between image region features in an implicit spatially aware manner. The collaborative operation of the two spatial perception methods is conducive to capturing image features containing rich spatial information.
We propose the DMAD to fully utilize both low-level and high-level features from the RMAE through dense connections. We design the DMACA to associate visual and linguistic information across different layers, thereby capturing deeper correlations between image features and text descriptions.
We build a novel DMFormer model by integrating the above two modules based on Linformer. This model effectively handles the interaction between image and text features while capturing the spatial semantic correlations between image regions. Experimental results on the MS-COCO dataset show that compared to existing mainstream methods, our proposed method generates richer and more accurate sentences, with significant improvements in all evaluation metrics.

2. Related Work

2.1. Image Captioning

The task of image captioning is an interdisciplinary effort combining computer vision and natural language processing. Its goal is to enable the computer to automatically generate natural language text that describes the image content. Depending on the different implementation methods, image captioning methods can be divided into traditional methods and deep learning-based methods [12,13,14].

Traditional methods mainly include retrieval-based and template-based approaches. Retrieval-based methods identify the most similar image to the input image within a large-scale image database and use its corresponding caption as the output. However, this approach depends heavily on the accuracy of image retrieval and it is difficult to process new images that have not been seen before. Template-based methods generate captions using predefined grammatical rules and templates. However, the generated descriptions often lack diversity and flexibility, making it difficult to adapt to complex or new image content [15].

With the rise of deep learning, encoder–decoder frameworks have become mainstream. These methods typically use CNNs as encoders to extract visual features and RNNs as decoders to generate text. Despite their success, RNN-based methods suffer from gradient vanishing and low computational efficiency when handling long sequences. To address these issues, researchers have explored Transformer-based architectures, which leverage multi-head self-attention mechanisms to process sequence data in parallel, significantly improving caption quality. In recent years, Transformer-based image captioning methods have made significant progress [9,10,11,12,15,16]. Fang et al. [17] introduced a pure vision Transformer-based model that incorporates a novel Concept Token Network (CTN). The CTN predicts semantic concepts and integrates them into an end-to-end captioning framework, enhancing the semantic richness of the generated captions. Zeng et al. [18] proposed the Progressive Tree-Structured Prototype Network (PTSN), which is the first attempt to narrow down the scope of predicted words by modeling hierarchical textual semantics. This approach aims to improve the semantic consistency and accuracy of generated captions. Dubey et al. [19] proposed a label-attention Transformer that leverages geometrically coherent objects. They used a Deep Neural Network (DNN) to generate proposals of geometrically coherent objects and then employed a Label Attention Mechanism (LAM) to investigate the relationships among these objects for caption generation. Ge et al. [20] developed a double decoding Transformer framework designed to correct wrong words and thereby enhance caption quality. Their method employs a rectifier module to identify and correct errors in the generated text, resulting in more accurate and coherent captions. Zeng et al. [21] proposed a novel zero-shot image captioning framework called MeaCap, which significantly improves the consistency between generated captions and image content while reducing hallucination phenomena by introducing text memory and a retrieval-filtering module. Yang et al. [22] proposed SAMT-generator, an image captioning method based on a multi-stage Transformer network. It enriches the semantic information of visual features by integrating multi-scale features through a Multi-stage Transformer Feature Enhancement Network (MT-FEN).

Despite these advancements, existing image captioning methods still have some limitations [19,20,21,22,23,24,25,26]: (1) Due to the time and space complexity of

O (n^{2})

, the computational efficiency is low when dealing with large-scale image data, which limits its scalability in practical applications. (2) They fail to effectively capture the spatial and semantic correlations between the internal features of image regions. (3) The interaction between image features and text features remains underdeveloped, resulting in captions that lack semantic accuracy and richness. To overcome these challenges, we propose the DMFormer, a novel image captioning model based on Linformer. The DMFormer introduces the RMAE and the DMAD, incorporating two new attention mechanisms and dense connections between encoder and decoder layers to enhance feature interaction and caption quality.

2.2. Linformer and Its Advantages

Linformer is an improved Transformer architecture, which reduces the time and space complexity from

O (n^{2})

to

O (n)

by low-rank matrix approximation, and significantly improves the memory and time efficiency of the model [27]. Its core idea is to project the Key and Value components of the input sequence into a lower-dimensional space, thereby reducing the amount of calculation while maintaining the performance of the Transformer. This high efficiency of Linformer gives it significant advantages in dealing with large-scale image captioning tasks. However, applications of Linformer in the field of image captioning are still limited, and its full potential remains to be explored.

Linformer is mainly based on Multi-Head Linear Self-Attention (MHLA), which enables different positions within the model to notice feature information in different sub-spaces. MHLA concatenates multiple attention mechanisms, achieving efficient feature interaction. Its structure is shown in Figure 1 below, and the computation is as follows:

M H L A (Q, K, V) = C o n c a t (h e a d_{1}, h e a d_{2}, …, h e a d_{h}) W^{o},

(1)

where the attention inputs are composed of queries

Q

, keys

K

, and values

V

with dimensions

d_{k}

,

d_{k}

, and

d_{n}

;

n

is the sequence length,

W^{o}

is the trainable projection matrix, and

h

is the number of attention heads. The computation for each attention head is as follows:

h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, E_{i} K W_{i}^{K}, F_{i} V W_{i}^{V}) = softmax [\frac{Q W_{i}^{Q} {(E_{i} K W_{i}^{K})}^{T}}{\sqrt{d_{k}}}] \cdot F_{i} V W_{i}^{V}

(2)

where

W_{i}^{Q}, W_{i}^{K} \in R^{d_{m} \times d_{k}}

,

W_{i}^{V} \in R^{d_{m} \times d_{v}}

,

W_{i}^{O} \in R^{h d_{v} \times d_{m}}

are the trainable projection matrices.

E_{i}, F_{i} \in R^{n \times k}

are the trainable linear projection matrices.

Despite the progress made by existing methods in the task of image captioning [28,29], the existing methods still fall short in capturing the spatial and semantic correlations between the internal features of image regions, as well as in fully leveraging the interaction between image and text features. Additionally, the high complexity of the Transformer architecture also limits its application efficiency on large-scale datasets. To address these issues, we apply Linformer to the image captioning task and construct the DMFormer model. By introducing two mechanisms, RMAA and DMACA, the model effectively overcomes the limitations of the existing methods. The contributions of this paper lie in proposing a new efficient framework for image captioning. This framework significantly enhances the accuracy and richness of captions by fully utilizing the interaction between image and text features.

3. Proposed Method

We propose a novel low-complexity multi-layer cross encoder–decoder structure based on Linformer, namely the DMFormer. The DMFormer mainly consists of the encoder-RMAE for extracting visual features from the images and the decoder-DMAD for generating sentences. Specifically, in the RMAE, we introduce RMAA. On the one hand, we encode the complex spatial relationships between image regions using geometric information, explicitly perceiving spatial signals in the image. On the other hand, we construct a memory unit matrix to implicitly learn the contextual information of the image region features. The combination of these two spatial perception methods is conducive to deepening the model’s understanding of image content. In the DMAD, we propose DMACA. To fully explore the semantic associations between image and text features, on the one hand, we adopt a multi-layer structure with dense connections to better utilize both the low-level and high-level features generated by the RMAE. On the other hand, we employ an adaptive gating mechanism to learn cross-modal associations between visual and language features and construct memory units to store prior knowledge of images and text. This approach captures the intrinsic semantic information and global dependencies of objects, reducing information loss. The specific model structure is shown in Figure 2.

3.1. Positional Encoding

Since the model does not use any recurrence or convolution, it cannot directly capture the order of sequence. To enable the model to utilize the order information of the input sequence, it is essential to fully incorporate the relative or absolute positional information of the sequence tokens. In this paper, we apply positional encoding to the input before feeding it into the encoder and decoder. Most existing methods only model the positional relationships of regions in a relative manner. However, to better leverage the spatial geometric information between different types of image features, we integrate both absolute and relative positional information in an explicit spatially aware manner. This approach aims to simulate the complex visual and positional relationships between input features. Specifically, we employ the following two types of positional encoding:

Firstly, we use sine and cosine functions to obtain the absolute positional encodings. For each position

i

, the absolute positional encoding

P E (i)

is defined as follows:

P E (p o s, 2 i) = \sin (p o s / {10,000}^{2 i / d_{m o d e l}}),

(3)

P E (p o s, 2 i + 1) = \cos (p o s / {10,000}^{2 i / d_{m o d e l}}),

(4)

where

p o s

is the position index,

i

is the dimension index, and

d_{m o d e l}

is the embedding dimension.

Secondly, to better integrate the positional information of the visual features, we add relative positional information based on the geometric structure of the bounding boxes. Assume the bounding box of a region can be represented as

(x, y, w, h)

, where

(x, y)

denotes the center coordinates, and

w

and

h

represent the width and height of the box, respectively. The relative positional encoding is defined as follows:

r (i, j) = (\log (\frac{| x_{i} - x_{j} |}{w_{i}}), \log (\frac{| y_{i} - y_{j} |}{h_{i}}), \log (\frac{w_{i}}{w_{j}}), \log (\frac{h_{i}}{h_{j}})),

(5)

r^{'} (i, j) = R e L U (W_{G} F C (r (i, j))),

(6)

where

i

and

j

denote the indices of two regions, and

r (i, j)

represents the positional encoding of region

i

relative to region

j

.

By combining absolute and relative positional encodings, our method can more comprehensively capture the spatial geometric relationships between image regions, thereby generating more accurate and richer descriptions.

3.2. Relation Memory Augmented Encoder

In the image captioning task, the role of the encoder is to refine the visual representation of the image, and its ability to capture visual semantic information is crucial. Each region in the image does not exist in isolation, but has complex spatial and semantic correlations with other regions. These spatial relationships are vital for understanding the overall semantics of the image. However, existing methods fall short in capturing the spatial position correlations between the internal features of image regions, lacking prior knowledge of the relationships between regional features. As a result, the generated descriptions often cannot accurately depict the spatial structure of the image and struggle to associate contextual visual information. To address this problem, we design the encoder-RMAE. Combined with explicit and implicit spatial perception, we propose RMAA to embed geometric information into the attention module with memory units. This approach enables the modeling of complex spatial relationships between regional features and captures the overall semantic structure of the image, allowing the model to generate richer semantic information.

The RMAE consists of three identical encoder layers stacked together. Each encoder layer contains two sub-layers: the first sub-layer is Multi-Head Relation Memory Augmented Attention, which is used to capture the correlation coefficients in multiple dimensions between features. By the combination of explicit and implicit spatial perception, it enhances the model’s understanding of the complex spatial relationships between image regions. The second sub-layer is the Position-wise Feed-Forward Network (FFN), which performs dimension transformation and enhances the expression ability of the model through a fully connected network of two linear layers. Residual connection and layer normalization are used between all sub-layers, which can not only accelerate the convergence speed, but also effectively avoid the vanishing gradient problem during training. The architecture of the encoder layer is shown in Figure 3 and is defined as follows:

Z = L a y e r N o r m (M u l t i H e a d R M A A (X) + X),

(7)

\bar{X} = L a y e r N o r m (F F N (Z) + Z),

(8)

F F N (Z) = G e L U (W_{1} Z + b_{1}) W_{2} + b_{2},

(9)

where

M u l t i H e a d R M A A (\cdot)

denotes the Multi-Head Relation Memory Augmented Attention module,

L a y e r N o r m (\cdot)

denotes the normalization operation of the layers,

W_{1}

and

W_{2}

are trainable weight matrices,

b_{1}

and

b_{2}

are bias terms, and

G e L U

denotes the Gaussian error linear unit.

By stacking the aforementioned encoder layers

N

times and using the output of the

i - 1

-th layer as the input to the

i

-th layer, the result is equivalent to creating a multi-level encoding of relationships between image regions. The shallow encodings tend to model the association between single visual elements, while deeper encodings focus on the semantic relationship after cascading multiple visual elements. There are complementary advantages between the encoding results of these different layers. Therefore, we concatenate these results, denoted as

\bar{X} = ({\bar{X}}_{1}, {\bar{X}}_{2}, …, {\bar{X}}_{N})

, and feed them into the decoder for cross-modal modeling.

Relation Memory Augmented Attention

To more accurately model the spatial and semantic relationships between image regions, thereby generating descriptions with explicit spatial structures, we propose RMAA. RMAA explicitly leverages geometric information to model the geometric relationships between image regions, and implicitly constructs a memory unit matrix to store the contextual information of image region features. The combination of these two structures enhances the model’s understanding of image content, enabling it to capture rich image features that contain deep spatial relationships. It not only strengthens the understanding of image spatial structure, but also provides a solid foundation for generating more detailed and accurate descriptions.

Firstly, to enhance the relative position representation of image features, we incorporate explicit spatially aware region geometry into the computation of the Self-Attention module. Specifically, we add the computed position representation after the Scaled Dot-Product Attention in Self-Attention. This design allows the model to directly leverage geometric information to enhance its understanding of the spatial relationships between image regions. This is defined as follows:

A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + λ) V,

(10)

where

λ

is the position encoding used to enhance the position representation of image features. The introduction of the position encoding

λ

enables the model to capture the positional information between image regions, thereby more accurately modeling the spatial structure of the image.

Secondly, to further enhance the model’s understanding of the contextual relationships between image regions, RMAA introduces implicit spatial perception. Specifically, additional memory unit matrices

M_{K}

and

M_{V}

are added to the Key and Value components. These matrices can be automatically learned through Stochastic Gradient Descent (SGD), allowing the model to capture prior knowledge that cannot be expressed by the input features. This is defined as follows:

R M A A (X) = A t t e n t i o n (W_{Q} X, K, V),

(11)

K = [W_{K} E X, M_{K}],

(12)

V = [W_{V} F X, M_{V}],

(13)

where

M_{K}

and

M_{V}

are learnable matrices with

n_{m}

rows, which are used to store the contextual information between image region features. By introducing these memory unit matrices, the model is able to more effectively capture the semantic associations between image regions, thereby enhancing its understanding of the image content.

Finally, to fully leverage the advantages of the multi-head attention mechanism, RMAA employs a multi-head approach. The outputs from different heads are concatenated and then projected through a linear layer. This design enables the model to capture complex relationships between image features from multiple perspectives and integrates this information via linear projection, thereby generating more accurate descriptions. This is defined as follows:

M u l t i H e a d R M A A (Q, K, V) = C o n c a t (R M A A_{1}, R M A A_{2}, …, R M A A_{h}) W^{o},

(14)

R M A A_{i} = A t t e n t i o n (Q W_{i}^{Q}, E_{i} M_{i}^{K} K W_{i}^{K}, F_{i} M_{i}^{V} V W_{i}^{V}) = softmax [\frac{Q W_{i}^{Q} {(E_{i} M_{i}^{K} K W_{i}^{K})}^{T}}{\sqrt{d_{k}}}] \cdot F_{i} M_{i}^{V} V W_{i}^{V}

(15)

where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are the weight matrices for the

i

-th head, and

W^{o}

is the linear projection matrix for the output. The multi-head attention mechanism allows the model to capture the relationship between features from multiple sub-spaces, while linear projection integrates this information into a unified representation.

Through the joint modeling of explicit and implicit spatial perception, RMAA not only leverages explicit spatial information to enhance the model’s understanding of image region locations, but also captures more complex contextual relationships via implicit spatial perception. This integrated approach enables the model to more comprehensively grasp the overall semantic structure of the image, thereby generating richer semantic information.

3.3. Dense Memory Augmented Decoder

The role of the decoder is to transform the visual features extracted by the encoder into a text sequence that describes the image content. This process is essentially an interaction between image and text information. However, existing methods have limitations in the interaction between image features and text features, failing to fully explore the semantic associations between them. This leads to information loss, which affects the accuracy and richness of the generated description. To address this problem, we propose the DMAD. The DMAD employs a multi-layer structure that densely connects all multi-layer feature representations from the encoder. This design aims to reduce the loss of feature information extracted from the RMAE during sentence generation, ensuring that both low-level and high-level features are fully utilized. Furthermore, the DMAD introduces DMACA. DMACA incorporates memory units that collect current image information as persistent memory, which is continuously updated during training. These memory units serve as additional semantic information sources to help the model capture the semantic relevance in the input sequence. They further capture the intrinsic semantic information and global dependencies of the object, thereby reducing information loss. In this way, the model can better utilize information from previous positions to guide the generation of descriptions at the current position. This enhances the context-aware ability of the model and improves the quality and consistency of the generated descriptions. Additionally, DMACA employs an adaptive gating mechanism to learn cross-modal associations between visual and language features, further enhancing the model’s ability to capture semantic associations.

The DMAD consists of three stacked decoder layers with identical structures. Each decoder layer contains three sub-layers: The first is Masked Multi-Head Linear Attention, which ensures that the output at the current position is only related to the input of the previous position to prevent the position from involving subsequent positions. The second sub-layer is Multi-Head Dense Memory Augmented Cross Attention, which captures the correlation coefficients between image features and words in multiple dimensions. The third sub-layer is FFN, which enhances the expression ability of the model through a fully connected network with two linear layers for dimension transformation. Each sub-layer employs residual connection and layer normalization, which can not only accelerate the convergence speed, but also effectively prevent the vanishing gradient problem during training. The architecture of the decoder layers is shown in Figure 4, which is defined as follows.

Z_{1} = L a y e r N o r m (M u l t i H e a d_{m a s k} (Y) + Y),

(16)

Z_{2} = L a y e r N o r m (M u l t i H e a d D M A C A (\bar{X}, Z_{1}) + Z_{1}),

(17)

\bar{Y} = L a y e r N o r m (F F N (Z_{2}) + Z_{2}),

(18)

where

Y

represents the input features,

M u l t i H e a d_{m a s k} (\cdot)

denotes the Masked Multi-Head Linear Attention module,

M u l t i H e a d D M A C A (\cdot)

denotes the Multi-Head Dense Memory Augmented Cross Attention module, and

L a y e r N o r m (\cdot)

represents the layer normalization operation.

Dense Memory Augmented Cross Attention

To fully explore the semantic associations between visual and textual features, we propose DMACA. DMACA effectively integrates the semantic associations between visual and textual features by introducing additional memory unit matrices, cross-attention of multi-level features, and an adaptive gating mechanism. Firstly, DMACA introduces additional memory unit matrices

M_{K}

and

M_{V}

to store the prior knowledge of image region features. These memory unit matrices are automatically updated through backpropagation during training and can capture prior knowledge that cannot be expressed by the input features. This design enables the model to leverage richer semantic information to enhance its understanding of image content. Secondly, to fully utilize the multi-level features output by the encoder output, DMACA performs cross-attention between the decoder input

Y

and the output information from all encoder layers. In this way, the model can capture different levels of feature information, so as to understand the image content more comprehensively. Specifically, the cross-attention mechanism enables the decoder to dynamically focus on multiple layers of image features when generating textual descriptions. Finally, to further integrate cross-attention information from different encoder layers, DMACA introduces an additional adaptive gating mechanism. The mechanism dynamically adjusts the importance of each layer’s features by learning the contribution weights of the different encoding layers. This adaptive adjustment enables the model to make more flexible use of multi-level feature information, thereby better capturing the semantic associations between visual and textual features. Specifically, this is defined as follows:

D M A C A (\bar{X}, Y) = \sum_{i = 1}^{N} α_{i} ⊙ C_{m e m} ({\bar{X}}_{i}, Y),

(19)

C_{m e m} ({\bar{X}}_{i}, Y) = A t t e n t i o n (W_{Q} Y, K {\bar{X}}_{i}, V {\bar{X}}_{i}),

(20)

K = [W_{K} E {\bar{X}}_{i}, M_{K}],

(21)

V = [W_{V} F {\bar{X}}_{i}, M_{V}],

(22)

where

α_{i}

is a weight matrix that represents the contribution of each encoding layer and the relative importance between the different encoding layers. It is calculated based on the correlation between the cross-attention results of each encoder layer and the input query. This is defined as follows:

α_{i} = G e L U (W_{i} [Y, C ({\bar{X}}_{i}, Y)] + b_{i}),

(23)

where

[,]

denotes the concatenation,

W_{i}

is the weight matrix of

2 d \times d

, and

b_{i}

is a learnable bias vector.

Through the above design, DMACA is capable of effectively integrating multi-level feature information, thereby enhancing the model’s understanding of the relationships between image regions. This design not only makes full use of the detailed information of low-level features, but also incorporates the semantic information from high-level features, so as to more comprehensively mine the semantic association between image features and text features. Ultimately, DMACA is able to generate more accurate and richer descriptions, thereby improving the overall performance of the model.

3.4. Training Strategy

We adopt two-stage training. Firstly, we use the cross-entropy loss function [30] for pre-training, followed by fine-tuning using a reinforcement learning-based method [31]. The specific process is as follows:

Firstly, we employ the cross-entropy loss function for pre-training. Given a word sequence

Y = (y_{1}, y_{2}, …, y_{T})

of a real annotated dataset, the optimization objective is to minimize the cross-entropy loss as follows:

L (θ) = - \sum_{t = 1}^{T} \log (p_{θ} (y_{t} | y_{1}, …, y_{t - 1})),

(24)

where

θ

is the model parameter and

T

denotes the length of the word sequence.

Then, we fine-tune the model using reinforcement learning to optimize for the CIDEr metric. The goal of the training model is to minimize the negative expected reward score:

L (θ) = - E_{Y_{s} \sim p_{θ}} [r (Y_{s})],

(25)

where

Y_{s}

denotes the sentence obtained by the model through Monte Carlo sampling, and

r

denotes the score of CIDEr. The gradient of the above objective function is calculated using reinforcement learning methods as follows:

\nabla_{θ} L (θ) = - E_{Y_{s} \sim p_{θ}} [r (Y_{s}) \nabla_{θ} \log P_{θ} (Y_{s})],

(26)

Finally, to mitigate training instability and high variance in the gradient estimates, we use the reward of the sentence generated by the current model during the test stage as a baseline. Specifically, we employ beam search to generate sentences and use their CIDEr scores as the baseline rewards. This helps stabilize the training process and reduce the variance in gradient updates.

E = \nabla_{θ} L (θ) \approx - (r (Y_{s}) - r ({\hat{Y}}_{s})) \nabla_{θ} \log p_{θ} (Y_{s}),

(27)

where

r ({\hat{Y}}_{s})

is the baseline reward.

4. Experiments

4.1. Dataset and Evaluation Metrics

Dataset: To validate the effectiveness of our proposed DMFormer, we conducted extensive experiments on the MS-COCO dataset, which is the most widely used and largest benchmark dataset in image captioning. The dataset contains 82,783 training images, 40,504 validation images, and 40,775 testing images. Each image is annotated with at least five captions. For a fair comparison, we divided all those images and their corresponding captions into three pairs of sets, 113,287 for training, 5000 for validation, and 5000 for testing.

Caption Preprocessing and Tokenization: (1) Text Cleaning: Captions are converted to lowercase and stripped of special characters and digits to ensure uniformity. (2) Tokenization: We use the word_tokenize function from the NLTK library to split captions into individual words. (3) Sequence Padding: Sequences are padded to ensure consistent length, which is essential for batch processing during training.

We use the word_tokenize function from the NLTK library to split captions into individual words, and any words not included in the 10k token vocabulary are treated as unknown tokens. Additionally, we specify that the maximum sequence length for padding is set to 20, ensuring all sequences are of uniform length for batch processing.

Evaluation metrics: To evaluate the quality of the generated captions, we employed a variety of automatic evaluation metrics based on the similarity between ground-truth captions and generated captions. Following the standard evaluation protocol, we used the full set of captioning metrics, including BLEU [32], METEOR [33], ROUGE [34], CIDEr [35], and SPICE [36].

4.2. Implementation Details

To extract visual features, we used the pre-trained Faster-RCNN with ResNet-101 backbone, fine-tuned on the Visual Genome dataset. We extracted 2048 dimensional features from the first FC-layer of the detection head. These features correspond to the outputs of the last convolutional layer of ResNet-101, with a spatial resolution of 7 × 7 × 2048. In our implementation, we set the output dimension

d_{m o d e l}

to 512 and the number of attention heads to 8. The visual and textual memory vectors were set to 40 and 20, respectively. Both the encoder and decoder consisted of 3 layers. We employed dropout with a keep probability of 0.9 after each attention and feed-forward layer. We used the Adam optimizer to train the model, with the batch size set to 50 and the beam size for beam search set to 5. In the first stage of pre-training, which is based on cross-entropy loss, we adopt a standard learning rate strategy. The baseline learning rate is set to 1 × 10⁻⁴, and the number of training epochs is set to 28. In the second stage, which involves CIDEr-D optimization, we adjusted the learning rate in three stages based on the number of training epochs: (1) Before epoch 28, we used a fixed baseline learning rate of 5 × 10⁻⁶. (2) Between epochs 28 and 50, we adjusted the baseline learning rate to 5 × 10⁻⁷. (3) After epoch 50, we applied a composite exponential decay strategy to adjust the learning rate dynamically. The formula for the reinforcement learning rate is as follows:

l a m_l r = \{\begin{matrix} base_l r \\ b a s e_l r * 0.1 \\ d_{\mod a l}^{- 0.5} * \min (e^{- 0.5}, e * w^{- 1.5}) \end{matrix} \begin{matrix} e < = 28 \\ 28 < e < = 50 \\ e > 50 \end{matrix},

(28)

where

base_l r

is the baseline learning rate;

w

is the learning rate scheduling strategy period of the warming cycle, set to 20,000;

e

is the number of rounds of the current training;

\min ()

is the minimum value calculation function; and

d_{model}

is the input and output dimension of each layer of the model.

In our work, the memory units

M_{k}

and

M_{v}

are designed as learnable parameters optimized via stochastic gradient descent (SGD). Below are the key technical details:

(1) Initialization:

M_{k}

and

M_{v}

are initialized by sampling from a normal distribution with zero-mean and a small standard deviation (

δ

= 0.01). This low-variance initialization prevents unstable attention weights in early training stages, aligning with standard practices in Transformer-based models. It ensures that initial memory entries are diverse yet bounded, avoiding premature convergence to suboptimal patterns.

(2) Update Mechanism: During training,

M_{k}

and

M_{v}

are updated via SGD with momentum (

β

= 0.9), where the gradients are computed through backpropagation. Specifically, the updates are performed as follows:

M_{K}^{(t + 1)} = M_{K}^{(t)} - α (\frac{\partial L}{\partial M_{K}} + β \cdot m o m e n t u m_b u f f e r_{MK}),

(29)

M_{V}^{(t + 1)} = M_{V}^{(t)} - α (\frac{\partial L}{\partial M_{V}} + β \cdot m o m e n t u m_b u f f e r_{MV}),

(30)

where

α

is the learning rate,

\frac{\partial L}{\partial M_{K}}

and

\frac{\partial L}{\partial M_{V}}

represent the gradient of the loss

L

with respect to

M_{k}

and

M_{v}

,

β

controls the momentum term, and

m o m e n t u m_b u f f e r_{MK}

and

m o m e n t u m_b u f f e r_{MV}

store the accumulated gradients from previous updates. The use of momentum helps accelerate the convergence and smooth out noisy gradients, which is particularly beneficial for high-dimensional parameters like

M_{k}

and

M_{v}

.

This design ensures that the memory units are optimized in a stable and efficient manner, while allowing them to adapt dynamically during training.

4.3. Experimental Results and Analysis

4.3.1. Analysis of Module Ablation Experiments

In order to verify the effectiveness of each module proposed in the DMFormer, a number of ablation experiments were conducted to investigate the contribution of each module. Firstly, we use an image captioning model based on the original Linformer as the baseline model. Secondly, we use the RMAE module to replace the encoder of the baseline model. Thirdly, the decoder of the baseline model is replaced by the DMAD module. Finally, the RMAE and DMAD are fused to construct the DMFormer model. The above experiments are conducted on the standard MSCOCO dataset and the experimental results are shown in Table 1.

From the results of Table 1, it can be seen that the various modules of the DMFormer have made significant contributions to the performance improvement of the model. Specifically, the following contributions were made: (1) After incorporating the RMAE module, the model is able to more effectively extract spatial relationships and semantic information from image regions, increasing the CIDEr score from 121.8 to 127.1. (2) By replacing the decoder of the baseline model with the DMAD module, the model is able to dynamically focus on the bidirectional relationships between visual and textual features, providing richer reference information for text generation. This improvement raises the CIDEr score from 121.8 to 130.5. (3) When the RMAE and DMAD modules are used as the encoder and decoder, respectively, the DMFormer achieves significant improvements across all evaluation metrics. This shows that by integrating these two modules, the DMFormer not only performs well in semantic information extraction and feature interaction but also optimizes the overall architecture, reducing information loss and further enhancing the accuracy and richness of the generated descriptions.

4.3.2. Analysis of Memory Vectors Ablation Experiments

To validate our choice of memory vector dimensions, we conducted ablation studies comparing different numbers of memory vectors in both the encoder and decoder. The results, summarized in Table 2 and Table 3, demonstrate that

n_{m}

= 40 for the encoder and

n_{m}

= 20 for the decoder yield optimal performance across multiple evaluation metrics.

4.3.3. Analysis of RMAA Variants in Ablation Experiments

To validate the performance of RMAA, we have conducted ablation studies across different configurations, including explicit-only, implicit-only, and combined configurations. The results summarized in Table 4 consistently demonstrate that the combined approach outperforms the individual variants in terms of multiple metrics, providing solid empirical evidence for the effectiveness of our design choice.

4.3.4. Comparison and Analysis with Advanced Baseline Models

To verify that the performance improvement of the DMFormer is not dependent on the regional visual features extracted by the Faster R-CNN object detector, we conducted comparative experiments on the MSCOCO dataset. In the experiments, all models were trained based on the regional visual features extracted by Faster R-CNN to ensure the consistency of visual features. Additionally, we set the input and output dimensions

d_{m o d e l}

of the encoder and decoder layers of all models to 512 and fixed the number of training epochs to 50. The experimental results are shown in Table 5.

The experimental results demonstrate that the DMFormer can achieve significantly better performance than other state-of-the-art methods under the same visual features and architecture configurations. This indicates that the performance advantage of the DMFormer does not merely stem from high-quality visual features, but is primarily attributed to its distinctive module design and architectural optimization. These innovations allow the DMFormer to more effectively explore the semantic connections between visual and textual information, thereby producing more accurate and richer image captions.

4.3.5. Comparative Analysis with Advanced Models

The state-of-the-art of the DMFormer model is validated by comparing it to the existing mainstream models using the performance metrics on the MSCOCO Karpathy test. Several mainstream methods involved in the comparison include SCST [31], RFNet [40], Up-Down [41], GCN-LSTM [42], SGAE [37], ORT [38], AoANet [39], M2-Transformer [29], X-Transformer [24], RSTNet [43], DGET [44], GAT [10], ViTCAP [17], CATNet [45], MAENet [46], D2 Transformer [20], VaT [47], AS-Transformer [48], LATGeO [19]. As can be seen from Table 6, the DMFormer demonstrates excellent performance, especially in terms of the CIDEr metric, which is highly valued in the field of image captioning, achieving a score of 133.2. This indicates that the DMFormer is highly effective in generating captions that closely match human references. Additionally, the DMFormer achieves remarkable results in other key metrics, including BLEU-1, BLEU-4, METEOR, ROUGE-L, and SPICE. These results highlight the DMFormer’s ability to generate high-quality captions that are both accurate and diverse. The superior performance of the DMFormer can be attributed to its innovative modules and architecture. The RMAE module enhances the extraction of spatial relationships and semantic information from image regions, while the DMAD module improves the interaction between visual and textual features. Together, these modules enable the DMFormer to generate more accurate and contextually rich captions compared to other state-of-the-art models.

4.3.6. Computational Efficiency Analysis

To prove the superiority of the DMFormer in terms of the trade-off between the amount of calculations (FLOPS), parameter quantity (Params), and reasoning speed (Inference Time), we can follow a structured approach. Below is a step-by-step demonstration of the superiority of the DMFormer based on the information provided in Table 7:

The results are shown in Table 2. The DMFormer demonstrates the best trade-off among the amount of calculations, parameter quantity, and reasoning speed. It achieves significant improvements in captioning performance with moderate increases in computational resources and parameter count, while maintaining faster inference times compared to other state-of-the-art models. This makes the DMFormer a superior choice for image captioning tasks, offering a balanced and efficient solution.

4.3.7. Qualitative Analysis

To intuitively demonstrate the performance improvement of the DMFormer model, Figure 5 compares manually labeled statements, captions generated by the baseline model, and captions generated by the DMFormer model. For the first image, our method generates more informative and accurate captions than the baseline by capturing both spatial arrangement (e.g., “in front of a clock tower”) and event context (e.g., “festival type”). This demonstrates the model’s ability to incorporate spatial and contextual information effectively. For the second image, our method generates more precise captions by identifying specific details (e.g., “bananas and oranges”) and adding contextual information (e.g., “next to plates”). This highlights the model’s capability to capture finer details and relationships between objects. For the third image, while the DMFormer adds contextual details (“blue lights on the side of the road”), it fails to mention other relevant elements such as pedestrians or additional objects in the background. This indicates that the model may overlook less prominent details in complex scenes. For the fourth image, the DMFormer specifies key elements like a wooden table, chairs, and a couch. However, it does not accurately describe the color or style of these objects, showing that the model may struggle with finer aesthetic details. For the fifth image, our method captures the scene vividly with details like “adorable cat,” “big eyes,” and “luggage left open,” making the description more engaging and precise. For the sixth image, our method describes the scene more precisely with details like “fenced off field,” showing its ability to generate detailed and contextually rich captions.

4.3.8. Significance Analysis

Comparisons following standard evaluation criterion demonstrate that our DMFormer outperforms several existing methods. In order to verify the performance of the DMFormer more comprehensively, we perform significance analysis from a statistical point of view. First, we take 5000 images from the Karpathy test split as samples and make our DMFormer and existing methods predict their captions. We then use the CIDEr score, the most highly valued in image captioning, to measure of the quality of the generated captions. Finally, paired two-tailed t-tests were employed to assess whether there were significant differences, and the results are shown in Table 8.

The results of the t-tests indicate that the improvements are statistically significant at the 0.05 significance level. This provides strong evidence that our method outperforms the baselines in a reliable and repeatable manner.

5. Limitations and Future Work

5.1. Limitations

Inference Speed: While the DMFormer achieves competitive performance, its inference speed may not be sufficient for real-time applications requiring faster response times. Future work could explore model optimization techniques to further reduce inference time.

Memory Usage: The DMFormer requires approximately 4.5 GB of GPU memory during inference, which may be a limitation in environments with constrained computational resources. We plan to investigate more memory-efficient architectures and optimization strategies.

Robustness: The robustness of the DMFormer in real-world scenarios with diverse and complex imagery remains to be tested. We anticipate that extreme cases, such as highly ambiguous images or those with severe occlusions, may pose challenges. Future work will include testing on more diverse datasets.

Scalability: Deploying the DMFormer in real-world applications with massive and dynamic datasets will require further evaluation. We plan to explore distributed computing and incremental learning approaches to enhance scalability.

5.2. Future Work

Model Optimization: We plan to explore techniques such as quantization, pruning, and knowledge distillation to create more efficient versions of the DMFormer without significantly compromising performance.

Real-World Testing: We aim to conduct extensive testing in real-world scenarios to better understand and address the model’s limitations.

Hardware Acceleration: Investigating hardware acceleration options, such as TPUs and specialized GPUs, could further enhance inference speed and memory efficiency.

Enhanced Robustness: We will incorporate advanced data augmentation techniques and robustness training methods to improve the model’s performance on challenging and diverse datasets.

Multi-Modal Applications: Extending the DMFormer to multi-modal applications, such as video captioning and visual question answering, is another promising direction for future research.

6. Conclusions

We introduce a novel image captioning method named the DMFormer, which aims to address the limitations of existing methods in modeling spatial and semantic relationships between image regions, as well as the insufficient interaction between visual and textual features. The DMFormer is built upon the Linformer architecture, inheriting its low-complexity advantages, and incorporates a multi-layer RMAE-DMAD structure to more efficiently handle the image captioning task. The RMAE combines explicit spatial perception and implicit spatial perception, proposing a new attention module called RMAA. This module embeds geometric information into an attention mechanism augmented with memory units, effectively modeling the complex spatial relationships between image region features and capturing the overall semantic structure of the image. This design enables the model to generate richer and more accurate semantic information, thereby better reflecting the spatial structure and contextual information of the image. The DMAD adopts a multi-layer structure that densely connects multi-layer feature representations from all encoder layers. This module introduces DMACA, which constructs memory units to store prior knowledge of both visual and textual features. Through an adaptive gating mechanism, DMACA learns cross-modal associations between visual and language features. This design further explores the semantic associations between visual and textual features, reduces information loss, and generates more accurate and richer descriptions. Experiments conducted on the MS-COCO dataset demonstrate that the DMFormer generates more accurate and richer descriptions, with significant improvements in various evaluation metrics compared to mainstream methods.

Author Contributions

Conceptualization, Y.H. and Z.J.; methodology, Y.H.; software, Y.H. and Z.J.; validation, Y.H. and Z.J.; formal analysis, Y.H.; investigation, Y.H. and Z.J.; resources, Y.H. and Z.J.; data curation, Y.H.; writing—original draft preparation, Y.H.; writing—review and editing, Y.H. and Z.J.; visualization, Y.H. and Z.J.; supervision, Y.H. and Z.J.; project administration, Y.H. and Z.J.; funding acquisition, Y.H. and Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Nature Science Foundation of China under Grants No. 62473105 and No. 62172118; the Nature Science key Foundation of Guangxi under Grant No. 2021GXNSFDA196002; the Guangxi Key Laboratory of Image and Graphic Intelligent Processing under Grants No. GIIP2302, No.GIIP2303, and No.GIIP2304; and the Innovation Project of Guang Xi Graduate Education under Grants No. 2024YCXB09 and No. 2024YCXS039.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available in the MSCOCO repository, https://cocodataset.org/ (accessed on 2 January 2025). The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Abdar, M.; Kollati, M.; Kuraparthi, S.; Pourpanah, F.; McDuff, D.; Ghavamzadeh, M.; Yan, S.; Mohamed, A.; Khosravi, A.; Cambria, E.; et al. A review of deep learning for video captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1–20. [Google Scholar] [CrossRef] [PubMed]
Sharma, H. A Survey on Image Encoders and Language Models for Image Captioning. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1116, 012118. [Google Scholar] [CrossRef]
De Silva, V.; Sumanathilaka, T. A Survey on Image Captioning Using Object Detection and NLP. In Proceedings of the 2024 4th International Conference on Advanced Research in Computing (ICARC), Belihuloya, Sri Lanka, 21–24 February 2024; pp. 270–275. [Google Scholar]
Daneshfar, F.; Bartani, A.; Lotfi, P. Image captioning by diffusion models: A survey. Eng. Appl. Artif. Intell. 2024, 138, 109288. [Google Scholar] [CrossRef]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Verma, M. Revisiting Linformer with a modified self-attention with linear complexity. arXiv 2020, arXiv:2101.10277. [Google Scholar]
Guo, L.; Liu, J.; Zhu, X.; Yao, P.; Lu, S.; Lu, H. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10327–10336. [Google Scholar]
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 1655–1663. [Google Scholar]
Hossen, M.B.; Ye, Z.; Abdussalam, A.; Hossain, M.I. GVA: Guided visual attention approach for automatic image caption generation. Multimed. Syst. 2024, 30, 50. [Google Scholar] [CrossRef]
Wang, C.; Shen, Y.; Ji, L. Geometry Attention Transformer with position-aware LSTMs for image captioning. Expert Syst. Appl. 2022, 201, 117174. [Google Scholar] [CrossRef]
Kuo, C.W.; Kira, Z. Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17969–17979. [Google Scholar]
Zohourianshahzadi, Z.; Kalita, J.K. Neural attention for image captioning: Review of outstanding methods. Artif. Intell. Rev. 2021, 55, 3833–3862. [Google Scholar] [CrossRef]
Hayat, M.; Ahmad, N.; Nasir, A.; Tariq, Z.A. Hybrid Deep Learning EfficientNetV2 and Vision Transformer (EffNetV2-ViT) Model for Breast Cancer Histopathological Image Classification. IEEE Access 2024, 12, 184119–184131. [Google Scholar] [CrossRef]
Hayat, M.; Aramvith, S. Transformer’s role in brain MRI: A Scoping Review. IEEE Access 2024, 12, 108876–108896. [Google Scholar] [CrossRef]
Nivedita, M. A Survey on Different Deep Learning Architectures for Image Captioning. WSEAS Trans. Syst. Control. 2020, 15, 635–646. [Google Scholar] [CrossRef]
Wu, Y.; Li, L.; Jiao, L.; Liu, F.; Liu, X.; Yang, S. TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5643912. [Google Scholar] [CrossRef]
Fang, Z.; Wang, J.; Hu, X.; Liang, L.; Gan, Z.; Wang, L.; Yang, Y.; Liu, Z. Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18009–18019. [Google Scholar]
Zeng, P.; Zhu, J.; Song, J.; Gao, L. Progressive tree-structured prototype network for end-to-end image captioning. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10–14 October 2022; pp. 5210–5218. [Google Scholar]
Dubey, S.; Olimov, F.; Rafique, M.A.; Kim, J.; Jeon, M. Label-attention transformer with geometrically coherent objects for image captioning. Inf. Sci. 2023, 623, 812–831. [Google Scholar] [CrossRef]
Ge, G.; Han, Y.; Hao, L.; Hao, K.; Wei, B.; Tang, X.S. Show, tell and rectify: Boost image caption generation via an output rectifier. Neurocomputing 2024, 585, 127651. [Google Scholar] [CrossRef]
Zeng, Z.; Xie, Y.; Zhang, H.; Chen, C.; Chen, B.; Wang, Z. Meacap: Memory-augmented zero-shot image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 14100–14110. [Google Scholar]
Yang, X.; Yang, Y.; Ma, S.; Li, Z.; Dong, W.; Woźniak, M. SAMT-generator: A second-attention for image captioning based on multi-stage transformer network. Neurocomputing 2024, 593, 127823. [Google Scholar] [CrossRef]
Chen, F.; Li, X.; Tang, J.; Li, S.; Wang, T. A Survey on Recent Advances in Image Captioning. J. Phys. Conf. Ser. 2021, 1914, 012053. [Google Scholar] [CrossRef]
Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10971–10980. [Google Scholar]
Cui, M.; Li, C.; Yang, Y. Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM. Sensors 2024, 24, 3820. [Google Scholar] [CrossRef]
Chan, K.H.; Im, S.K.; Zhang, Y. Optimization of language models by word computing. In Proceedings of the 6th International Conference on Graphics and Signal Processing, Chengdu, China, 25–27 February 2022; pp. 39–43. [Google Scholar]
Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From Show to Tell: A Survey on Deep Learning-based Image Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 539–559. [Google Scholar] [CrossRef]
Sarto, S.; Cornia, M.; Baraldi, L.; Nicolosi, A.; Cucchiara, R. Towards retrieval-augmented architectures for image captioning. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 242. [Google Scholar] [CrossRef]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
Boer, P.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A tutorial on the cross-entropy method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1179–1195. [Google Scholar]
Papineni, S. Blue: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2002; pp. 311–318. [Google Scholar]
Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; pp. 376–380. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 382–398. [Google Scholar]
Yang, X.; Tang, K.; Zhang, H.; Cai, J. Auto-encoding scene graphs for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 10685–10694. [Google Scholar]
Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image captioning: Transforming objects into words. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 11137–11147. [Google Scholar]
Huang, L.; Wang, W.; Chen, J.; Wei, X. Attention on attention for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4634–4643. [Google Scholar]
Jiang, W.; Ma, L.; Jiang, Y.; Liu, W.; Zhang, T. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 499–515. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Eney, T.D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Yang, X.; Tang, K.; Zhang, H.; Cai, J. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; Ji, R. RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words. In Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15465–15474. [Google Scholar]
Xian, T.; Li, Z.; Zhang, C.; Ma, H. Dual global enhanced transformer for image captioning. Neural Netw. 2022, 148, 129–141. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Wang, N.; Suo, W.; Sun, M.; Wang, P. Improving image captioning via enhancing dual-side context awareness. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 389–397. [Google Scholar]
Hu, N.; Fan, C.; Ming, Y.; Feng, F. MAENet: A novel multi-head association attention enhancement network for completing intra-modal interaction in image captioning. Neurocomputing 2023, 519, 69–81. [Google Scholar] [CrossRef]
Yang, L.; He, L.; Hu, D.; Liu, Y.; Peng, Y.; Chen, H.; Zhou, M. Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–12. [Google Scholar] [CrossRef]
Zhang, J.; Fang, Z.; Sun, H.; Wang, Z. Adaptive semantic-enhanced transformer for image captioning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 1785–1796. [Google Scholar] [CrossRef]

Figure 1. Architecture of Multi-Head Linear Self-Attention.

Figure 2. Architecture of the DMFormer.

Figure 3. Architecture of Encoder Layers.

Figure 4. Architecture of Decoder Layers.

Figure 5. Visualization for Generated Captions from GroundTruth, Baseline, and Our Method.

Table 1. Ablation Study Comparison of RMAE and DMAD.

RMAE	DMAD	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
×	×	79.1	36.2	27.7	56.9	121.8	20.9
×	√	80.9	38.6	28.5	58.6	127.1	22.5
√	×	80.7	39.1	28.8	58.8	130.5	22.6
√	√	81.3	39.6	29.4	59.6	133.2	23.2

Table 2. Ablation Study Comparison of Encoder Using Different Numbers of Memory Vectors.

Memories	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
No memory	79.1	36.2	27.7	56.9	121.8	20.9
20	79.6	37.4	28.5	57.8	126.5	21.8
40	81.3	39.6	29.4	59.6	133.2	23.2
60	80.5	38.2	29.1	58.4	131.0	22.3

Table 3. Ablation Study Comparison of Decoder Using Different Numbers of Memory Vectors.

Memories	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
No memory	79.1	36.2	27.7	56.9	121.8	20.9
20	81.3	39.6	29.4	59.6	133.2	23.2
40	80.9	38.5	28.7	58.9	132.3	21.9
60	79.6	37.4	28.2	57.5	130.6	21.1

Table 4. Performance of RMAA Variants.

Model	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Explicit-only	79.6	37.1	28.4	57.3	126.9	21.3
Implicit-only	80.1	37.3	27.9	57.1	127.0	20.8
RMAA (combined)	81.3	39.6	29.4	59.6	133.2	23.2

Table 5. Experimental Comparison with State-of-the-Art Baseline Models on the MSCOCO Dataset.

Method	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
SGAE [37]	80.8	38.4	28.4	58.6	127.8	22.1
ORT [38]	80.5	38.6	28.7	58.4	128.3	22.6
AoA-Transformer [39]	80.2	38.9	29.1	58.8	129.8	22.4
M2-Transformer [29]	80.8	39.1	29.1	58.6	131.2	22.6
X-Transformer [24]	81.0	39.1	29.1	58.8	130.2	22.8
DMFormer (Ours)	81.3	39.6	29.4	59.6	133.2	23.2

Table 6. Experimental Comparison with Advanced Models on the MSCOCO Dataset.

Model	BLEU-1	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
SCST [31]	-	34.2	26.7	55.7	114.0	-
RFNet [40]	79.1	36.5	27.7	57.3	121.9	21.2
Up-Down [41]	79.8	36.3	27.7	56.9	120.1	21.4
GCN-LSTM [42]	80.5	38.2	28.5	58.3	127.6	22.0
SGAE [37]	80.8	38.4	28.4	58.6	127.8	22.1
ORT [38]	80.5	38.6	28.7	58.4	128.3	22.6
AoANet [39]	80.2	38.9	29.1	58.8	129.8	22.4
M2-Transformer [29]	80.8	39.1	29.1	58.6	131.2	22.6
X-Transformer [24]	81.0	39.5	29.1	58.8	130.2	22.8
RSTNet [43]	81.1	39.3	29.3	58.8	133.3	23.0
DGET [44]	81.3	40.3	29.2	59.4	132.4	23.3
GAT [10]	80.8	39.7	29.1	59.0	130.5	22.9
ViTCAP [17]	-	40.1	29.4	59.4	133.1	23.0
CATNet [45]	81.1	39.7	29.5	59.3	133.0	22.5
MAENet [46]	81.2	39.8	29.6	59.1	133.1	23.1
D2 Transformer [20]	80.8	38.9	29.1	58.5	131.8	22.7
VaT [47]	80.9	39.8	29.2	59.0	131.2	23.1
AS-Transformer [48]	80.6	39.3	29.2	58.9	131.0	23.1
LATGeO [19]	81.0	38.8	29.2	58.7	131.7	22.9
DMFormer (Ours)	81.3	39.6	29.4	59.6	133.2	23.2

Table 7. Comparison of FLOPS, Params, and Inference Time.

Model	FLOPS (G)	Params (M)	Inference Time (ms)
X-Transformer [24]	265	210	276
M²-Transformer [29]	97	191	500
RSTNet [43]	18	213	86
DMFormer (Ours)	15	150	71

Table 8. Significance Analysis on CIDEr Scores.

Model	DMFormer (Ours)	SGAE [37]	ORT [38]	AoA-Transformer [39]	X-Transformer [24]
Mean of CIDEr	133.2	127.8	128.3	129.8	130.2
Compare with Ours	-	5.4↓	4.9↓	3.4↓	3.0↓
p-value	-	3 × 10⁻⁵	0.017	0.024	0.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Y.; Jiang, Z. DMFormer: Dense Memory Linformer for Image Captioning. Electronics 2025, 14, 1716. https://doi.org/10.3390/electronics14091716

AMA Style

He Y, Jiang Z. DMFormer: Dense Memory Linformer for Image Captioning. Electronics. 2025; 14(9):1716. https://doi.org/10.3390/electronics14091716

Chicago/Turabian Style

He, Yuting, and Zetao Jiang. 2025. "DMFormer: Dense Memory Linformer for Image Captioning" Electronics 14, no. 9: 1716. https://doi.org/10.3390/electronics14091716

APA Style

He, Y., & Jiang, Z. (2025). DMFormer: Dense Memory Linformer for Image Captioning. Electronics, 14(9), 1716. https://doi.org/10.3390/electronics14091716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DMFormer: Dense Memory Linformer for Image Captioning

Abstract

1. Introduction

2. Related Work

2.1. Image Captioning

2.2. Linformer and Its Advantages

3. Proposed Method

3.1. Positional Encoding

3.2. Relation Memory Augmented Encoder

Relation Memory Augmented Attention

3.3. Dense Memory Augmented Decoder

Dense Memory Augmented Cross Attention

3.4. Training Strategy

4. Experiments

4.1. Dataset and Evaluation Metrics

4.2. Implementation Details

4.3. Experimental Results and Analysis

4.3.1. Analysis of Module Ablation Experiments

4.3.2. Analysis of Memory Vectors Ablation Experiments

4.3.3. Analysis of RMAA Variants in Ablation Experiments

4.3.4. Comparison and Analysis with Advanced Baseline Models

4.3.5. Comparative Analysis with Advanced Models

4.3.6. Computational Efficiency Analysis

4.3.7. Qualitative Analysis

4.3.8. Significance Analysis

5. Limitations and Future Work

5.1. Limitations

5.2. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI