VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Cheng, Qimin; Xu, Yuqi; Huang, Ziyang

doi:10.3390/rs16162961

Open AccessArticle

VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

by

Qimin Cheng

^*,†

,

Yuqi Xu

^† and

Ziyang Huang

School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(16), 2961; https://doi.org/10.3390/rs16162961

Submission received: 5 July 2024 / Revised: 3 August 2024 / Accepted: 9 August 2024 / Published: 12 August 2024

Download

Browse Figures

Versions Notes

Abstract

Pioneering remote sensing image captioning (RSIC) works use autoregressive decoding for fluent and coherent sentences but suffer from high latency and high computation costs. In contrast, non-autoregressive approaches improve inference speed by predicting multiple tokens simultaneously, though at the cost of performance due to a lack of sequential dependencies. Recently, diffusion model-based non-autoregressive decoding has shown promise in natural image captioning with iterative refinement, but its effectiveness is limited by the intrinsic characteristics of remote sensing images, which complicate robust input construction and affect the description accuracy. To overcome these challenges, we propose an innovative diffusion model for RSIC, named the Visual Conditional Control Diffusion Network (VCC-DiffNet). Specifically, we propose a Refined Multi-scale Feature Extraction (RMFE) module to extract the discernible visual context features of RSIs as input of the diffusion model-based non-autoregressive decoder to conditionally control a multi-step denoising process. Furthermore, we propose an Interactive Enhanced Decoder (IE-Decoder) utilizing dual image–description interactions to generate descriptions finely aligned with the image content. Experiments conducted on four representative RSIC datasets demonstrate that our non-autoregressive VCC-DiffNet performs comparably to, or even better than, popular autoregressive baselines in classic metrics, achieving around an

8.22 \times

speedup in Sydney-Captions, an

11.61 \times

speedup in UCM-Captions, a

15.20 \times

speedup in RSICD, and an

8.13 \times

speedup in NWPU-Captions.

Keywords:

remote sensing image captioning; multi-modal domains; diffusion model

1. Introduction

Remote sensing image captioning (RSIC) aims to create captions are easily understood by humans for remote sensing images (RSIs). These captions describe the ground elements, their attributes, dependencies among them, and spatial relationships, enabling military intelligence generation and disaster early warning, etc. In contrast to conventional template-based and retrieval-based methods, the encoder–decoder framework has become the prevalent approach in RSIC, owing to its ability to produce more flexible and variable sentences [1]. This framework employs an encoder to extract the visual semantic features from a given RSI, followed by a decoder that formulates target descriptions based on these features. The decoder typically utilizes an autoregressive (AR) strategy, where it uses the previously generated outputs as context for generating the next token. This approach enables the decoder to capture dependencies and contextual information in the input sequence, resulting in generating more coherent and contextually relevant sentences. However, the AR decoding strategy is hindered by high subtitle delay and computational expenses, particularly with long sequences [2]. This limitation significantly impacts the efficiency and scalability of RSIC for real-time applications.

Non-autoregressive (NAR) decoding aims to overcome the limitations of autoregressive (AR) decoding by predicting multiple tokens simultaneously. While this approach holds potential for faster model inference, it may compromise the quality of descriptive sentences, as the generated words are not dependent on each other sequentially and are generated concurrently [3].

Recently, a new class of generative models known as diffusion models has been introduced into NAR description generation, iteratively refining a given input noise signal to generate a descriptive sentence with intermediate control, which can allow for more controlled and precise descriptions [4]. In practical applications, efforts [2,5,6,7] to apply diffusion models to NAR natural image captioning have shown promising results in generating high-quality sentences with improved accuracy and fluency. Even so, the direct application of diffusion-based NAR models in RSIC is still challenging due to the intrinsic characteristics of remote sensing images including intra-class diversity, inter-class similarity, multiple ground objects, and multiple scales as shown in Figure 1. These complexities in the scene, along with the difficulty in distinguishing ground objects, present obstacles in constructing a robust input for a diffusion model-based non-autoregressive decoder. As a result, the precision of sentence-level semantic analysis may be compromised.

In response to challenges such as the inference latency of autoregressive methods, degradation in the performance of general non-autoregressive methods, and difficulty in constructing robust input for diffusion model-based non-autoregressive methods, we propose an innovative diffusion model for RSIC, called the Visual Conditional Control Diffusion Network (VCC-DiffNet), which imposes discernible visual context features of RSIs as the input of the decoder to conditionally control a multi-step denoising process, progressively generating descriptions that are relevant to the image content in a fine-grained manner. Specifically, to obtain more discernible visual features for the diffusion model-based non-autoregressive decoder, we introduce a Refined Multi-scale Feature Extraction (RMFE) module. This module extracts robust visual features representing complex scenes and diverse ground objects by utilizing both global features and multi-scale grid features. Additionally, we propose an Interactive Enhanced Decoder (IE-Decoder) with dual interactions to concentrate on accurate scene and ground object information, thus ensuring precise description generation. Experimental results demonstrate that our VCC-DiffNet effectively captures the complexity of RSIs to generate semantically accurate descriptions. It achieves comparable or superior performance to popular autoregressive baselines, with around an

8.22 \times

speedup in Sydney-Captions, an

11.61 \times

speedup in UCM-Captions, a

15.20 \times

speedup in RSICD, and an

8.13 \times

speedup in NWPU-Captions.

In conclusion, the main contributions of our work can be summarized as follows:

A Visual Conditional Control Diffusion Network (VCC-DiffNet) is proposed to address the subtitle delay problem in existing RSIC tasks. This is the first attempt, as far as we are aware, to apply a diffusion model-based non-autoregressive decoding paradigm to the RSIC task.
A Refined Multi-scale Feature Extraction (RMFE) module is proposed as a means of obtaining more discernible visual features as the input of a diffusion model-based non-autoregressive decoder. The input visual features robustly represent the complex scenes and various ground objects by capitalizing on both global features and multi-scale grid features.
An Interactive Enhanced Decoder (IE-Decoder) with dual interactions is proposed to focus on the accurate information pertaining to scenes and ground objects, thus ensuring the generation of accurate scene and ground object information from a description perspective.

The organization of this paper in the following sections is as follows. In Section 2, we review the related work on remote sensing image captioning from the autoregressive perspective and non-autoregressive perspective. Section 3 details the design of the proposed VCC-DiffNet. Section 4 introduces the experiments and analysis conducted on four datasets. Section 5 provides a summary of this paper.

2. Related Work

The review of relevant work is presented in this section from two perspectives: autoregressive remote sensing image captioning, and non-autoregressive image captioning. Since existing RSIC tasks mainly adopt the autoregressive decoding strategy, we first review the autoregressive remote sensing image captioning from the RNN-based and Transformer-based perspectives. To identify areas for improvement and innovation of RSIC, we also review the non-autoregressive decoding in natural image captioning.

2.1. Autoregressive Remote Sensing Image Captioning

Recently, research on autoregressive remote sensing image captioning can be divided into two groups, RNN-based methods and Transformer-based methods, according to the different underlying architectures and mechanisms employed for generating textual descriptions of images.

RNN-based methods. In the pioneering work, Qu et al. [8] introduced the CNN plus RNN architecture to the RSIC task, which can produce more flexible and variable sentences compared to the template-based [9] and retrieval-based methods [10]. In subsequent research, long short-term memory (LSTM) [11], the special RNN variant, has been widely used in a significant number of studies [12,13,14,15] since it can better capture and handle long-term dependencies in sequential data. Until now, researchers have provided many ideas to enhance the description generation capability of LSTM. For instance, Fu et al. [16] devised a persistent memory mechanism component to expand the information storage capacity of LSTM, thereby reducing information loss at the decoding ends. Li et al. [17] introduced a semantic gate within the decoder with the aim of improving the comprehensiveness and accuracy of the description. Zhang et al. [18] proposed a novel approach to visual and textual feature fusion in the decoder, whereby linguistic state guidance was introduced to enhance the overall system’s performance. Wang et al. [19] proposed the integration of sequence attention and flexible word correlation within the decoder in order to generate accurate descriptions that were sufficiently detailed.
Transformer-based methods. Another RSIC framework with Transformer instead of LSTM as decoder appears in [20]. The Transformer-based decoder employs a stacked structure devoid of recursive connections, exhibiting superior performance and reduced training time in comparison to traditional LSTM models. Thus, it is receiving more and more favor from researchers. Liu et al. [21] proposed a multi-layer aggregated Transformer architecture for text generation, which is capable of extracting information in a manner conducive to the generation of sentences of sufficient quality. Ren et al. [22] proposed a mask-guided Transformer network incorporating a topic token with the objective of enhancing the accuracy and variety of captioning. Zhao et al. [23] proposed a region attention Transformer model that integrates region features into the RSICD task. Subsequently, they proposed a cooperative connection Transformer [24], which employs both grid features and region features to improve the decoder’s perception.

While these models have achieved notable performance, they employ autoregressive decoding, a technique that incurs significant computational cost and results in subtitle delays and accumulation of sequence-level errors.

2.2. Non-Autoregressive Image Captioning

The limitations of autoregressive decoding in terms of computational cost and inference time have led to the development of non-autoregressive decoding, initially proposed in [25], as a solution for neural machine translation. However, its text generation quality is worse than that of autoregressive decoding. In subsequent non-autoregressive image captioning tasks, researchers made many improvements to improve text generation quality. Gao et al. [26] proposed a masked non-autoregressive decoding to preserve semantic content and generate diverse text. Fei et al. [3] introduced a position alignment module for further sentence generation. The models described above have both been trained with cross-entropy loss, leading to a poor approximation of the ground-truth sentence distribution. Guo et al. [27] therefore proposed a novel counterfactuals-critical multi-agent learning (CMAL) method that treats the input sequence of the decoder as multiple agents to learn for a maximum reward. The CMAL system is designed to achieve optimal performance using a CIDEr score that guarantees sentence-level fluency. Yu et al. [28] introduced a semantic retrieval module, using semantic information retrieved from image representations as input of the decoder to reduce performance loss. To make a trade-off between autoregressive decoding and non-autoregressive decoding, some partially non-autoregressive attempts [29,30,31] have been proposed, which generate words non-autoregressively locally but execute the autoregressive paradigm globally. Nevertheless, with the dependence in a broken sentence, both non-autoregressive decoding methods and partially non-autoregressive methods still lose the generation accuracy.

Recently, a novel class of non-autoregressive methods, text diffusion models, have shown improved text generation quality [4]. Essentially, text diffusion models implement a multi-step denoising process to transform a noisy sequence into descriptive text. Existing models mostly follow the paradigm that performs the diffusion process in continuous latent representations. Xu et al. [7] proposed a CLIP-Diffusion-LM model to constantly modify tokens created on sequence embedding. He et al. [32] put forth a novel model called DiffCap, which applies continuous diffusions on image captioning and fuses extracted image features for diffusion caption generation. In order to achieve diverse captions, Liu et al. [33] introduced a mapping network that incorporates prefix image embeddings into the continuous diffusion model’s denoising process. Specifically, Chen et al. [2] introduced a Bit Diffusion method that trains a continuous diffusion model to describe discrete words by representing them as binary bits. Luo et al. [6] proposed the SCD-Net that improves the Bit Diffusion by strengthening the visual-language semantic alignment with the semantic conditional diffusion process. To take into consideration that text tokens are discrete in nature, Zhu et al. [5] explored a discrete diffusion model which performs the diffusion process in discrete text tokens to produce accurate image captions. Overall, existing text diffusion models show improved text generation quality in natural image captioning, but their ability in RSIC tasks remains to be verified due to the lack of consideration for the intrinsic characteristics of remote sensing images (RSIs) in their network design.

3. Methodology

To improve decoding efficiency and enhance the input for diffusion model-based non-autoregressive decoder, we propose a Visual Conditional Control Diffusion Network (VCC-DiffNet) for the RSIC task. Overall, we first propose a Refined Multi-scale Feature Extraction (RMFE) module to obtain more discernible visual features as input for the diffusion model-based non-autoregressive decoder, enabling conditional control over a multi-step denoising process.. Moreover, we propose an Interactive Enhanced Decoder (IE-Decoder) with dual interactions to concentrate on accurate scene and ground object information, thus ensuring precise description generation. Following that, we elaborate on some implementation designs. The architecture of VCC-DiffNet is shown in Figure 2.

3.1. Diffusion Model for VCC-DiffNet

Taking into account that descriptive tokens are discrete in nature, the diffusion process of our VCC-DiffNet follows a general discrete diffusion framework D3PM [34] and refers to the token perturbation process proposed in DDCap [5].

Forward Diffusion Process. Each descriptive token within a caption is represented as a discrete state

x \in 1, \dots, K

, where K is the size of the vocabulary. The forward procedure progressively introduces noise into the descriptive token sequence by a transition matrix

{[Q_{t}]}_{i j} = q (x_{t} = j | x_{t - 1} = i)

, where

x_{t}

is the noisy token at Step t. All tokens are converted into

[M A S K]

tokens with a probability

β_{t}

. If

x_{t - 1}

is not

[M A S K]

, the transition probability vector from State

x_{t - 1}

to

x_{t}

is defined as follows:

p (x_{t} | x_{t - 1}) = \{\begin{matrix} α_{t}, & x_{t} = x_{t - 1}; \\ β_{t}, & x_{t} = [M A S K]; \\ 1 - α_{t} - β_{t}, & o t h e r w i s e . \end{matrix}

(1)

If

x_{t - 1}

is

[M A S K]

, the transition probability vector from State

x_{t - 1}

to

x_{t}

is defined as:

p (x_{t} | x_{t - 1} = [M A S K]) = \{\begin{matrix} 1, & x_{t} = [M A S K]; \\ 0, & o t h e r w i s e . \end{matrix}

(2)

More specifically, when given a descriptive token

x_{0} ∽ q (x)

, samples are taken from the following to create a series of latent variables

x_{0}, \dots, x_{t - 1}

in the forward process:

q (x_{t} | x_{t - 1}) = C a t (x_{t}; p (x_{t} | x_{t - 1})),

(3)

where

q (\cdot)

represents a token category distribution over x. Given enough iterations, all tokens will be transformed into

[M A S K]

.

Reverse Diffusion Process. The reverse process

p_{θ} (x_{t - 1} | x_{t}, f_{m}, f_{g})

begins with a complete

[M A S K]

sequence and progressively recovers the targeted caption by the IE-Decoder, where

f_{m}

and

f_{g}

are respectively the multi-scale grid features and global grid features extracted by the RMFE module. In other words, the decoder refines noise tokens conditional on the grid features and global features in the reverse diffusion process. The learning objective is expressed as:

L_{x_{0}} = - l o g p_{θ} (x_{0} | x_{t}, f_{m}, f_{g}) .

(4)

Upon completing T steps, the discrete diffusion model yields the ultimate outcome of

x_{0}

. The visualization of the intermediate steps of our model’s reverse diffusion process is shown in Figure 3.

3.2. Refined Multi-Scale Feature Extraction (RMFE) Module

The RMFE module as shown in Figure 4 is proposed to delve into the discernible characteristics of scene and ground objects in RSIs. When extracting visual features from RSIs, local features like grid features offer an organized method of capturing localized information across various areas of the RSIs, facilitating more accurate object location and fine-grained object recognition; global features, on the other hand, capture broad characteristics and spatial relationships, providing a holistic understanding of the scene. Hence, our RMFE module maximizes the benefits of these two features to comprehend intricate scenes and differentiate between various ground objects in RSIs.

Given the multi-scale nature of RSIs, the Swin-Transformer is employed as the backbone network to extract multi-scale grid features through a hierarchical structure. Initially, patch partitioning is used to split the input RSI into patches, which are then passed through four hierarchical stages. In stage one, the patches are processed by the Swin-Transformer block after changing the channel number via the linear embedding layer. In stage two, the feature maps are input into the Swin-Transformer block after being down-sampled using the patch merging layer. Both stage three and stage four consist of identical modules to those in stage two. Each Swin-Transformer block

{S T}_{i}

’s output can be formulated as:

X_{i} = {S T}_{i} (X_{i - 1})

(5)

where

X_{i} \in R^{H / 2^{i + 1} \times 2^{i - 1} C}

,

i = 1, 2, 3, 4

. H and W represent the input image’s height and width. C denotes the output’s number of channels. We fuse the multi-scale features

X_{i}

through FPN [35], obtaining the corresponding output feature maps

Y_{i}, i = 1, 2, 3, 4

. The feature maps

Y_{i}, i = 1, 2, 3, 4

then are fused into Y via alignment to

Y_{1}

and concatenation.

After the initially fused feature Y is extracted, we propose a feature refinement module, yielding multi-scale features

f_{m}

and global features

f_{g}

. Specifically, we use a

3 \times 3

convolution to further connect this fused feature Y and reduce its channel number. The generated features

f_{m}^{'}

are finally input into a Transformer block for refinement. Furthermore,

f_{m}^{'}

is fed to a global average pooling module to obtain a global representation

f_{g}^{'}

. Meanwhile, we also refine the global representation

f_{g}^{'}

via a parameter-sharing Transformer.

3.3. Interactive Enhanced Decoder (IE-Decoder)

The decoder, where image–description interaction occurs, is designed to generate word tokens in sentences utilizing exacted visual features. To enhance integration between image and description modalities, we use the powerful language model GPT-2 [36] as our decoder and integrate an interactive enhancement module to construct an Interactive Enhanced Decoder (IE-Decoder) as shown in Figure 5. In our decoder, there are dual interactions between natural language and visual content. The first one is to fuse refined global features

f_{g}

with the input of each decoder block via the interactive enhancement module to capture global visual context information, which can be formulated as:

\begin{matrix} Z_{1 : t - 1}^{p, l} = L a y e r N o r m (Z_{1 : t - 1}^{l} + (W_{f_{1}} [Z_{1 : t - 1}^{l - 1}; f_{g}]) \times (G e L U (W_{f_{2}} [Z_{1 : t - 1}^{l - 1}; f_{g}]))) \end{matrix}

(6)

where

Z_{1 : t - 1}^{l - 1}

denotes the output of the

(l - 1)

-th block and is input to the

(l)

-th at timestep t,

W_{f_{1}}

and

W_{f_{2}}

are learned parameters of the linear layer, and

[Z_{1 : t - 1}^{l - 1}; f_{g}]

denotes concatenation. The second interaction occurs in the cross-attention module, where multi-scale features

f_{m}

and

Z_{1 : t - 1}^{p, l}

are fused to capture the local visual contextual information, which can be formulated as:

\begin{matrix} {\hat{Z}}_{1 : t - 1}^{l} = Z_{1 : t - 1}^{p, l} + M S A (W_{Q} \times L a y e r N o r m (Z_{1 : t - 1}^{p, l}), W_{K} f_{m}, W_{V} f_{m}) \end{matrix}

(7)

\begin{matrix} Z_{1 : t - 1}^{l} = {\hat{Z}}_{1 : t - 1}^{l} + F e e d F o r w a r d (L a y e r N o r m ({\hat{Z}}_{1 : t - 1}^{l})) \end{matrix}

(8)

where

W_{Q}

,

W_{K}

and

W_{V}

are the learned parameters,

Z_{1 : t - 1}^{p, l}

represents the embedding vector for the word generated at timestep

t - 1

, the layer normalization of

Z_{1 : t - 1}^{p, l}

acts as the query in MSA, and the multi-scale features

f_{m}

serve as the keys and values in MSA.

With the aforementioned image–description interactions, our VCC-DiffNet can focus better on the descriptive information of scenes and ground objects and generate more semantically accurate descriptions.

4. Experiments and Analysis

In this section, we first introduce the class metrics, public datasets, and configuration settings employed in our experiments. Following this, we compare VCC-DiffNet with eight prominent autoregressive image captioning methods, two partially non-autoregressive, and two state-of-the-art non-autoregressive image captioning methods to validate the effectiveness of VCC-DiffNet. Additionally, we provide visualized generation results of VCC-DiffNet compared to two non-autoregressive methods. Next, we explore the impact of iterative steps on decoding efficiency and the quality of generation. Finally, we analyze the contributions of the two modules (feature refinement module and interactive enhancement module) through ablation experiments.

4.1. Datasets and Evaluation Metric

Four public RSIC datasets, Sydney-Captions [8], UCM-Captions [8], RSICD [12], and NWPU-Captions [37], are used in this work. In the four datasets, the distribution of training, validation, and test sets is set at 80%, 10%, and 10%, respectively. The following is an introduction to the four datasets.

Sydney-Captions [8]: The Sydney-Captions dataset comprises 613 images showcasing 7 types of ground objects. It offers five descriptive statements for each image. Each image is sized at

500 \times 500

pixels.

UCM-Captions [8]: The UCM-Captions dataset comprises 2100 images depicting 21 types of features. It includes 100 images per class, with each image sized at

256 \times 256

pixels and associated with five descriptive statements.

RSICD [12]: The RSICD dataset comprises

10, 921

images, with 30 categories of scenes. Each image is sized at

224 \times 224

pixels, and meticulously annotated with five descriptive statements.

NWPU-Captions [37]: The NWPU-Captions dataset comprises 31,500 images, containing 45 scene categories. Each scene category contains 700 images, with each image sized at

256 \times 256

pixels and associated with five different descriptive sentences.

We use five metrics to comprehensively evaluate model performance: BLEU@N [38], METEOR [39], ROUGE-L [40], CIDEr [41], and Latency [26]. Specifically, BLEU@N, METEOR, ROUGE-L, and CIDEr are classic metrics used to measure the quality of generated sentences, which place emphasis on the frequency of n-grams and the overall fluency of sentences. Latency, on the other hand, refers to the time it takes to decode a single sentence without using mini-batching, and is calculated by averaging the time taken across the entire offline test set.

4.2. Experimental Settings

In our experiments, all RSIs are resized to

224 \times 224

before being input into the model. A server equipped with an NVIDIA GeForce GTX 4090 is used for the research. The training is conducted using a 32-batch size across 20 epochs. The optimizer AdamW [42] is employed for optimization, with a fixed weight decay of

0.01

. After first warming up linearly to

2 \times 10^{- 5}

, the learning rate is decelerated to 0 as a result of cosine decay. As for the decoder, we put Zhu et al.’s [5] strategy into practice.

4.3. Comparisons with State-of-the-Arts Models

Table 1, Table 2, Table 3 and Table 4 present the comprehensive comparison results between our VCC-DiffNet and the state-of-the-art methods on four datasets. The highest score among autoregressive image captioning methods is underlined, while the highest score among non-autoregressive image captioning methods is highlighted in bold. All metrics are given as percentages (%), with the exception of latency and speedup. The symbol “†” indicates that the model is built using the same Transformer architecture. Our method is compared with the following eight prominent autoregressive image captioning methods, which demonstrated superior performance in previous research:

Soft-attention [12] proposes to integrate a “soft” attention mechanism with the LSTM decoder.
MLCA-Net [37] dynamically gathers visual features by use of a multi-level attention module at the encoding ends. To explore the latent context, it integrates a contextual attention module within the decoder.
Deformable Transformer [43] is outfitted with a deformable scaled dot-product attention mechanism that facilitates the acquisition of multi-scale features from both the foreground and background.
Topic-Embed [44] proposes inputting topic-sensitive word embedding and multi-scale features into an adaptive attention-based decoder to generate novel captions.
GLCM [45] proposes an attention-based global-local captioning model for the purpose of obtaining a representation of global and local visual features for RSICD.
GVFGA [18] proposes a global visual feature-guided attention mechanism. In this mechanism, global visual features are introduced, and redundant feature components are removed, thus improving the overall performance of the system.
CCT [24] suggests a cooperative connection Transformer to fully capitalize on the benefits of both area features and grid features.
ClipCap [46] adopts the GPT-2 module as the autoregressive captioning decoder, and its architecture is the same as our suggested network, which acts as the AR baseline for runtime comparison.

Additionally, we compare our method with two partially non-autoregressive methods:

PNAIC [29] incorporates curriculum learning-based training tasks of group length prediction and invalid group deletion to generation accurate captions and prevent common incoherent errors.
SATIC [31] introduce a semi-autoregressive model, which keeps the autoregressive property in global but generates words parallelly locally.

Also, we compare our method with two recently proposed non-autoregressive image captioning methods:

EENAIC [28] presents a semantic retrieval module that retrieves semantic information from images using features, to reduce the performance difference between the autoregressive and non-autoregressive models.
DDCap [5] adopts the GPT-2 module as the diffusion-based non-autoregressive decoder, and the structure is identical to both ClipCap and our proposed network.

Table 1. Performance comparison of VCC-DiffNet and state-of-the-art methods on Sydney-Captions.

Method	B@1	B@2	B@3	B@4	M	R	C	Latency	Speedup
Autoregressive Methods
Soft-attention [12]	73.22	66.74	62.23	58.20	39.42	71.27	249.93	-	-
MLCA-Net [37]	83.10	74.20	65.90	58.00	39.00	71.10	232.40	-	-
Deformable Transformer [43]	83.73	77.71	71.98	66.59	45.48	78.60	303.69	-	-
Topic-Embed [44]	82.2	74.1	66.2	59.4	39.7	-	270.5	-	-
GLCM [45]	80.41	73.05	67.45	62.59	44.21	69.65	243.37	-	-
GVFGA [18]	84.2	75.7	67.2	60.1	42.1	73.3	285.1	-	-
CCT [24]	86.6	79.9	74.7	69.1	46.0	79.1	286.7	-	-
ClipCap † [46]	78.47	67.52	57.69	49.22	38.45	72.38	218.30	370 ms	1.00×
Partially Non-Autoregressive Methods
PNAIC [29]	71.40	52.54	41.23	34.18	27.20	57.61	156.23	46 ms	8.04×
SATIC [31]	72.16	53.14	42.51	34.95	27.40	57.48	157.14	49 ms	7.55×
Non-Autoregressive Methods
EENAIC [28]	71.30	51.87	40.61	32.95	27.20	56.30	154.25	2 ms	185.00×
DDCap † [5]	81.77	71.35	62.13	54.20	39.86	73.70	216.44	50 ms	7.40×
VCC-DiffNet † (Ours)	80.11	72.83	65.98	59.81	42.16	74.90	274.42	45 ms	8.22×

† indicates that the model is built using the same Transformer architecture.

Table 2. Performance comparison of VCC-DiffNet and state-of-the-art methods on UCM-Captions.

Method	B@1	B@2	B@3	B@4	M	R	C	Latency	Speedup
Autoregressive Methods
Soft-attention [12]	74.54	65.45	58.55	52.50	38.86	72.37	261.24	-	-
MLCA-Net [37]	82.60	77.00	71.70	66.80	43.50	77.20	324.00	-	-
Deformable Transformer [43]	82.30	77.00	72.28	67.92	44.39	78.39	346.29	-	-
Topic-Embed [44]	83.9	76.9	71.5	67.5	44.6	-	323.1	-	-
GLCM [45]	81.82	75.40	69.86	64.68	46.19	75.24	302.79	-	-
GVFGA [18]	84.3	77.5	71.1	65.1	45.3	78.5	338.1	-	-
CCT [24]	92.2	89.0	86.4	83.3	57.3	88.3	415.6	-	-
ClipCap † [46]	83.74	77.64	72.29	67.53	44.75	78.90	337.37	418 ms	1.00×
Partially Non-Autoregressive Methods
PNAIC [29]	77.22	66.57	58.76	53.25	40.10	71.93	280.24	36 ms	11.61×
SATIC [31]	77.85	67.13	58.98	53.88	40.61	72.00	280.53	38 ms	11.00×
Non-Autoregressive Methods
EENAIC [28]	76.94	66.00	58.29	52.44	39.12	71.05	279.87	1 ms	418.00×
DDCap † [5]	81.48	73.67	66.26	59.44	41.85	77.64	305.81	46 ms	9.09×
VCC-DiffNet † (Ours)	87.12	81.77	76.95	72.43	47.77	82.10	366.31	36 ms	11.61×

† indicates that the model is built using the same Transformer architecture.

Table 3. Performance comparison of VCC-DiffNet and state-of-the-art methods on RSICD.

Method	B@1	B@2	B@3	B@4	M	R	C	Latency	Speedup
Autoregressive Methods
Soft-attention [12]	67.53	53.08	43.33	36.17	32.55	61.09	196.43	-	-
MLCA-Net [37]	75.70	63.40	53.90	46.10	35.10	64.60	235.60	-	-
Deformable Transformer [43]	75.81	64.16	55.85	49.23	35.50	65.23	258.14	-	-
Topic-Embed [44]	79.8	64.7	56.9	48.9	28.5	-	240.4	-	-
GLCM [45]	77.67	64.92	56.42	49.37	36.27	67.79	254.91	-	-
GVFGA [18]	79.3	68.1	57.7	49.8	37.4	68.2	279.3	-	-
CCT [24]	79.8	69.3	60.8	53.3	38.3	69.2	288.1	-	-
ClipCap † [46]	71.65	58.6	48.3	40.26	35.7	62.91	222.8	456 ms	1.00×
Partially Non-Autoregressive Methods
PNAIC [29]	68.50	53.10	42.95	35.12	31.34	57.22	180.37	35 ms	13.03×
SATIC [31]	68.61	53.24	43.00	35.68	31.98	57.98	180.53	36ms	12.67×
Non-Autoregressive Methods
EENAIC [28]	68.20	52.67	42.48	34.95	30.2	56.66	179.32	2 ms	228.00×
DDCap † [5]	69.40	56.65	46.70	39.08	34.47	63.32	201.58	44 ms	10.36×
VCC-DiffNet † (Ours)	78.85	68.07	59.29	52.02	39.47	70.18	290.74	30 ms	15.20×

† indicates that the model is built using the same Transformer architecture.

Table 4. Performance comparison of VCC-DiffNet and state-of-the-art methods on NWPU-Captions.

Method	B@1	B@2	B@3	B@4	M	R	C	Latency	Speedup
Autoregressive Methods
Soft-attention [12]	73.10	60.90	52.50	46.20	33.90	59.90	113.60	-	-
MLCA-Net [37]	74.50	62.40	54.10	47.80	33.70	60.10	126.40	-	-
Deformable Transformer [43]	75.15	62.91	54.57	48.28	31.87	58.58	120.71	-	-
ClipCap † [46]	74.07	60.05	49.44	41.32	28.52	58.04	108.70	374ms	1.00×
Partially Non-Autoregressive Methods
PNAIC [29]	71.63	55.27	44.15	35.69	27.54	56.64	100.23	47 ms	7.96×
SATIC [31]	71.89	55.75	44.87	36.28	28.46	57.13	101.86	49ms	7.63×
Non-Autoregressive Methods
EENAIC [28]	71.37	55.11	43.58	34.84	26.21	55.39	99.99	3ms	124.67×
DDCap † [5]	76.73	62.98	51.86	42.95	30.09	61.36	110.48	53 ms	7.06×
VCC-DiffNet † (Ours)	79.72	66.35	56.04	47.93	30.59	61.82	123.24	46ms	8.13×

† indicates that the model is built using the same Transformer architecture.

From the experimental results, several key observations can be made. Firstly, our suggested VCC-DiffNet outperforms popular non-autoregressive captioning models on the majority of benchmark metrics; the results are on par with the most advanced autoregressive captioning models, with the results even approaching those of the most advanced autoregressive captioning models. The results indicate that training with a sufficient amount of labeled data facilitates VCC-DiffNet in achieving the optimal parameters, leading to significant performance improvements. As seen on the RSICD dataset, VCC-DiffNet surpasses leading autoregressive captioning models in METEOR and CIDEr, and on the NWPU-Captions dataset, VCC-DiffNet exceeds the performance of leading autoregressive models in BLEU and ROUGE. Secondly, despite employing an iterative inference mechanism, VCC-DiffNet maintains a low latency for sentence generation. Specifically, it achieves approximately an

8.22 \times

speedup in Sydney-Captions, an

11.61 \times

speedup in UCM-Captions, a

15.20 \times

speedup in RSICD, and an

8.13 \times

speedup in NWPU-Captions, compared to the autoregressive (AR) baseline. Although EENAIC boasts the lowest latency, owing to its simplified architecture with reduced layers in both the encoder and decoder for generating target sentences, its lightweight design comes with limited generative capabilities, potentially resulting in performance degradation. Thirdly, when compared to DDCap, the current state-of-the-art non-autoregressive model in terms of generation quality, VCC-DiffNet showcases higher performance metrics, surpassing a BLEU@4 score of

3.0

. This improvement is achieved by incorporating more distinct global features and multi-scale grid features for intermediate control, along with adding interactive enhancement prior to caption decoding. In summary, our method achieves a balance between generation quality and inference speed, providing the best overall performance.

We present the generated results for some test data to visually demonstrate the efficacy of our proposed method. Figure 6 depicts ground-truth image captions alongside the qualitative results from different approaches for six scene categories images in the RSICD dataset. Blue highlights indicate captions that are grammatically incorrect or do not make sense with the content of the image. Furthermore, green highlights the descriptions of scenes and ground objects that match the content of the images.

Overall, all three methods produce semantically relevant captions. However, the rough modeling of image content in EENIC and DDCap may result in missing words (e.g., the missing “swimming pools” in the second image) or description errors (e.g., the error word “farmlands” in the first image; the error word “two” in the last image). Additionally, both EENAIC and DDCap occasionally produce low-quality captions with word repetition issues (e.g., repetition of “industrial” and “area” in the first image; repetition of “resort” and “a” in the second image; repetition of “an airport” in the fifth image; and repetition of “playground” in the last image) and incoherent captions (e.g., “and are trees” in the fourth image; “a piece of some” in the last image). In contrast, our VCC-DiffNet addresses these issues by employing a Visual Conditional Control Diffusion Network. This approach integrates discernible visual context features of RSIs to enhance image captioning accuracy, resulting in more precise descriptions. Consequently, our VCC-DiffNet demonstrates a keen perception of complex scenes and diverse ground objects (the scenes and ground objects in the above images are almost all predicted correctly), even occasionally generating descriptions richer than the ground-truth sentences (as evident in the third image).

4.4. Effect of Iterative Steps

To visualize the effects of iterative steps on decoding efficiency and generation quality, we plot the scores of four class metrics on four datasets in Figure 7 and the average seconds it takes to decode one sentence in Figure 8.

As expected, in non-autoregressive decoding, latency is proportionate to the iteration count, and the quality of generated sentences also improves with an increase in the number of iterations, resulting in higher scores of metrics. We also find that VCC-DiffNet with 20 iterative steps can achieve excellent scores, while the scores of VCC-DiffNet with 25 iterative steps only have a slight difference compared to those of VCC-DiffNet with 20 iterative steps. In order to strike a balance between quality and time complexity, the performance scores at 20 iterative steps are optimal. Therefore, we adopt 20 steps of iteration for refinement in our experiment.

4.5. Ablation Study

The results of the ablation experiments carried out on the four datasets are shown in Table 5. “Baseline” refers to the baseline model, “+Fea-Refine” denotes the addition of the feature refinement module, “+Inter-Enhance” indicates the introduction of the interactive enhancement module, and “VCC-DiffNet” represents our proposed model.

The inclusion of the feature refinement module shows improved results compared to utilizing only coarse-grained grid features. This enhancement helps in better deciphering the complexities within RSIs and attaining more effective feature representation for the intricate scenes and diverse ground objects present in RSIs. Following the incorporation of the interactive enhancement module, there is an improvement in each metric when compared to the baseline. This indicates that introducing pre-interaction between the descriptive sentence and visual inputs prior to the decoder can boost multi-modal interactions and reduce semantic gaps. The pre-interaction enables the decoder to focus non-autoregressively on the descriptive details of the scenes and ground objects accurately. In conclusion, our proposed VCC-DiffNet demonstrates strong performance across all modules, effectively enabling high-quality captioning.

5. Conclusions and Outlooks

In this article, in response to issues such as the inference latency of autoregressive methods, performance degradation of general non-autoregressive methods, and the difficulty of constructing robust input of diffusion model-based non-autoregressive methods, we make the first attempt to introduce a diffusion decoding paradigm (VCC-DiffNet) into the RSIC task. Specifically, we propose an RMFE module to extract discernible visual representation as the input of the diffusion model-based non-autoregressive decoder. The proposed interactive enhancement module adds a pre-interaction between descriptive sentences and visual inputs before the decoder to better focus on the accurate information pertaining to scenes and ground objects. The IE-Decoder adopts a diffusion-based decoding paradigm to generate descriptive sentences in parallel to reduce inference time. Experiments conducted on four representative RSIC datasets demonstrate that our non-autoregressive VCC-DiffNet effectively reduces the performance gap compared to popular autoregressive baselines, achieving around an

8.22 \times

speedup in Sydney-Captions, an

11.61 \times

speedup in UCM-Captions, a

15.20 \times

speedup in RSICD, and an

8.13 \times

speedup in NWPU-Captions. We will look into improving the current diffusion decoding techniques, even training with a small amount of labeled data in future research, and we hope our work will serve as inspiration for further efforts in the application of diffusion models for remote sensing image captioning.

Author Contributions

Conceptualization, Q.C. and Y.X.; methodology, Y.X.; software, Y.X.; validation, Y.X. and Z.H.; formal analysis, Y.X.; investigation, Z.H.; resources, Y.X.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, Q.C.; visualization, Y.X.; supervision, Q.C.; project administration, Q.C.; funding acquisition, Q.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (grant number 2023YFB3906101).

Data Availability Statement

The data used in this study are public datasets, which can be obtained from the following link: https://github.com/201528014227051/RSICD_optimal (accessed on 23 September 2022) and https://github.com/HaiyanHuang98/NWPU-Captions (accessed on 23 September 2022).

Acknowledgments

The authors would like to express their gratitude to the editors and the anonymous reviewers for their insightful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, B. A Systematic Survey of Remote Sensing Image Captioning. IEEE Access 2021, 9, 154086–154111. [Google Scholar] [CrossRef]
Chen, T.; Zhang, R.; Hinton, G.E. Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning. arXiv 2022, arXiv:2208.04202. [Google Scholar] [CrossRef]
Fei, Z. Fast Image Caption Generation with Position Alignment. arXiv 2019, arXiv:1912.06365. [Google Scholar] [CrossRef]
Li, Y.; Zhou, K.; Zhao, W.X.; Wen, J.-R. Diffusion Models for Non-autoregressive Text Generation: A Survey. arXiv 2023, arXiv:2303.06574. [Google Scholar] [CrossRef]
Zhu, Z.; Wei, Y.; Wang, J.; Gan, Z.; Zhang, Z.; Wang, L.; Hua, G.; Wang, L.; Liu, Z.; Hu, H. Exploring Discrete Diffusion Models for Image Captioning. arXiv 2022, arXiv:2211.11694. [Google Scholar] [CrossRef]
Luo, J.; Li, Y.; Pan, Y.; Yao, T.; Feng, J.; Chao, H.; Mei, T. Semantic-Conditional Diffusion Networks for Image Captioning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 23359–23368. [Google Scholar]
Xu, S. CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning. arXiv 2022, arXiv:2210.04559. [Google Scholar] [CrossRef]
Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high-resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Shi, Z.; Zou, Z. Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? IEEE Trans. Geosci. Remote Sens. 2017, 55, 3623–3634. [Google Scholar] [CrossRef]
Wang, B.; Lu, X.; Zheng, X.; Li, X. Semantic Descriptions of High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1274–1278. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Zhang, X.; Wang, X.; Tang, X.; Zhou, H.; Li, C. Description generation for remote sensing images using attribute attention mechanism. Remote Sens. 2019, 11, 612. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, W.; Diao, W.; Yan, M.; Gao, X.; Sun, X. VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning. IEEE Access 2019, 7, 137355–137364. [Google Scholar] [CrossRef]
Zhang, Z.; Diao, W.; Zhang, W.; Yan, M.; Gao, X.; Sun, X. LAM: Remote Sensing Image Captioning with Label-Attention Mechanism. Remote Sens. 2019, 11, 2349. [Google Scholar] [CrossRef]
Fu, K.; Li, Y.; Zhang, W.; Yu, H.; Sun, X. Boosting Memory with a Persistent Memory Mechanism for Remote Sensing Image Captioning. Remote Sens. 2020, 12, 1874. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Gu, J.; Li, C.; Wang, X.; Tang, X.; Jiao, L. Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5608816. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, W.; Yan, M.; Gao, X.; Fu, K.; Sun, X. Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5615216. [Google Scholar] [CrossRef]
Wang, J.; Wang, B.; Xi, J.; Bai, X.; Ersoy, O.K.; Cong, M.; Gao, S.; Zhao, Z. Remote Sensing Image Captioning With Sequential Attention and Flexible Word Correlation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6004505. [Google Scholar] [CrossRef]
Shen, X.; Liu, B.; Zhou, Y.; Zhao, J. Remote sensing image caption generation via transformer and reinforcement learning. Multimed. Tools Appl. 2020, 79, 26661–26682. [Google Scholar] [CrossRef]
Liu, C.; Zhao, R.; Shi, Z.X. Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506605. [Google Scholar] [CrossRef]
Ren, Z.; Gou, S.; Guo, Z.; Mao, S.; Li, R. A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning. Remote Sens. 2022, 14, 2939. [Google Scholar] [CrossRef]
Zhao, K.; Xiong, W. Exploring region features in remote sensing image captioning. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103672. [Google Scholar] [CrossRef]
Zhao, K.; Xiong, W. Cooperative Connection Transformer for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5607314. [Google Scholar] [CrossRef]
Lee, J.; Mansimov, E.; Cho, K. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. arXiv 2018, arXiv:1802.06901. [Google Scholar] [CrossRef]
Gao, J.; Meng, X.; Wang, S.; Li, X.; Wang, S.; Ma, S.; Gao, W. Masked Non-Autoregressive Image Captioning. arXiv 2019, arXiv:1906.00717. [Google Scholar] [CrossRef]
Guo, L.; Liu, J.; Zhu, X.; He, X.; Jiang, J.; Lu, H. Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan, 11–17 July 2020. [Google Scholar]
Yu, H.; Liu, Y.; Qi, B.; Hu, Z.; Liu, H. End-to-End Non-Autoregressive Image Captioning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Fei, Z. Partially Non-Autoregressive Image Captioning. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1309–1316. [Google Scholar] [CrossRef]
Yan, X.; Fei, Z.; Li, Z.; Wang, S.; Huang, Q.; Tian, Q. Semi-autoregressive image captioning. In Proceedings of the MM ’21: Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 2708–2716. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, Y.; Hu, Z.; Wang, M. Semi-autoregressive transformer for image captioning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 3139–3143. [Google Scholar] [CrossRef]
He, Y.; Cai, Z.; Gan, X.; Chang, B. DiffCap: Exploring Continuous Diffusion on Image Captioning. arXiv 2023, arXiv:2305.12144. [Google Scholar] [CrossRef]
Liu, G.; Li, Y.; Fei, Z.; Fu, H.; Luo, X.; Guo, Y. Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning. arXiv 2023, arXiv:2309.04965. [Google Scholar] [CrossRef]
Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; van den Berg, R. Structured Denoising Diffusion Models in Discrete State-Spaces. arXiv 2021, arXiv:2107.03006. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Cheng, Q.; Huang, H.; Xu, Y.; Zhou, Y.; Li, H.; Wang, Z. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5629419. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain, 4–10 July 2004; pp. 74–81. [Google Scholar]
Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
Du, R.; Cao, W.; Zhang, W.; Zhi, G.; Sun, X.; Li, S.; Li, J. From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 7704–7717. [Google Scholar] [CrossRef]
Zia, U.; Riaz, M.M.; Ghafoor, A. Transforming remote sensing images to textual descriptions. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102741. [Google Scholar] [CrossRef]
Wang, Q.; Huang, W.; Zhang, X.; Li, X. GLCM: Global–Local Captioning Model for Remote Sensing Image Captioning. IEEE Trans. Cybern. 2023, 53, 6910–6922. [Google Scholar] [CrossRef]
Mokady, R.; Hertz, A. ClipCap: CLIP Prefix for Image Captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar] [CrossRef]

Figure 1. Illustration of the special characteristic of RS images.

Figure 2. Proposed VCC-DiffNet overview. A caption undergoes tokenization for forward diffusion operations, where tokens are converted into other descriptive tokens or

[M A S K]

tokens. Subsequently, all tokens will be transformed into

[M A S K]

and fed into the decoder. Meanwhile, to acquire more discernible visual features, the multi-scale feature refinement module refines the multi-scale grid features that have been extracted using the multi-scale feature encoder. By incorporating visual features as intermediate control conditions, the decoder initiates a reverse diffusion process, meaning iteratively denoising and predicting the true token sequence every timestep t. That is, the Interactive Enhanced Decoder iterates for T steps to gradually reconstruct the ground-truth sequence of the RSI from an all

[M A S K]

sequence, utilizing conditional visual information from the RSI as input.

Figure 2. Proposed VCC-DiffNet overview. A caption undergoes tokenization for forward diffusion operations, where tokens are converted into other descriptive tokens or

[M A S K]

tokens. Subsequently, all tokens will be transformed into

[M A S K]

and fed into the decoder. Meanwhile, to acquire more discernible visual features, the multi-scale feature refinement module refines the multi-scale grid features that have been extracted using the multi-scale feature encoder. By incorporating visual features as intermediate control conditions, the decoder initiates a reverse diffusion process, meaning iteratively denoising and predicting the true token sequence every timestep t. That is, the Interactive Enhanced Decoder iterates for T steps to gradually reconstruct the ground-truth sequence of the RSI from an all

[M A S K]

sequence, utilizing conditional visual information from the RSI as input.

Figure 3. Visualization of the intermediate steps of our model’s reverse diffusion process. The reverse diffusion process begins with a complete

[M A S K]

sequence. At each step, those low-confidence tokens are masked and re-predicted in parallel, conditioned on other tokens in the sequence and visual information from the RSI.

Figure 3. Visualization of the intermediate steps of our model’s reverse diffusion process. The reverse diffusion process begins with a complete

[M A S K]

sequence. At each step, those low-confidence tokens are masked and re-predicted in parallel, conditioned on other tokens in the sequence and visual information from the RSI.

Figure 4. Detailed structure of the RMFE module.

Figure 5. Detailed structure of the IE-Decoder.

Figure 6. Examples of remote sensing image captioning results on RSICD.

Figure 7. Latency of decoding one image with different iterative steps on four datasets.

Figure 8. Scores of four class metrics on four datasets.

Table 5. Ablation performance of our network on different remote sensing image caption datasets.

DATASET	METHOD	B@1	B@2	B@3	B@4	M	R	C
RSICD	Baseline	75.32	63.67	54.32	46.77	37.51	66.86	259.77
	+Fea-Refine	76.53	65.08	55.88	48.25	37.94	68.72	264.31
	+Inter-Enhance	77.67	66.14	56.55	48.61	37.99	68.66	260.36
	VCC-DiffNet	78.85	68.07	59.29	52.02	39.47	70.18	290.74
UCM-Captions	Baseline	86.11	79.20	72.85	67.12	44.60	80.32	341.64
	+Fea-Refine	86.38	81.05	76.21	71.45	46.66	81.88	356.09
	+Inter-Enhance	87.04	81.11	75.59	70.28	46.56	82.47	355.31
	VCC-DiffNet	87.12	81.77	76.95	72.43	47.77	82.10	366.31
Sydney-Captions	Baseline	80.70	67.98	57.34	48.22	39.42	73.09	216.89
	+Fea-Refine	82.60	73.62	64.50	55.96	42.19	75.13	241.66
	+Inter-Enhance	83.43	75.07	66.78	58.63	41.61	75.96	243.70
	VCC-DiffNet	80.11	72.83	65.98	59.81	42.16	74.90	274.42
NWPU-Captions	Baseline	73.98	59.89	49.33	41.47	28.06	57.76	107.94
	+Fea-Refine	75.84	62.16	51.58	43.31	28.79	58.93	112.50
	+Inter-Enhance	75.75	62.84	52.67	44.71	29.46	59.96	116.30
	VCC-DiffNet	79.72	66.35	56.04	47.93	30.59	61.82	123.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, Q.; Xu, Y.; Huang, Z. VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning. Remote Sens. 2024, 16, 2961. https://doi.org/10.3390/rs16162961

AMA Style

Cheng Q, Xu Y, Huang Z. VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning. Remote Sensing. 2024; 16(16):2961. https://doi.org/10.3390/rs16162961

Chicago/Turabian Style

Cheng, Qimin, Yuqi Xu, and Ziyang Huang. 2024. "VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning" Remote Sensing 16, no. 16: 2961. https://doi.org/10.3390/rs16162961

APA Style

Cheng, Q., Xu, Y., & Huang, Z. (2024). VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning. Remote Sensing, 16(16), 2961. https://doi.org/10.3390/rs16162961

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Abstract

1. Introduction

2. Related Work

2.1. Autoregressive Remote Sensing Image Captioning

2.2. Non-Autoregressive Image Captioning

3. Methodology

3.1. Diffusion Model for VCC-DiffNet

3.2. Refined Multi-Scale Feature Extraction (RMFE) Module

3.3. Interactive Enhanced Decoder (IE-Decoder)

4. Experiments and Analysis

4.1. Datasets and Evaluation Metric

4.2. Experimental Settings

4.3. Comparisons with State-of-the-Arts Models

4.4. Effect of Iterative Steps

4.5. Ablation Study

5. Conclusions and Outlooks

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI