Adaptive-Attention Completing Network for Remote Sensing Image

Huang, Wenli; Deng, Ye; Hui, Siqi; Wang, Jinjun

doi:10.3390/rs15051321

Open AccessArticle

Adaptive-Attention Completing Network for Remote Sensing Image

by

Wenli Huang

,

Ye Deng

,

Siqi Hui

and

Jinjun Wang

^*

The Institute of Artificial Intelligence and Robotic, Xi’an Jiaotong University, Xian Ning West Road No. 28, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(5), 1321; https://doi.org/10.3390/rs15051321

Submission received: 13 January 2023 / Revised: 20 February 2023 / Accepted: 22 February 2023 / Published: 27 February 2023

(This article belongs to the Special Issue Artificial Intelligence Algorithm for Remote Sensing Imagery Processing Ⅱ)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The reconstruction of missing pixels is essential for remote sensing images, as they often suffer from problems such as covering, dead pixels, and scan line corrector (SLC)-off. Image inpainting techniques can solve these problems, as they can generate realistic content for the unknown regions of an image based on the known regions. Recently, convolutional neural network (CNN)-based inpainting methods have integrated the attention mechanism to improve inpainting performance, as they can capture long-range dependencies and adapt to inputs in a flexible manner. However, to obtain the attention map for each feature, they compute the similarities between the feature and the entire feature map, which may introduce noise from irrelevant features. To address this problem, we propose a novel adaptive attention (Ada-attention) that uses an offset position subnet to adaptively select the most relevant keys and values based on self-attention. This enables the attention to be focused on essential features and model more informative dependencies on the global range. Ada-attention first employs an offset subnet to predict offset position maps on the query feature map; then, it samples the most relevant features from the input feature map based on the offset position; next, it computes key and value maps for self-attention using the sampled features; finally, using the query, key and value maps, the self-attention outputs the reconstructed feature map. Based on Ada-attention, we customized a u-shaped adaptive-attention completing network (AACNet) to reconstruct missing regions. Experimental results on several digital remote sensing and natural image datasets, using two image inpainting models and two remote sensing image reconstruction approaches, demonstrate that the proposed AACNet achieves a good quantitative performance and good visual restoration results with regard to object integrity, texture/edge detail, and structural consistency. Ablation studies indicate that Ada-attention outperforms self-attention in terms of PSNR by 0.66%, SSIM by 0.74%, and MAE by 3.9%, and can focus on valuable global features using the adaptive offset subnet. Additionally, our approach has also been successfully applied to remove real clouds in remote sensing images, generating credible content for cloudy regions.

Keywords:

remote sensing; image inpainting; adaptive attention; u-shaped network

1. Introduction

Remote sensing images suffer from missing data problems such as dead pixels, scan line corrector (SLC)-off, and cloud cover [1], which are caused by sensor failures and complex atmospheric environments. These missing data hinder the observation of land surface features, real-time analysis, and further applications. Therefore, recovering lost information in remote sensing images is an urgent problem. Current reconstruction methods for remote sensing data can be divided into four categories: spatial-based, spectral-based, multitemporal-based, and hybrid-based methods [1,2]. Most methods require auxiliary information from other domains, such as other bands of spectral data and time-series data; therefore, they are limited to specific tasks. In contrast, space-based methods, also called image inpainting methods [2] try to generate realistic content for missing regions based only on the known regions of the same image, and are widely used to recover information in remote sensing images [2].

In recent years, many excellent learning-based methods have been proposed for image inpainting. Most methods utilize the convolution neural network (CNN) and they model image completion as a conditional generation task [3,4]. They model the distribution of large datasets and generate rich texture and realistic semantic patterns for missing image regions. However, these methods have relatively small receptive fields to cope with diverse and complex destroyed images, as the convolutional kernels are spatial-invariant and small.

To solve these problems, the dynamic selection network (DSN) inpainting method [5] introduced the deformable convolution to select more valuable features at different input positions. This can relieve the spatial-invariant problem, but the receptive field is still limited to local features. Some researchers plugged attention modules [6,7] into the inpainting convolution networks to model global relationships. However, this requires the computation of similarities between each query feature and all key features, and is likely to introduce irrelevant relationships.

To eliminate irrelevant feature interference, we designed a new form of attention, namely, adaptive attention (Ada-attention), to dynamically select the most relevant feature for each query feature. The query head of self-attention first transforms the inputs into query feature maps. Then, we propose the offset subnet to predict the offset position maps, and the most relevant features are sampled from the input feature map based on the offset and used to compute key and value maps for self-attention. Finally, using the query, key, and value maps, the self-attention outputs the reconstructed feature map.

Due to the dynamic selection mechanism, our Ada-attention shows a superior inpainting performance compared to the standard self-attention, as shown in Figure 1. In the figure, the gray-covered regions represent the corrupted regions. In the high-attention-score point images, the red stars show the specific query, and the purple circles show the high-attention-score points. These high-attention-score points show that our Adaptive Attention (Ada-attention) mainly focuses on the features that are most relevant to the specific query, e.g., the roof. Nevertheless, the standard self-attention attributes high scores to regions that contain a large amount of irrelevant noise. In the key/value positions images, cyan dots denote the original uniform coordinates used in self-attention, and red dots of Ada-attention denote the sampled coordinates adjusted by the offset subnet, which are more inclined toward edges and textures with rich features. In the inpainting result images, local details are displayed in the red box, demonstrating that our attention is superior to self-attention in restoring detailed edges and complete objects.

Based on the proposed Ada-attention, we designed a U-net [8] style network with gated residual blocks for image inpainting, termed Adaptive-attention Completing Network (AACNet). Experiments are performed on three digital remote sensing datasets and two classical natural inpainting datasets to evaluate performance. The results demonstrate that our Ada-attention can capture more informative long-term relationships and improve the performance when reconstructing missing data.

To summarize, the contributions of this paper are as follows:

We proposed an Adaptive Attention (Ada-attention) that utilizes an offset position subnet to dynamically select more relevant keys and values and enhance the attention capacity when modeling more informative long-term dependencies.
We customized a U-shaped Adaptive-attention Completing Network (AACNet) for remote sensing images that stack gated residual blocks and our proposed Ada-attention modules.
Extensive experiments were conducted on multiple datasets, and the results demonstrated that the proposed attention focused on more informative global features than standard self-attention, and our AACNet reconstruction results outperformed the state-of-the-art baselines.

This paper is organized as follows. Section 2 introduces related works on missing information reconstruction in remote sensing images, learning-based image inpainting, and attention mechanisms. Section 3 describes the AACNet architecture, Ada-attention module designs, and the inpainting objective loss. Section 4 first outlines the datasets, evaluation metrics, and implementation details. Then, it presents the qualitative and quantitative results with comparison baselines for simulated corrupted masks and applies our network to the removal of real cloud in remote sensing images. Finally, it validates our attention designs through ablation studies. Section 5 summarizes the conclusions and discusses future work.

2. Related Work

This section provides an overview of the most relevant works on missing information reconstruction of remote sensing images, learning-based image inpainting, and attention mechanisms.

2.1. Missing Information Reconstruction of Remote Sensing Images

Remote sensing imagery is an important way to observe the earth’s surface. However, sensor problems and the atmospheric environment often lead to information losses. According to the information obtained from different domains, missing-information reconstruction methods can be classified into four classes [1]: spatial-based methods to process spatial relationship data (e.g., digital images) [2]; spectral-based methods to process spectral data (e.g., multispectral/hyperspectral images) [9,10]; temporal-based methods to process time-series data (e.g., multitemporal-spatial, multitemporal–spectral data) [11]; and hybrid methods to process spatial, spectral, and temporal information [12].

The spectral-based methods utilize the relationship between different bands’ spectral data information to reconstruct corrupted information in some bands [9]. For example, Shen et al. [13] utilized the bayesian dictionary to learn sparse spectrum relations to restore the aqua moderate-resolution imaging spectroradiometer (MODIS) band 6 data. However, these methods fail to restore missing data in all bands, as the thick cloud covers information. Therefore, temporal-based methods rely on the time-series auxiliary information to improve restoration results. Some researchers applied histogram matching [14], the similar neighborhood interpolator [15], linear regression [11], sparse representation [16], Markov random field [17], and deep image prior [18] to spatial–temporal or spectral–temporal information to recover remote sensing data. The hybrid methods [12,19] jointly utilize the information in all the spatial, spectral, and temporal components but are unable to fully utilize high correlations, such as the adaptive weighted hybrid tensor completion [20] and nonlocal low-rank tensor completion [19] models, which only used some correlations for missing data.

The spatial-based methods are also called “image inpainting” methods, which are the classical low-level processing tasks in computer vision. They use the available data to reconstruct missing regions and do not require auxiliary data, including interpolation, diffusion-based, exemplar-based, and learning-based methods [1]. Interpolation methods [21] weight the average of sample values; they do not efficiently utilize all spatial information and can only handle some simple smooth features. Diffusion-based methods [22] propagate the missing data from known boundaries to the interior of corrupted regions (e.g., partial differential equation methods [23]), which do not consider the global information of the data and only can restore strong structures or small missing regions. Exemplar-based methods [24,25] search for similar information from the remaining regions or external datasets within the missing information regions. These non-learning methods lack the capacity to understand high-level semantic contexts; hence, they have a very limited performance for large damaged regions containing non-repetitive structures and cannot produce missing objects [3,6]. Driven by the development of deep learning, learning-based spatial methods have shown promising completion results in image inpainting [6]. These methods have a high non-linear fitting ability, can generate novel contents for corrupted regions, and can achieve good results when repairing large corrupted regions. Our Adaptive-attention Completing Network (AACNet) belongs to the learning-based inpainting method, and more learning-based inpainting methods will be systematically introduced in the following subsection.

2.2. Learning-Based Image Inpainting

Unlike non-learning approaches, learning-based methods use deep neural networks to capture high-level semantic information, produce more vivid and meaningful content, and significantly improve image inpainting performance.

Pathak et al. [3] first proposed the Context Encoders (CE) method, which adopted an encoder–decoder with generative adversarial loss and reconstruction loss to generate contents for corrupted regions. Iizuka et al. [4] stacked serialized dilated convolution to capture distant contexts with a larger receptive field and combined this with generative adversarial network (GAN) to encourage global and local consistency. Sem et al. [26] presented a CloudGan to detect clouds via an auto-encoder and remove clouds using the SN-PatchGAN. Shao et al. [27] proposed an efficient pyramidal GAN with a mask extraction network and a unified inpainting network to repair diverse degraded Remote Sensing (RS) images. Pan [28] utilized the spatial attention GAN to remove the cloud in the RS image. Andrea et al. [29] utilized the single image super-resolution EDSR learning-based network to remove clouds on Sentinel-2 optical multispectral images.

Some two-stage methods first restore the auxiliary information, then refine middle-level data to improve inpainting performance, such as edge connect [30], structure flow [31], and foreground-aware [32]. Furthermore, some diverse plausible solutions also used the two-stage strategy to complete images, such as VQ-VAE [33], probabilistic diverse GAN [34], and PUT [35]. Shao et al. [2] used a two-stage generator with convolution and attention operators to reconstruct one source of RS data. Du et al. [36] introduced the coarse-to-fine network, which paid spatial semantic attention to the reconstruction of RS images. However, these networks commonly have more parameters and are more challenging to train end-to-end. Some researchers adopted a recursive structure to gradually improve the inpainting results, such as recurrent feature reasoning network (RFR) [37] and progressive generative network (PGN) [38]. Nevertheless, the recursive structure networks require more training and inference time. Some inpainting methods focus on parallel processing to improve training and inference efficiencies, such as parallel multi-resolution fusion network [39], dynamic selection network (DSN) [5], and conditional texture and structure dual generation (CTSDG) model [40]. Some researchers use optimized neural operators based on the characteristics of image inpainting to improve performance, such as partial convolution (PC) [41], gated convolution (GC) [42], bilateral convolution (BC) [43], deformable convolution [5], Fourier convolution [44], and region normalization (RN) [45]. Some methods use different attention modules to capture the global available features for image inpainting, such as contextual attention [6], fusion channel and spatial attention [46], multi-scale attention [47], and coherent semantic attention [48].

Learning-based methods that are trained on massive data demonstrate a powerful ability to reconstruct realistic contents for missing data. Considering the above approaches, we designed a one-stage U-Net [8] style network with gated residual blocks to efficiently train and achieve powerful results.

2.3. Attention Mechanism

Attention mechanism has been extensively studied in natural language processing (NLP) and computer vision, and is widely used in various tasks [49]. Some channel or spacial attentions use global average pooling or linear layers to focus on enhancing essential features, such as the squeeze-and-excitation module [50] and convolutional block attention module (CBAM) [51]. Self-attention is a key component in many CNNs and transformers, and has been used in a large amount of research. However, the query of self-attention looks at all keys and values, and suffers from irrelevant noise [52]. Window-based or shifted window-based self-attention [53] causes each query to attend to keys within local regions to decrease the effect of noise but limit the receptive fields. Sparse attentions introduce diverse patterns to attend to keys on special global regions, such as the sparse factorizations of the attention matrix [54], low-rank matrix [55], locality-sensitive hashing (LSH)-based attention [56], cluster-based sparse attention [57], and deformable attention [52]. These methods are limited to a small set of keys, which restrict their ability to obtain a global informative description ability of attention. In this paper, we extended self-attention with a learnable module to learn more relevant relations in the image inpainting task. Our proposed Ada-attention adaptively selects keys/values to suppress irrelevant information, which is robust to complete corrupted images.

3. Methodology

In this section, we first design a U-shaped network as our Adaptive-attention Completing Network (AACNet) to reconstruct corrupted images. Next, we introduce the Adaptive Attention (Ada-attention) module to select more relevant keys and values. Finally, we describe the objective loss used to train our AACNet.

3.1. Network Architecture

Based on the U-Net style architecture, we designed our Adaptive-attention Completing Network (AACNet) to effectively reconstruct missing data in images. An overview is provided in Figure 2a. Our AACNet contains an encoder and decoder with four pyramid stages. Each stage consists of several gated residual blocks or Ada-attention modules. The concatenation operator connects the feature maps of each encoder and decoder stage. We suppose that

I_{g t}

is the ground-truth image, while M is the binary mask, in which 0 denotes the missing region and 1 denotes the valid region.

I_{i n} = I_{g t} ⊙ M

indicates the missing data image, and serves as our network input.

I_{p r e d}

is our network reconstruction image. The final result is

I_{c o m p} = I_{g t} ⊙ M + I_{p r e d} ⊙ (1 - M)

. Dimensions of

I_{g t}

,

I_{i n}

,

I_{p r e d}

, and

I_{c o m p}

are

H \times W \times 3

, dimension of M is

H \times W \times 1

, where H and W are the height and width.

3.1.1. Encoder

In the encoder, our network uses the downsampling module to halve the spacial size and double the channel number using a

3 \times 3

convolution with a stride of two between adjacent stages. The input image

I_{i n}

is first processed by two gated residual blocks to return the first encoder stage feature

F_{e n}^{1} \in R^{H \times W \times C}

, where C is the basic channel of feature. Then, feature

F_{e n}^{1}

is processed by the downsample module to output

F_{e n}^{2^{'}} \in R^{H / 2 \times W / 2 \times 2 C}

.

F_{e n}^{2^{'}}

is the input feature of the second encoder stage. Subsequently, feature

F_{e n}^{2^{'}}

is processed by the second and third stages to return the input feature of the fourth encoder stage

F_{e n}^{4^{'}} \in R^{H / 8 \times W / 8 \times 8 C}

. The fourth stage utilizes 5 gated residual blocks and 4 Ada-attention modules to return the whole encoder feature

F_{e n}^{4} \in R^{H / 8 \times W / 8 \times 8 C}

.

3.1.2. Decoder

Between adjacent decoder stages, our network uses the upsampling module to double the spacial size by the nearest interpolation function and halve the channel number by a

3 \times 3

convolution. Then, it concatenates the last decoder stage upsampling feature and the corresponding encoder stage feature and fuses features using one

1 \times 1

convolution to return the next stage input feature. Our network first upsamples the output feature of encoder

F_{e n}^{4}

to obtain the fourth decoder output feature

F_{d e}^{4} \in R^{H / 4 \times W / 4 \times 4 C}

. Then, our network concatenates and fuses features of

F_{d e}^{4}

and

F_{e n}^{3}

to return the third decoder stage input feature

F_{d e}^{3^{'}} \in R^{H / 4 \times W / 4 \times 4 C}

.

F_{d e}^{3^{'}}

is processed by four gated residual blocks, upsampling module, concatenation, and fusion operation to return the second stage input feature

F_{d e}^{2^{'}} \in R^{H / 2 \times W / 2 \times 2 C}

. Subsequently,

F_{d e}^{2^{'}}

is proposed by the second and first decoder stages to return the reconstructed feature

F_{d e}^{1} \in R^{H \times W \times C}

. Finally, the network utilizes a

7 \times 7

convolution and a Tanh function to convert the reconstructed feature

F_{d e}^{1}

back to the predicted image

I_{p r e d}

on RGB space.

3.1.3. Gated Residual Block

In our AACNet, we reference the residual connection [58] and the gated convolution [42] to design the gated residual block, and its architecture is shown in Figure 2b. It can be formulated as follows:

\begin{matrix} F_{l + 1} = F_{l} + GatedConv (ϕ (Conv (ϕ (F_{l})))), \end{matrix}

(1)

where

F_{l}

and

F_{l + 1}

are the input and output of the l-th residual layer. Furthermore, the gated residual block consists of two convolution layers and an identity mapping. The convolution layer adopts pre-activations

ϕ

[59], including an instance normalization (IN) operator [60] and a rectified linear unit (ReLU) [61]. Moreover, it adopts gated convolution in the convolution second layer, which was proposed to complete corrupted images by paper [42]. The gated convolution is defined as:

\begin{matrix} GatedConv (F) = Conv (F) ⊙ σ (Conv (F)), \end{matrix}

(2)

where

σ

is the sigmoid function that outputs soft gating values between zero to one.

This gated residual block can utilize the skip connection to facilitate information training without degradation. Meanwhile, it adopts gated convolution to learn the feature selection mechanism on local receptive fields to adaptively treat valid and invalid features. Furthermore, the Ada-attention module flexibly handles global relevant dependencies to synthesize compelling features for missing data regions. These two designs enable our AACNet to handle various complex image inpainting problems.

3.2. Adaptive Attention (Ada-Attention)

To capture more useful long-range interaction to improve the inpainting performance, we proposed an Adaptive Attention (Ada-attention) module to ensure that each query attends to relevant keys and values. Specifically, it inputs queries into an offset position subnet to select the most relevant keys and values for computing the self-attention result. The architecture is shown in Figure 3.

3.2.1. Self-Attention

We first review the preliminary knowledge of self-attention and explain its problems regarding irrelevant noise relationships. Self-attention learns the response at every position using a weighted sum of features at all positions to capture the long-range dependency [49]. It first embeds the input feature map into three representations, namely, query, key, and value. Then, it connects all pairs of queries and keys to calculate the global similarity score as the attention map. Finally, it aggregates all values using the weighted attention map. The self-attention can be formulated as follows:

\begin{matrix} Q = X W_{q}, K = X W_{k}, V = X W_{v}, \\ A t t e n t i o n (Q, K, V) = S (Q, K) V, \end{matrix}

(3)

where

W_{q}

,

W_{k}

,

W_{v} \in R^{d_{i} \times d_{e}}

learn embedding matrices; then, they embed the input feature map

X \in R^{n \times d_{i}}

into query

Q \in R^{n \times d_{e}}

, key

K \in R^{n \times d_{e}}

, and value

V \in R^{n \times d_{e}}

. n is the sequence length (or image resolution),

d_{i}

is the input feature dimension, and

d_{e}

is the embedding feature dimension. Commonly,

d_{i}

equals

d_{e}

. Operator S is the similarity operation that calculates the similarity score between Q and K per pixel. This typically uses the dot-product operation to obtain the similarity score, then uses the softmax function to return the attention map. The operator

S V

aggregates all values V to generate the attention output.

However, self-attention models correlations among all features pairs and considers the input as a disordered sequence. This means that it suffers from irrelevant correlations being calculated in the attention map and final results.

3.2.2. Single Head Ada-Attention

We aimed to design an attention module that is more suitable for modeling long-range dependency and focuses on relevant features. Inspired by the deformable convolution [62], we designed an Adaptive Attention (Ada-attention) that learns the offset position via a simple offset subnet based on the query and uses the offset position to select the essential input features, which can be used to calculate the key and value. Therefore, our attention adapts to the input data and captures important long-range features to help in the reconstruction task.

For simplicity, we describe one attention head below and the architecture is shown in Figure 3. Given an input feature map

X \in R^{H \times W \times C}

, C is the channel of the feature map, H and W are the height and width of the feature map, and the length of the input feature is

n = H W

. The single head Ada-attention is expressed as follows:

\begin{matrix} Q = X W_{q}, \\ ▵ d = Tanh (SGConv (Q)) \cdot γ, \\ X^{'} = S a m p l e (X (d + ▵ d)), \\ K^{'} = X^{'} W_{k}, V^{'} = X^{'} W_{v}, \\ A t t e n t i o n (Q, K^{'}, V^{'}) = S (Q, K^{'}, B) V^{'}, \end{matrix}

(4)

where

▵ d

is the offset position,

X^{'}

is the sampled new feature map, including more essential features. Our Ada-attention result is calculated by the query Q, the adjusted key

K^{'}

and value

V^{'}

, which consists of five steps: (1) embedding query, which converts input feature map X into query vector Q; (2) sampling relevant feature, which learns offset position

▵ d

and samples the input feature map X to adapt more relevant feature

X^{'}

; (3) embedding key and value, which converts the new feature map

X^{'}

into adapted key

K^{'}

and value

V^{'}

; (4) matching query and key, which calculates the similarity scores of the attention map based on query Q, key

K^{'}

and relative position B; (5) aggregating value, which aggregates value

V^{'}

according to the attention map to obtain the attention result.

The position offset subnet obtains the offset position

▵ d \in R^{H \times W \times 2}

over the query, consisting of one Separable Gated Convolution (SGConv), a Tanh function, and a scale operator

γ

. The flow diagram is shown in Figure 3a. The offset position

▵ d

corresponds to the 2D offset of each head of the input feature map position. The SGConv refers to the design idea of separable convolution in the mobile net [63], which sequentially includes gated depthwise convolution (gated DConv) [42,63], layer normalization (LN) [64], gaussian error linear units (GELU) [65], and pointwise convolution (PConv) [63]. This is used to learn the offsets based on the query data with a few parameters. Then, the Tanh function normalizes the offset values to

(- 1, + 1)

. The scalar

γ

is a learnable parameter used to modulate the range of the offset positions, and

γ

initializes as two in experiments.

The regular grid coordinate of the input feature map is denoted as d, and the range of values is

[- 1, + 1]

. Then, we added the offset position

▵ d

to the regular grid coordinate d to obtain the adjusted new coordinate

d^{'}

.

d^{'} = d + ▵ d

is fractional, so we sampled the input feature map X on new coordinates

d^{'}

using bilinear interpolation [66] to obtain a new feature map

X^{'}

for calculating the key and value.

Following the swin transformer [53] to introduce spatial position information, we added the relative positional embedding

B \in R^{H W \times H W}

into the attention map, in which values in B are taken from a relative position bias matrix. The relative position bias matrix

\tilde{B} \in R^{(2 H - 1) \times (2 W - 1)}

includes some parameters and is optimized during the training stage. The similarity is calculated as a dot product of the query and key, then processed using the softmax function [49]. Therefore, our attention map

S (Q, K^{'}, B)

can be expressed as

softmax (Q {K^{'}}^{T} + B)

.

3.2.3. Multi-Group Multi-Head Ada-Attention

We expanded the single head Ada-attention to multiple groups and multiple heads. One group included one or several heads. Different groups input different features into the same offset position subnet to adjust different offset positions

▵ d_{g} \in R^{H \times W \times 2}

.

Based on the group offset positions

▵ d_{g}

, each head attention performs the same operation as the single head to obtain one head attention map

h e a d_{g, j}

. Then, the subgroup multi-head attention map concatenates all head attention maps

h e a d_{g, 1}, \dots, h e a d_{g, h}

of this group; this is formulated as below:

\begin{matrix} h e a d_{g, j} = A t t e n t i o n (Q_{g, j}, K_{g, j}^{'}, V_{g, j}^{'}), \\ S G M H (Q_{g}, K_{g}^{'}, V_{g}^{'}) = C o n c a t e n a t e (h e a d_{g, 1}, \dots, h e a d_{g, h}), \end{matrix}

(5)

where

Q_{g, j} \in R^{n \times C / (g h)}

is one head embedding query feature of group g.

K_{g, j}^{'} \in R^{n \times C / (g h)}

and

V_{g, j}^{'} \in R^{n \times C / (g h)}

are one head adjusted key and value of group g.

Q_{g}, K_{g}^{'}

, and

V^{'} g

concatenate all

Q_{g, j}, K_{g, j}, V_{g, j}

of this group, respectively.

Then, we concatenated all

S G M H (Q_{g}, K_{g}^{'}, V_{g}^{'})

to obtain the multi-group multi-head Ada-attention, as follows:

\begin{matrix} M G M H (Q, K^{'}, V^{'}) = C o n c a t e n a t e (S G M H (Q_{1}, K_{1}^{'}, V_{1}^{'}), \dots, S G M H (Q_{g}, K_{g}^{'}, V_{g}^{'})) W_{o}, \end{matrix}

(6)

where

W_{o}

is a

1 \times 1

convolution, and the

M G M H (Q, K^{'}, V^{'})

is the final result of multi-group multi-head Ada-attention. A pseudo-code for multi-group multi-head Ada-attention is listed in Algorithm 1.

Algorithm 1 Pseudo-code for multi-group multi-head Ada-attention

Input:X, a feature map of shape

[B, C, H, W]

(batch size, channel, height, width);

H D

, number of heads; G, number of groups
Output: out, result of multi-group multi-head Ada-attention with shape

[B, C, H, W]

.

_1:: Q = proj_q(X) # Embedding query, shape = [B, C, H, W]
_2:: offset_d = offset_position_subnet(Q) # shape = [B, G, 2, H, W]
_3:: d = get_regular_grid_coordinate(B, G, H, W) # shape = [B, G, 2, H, W]
_4:: d_new = offset_d + d
_5:: d_new = d_new.reshape(B*G, 2, H, W)
_6:: X_re = X.reshape(B*G, C/G, H, W)
_7:: X_new = grid_sample(X_re, d_new) # Sample relevant feature, shape = [B*G, C/G, H, W]
_8:: X_new = X_new.reshape(B, C, H, W)
_9:: K = proj_k(X_new).reshape(B*HD, C/HD, HW) # Embedding key
_10:: V = proj_v(X_new).reshape(B*HD, C/HD, HW) # Embedding value
_11:: Q = Q.reshape(B*HD, C/HD, HW)
_12:: attn = einsum( ’b c m, b c n -> b m n’, q, k) # Match query and key to obtain the attention map, attn shape = [B*HD, HW, HW]
_13:: attn = softmax(attn, dim=2)
_14:: out = einsum(’b m n, b c n -> b c m’, attn, V) # Aggregating value, shape = [B*HD, C/HD, HW]
_15:: out = proj_out(out.reshape(B, C, H, W)) # shape = [B, C, H, W]

3.2.4. Complexity Analysis

We analyzed the computation complexity and parameter of our Ada-attention and self-attention. For simplicity, we only analyzed the attention with one head; the result is shown in Table 1.

The self-attention computation is

2 {(H W)}^{2} C + 4 H W C^{2}

[49], where H, W, and C are the input feature map height, width, and channel. The

2 {(H W)}^{2} C

is the computation of attention map

S (Q, K)

and aggregation values

S V

. The

4 H W C^{2}

is the computation used to embed the input feature map into query Q, key K, and value V, and converting the aggregation value to output. The self-attention parameter is

4 C^{2}

, which is the parameter of embedding matrices

W_{q}

,

W_{k}

,

W_{v}

, and output transformation using four

1 \times 1

convolution weights.

The Ada-attention computation is

2 {(H W)}^{2} C + 4 H W C^{2} + 2 H W C (k^{2} + 1)

, the increased

2 H W C (k^{2} + 1)

item is the offset network computation. The Ada-attention parameter is

4 C^{2} + 2 C (k^{2} + 1)

, the increased

2 C (k^{2} + 1)

item is the offset network parameter. k is the kernel size of SGConv, is less than C, H, and W, and set as 5 in experiments. Therefore, the subnet computation and parameter are less than self-attention, and only slightly increase the complexity of Ada-attention. However, the more relevant selected global features improve the attention’s ability to focus and improve the entire network’s performance, so the slight increase in complexity can be ignored.

3.3. Inpainting Loss

We referenced some image inpainting approaches [30,37,41,42] and used four losses to jointly train our proposed inpainting network, including the pixel-wise reconstruction loss, the perceptual loss, the style loss, and the adversarial loss. The reconstruction loss

L_{r}

[41] reduces pixel-wise difference; the perceptual loss

L_{p}

[67] improves the semantic consistency; the style loss

L_{s}

[68] suppresses “checkerboard” artifacts, and the adversarial loss

L_{a}

[68] discriminates the actual and predicted images to enhance realistic features. The objective loss can be formatted as follows:

\begin{matrix} L_{o b j} = & α_{r} L_{r} + α_{p} L_{p} + α_{s} L_{s} + α_{a} L_{a}, \end{matrix}

(7)

where

α_{r}

,

α_{p}

,

α_{s}

, and

α_{a}

are hyperparameters and set as 1.0, 1.0, 250, and 0.1 respectively in experiments followed by [30,40].

Reconstruction loss

L_{r}

computes the

L_{1}

distance between the predicted result

I_{p r e d}

and the ground-truth image

I_{g t}

by pixels. This is defined as:

\begin{matrix} L_{r} = {∥ I_{p r e d} - I_{g t} ∥}_{L_{1}} . \end{matrix}

(8)

Perceptual loss

L_{p}

measures high-level semantic similarity, which reduces the

L_{1}

distance on the high-level feature maps that extract from the pre-trained VGG-19 classification network [69], and is defined as:

L_{p} = \sum_{i = 0}^{N - 1} {∥Ψ_{i} (I_{p r e d}) - Ψ_{i} (I_{g t})∥}_{L_{1}},

(9)

where

Ψ_{i}

is the i-th selected feature map from the ReLU layer of the VGG-19 network. There are five different level feature maps, i.e.,

N = 5

[30]. These feature maps are also utilized in style loss.

Style loss

L_{s}

computes the distance between the Gram matrix

G

[70] of the feature maps

Ψ_{i}

and is formulated as:

L_{s} = \sum_{i = 0}^{N - 1} {∥G^{Ψ_{i}} (I_{p r e d}) - G^{Ψ_{i}} (I_{g t})∥}_{L_{1}} .

(10)

Adversarial loss

L_{a}

, via the multi-gradients network D [71], is used to discriminate the prediction results

I_{p r e d}^{a l l}

and ground truth images

I_{g t}^{a l l}

. This then guides our generator AACNet to restore more realistic contents. This is defined as:

\begin{matrix} L_{a} = E_{I_{g t}^{a l l}} [log D (I_{g t}^{a l l})] + E_{I_{p r e d}^{a l l}} [log [1 - D (I_{p r e d}^{a l l})]], \end{matrix}

(11)

where

I_{g t}^{a l l}

is the ground truths, including the ground truth image

I_{g t}

and the downsampled ground truth images

I_{g t, k}

.

I_{p r e d}^{a l l}

is the prediction results, including the reconstruction result

I_{p r e d}

and the intermediate reconstruction images

I_{p r e d, k}

. The intermediate reconstruction images are obtained from the encoder features of stages 2–4 via two convolution layers, and the corresponding image sizes are

H / 2 \times W / 2 \times 3

,

H / 4 \times W / 4 \times 3

, and

H / 8 \times W / 8 \times 3

.

4. Experiments

We first list the experiments of datasets, evaluation metrics, and implementation details. Then, we compare our AACNet with state-of-the-art methods, including restoration approaches on RS and natural images. Furthermore, we apply this to remove clouds on actual scene data. Specifically, we conducted ablation studies to analyze the effectiveness of our Ada-attention, as well as its attention-focusing ability and parameter efficiency.

4.1. Experiments Details

4.1.1. Datasets

We used three public RS image datasets and two classical natural image inpainting datasets to verify our customized AACNet performance and the key designed module effectiveness. Furthermore, we extended our AACNet to tackle the problem of cloud removal from remote sensing images. The details of all datasets are given in Table 2.

RS image datasets consist of Aerial Image Dataset (AID) [72], PatternNet [73], and NWPU-RESISC45 [74]. AID contains 10,000 aerial images obtained by RGB rendering from the original optical aerial image, and a spatial resolution range from approximately 0.5 to 0.8 m [72]. PatternNet has 30,400 images with a spatial resolution range from 0.062 to 4.7 m [73]. NWPU-RESISC45 contains 31,500 RS images from Google Earth imagery with a spatial resolution range from 0.2 to 30 m [74]. We randomly separately selected 500, 400, and 500 images from (AID), PatternNet, and NWPU-RESISC45 for testing.

Classical natural image inpainting datasets consist of Paris StreetView [75] and CelebA-HQ [76]. Paris StreetView dataset includes the buildings obtained from street views of Paris. It is suitable for reconstructing corrupted buildings and contains 14,900 training images and 100 test images [75]. CelebA-HQ [76] contains high-quality celebrity face images selected from CelebA [79], and is suitable for face restoration or synthesis. This includes 30,000 face images, and we used the first 2000 images as the test set.

In our research, we trained and tested the network on two public cloud removal datasets, namely the Remote sensing Image Cloud rEmoving dataset (RICE) [77] and the SEN12MS-CR [78]. The RICE dataset includes RICE-I (collected from Google Earth) and RICE-II (collected from the Landsat 8 OLI/TIRS dataset). We conducted experiments on the RICE-II consisting of 736 sets with cloud, mask, and ground truth images. We randomly divided these into 630 training sets and 106 test sets. The SEN12MS-CR dataset was specifically designed for cloud removal in multispectral images, and consists of 101,615 training triplets and 7899 test triplets. Each triplet in the dataset comprised a Sentinel-1 SAR image, a cloud and a cloud-free Sentinel-2 multispectral optical image.

During training, we randomly generated mask reference GC paper [42] to corrupt images. During testing, we used the irregular mask dataset [41] to evaluate the different ratios of corrupted region images. This dataset contained 12,000 test mask images, and the corrupted region ratio ranged from 0% to 60%. In our experiments, we evaluated our model on 10–30% and 30–50% ratios of irregular masks. Furthermore, we manually generated mask images simulating cloud-covered and dead line scenes to verify the image restoration in these cases. In the actual cloud removal experiment, we used all cloud-covered pixels as the mask regions to train and test the model.

4.1.2. Evaluation Metrics

We utilized six numerical metrics to quantitatively evaluate the efficacy of the image inpainting results

I_{c o m p}

generated by our method in comparison to the ground truth image

I_{g t}

. These metrics comprised the peak signal-to-noise ratio (PSNR) [80], structural similarity index measure (SSIM) [81], multi-scale structural similarity index (MS-SSIM) [82], feature similarity index measure (FSIM) [83], mean absolute error (MAE) [84], and learned perceptual image patch similarity (LPIPS) [85]. The PSNR and SSIM metrics were used to assess the spatial quality and structural similarity, respectively, of the inpainted images. The MS-SSIM, an enhancement of the SSIM, incorporated a multi-scale evaluation approach to provide a more comprehensive evaluation. The FSIM relied on phase congruency and gradient magnitude to evaluate local image quality, weighting the evaluations by phase congruency to derive a final quality score. The MAE metric measured the absolute pixel-level error between the two images. The LPIPS was based on a perceptual similarity dataset and calculated the similarity distance that was consistent with human perception. In general, higher values of PSNR, SSIM, MS-SSIM, and FSIM are considered to be indicative of improved image quality, while lower values of MAE and LPIPS are viewed as favorable.

4.1.3. Implementation Details

Our AACNet was implemented by PyTorch [86] and used the

256 \times 256

image as input and 8 as the batch size to train on all datasets. There are 28 layers in the whole network, including 23 layers for the gated residual block, 4 layers for the Ada-attention module, and one layer for the convolution module; the details are shown in Table 3. We initialized the weights using a random Gaussian distribution with a standard deviation of 0.02 [87], and used AdamW [88] with betas (0.5,0.9) for optimization. We set the basic channel C as 48. The datasets were individually trained by scratching and fine-tuning using a learning rate of

10^{- 4}

and

10^{- 5}

. We separately utilized 200 and 80 epochs for initial training and fine-tuning on AID, Paris StreetView, and CelebA-HQ datasets. We applied 70 epochs for initial training and fine-tuning on NWPU-RESISC45 and PatternNet datasets, respectively. The code and pretrained models used in this study are publicly available at https://github.com/huangwenwenlili/AACNet (accessed on 1 January 2023).

4.2. Performance of AACNet

4.2.1. Comparison Baselines

To verify performance, we compared our AACNet with two popular image inpainting models and two RS image reconstruction approaches. The image inpainting models contain dynamic selection network (DSN) [5] and large mask inpainting (LaMa) [44]. The DSN model has the same U-Net architecture as our AACNet and utilizes the data-dependent idea to adaptively select features. LaMa is a high-performance model. Further details are given below.

DSN [5]. This adopts the deformable convolution and regional composite normalization into a parallel U-Net model to dynamically select features and normalization styles to outperform completion images.
LaMa [44]. This uses fast Fourier convolutions to capture broad receptive fields on large training masks with lower parameter and time costs and achieves an excellent performance.

RS image reconstruction models contain inpainting-on-RSI (InRSI) [2] and bilateral convolution (BC) [43]. Their details are as below.

InRSI [2]. Based on single source data, this adopts the two-stage generator with gated convolutions and attentions. It also introduces local and global patch-GAN to jointly generate better contents for the RS image.
BC [43]. This proposes a bilateral convolution to generate useful features from known data and utilizes multi-range window attention to capture wide dependencies and reconstruct missing data in the RS image.

4.2.2. Quantiative Comparison with Baselines

The comparative performance of the AACNet model was compared to that of four other models, including DSN [5], LaMa [44], InRSI [2], and BC [43], on five distinct datasets. The evaluation was conducted through experiments employing quantitative metrics, including PSNR, SSIM, MS-SSIM, FSIM, MAE, and LPIPS. The results are presented in Table 4 and Table 5, and a statistical sunburst chart in Figure 4 was presented to provide an overview of the performance metrics of the proposed AACNet on all datasets. The results demonstrate that the AACNet model achieved a superior performance in the PSNR, SSIM, and MS-SSIM metrics across all datasets. The FSIM and MAE values for the 10–30% mask on our model’s AID and PatternNet datasets were slightly lower than those of the LaMa model, while the LPIPS values of our model were lower than those of the BC model.

In conclusion, the experimental results indicate that the AACNet model demonstrated a remarkable performance regarding its spatial, structural, and pixel-level characteristics, while exhibiting only a minimal decline in human perception. These findings affirm the ability of the AACNet model to effectively reconstruct both RS and natural images.

4.2.3. Qualitative Comparison with Baselines

We selected the reconstruction results of simulated dead lines, simulated cloud occlusion, and random masks from the test images of five datasets, which are shown in Figure 5 and Figure 6. The visualization results, compared with the four comparison methods, showed that our AACNet achieved better reconstruction visual results than other models. In remote sensing image restoration, LaMa, BC, and our model achieved good visual restoration results. Our method achieved the best recovery results in terms of object integrity (e.g., the middle pattern of the baseball field, bridge, airplane, storage tank, house, island, and baseball diamond), texture and edge (e.g., airport, basketball court, and runway), and structural consistency (e.g., farmland and harbor). In natural image restoration, our method could best restore the windows, vehicles, and pipes in the Paris StreetView dataset. In face restoration, better details could be generated by our model, such as for the forehead, nose, glasses, and eyes.

4.3. Real Data Experiment

Removing clouds for RS images is a crucial application; thus, we conducted experiments on real-scene data using our inpainting network, known as AACNet. We trained our model on the RICE-II and SEN12MS-CR datasets using the same implementation settings as other datasets. To evaluate the performance of our model, we performed a comparison with SpA-GAN [28] and BC [43] on the RICE-II dataset, and with DSen2-CR [29] on the SEN12MS-CR dataset. The SpA-GAN [28] model incorporates spatial attention in a generative adversarial network to generate cloud-free images, while BC [43] is a recent image inpainting model specifically designed for cloud removal in RS imagery. The DSen2-CR [29] is a deep residual neural network designed for removing clouds from multispectral data.

The results of the comparison are presented in Table 6. Our model outperformed BC and SpA-GAN in terms of PSNR, SSIM, and MAE by 4.68%, 0.76%, and 21.71% and 9.01%, 8.76%, and 26.29%, respectively, on the RICE-II dataset. On the SEM12MS-CR dataset, our model achieved higher PSNR and SSIM scores of 1.86% and 5.15% compared to DSen2-CR. The results of the cloud removal visualization on the RICE-II dataset are presented in Figure 7. The results demonstrate that our AACNet model provides the closest approximation to the ground truth for thin to thick clouds in different scenes, such as bare land, grassland, and ocean. SpA-GAN’s results contain blurring and artifacts, while BC achieved good results, with only a few artifacts and messy textures. Figure 8 presents the cloud removal visualization results on the multispectral SEN12MS-CR dataset. Our model provides the closest results to the ground truth, while DSen2-CR’s results are fuzzy and contain artifacts, with poor performance under thick clouds. In conclusion, the results of our quantitative and qualitative experiments indicate that our AACNet can be applied as a reliable solution for cloud removal in both RGB and multispectral RS imagery.

4.4. Ablation Study

In this section, we first performed experiments with multi-head self-attention on our backbone to verify the effectiveness of our attention. Then, we analyzed the attention-focusing ability with the offset position. Furthermore, we analyzed the impact of the head and group parameters of our Ada-attention on network performance. Experiments were conducted on the AID dataset.

4.4.1. Effectiveness of Our Ada-Attention

We designed three networks to verify the effectiveness of our Ada-attention in image inpainting. (1) The “Base” network only includes 28 layers of the gated residual block with U-Net architecture. (2) The “W/MHSA” network replaces the four blocks in the fourth stage of the base network with multi-head self-attention (MHSA). (3) Our AACNet replaces the MHSA in the “W/MHSA” network with multi-group multi-head Ada-attention; network details are shown in Table 3.

The quantitative results are shown in Table 7. In the “W/MHSA” experimental results, the precision metrics increased, indicating that MHSA could use global features to help restore damaged regions. In the experimental results of our AACNet, the average accuracy was 0.66%, 0.74%, and 3.9% higher than that of the “W/MHSA” network on PSNR, SSIM, and MAE, respectively. This showed that our proposed Ada-attention could capture valuable features in the global receptive field using the adaptive position subnet and improve the performance of the restoration network. We compared the visual inpainting results of these three networks, as shown in Figure 9. The visualization results demonstrated that our Ada-attention could capture useful global features to restore intact content, consistent, and rich details. For example, in our Ada-attention results, the square and baseball field were restored and intact in the first and third rows, the parking was reconstructed with consistency and realism in the second row, and the broken bridge was repaired with texture and edge in the fourth row. However, the “Base” and “W/MHSA” results suffered from incomplete objects, blurry contents, and spare details.

4.4.2. Attention-Focusing Ability Analysis

To analyze the Ada-attention that learned the most relevant features, we visualized the top 5% attention scores and the locations adjusted by offset subnet in Figure 10. The high attention score images showed that our Ada-attention focused on the features most relevant to the query, such as the viaduct and the airport ground. However, the self-attention score was distributed on many unrelated features, such as the keys to the viaduct and ground containing grass features. The red position dots in the key/value positions images were sampled by the offset subnet, and were more inclined to edges and textures with rich features, such as the edge of buildings. Our Ada-attention obtained better inpainting results, showing that our position offset subnet attracts more useful keys and values to the attention map and obtained useful features for image completion.

4.4.3. Attention Module Position Analysis

An experimental investigation was conducted to evaluate the impact of the attention mechanism’s position on the AACNet network architecture. The study focused on three distinct position configurations of the Ada-attention module located in the fourth encoder stage, which comprised five gated residual blocks and four Ada-attention modules. The first configuration, referred to as “Edge”, placed four attention modules at the boundaries of the five residual blocks, i.e., [2 × Ada-attention, 5 × gated residual block, 2 × Ada-attention]. The second configuration, referred to as “Inter”, placed four consecutive attention modules in intermediary layers, i.e., [2 × gated residual block, 4 × Ada-attention, 3 × gated residual block]. The third configuration employed attention modules between residual blocks within the AACNet architecture, as illustrated in Figure 2.

The quantitative results, as presented in Table 8, show that were only slight variations in the metrics across the three attention configurations, indicating that the proposed network is highly resilient to the position of the attention module.

4.4.4. Parameter Efficiency

Our multi-group multi-head Ada-attention has two hyperparameters: head and group numbers. Therefore, we conducted the following two experiments to analyze the impact of different settings on performance.

Experiment 1: In the AACNet (details in Table 3), we set group number

g = 1

, basic channel

C = 48

, and head number as 1, 2, 4, 8, 12, 16, and 32, respectively. We compared and analyzed the performance of different head numbers on the AID dataset. The curves are shown in Figure 11. The PSNR, SSIM, and MAE curves show that the network performance metrics obtained their optimal value when the head number was 4. The parameter and flop curves showed that the parameter number and flops in our attention increased with the increase in head number. The reason for this was that each header needed to calculate the attention map according to the complexity of the square of the input length, which increased the necessary parameters and computations. Considering the accuracy and complexity, we recommended that the head number is set between 4 and 12.

Experiment 2: In the AACNet, we set the basic channel

C = 48

and head number as 12. In our Ada-attention design, the head number divided by the group number was an integer, so we conducted experiments with group numbers 1, 2, 3, 4, 6, and 12 on the AID dataset. The PSNR, SSIM, and MAE curves in Figure 12 showed that the group number was not as large as is necessary to obtain a high level of accuracy, and our network obtained the best performance with the group number of 2. Our attention’s params and flops curves showed that the params value decreased with the increase in group number. The reason for this was that the input feature channels of the offset position subnet decreased as the group number increased, leading to a decrease in the subnet parameters. However, all channels needed to calculate the position offset of all groups, so the flops remained unchanged. We recommended setting the group number between 2 and 4.

4.4.5. Robustness Analysis

The robustness of the proposed AACNet model for image restoration was analyzed through experiments. The experiments were performed on various image types, including natural, remote sensing, and multispectral images, under different damage scenarios. These scenarios consisted of various mask ratios, ranging from 0% to 60%, and different means of corruption, such as random/cloud masks, dead lines, and real cloud-covered scenes. Furthermore, the model’s ability to restore images of different input resolutions was also tested.

The results of the experiments, presented in Table 4 and Table 5 and Figure 5 and Figure 6, demonstrate the efficacy of the AACNet model in restoring natural and remote sensing images with different corrupted ratios and shapes. The results of the real data cloud removal experiments, presented in Table 6 and Figure 7 and Figure 8, further demonstrate the model’s ability to effectively recover digital and multispectral images for real-world cloud-covered scenes. In terms of input resolution, the model was trained on 256-resolution images, and the results presented in Table 9 and Figure 13 indicate that AACNet is capable of processing images with varying resolutions, and achieves robust recovery results when the resolution of the input image is relatively close to the training data resolution of 256, such as 384 and 128.

5. Conclusions

This study introduced a u-shaped AACNet with Ada-attention and gated residual blocks to restore missing data in RS and natural images. The proposed Ada-attention selectively attends to relevant global features in keys and values using a data-dependent offset position subnet rather than attending to all features, thereby reducing irrelevant feature dependencies and capturing essential features to model informative long-term dependencies. The experimental results demonstrate the effectiveness of the proposed network, which outperformed RS and natural image inpainting baselines in terms of quantitative and qualitative performance on various scenes that were missing data images. Ablation studies confirm the effectiveness of the Ada-attention mechanism in capturing essential features. We successfully applied our network to the removal of clouds on actual natural and multispectral data, and it could generate appropriate cloudless content. Furthermore, the proposed AACNet exhibited robustness when processing diverse image resolutions, random corrupted regions, and different forms of damage, making it suitable for processing digital remote sensing, multispectral, and natural images. However, the proposed Ada-attention contains quadratic complexity according to input length, restricting its use to middle or deep layers with a short input length. Future work will investigate efficient attention modules for image restoration.

Author Contributions

Conceptualization, W.H. and Y.D.; data curation, W.H. and S.H.; formal analysis, W.H.; funding acquisition, J.W.; investigation, Y.D. and S.H.; methodology, W.H.; project administration, W.H. and Y.D.; resources, W.H. and Y.D.; software, W.H.; supervision, Y.D. and J.W.; validation, Y.D. and S.H.; visualization, Y.D.; writing—original draft, W.H. and S.H.; writing—review and editing, J.W. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant No. 2017YFA0700800.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to sincerely appreciate the reviewers and editors for their valuable suggestions and careful work in improving the presentation of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RS	remote sensing
AACNet	adaptive-attention completing network
Ada-attention	adaptive attention
SGConv	separable gated convolution
MHSA	multi-head self-attention
CNNs	convolutional neural networks
NLP	natural language processing
SLC	scan line corrector
MODIS	moderate-resolution imaging spectroradiometer
PC	partial convolution
GC	gated convolution
RN	region normalization
DSN	dynamic selection network
RFR	recurrent feature reasoning network
BC	bilateral convolution
LaMa	large mask inpainting
InRSI	inpainting-on-RSI
IN	instance normalization
ReLU	rectified linear unit
GELU	gaussian error linear unit
LN	layer normalization
GAN	generative adversarial network
PSNR	peak signal-to-noise ratio
SSIM	structural similarity index measure
MAE	mean absolute error

References

Shen, H.; Li, X.; Cheng, Q.; Zeng, C.; Yang, G.; Li, H.; Zhang, L. Missing information reconstruction of remote sensing data: A technical review. IEEE Geosci. Remote Sens. Mag. 2015, 3, 61–85. [Google Scholar] [CrossRef]
Shao, M.; Wang, C.; Wu, T.; Meng, D.; Luo, J. Context-based multiscale unified network for missing data reconstruction in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context Encoders: Feature Learning by Inpainting. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Wang, N.; Zhang, Y.; Zhang, L. Dynamic selection network for image inpainting. IEEE Trans. Image Process. 2021, 30, 1784–1798. [Google Scholar] [CrossRef] [PubMed]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative Image Inpainting with Contextual Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zeng, Y.; Fu, J.; Chao, H.; Guo, B. Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Li, X.; Shen, H.; Zhang, L.; Zhang, H.; Yuan, Q. Dead pixel completion of aqua MODIS band 6 using a robust M-estimator multiregression. IEEE Geosci. Remote Sens. Lett. 2013, 11, 768–772. [Google Scholar]
Wang, Q.; Wang, L.; Li, Z.; Tong, X.; Atkinson, P.M. Spatial–spectral radial basis function-based interpolation for Landsat ETM+ SLC-off image gap filling. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7901–7917. [Google Scholar] [CrossRef]
Zeng, C.; Shen, H.; Zhang, L. Recovering missing pixels for Landsat ETM+ SLC-off imagery using multi-temporal regression analysis and a regularization method. Remote Sens. Environ. 2013, 131, 182–194. [Google Scholar] [CrossRef]
Zhang, Q.; Yuan, Q.; Zeng, C.; Li, X.; Wei, Y. Missing data reconstruction in remote sensing image with a unified spatial–temporal–spectral deep convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4274–4288. [Google Scholar] [CrossRef]
Shen, H.; Li, X.; Zhang, L.; Tao, D.; Zeng, C. Compressed sensing-based inpainting of aqua moderate resolution imaging spectroradiometer band 6 using adaptive spectrum-weighted sparse Bayesian dictionary learning. IEEE Trans. Geosci. Remote Sens. 2013, 52, 894–906. [Google Scholar] [CrossRef]
Scaramuzza, P.; Barsi, J. Landsat 7 scan line corrector-off gap-filled product development. In Proceedings of the Pecora, Sioux Falls, SD, USA, 23–27 October 2005; Volume 16, pp. 23–27. [Google Scholar]
Chen, J.; Zhu, X.; Vogelmann, J.E.; Gao, F.; Jin, S. A simple and effective method for filling gaps in Landsat ETM+ SLC-off images. Remote Sens. Environ. 2011, 115, 1053–1064. [Google Scholar] [CrossRef]
Li, X.; Shen, H.; Li, H.; Zhang, L. Patch matching-based multitemporal group sparse representation for the missing information reconstruction of remote-sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 3629–3641. [Google Scholar] [CrossRef]
Cheng, Q.; Shen, H.; Zhang, L.; Yuan, Q.; Zeng, C. Cloud removal for remotely sensed images by similar pixel replacement guided with a spatio-temporal MRF model. ISPRS J. Photogramm. Remote Sens. 2014, 92, 54–68. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, C.; Wu, Y.; Luo, J. Remote sensing image cloud removal by deep image prior with a multitemporal constraint. Opt. Contin. 2022, 1, 215–226. [Google Scholar] [CrossRef]
Ji, T.Y.; Yokoya, N.; Zhu, X.X.; Huang, T.Z. Nonlocal tensor completion for multitemporal remotely sensed images’ inpainting. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3047–3061. [Google Scholar] [CrossRef]
Ng, M.K.P.; Yuan, Q.; Yan, L.; Sun, J. An adaptive weighted tensor completion method for the recovery of remote sensing images with missing data. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3367–3381. [Google Scholar] [CrossRef]
Yu, C.; Chen, L.; Su, L.; Fan, M.; Li, S. Kriging interpolation method and its application in retrieval of MODIS aerosol optical depth. In Proceedings of the 19th International Conference on Geoinformatics, Shanghai, China, 24–26 June 2011; pp. 1–6. [Google Scholar]
Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 417–424. [Google Scholar]
Barcelos, C.A.Z.; Batista, M.A. Image inpainting and denoising by nonlinear partial differential equations. In Proceedings of the 16th Brazilian Symposium on Computer Graphics and Image Processing, Sao Carlos, Brazil, 12–15 October 2003; pp. 287–293. [Google Scholar]
Criminisi, A.; Perez, P.; Toyama, K. Region Filling and Object Removal by Exemplar-Based Image Inpainting. IEEE Trans. Image Process. 2004, 13, 1200–1212. [Google Scholar] [CrossRef]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24–35. [Google Scholar] [CrossRef]
Singh, P.; Komodakis, N. Cloud-gan: Cloud removal for sentinel-2 imagery using a cyclic consistent generative adversarial networks. In Proceedings of the 2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1772–1775. [Google Scholar]
Shao, M.; Wang, C.; Zuo, W.; Meng, D. Efficient Pyramidal GAN for Versatile Missing Data Reconstruction in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Pan, H. Cloud removal for remote sensing imagery via spatial attention generative adversarial network. arXiv 2020, arXiv:2009.13015. [Google Scholar]
Meraner, A.; Ebel, P.; Zhu, X.X.; Schmitt, M. Cloud removal in Sentinel-2 imagery using a deep residual neural network and SAR-optical data fusion. ISPRS J. Photogramm. Remote Sens. 2020, 166, 333–346. [Google Scholar] [CrossRef]
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; Ebrahimi, M. Edgeconnect: Structure guided image inpainting using edge prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Ren, Y.; Yu, X.; Zhang, R.; Li, T.H.; Li, G. StructureFlow: Image Inpainting via Structure-aware Appearance Flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Xiong, W.; Yu, J.; Lin, Z.; Yang, J.; Lu, X.; Barnes, C.; Luo, J. Foreground-Aware Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Peng, J.; Liu, D.; Xu, S.; Li, H. Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
Liu, H.; Wan, Z.; Huang, W.; Song, Y.; Han, X.; Liao, J. PD-GAN: Probabilistic Diverse GAN for Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 9371–9381. [Google Scholar]
Liu, Q.; Tan, Z.; Chen, D.; Chu, Q.; Dai, X.; Chen, Y.; Liu, M.; Yuan, L.; Yu, N. Reduce Information Loss in Transformers for Pluralistic Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 11347–11357. [Google Scholar]
Du, Y.; He, J.; Huang, Q.; Sheng, Q.; Tian, G. A Coarse-to-Fine Deep Generative Model with Spatial Semantic Attention for High-Resolution Remote Sensing Image Inpainting. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Li, J.; Wang, N.; Zhang, L.; Du, B.; Tao, D. Recurrent Feature Reasoning for Image Inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhang, H.; Hu, Z.; Luo, C.; Zuo, W.; Wang, M. Semantic image inpainting with progressive generative networks. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1939–1947. [Google Scholar]
Wang, W.; Zhang, J.; Niu, L.; Ling, H.; Yang, X.; Zhang, L. Parallel multi-resolution fusion network for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14559–14568. [Google Scholar]
Guo, X.; Yang, H.; Huang, D. Image Inpainting via Conditional Texture and Structure Dual Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14134–14143. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 4471–4480. [Google Scholar]
Huang, W.; Deng, Y.; Hui, S.; Wang, J. Image Inpainting with Bilateral Convolution. Remote Sens. 2022, 14, 6140. [Google Scholar] [CrossRef]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 2149–2159. [Google Scholar]
Yu, T.; Guo, Z.; Jin, X.; Wu, S.; Chen, Z.; Li, W.; Zhang, Z.; Liu, S. Region normalization for image inpainting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12733–12740. [Google Scholar]
Ma, X.; Zhou, X.; Huang, H.; Chai, Z.; Wei, X.; He, R. Free-form image inpainting via contrastive attention network. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9242–9249. [Google Scholar]
Qin, J.; Bai, H.; Zhao, Y. Multi-scale attention network for image inpainting. Comput. Vis. Image Underst. 2021, 204, 103155. [Google Scholar] [CrossRef]
Liu, H.; Jiang, B.; Xiao, Y.; Yang, C. Coherent semantic attention for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 4170–4179. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 2021, 9, 53–68. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 630–645. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Hendrycks, D.; Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR 2016, arXiv:abs/1606.08415. [Google Scholar]
Kirkland, E.J. Bilinear interpolation. In Advanced Computing in Electron Microscopy; Springer: Berlin/Heidelberg, Germany, 2010; pp. 261–263. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11-–14 October 2016. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Drineas, P.; Mahoney, M.W.; Cristianini, N. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. J. Mach. Learn. Res. 2005, 6, 2153–2175. [Google Scholar]
Huang, W.; Deng, Y.; Hui, S.; Wang, J. Multi-receptions and multi-gradients discriminator for Image Inpainting. IEEE Access 2022, 10, 131579–131591. [Google Scholar] [CrossRef]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Zhou, W.; Newsam, S.; Li, C.; Shao, Z. PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval. ISPRS J. Photogramm. Remote Sens. 2018, 145, 197–209. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; Efros, A. What makes paris look like paris? Commun. ACM 2015, 58, 103–110. [Google Scholar] [CrossRef]
Lee, C.H.; Liu, Z.; Wu, L.; Luo, P. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Lin, D.; Xu, G.; Wang, X.; Wang, Y.; Sun, X.; Fu, K. A remote sensing image dataset for cloud removal. arXiv 2019, arXiv:1901.00600. [Google Scholar]
Ebel, P.; Meraner, A.; Schmitt, M.; Zhu, X.X. Multisensor data fusion for cloud removal in global and all-season sentinel-2 imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5866–5878. [Google Scholar] [CrossRef]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Korhonen, J.; You, J. Peak signal-to-noise ratio revisited: Is simple beautiful? In Proceedings of the 2012 Fourth International Workshop on Quality of Multimedia Experience, Melbourne, Australia, 5–7 July 2012; pp. 37–38. [Google Scholar]
Hassan, M.; Bhagvati, C. Structural similarity measure for color images. Int. J. Comput. Appl. 2012, 43, 7–12. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Zhang, L.; Zhang, L.; Mou, X.; Zhang, D. FSIM: A feature similarity index for image quality assessment. IEEE Trans. Image Process. 2011, 20, 2378–2386. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Illustration of self-attention and Ada-attention, including high-attention-score points, key/value positions, and inpainting results.

Figure 2. (a) The architecture of our Adaptive-attention Completing Network (AACNet). (b) The architecture of gated residual block. Our network consists of four pyramid stages for the encoder and decoder. Stages 1–3 include several gated residual blocks, and the fourth stage supplements several Ada-attention modules to improve the long-range modeling ability. The AACNet uses upsampling or downsampling module to change the feature map size and channel number between two adjacent stages.

Figure 3. The architecture of the single head Ada-attention and offset position subnet. The Ada-attention contains five steps: embedding query, sampling relevant feature, embedding key and value, matching query and key, and aggregating value. The

X \in R^{H \times W \times C}

is the input feature. H, W, and C are the height, width, and channel.

X^{'}

, Q,

K^{'}

, and

V^{'}

denote the sampling input feature, query, adjusted key, and adjusted value, respectively.

▵ d

is the offset position. The dimension of the attention map is

H W \times H W

.

⨂

is the matrix multiplication.

Figure 3. The architecture of the single head Ada-attention and offset position subnet. The Ada-attention contains five steps: embedding query, sampling relevant feature, embedding key and value, matching query and key, and aggregating value. The

X \in R^{H \times W \times C}

is the input feature. H, W, and C are the height, width, and channel.

X^{'}

, Q,

K^{'}

, and

V^{'}

denote the sampling input feature, query, adjusted key, and adjusted value, respectively.

▵ d

is the offset position. The dimension of the attention map is

H W \times H W

.

⨂

is the matrix multiplication.

Figure 4. Overview of the statistical sunburst chart of the average performance for our AACNet on all datasets. The data used for this analysis were obtained from the AVG values in Table 4 and Table 5. The higher values of PSNR, SSIM, MS-SSIM, and FSIM showed a better performance. The lower values of MAE and LPIPS showed a better performance.

Figure 5. Visualization results with DSN [5], LaMa [44], InRSI [2], BC [43], and ours on three remote sensing datasets. Local details of the results are displayed in red boxes, and it is recommended to zoom in for details.

Figure 6. Visualization results with DSN [5], LaMa [44], InRSI [2], BC [43], and ours on two natural datasets. Local details of the results are displayed in red boxes, and it is recommended to zoom in for details.

Figure 7. Reconstruction visualization results for removing real data clouds on the RICE-II dataset with SpA-GAN [28], BC [43], and ours. Please zoom in for details.

Figure 8. Reconstruction visualization results for removing multispectral cloud images on the SEN12MS-CR dataset with DSen2-CR [29] and ours. The Sentinel-2 satellite data contain 13 spectral bands, and the visualization images are rendered as an RGB composite using RGB bands. Sentinel-1 SAR satellite data consist of two bands that are mapped onto the green and blue channels to provide a visual representation. Please zoom in for details.

Figure 9. Visualization results of the effectiveness of our Ada-attention on the AID dataset. Please zoom in for details.

Figure 10. Illustration of self-attention and Ada-attention on AID dataset, including high-attention-score points, key/value positions, and inpainting results. The gray covering regions represent the corrupted regions. In the high-attention-score points images, the query is indicated by red stars, and the high-attention-score points of a specific query are represented by purple circles. A circle with a larger area represents a higher score. In the key/value positions images, the original uniform coordinates used in the self-attention mechanism are represented by cyan dots, while the sampled coordinates adjusted by the offset subnetwork in Ada-attention are represented by red dots. Local details are displayed in red boxes in the inpainting results.

Figure 11. Performance curves of different head numbers on the AID dataset. Higher PSNR and SSIM values denote a better performance. Lower MAE values denote a better performance.

Figure 12. Performance curves of different group numbers on the AID dataset. Higher PSNR and SSIM values denote a better performance. Lower MAE values denote a better performance.

Figure 13. Inpainting results for different resolutions on the AID dataset. Please zoom in for details.

Table 1. Comparison computation and parameter between self-attention and Ada-attention.

Module	Computation	Parameter
Self-attention [49]	$2 {(H W)}^{2} C + 4 H W C^{2}$	$4 C^{2}$
Ada-attention	$2 {(H W)}^{2} C + 4 H W C^{2} + 2 H W C (k^{2} + 1)$	$4 C^{2} + 2 C (k^{2} + 1)$

H, W, and C are the input feature map height, width and channel, respectively. k is the kernel size of SGConv.

Table 2. The main experimental datasets.

Name	Description	Training Images	Test Images	Release Organization
AID [72]	This contains 30 scene categories of RS aerial RGB images.	9500	500	Wuhan University
PatternNet [73]	This contains 45 scene categories of RS digital images.	30,000	400	Wuhan University
NWPU-RESISC45 [74]	This contains 38 scene categories of RS digital images.	31,000	500	Northwestern Polytechnical University
Paris StreetView [75]	This contains buildings of Paris of natural digital images.	14,900	100	Carnegie Mellon University
CelebA-HQ [76]	This contains celebrity face digital images for face restoration or synthesis.	28,000	2000	NVIDIA
RICE-II [77]	This contains cloud removal digital images with cloud, mask, and ground truth.	630	106	Chinese Academy of Sciences
SEN12MS-CR [78]	This contains cloud and cloud-free Sentinel-2 multispectral optical images and Sentinel-1 SAR images for the cloud removal task.	101,615	7899	Technical University of Munich

Table 3. Detailed configurations of our AACNet. Input_size consists of height, width, and channel. C represents the number of the basic channel. “

* N

” in operators and kernel_size means reusing the same module or configuration N times. “×” denotes multiplication. GRes Block represents the gated residual block.

Table 3. Detailed configurations of our AACNet. Input_size consists of height, width, and channel. C represents the number of the basic channel. “

* N

” in operators and kernel_size means reusing the same module or configuration N times. “×” denotes multiplication. GRes Block represents the gated residual block.

Stage Name	Operator	Input_Size	Output_Size	Kernel_Size	Stride
Encoder 1	GRes Block ∗ 2	256 × 256 × 4	256 × 256 × C	5, 3	1
Downsample 1	Conv	256 × 256 × C	128 × 128 × 2C	3	2
Encoder 2	GRes Block ∗ 3	128 × 128 × 2C	128 × 128 × 2C	3	1
Downsample 2	Conv	128 × 128 × 2C	64 × 64 × 4C	3	2
Encoder 3	GRes Block ∗ 4	64 × 64 × 4C	64 × 64 × 4C	3	1
Downsample 3	Conv	64 × 64 × 4C	32 × 32 × 8C	3	2
Encoder 4	[GRes Block, ASC_Attent] ∗ 4, GRes Block ∗ 1	32 × 32 × 8C	32 × 32 × 8C	[3, 5] ∗ 4, 3	1
Upsample 3	interpolation 2 times, Conv	32 × 32 × 8C	64 × 64 × 4C	3	1
Fuse 3	Concat(Upsample 3, Encoder 3), Conv	64 × 64 × 8C	64 × 64 × 4C	1	1
Decoder 3	GRes Block ∗ 4	64 × 64 × 4C	64 × 64 × 4C	3	1
Upsample 2	interpolation 2 times, Conv	64 × 64 × 4C	128 × 128 × 2C	3	1
Fuse 2	Concat(Upsample 2, Encoder 2), Conv	128 × 128 × 4C	128 × 128 × 2C	1	1
Decoder 2	GRes Block ∗ 3	128 × 128 × 2C	128 × 128 × 2C	3	1
Upsample 1	interpolation 2 times, Conv	128 × 128 × 2C	256 × 256 × C	3	1
Fuse 1	Concat(Upsample 1, Encoder 1), Conv	256 × 256 × 2C	256 × 256 × C	1	1
Decoder 1	GRes Block ∗ 2	256 × 256 × C	256 × 256 × C	3	1
Conv	Conv	256 × 256 × C	256 × 256 × 3	7	1
Output	Tanh	256 × 256 × 3	256 × 256 × 3	-	-

Table 4. Quantitative results with DSN [5], LaMa [44], InRSI [2], BC [43], and ours on AID, PatternNet, and NWPU-RESISC45 datasets.

DataSet		AID					PatternNet					NWPU-RESISC45
Metrics	Mask Ratio	DSN	LaMa	InRSI	BC	Ours	DSN	LaMa	InRSI	BC	Ours	DSN	LaMa	InRSI	BC	Ours
PSNR↑	10–30%	27.50	28.01	24.80	27.66	28.24	28.33	29.09	25.53	28.92	29.28	28.66	28.85	25.01	28.98	29.32
	30–50%	23.42	23.77	16.68	23.58	24.28	23.85	24.61	21.02	24.43	24.97	24.25	24.36	19.21	24.61	25.10
	AVG	25.46	25.89	20.74	25.62	26.26	26.09	26.85	23.28	26.68	27.13	26.46	26.61	22.11	26.79	27.21
SSIM↑	10–30%	0.881	0.892	0.817	0.886	0.896	0.904	0.918	0.844	0.913	0.919	0.899	0.906	0.816	0.906	0.911
	30–50%	0.717	0.734	0.476	0.727	0.753	0.761	0.792	0.630	0.782	0.798	0.745	0.758	0.556	0.762	0.775
	AVG	0.799	0.813	0.646	0.806	0.825	0.833	0.855	0.737	0.848	0.858	0.822	0.832	0.686	0.834	0.843
MS-SSIM↑	10–30%	0.912	0.920	0.833	0.915	0.924	0.927	0.936	0.862	0.933	0.937	0.922	0.927	0.835	0.927	0.929
	30–50%	0.777	0.790	0.498	0.783	0.804	0.809	0.831	0.662	0.825	0.834	0.793	0.803	0.595	0.806	0.813
	AVG	0.845	0.855	0.665	0.849	0.864	0.868	0.884	0.762	0.879	0.886	0.857	0.865	0.715	0.866	0.871
FSIM↑	10–30%	0.871	0.880	0.818	0.869	0.879	0.880	0.891	0.823	0.882	0.888	0.872	0.876	0.808	0.873	0.879
	30–50%	0.757	0.769	0.637	0.754	0.775	0.762	0.783	0.684	0.768	0.785	0.752	0.757	0.657	0.755	0.771
	AVG	0.814	0.824	0.728	0.811	0.827	0.821	0.837	0.753	0.825	0.837	0.812	0.816	0.732	0.814	0.825
MAE↓	10–30%	2.08%	1.93%	3.09%	2.10%	1.97%	1.96%	1.79%	3.02%	1.94%	1.85%	1.85%	1.78%	3.33%	1.84%	1.77%
	30–50%	4.54%	4.35%	10.96%	4.54%	4.19%	4.42%	4.11%	7.22%	4.30%	4.02%	4.17%	4.11%	9.03%	4.10%	3.88%
	AVG	3.31%	3.14%	7.02%	3.32%	3.08%	3.19%	2.95%	5.12%	3.12%	2.94%	3.01%	2.94%	6.18%	2.97%	2.83%
LPIPS↓	10–30%	0.085	0.087	0.230	0.075	0.072	0.070	0.064	0.195	0.057	0.059	0.078	0.070	0.217	0.063	0.066
	30–50%	0.178	0.186	0.454	0.159	0.160	0.149	0.142	0.374	0.124	0.134	0.165	0.155	0.404	0.139	0.152
	AVG	0.131	0.137	0.342	0.117	0.116	0.109	0.103	0.284	0.090	0.096	0.122	0.112	0.310	0.101	0.109

The higher values of PSNR, SSIM, MS-SSIM, and FSIM denote a better performance, indicating as ↑. The lower values of MAE and LPIPS denote a better performance, indicating as ↓. “N%” in MAE indicates that N has increased 100 times. The bold numbers represent that the corresponding model achieved the best evaluation in the respective metrics.

Table 5. Quantitative results with DSN [5], LaMa [44], InRSI [2], BC [43], and ours on Paris StreetView and CelebA-HQ datasets.

DataSet		Paris StreetView					CelebA-HQ
Metrics	Mask Ratio	DSN	LaMa	InRSI	BC	Ours	DSN	LaMa	InRSI	BC	Ours
PSNR↑	10–30%	29.56	28.95	25.02	30.66	31.45	31.12	30.91	23.26	31.53	31.65
	30–50%	24.99	24.55	19.09	25.87	26.75	26.24	25.93	17.34	26.52	26.75
	AVG	27.27	26.75	22.05	28.26	29.10	28.68	28.42	20.30	29.02	29.20
SSIM↑	10–30%	0.933	0.929	0.840	0.943	0.951	0.972	0.970	0.875	0.974	0.975
	30–50%	0.825	0.817	0.590	0.844	0.865	0.924	0.919	0.679	0.930	0.933
	AVG	0.879	0.873	0.715	0.894	0.908	0.948	0.945	0.777	0.952	0.954
MS-SSIM↑	10–30%	0.946	0.943	0.855	0.952	0.958	0.972	0.970	0.880	0.975	0.975
	30–50%	0.852	0.847	0.633	0.865	0.881	0.925	0.919	0.699	0.930	0.932
	AVG	0.899	0.895	0.744	0.908	0.920	0.948	0.945	0.790	0.953	0.954
FSIM↑	10–30%	0.880	0.880	0.814	0.885	0.894	0.895	0.896	0.809	0.898	0.900
	30–50%	0.764	0.766	0.670	0.772	0.791	0.790	0.789	0.654	0.796	0.802
	AVG	0.822	0.823	0.742	0.828	0.842	0.842	0.842	0.732	0.847	0.851
MAE↓	10–30%	1.47%	1.58%	3.14%	1.33%	1.21%	1.03%	1.08%	3.49%	1.00%	0.98%
	30–50%	3.36%	3.58%	8.33%	3.10%	2.78%	2.40%	2.57%	8.93%	2.36%	2.29%
	AVG	2.42%	2.58%	5.74%	2.21%	1.99%	1.71%	1.82%	6.21%	1.68%	1.64%
LPIPS↓	10–30%	0.058	0.060	0.181	0.044	0.042	0.031	0.030	0.157	0.023	0.024
	30–50%	0.127	0.139	0.361	0.104	0.101	0.067	0.070	0.319	0.056	0.058
	AVG	0.093	0.100	0.271	0.074	0.072	0.049	0.050	0.238	0.039	0.041

The higher values of PSNR, SSIM, MS-SSIM, and FSIM denote a better performance, indicating as ↑. The lower values of MAE and LPIPS denote a better performance, indicating as ↓. “N%” in MAE indicates that N has increased 100 times. The bold numbers represent that the corresponding model achieved the best evaluation in the respective metrics.

Table 6. Quantitative comparisons of cloud removal on the RICE-II and SEN12MS-CR datasets.

RICE-II				SEN12MS-CR
Models	PSNR↑	SSIM↑	MAE↓	Models	PSNR↑	SSIM↑	MAE↓
SpA-GAN [28]	29.74	0.731	4.42%	DSen2-CR [29]	28.43	0.882	2.86%
BC [43]	30.97	0.789	4.26%	DSen2-CR [29]	28.43	0.882	2.86%
Ours	32.42	0.795	3.50%	Ours	28.96	0.881	2.72%

Higher PSNR and SSIM values denote a better performance, indicating as ↑. Lower MAE values denote a better performance, indicating as ↓. The bold numbers represent that the corresponding model achieved the best evaluation in the respective metrics. “N%” in MAE indicates that N has increased 100 times.

Table 7. Effectiveness of our Ada-attention on the AID dataset.

Metrics	PSNR↑			SSIM↑			MAE↓
Mask Ratio	Base	W/MHSA	Ours	Base	W/MHSA	Ours	Base	W/MHSA	Ours
10–30%	27.74	28.06	28.24	0.888	0.893	0.896	2.09%	2.01%	1.97%
30–50%	23.85	24.12	24.28	0.734	0.746	0.753	4.42%	4.27%	4.19%
AVG	25.79	26.09	26.26	0.811	0.819	0.825	3.26%	3.14%	3.08%

Higher PSNR and SSIM values denote better performance, indicating as ↑. Lower MAE values denote a better performance, indicating as ↓. The bold numbers represent that the corresponding model achieved the best evaluation in the respective metrics. “N%” in MAE indicates that N increased 100 times.

Table 8. Position analysis of our Ada-attention on the AID dataset.

Metrics	PSNR↑			SSIM↑			MAE↓
Mask Ratio	Edge	Inter	Ours	Edge	Inter	Ours	Edge	Inter	Ours
10–30%	28.26	28.19	28.24	0.897	0.896	0.896	1.96%	1.98%	1.97%
30–50%	24.29	24.25	24.28	0.753	0.751	0.753	4.18%	4.21%	4.19%
AVG	26.27	26.22	26.26	0.825	0.823	0.825	3.07%	3.09%	3.08%

Higher PSNR and SSIM values denote a better performance, indicating as ↑. Lower MAE values denote a better performance, indicating as ↓. The bold numbers show that the corresponding model achieved the best evaluation result in the respective metrics. “N% ” in MAE indicates that N increased 100 times.

Table 9. Evaluation metrics on diverse resolution input images of the AID dataset.

Image Resolution	MAC (G)	PSNR↑			SSIM↑			MAE↓
Image Resolution	MAC (G)	10–30%	30–50%	AVG	10–30%	30–50%	AVG	10–30%	30–50%	AVG
384 × 384	266.24	30.66	25.92	28.29	0.923	0.794	0.859	1.41%	3.34%	2.37%
256 × 256	113.66	28.24	24.28	26.26	0.896	0.753	0.825	1.97%	4.19%	3.08%
128× 128	27.185	29.33	25.37	27.35	0.903	0.771	0.837	1.86%	3.87%	2.87%
64 × 64	6.756	24.03	21.02	22.53	0.763	0.625	0.694	4.67%	8.02%	6.34%
32 × 32	1.687	15.55	14.04	14.8	0.404	0.254	0.329	11.97%	16.82%	14.40%

MAC refers to the multiply–accumulate operation count. Higher PSNR and SSIM values denote a better performance, indicating as ↑. Lower MAE values denote a better performance, indicating as ↓. “N%” in MAE indicates that N increased 100 times.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, W.; Deng, Y.; Hui, S.; Wang, J. Adaptive-Attention Completing Network for Remote Sensing Image. Remote Sens. 2023, 15, 1321. https://doi.org/10.3390/rs15051321

AMA Style

Huang W, Deng Y, Hui S, Wang J. Adaptive-Attention Completing Network for Remote Sensing Image. Remote Sensing. 2023; 15(5):1321. https://doi.org/10.3390/rs15051321

Chicago/Turabian Style

Huang, Wenli, Ye Deng, Siqi Hui, and Jinjun Wang. 2023. "Adaptive-Attention Completing Network for Remote Sensing Image" Remote Sensing 15, no. 5: 1321. https://doi.org/10.3390/rs15051321

APA Style

Huang, W., Deng, Y., Hui, S., & Wang, J. (2023). Adaptive-Attention Completing Network for Remote Sensing Image. Remote Sensing, 15(5), 1321. https://doi.org/10.3390/rs15051321

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive-Attention Completing Network for Remote Sensing Image

Abstract

1. Introduction

2. Related Work

2.1. Missing Information Reconstruction of Remote Sensing Images

2.2. Learning-Based Image Inpainting

2.3. Attention Mechanism

3. Methodology

3.1. Network Architecture

3.1.1. Encoder

3.1.2. Decoder

3.1.3. Gated Residual Block

3.2. Adaptive Attention (Ada-Attention)

3.2.1. Self-Attention

3.2.2. Single Head Ada-Attention

3.2.3. Multi-Group Multi-Head Ada-Attention

3.2.4. Complexity Analysis

3.3. Inpainting Loss

4. Experiments

4.1. Experiments Details

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Performance of AACNet

4.2.1. Comparison Baselines

4.2.2. Quantiative Comparison with Baselines

4.2.3. Qualitative Comparison with Baselines

4.3. Real Data Experiment

4.4. Ablation Study

4.4.1. Effectiveness of Our Ada-Attention

4.4.2. Attention-Focusing Ability Analysis

4.4.3. Attention Module Position Analysis

4.4.4. Parameter Efficiency

4.4.5. Robustness Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI