A Flexible Spatiotemporal Thick Cloud Removal Method with Low Requirements for Reference Images

Zhang, Yu; Ji, Luyan; Xu, Xunpeng; Zhang, Peng; Jiang, Kang; Tang, Hairong

doi:10.3390/rs15174306

Open AccessArticle

A Flexible Spatiotemporal Thick Cloud Removal Method with Low Requirements for Reference Images

by

Yu Zhang

^1,2,3,

Luyan Ji

^1,2,

Xunpeng Xu

^1,2,3,

Peng Zhang

^1,2,

Kang Jiang

^1,2 and

Hairong Tang

^1,2,3,*

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(17), 4306; https://doi.org/10.3390/rs15174306

Submission received: 18 July 2023 / Revised: 22 August 2023 / Accepted: 28 August 2023 / Published: 31 August 2023

(This article belongs to the Special Issue Advanced Learning Techniques for Remote Sensing Image Quality Improvement)

Download

Browse Figures

Versions Notes

Abstract

:

Thick cloud and shadows have a significant impact on the availability of optical remote sensing data. Although various methods have been proposed to address this issue, they still have some limitations. First, most approaches rely on a single clear reference image as complementary information, which becomes challenging when the target image has large missing areas. Secondly, the existing methods that can utilize multiple reference images require the complementary data to have high temporal correlation, which is not suitable for situations where the difference between the reference image and the target image is large. To overcome these limitations, a flexible spatiotemporal deep learning framework based on generative adversarial networks is proposed for thick cloud removal, which allows for the use of three arbitrary temporal images as references. The framework incorporates a three-step encoder that can leverage the uncontaminated information from the target image to assimilate the reference images, enhancing the model’s ability to handle reference images with diverse temporal differences. A series of simulated and real experiments on Landsat 8 and Sentinel 2 data is performed to demonstrate the effectiveness of the proposed method. The proposed method is especially applicable to small/large-scale regions with reference images that are significantly different from the target image.

Keywords:

thick cloud removal; spatiotemporal deep learning framework; generative adversarial network; arbitrary temporal images

1. Introduction

With the continuous development and improvement of remote sensing technology, the number of remote sensing (RS) images with high temporal, spatial, and spectral resolution has expanded across various applications, such as target detection [1,2], image classification [3], vegetation monitoring [4], etc. However, the quality of any optical RS image is inevitably degraded by atmospheric interference and other factors. Among these factors, thick cloud contamination is particularly common and can completely obscure valuable feature information. Therefore, it is critical to mitigate the impact of cloud contamination for the interpretation and application of RS images.

1.1. Related Works

Over the past few decades, significant endeavors have been dedicated to the removal of thick cloud from remote sensing images. Based on the source of supplementary data used, the existing methods can be broadly categorized into four types: spatially based, multisource-based, temporally based, and hybrid methods [5].

Spatially based methods usually employ resampling algorithms to reconstruct the contaminated areas using information from cloud-free areas within the same image. Various algorithms have been proposed, including geostatistically based [6], bandelet-transform-based [7], variation-based [8], and generative adversarial network (GAN)-based [9] algorithms. However, these methods often struggle when dealing with large or highly heterogeneous cloud-contaminated areas.

Multisource-based methods, on the other hand, aim to reconstruct cloud-contaminated areas by incorporating auxiliary information from other remote sensing data sources, such as synthetic aperture radar (SAR) images [10,11,12,13], which are less susceptible to clouds. However, the inconsistent spectral and spatial resolution of these different sources complicates the preprocessing steps required for these methods.

Temporally based methods rely on the assumption that satellites have the capability to periodically observe the same location and that ground features change slowly or remain constant over a short time. These approaches can be classified into temporal replacement [14,15], temporal filtering [16,17], and temporal learning [18,19] methods.

Temporal replacement methods, as their name suggests, reconstruct missing areas utilizing information from the same region in a cloud-free image. Depending on whether the temporal difference can be ignored, such methods can be further divided into two subtypes: direct replacement [20,21] and indirect replacement [22,23] methods. Temporal filtering methods are commonly employed when dealing with long time series of images, as opposed to temporal replacement methods, which are more suitable for short time series [24,25].

The two temporally based methods mentioned above aim to establish a clear functional relationship between cloud-contaminated images and cloud-free images in the time domain. In contrast, temporal learning methods, such as compressed sensing [26], neural networks [27], machine learning [28], etc., intend to learn the relationship by themselves. One of the most typical temporal learning methods is dictionary learning, which aims to find a representation of data by learning a set of basis vectors and atoms from the given data [29]. For example, Xu et al. [30] used dictionary learning for cloud removal by merging the coefficients obtained from the reference image with the dictionary learned from the target image. However, this approach tends to produce poor results when there is a significant difference between the target image and the reference image. To address this issue, Li et al. [31] introduced an image fusion model to minimize the temporal difference and performed error correction for the over-restored area. Nevertheless, most of these methods cannot retain the pixel values in the uncontaminated areas, which leads to an obvious reconstruction edge.

However, the performance of these temporally based methods is limited with the reference images. If the reference images exhibit significant disparities, the outcomes might not meet the desired standards. Hence, it is crucial for the reference images to exhibit strong correlations with the target images in terms of spatial, spectral, and temporal aspects.

Hybrid methods combine spatial, temporal, and spectral information to remove thick cloud. For example, Zhu et al. [32] proposed an interpolation approach based on similar neighborhood pixels for cloud removal in Landsat images. This method utilizes similar pixels from contaminated areas to estimate the changes between the cloud-free image and the cloudy image. However, when dealing with large contaminated areas, the accuracy may be reduced due to the limited availability of similar pixels. Since the reference image may also be cloud-contaminated in the same area as the target image, Zeng et al. [33] proposed an integrated approach for recovering Landsat-7 ETM+ imagery. This method utilizes a non-reference regularization algorithm to reconstruct the missing pixels when the supplementary reference data cannot completely reconstruct the missing information. As the low-rank tensor completion method shows promising capabilities in image reconstruction, it has been increasingly adopted in cloud removal research. For instance, Ng et al. [34] proposed a low-rank tensor completion model that can adaptively select the weight for each dimension to recover missing data. However, in the case of multitemporal remote sensing images, the low-rank properties in the spatial and temporal dimensions differ, posing a challenge in weight selection. To solve this problem, Chen et al. [35] proposed a low-rank tensor completion method based on the frequency domain that regards cloud pixels as high-frequency components in the temporal dimension and performs low-rank completion for each frequency component after Fourier transform. The final series of cloud-free images can be obtained by inverse transformation.

Recently, deep learning has emerged as a powerful tool for reconstruction of missing information in remote sensing images due to its ability to capture intricate relationships and patterns. For instance, Zhang et al. [36] introduced an integrated framework based on a convolutional neural network (CNN) that combines spatial, temporal, and spectral information to recover corrupted remote sensing data. Given the considerable amount of temporal information contained in time-series images, Ma et al. [37] proposed a temporal information injection network (TINN) based on ResNet [38]. Moreover, the advancement of generative adversarial networks (GANs) [39] has introduced new perspectives to cloud removal models in remote sensing. For instance, image-to-image translation models such as cyclic GAN [40] and conditional GAN [41] have been applied to remove clouds. The generator component is responsible for translating the cloud-contaminated image into a cloud-free representation, while the discriminator component learns to distinguish the generated image from the ground truth. However, these methods are limited by their ability to only utilize information from the cloud-contaminated image, which restricts their applicability. To address this issue and combine information from more reference images, Sarukkai et al. cast cloud removal as an image synthesis problem and proposed a trainable spatiotemporal generator network (STGAN). Compared with other traditional methods, deep-learning-based methods are less sensitive to variations in data quality, although they do require a larger number of model parameters.

1.2. Motivation

Within these approaches, the importance of utilizing temporal auxiliary information for better cloud removal has been acknowledged [42]. However, these approaches have yet to fully address the challenge of effectively utilizing auxiliary information and still have the following problems:

(1): Most cloud removal methods rely on a single reference image and have strict requirements for image quality, expecting it to be completely clear and close to that of the target image. However, when the target image and the reference image contain repeated missing areas, the recovery image still contains missing areas, as shown in Figure 1a.
(2): Cloud removal methods that can use the information from multiple reference images at the same time, such as low-rank tensor completion-based methods, require a long time-series dataset with high temporal correlation. When the auxiliary information is only available in the reference image, with significant differences from the target image, the recovery image contains abrupt areas, as shown in Figure 1a.

In summary, for these methods, in order to achieve good results, a cloud-free reference image or a high-quality time-series dataset is necessary. However, obtaining such data is often challenging in practical applications [43].

From the above perspectives, we want to develop a method that can utilize the information from multiple arbitrary temporal reference images, even if they are quite different from the target image. In this paper, a flexible spatiotemporal cloud removal method based on a generative adversarial network (GAN) is proposed, which is named FSTGAN. The main contributions can be summarized as follows.

(1): In contrast to single-reference-image-based methods, FSTGAN can simultaneously use the information from multiple cloudy reference images to recover the target image. As shown in Step 1 of Figure 1b, FSTGAN effectively fuses temporal features extracted from each reference image, resulting in feature maps that are devoid of any missing areas.
(2): In contrast to the existing multiple-reference-image-based methods, the proposed method can leverage the spatial information of uncontaminated areas as a guideline to assimilate the fused features, which eliminates the abrupt areas in the feature maps, as illustrated in Step 2 of Figure 1b. Consequently, FSTGAN is capable of achieving satisfactory results, even when the reference images exhibit significant differences, thereby reducing the strict requirements for reference images.
(3): A series of experiments were carried out on Landsat 8 OLI and Sentinel-2 MSI data. The visual effects and quantitative evaluation demonstrate the practicality of FSTGAN in both simulated and real experiments.

The subsequent sections of this paper are organized as follows. Section 2 presents the methodology employed for cloud and cloud shadow removal. Section 3 exhibits the experimental results obtained from Landsat 8 OLI and Sentinel-2 MSI data. Section 4 analyzes the rationality of the network architecture and related details. Section 5 summarizes the conclusions drawn from this study and outlines future prospects.

2. Methodology

FSTGAN is an end-to-end network that consists of a generator and a multiscale discriminator. The network is designed with one target image and three arbitrary temporal reference images as input; its framework is displayed in Figure 2. The generator consists of an encoder and a decoder. The encoder is responsible for feature extraction, feature fusion, and feature assimilation of the input images, while the decoder is responsible for recovering the features to obtain the cloud removal image. The discriminator is composed of several Resblocks [38] and requires two groups of data as input: the cloud removal image generated by the generator and the target image serve as negative samples, while the ground truth and the target image serve as positive samples. More details about the generator, discriminator, and loss function are provided in the subsequent sections.

2.1. Generator

From the perspective of overall network architecture, the generator is composed of two components: an encoder and a decoder, both constructed based on the residual block [38], which enables the learning of residual information by adding shortcut connections. The encoder consists of three parts: the feature extraction block, the feature fusion block, and the feature assimilation block. Each module is composed of two basic blocks, the details of which shown in Figure 3. Basic block 1 is simply composed of a 3 × 3 convolution and a 1 × 1 convolution, and its function is to extract features for the target image. Basic block 2 further integrates switchable normalization to better extract forward features for reference images. In general, these three modules play a crucial role in improving the performance of the proposed method. The specific functions of each part are as follows:

Feature Extraction Block: The inputs of this module are divided into two main groups: the image/features of the target image and images/features of three reference images. Through the skip connections, the features extracted from the target image can be transferred to the reference image. This module is responsible for extracting features from the target image and three reference images. By analyzing the characteristics of each image, it effectively captures valuable information that can be used for further feature fusion.
Feature Fusion Block: The four inputs of this module are the features extracted from the target image and three reference images. After extracting features from the reference images, the feature fusion block sums these features to generate feature maps without any missing areas. Additionally, to prepare for the subsequent feature assimilation, this fusion process converts the four input features into two outputs. One output represents the features of the target image, while the other represents the features obtained by fusing the reference images.
Feature Assimilation Block: The inputs of this module consist of fused features extracted from three reference images, along with the features extracted from the target image. The feature assimilation block leverages the uncontaminated information of the target image to assimilate the fused features obtained from the last step. This assimilation process results in refined feature maps of fused features, which aid in detail recovery in the decoder.

The above three modules are all constructed using the two basic blocks (basic block 1 and basic block 2 in Figure 3) according to different connection methods. Throughout the encoding process, the features of the target image are transferred to the reference image by skip connections, playing a guiding role in extracting auxiliary information from the reference images. On the other hand, the decoder is comprised of several decoder Resblocks, each containing a residual structure similar to that of the encoder, as shown in Figure 3. One branch of the block uses 3 × 3 and 1 × 1 convolution for decoding, and the other branch uses 1 × 1 convolution to adjust channels for skip connections. The decoder is designed to have the same length as the encoder, ensuring consistency between the two. The input of the decoder is the concatenation of the assimilated features of the reference images and the spatial features of the target image, and its output is the cloud-free target image.

For the selection of normalization in the encoder, when the batch size is small, batch normalization [44] amplifies the influence of cloud noise, thereby affecting the stability of training, while layer normalization [45] and instance normalization [46], due to the lack of regularization ability, often lead to obvious overfitting phenomena. To address these issues, we employ switchable normalization in the generator, as proposed in [47]. Switchable normalization can adaptively select different normalization methods according to different input data characteristics, thereby improving the generalization ability and performance of the model. The formula for switchable normalization can be formulated as Equation (1), where

h_{n c i j}

and

{\hat{h}}_{n c i j}

denote the values before and after switchable normalization, respectively; the mean (

μ_{k}

) and standard deviation (

σ_{k}

) are calculated by different normalization methods based on

Ω

;

Ω

represents the set of three normalization methods, i.e., batch normalization, layer normalization, and instance normalization;

γ

and

β

are learnable parameters that control the scaling and shifting of the normalized feature map, respectively; and w and

w_{k}^{'}

are weights that are learned to adjust the importance of each normalization method.

{\hat{h}}_{n c i j} = γ \frac{h_{n c i j} - Σ_{k \in Ω} w_{k} μ_{k}}{\sqrt{Σ_{k \in Ω} w_{k}^{'} σ_{k}^{2} + ϵ}} + β

(1)

2.2. Discriminator

Inspired by the multiscale discriminator proposed in [48], the discriminator of the proposed methods is designed as a multiscale binary classifier composed of multiple Resblocks, as shown in Figure 2c. The inputs consist of the target image, the cloud removal image (considered as fake data), and the ground truth (used as a reference for training). To enable the discriminator to learn more information at different scales and generate a high-resolution cloud removal image, we downsampled the inputs by factors of 0.5 and 0.25 and trained three discriminators simultaneously. The final layer of features undergoes a sigmoid activation function, mapping the input images to false or true groups. The details of each discriminator Resblock are illustrated in Figure 3. Yhe 3 × 3 StridedConv is utilized to reduce the size of the feature map by half, and the 1 × 1 StridedConv is employed to downsample and adjust the number of channels to achieve skip connection. To enhance training speed and stability, batch normalization is incorporated into our discriminator. Furthermore, spectral normalization [49] is applied to the convolution weights to ensure the Lipschitz continuity, thereby improving the performance and stability of the discriminator.

2.3. Loss Function

In view of the stability of the least squares GAN (LSGAN) [50] shown in related image tasks, we take it as the basic loss function for our training, which can be expressed as Equation (2).

\begin{matrix} min_{D} V_{LSGAN} (D) = & \frac{1}{2} E_{x \sim p_{data} (x)} [{(D (x) - b)}^{2}] + \frac{1}{2} E_{z \sim p_{z} (z)} [{(D (G (z)) - a)}^{2}] \\ min_{G} V_{LSGAN} (G) = & \frac{1}{2} E_{z \sim p_{z} (z)} [{(D (G (z)) - c)}^{2}] \end{matrix}

(2)

where a, b, and c are constants that fulfill

b - c = 1

and

b - a = 2

.

The loss of discriminator in the FSTGAN model is the same as

V_{LSGAN} (D)

, while for the loss of generator, inspired by [51], we add additional image content constraints on the basis of

V_{LSGAN} (G)

. Specifically, the loss of generator is compounded by the corresponding LSGAN part (

L_{LSGAN}

), the image’s feature loss (

L_{Feature}

), spectral loss (

L_{Spectrum}

), and visual loss (

L_{Vision}

), which can be formulated as Equation (3).

min_{G} V (G) = λ_{1} L_{Feature} + λ_{2} L_{Spectrum} + λ_{3} L_{Vision} + λ_{4} L_{LSGAN}

(3)

where

λ_{1}, λ_{2}, λ_{3}, λ_{4}

are the weight coefficients.

The feature loss is computed using the mean square error (MSE) function, which compares the reshaped cloud removal image (

x

) and the ground truth (

y

). Equation (4) represents the formulation of the feature loss. To make the color of recovery areas closer to that of the ground truth, we employ spectral angle loss. This loss measures the cosine similarity between the cloud removal image and the ground truth, as shown in Equation (5), where

I

represents a tensor of ones. For more detailed recovery of the recovery areas, vision loss is utilized on the basis of multiscale structural similarity [52]. The expression is formulated as Equation (6), where

l_{M}

,

c_{i}

, and

s_{i}

measure the similarity in luminance, contrast, and structure, respectively, and the coefficients

α_{M}

,

β_{i}

, and

γ_{i}

represent the corresponding weights.

L_{Feature} = \frac{1}{N} {∥ x - y ∥}^{2}

(4)

L_{Spectrum} = I - \frac{x \cdot y}{∥x∥ ∥y∥}

(5)

L_{Vision} = I - {[l_{M} (x, y)]}^{α_{M}} \cdot \prod_{i = 1}^{M} {[c_{i} (x, y)]}^{β_{i}} {[s_{i} (x, y)]}^{γ_{i}}

(6)

3. Experiments and Results

3.1. Study Areas and Datasets

This study utilized two datasets to evaluate the effectiveness of our method: Landsat 8 OLI L1 data and Sentinel-2 MSI L2A data. The Landsat 8 OLI data have a spatial resolution of 30 m and a temporal resolution of 16 days, while the Sentinel-2 MSI data have a spatial resolution of 10 m and a temporal resolution of 5 days.

The first dataset is derived from four scene subsets in Beijing characterized by diverse land cover types such as mountains, water, and impervious layers, as depicted in Figure 4a. Beijing has a warm, temperate, semihumid continental monsoon climate characterized by the frequent occurrence of thick clouds [53]. The second dataset comes from a subset of the Qinghai-Tibetan Plateau (QTP), where the predominant land cover types include soil, mountains, and water, as shown in Figure 4b. The QTP, known for its valuable natural resources, is characterized by subtropical monsoon climates and has attracted significant attention in environmental studies [54]. These two study areas were selected due to their significant differences in natural conditions and human activities. By evaluating the performance of the proposed method on these diverse land cover types and natural conditions, we can comprehensively assess its effectiveness and applicability.

In Figure 4, the distributions of four small subsets in Beijing (each measuring 1024 × 1024 pixels) and one large subset in the QTP (measuring 10,980 × 10,980 pixels) are displayed. The training dataset was created by dividing these subsets into patches with dimensions of 512 × 512 pixels. Specifically, there were 540 patches for the Landsat data and 8680 patches for the Sentinel data. The masks were randomly generated in the data to simulate cloud contamination. In this case, the image before masking can be serve as the ground truth of the simulated cloud-contaminated images.

3.2. Experimental Settings

The proposed framework was evaluated with adaptive weighted tensor completion (AWTC) [34], spatiotemporal tensor completion (ST-Tensor) [55], frequency spectrum-modulated tensor completion (FMTC) [35], and STGAN [56], which are multiple-reference-image-based methods. As mentioned in Section 1.2, most cloud removal methods use only a single reference image. Therefore, in order to demonstrate the wide application scope of the proposed method, we also compare it with three single-reference image-based methods: modified neighborhood similar pixel interpolator (MNSPI) [32], weighted linear regression (WLR) [33], and STGAN [56]. It should be noted that the STGAN is a switchable network that can select a single reference image or multiple reference images as complementary information. It offers two different generator architectures, namely branched U-Net and Res-Net. Due to the high memory usage of the Res-Net architecture, we used STGAN based on the U-Net architecture as the compared method.

The proposed network architecture consists of ten basic blocks in the generator, as depicted in Figure 2. Each block has empirically determined output channels set to 16, 32, 64, 128, 256, 128, 64, 32, 16, and 7 or 4. Notably, the number of output channels in the last layer of the generator matches the number of spectral bands in the Landsat and Sentinel datasets. The discriminator, on the other hand, requires training with five discriminator Resblocks, whose output channels are set to 32, 64, 128, 256, and 256, respectively. Additionally, the coefficients in Equation (3) are set as follows:

λ_{1} = λ_{2} = λ_{3} = 1

and

λ_{4} = 0.001

. Other hyperparameters are set as the follows: the batch size is set to 8, and training is run for 200 and 80 iterations for Landsat and Sentinel datasets, respectively. The Adam optimizer is used with fixed parameters of 0.9 and 0.999, and the learning rate is set to

2 \times 10^{- 4}

. The parameters for all of the comparison methods follow the suggestions in the original papers. The hardware configuration and the software environment are listed in Table 1.

For the training of FSTGAN, the reference images are randomly selected, which means the reference images for the same target image vary across training iterations. This training approach allows the model to learn a more robust encoder that can effectively handle situations where the reference images are far from the target image.

In order to assess the effectiveness of different cloud removal methods, we not only used visual interpretation for qualitative assessment analysis but also introduced six evaluation metrics for quantitative analysis: root mean square error (RMSE), structural similarity (SSIM) [57], square root of the average relative global error (ERGAS) [58], spectral angle mapper (SAM) [59], peak signal-to-noise ratio (PSNR), correlation coefficient (CC), and universal image quality index (UIQI) [60]. These evaluation metrics can comprehensively evaluate the reconstructed results in terms of content, spectrum, and structure.

3.3. Simulated Data Experiments

In this section, the applicability of the proposed method under different scenarios was verified. The specific scenario was simulated as follows:

(1): Case 1: The target image was contaminated in large areas, and multitemporal images with multiple small missing areas were employed as the complementary information, as shown in rows 1–4 in Figure 5a–d.
(2): Case 2: The target image was contaminated by multiple small areas, and multitemporal images with multiple small missing areas were employed as the complementary information, as shown in rows 5–8 in Figure 5a–d.
(3): Case 3: The target image was contaminated by multiple small areas, and a single cloud-free image was employed as the complementary information, as shown in Figures 8 and 9a,b.

In order to show the results of Case 1 and Case 2 more clearly, we only use the other multiple-reference-image-based methods for comparison, except AWTC, as shown in Figure 5f–i.

Rows 1–4 in Figure 5 show that when the missing areas are large, the recovery results of STGAN are blurred, with the model losing its effectiveness. This is because STGAN tries to learn a set of shared parameters that can extract features from both the target image and the reference images. When the missing area is large, the target image differs significantly from the reference images, causing the network to lose its effectiveness and fail to complete the reconstruction task. Although ST-Tensor and FMTC can maintain many image details, there are color differences between the reconstructed areas and the surroundings, especially for the region whose information only exists in the reference image that is far from the target image. This inconsistency arises because these methods are unable to automatically determine the weights of the spatiotemporal tensor and fully exploit the spatial information of uncontaminated regions. In contrast, FSTGAN preserves more spatial details, and the spectrum is not distorted, ensuring that there are no abrupt areas in the image. This is attributed to the fact that FSTGAN can make full use of the spatial information of the uncontaminated areas in the target image to assimilate the features extracted from reference images and transform its style into that of the target image.

According to the results of Case 2 displayed in rows 5–8 of Figure 5, it is evident that FMTC and ST-Tensor fail to effectively utilize the spatial information from the target image. As a result, the reconstructed images exhibit areas of abrupt change. In comparison, STGAN is able to leverage more spatial information from the surrounding areas of the missing region, but there still exists a noticeable discrepancy in color compared to the surroundings. On the other hand, FSTGAN preserves the majority of details in the reconstructed images, and the images are free of abrupt areas, confirming its efficacy in both Case 1 and Case 2.

Table 2 provides a qualitative assessment of the cloud removal results obtained by different methods for the Landsat and Sentinel data presented in Figure 5. In Case 1, for the three low-rank based methods, ST-Tensor shows slightly better performance than FMTC, while AWTC yields the least satisfactory results. This can be attributed to the fact that AWTC can only utilize low-rank information of experimental data. As a result, when dealing with large missing regions, the available spatial information becomes very limited. Similarly, STGAN achieves poor performance due to its ineffectiveness for this particular case. In Case 2, AWTC and STGAN achieve considerable performance but are still inferior to the two other low-rank-based methods. FSTGAN outperforms all other methods in all quantitative evaluation metrics; in particular, its SAM is significantly better than that of the three low-rank-based methods.

To further demonstrate the effectiveness of the proposed method, we select rows 4 and 6 in Figure 5 as the representatives of Cases 1 and 2, respectively, and show the zoomed figure and scatter plots of the reconstructed regions in Figure 6 and Figure 7. The narrower scatter region is, the closer recovery results are to the ground truth. The scatter region of FSTGAN is clearly more concentrated, which signifies a higher level of consistency and accuracy in reconstructing the missing areas, further highlighting its superior performance.

For Case 3, to verify the limitation of multiple-reference-image-based methods, we also choose ST-Tensor for comparison, in addition to methods based on a single reference image. To ensure the fairness of comparison, the three inputs of FSTGAN and ST-Tensor are the same as the reference image used in the three single-temporal-image-based methods: MNSPI, WLR, and STGAN. In this way, it can be guaranteed that the two methods do not use any additional information beyond the reference image. The recovery results of these methods are shown in Figure 8 and Figure 9d–h, representing the cases where the difference between the reference image and the target image is small/large, respectively. It can be observed that when the difference between the reference image and the target image is small, all methods can achieve good results from a qualitative perspective, as shown in Figure 8. However, from the quantitative perspective, as indicated in Table 3, FSTGAN achieves the best results among all methods, which means that it recovers more detailed information. On the contrary, when there is a larger difference between the target image and the reference image, most of the recovery results tend to be poor, as shown in Figure 9. Due to the complexity of the missing area, the recovery details of MNSPI and WLR are seriously insufficient or even lose their effectiveness. Since the spatial information of the target image cannot be used reasonably, when all reference images have large differences from the target image, the limitations of traditional multiple-reference-image-based methods are also revealed, as shown in Figure 9g. Compared with the abovementioned methods, STGAN makes more effective use of the spatial information of the target image, so the tone on the recovery image is relatively continuous, despite insufficient detail. Among all methods, FSTGAN achieved the best results, which means that it is more robust to reference images and can better combine spatiotemporal information than the other methods.

3.4. Real-Data Experiments

To demonstrate the effectiveness of the proposed method when applied to real cloud contamination and large-scale remote sensing images, we carried out real experiments on Landsat data. It is important to note that none of these cloud removal methods has the ability to detect clouds. Therefore, it is necessary to perform mask processing on clouds and their shadows in remote sensing images before utilizing these methods. It is worth mentioning there are already well-established methods for Landsat cloud detection [61,62], which use the characteristics of clouds in each band to detect them. However, in this scene, the quality assessment (QA) band of Landsat is sufficient and very accurate through visual interpretation. Hence, all the images were preprocessed to mask the clouds and their shadows according to their QA bands. The masked results and recovery results of different methods are shown in Figure 10a–d and e–h, respectively.

It is evident that there exists temporal differences in the reference images due to seasonal changes and illumination differences; the dates of reference images b–d are progressively farther from the target image. Among the three reference images, only c is free of clouds and can be utilized for the single-reference-image-based methods. However, reference image b is the nearest date, and its uncontaminated regions can also offer considerable valuable information. In this case, as the method based on a single reference image may not be optimal, we compare FSTGAN with other multiple-reference-image-based methods.

As observed in e–h, the results of the ST-Tensor method exhibit abrupt changes. This is because when differences between the target image and reference images are large, ST-Tensor chooses inappropriate weights for the rearranged spatiotemporal tensor. Furthermore, due to inherent limitations of the network, the recovery effect of STGAN in large missing areas is also unsatisfactory. On the other hand, both FMTC and FSTGAN demonstrate relatively good overall recovery effects, without any abrupt changes. However, for the zoomed details shown in Figure 10i–l, it becomes apparent that the details recovered by FMTC exhibit more noise, with a style more closely resembling that of c–d, which is not reasonable. This can be attributed to the fact that FMTC operates in the frequency domain. Consequently, when the time series is short and there is a significant temporal difference, the results tend to be noisy and biased towards the style of reference images with higher frequency. In contrast, our method effectively mitigates color distortion and ensures the continuity of spatial information, which demonstrates the efficacy of FSTGAN in handling large-scale remote sensing images.

4. Discussion

In this section, we explore the rationality behind the network design in the context of remote sensing. Furthermore, we discuss various factors associated with the proposed method, such as the choice of normalization technique and loss function.

4.1. Rationality of the Network

In this section, we discuss how the network operates. To provide a more intuitive representation of the network’s working process, we choose the Landsat data in Figure 5 as inputs and display some feature maps from each layer in the encoder, as shown in Figure 11. It is important to note that the feature maps shown here are selected parts of each layer for the purpose of display convenience, and they do not represent all the features extracted from each layer. The actual number of feature channels in each layer is consistent with the formation provided in Section 3.2, i.e., 16, 32, 64, 128, and 256 for layers 1–5, respectively.

From this figure, it can be observed that after passing through the first two feature extraction blocks, various shallow and deep features of the target image and the three reference images are extracted, including texture and color information and other relevant characteristics. However, these feature maps are also missing in the corresponding cloud-contaminated areas of the reference images.

To address this issue and fully utilize the valuable information from each reference image, we designed the feature fusion block. This block sums the extracted features from each reference image, resulting in feature maps that do not contain any missing areas. Nevertheless, due to variations in seasons and ground objects between the reference images, the deeper features extracted in layer 2 exhibit discrepancies, resulting in “blocking artifacts” of the fused features, as depicted in layer 3.

To mitigate this phenomenon, we further designed the feature assimilation block, which leverages the surrounding spatial information to assimilate abrupt areas and promote greater consistency among the features. As we can observe in layer 4, the application of a single feature assimilation block significantly reduces the abruptness of the features, although a slight discontinuity remains. However, in layer 5, the final encoded features are entirely free of “blocking artifacts”, thereby serving as the fundamental features for the subsequent reconstruction of the target image.

During the entire encoding process, the features extracted from the target image are shared with the reference image, serving as a guide for the feature extraction process of the reference image. After the encoding process, the final encoded features can be divided into two parts: features extracted from the target image and features from the reference images, as indicated by orange and blue rectangles in layer 5, respectively. The former provides information about uncontaminated areas in the target image, and the latter provides the missing information about the target image. By concatenating these two sets of features, all the information of a completely clear image can be obtained, and the image can be reconstructed by passing them through the decoder.

4.2. Ablation Study

To demonstrate the effectiveness of switchable normalization in the generator, several ablation experiments were carried out. Specifically, we compared switchable normalization with the commonly used batch normalization method and examined a scenario without any normalization. The results shown in Figure 12 indicate that while batch normalization can slightly expedite the convergence of the model as the number of epochs increases, its impact on the validation dataset tends to fluctuate significantly. This observation also verifies what is described in Section 2.1. Specifically, due to the varying locations of cloud contamination areas in the image, the application of batch normalization introduces more noise and consequently leads to model instability. On the other hand, the switchable normalization approach adopted in this paper exhibits superior performance in both training and validation, which can be attributed to its ability to adaptively combine multiple normalization methods. These results further confirm the crucial role played by switchable normalization in the network architecture.

We also conducted another two sets of ablation experiments: one to verify the impact of each component in the loss function and the other to verify the importance of skip connections in each module. Specifically, for the former ablation experiment, we verified the effect of various combinations of the three losses (feature loss, spectrum loss, and vision loss). As modifying the loss function component would alter the training loss, we utilized the test loss, specifically the mean squared error (MSE) between the final recovered images and the ground truth, as the evaluation metric. Table 4 illustrates that each of these three losses contributes to an improvement in the results, with the combination of all three loss functions yielding the best outcomes.

For the latter ablation experiment, we removed the skip connections in each module (encoder, decoder, and discriminator) to verify its effectiveness. For quantitative comparison, we adopted the final training loss as the evaluation metric. The results are shown in Table 5. It can be concluded that the design of the skip connections in each module is reasonable, with the following order of importance: decoder, encoder, and discriminator.

4.3. The Influence of the Scale Factors in the Discriminator

As mentioned in Section 2.2, the multiscale discriminator of FSTGAN consists of three scale factors: 1, 0.5, and 0.25. Since the resolution of each remote sensing data point is different, when the spatial resolution is low, some important information is likely lost after downsampling, with a diminished improvement effect on network performance. In this section, we intend to further discuss the influence of scale factors on the two datasets: Landsat 8 and Sentinel 2, with spectral resolutions of 30 m and 10 m, respectively. For quantitative comparison, we also adopted the final training loss as the evaluation metric. The results are shown in Table 6. According to the table, due differences between the selected regions and the mismatch of bands used in the experiment, we cannot conclude whether the efficiency of a multiscale discriminator is correlated with resolution. However, it can be concluded that for both datasets, each additional scale of the discriminator has a positive impact on the results, which means that each scale introduces more information that is conducive to discrimination.

4.4. Computational Complexity Analysis

In this section, we provide a detailed analysis of the computational complexity of each algorithm compared in this study. To ensure a fair comparison, we evaluate the algorithms using the same scenario. Specifically, we select the second and seventh rows in Figure 5 as representatives for Landsat and Sentinel data, respectively. For the two methods based on deep learning, we conducted a comprehensive analysis on the number of parameters, training time, and inference time. The specific results are shown in Table 7. Among the traditional algorithms, FMTC exhibits the smallest time consumption, while ST-Tensor has the longest time consumption. Regarding the deep learning methods, FSTGAN outperforms STGAN, with fewere parameters and a shorter training time, which further proves the superiority of the proposed method.

5. Conclusions

In this paper, a flexible spatiotemporal deep learning framework based on a generative adversarial network is proposed for thick cloud removal. Compared with existing temporally based methods, our network can not only utilize the information of multiple reference images but also achieve good results when there is a large difference between reference images and the target image, corresponding to low requirements for reference images. Experiments conducted on Landsat 8 OLI and Sentinel-2 MSI data demonstrate the effectiveness of the proposed method in addressing both small- and large-scale missing area scenarios. Although the proposed method achieves satisfactory results on the experimental data, some issues still need further consideration. First, the proposed method does not have the ability to detect clouds, so clouds in the input image must be masked in advanced, which means the cloud removal results are affected by the reliability of the cloud masking process. Second, the proposed method cannot reconstruct areas for which no complementary temporal information is available. Third, the proposed method is aimed at addressing the issues of thick clouds and their shadows, so if it is used to solve the problem of thin clouds, the available ground feature information is ignored. Lastly, although this method has lower requirements for the reference images, the type of land cover must remain unchanged; otherwise, it produces unsatisfactory results. In future research, a model that can integrate cloud detection and cloud removal will be studied, with thin and thick cloud regions can separately reconstructed to overcome the limitations of the method proposed in this paper.

Author Contributions

Conceptualization, Y.Z.; Methodology, Y.Z. and H.T.; Resources, L.J.; Writing—original draft, Y.Z.; Writing—review & editing, Y.Z.; Supervision, L.J., X.X., P.Z., K.J. and H.T.; Funding acquisition, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (grant number 31400), the Second Tibetan Plateau Scientific Expedition and Research Program (grant number 2019QZKK0206), and the High-resolution Earth Observation System Major Project (Civil Part) Phase II (Subject number: 01-Y30F05-9001-20/22-03).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chang, C.I.; Heinz, D.C. Constrained subpixel target detection for remotely sensed imagery. IEEE Trans. Geosci. Remote Sens. 2000, 38, 1144–1159. [Google Scholar] [CrossRef]
Chen, C.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep learning for hyperspectral image classification: An overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Berni, J.A.; Zarco-Tejada, P.J.; Suárez, L.; Fereres, E. Thermal and narrowband multispectral remote sensing for vegetation monitoring from an unmanned aerial vehicle. IEEE Trans. Geosci. Remote Sens. 2009, 47, 722–738. [Google Scholar] [CrossRef]
Cao, R.; Chen, Y.; Chen, J.; Zhu, X.; Shen, M. Thick cloud removal in Landsat images based on autoregression of Landsat time-series data. Remote Sens. Environ. 2020, 249, 112001. [Google Scholar] [CrossRef]
Zhang, C.; Li, W.; Travis, D. Gaps-fill of SLC-off Landsat ETM+ satellite image using a geostatistical approach. Int. J. Remote Sens. 2007, 28, 5103–5122. [Google Scholar] [CrossRef]
Maalouf, A.; Carré, P.; Augereau, B.; Fernandez-Maloigne, C. A bandelet-based inpainting technique for clouds removal from remotely sensed images. IEEE Trans. Geosci. Remote Sens. 2009, 47, 2363–2371. [Google Scholar] [CrossRef]
Cheng, Q.; Shen, H.; Zhang, L.; Li, P. Inpainting for remotely sensed images with a multichannel nonlocal total variation model. IEEE Trans. Geosci. Remote Sens. 2013, 52, 175–187. [Google Scholar] [CrossRef]
Zhao, Y.; Shen, S.; Hu, J.; Li, Y.; Pan, J. Cloud removal using multimodal GAN with adversarial consistency loss. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8015605. [Google Scholar] [CrossRef]
Huang, B.; Li, Y.; Han, X.; Cui, Y.; Li, W.; Li, R. Cloud removal from optical satellite imagery with SAR imagery using sparse representation. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1046–1050. [Google Scholar] [CrossRef]
Gao, J.; Yuan, Q.; Li, J.; Zhang, H.; Su, X. Cloud removal with fusion of high resolution optical and SAR images using generative adversarial networks. Remote Sens. 2020, 12, 191. [Google Scholar] [CrossRef]
Grohnfeldt, C.; Schmitt, M.; Zhu, X. A conditional generative adversarial network to fuse SAR and multispectral optical data for cloud removal from Sentinel-2 images. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1726–1729. [Google Scholar]
Liu, H.; Huang, B.; Cai, J. Thick Cloud Removal Under Land Cover Changes Using Multisource Satellite Imagery and a Spatiotemporal Attention Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601218. [Google Scholar] [CrossRef]
Li, M.; Liew, S.C.; Kwoh, L.K. Producing cloud free and cloud-shadow free mosaic from cloudy IKONOS images. In Proceedings of the IGARSS 2003: 2003 IEEE International Geoscience and Remote Sensing Symposium, Toulouse, France, 21–25 July 2003; Volume 6, pp. 3946–3948. [Google Scholar]
Scaramuzza, P.; Barsi, J. Landsat 7 scan line corrector-off gap-filled product development. Proc. Pecora 2005, 16, 23–27. [Google Scholar]
Savitzky, A.; Golay, M.J. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 1964, 36, 1627–1639. [Google Scholar] [CrossRef]
Roerink, G.; Menenti, M.; Verhoef, W. Reconstructing cloudfree NDVI composites using Fourier analysis of time series. Int. J. Remote Sens. 2000, 21, 1911–1917. [Google Scholar] [CrossRef]
Lorenzi, L.; Melgani, F.; Mercier, G. Missing-area reconstruction in multispectral images under a compressive sensing perspective. IEEE Trans. Geosci. Remote Sens. 2013, 51, 3998–4008. [Google Scholar] [CrossRef]
Latif, B.A.; Lecerf, R.; Mercier, G.; Hubert-Moy, L. Preprocessing of low-resolution time series contaminated by clouds and shadows. IEEE Trans. Geosci. Remote Sens. 2008, 46, 2083–2096. [Google Scholar] [CrossRef]
Meng, Q.; Borders, B.E.; Cieszewski, C.J.; Madden, M. Closest spectral fit for removing clouds and cloud shadows. Photogramm. Eng. Remote Sens. 2009, 75, 569–576. [Google Scholar] [CrossRef]
Ramoino, F.; Tutunaru, F.; Pera, F.; Arino, O. Ten-meter Sentinel-2A cloud-free composite—Southern Africa 2016. Remote Sens. 2017, 9, 652. [Google Scholar] [CrossRef]
Gao, G.; Gu, Y. Multitemporal Landsat missing data recovery based on tempo-spectral angle model. Photogramm. Eng. Remote Sens. 2017, 55, 3656–3668. [Google Scholar] [CrossRef]
Zhang, J.; Clayton, M.K.; Townsend, P.A. Missing data and regression models for spatial images. Photogramm. Eng. Remote Sens. 2014, 53, 1574–1582. [Google Scholar] [CrossRef]
Vuolo, F.; Ng, W.T.; Atzberger, C. Smoothing and gap-filling of high resolution multi-spectral time series: Example of Landsat data. Int. J. Appl. Earth Obs. Geoinf. 2017, 57, 202–213. [Google Scholar] [CrossRef]
Yang, G.; Shen, H.; Zhang, L.; He, Z.; Li, X. A moving weighted harmonic analysis method for reconstructing high-quality SPOT VEGETATION NDVI time-series data. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6008–6021. [Google Scholar] [CrossRef]
Zhang, Y.; Xiang, Y.; Zhang, L.Y.; Yang, L.X.; Zhou, J. Efficiently and securely outsourcing compressed sensing reconstruction to a cloud. Inf. Sci. 2019, 496, 150–160. [Google Scholar] [CrossRef]
Sahoo, T.; Patnaik, S. Cloud removal from satellite images using auto associative neural network and stationary wevlet transform. In Proceedings of the 2008 First International Conference on Emerging Trends in Engineering and Technology, Nagpur, India, 16–18 July 2008; pp. 100–105. [Google Scholar]
Tahsin, S.; Medeiros, S.C.; Hooshyar, M.; Singh, A. Optical cloud pixel recovery via machine learning. Remote Sens. 2017, 9, 527. [Google Scholar] [CrossRef]
Gao, F.; Deng, X.; Xu, M.; Xu, J.; Dragotti, P.L. Multi-modal convolutional dictionary learning. IEEE Trans. Image Process. 2022, 31, 1325–1339. [Google Scholar] [CrossRef]
Xu, M.; Jia, X.; Pickering, M.; Plaza, A.J. Cloud removal based on sparse representation via multitemporal dictionary learning. IEEE Trans. Geosci. Remote Sens. 2016, 54, 2998–3006. [Google Scholar] [CrossRef]
Li, X.; Wang, L.; Cheng, Q.; Wu, P.; Gan, W.; Fang, L. Cloud removal in remote sensing images using nonnegative matrix factorization and error correction. ISPRS J. Photogramm. Remote Sens. 2019, 148, 103–113. [Google Scholar] [CrossRef]
Zhu, X.; Gao, F.; Liu, D.; Chen, J. A modified neighborhood similar pixel interpolator approach for removing thick clouds in Landsat images. IEEE Geosci. Remote Sens. Lett. 2011, 9, 521–525. [Google Scholar] [CrossRef]
Zeng, C.; Shen, H.; Zhang, L. Recovering missing pixels for Landsat ETM+ SLC-off imagery using multi-temporal regression analysis and a regularization method. Remote Sens. Environ. 2013, 131, 182–194. [Google Scholar] [CrossRef]
Ng, M.K.P.; Yuan, Q.; Yan, L.; Sun, J. An adaptive weighted tensor completion method for the recovery of remote sensing images with missing data. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3367–3381. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, P.; Zhang, Y.; Xu, X.; Ji, L.; Tang, H. Thick Cloud Removal in Multi-Temporal Remote Sensing Images via Frequency Spectrum-Modulated Tensor Completion. Remote Sens. 2023, 15, 1230. [Google Scholar] [CrossRef]
Zhang, Q.; Yuan, Q.; Zeng, C.; Li, X.; Wei, Y. Missing data reconstruction in remote sensing image with a unified spatial–temporal–spectral deep convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4274–4288. [Google Scholar] [CrossRef]
Ma, X.; Wang, Q.; Tong, X.; Atkinson, P.M. A deep learning model for incorporating temporal information in haze removal. Remote Sens. Environ. 2022, 274, 113012. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Singh, P.; Komodakis, N. Cloud-gan: Cloud removal for sentinel-2 imagery using a cyclic consistent generative adversarial networks. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1772–1775. [Google Scholar]
Enomoto, K.; Sakurada, K.; Wang, W.; Fukui, H.; Matsuoka, M.; Nakamura, R.; Kawaguchi, N. Filmy cloud removal on satellite imagery with multispectral conditional generative adversarial nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 48–56. [Google Scholar]
Shen, H.; Li, X.; Cheng, Q.; Zeng, C.; Yang, G.; Li, H.; Zhang, L. Missing information reconstruction of remote sensing data: A technical review. IEEE Geosci. Remote Sens. Mag. 2015, 3, 61–85. [Google Scholar] [CrossRef]
Zhu, Z.; Woodcock, C.E. Object-based cloud and cloud shadow detection in Landsat imagery. Remote Sens. Environ. 2012, 118, 83–94. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
Luo, P.; Ren, J.; Peng, Z.; Zhang, R.; Li, J. Differentiable learning-to-normalize via switchable normalization. arXiv 2018, arXiv:1806.10779. [Google Scholar]
Tang, W.; Li, G.; Bao, X.; Nian, F.; Li, T. Mscgan: Multi-scale conditional generative adversarial networks for person image generation. In Proceedings of the 2020 Chinese Control And Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 1440–1445. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv 2018, arXiv:1802.05957. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Tan, Z.; Gao, M.; Li, X.; Jiang, L. A flexible reference-insensitive spatiotemporal fusion model for remote sensing images using conditional generative adversarial network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601413. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Xia, M.; Jia, K. Reconstructing Missing Information of Remote Sensing Data Contaminated by Large and Thick Clouds Based on an Improved Multitemporal Dictionary Learning Method. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5605914. [Google Scholar] [CrossRef]
Wan, W.; Xiao, P.; Feng, X.; Li, H.; Ma, R.; Duan, H.; Zhao, L. Monitoring lake changes of Qinghai-Tibetan Plateau over the past 30 years using satellite remote sensing data. Chin. Sci. Bull. 2014, 59, 1021–1035. [Google Scholar] [CrossRef]
Chu, D.; Shen, H.; Guan, X.; Chen, J.M.; Li, X.; Li, J.; Zhang, L. Long time-series NDVI reconstruction in cloud-prone regions via spatio-temporal tensor completion. Remote Sens. Environ. 2021, 264, 112632. [Google Scholar] [CrossRef]
Sarukkai, V.; Jain, A.; Uzkent, B.; Ermon, S. Cloud Removal in Satellite Images Using Spatiotemporal Generative Networks. arXiv 2019, arXiv:1912.06838. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Wald, L. Data Fusion: Definitions and Architectures: Fusion of Images of Different Spatial Resolutions; Presses des MINES: Paris, France, 2002. [Google Scholar]
Kruse, F.A.; Lefkoff, A.; Boardman, J.; Heidebrecht, K.; Shapiro, A.; Barloon, P.; Goetz, A. The spectral image processing system (SIPS)—Interactive visualization and analysis of imaging spectrometer data. Remote Sens. Environ. 1993, 44, 145–163. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Kwan, C.; Hagen, L.; Chou, B.; Perez, D.; Li, J.; Shen, Y.; Koperski, K. Simple and effective cloud-and shadow-detection algorithms for Landsat and Worldview images. Signal Image Video Process. 2020, 14, 125–133. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, S.; Woodcock, C.E. Improvement and expansion of the Fmask algorithm: Cloud, cloud shadow, and snow detection for Landsats 4–7, 8, and Sentinel 2 images. Remote Sens. Environ. 2015, 159, 269–277. [Google Scholar] [CrossRef]

Figure 1. The issues associated with existing temporally based cloud removal methods and our solution (The © symbol denotes the concatenation of the feature maps). (a) Limitations of existing single-reference-image- and multiple-reference-image-based cloud removal methods; (b) contribution of the proposed method.

Figure 2. Framework of FSTGAN and the structure of the generator and discriminator (The specific details of these Blocks are shown in Figure 3. The generator’s inputs are a target image (cloud-contaminated image) and another three reference images. The generator’s output is the cloud removal image. The discriminator’s inputs include two parts: one is the cloud removal image and the target image, and the other is the ground truth and the target image. Multiscale inputs take two pairs of inputs after different levels of downsampling as the inputs of the discriminator. (a) Whole architecture of the FSTGAN model for cloud removal; (b) network framework for the generator; (c) network framework for the discriminator.

Figure 3. Different components of the proposed architecture are shown in Figure 2 (Conv and StridedConv denote the convolution operation with a stride of 1 and 2, respectively). The symbol ⊕ represents the element-wise addition of feature maps. SwitchNorm is a novel normalization method whereby data are weighted and applied to the feature maps. BatchNorm and SpectralNorm are two normalization methods for the feature maps and convolution parameters.).

Figure 4. Study area of Beijing and the Qinghai-Tibetan Plateau (QTP). (a) Study area of Beijing and the location of four subsets of the Landsat data; (b) the location of the study area in QTP and the Sentinel data of the subset region.

Figure 5. Qualitative comparisons of Case 1 and Case 2 on a simulated cloud-contaminated dataset, where rows 1–2 and 3–4 represent Landsat and Sentinel data of Case 1, respectively, and rows 5–6 and 7–8 represent Landsat and Sentinel data of Case 2, respectively. (a) Simulated cloud-contaminated images; (b–d) three reference images; (e) ground truth; (f–i) recovery results of ST-Tensor, FMTC, STGAN, and the proposed method, respectively.

Figure 6. Zoomed images from row 4 in Figure 5 for Case 1 and the corresponding scatter plots of recovery results obtained with different methods. (a) Ground truth; (b–e) recovery results of ST-Tensor, FMTC, STGAN, and FSTGAN, respectively; (f–i) scatter plots of ST-Tensor, FMTC, STGAN, and FSTGAN, respectively, in reconstructed regions.

Figure 7. Zoomed images from row 6 in Figure 5 for Case 2 and the corresponding scatter plots of recovery results obtained with different methods. (a) Ground truth; (b–e) recovery results of ST-Tensor, FMTC, STGAN, and FSTGAN, respectively; (f–i) scatter plots of ST-Tensor, FMTC, STGAN, and FSTGAN, respectively, in reconstructed regions.

Figure 8. Qualitative comparisons of Case 3, where the reference image and the target image have small temporal differences. (a) Cloud-contaminated image; (b) reference image; (c) ground truth; (d–h) recovery results of MNSPI, WLR, STGAN, ST-Tensor, and FSTGAN, respectively.

Figure 9. Qualitative comparisons of Case 3, where the reference image and the target image have large temporal differences. (a) Cloud-contaminated image; (b) reference image; (c) ground truth; (d–h) recovery results of MNSPI, WLR, STGAN, ST-Tensor, and FSTGAN, respectively.

Figure 10. Real-data experiments over Beijing using Landsat data (30 × 30 km). (a) Target image from 8 August 2016; (b–d) reference images from 21 June 2016, 4 May 2016, and 18 April 2016; (e–h) recovery results of ST-Tensor, FMTC, STGAN, and FSTGAN, respectively; (i,j) zoomed details of FMTC and FSTGAN corresponding to areas indicated by yellow rectangles; (k,l) zoomed details of FMTC and FSTGAN corresponding to areas indicated by green rectangles.

Figure 11. An example used to explain network rationality (the feature extraction block, feature fusion block, and feature assimilation block represent the basic blocks in the encoder) the related details of which are shown in Figure 3. Layers 1–5 represent the extracted feature maps after passing through each block, and we only show part of them for convenience. In this process, the input target image and reference images have gone through five steps: steps 1 and 2: extract shallow and deep features for feature fusion; step 3: fuse the deep features to obtain feature maps with no missing areas; steps 4 and 5: leverage the uncontaminated information of the target image to assimilated the fused features, eliminating the abrupt areas in the feature maps).

Figure 12. The training loss and the validation MSE for Landsat data on three different model architectures (generator with no normalization, batch normalization, and switchable normalization).

Table 1. Hardware configuration and the software environment.

Hardware	RAM	CPU	GPU
Hardware	80 GB	AMD EPYC 7543	RTX 3090
Software	Python	CUDA	PyTorch
Software	3.8.13	11.3	1.8.0

Table 2. Quantitative evaluation indexes of Cases 1 and 2.

Case 1: Rows 1–2 in Figure 5 of Landsat OLI Data									Case 1: Rows 3–4 in Figure 5 of Sentinel MSI Data
Dataset	Method	RMSE	SSIM	ERGAS	SAM	PSNR	CC	UIQI	Dataset	Method	RMSE	SSIM	ERGAS	SAM	PSNR	CC	UIQI
Row 1	AWTC	0.555	0.701	4.077	21.822	25.457	0.519	0.486	Row 3	AWTC	1.156	0.730	4.244	24.799	19.169	0.163	0.118
	ST-Tensor	0.200	0.904	2.424	8.132	34.595	0.897	0.892		ST-Tensor	0.126	0.973	1.488	3.135	37.979	0.925	0.921
	FMTC	0.228	0.907	2.570	8.756	33.575	0.885	0.879		FMTC	0.145	0.949	1.572	3.429	36.783	0.909	0.907
	STGAN	0.621	0.690	4.341	26.769	24.421	0.561	0.485		STGAN	0.418	0.903	2.585	8.919	27.836	0.499	0.492
	Ours	0.109	0.960	2.306	6.606	40.509	0.941	0.934		Ours	0.107	0.975	1.355	2.656	39.459	0.948	0.941
Row 2	AWTC	0.265	0.864	3.681	17.121	34.283	0.695	0.668	Row 4	AWTC	0.998	0.758	4.002	21.674	20.385	0.276	0.243
	ST-Tensor	0.125	0.960	2.738	8.782	39.905	0.920	0.915		ST-Tensor	0.131	0.942	1.495	3.144	37.721	0.960	0.960
	FMTC	0.137	0.950	2.928	9.328	38.891	0.914	0.904		FMTC	0.157	0.929	1.642	3.603	36.110	0.947	0.945
	STGAN	0.331	0.843	4.306	18.037	31.938	0.736	0.683		STGAN	0.319	0.901	2.308	6.846	30.091	0.813	0.810
	Ours	0.081	0.978	2.192	6.294	43.667	0.960	0.959		Ours	0.093	0.970	1.263	2.271	40.638	0.979	0.979
Case 2: Rows 5–6 in Figure 5 of Landsat OLI data									Case 2: Rows 7–8 in Figure 5 of Sentinel MSI data
Row 5	AWTC	0.104	0.969	2.048	5.350	41.264	0.970	0.970	Row 7	AWTC	0.119	0.974	1.441	2.898	38.646	0.949	0.948
	ST-Tensor	0.094	0.984	1.822	4.239	43.587	0.974	0.973		ST-Tensor	0.112	0.983	1.418	2.869	39.038	0.954	0.954
	FMTC	0.094	0.979	1.968	4.922	42.020	0.978	0.977		FMTC	0.105	0.978	1.373	2.719	39.612	0.959	0.958
	STGAN	0.124	0.968	2.506	7.552	38.408	0.957	0.955		STGAN	0.189	0.966	1.836	4.539	34.529	0.875	0.870
	Ours	0.055	0.990	1.578	3.158	46.063	0.991	0.991		Ours	0.075	0.989	1.144	1.888	44.664	0.978	0.978
Row 6	AWTC	0.096	0.974	2.043	5.542	42.309	0.971	0.971	Row 8	AWTC	0.091	0.977	1.289	2.369	41.004	0.931	0.929
	ST-Tensor	0.090	0.987	2.059	5.621	42.279	0.969	0.969		ST-Tensor	0.053	0.990	0.990	1.411	45.612	0.976	0.976
	FMTC	0.077	0.983	1.804	4.307	44.611	0.983	0.982		FMTC	0.056	0.987	1.017	1.503	45.162	0.959	0.958
	STGAN	0.105	0.976	2.325	6.948	40.179	0.949	0.947		STGAN	0.108	0.978	1.405	2.033	39.596	0.953	0.951
	Ours	0.037	0.996	1.308	2.280	49.995	0.995	0.995		Ours	0.041	0.994	0.890	1.068	48.001	0.986	0.986

Table 3. Quantitative evaluation indexes of Case 3.

Case 3: Sentinel with Small Temporal Differences (Figure 8)									Case 3: Sentinel with Large Temporal Differences in (Figure 9)
Dataset	Method	RMSE	SSIM	ERGAS	SAM	PSNR	CC	UIQI	Dataset	Method	RMSE	SSIM	ERGAS	SAM	PSNR	CC	UIQI
Figure 8	MNSPI	0.074	0.980	1.204	2.086	42.804	0.981	0.981	Figure 9	MNSPI	0.153	0.964	1.949	5.275	36.505	0.944	0.943
	WLR	0.073	0.986	1.381	2.588	42.862	0.991	0.991		WLR	0.159	0.960	1.983	5.481	36.172	0.940	0.940
	STGAN	0.119	0.968	1.543	2.976	38.634	0.965	0.962		STGAN	0.239	0.952	2.344	5.745	33.159	0.938	0.928
	ST-Tensor	0.085	0.979	1.297	2.404	41.562	0.975	0.975		ST-Tensor	0.162	0.960	2.083	5.754	35.917	0.935	0.933
	Ours	0.048	0.991	0.969	1.350	46.607	0.992	0.992		Ours	0.109	0.974	1.648	3.797	39.417	0.971	0.970

Table 4. Ablation results of loss function.

Feature Loss	Spectrum Loss	Vision Loss	Test Loss (MSE) ↓
✓			0.009213
	✓		0.069922
		✓	0.010431
✓	✓		0.009114
✓		✓	0.008979
	✓	✓	0.009496
✓	✓	✓	0.008767

✓ indicates that the loss function contains this component.

Table 5. Ablation results of skip connections in each component.

Generator		Discriminator	Training Loss ↓
Encoder	Decoder	Discriminator	Training Loss ↓
✗			0.041925
	✗		0.041048
		✗	0.016650
✗	✗		0.043720
✗	✗	✗	0.043781
			0.015877

✗ indicates a module without a skip connection.

Table 6. Ablation results of the scale factors in the discriminator.

Dataset	Scale Factors in the Discriminator			Training Loss ↓
Dataset	1×	0.5×	0.25×	Training Loss ↓
Landsat	✓			0.016792
	✓	✓		0.016534
	✓	✓	✓	0.015877
Sentinel	✓			0.020759
	✓	✓		0.020710
	✓	✓	✓	0.020551

✓ indicates that the discriminator contains this scale.

Table 7. Computational complexity analysis of compared methods.

	Traditional Methods (Inference Time)			Deep Learning Methods (Parameters/Training Time/Inference Time)
Method	AWTC	ST-Tensor	FMTC	STGAN	FSTGAN
Landsat	772.84 s	1228.51 s	335.22 s	115.835 M/8.55 h/0.53 s	3.678 M/5.4 h/0.17 s
Sentinel	390.45 s	592.46 s	190.13 s	115.802 M/35.7 h/0.56 s	3.673 M/33.7 h/0.11 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Ji, L.; Xu, X.; Zhang, P.; Jiang, K.; Tang, H. A Flexible Spatiotemporal Thick Cloud Removal Method with Low Requirements for Reference Images. Remote Sens. 2023, 15, 4306. https://doi.org/10.3390/rs15174306

AMA Style

Zhang Y, Ji L, Xu X, Zhang P, Jiang K, Tang H. A Flexible Spatiotemporal Thick Cloud Removal Method with Low Requirements for Reference Images. Remote Sensing. 2023; 15(17):4306. https://doi.org/10.3390/rs15174306

Chicago/Turabian Style

Zhang, Yu, Luyan Ji, Xunpeng Xu, Peng Zhang, Kang Jiang, and Hairong Tang. 2023. "A Flexible Spatiotemporal Thick Cloud Removal Method with Low Requirements for Reference Images" Remote Sensing 15, no. 17: 4306. https://doi.org/10.3390/rs15174306

APA Style

Zhang, Y., Ji, L., Xu, X., Zhang, P., Jiang, K., & Tang, H. (2023). A Flexible Spatiotemporal Thick Cloud Removal Method with Low Requirements for Reference Images. Remote Sensing, 15(17), 4306. https://doi.org/10.3390/rs15174306

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Flexible Spatiotemporal Thick Cloud Removal Method with Low Requirements for Reference Images

Abstract

1. Introduction

1.1. Related Works

1.2. Motivation

2. Methodology

2.1. Generator

2.2. Discriminator

2.3. Loss Function

3. Experiments and Results

3.1. Study Areas and Datasets

3.2. Experimental Settings

3.3. Simulated Data Experiments

3.4. Real-Data Experiments

4. Discussion

4.1. Rationality of the Network

4.2. Ablation Study

4.3. The Influence of the Scale Factors in the Discriminator

4.4. Computational Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI