1. Introduction
Satellite remote sensing images capture Earth’s surface information via sensors mounted on satellites orbiting in space. These images are pivotal for the broad spectrum of Earth observation and monitoring applications. However, the presence of clouds can significantly interfere with the imaging process. Specifically, thick cloud cover can obscure the Earth’s surface, making it challenging for satellites to gather accurate data. Addressing and mitigating the impact of cloud cover remains a formidable challenge.
In the context of thick cloud removal, synthetic aperture radar (SAR) has demonstrated significant potential to provide clear ground information, leading to its widespread adoption in this area. Various techniques utilizing SAR ground guidance data have been developed to enhance the efficiency of thick cloud removal operations. SAR data is particularly valuable because it provides high-resolution surface information independent of weather and lighting conditions. Bermudez et al. [
1] aim to convert SAR spectra into RGB spectra using a generative adversarial network (GAN) [
2]. Conversely, DSen2-CR [
3] suggests employing residual networks [
4] to merge SAR spectra with cloudy images, thereby directly integrating ground information into the data. Another innovative method [
5] uses the extensive interaction capabilities of transformers [
6] to extract SAR spectral information. GLF-CR [
7] introduces a novel approach by using SAR as a guide to merge contextual information between cloudy images. Additionally, UnCRtainTS [
8] employs uncertainty quantization to enhance the effectiveness of thick cloud removal tasks.
Recent advances in diffusion [
9] have shown state-of-the-art performance in computer vision various domains, establishing a new paradigm. DiffCR [
10] achieves high-performance cloud removal for optical satellite images by combining conditional guided diffusion with deep convolutional networks. This method substantially enhances image generation quality while keeping both parameters and computational complexity low. EDiffSR [
11] combines the robust feature extraction capabilities of U-Net with the generative potential of diffusion, offering a promising new approach in the field of remote sensing image super-resolution. In this paper, we introduce a thick cloud removal model named SAR-DeCR, based on transformer [
6] and diffusion methods [
9]. Our model comprises three key modules: coarse cloud removal, SAR-Fusion, and cloud-free diffusion. Specifically, these modules are designed to (1) integrate SAR information into the cloudy image through the coarse cloud removal (CCR) module, removing all cloud cover and restoring accurate color information while preserving the basic image structure. This integration utilizes the Swin transformer [
12] for feature fusion and a spatial attention module (SAM) [
13] for identifying cloud locations. Compared to convolutional neural networks (CNNs), the Swin transformer exhibits superior capabilities in pixel reconstruction, particularly with regard to complex image details and long-range dependencies. Its hierarchical structure and sliding window mechanism facilitate more efficient feature extraction and contextual information capture in producing more realistic image reconstruction results. (2) The SAR-Fusion (SAR-F) module employs cloud attention to selectively incorporate SAR information into the coarse cloud removal data. Specifically, we apply SAR information only to regions affected by clouds while leaving the rest unchanged. (3) The cloud-free diffusion (CF-D) module focuses on image texture reconstruction, using stabilized pre-trained weights from an extensive dataset and duplicating the encoder and middle layers of U-Net [
14] to create new channels. These duplicated weights are updateable and guide the direction of the image generation process. In a typical denoising diffusion probabilistic model (DDPM) [
9], each denoising entails considerable computation and adjustment to achieve optimal image quality. ControlNet [
14] substantially alleviates this computational burden by using conditional information to delineate a clear generation pathway.
The output HD maps are generated by integrating the previously mentioned modules and processing thick cloud images alongside SAR maps. A comprehensive comparison with the current state-of-the-art (SOTA) thick cloud removal methods was conducted, showcasing the highest performance achieved to date. Additionally, we conducted ablation experiments on the three-stage networks coarse cloud removal, SAR-Fusion, and cloud-free diffusion. The results of these experiments alongside various other tests and visualizations, confirm that our method surpasses other techniques in effectively removing thick clouds.
Below, we summarize the principal contributions of this work:
- 1.
We introduce SAR-DeCR, an innovative network that incorporates diffusion models into the field of thick cloud removal.
- 2.
We propose an attention module designed to facilitate pixel-space feature extraction. Unlike previous image reconstruction networks, this cloud attention module provides the network with precise location data about the clouds, enabling the Swin transformer to extract features more effectively. A binarized cloud map, derived from the cloud attention map, is used to infuse the rich ground information from SAR into the image.
- 3.
Our experiments demonstrate the superiority of our approach over other SAR-guided cloud removal methods. In addition, our proposed model is the first application of diffusion modeling in the realm of SAR-guided thick cloud removal, providing valuable insights for future research.
2. Related Works
Cloud Removal. Removing clouds from remote sensing images using only visible or single-image inputs is challenging because clouds affect pixel values in complex and variable ways. Traditional methods and unsupervised convolutional neural network (CNN) techniques address this issue by incorporating established priors that characterize cloud properties. Common priors include the linear prior [
15], low-frequency prior [
16,
17], dark channel prior (DCP) [
17,
18,
19,
20], signal-to-noise ratio prior [
21], and model prior [
22]. For instance, Li et al. [
17] developed a sphere model to mitigate noise in remote sensing images by enhancing the DCP. Building on the assumptions of DCP, Shen et al. [
19] introduced a spatio-spectral adaptive haze removal method that utilizes variables such as global non-uniform atmospheric light, bright pixel index, and image gradients.
In the past decade, unsupervised CNN techniques for cloud removal have been widely studied. These techniques work by implicitly processing the top-of-atmosphere (TOA) reflectance in observed remote sensing images [
22,
23,
24]. By employing machine learning on a single cloud-covered image, these methodologies have demonstrated empirical efficacy in haze removal. It is crucial to recognize, however, the distinct differences between thick clouds, thin clouds, and haze. While satellite sensors can still capture some ground information through reflected light in the presence of thin clouds or haze, thick clouds completely obscure the ground, resulting in a total loss of information. Consequently, the removal of thick clouds poses a more formidable challenge and has captured significant attention in recent cloud removal research. Recent advancements include the curvature-based cloud removal method proposed by Yu et al. [
25], which focuses on reconstructing ground information obscured by clouds. The LRRSSN framework [
26] combines model-driven and data-driven approaches, eliminating the need for paired images, auxiliary data, or cloud masks required by traditional deep learning methods. The DiffCR system [
10] utilizes a conditional guided diffusion model coupled with a deep convolutional network to achieve high-performance cloud removal for optical satellite images. To maintain a strong similarity between the input image and the generated output, DiffCR uses a decoupled encoder for extracting conditional image features. The HyA-GAN model [
27] incorporates channel and spatial attention mechanisms into a GAN. This integration enhances the network’s ability to focus on critical areas. As a result, it improves data recovery and cloud removal performance. GLTF-Net [
28] divides the thick cloud removal method into two phases of global multi-temporal feature fusion (GMFF) and local single-temporal information recovery (LSIR), which recovers the information of the thick cloud region in the single-temporal image by fusing the multi-temporal global features. FSTGAN [
29] proposes a flexible spatio-temporal deep learning framework based on generative adversarial networks using reference images of any three temporal phases for thick cloud removal employing a three-stage encoder structure. This enhances the model’s ability to adapt to reference images with large temporal differences.
When ground information is obscured, SAR images offer a reliable alternative. They are unaffected by cloud cover and precipitation, ensuring dependable data in various environmental conditions. The TCR network [
30] exploits shared characteristics between SAR and optical images to transform SAR data into optical-like images, effectively restoring information that was obscured by clouds. It further refines the de-clouded image by optimizing it against cloud-covered images. Former-CR [
5] utilized SAR images with cloud-obscured optical images to directly reconstruct cloud-free images and also designed a new loss function to enhance the overall structure and visual effect of the reconstructed images. MSGCA-Net [
31] proposed a multilayered SAR-guided contextual attention network by introducing an SAR-guided contextual attention (SGCA) module to fuse reliable global structural information in SAR images with local feature information in optical images.
Diffusion Processes. In recent years, significant advancements have been made in generative modeling through the development of diffusion models. Latent diffusion models (LDM) [
32] conduct diffusion steps within the latent image space [
33], thereby reducing computational costs. A diffusion model consists of two main components: the forward process and the reverse process. Specifically, the forward process gradually converts the data distribution into a latent variable distribution by applying the parameters of a Markov chain. In contrast, the backward process seeks to transform the latent variable distribution back into the original data distribution. This process effectively restores the original data and reveals the underlying data distribution. However, generating high-quality samples requires several iterations. DDIM [
34] accelerates the sampling process by implementing a non-Markovian diffusion mechanism. ControlNet [
14] enhances control by adding multiple auxiliary conditioned paths to pre-trained diffusion models. VQ-GAN [
33] combines the inductive biases of CNNs with the expressive capabilities of transformers to efficiently model and synthesize high-resolution images. SeqDMs [
35] combines information from auxiliary modalities, such as SAR, which is unaffected by clouds, with optical satellite images. It reverses the diffusion model process, integrating sequence information from both primary and auxiliary modalities over time. Diffusion Enhancement (DE) [
36] progressively restores image texture details using reference visual priors to improve inference accuracy. Additionally, a weight assignment (WA) network has been developed to dynamically adjust feature fusion weights, thus enhancing the performance of super-resolution image generation. DDPM-CR [
37] extracts multi-scale features using denoising diffusion probability model (DDPM) and combines it with attention mechanism for cloud removal. Meanwhile, a cloud-aware loss function is designed that integrates high and low frequency information and cloud region characteristics.
Despite some diffusion-based methods demonstrating effectiveness, they are typically inefficient and require thousands of sampling steps. Thus, this study introduces a novel and rapid cloud removal framework that significantly reduces the number of sampling steps. It employs a ControlNet to direct the generation process, thereby achieving enhanced fidelity and accelerated generation steps.
5. Conclusions
We introduce a method for removing thick clouds from satellite remote-sensing imagery, termed SAR-DeCR. This approach is designed to restore high-quality images that are recognizably guided by SAR ground information, ensuring accurate geographic information retrieval. Our method begins with the integration of the thick cloud image and the SAR image to effectively eliminate cloud obstructions, which typically obscure ground information, called coarse cloud removal (CCR). In CCR, the Swin transformer serves as the feature extractor, and the spatial attention model provides the Swin transformer with the approximate location information of the cloud. Subsequently, we introduce the SAR-Fusion (SAR-F) module to refine the initial output of the coarse cloud removal and to enhance the influence of SAR data. SAR-F effectively incorporates SAR data into cloud positions in an unsupervised manner, thereby augmenting the guidance provided by SAR information. Lastly, to capitalize on the powerful generative capabilities of expansive vision-text models, and to use the fused images as conditional supervision, we develop a high-fidelity information reconstruction module based on the diffusion model, called cloud-free diffusion (CF-D). CF-D is specifically designed to preserve accurate SAR ground information and to complement the remaining high-frequency information. Our experiments clearly demonstrate the superiority of our approach over other SOTA SAR-guided methods. Notably, our proposed model is the first application of diffusion modeling in the realm of SAR-guided thick cloud removal, providing valuable insights for future research.