Denoising Diffusion Probabilistic Feature-Based Network for Cloud Removal in Sentinel-2 Imagery

: Cloud contamination is a common issue that severely reduces the quality of optical satellite images in remote sensing ﬁelds. With the rapid development of deep learning technology, cloud contamination is expected to be addressed. In this paper, we propose Denoising Diffusion Probabilistic Model-Cloud Removal (DDPM-CR), a novel cloud removal network that can effectively remove both thin and thick clouds in optical image scenes. Our network leverages the denoising diffusion probabilistic model (DDPM) architecture to integrate both clouded optical and auxiliary SAR images as input to extract DDPM features, providing signiﬁcant information for missing information retrieval. Additionally, we propose a cloud removal head adopting the DDPM features with an attention mechanism at multiple scales to remove clouds. To achieve better network performance, we propose a cloud-oriented loss that considers both high- and low-frequency image information as well as cloud regions in the training procedure. Our ablation and comparative experiments demonstrate that the DDPM-CR network outperforms other methods under various cloud conditions, achieving better visual effects and accuracy metrics (MAE = 0.0229, RMSE = 0.0268, PSNR = 31.7712, and SSIM = 0.9033). These results suggest that the DDPM-CR network is a promising solution for retrieving missing information in either thin or thick cloud-covered regions, especially when using auxiliary information such as SAR data.


Introduction
Optical remote sensing images provide high spatial resolution and easily interpretable representations, making them critical data sources in various fields such as vegetation monitoring [1], land use mapping [2], and ecological change detection [3]. However, clouds and cloud shadows cover 55% of the Earth's surface, affecting image quality and hindering the recording of spectral information of ground objects [4]. The retrieval of missing spectral information in optical remote sensing images is essential to maintain data integrity and ensure the success of subsequent applications.
In recent decades, several methods have been proposed for cloud removal in optical imagery, which can be categorized into spatial-based, multispectral, and multitemporal methods [5]. Spatial-based methods recover missing spectral information by using other parts of the image data, assuming that cloud-covered and cloud-free regions have identical statistical distributions. Interpolation methods [6,7], propagated diffusion methods [8], variation-based methods [9], and exemplar-based methods [10] have been developed to remove clouds in optical images. Spatial-based methods do not rely on additional information but on cloud-free parts of the same image. Although superior results can be achieved with a small percentage of images contaminated by clouds, a large percentage of thick cloud coverage can hinder this method. Therefore, auxiliary data, such as SAR images [11], have 1.
The proposed DDPM-CR network effectively retrieves missing information based on multilevel deep features, allowing for the reuse of cloud-contaminated images.

2.
The cloud-oriented loss function, which integrates the advantages of various loss functions, improves the performance and efficiency of network training.
This paper is structured as follows. Section 2 describes the structure of the proposed DDPM-CR network, the definition of cloud-oriented loss, the introduction to data processing, and the training dataset and training details of this work. The ablation analyses and comparison experiments with baseline methods are presented in Section 3. With the achieved results, a discussion is presented in Section 4. Finally, the conclusions of this work are described in Section 5. Figure 1. The DDPM diagram. The upper row depicts the forward diffusion process; the lower row presents the reverse inference process procedure. In the reverse inference process, green color bars present input/output layers, blue color bars depict ResNet blocks and pink color bars depict attention layers which will be described in detail later.
The DDPM forward procedure involves a Markovian process that adds Gaussian noise to remote sensing images. In the context of the cloud removal task, the cloud-contaminated image serves as x0 and provides the initial image distribution. To ensure adequate extraction of features for the retrieval of information from cloud-contaminated images, the DDPM network employs five input bands, including three multispectral bands (red, green, and blue) and two SAR bands (VV and VH). Using the reparameterization technique depicted in Figure 1 as ①, noise-added xt can be obtained for any time t∈(0, T) Figure 1. The DDPM diagram. The upper row depicts the forward diffusion process; the lower row presents the reverse inference process procedure. In the reverse inference process, green color bars present input/output layers, blue color bars depict ResNet blocks and pink color bars depict attention layers which will be described in detail later.
The DDPM forward procedure involves a Markovian process that adds Gaussian noise to remote sensing images. In the context of the cloud removal task, the cloudcontaminated image serves as x 0 and provides the initial image distribution. To ensure adequate extraction of features for the retrieval of information from cloud-contaminated images, the DDPM network employs five input bands, including three multispectral bands (red, green, and blue) and two SAR bands (VV and VH). Using the reparameterization technique depicted in Figure 1 as 1 , noise-added x t can be obtained for any time t ∈ (0, T) Remote Sens. 2023, 15, 2217 4 of 25 based on given hyperparameters. Here, β t controls the noise variance and − β t = ∏ T i=1 β i , which are distributed on the interval from 0-1. After T iterations, as illustrated in 2 , noise-added results that obey an isotropic Gaussian distribution are generated.
The reverse inference process is also a Markov chain that aims to reconstruct original images from noise-added results after T iterations, as illustrated in 3 and 4 of the above figure. This denoising procedure requires a deep network f θ to model the parameters during the reverse inference. The parameters µ θ (x t , t) and σ 2 θ (x t , t) in 3 are the mean and variance acquired through network training. To this end, a U-Net-based [38] network is modified for reverse inference, as shown in Figure 2.
Remote Sens. 2023, 15,2217 4 of 27 based on given hyperparameters. Here, controls the noise variance and = ∏ , which are distributed on the interval from 0-1. After T iterations, as illustrated in ②, noiseadded results that obey an isotropic Gaussian distribution are generated.
The reverse inference process is also a Markov chain that aims to reconstruct original images from noise-added results after T iterations, as illustrated in ③ and ④ of the above figure. This denoising procedure requires a deep network to model the parameters during the reverse inference. The parameters ( , ) and ( , ) in ③ are the mean and variance acquired through network training. To this end, a U-Net-based [38] network is modified for reverse inference, as shown in Figure 2. The main architecture for reverse inference is based on the widely-used U-Net network, as shown in the figure above. To meet the requirement of image recovery from noise and provide precise deep features for the subsequent cloud removal procedure, the main modified part is the ResNet block. A noise affine layer is designed to linearly add noise to image features. Additionally, a swish layer is used, as it has been shown to outperform the ReLU layer in deeper architectures [39]. To concentrate the network on deep features at different dimensions as well as various salient image regions, the high efficiency and robustness of spatial and channel attention mechanisms are introduced without obvious computational cost increasing [40]. To avoid affecting the inference performance, the attention layers are set in the bottleneck of the architecture. Finally, the original concatenated image can be reconstructed, and its objective function can be defined in a simplified form according to the works proposed by [37]: The main architecture for reverse inference is based on the widely-used U-Net network, as shown in the figure above. To meet the requirement of image recovery from noise and provide precise deep features for the subsequent cloud removal procedure, the main modified part is the ResNet block. A noise affine layer is designed to linearly add noise to image features. Additionally, a swish layer is used, as it has been shown to outperform the ReLU layer in deeper architectures [39]. To concentrate the network on deep features at different dimensions as well as various salient image regions, the high efficiency and robustness of spatial and channel attention mechanisms are introduced without obvious computational cost increasing [40]. To avoid affecting the inference performance, the attention layers are set in the bottleneck of the architecture. Finally, the original concatenated image can be reconstructed, and its objective function can be defined in a simplified form according to the works proposed by [37]: where f θ represents the above denoising model; ∼ x is the noise-added image after the forward procedure.

Cloud Removal Head
In this paper, we did not employ the DDPM reverse inference workflow to directly remove clouds, as retrieving missing spectral information for remote sensing images requires a more sophisticated architecture. Therefore, utilizing a well-trained DDPM network, we extracted diverse deep features to provide abundant information, which enabled us to achieve better cloud removal results via the subsequent cloud removal head. The extracted DDPM features are illustrated in Figure 3.
where represents the above denoising model; is the noise-added image after the forward procedure.

Cloud Removal Head
In this paper, we did not employ the DDPM reverse inference workflow to directly remove clouds, as retrieving missing spectral information for remote sensing images requires a more sophisticated architecture. Therefore, utilizing a well-trained DDPM network, we extracted diverse deep features to provide abundant information, which enabled us to achieve better cloud removal results via the subsequent cloud removal head. The extracted DDPM features are illustrated in Figure 3. The jet color map is used for the display of the above feature representations, the intermediate value is depicted in yellow and green, the high value and low value are shown in red and blue. With different intermediate added noise, DDPM focuses on various parts of ground objects according to different activation values, which is beneficial for the following cloud removal task.
The figure above depicts the deep features extracted from the pretrained DDPM network. The figure is divided into 3 rows and 5 columns, with each row representing different noise levels added to input images, and each column presenting deep features at various scales. These noise levels were selected through trial and error to enhance the network's robustness towards diverse imaging conditions. The extracted features at different scales, such as (h, w), (h/2, w/2), (h/4, w/4), (h/8, w/8), and (h/16, w/16), represent different characteristics of the image, from shallow edge and texture to global abstract representations. Both the large-scale and small-scale features can be well described. The multiscale feature representations generated by combining these features better capture the patterns in remote sensing images. Based on these deep features, we propose a cloud removal head to perform the cloud removal task, as shown in Figure 4. The jet color map is used for the display of the above feature representations, the intermediate value is depicted in yellow and green, the high value and low value are shown in red and blue. With different intermediate added noise, DDPM focuses on various parts of ground objects according to different activation values, which is beneficial for the following cloud removal task.
The figure above depicts the deep features extracted from the pretrained DDPM network. The figure is divided into 3 rows and 5 columns, with each row representing different noise levels added to input images, and each column presenting deep features at various scales. These noise levels were selected through trial and error to enhance the network's robustness towards diverse imaging conditions. The extracted features at different scales, such as (h, w), (h/2, w/2), (h/4, w/4), (h/8, w/8), and (h/16, w/16), represent different characteristics of the image, from shallow edge and texture to global abstract representations. Both the large-scale and small-scale features can be well described. The multiscale feature representations generated by combining these features better capture the patterns in remote sensing images. Based on these deep features, we propose a cloud removal head to perform the cloud removal task, as shown in Figure 4.
tracted DDPM features. Each block consists of convolutional layers, ReLU layers, attention layers (the same as the DDPM network), and upsampling layers used to recover the original size of the output image step by step. At the top of the network, the tanh layer compresses the output values to (−1, 1) to match the range of normalized input images. Additionally, at each step, the network integrates all DDPM features with the processed features to retain refined image details and then passes them to attention layers to focus on recovering missing information in cloud-contaminated regions.

Cloud-Oriented Loss
The loss function used for training the DDPM network is described in Equation (1). However, the cloud removal head is the primary component responsible for removing clouds, and it functions similarly to image inpainting. Typically, Mean Absolute Error (MAE) loss is commonly employed in image inpainting networks. However, the cloud-contaminated pixels in remote sensing images have a high dynamic range and significant deviation. As a result, a single MAE loss function is inadequate to address this issue. To overcome this challenge, a cloud-oriented loss is proposed in this paper that incorporates three sub-loss functions, as follows: The figure above depicts the cloud removal inference procedure, where deep features extracted from the DDPM network with different noise levels are used as inputs to provide effective information on the ground objects' description at various scales. The network is composed of several blocks, with the number of blocks corresponding to the number of extracted DDPM features. Each block consists of convolutional layers, ReLU layers, attention layers (the same as the DDPM network), and upsampling layers used to recover the original size of the output image step by step. At the top of the network, the tanh layer compresses the output values to (−1, 1) to match the range of normalized input images. Additionally, at each step, the network integrates all DDPM features with the processed features to retain refined image details and then passes them to attention layers to focus on recovering missing information in cloud-contaminated regions.

Cloud-Oriented Loss
The loss function used for training the DDPM network is described in Equation (1). However, the cloud removal head is the primary component responsible for removing clouds, and it functions similarly to image inpainting. Typically, Mean Absolute Error (MAE) loss is commonly employed in image inpainting networks. However, the cloudcontaminated pixels in remote sensing images have a high dynamic range and significant deviation. As a result, a single MAE loss function is inadequate to address this issue. To overcome this challenge, a cloud-oriented loss is proposed in this paper that incorporates three sub-loss functions, as follows: The cloud-oriented loss, denoted as L COL , is a combination of three subloss functions, namely, the modified MAE loss (L mMAE ), perceptual loss (L percep ), and attention loss Remote Sens. 2023, 15, 2217 7 of 25 (L attention ), weighted by λ a , λ b , and λ c , respectively. The modified MAE loss accounts for both clouds and cloud shadows and its definition depends on the validation accuracy metrics used, as depicted in the following equation: where G and P represent the target cloudless image patch and the cloud removal result obtained by the cloud removal head, respectively. The L1 loss, also known as the MAE loss, is denoted by * 1 and the Hadamard product is represented by . The cloud and cloud shadow mask, M, is obtained from the cloud detection method proposed by Zhai et al. [41]. The weights of the cloud-free and cloud-contaminated regions are denoted by λ M , which change during the training procedure based on the validation accuracy metrics. When the accuracy is low, λ M is set to a high value to enable the network to learn the overall characteristics of remote sensing images. As the network training progresses and the accuracy improves, λ M is gradually decreased to focus the network on recovering missing information.
The modified MAE loss function aims to reconstruct the color and texture information of the image patches, while during the training process, high-frequency information may be lost, leading to oversmoothed results. To address this issue, the cloud-oriented loss includes a perceptual loss term (L percep ). This term leverages pretrained CNN networks to measure the difference between the generated image patch and its corresponding cloudless patch, enhancing the details of the cloud removal results [42]. Additionally, attention loss is also incorporated into the cloud-oriented loss, as defined by the following equation.
The attention loss function plays a crucial role in the cloud-oriented loss by integrating the features obtained from each attention layer in Figure 4 and generating a spatial attention matrix, which enhances the image regions of clouds and cloud masks at various scales as high salient values, improving the overall performance of the cloud removal task. * 2 2 denotes the L2 loss function, which is used to control the network to focus on the cloud removal task. After extensive experimentation, λ a , λ b , and λ c are chosen as 0.8, 0.15, and 0.05, respectively, to balance different remote sensing image representations.

Experimental Data and Training Details
To train the DDPM-CR network, a large number of image patches are required, particularly for complex remote sensing image representations. In this study, we utilize Sentinel-1 and Sentinel-2 images for cloud removal tasks due to their spatial-temporal resolution and accessibility [43,44]. The Sentinel-2 multispectral data capture spectral signals ranging from 443 nm to 2190 nm with 13 bands at different spatial resolutions. The spatial resolution of visible and NIR bands is 10 m, while other bands have spatial resolutions of 20 m and 60 m. On the other hand, the Sentinel-1 satellite can acquire ground information on all weather conditions with several imaging modes. The most widely used mode is the interferometric wide swath (IW), which captures single-polarized VV and dual-polarized VH image data. To train the DDPM-CR network effectively under various cloud conditions, we introduce an open-source dataset named SEN12MS-CR [45], which contains 175 globally distributed regions in 4 seasons throughout 2018. The dataset comprises 122,218 image triplets (SAR, cloudy, and target cloudless images) with a size of 256 × 256. In this paper, we define thick cloud regions as image regions where over 60% of the spectral signal is completely obstructed by clouds, and vice versa. We use the partition criteria in Table 1 for network training, validation, and testing. In this work, preprocessing of the multimodal data sources, Sentinel-2 multispectral images and Sentinel-1 SAR images, was performed to address their significant differences in value ranges and eliminate the impact of anomalous and background pixels on the results. Anomalous value filtering and min-max normalization were utilized for the data preprocessing. The anomalous value filtering procedure eliminated the effect of outlier pixel values on normalization. The minimum and maximum values of the multispectral and SAR images were calculated by iterating through the entire dataset. After preprocessing, the value range of each image data item was transformed to [−1, 1] to avoid the influence of different value dimensions. Additionally, random image cropping was employed during network training to randomly clip the original image patches into smaller-sized subsets. The generation of image patches at different locations in each iteration added robustness to the network. Furthermore, smaller-sized image patches resulted in fewer DDPM feature parameters, reducing the burden on GPU memory.
The training procedure for the DDPM and cloud removal head network is a two-step process. To ensure uniformity, the row and column sizes for all training image patches are randomly cropped to 128. For the DDPM network, a learning rate of 1 × 10 −4 and an Adam optimizer with a cosine warmup schedule are used. The weights are initialized with a Kaiming distribution to maintain heterogeneity and stability of weight parameters. In contrast, the cloud removal head network applies a normal distribution with µ = 0 and δ2 = 0.02 to initialize the weights. Similarly, the Adam optimizer is also used with β1 = 0.5 and β2 = 0.999, and the initial learning rate is 2 × 10 −4 for the first 50 epochs, followed by a multistep decay method with a decay rate of 0.92 for better convergence. The training process for both networks runs for a total of 200 epochs with a batch size of 1 due to GPU memory limitations.
The implementation platform is a workstation equipped with an Intel Core i9-10900k CPU @ 3.70 GHz, 32 GB of DDR4 2133 MHz RAM, and an NVIDIA GeForce RTX 3060 with 12 GB RAM. The operating system is Ubuntu 18.04. All networks were implemented with PyTorch 1.10.0 using Python 3.6. Please refer to the Supplementary Materials section for the implementation of DDPM-CR.
To quantitively evaluate the performance of the cloud removal results, four metrics are used for analysis, MAE, RMSE, PSNR, and SSIM, and the definitions are listed as follows.
(1) Mean absolute error (MAE) where MAE is defined as the deviation of the inferred result and the target image. H and W depict height and width, respectively. (i, j) represents the pixel location of the image patch, and P (i, j) and G (i, j) are pixel values in the inferred and target images, respectively.
(2) Root-mean-square error (RMSE) where RMSE is also a widely used pixel-level metric for similarity calculation between inferred and target images with the square root of mean square residual error.
(3) Peak signal-to-noise ratio (PSNR) PSNR is an objective assessment metric for image evaluation, and it reflects the image reconstruction quality. Within the equation definition, MSE is defined as the mean square error between inferred and target images. Generally, a PSNR value lower than 20 indicates poor reconstruction quality, and a PSNR value over 30 shows excellent reconstruction performance. PSNR is often used with the SSIM metric.
The SSIM metric is a common metric used to measure the similarity of inferred and target images. It applies intensity (s), brightness (l), and contrast (c), as defined below.
where µ P , µ G , σ P , σ G , and σ PG are the mean, standard deviation, and covariance of the inferred and target images, respectively. C 1 , C 2 , and C 3 are constants to avoid INF results, C 1 = (K 1 * L) 2 , C 2 = (K 2 * L) 2 , and C 3 = C 2 /2 with K 1 = 0.01 and K 2 = 0.03. L is determined according to grayscale for 16-bit Sentinel images, L = 65,535. To increase the execution efficiency, a Gaussian function is used to calculate the mean, standard deviation, and covariance. Therefore, the final SSIM definition is listed as follows.

Ablation Study on SAR-Multispectral Multimodal Input Data
Most cloud removal methods use only optical images as data sources. However, under this condition, some spectral information is needed to penetrate clouds, making cloud removal challenging. Thick clouds obstruct ground object spectral information recorded by satellite image sensors, and these methods rely solely on the neighbor image representations of clouds to recover the missing information. By incorporating SAR images as complementary data, the missing information in thick cloud-covered regions is more likely to be recovered. To validate the impact of multimodal input data on cloud removal, an identical DDPM-CR network without SAR images is trained. The results are presented in Figure 5.
The effectiveness of SAR data is demonstrated in the above figure, which includes examples of city areas and farming land. City areas have complex textures, making it difficult to retrieve missing information, while farming land is more homogeneous in image representation but is affected by the seasonal aspect of plants. Both areas are covered with thick clouds and cloud shadows, and information on most image parts is missing except for the upper-left corner. The DDPM-CR network, with auxiliary SAR data, outperforms the network trained without SAR data in both areas. In the city area, (g) shows that missing information in cloud-covered regions is well recovered, whereas without auxiliary information, missing information cannot be well reconstructed, and the reconstruction parts have traces of clouds, as shown in (h). For farming land, the network without auxiliary data can only infer missing information using neighboring pixels, leading to failure in retrieving both cloud-covered and cloud-free regions, as shown in (j). Evaluation metrics of the cloud removal results are shown in (k) and (l). To reduce the difference in measurement units, MAE, RMSE, and SSIM metrics share a common axis, and PSNR uses an individual axis. In addition, MAE and RMSE are in small values, and the bars are much shorter than those of SSIM. From the two subfigures, SAR data contribute to better performance, with each accuracy metric higher than that of the non-SAR data network. The MAE and RMSE metrics are relatively close for specific regions, and significant improvements appear in the PSNR and SSIM of the farming land region, where the inferior cloud removal result without SAR data complement enlarges the gap between the result with SAR data. To further assess the effectiveness of SAR data, an evaluation table is generated with the test dataset of SEN12MS-CR, as listed in Table 2. The effectiveness of SAR data is demonstrated in the above figure, which includes examples of city areas and farming land. City areas have complex textures, making it difficult to retrieve missing information, while farming land is more homogeneous in image representation but is affected by the seasonal aspect of plants. Both areas are covered with thick clouds and cloud shadows, and information on most image parts is missing except for the upper-left corner. The DDPM-CR network, with auxiliary SAR data, outperforms the network trained without SAR data in both areas. In the city area, (g) shows that missing information in cloudcovered regions is well recovered, whereas without auxiliary information, missing infor-   Table 2 demonstrates that the DDPM-CR network trained with and without SAR data yields different results. Figure 5 also shows that the DDPM-CR network trained with SAR data outperforms the network without SAR data, as evidenced by the improved accuracy metrics. By incorporating SAR data as auxiliary data, the MAE and RMSE metrics decrease by 30.9% and 32.1%, respectively, while the PSNR and SSIM metrics increase by 12.8% and 21.9%, respectively. The SAR data are crucial in providing additional information to assist the network in retrieving the missing information. The DDPM-CR network cannot create cloud removal results without a basis, and the clouds and cloud shadows have too many types to learn their image mapping on actual ground objects, which could lead to the collapse of cloud removal results, such as the farming land result in Figure 5. By fully utilizing the spatial information of SAR images that are unaffected by clouds and cloud shadows, the network can make use of more auxiliary information to retrieve missing information. The DDPM architecture integrates the advantages of both SAR and multispectral images effectively, and the cloud removal head assembles these deep features at various noise levels in a cascading manner to attain fine cloud removal results.

The Evaluation of Loss Functions on Cloud Removal
The performance of a deep network is greatly influenced by its loss function. In this study, we propose a cloud-oriented loss function to guide the training of the network. This loss function consists of a modified mean absolute error (MAE) loss, a perceptual loss, and an attention loss, all tailored for cloud removal. To evaluate the effectiveness of the cloud-oriented loss function, we compare the prediction results of the DDPM-CR network trained with different loss functions, namely, modified MAE loss only, modified MAE with perceptual loss, modified MAE with attention loss, and cloud-oriented loss. Before training, we determine the weights of the modified MAE loss in each comparison loss function to achieve optimal accuracy, with SSIM and PSNR as evaluation metrics, as shown in Figure 6.  The figure above illustrates the impact of different weight configurations of loss functions on the SSIM and PSNR metrics of DDPM-CR. It is evident that different weight values lead to significant differences in the results, with both SSIM and PSNR showing a rise and fall trend as the weights increase from (a) to (b). The peak values indicate the local optimal accuracy, implying that alternative weight values can be selected. In (a), the mMAE + perceptual loss function is depicted, and the peak SSIM and PSNR values occur at weight values of 0.3 and 0.4, respectively. However, the weight value of 0.3 results in more artifacts in the cloud removal results, thus, 0.4 is selected as the weight of mMAE, and the weight of the perceptual loss is set to 0.6. In (b), the mMAE + attention loss function is shown, and the peak values of SSIM and PSNR correspond to a weight value of 0.2. Therefore, the weight of mMAE is set to 0.2, and that of the attention loss is 0.8. The DDPM-CR network is trained with the chosen weights of the two loss functions, and a visual comparison of the cloud removal results is presented in Figure 7. Subfigures (a5)-(e5) show the results of the cloud-oriented loss function, which integrates the advantages of the other three loss functions by considering the spectral information of both cloud-covered and cloud-free regions, as well as their detailed information. Its results are more consistent with the target cloudless image patches in subfigures (a6)-(e6). Table 3 is also presented below to illustrate the superiority of the cloud-oriented loss function compared with other loss functions.  The subfigures (a1)-(e1) in Figure 7 show image patches heavily covered with thick clouds, which allows for evaluating the influence of the loss functions on the network performance. Subfigures (a2)-(e2) depict the results of the DDPM-CR network trained with only mMAE loss, which considers image representations of cloud-free and cloudcontaminated regions. Although missing information is recovered in all image scenes, the results lack detail, and some artifact pixels are visible, especially in regions with regular textures, such as the city region (d2).
As for the combined loss functions, subfigures (a3)-(e3) from the DDPM-CR network trained with mMAE + perceptual loss show better visual effects, with more delicate spatial information. This is due to the perceptual loss function, which greatly improves the highfrequency information in the reconstruction results with the aid of extracted features from other assistant networks, as evident in the result of (d3). However, in some homogeneous regions, such as farming lands, excessive phenomena exist. Subfigures (a4)-(e4) show the results of mMAE + attention loss, which performs better than the results of mMAE loss alone and is similar to those of mMAE + perceptual loss. Attention loss applies features from attention blocks to construct attention masks, and together with cloud masks, the training procedure is under good guidance, leading to a good recovery of missing cloud-covered and cloud-free region information. Different from the mMAE loss function mechanism, attention loss urges the network to focus more on the information retrieval of cloudcovered regions. The reconstruction results have a better natural visual representation compared with the other two loss functions, reflecting the complementarity of mMAE and attention loss.
Subfigures (a5)-(e5) show the results of the cloud-oriented loss function, which integrates the advantages of the other three loss functions by considering the spectral information of both cloud-covered and cloud-free regions, as well as their detailed information. Its results are more consistent with the target cloudless image patches in subfigures (a6)-(e6). Table 3 is also presented below to illustrate the superiority of the cloud-oriented loss function compared with other loss functions. Table 3. Quantitative evaluation of the loss functions on the performance of the DDPM-CR network. The metrics below are achieved by mean values of the test dataset from SEN12MS-CR. The first row lists the results of the DDPM-CR network trained with only the mMAE loss function. The second row lists the accuracy metrics with the mMAE + perceptual loss function. The third row lists the results of the mMAE + attention loss function. The fourth row lists the results of the cloud-oriented loss function. As listed in the above table, since the same test dataset of SEN12MS-CR is used, the metric values of cloud-oriented loss are the same as those in Table 2. Specifically, mMAE loss alone achieves the worst result among the loss functions, while cloud-oriented loss achieves the best result for MAE. For RMSE, mMAE + attention loss achieves the best value, and cloud-oriented loss comes in second place, indicating that attention loss is a stable improvement for the cloud removal task. PSNR and SSIM provide a qualitative assessment of reconstructed images, and cloud-oriented loss significantly outperforms the other loss functions. These results demonstrate that cloud-oriented loss effectively integrates the advantages of the other loss functions and yields more comprehensive results. Together with the analysis in Sections 3.1 and 3.2, the choice of appropriate loss functions, in addition to elaborately designed networks and auxiliary data such as SAR images, is crucial in improving the validation accuracy of the DDPM-CR network. Finally, the complete DDPM-CR network is compared with some baseline models.

Comparative Experiments with Baseline Models
The above analyses are crucial in evaluating the effectiveness of the DDPM-CR network using the SEN12MS-CR dataset. However, to ensure the practical applicability of the network, it is essential to evaluate its performance on additional study areas. In this study, we selected two areas, Wuhan and Erhai, as shown in Figure 8, for further evaluation of the DDPM-CR network. To evaluate the effectiveness of the DDPM-CR network, we compare it against a traditional method, RSdehaze [47], and four other deep learning networks: DSen2-CR [24], pix2pix [48], GLF-CR [23], and SPA-CycleGAN [30]. The RSdehaze method is used to highlight the importance of auxiliary SAR images in cloud removal. GLF-CR uses a two-stream network that differs from DDPM-CR, while the other deep learning methods are similar in many ways to DDPM-CR. DSen2-CR is a ResNet-based network that takes cloud masks into account, while pix2pix and SPA-CycleGAN are generative models like DDPM-CR. SPA-Cy-cleGAN also incorporates an attention mechanism and corresponding attention loss. To ensure a fair comparison, all deep learning methods use SAR images as complementary data and are trained using the SEN12MS-CR dataset. The cloud removal results of the comparison methods on the two study areas are presented in  Erhai is located in Dali Bai Autonomous Prefecture, Yunnan Province, China. It is the second-largest freshwater lake in Yunnan [46]. Erhai is long and narrow in shape, and its bank side is steep. The annual water temperature ranges from 12-21 • C. Due to its unique geographical characteristics, Erhai is frequently covered by clouds that severely affect crop-monitoring-related tasks. Wuhan is in the middle and lower reaches of the Yangtze River, and it is a central city of transportation hubs, industrial centers, and science and education centers. Rapid city expansion requires high-quality remote sensing images to assist in regulation. However, the Yangtze River and abundant rainfall cause this area to be regularly covered with a high percentage of cloud coverage. For image data, SAR, cloudy and cloudless images should be collected on adjacent dates to ensure that target cloudless images are available. For the Erhai area, SAR and cloudy images were acquired on 13 August 2020, cloudless images were acquired on 27 August 2020, and those images of Wuhan were acquired on 7 August 2022 and 17 August 2022. The images of the two study areas were split into 128 × 128 size patches with thin cloud and thick cloud regions separately to make a dataset for evaluation of the comparison methods.
To evaluate the effectiveness of the DDPM-CR network, we compare it against a traditional method, RSdehaze [47], and four other deep learning networks: DSen2-CR [24], pix2pix [48], GLF-CR [23], and SPA-CycleGAN [30]. The RSdehaze method is used to highlight the importance of auxiliary SAR images in cloud removal. GLF-CR uses a twostream network that differs from DDPM-CR, while the other deep learning methods are similar in many ways to DDPM-CR. DSen2-CR is a ResNet-based network that takes cloud masks into account, while pix2pix and SPA-CycleGAN are generative models like DDPM-CR. SPA-CycleGAN also incorporates an attention mechanism and corresponding attention loss. To ensure a fair comparison, all deep learning methods use SAR images as complementary data and are trained using the SEN12MS-CR dataset. The cloud removal results of the comparison methods on the two study areas are presented in Figures 9-12.     The spectral signal can penetrate clouds to a certain extent, allowing it to be captured by satellite sensors. This phenomenon is similar to fog and can be partially corrected through radiation correction in remote sensing preprocessing. In Figure 9, the comparison methods used in Wuhan study area achieved good results, with thin clouds being completely removed. However, clouds and cloud shadows generally have a greater impact on the RSdehaze algorithm, leading to reconstruction failure in cloud-covered regions. This highlights the importance of SAR images in cloud removal tasks.
The pix2pix network also faced difficulties in synthesizing multiple information types to generate target cloudless images, which may explain its lower performance. In contrast, the DSen2-CR network achieved better visual results than the GLF-CR network, displaying more detailed building textures. In Figure 9a, SPA-CycleGAN performs well and generates results that are more consistent with the target image than DDPM-CR, while DDPM-CR achieves better results in other regions. The spectral signal can penetrate clouds to a certain extent, allowing it to be captured by satellite sensors. This phenomenon is similar to fog and can be partially corrected through radiation correction in remote sensing preprocessing. In Figure 9, the comparison methods used in Wuhan study area achieved good results, with thin clouds being completely Figure 12. Thick cloud removal results of the comparison methods in the Erhai study area. Image patches of thick clouds and targets are shown in the first and second rows. In the third row, columns (a-f) demonstrate the cloud removal results of each method. The color composition of the images is based on natural colors (R: band4, G: band3, and B: band2).
The first two rows in Figure 10 demonstrate that thick clouds completely obstruct the transmission of the spectral signal in Wuhan study area, resulting in denser clouds and shadows compared with thin cloud conditions. The RSdehaze method exhibits severe deterioration in the results, with color mistakes and failure to recover detailed construction information, even cloud-like artifacts, particularly in region (b). Deep-learning-based methods also produce degraded results with a blurred appearance in the recovered image information within thick cloud regions.
Specifically, Dsen2-CR and pix2pix achieve similar results with inferior visual effects, while GLF-CR outperforms other comparison methods in regions (b) and (c) with finer color and detailed information due to its two-stream architecture. The SPA-CycleGAN network demonstrates good results in regions (d) and (f), owing to the attention mechanism adopted in the architecture that plays a significant role in cloud locating and missing information reconstruction. However, overall, the DDPM-CR network outperforms other methods, as it retrieves a majority of spectral information under thick clouds and cloud shadows and produces the least vagueness in cloud removal results.
In Figure 11 of Erhai study area, the diversity of ground objects poses more challenges to the comparison methods than in Wuhan, as both natural and man-made objects are present. Additionally, cumulus clouds are widespread in the area, further hindering spectral information transmission compared with Wuhan. The denser clouds in Erhai result in more distorted cloud removal results and inconsistent colors with the target cloudless images when using RSdehaze methods. Deep learning-based methods, aided by auxiliary SAR images, can retrieve missing data from clouded areas, with good reconstruction results in cloud-free regions.
However, denser clouds and a diverse range of ground objects introduce additional complexity to the cloud removal task and all deep learning-based methods have limitations in accurately reconstructing cloud-contaminated regions. The pix2pix network exhibits poor generalization, with mosaic artifacts in cloud-contaminated regions. GLF-CR retrieves more image information on cloud regions than DSen2-CR, particularly in region (a), while DSen2-CR has limited performance in this study area. The SPA-CycleGAN and DDPM-CR networks, which share similar principles, achieve comparable results, with DDPM-CR producing finer details in regions (a) and (c).
The Figure 12 presents the results of thick cloud removal achieved by comparison methods in the Erhai study area. In the first row of the figure, it is difficult to distinguish the image representations, and thick clouds occupy most of the scene. The RSdehaze method fails to produce reasonable results, and even the DSen2-CR network shows vague representations, even if thick clouds cannot be identified. The results obtained by the pix2pix network display numerous mosaic artifacts with a blurred appearance. GLF-CR generates mosaic-like artifacts in (d) and (e). Although SPA-CycleGAN retrieves some of the missing information caused by thick cloud coverage, the results are still vague, similar to those of the GLF-CR network. In contrast, the proposed DDPM-CR network outperforms other methods in terms of cloud removal results with a better visual appearance, such as the results shown in (d). Other methods fail to reconstruct the detailed information of the town region in the scene, while the DDPM-CR network recovers this information with less distortion.
In the results of the comparison methods for various cloud thicknesses, deep learningbased methods outperform the traditional RSdehaze method in two aspects. Firstly, the RSdehaze method is designed specifically for fog and thin cloud removal applications, and its performance is severely reduced under thick clouds. Secondly, it only uses optical bands, which makes information recovery under thick clouds ineffective. Each network has its area of expertise, and the above results indicate that the pix2pix network, without modification, cannot be used directly for cloud removal tasks. The two-stream architecture of GLF-CR does not show a significant improvement compared with one-stream architectures, and it has similar performance to DSen2-CR. SPA-CycleGAN is similar to the DDPM-CR network in both architecture and loss function definition, which contributes to better cloud removal results. The proposed DDPM-CR network builds on the successful experience of previous works to improve its performance on the cloud removal task, and its inference results are less affected by cloud thickness compared with other methods. Table 4 lists an evaluation of the comparison methods based on image patches from the Wuhan and Erhai study areas.
The table presented above shows the average accuracy values of the comparison methods under thin and thick cloud conditions in various study areas, with bold numbers highlighting the best accuracy metrics. Generally, the metric values of the Wuhan study area are higher than those of the Erhai study area due to the former's uniform types of ground objects. Conversely, the Erhai study area has more diverse ground objects and denser clouds, which affect both visual comparisons and evaluation metrics. Under thin cloud conditions in the Wuhan study area, all methods achieve good results, with deep learning-based methods having similar metric values as shown in Figure 9. The DDPM-CR network achieves the best MAE, PSNR, and SSIM metric values, while the DSen2-CR network has the best RMSE metric value. In the Erhai study area, the metric values of all methods decrease, and the SPA-CycleGAN network achieves the best MAE and RMSE metric values, while the DDPM-CR network performs the best in PSNR and SSIM metrics. For results under thick cloud conditions, the performance of all comparison methods decreases significantly, with nearly all accuracy metrics being affected. Notably, the DDPM-CR network achieves the best accuracy metrics, indicating its superior performance on various study areas with diverse cloud conditions. This can be attributed to the multilevel deep features from the DDPM architecture, which provide crucial information for the cloud removal head and further improve network performance.

Discussion
In this study, we selected the weight parameters for the mMAE, perceptual, and attention loss functions through a combination of grid search and visual comparisons. The top five sets of parameter selections with PSNR are presented in Table 5.
Within the cloud-orient loss function, we also propose an mMAE loss function to aid in the training of the network, which incorporates a dynamic adjustment strategy on weight λ m to balance the retrieval of missing information in clouded regions and pixel mapping to cloudless regions. This loss function is more flexible than the attention loss. λ m can be adjusted using two methods. The first method uses linear adjustment with the number of epochs after the initial λ m = 0.8 setting. During the first 50 epochs, the DDPM-CR network focuses on learning the global information of the training dataset with the attention loss function to stabilize the network training. Subsequently, λ m decreases at a decay rate of 0.002 every epoch until training is complete. This procedure allows the network to gradually increase the proportion of information recovery in cloud-covered regions. The second method adjusts λ m based on the validation SSIM metric. During the first 50 epochs, λ m is also initialized as 0.8. For the next 150 epochs, if the best validation SSIM is lower than 0.7, λ m remains unchanged to continue learning the global representations of the whole dataset. If the validation SSIM is higher than 0.7 but lower than 0.85, λ m is set as 0.6 to prioritize information recovery in cloud-covered regions. When the validation SSIM is higher than 0.85, λ m is finally set as 0.5 to refine the image information of the cloud-covered region. The difference between the two strategies is that λ m decreases either based on the epoch number or validation of the SSIM metric. We depict the training curves of validation PSNR and SSIM in Figure 13 to evaluate the performance of each strategy. Based on the results in the above table, it is evident that λ a , which represents the weight assigned to the MAE loss function, remains dominant compared with the other two sub-loss functions in the cloud-oriented loss. This highlights the fundamental importance of the MAE loss function in the cloud removal task. λ b and λ c play a subordinate role in achieving superior reconstruction results. Notably, λ c is assigned a relatively higher weight, particularly in the last row of the recovery in cloud-covered regions. When the validation SSIM is higher than 0.85, λm is finally set as 0.5 to refine the image information of the cloud-covered region. The difference between the two strategies is that λm decreases either based on the epoch number or validation of the SSIM metric. We depict the training curves of validation PSNR and SSIM in Figure 13 to evaluate the performance of each strategy. For the above experiments, the practical training epoch number is set to 200, and an additional 800 epochs are used to determine the optimal adjustment strategy for λm. As shown in subfigures (a) and (b), the network converges around epoch 200, but curve vibration appears in the subsequent epochs, particularly for the red line, which experiences severe fluctuations. However, the above curves also indicate that adjusting λm with validation SSIM outperforms the strategy based on epoch numbers; in most training epochs, the blue For the above experiments, the practical training epoch number is set to 200, and an additional 800 epochs are used to determine the optimal adjustment strategy for λ m . As shown in subfigures (a) and (b), the network converges around epoch 200, but curve vibration appears in the subsequent epochs, particularly for the red line, which experiences severe fluctuations. However, the above curves also indicate that adjusting λ m with validation SSIM outperforms the strategy based on epoch numbers; in most training epochs, the blue line is above the red line for both SSIM and PSNR metrics. This can be explained by the fact that linearly decreasing λ m with epoch numbers may result in underfitting due to frequently changing λ m weights, thereby compromising the training effectiveness. In contrast, the λ m adjustment strategy based on validation SSIM ensures better network convergence and a more stable training procedure, and the λ m alteration strictly obeys the DDPM-CR network accuracy metric, resulting in better accuracy metrics. Furthermore, λ m controls the information retrieval of cloud-free and cloud-covered regions according to the mMAE loss function definition. To visually evaluate the effectiveness of the two λ m adjustment strategies, we conduct a comparison of the same image patch containing both cloud-free and cloud-covered regions from the SEN12MS-CR dataset, as shown in Figure 14. In the figure provided, an image patch with a mix of cloud-covered and cloudless regions is chosen to evaluate the effectiveness of the two λm adjustment strategies. Three typical regions are selected for a closer look, with the left-most region completely covered by clouds, the middle region half covered by clouds, and the right-most region being cloudless. Results for cloud removal achieved by λm decreasing with validation SSIM and epoch numbers are presented in (d) and (e), respectively. For regions completely covered by clouds, (d) retrieved more abundant image details than (e), and the same result appears in the region half covered by clouds. In the cloudless region, (d) also outperforms (e), with not only more complete retrieved ground objects but also finer image details. These observations suggest that the λm adjustment strategy with validation SSIM is more reasonable and that the validation SSIM provides a significant reference for the adjustment of the mMAE loss function.
Most works in the Section 1 have used fixed loss functions such as MSE [20,21,26], MAE [23,24], and GAN-based [22,[27][28][29][30][31][32][33] loss functions, along with their variations, to achieve satisfactory results. However, these loss functions may have limited generalization perfor- In the figure provided, an image patch with a mix of cloud-covered and cloudless regions is chosen to evaluate the effectiveness of the two λ m adjustment strategies. Three typical regions are selected for a closer look, with the left-most region completely covered by clouds, the middle region half covered by clouds, and the right-most region being cloudless. Results for cloud removal achieved by λ m decreasing with validation SSIM and epoch numbers are presented in (d) and (e), respectively. For regions completely covered by clouds, (d) retrieved more abundant image details than (e), and the same result appears in the region half covered by clouds. In the cloudless region, (d) also outperforms (e), with not only more complete retrieved ground objects but also finer image details.
These observations suggest that the λ m adjustment strategy with validation SSIM is more reasonable and that the validation SSIM provides a significant reference for the adjustment of the mMAE loss function.
Most works in the Section 1 have used fixed loss functions such as MSE [20,21,26], MAE [23,24], and GAN-based [22,[27][28][29][30][31][32][33] loss functions, along with their variations, to achieve satisfactory results. However, these loss functions may have limited generalization performance under complex image scenarios. In contrast, the proposed cloud-oriented loss integrates various loss functions to better suit the cloud removal task. By incorporating the λ m adjustment strategy with validation SSIM in mMAE, the proposed approach achieves optimal cloud removal results, demonstrating the effectiveness of the cloud-oriented loss in various cloud-coverage conditions, as illustrated in Figure 14.
The DDPM-CR model achieves high-quality cloud removal results for several reasons. Classic methods such as RSdehaze are limited by image sensors and radiative transfer, making them ineffective in varying conditions. Deep learning-based methods require elaborate architectures to achieve high-quality results, especially in the case of GAN-based networks where SPA-CycleGAN outperforms pix2pix. The DDPM-CR architecture with its abundant DDPM features and proposed cloud removal head can utilize more representative information for cloud removal tasks than Dsen2-CR and GLF-CR architectures. Additionally, the cloud-oriented loss function has a positive effect on network training, allowing the network to adapt to complex cloud conditions. These factors contribute to the superior performance of DDPM-CR over other methods in cloud removal.

Conclusions
Optical remote sensing images are often obstructed by clouds, which can lead to significant resource waste. This study proposes a novel network for cloud removal tasks, trained on the SEN12MS-CR dataset, that has several advantages. Firstly, it utilizes auxiliary SAR data and multilevel features from the DDPM architecture, resulting in abundant information for cloud removal and fine retrieval details with satisfactory validation accuracy. Second, the cloud removal head with the attention mechanism recovers missing information well under various scales. The proposed cloud-oriented loss balances the information recovery of cloud-covered and cloud-free regions, and the results of DDPM-CR networks trained with various loss functions demonstrate its effectiveness (MAE = 0.0301, RMSE = 0.0327, PSNR = 29.7334, and SSIM = 0.8943). To validate the generalization of DDPM-CR, the Wuhan study area dominated by manmade ground objects and the Erhai study area with a complex ground object distribution were used. Several methods were tested in the study areas under both thin and thick cloud conditions. The results show that DDPM-CR outperforms the traditional and other three deep learning-based comparison methods both quantitatively and qualitatively, and it learns a complex mapping between Sentinel-2 optical images and Sentinel-1 SAR images with fewer artifacts and distortion (MAE = 0.0229, RMSE = 0.0268, PSNR = 31.7712, and SSIM = 0.9033). Under various cloud conditions, the DDPM-CR network has good performance on regular image patterns, farming lands and forests, for example, and complex image patterns, such as urban areas. However, we also find blurred results within thick cloud-covered regions, so the DDPM-CR network still has room for improvement. Therefore, future studies will focus on optimization of the DDPM-CR network and the addition of a super-resolution architecture for auxiliary SAR images to improve its accuracy.

Conflicts of Interest:
The authors declare no conflict of interest.