Wavelet Integrated Convolutional Neural Network for Thin Cloud Removal in Remote Sensing Images

: Cloud occlusion phenomena are widespread in optical remote sensing (RS) images, leading to information loss and image degradation and causing difﬁculties in subsequent applications such as land surface classiﬁcation, object detection, and land change monitoring. Therefore, thin cloud removal is a key preprocessing procedure for optical RS images, and has great practical value. Recent deep learning-based thin cloud removal methods have achieved excellent results. However, these methods have a common problem in that they cannot obtain large receptive ﬁelds while preserving image detail. In this paper, we propose a novel wavelet-integrated convolutional neural network for thin cloud removal (WaveCNN-CR) in RS images that can obtain larger receptive ﬁelds without any information loss. WaveCNN-CR generates cloud-free images in an end-to-end manner based on an encoder–decoder-like architecture. In the encoding stage, WaveCNN-CR ﬁrst extracts multi-scale and multi-frequency components via wavelet transform, then further performs feature extraction for each high-frequency component at different scales by multiple enhanced feature extraction modules (EFEM) separately. In the decoding stage, WaveCNN-CR recursively concatenates the processed low-frequency and high-frequency components at each scale, feeds them into EFEMs for feature extraction, then reconstructs the high-resolution low-frequency component by inverse wavelet transform. In addition, the designed EFEM consisting of an attentive residual block (ARB) and gated residual block (GRB) is used to emphasize the more informative features. ARB and GRB enhance features from the perspective of global and local context, respectively. Extensive experiments on the T-CLOUD, RICE1, and WHUS2-CR datasets demonstrate that our WaveCNN-CR signiﬁcantly outperforms existing state-of-the-art methods.


Introduction
With the rapid development of optical satellite sensor technology, remote sensing (RS) images with high spatial, spectral, and temporal resolution have become increasingly accessible.RS images play a crucial role in modern Earth observation and are widely used in various applications, including land surface classification [1,2], object detection [3,4], land change monitoring [5,6], and military command [7].However, the global annual mean cloud cover is as high as 67% [8,9], and RS images are invariably contaminated by clouds, greatly degrading their quality and causing serious adverse effects in subsequent applications.Thus, it is valuable to remove clouds from RS images while retaining the land surface information in order to improve their quality and availability.
The semitransparency property of thin clouds makes it possible to recover cloud-free images from a single cloudy RS image.Within the last decade a large number of thin cloud removal methods have been proposed, which can be briefly classified into two main categories: traditional image processing-based methods, and deep-learning (DL)-based methods.In previous studies, traditional image processing-based methods have been widely developed thanks to their ease of interpretation and implementation.Shen et al. [10] proposed a high-fidelity thin cloud removal method based on locally adaptive homomorphic filtering (HF).Pan et al. [11] designed a deformed imaging model according to the statistical properties of RS images and then combined it with the dark channel prior (DCP) to remove thin clouds.Li et al. [12] developed a two-stage thin cloud removal method that first utilized HF to improve the distribution of thin clouds, then employed a sphere-model improved DCP to obtain cloud-free images.Makarau et al. [13,14] removed clouds using a local search for dark objects to calculate a thin cloud thickness map for each band in multispectral RS images.These methods rely on assumed physical models or statistical priors, resulting in poor performance when prior assumptions are inconsistent with the actual RS images.
Image decomposition and transformation are traditional image processing methods that have been applied to thin cloud removal.He et al. [15] first extracted the thin cloud component by low-rank matrix decomposition and automatic thresholding, then subtracted it from the original cloudy images to obtain cloud-free images.Hu et al. [16] first applied a multidirectional dual tree complex wavelet transform to decompose cloudy images into sub-bands, then used a domain adaptation transfer least-squares support vector regression model to remove thin clouds by enhancing the high-frequency sub-bands and replacing the low-frequency sub-bands.Furthermore, individual component analysis [17,18] and principal component transform [19] have been used for thin cloud removal in RS images.This kind of method does not consider the imaging model of cloud distortion at all, and cannot obtain satisfactory results for complex scenes with nonuniform clouds.
Other traditional methods that rely on spectral analysis have been proposed for multispectral RS images.Hong and Zhang [20] improved and extended the haze optimized transform method to execute thin cloud removal.Lv et al. [21] proposed a thin cloud removal method based on radiative transfer models and empirical assumptions between multiple visible bands and one near infrared band, which they further simplified to an empirical relationship between two visible bands in [22].Xu et al. [23] and Zhou and Wang [24] adopted the cirrus band as auxiliary data to remove thin clouds by calculating the linear regression coefficients between visible/infrared bands and cirrus band.However, these spectral-based methods do not make full use of the spatial correlation in cloudy images, and usually fail to work when only few bands are available.
In recent years, DL technology has made impressive achievements in various computer vision tasks, such as image classification [25,26], object detection [27,28], semantic segmentation [29,30], and image translation [31,32], thanks to its strong abilities in nonlinear fitting and deep feature mining through supervised learning.Previous researchers have applied DL approaches to thin cloud removal in RS images.Li et al. [33] proposed an endto-end deep residual symmetrical concatenation network (RSC-Net) for thin cloud removal.Wen et al. [34] designed a residual channel attention network (RCA-Net) to remove clouds by integrating residual learning (RL) and channel attention mechanisms.Li et al. [35] designed a convolutional neural network (CNN) with two input/output branches for thin cloud removal in Sentinel-2A images by taking the short-wave infrared and vegetation red edge bands as auxiliary inputs in addition to the visible/near infrared bands.Zhou et al. [36] proposed a lightweight and near-real-time thin cloud removal method using a multi-scale attention residual network (MSAR-DefogNet).Ding et al. [37] applied conditional variational auto-encoders with uncertainty analysis to generate multiple reasonable cloud-free images for each cloudy image.
Furthermore, there are many generative adversarial network (GAN)-based methods [38,39] that have been proposed to remove thin clouds.Enomoto et al. [40] and Zhang et al. [41] directly applied conditional GAN (cGAN) [42] to accomplish thin cloud removal in RS images.Wen et al. [43] presented a GAN based on YUV color space and implemented thin cloud removal by learning the luminance and chroma components inde-pendently.Zhang et al. [44] proposed an improved GAN to recover cloud-free images by adding color consistency constraints to the loss function.In [45][46][47][48], the authors integrated various attentional mechanisms into GANs to enhance the feature representation ability of the models, thereby generating cloud-free images with higher quality.
Other studies have removed thin clouds by combining CNN/GAN and imaging models.Zi et al. [49] proposed a two-stage approach using two CNNs, one for estimating the reference thin cloud thickness map and the other for estimating the thickness coefficients.Yu et al. [50,51] developed a multiscale distortion-aware cloud removal network (MCRN) by incorporating the physical model of cloud distortion into feature extraction.Subsequently, the hybrid model-based and GAN-based approaches [52,53] have been used for weakly supervised thin cloud removal to reduce the dependence on paired training data.
However, the above-mentioned CNN-based and GAN-based thin cloud removal methods suffer from a number of shortcomings.From the perspective of network architecture, the models with downsampling and upsampling layers easily lead to corrupted image details, while the other methods without downsampling and upsampling layers result in poor performance on nonuniform thin cloud removal due to their limited receptive fields.On the other hand, existing methods perform thin cloud removal in the spatial domain, ignoring the distinct frequency information.
Considering that wavelet transform [54] is able to decompose an image into quartersized components of different frequencies without any information loss, in this paper we propose a wavelet-integrated CNN for thin cloud removal (WaveCNN-CR) in RS images, which can enlarge the receptive field while preserving image details.WaveCNN-CR applies wavelet transform to extract multi-scale and multi-frequency features, then inverse wavelet transform is used to reconstruct the high-resolution output.In addition, we design a global-local enhanced feature extraction module (EFEM) in WaveCNN-CR that integrates the attention and gating mechanisms, thereby emphasizing the more informative features.The main contributions of this paper are as follows: 1.
We propose a novel wavelet-integrated CNN for thin cloud removal in RS images, which we call WaveCNN-CR.WaveCNN-CR can obtain multi-scale and multifrequency features as well as larger receptive fields without any information loss.
In addition, it can generate cloud-free results with more accurate details by directly processing the high-frequency features.

2.
We design a novel EFEM consisting of an attentive residual block (ARB) and gated residual block (GRB) in WaveCNN-CR, enabling stronger feature representation ability.ARB enhances features by capturing long-range interactive global information based on an attention mechanism, while GRB enhances features by exploiting local information based on a gating mechanism.

3.
We conduct extensive experiments on three public datasets, T-CLOUD, RICE1, and WHUS2-CR, which respectively include Landsat 8, Google Earth, and Sentinel-2A images.Compared with existing thin cloud removal methods, WaveCNN-CR achieves state-of-the-art (SOTA) results both qualitatively and quantitatively.
The remainder of this paper is organized as follows.Section 2 briefly introduces related works.Section 3 presents details of the proposed thin cloud removal method.Our experimental results and analysis are described and discussed in Section 4. Finally, our conclusions are provided in Section 5.

Related Works
Below, we provide a brief analysis of the network architecture of existing DL-based thin cloud removal methods in Section 2.1.In addition, we introduce the application of wavelets to DL-based computer visual tasks in Section 2.2.

Network Architecture of Existing DL-Based Methods
Recently, DL-based thin cloud removal methods have achieved amazing results [34,36,47,50].The major difference between these end-to-end methods lies in their network architectures.There are generally two different main structures: plane encoder-decoder structures [33,34,36,43,45,47] and hourglass-shaped encoder-decoder structures [35,[38][39][40][41]44,48,50,51].The former retains feature maps with the same spatial dimensions as the input image in both the encoder and decoder without any downsampling or upsampling operations (see Figure 1a), which can preserve image details without information loss.However, it has limited receptive fields and lacks the long-range dependencies of image and context, which is not conducive to the removal of nonuniform thin clouds [55].The latter structure gradually reduces the size of the feature maps via downsampling operations in the encoder, then increases the size of the feature maps via upsampling operations in the decoder (see Figure 1b), which can obtain larger receptive fields and multi-scale features.Nevertheless, the downsampling operation (strided-convolution/pooling) damages image details and causes loss of detail information; furthermore, existing upsampling operations (deconvolution/interpolation) cannot accurately recover the original data, which is not conducive to the restoration of image detail [56].A predominant thin cloud removal method needs to effectively remove thin clouds from the whole image while avoiding corruption of image details.This requires a thin cloud removal model with both large receptive fields and no loss of detail information.Existing methods fail to balance the tradeoff between large receptive fields and preservation of image detail.To address this problem, in this paper our proposed WaveCNN-CR employs wavelet transform instead of conventional downsampling operations to enlarge the receptive field without any information loss, then inverse wavelet transform is used to reconstruct the high-resolution feature maps.In addition, direct processing of the high-frequency features obtained by the wavelet transform facilitates the recovery of image detail.

Wavelet Transform in DL-Based Computer Vision
Wavelet transform [54] decomposes a signal into different frequency components, which is invertible and information-lossless. Researchers have integrated wavelet transform into CNNs to enhance performance in various computer vision tasks.For example, Huang et al. [57] proposed a wavelet-based CNN to recover the missing details in the wavelet domain for multi-scale face super-resolution.Liu et al. [58] utilized multilevel wavelet transform to enlarge the receptive field without information loss for image restoration.Li et al. [56] designed WaveCNets by replacing conventional downsampling operations with discrete wavelet transform (DWT) to improve the classification accuracy and noise-robustness of CNNs for image classification.For the stripe noise removal task, TSWEU [59] utilized wavelet transform to extract the intrinsically directional feature in the stripe and multi-scale image features; SNRWDNN [60] used quarter-sized wavelet sub-bands as inputs to simultaneously improve the computational efficiency and destriping performance.Chen et al. [61] embedded the dual-tree complex wavelet transform into a CNN for better retrieval of snow information in the single image desnowing task.Wave-GAN [62] incorporated wavelet transform and GAN to ameliorate synthesis quality from the frequency domain perspective for few-shot image generation.
Unlike most of these approaches, which generally replace downsampling operations with wavelet transforms, then directly concatenate the low-frequency and high-frequency components and feed them into the convolution layer for feature extraction, our proposed WaveCNN-CR adopts multi-level wavelet transform to decompose the input features into multi-scale frequency components and perform feature extraction for each frequency component separately in the encoding stage.Then, the processed low-frequency and highfrequency components are combined and gradually restored to their original resolution by inverse DWT (IDWT) in the decoding stage.

Method
In this paper, we propose a thin cloud removal method for RS images using a waveletintegrated CNN, WaveCNN-CR.First, we present the overall framework of WaveCNN-CR in Section 3.1.Then, in Section 3.2 we describe the hierarchical wavelet transform in WaveCNN-CR.Moreover, we elaborate the architecture of ARB and GRB in detail in Sections 3.3 and 3.4, respectively.Finally, we introduce the loss function of WaveCNN-CR in Section 3.5.

Overall Framework
The framework of the proposed WaveCNN-CR is shown in Figure 2. Considering a cloudy RGB image I ∈ R H×W×3 with spatial dimensions H × W, WaveCNN-CR first employs a 3 × 3 convolution operation to obtain low-level features F 0 ∈ R H×W×C , where C is the number of channels.Then, the hierarchical wavelet transform is applied to decompose the shallow features F 0 into four levels of high-frequency components, i.e.,

Hierarchical Wavelet Transform
Wavelet transform provides information on both frequency and spatiality without any information loss, which is crucial for accurate thin cloud removal and image detail preservation.WaveCNN-CR adopts a simple yet effective wavelet transform, namely, Haar wavelet [63].Haar wavelet contains two operations (i.e., DWT and IDWT) and four wavelet filters, i.e., a low-pass filter f LL and high-pass filters f LH , f HL , and f HH .
The low-pass filter focuses on low-frequency image structure information.In contrast, the high-pass filters capture high-frequency image detail and texture information.
First, we extract multi-scale and multi-frequency wavelet features by four-level DWT and recursively invert the processed multi-scale features to reconstruct an initial resolution output by IDWT, as shown in Figure 2. Specifically, the shallow features F 0 are decomposed into a quarter-sized low-frequency component LL 1 and high-frequency components LH 1 , HL 1 , and HH 1 via DWT in the first level, which can be formulated as where represents the convolution operation.Then, the decomposition continues iteratively on LL i−1 to produce LL i , LH i , HL i , and HH i (i = 2, 3, 4).Hence, we obtain a total of one low-frequency component and twelve multi-scale high-frequency components.We take LL 4 as the low-frequency features LF 4 and concatenate LH i , HL i , and HH i in the channel dimension as the ith level high-frequency features HF i .In the decoding stage, we iteratively concatenate LF i and HF i , feed them into the EFEM for feature extraction, then apply IDWT to reconstruct LF i−1 (i = 4, 3, 2, 1).

Attentive Residual Block
Attention mechanisms are widely used in various computer vision tasks, such as image classification, object detection, image denoising, and thin cloud removal, and can effectively improve the learning ability of CNNs.Attention enhances feature representation by recalibrating the feature maps to emphasize useful features and suppress useless features.In addition, RL can directly transfer features from shallow layers to deeper layers through skip connection.In particular, for the thin cloud removal task RL can avoid corruption of clear ground information.Meanwhile, RL allows CNNs with greater depth to be trained more easily.Inspired by this, we combined an attention mechanism with RL in our proposed attentive residual block for enhanced feature extraction.
The architecture of our proposed ARB is shown in Figure 3b, and its mathematical formula can be expressed as where F in and F out are the input and output feature maps of ARB, respectively, Att(•) represents the attention block, W 3×3 denotes the 3 × 3 convolution, and the convolution kernel ω is the parameter of the network.First, ω is assigned initial values by random initialization and then gradually optimized by backpropagation according to the loss function in the training stage.ARB first employs a convolutional layer for feature extraction, then aggregates global contextual information for feature enhancement through the attention block.In this paper, we utilize the coordinate attention block (CAB) [64], which can obtain channel attention and global spatial attention simultaneously by integrating the horizontal attention and vertical attention.CAB performs better than the classical SE channel attention block [65] and CBAM [66] because SE contains only channel attention, while CBAM calculates channel attention and local spatial attention separately.Figure 3d presents the architecture of CAB.With an input tensor F in ∈ R h×w×c , two one-dimensional global average pooling operations are first used to aggregate the input features along the horizontal and vertical directions, respectively.The resulting two direction-aware feature maps F h ∈ R h×1×c and F w ∈ R 1×w×c can then be formulated as where HGAP and VGAP refer to horizontal global average pooling and vertical global average pooling, respectively.Then, F h and F w are concatenated and encoded by a 1 × 1 convolutional layer and a nonlinear activation layer, which can be written as where [•, •] represents the concatenation along the spatial dimension, W 1×1 denotes the 1 × 1 convolution, ϕ is the non-linear activation function ReLU6 [67], and F enc ∈ R 1×(h+w)×c/r are the output encoded feature maps.Here, r is the channel reduction ratio.Then, F enc are split along the spatial dimension into two separate feature maps, F h enc ∈ R h×1×c/r and F w enc ∈ R 1×w×c/r .An additional two 1 × 1 convolution operations are used to convert F h enc and F w enc into tensors with the same number of channels as F in , respectively, and the following sigmoid function is used for normalization, obtaining where σ is the sigmoid function and g h and g w are the horizontal and vertical attention weights, respectively.Finally, g h and g w are combined to rescale the input features F in , and the output of CAB can be written as where and ⊗ denote elementwise multiplication and matrix multiplication, respectively.

Gated Residual Block
After ARB obtains the enhanced features using the global context information, we further apply the gating mechanism to control the flow of features based on the local context information.The gating mechanism can be modeled as the element-wise multiplication of two parallel paths of 3 × 3 convolutional layers, one of which is followed by a nonlinear activation layer.The architecture of our proposed GRB is illustrated in Figure 3c.With an input tensor F in ∈ R h×w×c , GRB can be formulated as where ψ and φ are the layer normalization [68] and GELU nonlinearity [69], respectively, F l in denotes the l-th channel of the input tensor, µ l and (σ l ) 2 are the mean and variance of F l in , respectively, is a small constant that prevent the denominator from being zero, and g l and b l are two learnable parameters.Here, it is worth noting that we first use two 3 × 3 convolutions to expand the channels of the layer normalized features by a factor of two in order to exploit richer local features, then finally reduce the channels back to the original input dimension by a 1 × 1 convolution.Overall, GRB allows us to choose which part of the features should be propagated to the next layer of the network.Specific to the thin cloud removal task, thanks to global residual learning this means allowing information relating to clouds to pass forward while blocking information on cloud-free regions, resulting in better thin cloud removal performance and better fidelity in cloud-free regions.

Loss Function
The L 1 norm and mean squared error (MSE) are the most commonly used loss functions in supervised image-to-image translation tasks.However, the minimization of MSE suppresses high-frequency detail information, causing the phenomenon of regression to the mean and resulting in blurred and oversmoothed results [70,71].Therefore, in this paper we employ L 1 loss to optimize WaveCNN-CR.The loss function can be expressed as where I i and GT i are the ith thin cloud image and corresponding ground truth (cloud-free reference image) in the training set, respectively, N is the number of training samples, || • || 1 represents the L 1 norm, f ω denotes our WaveCNN-CR, and ω represents the parameters of WaveCNN-CR.Here, we aim to minimize L(ω) in order to obtain the optimal parameters ω * .

Results and Discussion
In this part, we first describe the experimental settings, including the datasets, evaluation metrics, and implementation details, in Section 4.1.Next, the ablution study on the T-CLOUD dataset is presented and discussed in Section 4.2.Finally, we conduct comparative experiments with other SOTA methods in Section 4.3.

Datasets
In our experiments, we evaluated our method on three public datasets: T-CLOUD [37], RICE [72], and WHUS2-CR [35].Table 1 summarizes the similarities and differences of these three datasets.(2) RICE dataset: RICE contains two subsets: thin cloud-contaminated RICE1 and thick cloud-contaminated RICE2.The former consists of 500 pairs of cloudy images and their cloud-free counterparts, all with a size of 512 × 512, while the latter has 450 triplets of images, each triplet containing a reference image without clouds, a thick cloud-covered image, and the mask of the clouds.We chose RICE1 for our thin cloud removal experiments.In RICE1, all images are collected from Google Earth by setting whether or not to exhibit the cloud layer.We randomly selected 400 pairs for training and the remaining 100 pairs for testing.
(3) WHUS2-CR dataset: In the WHUS2-CR dataset, cloudy and corresponding cloudfree images are captured by the Sentinel-2A satellite, which has a multispectral imager for ground exploration.To reduce the difference between cloudy and cloud-free images as much as possible, the time lag of the acquisition dates of cloudy and corresponding cloud-free images is set to ten days, which is the revisitation time of the Sentinel-2A satellite.In WHUS2-CR, we randomly cropped 5000 image patches with a size of 256 × 256 pixels from the original high-resolution image pairs.In our experiments, 4000 pairs were used for training and 1000 pairs for testing.

Evaluation Metrics
To quantitatively evaluate the performance of thin cloud removal methods, we adopted the widely used peak signal-to-noise ratio (PSNR) [73], structural similarity (SSIM) [74], and CIEDE2000 [75] as full-reference metrics.
Specifically, PSNR calculates the ratio of the maximum pixel value against the pixelwise evaluation error, which can be formulated as where MSE is the mean squared error between the thin cloud removal result X and the ground-truth image Y, N is the number of pixels in the image, and B denotes the bit depth of the image, which is generally takes a a value of 8, that is, 2 B − 1 = 255.A larger PSNR indicates a better thin cloud removal result.SSIM evaluates the similarity between two images in terms of luminance, contrast, and structure: where µ X and µ Y are the mean values of X and Y, respectively, σ 2 X and σ 2 Y are the variances of X and Y, respectively, σ XY is the covariance of X and Y, and c 1 , c 2 , and c 3 are small constants that prevent the denominator term from being zero.The value of SSIM ranges from 0 to 1, with larger values indicating a better thin cloud removal effect.CIEDE2000 measures the color difference between two images, which is consistent with subjective human visual perception.CIEDE2000 can be defined as where ∆L , ∆C , and ∆H are the CIELAB metrics lightness, chroma, and hue differences between X and Y, respectively; k L , k C , and k H are the parametric factors; and the weighting factors S L , S C , and S H and interactive term R T are calculated from ∆L , ∆C , and ∆H , respectively.For detailed calculations, refer to [76].A smaller value of CIEDE2000 indicates better color preservation.

Implementation Details
The proposed WaveCNN-CR was implemented in PyTorch and trained on an Intel Gold 6252 CPU and an NVIDIA A100 GPU.The number of channels in the first convolution layer was set to C = 48, and the channel reduction ratio in CAB was set to r = 4.We trained WaveCNN-CR with the Adam [77] optimizer (β 1 = 0.9, β 2 = 0.999).The batch size and training epochs were set to 1 and 300, respectively.The initial learning rate was set to 0.0003 for the first 100 epochs, then gradually reduced to 0 over the next 200 epochs using the cosine annealing strategy [78].In addition, we used horizontal and vertical flipping for data augmentation.

Ablation Study
To verify the effectiveness of the proposed WaveCNN-CR, we conducted extensive ablation experiments to analyze the overall architecture of WaveCNN-CR and the structure of EFEM, ARB, and GRB.The T-CLOUD dataset was employed for training and testing.For fast comparisons, the training epochs in all ablation experiments were set to 150.

Analysis of Overall Architecture
To demonstrate the effectiveness of wavelet transform in WaveCNN-CR, we compared it with three variant models without wavelet transform.One of the variants was designed with the plane structure (denoted as Plane) and the other two variants adopted the hourglass-shaped structure, one utilizing convolution and deconvolution with stride 2 as the respective downsampling and upsampling operations (denoted as Hourglass1) and the other using average pooling as the downsampling operation and bilinear interpolation as the upsampling operation (denoted as Hourglass2).In Hourglass2, we employed 1 × 1 convolution before downsampling and upsampling to ensure that the number of channels in its feature map was consistent with that in WaveCNN-CR.The qualitative comparison results are shown in Figure 4. Plane was limited by the small receptive fields, resulting in unsatisfactory result on nonuniform thin clouds (see the red box area).Hourglass2 performed better than Hourglass1, effectively removing the nonuniform thin clouds, though there were blurry detail textures in its results.In contrast, our proposed WaveCNN-CR benefited from the wavelet transform without information loss, effectively removing the nonuniform thin clouds while accurately recovering the detailed texture of the image.Table 2 presents the quantitative results.It can be seen that compared with Hourglass2, Plane performed poorly in terms of PSNR and CIEDE2000, while performing better on the SSIM metric.This is because there were no downsampling/upsampling operations in Plane, thereby protecting the detailed texture of the image.Our proposed WaveCNN-CR is able to integrated wavelet transform into CNN, achieving the best results on all three evaluation metrics.

Effectiveness of EFEM
In the proposed WaveCNN-CR, EFEM consists of an ARB followed by a GRB.To verify the effectiveness of EFEM, we compared it with three variants: (1) two ARBs (denoted ARB_ARB), (2) two GRBs (denoted GRB_GRB), and (3) one GRB followed by one ARB (denoted GRB_ARB).As shown in Table 3, the results of the combination of ARB and GRB were better than those of two ARBs or GRBs alone, indicating that global ARB and local GRB are complementary.The proposed EFEM composed of ARB and GRB in sequence, achieved the best results, which also proves that this global-local enhancement strategy can obtain higher performance gains.

Analysis of ARB
To verify the effectiveness of the ARB, we compared it with variant modules with different structures.In Table 4, CB denotes a regular convolutional block without an attention mechanism or residual connection, while AB and RB represent an attentive block with attention mechanism and residual block with residual connection, respectively.In addition, ARB_SE and ARB_CBAM represent ARBs with SE and CBAM attention modules, respectively.From the quantitative comparison results, it can be seen that, as compared with CB, RB obtained better results, while AB achieved higher PSNR gains while showing poor performance in terms of SSIM and CIEDE2000.The later three ARBs with different attention mechanisms were significantly better than the first three, illustrating the effectiveness of combining the attention mechanism and RL.Our ARB using CAB achieved the best results, with 31.01 dB in PSNR, 0.8813 in SSIM, and 3.4262 in CIEDE2000.

Analysis of GRB
We conducted experiments to verify the effectiveness of GRB.As shown in Table 5, CB represents the convolutional block without a gating mechanism or residual connection, while GB and RB denote the gated block with gating mechanism and residual block with residual connection, respectively.GB performed the worst, indicating that the gating mechanism plays a negative role when there is no residual connection.Based on RB, our GRB with gating mechanism showed improved performance of 1.33 dB PSNR, 0.0187 SSIM, and 0.4843 CIEDE2000.In this section, we present the experimental results on the T-CLOUD, RICE1, and WHUS2-CR datasets used to evaluate our proposed WaveCNN-CR.Quantitative and qualitative comparisons were conducted against several SOTA methods, including four CNN-based methods (RSC-Net [33], MCRN [50], MSAR-DefogNet [36], and RCA-Net [34]) and five GAN-based methods (SpA-GAN [45], UNet-GAN [38], MS-GAN [39], Color-GAN [44], and AMGAN-CR [47]).
The quantitative results are presented in Tables 6-8.It can be seen that the five attention-based methods, including MSAR-DefogNet, RCA-Net, SpA-GAN, AMGAN-CR, and WaveCNN-CR, significantly outperformed the remaining five methods without an attention mechanism, proving the effectiveness of the attention mechanism.Our proposed WaveCNN-CR achieved remarkable performance gains over existing methods on all three datasets.Compared to the most recent best method, MSAR-DefogNet, WaveCNN-CR achieved improvements of 2.37 dB, 2.16 dB, and 0.40 dB PSNR and 0.0406, 0.0116, and 0.0150 SSIM on the T-CLOUD, RICE1, and WHUS2-CR datasets, respectively.For the color difference indicator, CIEDE2000, the quantitative results consistently showed that WaveCNN-CR achieveds the best performance, demonstrating that WaveCNN-CR has great potential to improve thin cloud removal performance.
Furthermore, we compared the parameters, computational cost, and test time of different methods on the T-CLOUD dataset, with the results shown in Table 10.It can be seen that RSC-Net, UNet-GAN, MS-GAN, and Color-GAN had relatively lower computational costs and time consumption, however, their thin cloud removal performance was relatively poor.While MCRN, RCA-Net, SpA-GAN, and AMGAN-CR had higher computational and time costs, and their thin cloud removal results were better than those of the previous four methods.MSAR-DefogNet achieved a good balance between parameters, computations, time cost, and the effectiveness of cloud removal.Overall, our WaveCNN-CR had the highest number of parameters and the second-highest cost in terms of computation and time.Compared with MSAR-DefogNet, our WaveCNN-CR made sacrifices in terms of memory usage and time consumption, but showed greatly improved effectiveness in thin cloud removal.[33], MCRN [50], MSAR-DefogNet [36], RCA-Net [34], SpA-GAN [45], UNet-GAN [38], MS-GAN [39], Color-GAN [44], AMGAN-CR [47], and our proposed WaveCNN-CR, respectively; (l) reference cloud-free image.

Conclusions
In this paper, we proposed a novel thin cloud removal method for RS images, called WaveCNN-CR, that integrates wavelet transform into CNN.Benefiting from the lossless decomposition of wavelet transform, WaveCNN-CR is able to obtain large receptive fields and simultaneously preserve image details, which is an advantage over existing thin cloud removal methods.Specifically, WaveCNN-CR adopts hierarchical DWT to decompose the input features into multi-scale and multi-frequency components, then performs feature extraction for each high-frequency component at different scales using multiple EFEMs in the encoding stage.Then, the processed low-frequency and high-frequency components are recursively combined to reconstruct the high-resolution output in the decoding stage via IDWT.Furthermore, we designed a novel EFEM to integrate global and local information to improve the feature representation ability of WaveCNN-CR.This EFEM is composed of both ARB and GRB; ARB enhances features through the global contextual information captured by attention mechanism, while GRB enhances features through the local contextual information exploited by the gating mechanism.We conducted comparative experiments on three publicly available datasets, T-CLOUD, RICE1, and WHUS2-CR, that include Landsat 8, Google Earth, and Sentinel-2A images, respectively.Both the qualitative and quantitative results demonstrated that WaveCNN-CR significantly outperforms other SOTA methods in terms of thin cloud removal and image detail restoration.
In future work, we intend to apply WaveCNN-CR to multispectral and multitemporal RS images, making full use of spatial, spectral, and temporal information to remove clouds.Additionally, WaveCNN-CR could be applied to other image restoration tasks such as denoising, deblurring, and deraining.Considering that the collection of large datasets with paired images is time-consuming, WaveCNN-CR could be combined with transfer learning on a small dataset or combined with GANs in a weakly supervised way to remove thin clouds from RS images.

Figure 1 .
Figure 1.The two types of network structures used in existing DL-based methods: (a) plane encoderdecoder structure and (b) hourglass-shaped encoder-decoder structure.

Figure 2 .
Figure 2. The overall framework of the proposed WaveCNN-CR.

Figure 3 .
Figure 3. Detailed architecture of the modules in WaveCNN-CR: (a) enhanced feature extraction module, (b) attentive residual block, (c) gated residual block, and (d) coordinate attention block.

( 1 )
T-CLOUD dataset: The data in T-CLOUD are from Landsat 8 RGB images.The dataset contains 2939 doublets of cloudy images and their clear counterparts separated by one satellite re-entry period (16 days).At first, the original optical RS image pairs are captured by the same satellite sensor at different times.Then, the image sub-regions which have similar lighting conditions on the corresponding cloudy and cloud-free images are selected to form the training and testing data.Finally, the paired cloudy and cloud-free images can be obtained by cropping at the corresponding position.All images are cropped to a size of 256 × 256 pixels.The data are split with a ratio of 8:2, with 2351 images in the training set and 588 images in the test set.

Figure 6
Figure 6 shows the visual results of a heavily thin cloud-contaminated image in the uniform RICE1 dataset.The results indicate that RSC-Net, SpA-GAN, UNet-GAN, and Color-GAN suffered from many remaining clouds.The remaining five methods, MCRN, MSAR-DefogNet, RCA-Net, MS-GAN, and AMGAN-CR, all obtained cloud-free results, although with varying degrees of color deviation compared to the reference image.The restored image obtained with the proposed WaveCNN-CR had more similar patterns to the reference image, with no color distortion, which is consistent with the quantitative results.Furthermore, a thin cloud removal instance of a moderately thin cloud-contaminated image in the WHUS2-CR dataset is shown in Figure 7.It can be observed that while all comparison methods suffered from varying degrees of color distortion, the visual quality of the restoration results demonstrates the superiority of WaveCNN-CR.

Table 1 .
Properties of the T-CLOUD, RICE1, and WHUS2-CR datasets used in the experiments.

Table 2 .
Ablution analysis of the overall architecture of WaveCNN-CR.The bold and underlined text indicates the best and second-best performance, respectively.The ↑ symbol indicates that larger values are better, while ↓ indicates that smaller values are better.

Table 3 .
Ablution analysis of the structure of EFEM.The bold and underlined text indicates the best and second-best performance, respectively.The ↑ symbol indicates that larger values are better, while ↓ indicates that smaller values are better.

Table 4 .
Ablution analysis of the structure of ARB.The bold and underlined text indicates the best and second-best performance, respectively.The ↑ symbol indicates that larger values are better, while ↓ indicates that smaller values are better.

Table 5 .
Ablution analysis of the structure of GRB.The bold and underlined text indicates the best and second-best performance, respectively.The ↑ symbol indicates that larger values are better, while ↓ indicates that smaller values are better.

Table 6 .
Quantitative evaluations on the T-CLOUD dataset.The bold and underlined text indicates the best and second-best performance, respectively.The ↑ symbol indicates that larger values are better, while ↓ indicates that smaller values are better.

Table 7 .
Quantitative evaluations on the RICE1 dataset.The bold and underlined text indicates the best and second-best performance, respectively.The ↑ symbol indicates that larger values are better, while ↓ indicates that smaller values are better.

Table 8 .
Quantitative evaluations on the WHUS2-CR dataset.The bold and underlined text indicates the best and second-best performance, respectively.The ↑ symbol indicates that larger values are better, while ↓ indicates that smaller values are better.

Table 9 .
Statistical results of the average pixel values of the input cloudy images, reference images, and results of different methods on the three test datasets.

Table 10 .
Parameters, computational cost, and test time of different methods on the T-CLOUD dataset.