Contourlet-CNN for SAR Image Despeckling

: A multiscale and multidirectional network named the Contourlet convolutional neural network (CCNN) is proposed for synthetic aperture radar (SAR) image despeckling. SAR image resolution is not higher than that of optical images. If the network depth is increased blindly, the SAR image detail information ﬂow will become quite weak, resulting in severe vanishing/exploding gradients. In this paper, a multiscale and multidirectional convolutional neural network is constructed, in which a single-stream structure of convolutional layers is replaced with a multiple-stream structure to extract image features with multidirectional and multiscale properties, thus signiﬁcantly improving the despeckling performance. With the help of the Contourlet, the CCNN is designed with multiple independent subnetworks to respectively capture abstract features of an image in a certain frequency and direction band. The CCNN can increase the number of convolutional layers by increasing the number of subnetworks, which makes the CCNN not only have enough convolutional layers to capture the SAR image features, but also overcome the problem of vanishing/exploding gradients caused by deepening the networks. Extensive quantitative and qualitative evaluations of synthetic and real SAR images show the superiority of our proposed method over the state-of-the-art speckle reduction method.


Introduction
Synthetic aperture radar (SAR) images are images of the Earth's surface obtained by the observation tool (SAR systems) under any weather condition. However, SAR images are inevitably obscured by speckle noise due to their coherent imaging mechanism, which makes it extremely difficult for computer vision systems to automatically interpret SAR data. Removing speckle is an essential step before applying SAR images to various tasks [1].
Conventional methods remove speckle, either in the spatial domain, such as Lee [2], Kuan [3] and Frost [4], which operate on the pixels by sliding a window over the entire image, or in the frequency domain, where some transforms that can sparsely represent images, such as wavelets [5] and contourlets [6], are employed to reduce speckle by thresholding the small coefficients of the frequency domain. The frequency domain methods improve the performance of speckle reduction in the following two aspects: representing SAR image features more sparsely and more accurately distinguishing the transformation coefficients of the image content from the speckle. The former analyzes geometrical structures in SAR images from the multiscale and multidirection of the transform and represents many detailed features with fewer high-magnitude transform coefficients [7,8]. The goal of the latter is to optimize the threshold determination strategy to identify the coefficients representing the image content from all the transform coefficients as accurately as possible [9].
In the past decade, non-local self-similarity (NSS)-based methods [10][11][12][13] for speckle reduction have received wide attention due to their ability to eliminate speckle while sacrificing fewer image details. Deledalle et al. [10] proposed the first nonlocal patch-based despeckling method, where each despeckled pixel is a weighted average of pixels centered at some blocks that are similar to the block centered at the current pixel. Two perfect variants of this kind of method are NL-SAR [12] and SAR-BM3D [13], which can obtain a desired result for the SAR images with regular and repetitive textures. However, it is difficult for this kind of method to create an optimal balance between preserving SAR image features and removing artifacts because the NSS models are sensitive to the spatial features of SAR images.
Inspired by the success of deep convolutional neural networks (CNNs) in the field of optical image denoising, Chierchia et al. [14] first proposed a residual CNN for subtracting speckle noise from SAR images. Considering that speckle noise is assumed to be multiplicative, the method first transforms multiplicative speckle into an additive form and then regards the additive speckle as the residual of the network. The ID-CNN [15] can directly train SAR images without requiring a homomorphic transformation. In this method, the SAR image subtracts the learned speckle through skip connections, thus yielding a clean SAR image. SAR-DRN [16] employs dilated convolutions and a combination structure of skip connections with a residual learning. Similar to [15], SAR-DRN is also trained in an end-to-end way. Yue et al. [17] exploited a deep neural network architecture to extract image features and reconstruct the probability density function (PDF) of the radar cross section (RCS) that is obscured by speckle noise. Lattari et al. [18] proposed a deep encoder-decoder CNN based on the architecture of the U-Net to capture speckle statistical features. Cozzolino et al. [19] proposed a nonlocal despeckling method for SAR images, in which the weight of the target pixel is estimated using a convolutional neural network.
From the above deep-learning-based despeckling methods, it can be seen that these methods all employ CNNs with a single-stream structure for training, either outputting a clean image in an end-to-end fashion or learning the underlying noise model. However, it is very difficult for models that adopt a single-stream structure to capture the multidirectional features of images, which will result in the loss of many detailed edge and textural features in the process of removing speckle noise.
In this paper, we propose a multiscale and multidirectional CNN (CCNN) model to capture image features and to achieve better performance in suppressing speckle noise. The CCNN consists of multiple independent subnetworks (shown in Figure 1), each of which adopts a network structure and a loss function according to the characteristic of the subband. Each subnetwork captures feature details and removes speckle noise from a specific direction and a specific scale. When the loss function of each subnetwork reaches the optimal value, the despeckled SAR image is obtained in a coarse-to-fine manner through the in-verse Contourlet transform.
reduction have received wide attention due to their ability to eliminate speckle while sac rificing fewer image details. Deledalle et al. [10] proposed the first nonlocal patch-based despeckling method, where each despeckled pixel is a weighted average of pixels cen tered at some blocks that are similar to the block centered at the current pixel. Two perfec variants of this kind of method are NL-SAR [12] and SAR-BM3D [13], which can obtain a desired result for the SAR images with regular and repetitive textures. However, it is dif ficult for this kind of method to create an optimal balance between preserving SAR image features and removing artifacts because the NSS models are sensitive to the spatial fea tures of SAR images.
Inspired by the success of deep convolutional neural networks (CNNs) in the field o optical image denoising, Chierchia et al. [14] first proposed a residual CNN for subtracting speckle noise from SAR images. Considering that speckle noise is assumed to be multipli cative, the method first transforms multiplicative speckle into an additive form and then regards the additive speckle as the residual of the network. The ID-CNN [15] can directly train SAR images without requiring a homomorphic transformation. In this method, the SAR image subtracts the learned speckle through skip connections, thus yielding a clean SAR image. SAR-DRN [16] employs dilated convolutions and a combination structure o skip connections with a residual learning. Similar to [15], SAR-DRN is also trained in an end-to-end way. Yue et al. [17] exploited a deep neural network architecture to extrac image features and reconstruct the probability density function (PDF) of the radar cross section (RCS) that is obscured by speckle noise. Lattari et al. [18] proposed a deep encoder decoder CNN based on the architecture of the U-Net to capture speckle statistical features Cozzolino et al. [19] proposed a nonlocal despeckling method for SAR images, in which the weight of the target pixel is estimated using a convolutional neural network.
From the above deep-learning-based despeckling methods, it can be seen that these methods all employ CNNs with a single-stream structure for training, either outputting a clean image in an end-to-end fashion or learning the underlying noise model. However it is very difficult for models that adopt a single-stream structure to capture the multidi rectional features of images, which will result in the loss of many detailed edge and tex tural features in the process of removing speckle noise.
In this paper, we propose a multiscale and multidirectional CNN (CCNN) model to capture image features and to achieve better performance in suppressing speckle noise The CCNN consists of multiple independent subnetworks (shown in Figure 1), each o which adopts a network structure and a loss function according to the characteristic of the subband. Each subnetwork captures feature details and removes speckle noise from a specific direction and a specific scale. When the loss function of each subnetwork reaches the optimal value, the despeckled SAR image is obtained in a coarse-to-fine manner through the in-verse Contourlet transform.  Benefiting from the multiscale and multidirectional decomposition of the Contourlet transform, the CCNN can capture the detailed features in multiscales and multidirections, thus preserving image resolution while suppressing speckle noise. Compared with stateof-the-art despeckling methods, the CCNN can preserve relatively more detailed features while suppressing speckle noise.
In summary, the main contributions of the proposed method are fourfold: (1) the CCNN uses the Contourlet transform to divide a problem of SAR image despeckling into multiple subproblems and then suppresses speckle noises of the Contourlet sub-bands using multiple independent subnetworks, which means that our proposed CCNN can provide sufficient convolutional layers for capturing the image features, but overcome the problem of vanishing/exploding gradients; (2) each subnetwork has its structure and loss function, which ensures that each sub-band is the most similar to the corresponding sub-bands of the clean SAR image; (3) the features in each sub-band are concentrated in the horizontal or vertical or diagonal direction, etc., which reduces the requirement of a convolutional neural network, that is, each sub-band does not require too many convolution layers or a complex network to capture SAR image features and suppress speckle noise; (4) the subnetwork used to train each sub-band is independent and can run in parallel, thus shortening the training time.
This paper is organized as follows: Section 2 briefly introduces the related work on CNN-based despeckling and the Contourlet analysis. The architecture of the proposed CCNN and the adopted loss functions are described in detail in Section 3. We describe and discuss experimental results for synthetic and real-world SAR data in Section 4. Finally, Section 5 concludes and outlines future work.

Related Work
Many schemes for improving the visual quality of SAR image despeckling have been developed. Here, we focus on the schemes that are relative to the proposed CCNN.
With the success of AlexNet [20] in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012), CNNs have attracted great attention in the field of SAR image despeckling. In the last two years, a surge in SAR image despeckling methods based on CNNs has been presented, which adopted the architecture of U-Net [21], ResNet [22], or DenseNet [23]. These CNN architectures are designed with many convolutional layers to make them deep enough to capture abundant enough features for images. To alleviate the problem of vanishing/exploding gradients caused by deepening the networks, the common characteristic of these networks is that not only neighboring layers, but also any other two layers, are linked by using skip connections. U-Net uses skip connections to concatenate feature mapping from the first convolutional layer to the last, second to second to last, etc. In ResNet, the input of convolutional blocks (containing multiple convolutional layers) is added to their output via skip connections. In DenseNet, all preceding convolutional layers were connected to their subsequent layers, which can overcome the shortcomings of ResNet, such as some layers being selectively discarded, or information being blocked. The basic idea of such a design is that only the deep network with a greater number of convolution layers can capture as many features as possible, helping to upgrade the visual quality of SAR image despeckling.
However, SAR image resolution is inferior to that of optical images. If capturing the detailed features of SAR images by only increasing the network depth, network training will become quite difficult and fail to capture some feature details. Multiscale and multidirection are natural attributes of an image. Analyzing the characteristics of multiscale and multidirectional images can reveal more essences of SAR images [24]. The Contourlet transform plays an important role in the multiscale and multidirectional analysis of images and can produce large magnitude coefficients for image details in a certain scale and direction, which can effectively analyze and sparsely represent the edges and textures of SAR images. This helps us to improve the despeckling performance of the CNN-based method.
The Contourlet transform [6] is an efficient multiresolution image representation, which has properties such as multiresolution, multiscale, multidirectionality, and anisotropy. The original image can be decomposed into lossless multiscale sub-bands in different directions and different frequency bands through Contourlet transform. In Contourlet transform, the Laplacian pyramid (LP) [6] is first used to decompose an image into a bandpass subband and a lowpass sub-band. The bandpass sub-bands from the LP are then fed into a directional filter bank (DFB) [6] to decompose into bandpass sub-bands with specific directional features. The process can be iterated on the lowpass sub-band, generating multiple sub-bands with different scales and different directions. These sub-bands use a few high-magnitude coefficients to represent image features along specific directions, while noise is generally represented by smaller coefficients [25]. Through Contourlet decomposition, we can divide the problem of requiring a deep and complicated CNN to handle SAR image despeckling into multiple small problems, each of which, using a simple CNN, trains the features of a Contourlet sub-band with a specific scale and direction. The features in the Contourlet sub-bands are concentrated in a specific direction, and their feature coefficients are large; the subnetwork that has a relatively simple structure and fewer convolutional layers will be sufficient to fully learn the feature details of each sub-band and effectively suppress speckle noise.

Proposed Method
In this section, we first introduce the architecture of the CCNN and describe its subnetworks. Second, we analyze the loss function of each subnetwork. Finally, the CCNN training procedure is described.

Network Structure
The CCNN is composed of multiple independent subnetworks, each of which captures feature details from a specific direction and a specific scale and removes SAR image speckle noise. Each subnetwork has a structure and objective function for learning the corresponding frequency sub-band independently. When the objective function of the subnetwork reaches the optimal value, it means that the sub-band of the contaminated SAR image has been trained to be very close to that of the clean image and, at this time, the parameters of the subnetwork have been adjusted to the optimal value. Now, the inverse Contourlet transform can be applied to all the well-trained sub-bands, resulting in a clean SAR image. As shown in Figure 1, the CCNN does not seek to improve the performance of capturing image features and eliminating speckles by deepening the network. The CCNN attempts to decompose images into several sub-bands or multiple subproblems using the Contourlet transform and then suppress speckle noise using multiple multidirectional and multiscale subnetworks. Thus, not only is the performance-suppressing speckle noise of the network guaranteed, but as many SAR image feature details as possible are preserved, thus balancing the despeckling performance against the lost feature details. Figure 1 shows the overall structure of the CCNN based on 2-level Contourlet decomposition. We use the ResNet [22] structure to design the eight subnetworks (shown in Figure 2a) for training the directional sub-bands in the first level of the Contourlet transform. This is based on the fact that these sub-bands contain most of the speckle noise of SAR images. We use the subnetworks with residual learning to directly estimate speckle noise and then obtain the sub-band of the estimated image via skip connections. The subnetworks consist of six convolutional layers. The subnetworks that are responsible for training the directional sub-bands in the second level of the Contourlet are designed with the UNet structure (shown in Figure 2b, UNet1, UNet2, UNet3 and UNet4). This is based on the fact that UNet [21] can perform well in image segmentation and capture the features of the refined edges of an image. These subnetworks have five convolutional layers, among which three are mixed convolution layers. The mixed convolution layer is obtained from the dilated convolution and the standard convolution, which can not only enlarge the receptive field of suppressing speckle noise, but also fully capture the small Remote Sens. 2021, 13, 764 5 of 20 features of the image, thus preserving as many of the detailed features as possible. S-CNN, shown in Figure 2c, which is responsible for training the coarse sub-band of the Contourlet, is composed of four convolution layers, among which are three mixed convolution layers. In addition, we add skip connections between the input and the output to ensure that the overall features of the image can be passed into the output. More details about each subnetwork setting are shown in Figure 2.
Remote Sens. 2021, 13, x FOR PEER REVIEW 5 of 20 from the dilated convolution and the standard convolution, which can not only enlarge the receptive field of suppressing speckle noise, but also fully capture the small features of the image, thus preserving as many of the detailed features as possible. S-CNN, shown in Figure 2c, which is responsible for training the coarse sub-band of the Contourlet, is composed of four convolution layers, among which are three mixed convolution layers.
In addition, we add skip connections between the input and the output to ensure that the overall features of the image can be passed into the output. More details about each subnetwork setting are shown in Figure 2.

Loss Functions
Loss functions is key to ensuring the performance of the CCNN. We explored different loss functions and their combinations to effectively learn various tasks [26][27][28]. The per-pixel Euclidean loss was considered an effective quantitative assessment. However, we observed that it may cause the loss of image details and result in artifacts on the final estimated image if minimizing pixelwise error depends only on the per-pixel loss. Additionally, the loss of the image details can be evaluated by capturing the difference in the detailed features [29]. Considering this, we combine the total variation (TV) regularization term with the Euclidean loss (EL) as the final loss function to judge the visual quality of each sub-band (the output of each subnetwork). The Euclidean loss is defined as The TV regularization term can be calculated as follows Finally, the loss function can be represented as

Loss Functions
Loss functions is key to ensuring the performance of the CCNN. We explored different loss functions and their combinations to effectively learn various tasks [26][27][28]. The per-pixel Euclidean loss was considered an effective quantitative assessment. However, we observed that it may cause the loss of image details and result in artifacts on the final estimated image if minimizing pixelwise error depends only on the per-pixel loss. Additionally, the loss of the image details can be evaluated by capturing the difference in the detailed features [29]. Considering this, we combine the total variation (TV) regularization term with the Euclidean loss (EL) as the final loss function to judge the visual quality of each sub-band (the output of each subnetwork). The Euclidean loss is defined as The TV regularization term can be calculated as follows Remote Sens. 2021, 13, 764 6 of 20 Finally, the loss function can be represented as where x(i, j) and y(i, j) represent the Contourlet coefficient values of the estimated SAR image and the corresponding clean image, respectively. C, W, and H represent the channel, width, and height of the input subband pair, respectively. The loss function use of the L cs metric can measure the difference between the sub-bands of the estimated image and that of the clean image, resulting in a relatively good peak signal to noise ratio (PSNR) for images. However, the restored image with a higher PSNR does not mean that the image details are well preserved. To create a better balance between preserving the detail features and suppressing speckle noise, we introduce a weight of λ into the loss function L cs . Because the coarse sub-band hardly contains speckle noise energy, we set the weight λ of the loss function L cs of the low-frequency sub-band to 1. For the loss function, L cs , of the directional sub-bands, λ is obtained as follows where ave represents the average value of the Contourlet coefficients of each directional subband. It can be calculated as Note that the ave of each directional subband is different and is closely related to the noise coefficients and the feature coefficients of each subband. Therefore, the loss function of each subband is also different. The image feature energy far surpasses the noise energy in images and noise has no directivity, meaning that the Contourlet coefficients of the image features are relatively larger, while the noise coefficients are relatively smaller. In computing the loss function of each subband, all coefficients larger than the ave threshold are considered to represent the features of the image details and are assigned a larger weight to enhance these features. All coefficients lower than the ave threshold are considered noise information and are assigned a smaller weight to suppress the noise.

Training Set and Parameter Setting
We train two CCNN models: one using the synthetic SAR image dataset and one using the real SAR image dataset. The synthetic SAR image dataset was built using 2500 optical images by injecting single-look speckle, and were collected from ISLVRC 2012 [30] and the Waterloo Exploration Database [31]. The real SAR image dataset is from 800 images from the UC Merced Dataset [32], the UCID [33], and scraped Google Maps images [34]. For the real SAR image dataset, the challenge is to create corresponding pairs of speckled and target images. Now that it is not possible to physically obtain a thoroughly despeckled SAR image, we adopt an alternative strategy to obtain speckle-free references as target images. We obtain the speckle-free targets by averaging multilooked images (25 dates) and keeping only the regions that do not significantly change over time. Specifically, we pick out those objects that have continuous single-look images from the UC Merced Dataset and the UCID Dataset. The first image of each object was used as the speckle SAR image, and the next series of 25 images were used to calculate the speckle-free data. We discarded the regions with significant temporal changes and averaged the same regions of 25 images, yielding the "clean" data. Of course, such a reference is far from a "clean" image, not only for the presence of temporal changes, but also for the limited number of SAR images, resulting in imperfect rejection of speckle. We use this real SAR image dataset to fine-tune the CCNN, which has been well trained via the synthetic SAR image dataset. All images are resized to 256×256 and the training data are augmented by random combinations, such as vertical and horizontal flipping, 90 • and 270 • rotation, changes in image contrast, etc. The test images are from the PatternNet dataset [35] and BSD500. The entire network is implemented in the TensorFlow framework and is updated using the Adam optimizer [36]. The activation function used is ReLU [37]. The convolution kernels for feature extraction are 3×3, and the kernel size for feature aggregation is 1×1. The final convolution layer is 3 (corresponding to a color SAR image) or 1 (referring to a grayscale SAR image). The learning rate is initially set to 0.0002 for all the subnetworks and it decreases by one-quarter after every 25 epochs. It takes approximately 9 h to train the CCNN with an NVIDIA RTX 2080Ti in the case that the 13 subnetworks (both grayscale SAR images and color SAR images via Contourlet decomposition of two levels) can work in parallel.

Network Training
Both Contourlet transform and CNN are suitable for dealing with additive noise. Therefore, we need to use the homomorphic transform to change the multiplicative speckle noise that contaminates the SAR images into additive noise before the Contourlet transform is applied to the SAR images. At the output of the CCNN, after the inverse Contourlet transform is used to composite the estimated images, the exponential function is needed to map the estimated image back to the multiplicative domain. In the CCNN, we decompose the speckled images through the Contourlet two times and obtain the thirteen Contourlet sub-bands, which concentrate the image features of the vertical, horizontal, diagonal directions, etc., and highlight these detailed features. Meanwhile, the subnetworks that are responsible for training the directional sub-bands of the first level of the Contourlet are designed with the Resnet structure to directly train the speckle noise rather than to train the estimated images. Therefore, these subnetworks can achieve the best despeckling performance with relatively shorter run time, creating a better balance between the performance of suppressing speckle and running efficiency. Each sub-band of the speckled images, incorporated with the corresponding Contourlet sub-bands of the clean images, is input into its network for training. The loss function of each subnetwork is calculated using Equation (3), learning the function mapping from the noisy sub-band to the clean sub-band. When the loss function of each subnetwork reaches the minimum, the feature mappings of all the subnetworks are obtained. Subsequently, the CCNN is used to remove the speckle from a tested SAR image.

Results
We carried out experiments on both synthetic and real SAR images. The motivation for designing the CCNN was to improve the performance of suppressing convolutional neural network speckle while shortening the training time. To this end, multiple experiments were conducted to evaluate the performance of the CCNN. First, we investigated the advantages of each independently learned Contourlet sub-band. Second, we examined the impacts of the different structures of the subnetworks on the despeckling quality of the SAR images. Third, we investigated the effect of different subnetworks with different loss functions. These experiments belong to our ablation studies, displaying the impact of each basic component on the performance of the CCNN. Finally, we selected four representative despeckling methods as the comparison baselines to compare and comprehensively analyze the performance of the proposed CCNN method: three CNN-based methods (SAR-CNN [14], ID-CNN [15], and SAR-UNet [18]), and the representative traditional method (SAR-BM3D [13]).
We used the PSNR to evaluate the despeckling results of the synthetic SAR images. In addition, the structural similarity (SSIM) [38] was used to evaluate the ability to retain the image details. Sheikh et al. [39] discussed many quantitative metrics for the estimated images and stated that the SSIM index is good at assessing the fidelity of detailed features. The higher the values, the more similar the local feature of the image will be, and the better the fidelity of the image. For SAR images from real-world scenarios, speckle suppression ability is measured objectively using no-reference image quality evaluation such as the natural image quality evaluator (NIQE) [40] and the equivalent number of looks (ENL) [41], which are calculated on homogeneous regions of the despeckled SAR images. The former can objectively measure the ability to suppress speckle noise in the absence of clean reference images, and the latter is sensitive to the content sharpness, texture diversity, and detail contrast of the images. Therefore, they are suitable for evaluating the speckle suppressing results of SAR images from real-world scenarios. The larger the NIQE and the larger the ENL, the better the preserved image details will be, and the speckle results will be suppressed.

Subband Training
In this section, we aim to verify that each Contourlet sub-band being trained independently not only shortens the training time but is also helpful for enhancing the denoising quality. One of the advantages of the CCNN is that it can be divided into several subnetworks. These subnetworks can learn the feature mapping of each Contourlet subband on multiple computers in parallel, and the well-trained sub-band can then be integrated to obtain a clean image through the inverse Contourlet. The CCNN that is trained in this way is denoted as CCNN-1. The CCNN learns its feature mapping on a single computer. Here, we compare the CCNN with the SAR-BM3D [13], SAR-CNN [14], ID-CNN [15], and SAR-UNet [18], despeckling benchmark methods. The results of these methods, shown in Table 1, come from their paper or cite their paper.  Table 1 displays the averages of PSNRs/SSIMs/NIQEs/ENLs and the GPU runtime when the CCNN and the four comparison methods handle the 100 images from the BSD 500 (added with the speckle noise of σ = 0.08) and the PatternNet dataset. The CCNN-1 and the CCNN both achieved the best metric values, with a relatively shorter execution time when compared with state-of-the-art image despeckling methods. It can also be seen that the execution time and training time of the CCNN were slightly greater than those of CCNN-1. This is because the subnetworks of CCNN-1 can be independently trained on multiple computers in parallel. The subnetworks of the CCNN are confined to running on a single computer. Due to the multiscale and multidirectional decomposition of Contourlets, the sub-bands of the Contourlet can not only intensively represent common characteristics in a certain direction, but also use larger coefficients than those of the noise to represent these features (for example, the coefficients of the horizontal directional edges in the LH sub-bands are larger). Each sub-band does not lose any detailed features because it does Remote Sens. 2021, 13, 764 9 of 20 not require clipping. Therefore, each subnetwork does not require a deep CNN with many convolution layers to capture the characteristics of these sub-bands; moreover, each subnetwork has a loss function, which can control and adjust the parameters of the subnetwork, to ensure that each sub-band is most similar to that of the clean SAR image. These measures ensure that the proposed CCNN can achieve better PSNR/SSIM/NIQE/ENL metrics than those of other comparison methods at a relatively faster speed, even if the CCNN works on a single computer.

The Structure of the Subnetwork
Here, we verify that the mechanism, i.e., each subnetwork that adopts a training structure to suit the characteristic of each sub-band, is helpful for improving the performance of the CCNN. In the CCNN, the structure of the subnetwork that is responsible for training each Contourlet sub-band can be different. We explore a CNN architecture that is fit for them according to each of their Contourlet sub-band attributes. The common feature of these subnetworks is that they all have few convolutional layers, which means that our proposed CCNN can have enough convolutional layers to capture the image features without causing the problem of vanishing/exploding gradients. The directional sub-bands in the low level contain most of the speckle noise of an SAR image and the size is larger. We adopt the ResNet structure to capture the speckle noise features from the sub-bands and learn the speckle noise feature mappings. Although these sub-bands are large, the subnetworks train the speckle noise features rather than the image content by using residual learning, which is not only easy to train but also reduces the learned parameters. This is based on the fact that there are far fewer contents to be learned from the noise than from an image. For the directional sub-bands in the highest level, the sub-bands contain more detailed features and less noise, and these detailed features may be very thin, which requires that the fine features extracted from the previous convolution layers can be passed to the subsequent convolution layers. Therefore, the subnetworks that are in charge of the directional sub-bands at the second level adopt the U-Net structure. Symmetric connections are added between the convolution layers and their corresponding layers, ensuring that the overall features captured by the previous convolution layers are passed into the subsequent feature maps. Furthermore, we also utilize dilate convolutions to enlarge the receptive field of the convolutional layers and to capture the thin features. For the coarse sub-bands in the highest level, the sub-bands contain much more overall image information, fewer detailed features and the least noise. The S-CNN that contains fewer mixed convolutional layers and has a skip connection will be competent enough to capture and learn the overall features of the coarse sub-bands. Figure 3 shows the positive impact of the subnetwork structures on the performance of the CCNN. The CCNN-2 represents a variant of the CCNN in which all the subnetworks are designed with the ResNet structure (i.e., the structure shown in Figure 2a) and each subnetwork has eight convolutional layers. We consider SAR-UNet [18] as a comparison baseline because it uses the same structure to extract detailed features at different scales. The comparison of the proposed CCNN method, the CCNN-2 method and the SAR-UNet method in terms of PSNRs/SSIMs/NIQEs/ENLs is shown in Figure 3a,d. These data are from the average of the 100 synthetic SAR images that included speckle noise with the variance σ = 0.06, 0.08, . . . , 0.2. Figure 3 illustrates that the performance of the CCNN is significantly superior to that of SAR-UNet. Additionally, the performance of CCNN-2 slightly surpasses that of the SAR-UNet method. This indicates that the strategy that uses the different CNN structures to train the different sub-bands can significantly improve the performance of suppressing speckles of the CCNN.

Loss Function
In the CCNN, the subnetwork used to independently train each Contourlet sub-band has not only its own structure but also its own loss function. As is known, the image contents and noise can be partially separated through multidirectional and multiscale Contourlet decomposition. The correlation of the image information is large, and the correlation of the noise information is small. Therefore, the amplitudes of most coefficients of the image details are higher than those of most of the noise coefficients, even if the noise intensity is high. Therefore, in the directional sub-bands, the high-magnitude coefficients convey most of the image detail energy and most of the low-magnitude coefficients are due to noise. If the high-magnitude coefficients can play a leading role in the loss function, it is conducive to the learning and preservation of image details and suppressing noise. Considering this point, we introduce a weight factor λ into the loss function of the directional sub-bands. The weight factor λ is set to two different values according to the coefficient values of the directional sub-band. Only coefficients greater than the average of the coefficients of each directional sub-band are assigned a higher λ value, and the rest are assigned a lower λ. However, the averages of the coefficients of each directional sub-band are bound to be different, which means that the objective function of each subnetwork has its own standard. This mechanism is helpful for suppressing speckle noise while preserv-

Loss Function
In the CCNN, the subnetwork used to independently train each Contourlet sub-band has not only its own structure but also its own loss function. As is known, the image contents and noise can be partially separated through multidirectional and multiscale Contourlet decomposition. The correlation of the image information is large, and the correlation of the noise information is small. Therefore, the amplitudes of most coefficients of the image details are higher than those of most of the noise coefficients, even if the noise intensity is high. Therefore, in the directional sub-bands, the high-magnitude coefficients convey most of the image detail energy and most of the low-magnitude coefficients are due to noise. If the high-magnitude coefficients can play a leading role in the loss function, it is conducive to the learning and preservation of image details and suppressing noise. Considering this point, we introduce a weight factor λ into the loss function of the directional sub-bands. The weight factor λ is set to two different values according to the coefficient values of the directional sub-band. Only coefficients greater than the average of the coefficients of each directional sub-band are assigned a higher λ value, and the rest are assigned a lower λ. However, the averages of the coefficients of each directional sub-band are bound to be different, which means that the objective function of each subnetwork has its own standard. This mechanism is helpful for suppressing speckle noise while preserving the feature details. To verify the effectiveness of this scheme, we conducted the following experiments: each subnetwork uses the same loss function, i.e., weight factor λ of Equation (1) is set to 1, to learn its feature mapping, and this CCNN is denoted as CCNN-3. Here, we consider the ID-CNN [15] as a comparison baseline because its weight factor λ of the loss function is also a constant. Figure 4 shows the probability distributions of the PSNR, SSIM, NIQE, and ENL gains of these 200 SAR images (100 synthetic and 100 real images). These gain values were obtained from the CCNN and CCNN-3 relative to the ID-CNN baseline. The PSNR/SSIM/NIQE/ENL gains shown in Figure 4 illustrate that the performance of the CCNN exceeds those of the ID-CNN by a large margin and that CCNN-3 is slightly superior to the ID-CNN, which demonstrates that using different loss functions to supervise learning of each sub-band is helpful for improving the despeckling performance of the CCNN.
ote Sens. 2021, 13, x FOR PEER REVIEW 11 o Figure 4 shows the probability distributions of the PSNR, SSIM, NIQE, and EN gains of these 200 SAR images (100 synthetic and 100 real images). These gain values w obtained from the CCNN and CCNN-3 relative to the ID-CNN baseline. T PSNR/SSIM/NIQE/ENL gains shown in Figure 4 illustrate that the performance of t CCNN exceeds those of the ID-CNN by a large margin and that CCNN-3 is slightly sup rior to the ID-CNN, which demonstrates that using different loss functions to superv learning of each sub-band is helpful for improving the despeckling performance of t CCNN.

The CCNN Performance
To comprehensively verify the CCNN performance, we investigate the despeckli results of the CCNN and the four comparison methods on synthetic SAR images and r SAR images. In particular, we also investigate the performance of the CCNN+, a varia of the CCNN, in which the number of convolutional layers increases to 10 or the numb of the mixed convolution layers increases to five in each subnetwork. The regions of terest (ROIs) in each despeckled image are enlarged by two times using bicubic interp

The CCNN Performance
To comprehensively verify the CCNN performance, we investigate the despeckling results of the CCNN and the four comparison methods on synthetic SAR images and real SAR images. In particular, we also investigate the performance of the CCNN+, a variant of the CCNN, in which the number of convolutional layers increases to 10 or the number of the mixed convolution layers increases to five in each subnetwork. The regions of interest (ROIs) in each despeckled image are enlarged by two times using bicubic interpolation and are shown in the corners to highlight the details obtained from different methods.

Results on Synthetic SAR Images
The 100 test images were randomly selected from the BSD500 and speckle noise was added with variance σ = 0.06, 0.08, . . . , and 0.2. Among them, the representatives of grayscale and color images and their corresponding single-look speckle synthetic images are shown in Figure 5. The two assessment values in terms of PSNR and SSIM (averaging values of 100 synthetic images obtained from these methods are presented in Table 2. The visual quality of the despeckled images is shown in Figures 6 and 7.      The assessment values from the methods shown in Table 2 are high, illustrating that these methods all yield good speckle-suppressing results. High PSNR values indicate that the despeckled image is closest to the original clean image; high SSIM values demonstrate that the methods can retain the edge and texture details while suppressing speckle noise. Overall, the assessment values obtained from the CNN-based methods (SAR-CNN,ID-CNN,SAR-UNet and CCNN) are superior to those of the traditional method (SAR-BM3D). This is because the traditional despeckling method distinguishes the image energy from the speckle noise based on image similarity and sparse representation, while the CNN-based methods can extract useful end-to-end data by integrating multiple convolutional layers and/or convolutional blocks, thus having a powerful learning capability and providing accurate predictive features for the estimated images. Figures 6 and 7 indicate that the visual quality of images obtained from the traditional despeckling methods is indeed inferior to the image quality obtained from the CNN-based methods. There are significant block artifacts around the edges of images obtained from the SAR-BM3D method. With the help of residual learning, the SAR-CNN and the ID-CNN can retain the edge and texture details well while removing noise and they generate slight artifacts only at the sharp edges. In contrast, the visual results of the ID-CNN are surperior to those of the SAR-CNN, perhaps benefiting from the ID-CNN using the improved loss function. Due to introducing downscaling and upscaling convolutional layers, the numerical results of the despeckled images from SAR-UNet are relatively better; most of the details are preserved, although the details in the textural regions are lost. The CCNN learns the deep characteristics of the images in multiple scales and multiple directions by using the Contourlet transform to build a multiple-stream structure, thus estimating fully detailed image information. The obtained numerical results from the CCNN are desired and have not only the highest PSNR mean values, but also relatively high SSIMs. Additionally, the visual quality obtained from the CCNN is also outstanding, with few slight artifacts appearing only at some edges. Moreover, when the number of convolutional layers of the CCNN increases, namely, CCNN+, the obtained metrics are better in terms of the PSNRs/SSIMs. This indicates that there is still room for improvement in the structure and the performance of the CCNN.

Results on Real SAR Images
The experiments on the synthetic images cannot adequately validate the effectiveness of the proposed method since the synthetic images in the simulation experiments are not acquired in a real degradation way. Here, we repeat the experiments on real SAR images to illustrate the feasibility and robustness of the proposed CCNN in some real-world

Results on Real SAR Images
The experiments on the synthetic images cannot adequately validate the effectiveness of the proposed method since the synthetic images in the simulation experiments are not acquired in a real degradation way. Here, we repeat the experiments on real SAR images to illustrate the feasibility and robustness of the proposed CCNN in some real-world scenarios.
We selected eight SAR images that have multilooked images from the PatternNet dataset to test the CCNN. Two of these SAR images and their corresponding multilook references are shown in Figure 8. The red boxes on the multilooked images represent the regions used to compute the ENL. In this experiment, temporal multilooking was used as the reference image for training. This reference is quite different from the ideal "clean" image, which implies that the speckle noise cannot be completely suppressed. Therefore, we focus on the comparisons of the quantitative indicators NIQE and ENL.  The quantitative evaluation of the baseline methods and our CCNN are shown in Table 3. The metric NIQE can measure the quality of the despeckled images and is expressed as a simple distance between the despeckled images and the model that is constructed via statistical features collected from many real SAR images. The larger the metric NIQE is, the better the image detail preservation will be. The index ENL indicates the smooth degree of the despeckled image by calculating the mean and standard deviation of a homogeneous region. The larger the ENL is, the smoother the region will be and the better the speckle noise suppression. In Table 3, the ENL of the SAR-CNN method is the lowest while the NIQE of the SAR-BM3D network is the lowest, which indicates that the former cannot effectively suppress speckle noise and the latter cannot preserve fine details in the textural region. Overall, the metrics obtained from the CCNN on the eight SAR images (#1, #2, . . . , #8) are the most satisfactory and have the highest ENL and NIQE, which indicates that the CCNN can effectively suppress speckle noise without significantly impairing the image resolution. Figures 9 and 10 show the visual qualities of the despeckled images obtained from the baseline methods and the proposed CCNN. In contrast, we observe that the SAR-BM3D, SAR-CNN, ID-CNN, and SAR-UNet lose details to some extent, blur some texture details and edges, or have a large level of residual speckle noise, while our proposed CCNN and the CCNN+ preserve most details well, even for thin lines and complex textures. Additionally, very few speckle noises are left and very few artifacts are introduced.

Discussion
In summary, the proposed CCNN provides very satisfying results considering despeckling performance and detail preservation. For real-world SAR images, we observe the same results with the synthetic SAR images as before, only less pronounced. This probably has to do with the imperfect reference images used for training the CCNN. In fact, a 25-look image (obtained by averaging a series of 25 SAR images) is not a true clean SAR image, but only an approximation of it, based on temporal multilooking. The regions characterized by a different average intensity from the rest of the image probably correspond to regions where the despeckled image approaches the reference image but not the original noisy image. Thus, the CCNN behaves as instructed to do based on bad examples. With this premise, the reference multilooking images can only provide bad results. If our conjecture is right, the CCNN performance will be further improved when better reference data are available.
The CCNN can be extended to higher level of Contourlet decomposition. Nevertheless, higher level inevitably results in deeper network and heavier computational burden. Thus, a suitable level is required to balance efficiency and performance. We examined the PSNR and runtime results of CCNNs with the levels of 1 to 3. We observed that CCNN with 2-level architecture performs much better than CCNN with 1-level, while CCNN with 3-level architecture only performs negligibly better than CCNN with 2-level in terms of the PSNR metric. Moreover, the speed of CCNN with 2level is also moderate compared with many more decomposition levels. Taking both efficiency and performance gain into account, we choose CCNN with 2-level as the default setting.

Discussion
In summary, the proposed CCNN provides very satisfying results considering despeckling performance and detail preservation. For real-world SAR images, we observe the same results with the synthetic SAR images as before, only less pronounced. This probably has to do with the imperfect reference images used for training the CCNN. In fact, a 25-look image (obtained by averaging a series of 25 SAR images) is not a true clean SAR image, but only an approximation of it, based on temporal multilooking. The regions characterized by a different average intensity from the rest of the image probably correspond to regions where the despeckled image approaches the reference image but not the original noisy image. Thus, the CCNN behaves as instructed to do based on bad examples. With this premise, the reference multilooking images can only provide bad results. If our conjecture is right, the CCNN performance will be further improved when better reference data are available.
The CCNN can be extended to higher level of Contourlet decomposition. Nevertheless, higher level inevitably results in deeper network and heavier computational burden. Thus, a suitable level is required to balance efficiency and performance. We examined the PSNR and runtime results of CCNNs with the levels of 1 to 3. We observed that CCNN with 2-level architecture performs much better than CCNN with 1-level, while CCNN with 3-level architecture only performs negligibly better than CCNN with 2-level in terms of the PSNR metric. Moreover, the speed of CCNN with 2-level is also moderate compared with many more decomposition levels. Taking both efficiency and performance gain into account, we choose CCNN with 2-level as the default setting.

Conclusions
In this paper, we proposed a convolutional neural network for suppressing speckle noise (CCNN) that captures feature details and suppresses speckle noise from multiple scales and multiple directions. The CCNN uses the Contourlet transform to decompose SAR image despeckling into multiple subproblems and then suppress speckle noise using multiple multidirectional and multiscale subnetworks. Each subnetwork is in charge of learning the feature mapping and suppressing the speckle noise coefficients of each Contourlet subband under the supervision of its loss function. Considering the attribute of each subband, we design a different structure and a different loss function for each subnetwork. The subnetworks do not require too many convolutional layers and a complex network structure to capture the features or speckle noise in a specific scale and specific direction of an image, which means that our proposed CCNN not only provides sufficient convolutional layers to capture the image features but also avoids the problem of vanishing/exploding gradients. The ability to preserve feature details of the CCNN can be significantly improved by capturing the detailed texture and clear structure at multiple scales and in multiple directions. In addition, these subnetworks can be executed in parallel on different computers, resulting in shortening the training time of the network. In addition, these subnetworks can be executed in parallel on different computers, resulting in shortening the training time of the network. The results for the synthetic and real SAR images show that the CCNN achieves the best performance with the lowest runtime when compared with the state-of-the-art methods for suppressing speckle, resulting in a good tradeoff between efficiency and performance.
In the future, we will explore a suitable generative model by exploiting the techniques [42,43] to generate training samples more similar to the data distribution of real SAR images.