1. Introduction
Remote sensing big data have pushed the research of Earth science considerably and produced a significant amount of Earth observation data. As one of the main sources of remote sensing big data, synthetic aperture radar (SAR) can provide the capability of acquiring all-day and all-weather Earth ground images. Hence, it has played a crucial role in remote sensing big data applications, including wetland monitoring [
1,
2], forest assessment [
3,
4], snowmelt monitoring [
5], flood inundation mapping [
6], and ship classification and detection [
7,
8,
9,
10]. However, it is not easy to extract analysis results from SAR observation big data. SAR images are essentially corrupted by the speckle noise caused by the constructive or destructive interference of back-scattered microwave signals [
11]. The existence of speckle noise leads to the degradation of image quality and makes it challenging to interpret SAR images visually and automatically. Hence, suppressing speckle noise (i.e., despeckling) is an indispensable task in SAR image preprocessing.
To mitigate the speckle noise in SAR images, a large number of traditional methods have been proposed, including filter-based (e.g., Lee [
12] and Frost [
13]), variational-based (e.g., the AA model [
14] and the SO model [
15]), and nonlocal-based (e.g., the probabilistic patch-based algorithm (PPB) [
16] and the SAR block-matching 3D algorithm (SAR-BM3D) [
17]). Most of these methods face several significant problems: (1) they usually require a proper selection of the parameter settings, which largely depends on subjective experience; (2) to a certain extent, their performance is scene-dependent. In other words, speckle noise is smoothly removed in homogeneous regions (e.g., agricultural fields), and detailed information (e.g., edges and textures) is lost in heterogeneous regions (e.g., strong scatterers); (3) there are sometimes artefacts in flat areas, such as the ringing near edges and isolated patterns [
18]. The detailed analysis of these traditional methods can be found in a review [
18].
With the fast advancement of deep learning technology, convolutional neural networks (CNNs) have demonstrated superior computer vision performance, such as image reconstruction [
19], semantic segmentation [
20], super-resolution [
21], object detection [
22], and image classification [
23] and identification [
24]. Benefiting from its powerful feature extraction capability, the CNN has also been employed to achieve image denoising [
25]. Generally, CNN-based image denoising methods adopt supervised learning, where a large amount of “Noisy2Clean” image pairs (i.e., noisy inputs and the corresponding clean targets) are needed. Moreover, by minimizing a distance metric between noisy inputs and clean targets, CNN models are trained and updated to output denoised images. However, considering applying supervised-learning-based denoising methods to SAR image despeckling, a key problem arises: it is hard to acquire clean SAR images in real-world conditions. Thus, there will not be sufficient clean SAR images that can be used as targets to train the despeckling network. In the literature, there are mainly two strategies to address this problem: using multitemporal SAR data [
26,
27] and synthetic speckled data [
28,
29,
30,
31,
32,
33,
34], which are introduced as follows:
(1) Multitemporal SAR data: Chierchia et al. [
26] trained a 17-layer SAR despeckling CNN (SAR-CNN) using multitemporal data from the same scene, where approximate clean targets were obtained from multitemporal (multilook) SAR images. Similarly, Cozzolino et al. [
27] picked out some region images without significant temporal changes (25 dates). Speckled inputs were the first image of each object, and clean targets were obtained from the next series of 25 images. However, due to the changes in different time sequences and the scene registration, it was not entirely reliable to treat the multilook images as the clean targets, leading to a suboptimal rejection of speckle noise;
(2) Synthetic speckled data: A more common strategy of deep-learning-based SAR image despeckling methods is to construct synthetic speckled data for training. These methods all adopt supervised learning by using “Speckled2Clean” image pairs. In other words, speckle-free single-channel grey optical images (e.g., UC Merced land-use dataset [
35]) are employed as clean targets, and speckled inputs are generated by adding speckle noise to the corresponding single-channel grey optical images.
Specifically, image despeckling CNN (ID-CNN) [
28] employs eight convolutional layers along with the rectified linear unit (ReLU) activation function and the batch normalization layers. In addition, a division residual layer with the skip connection is used to output despeckled images, where the speckled inputs are divided by the estimated speckle noise. The SAR dilated residual network (SAR-DRN) [
29] implements a seven-layer lightweight network with dilated convolutions, enlarging the receptive field while keeping the filter size. Unlike ID-CNN, SAR-DRN employs residual learning. In other words, the network outputs are the estimated speckle noise of speckled inputs rather than despeckled images. Then, the loss function is calculated between the estimated and the residual speckle noise. The residual speckle noise is obtained by subtracting the clean targets from the corresponding speckled inputs. The multiscale recurrent network (MSR-Net) [
30] was presented with multiscale cascaded subnetworks. Each subnetwork consists of an encoder, a decoder, and a convolutional long short-term memory unit. In addition, the multiscale recurrent and weight sharing strategies were adopted to increase network capacity. The hybrid dilated residual attention network (HDRANet) [
31] and SAR-DRN both utilize dilated convolutions to enlarge the receptive field. Different from SAR-DRN, HDRANet proposes hybrid dilated convolution, which can avoid causing gridding artefacts. The attention module via a residual architecture was also introduced to improve network performance. In particular, the spatial and transform domain CNN (STD-CNN) [
32] fuses spatial and wavelet-based transform domain features for SAR image despeckling with rich details and a global topology structure. A multiconnection network incorporating wavelet features (MCN-WF) [
33] also references the wavelet transform. By using wavelet features, the loss function was redesigned to train a multiconnection network based on dense connections. Considering the distribution characteristics of SAR speckle noise, the SAR recursive deep CNN prior (SAR-RDCP) [
34] combines the strong mathematical basis of traditional variational models and the nonlinear end-to-end mapping ability of deep CNN models. The whole SAR-RDCP model consists of two subblocks: a data-fitting block and a predenoising residual channel attention block. By introducing a novel despeckling gain loss, two subblocks are jointly optimized to achieve the overall network training.
The advantage of using synthetic speckled data is that they can contain a large number of “Speckled2Clean” image pairs. This advantage allows deep CNN models to be trained stably without overfitting and to learn the complex nonlinear mapping relationships between speckled inputs and clean targets. However, due to the difference in the imaging mechanism, optical images have shown many differences from SAR images in terms of grey-level distribution, spatial correlation, dynamics, power spectral density, etc. [
18]. Many unique characteristics (e.g., scattering phenomena) of SAR images are neglected in the training process. An illustration of the differences between optical and SAR images is presented in
Figure 1, provided by the Sen1-2 dataset [
36]. Hence, it is not ideal to obtain despeckled SAR images by training the network on synthetic speckled data in practical situations. A domain gap problem has been exposed: the despeckling networks perform well on the data, which are from a domain similar to the training data (i.e., synthetic speckled images), but perform poorly on the testing data (i.e., real SAR images).
To address the problem that supervised learning-based CNN methods need to require “Noisy2Clean” image pairs to train denoising networks, Lehtinen et al. [
37] proposed a novel training strategy (named Noise2Noise). It demonstrated that denoised images could be generated with the networks trained on “Noisy2Noisy” image pairs, consisting of noisy inputs and noisy targets. They both contain the same underlying clean ground truth and are corrupted by the independent and identical noise. Its basic idea is that the mean squared error (MSE) loss function is minimized by the expected value of the targets. Hence, the Noise2Noise strategy is suitable for the noisy images whose expected value is equal to that of the underlying clean ground truth, for example the noisy images corrupted by additive white Gaussian noise (AWGN). Inspired by this, Ma et al. [
38] proposed a noisy reference-based SAR deep learning (NR-SAR-DL) filter, which used multitemporal SAR images to train the despeckling network. These images (called Speckled2Speckled image pairs) were acquired from the same scene by the same sensor. NR-SAR-DL has outstanding despeckling performance on real multitemporal SAR data, especially in preserving point targets and the radiometrics. However, though NR-SAR-DL integrates temporal stationarity information into the loss function, its effectiveness is still affected by the training errors caused by temporal variations.
When neither “Speckled2Clean” nor “Speckled2Speckled” image pairs are available, training the CNN-based despeckling network becomes challenging. Recently, Quan et al. [
39] proposed a dropout-based scheme (named Self2Self) for image denoising with only single noisy images. In the Self2Self strategy, a denoising CNN with dropout was trained on the Bernoulli-sampled instances of noisy images. For the noisy images corrupted by the AWGN, the denoising performance of Self2Self is comparable to that of “Noisy2Clean”, which provides the possibility for just using real speckled SAR images to train a deep despeckling network.
In this paper, we aim to solve such a problem: training a deep despeckling network requires clean SAR ground truth images that are difficult to obtain in real-world conditions. By solving this problem, the deep despeckling network can be trained on real SAR images instead of synthetic speckled images. To this end, inspired by Self2Self, we propose an advanced SAR image despeckling method by virtue of Bernoulli-sampling-based self-supervised deep learning, namely SSD-SAR-BS. Our main contributions are summarized as follows:
To address the problem that no clean SAR images can be employed as targets to train the deep despeckling network, we propose a Bernoulli-sampling-based self-supervised despeckling training strategy, utilizing the known speckle noise model and real speckled SAR images. The feasibility is proven with mathematical justification, combining the characteristic of speckle noise in SAR images and the mean squared error loss function;
A multiscale despeckling network (MSDNet) was designed based on the traditional UNet, where shallow and deep features are fused to recover despeckled SAR images. Dense residual blocks are introduced to enhance the feature extracting ability. In addition, the dropout-based ensemble in the testing process is proposed, to avoid the pixel loss problem caused by the Bernoulli sampling and to boost the despeckling performance;
We conducted qualitative and quantitative comparison experiments on synthetic speckled and real SAR image data. The results showed that our proposed method significantly suppressed the speckle noise while reliably preserving image features over the state-of-the-art despeckling methods.
The rest of this paper is organized as follows.
Section 2 introduces our proposed method in detail.
Section 3 describes the compared methods, experimental settings, and experimental results on synthetic speckled and real SAR image data.
Section 4 discusses the impacts of the several components of our proposed method.
Section 5 summarizes the paper.
4. Discussion
As mentioned earlier, the Bernoulli sampling probability p was set to be 0.3, and the dropout-based ensemble was used to boost the performance in our proposed SSD-SAR-BS. To confirm the effectiveness of these settings, we provide a set of comparative experiments. For the Bernoulli sampling probabilistic p, only 0.1, 0.3, and 0.5 were compared because when the p was too large, most pixels of the input speckled images were lost and the despeckling performance rapidly fell. For the objective evaluation, we present their despeckling performance on 100 synthetic speckled images from the AID dataset in terms of the average PSNR and SSIM values, under different numbers of the average times K in the dropout-based ensemble.
As shown in
Figure 15, as
K improves, the average PSNR and SSIM values increased significantly, especially the average PSNR values where K was set from zero to twenty and the average SSIM values where K was set from zero to forty. This was similar for the different Bernoulli sampling probability
p, in other words,
p = 0.1,
p = 0.3, and
p = 0.5. Moreover, from the magnified curves as shown in the right column of
Figure 15, the average PSNR and SSIM values of
p = 0.3 were larger than those of
p = 0.1 and
p = 0.5. This was because when
p = 0.1, the randomness of Bernoulli sampling was not strong enough; when
p = 0.5, the preserved pixels were not enough to construct the despeckled images. Hence, 0.3 was the superior value of Bernoulli sampling probability
p, and the dropout-based ensemble could effectively boost the despeckling performance of our proposed SSD-SAR-BS.
We also list the inference runtimes of the compared and our proposed methods in
Table 4. All methods were implemented on the same system environment described in
Section 3.1.2. We employed the dropout-based ensemble in our proposed SSD-SAR-BS, enabling the model to operate as two: (1) Accurate model (e.g.,
K = 100): This model was more expensive due to it having more average times, providing superior speckle noise suppression and detail preservation. (2) Fast model (e.g.,
K = 40): Facing more real-time SAR image despeckling tasks, this model can improve the inference efficiency (reduce the test runtime) by reducing the average times
K. From
Table 4, we can see that as the number of the average times
K reduces, the runtime reduces significantly. Specifically, when
K = 40, the runtimes of images with 64 × 64 and 128 × 128 pixels were reduced to about 0.25 and 0.75 s, respectively. The time consumption of this fast model was superior to those of the traditional methods (i.e., PPB and SAR-BM3D). This was similar to other deep-learning-based methods. In other words, after much time was needed to train the deep neural network, the testing process was speedy in the deep-learning-based methods, which is another advantage of our proposed SSD-SAR-BS.