An Advanced SAR Image Despeckling Method by Bernoulli-Sampling-Based Self-Supervised Deep Learning

: As one of the main sources of remote sensing big data, synthetic aperture radar (SAR) can provide all-day and all-weather Earth image acquisition. However, speckle noise in SAR images brings a notable limitation for its big data applications, including image analysis and interpretation. Deep learning has been demonstrated as an advanced method and technology for SAR image despeckling. Most existing deep-learning-based methods adopt supervised learning and use synthetic speckled images to train the despeckling networks. This is because they need clean images as the references, and it is hard to obtain purely clean SAR images in real-world conditions. However, signiﬁcant differences between synthetic speckled and real SAR images cause the domain gap problem. In other words, they cannot show superior performance for despeckling real SAR images as they do for synthetic speckled images. Inspired by recent studies on self-supervised denoising, we propose an advanced SAR image despeckling method by virtue of Bernoulli-sampling-based self-supervised deep learning, called SSD-SAR-BS. By only using real speckled SAR images, Bernoulli-sampled speckled image pairs (input–target) were obtained as the training data. Then, a multiscale despeckling network was trained on these image pairs. In addition, a dropout-based ensemble was introduced to boost the network performance. Extensive experimental results demonstrated that our proposed method outperforms the state-of-the-art for speckle noise suppression on both synthetic speckled and real SAR datasets (i.e., Sentinel-1 and TerraSAR-X).


Introduction
Remote sensing big data have pushed the research of Earth science considerably and produced a significant amount of Earth observation data. As one of the main sources of remote sensing big data, synthetic aperture radar (SAR) can provide the capability of acquiring all-day and all-weather Earth ground images. Hence, it has played a crucial role in remote sensing big data applications, including wetland monitoring [1,2], forest assessment [3,4], snowmelt monitoring [5], flood inundation mapping [6], and ship classification and detection [7][8][9][10]. However, it is not easy to extract analysis results from SAR observation big data. SAR images are essentially corrupted by the speckle noise caused by the constructive or destructive interference of back-scattered microwave signals [11]. The existence of speckle noise leads to the degradation of image quality and makes it challenging to interpret SAR images visually and automatically. Hence, suppressing speckle noise (i.e., despeckling) is an indispensable task in SAR image preprocessing.
To mitigate the speckle noise in SAR images, a large number of traditional methods have been proposed, including filter-based (e.g., Lee [12] and Frost [13]), variationalbased (e.g., the AA model [14] and the SO model [15]), and nonlocal-based (e.g., the SAR-DRN, HDRANet proposes hybrid dilated convolution, which can avoid causing gridding artefacts. The attention module via a residual architecture was also introduced to improve network performance. In particular, the spatial and transform domain CNN (STD-CNN) [32] fuses spatial and wavelet-based transform domain features for SAR image despeckling with rich details and a global topology structure. A multiconnection network incorporating wavelet features (MCN-WF) [33] also references the wavelet transform. By using wavelet features, the loss function was redesigned to train a multiconnection network based on dense connections. Considering the distribution characteristics of SAR speckle noise, the SAR recursive deep CNN prior (SAR-RDCP) [34] combines the strong mathematical basis of traditional variational models and the nonlinear end-to-end mapping ability of deep CNN models. The whole SAR-RDCP model consists of two subblocks: a data-fitting block and a predenoising residual channel attention block. By introducing a novel despeckling gain loss, two subblocks are jointly optimized to achieve the overall network training.
The advantage of using synthetic speckled data is that they can contain a large number of "Speckled2Clean" image pairs. This advantage allows deep CNN models to be trained stably without overfitting and to learn the complex nonlinear mapping relationships between speckled inputs and clean targets. However, due to the difference in the imaging mechanism, optical images have shown many differences from SAR images in terms of grey-level distribution, spatial correlation, dynamics, power spectral density, etc. [18]. Many unique characteristics (e.g., scattering phenomena) of SAR images are neglected in the training process. An illustration of the differences between optical and SAR images is presented in Figure 1, provided by the Sen1-2 dataset [36]. Hence, it is not ideal to obtain despeckled SAR images by training the network on synthetic speckled data in practical situations. A domain gap problem has been exposed: the despeckling networks perform well on the data, which are from a domain similar to the training data (i.e., synthetic speckled images), but perform poorly on the testing data (i.e., real SAR images). To address the problem that supervised learning-based CNN methods need to require "Noisy2Clean" image pairs to train denoising networks, Lehtinen et al. [37] proposed a novel training strategy (named Noise2Noise). It demonstrated that denoised images could be generated with the networks trained on "Noisy2Noisy" image pairs, consisting of noisy inputs and noisy targets. They both contain the same underlying clean ground truth and are corrupted by the independent and identical noise. Its basic idea is that the mean squared error (MSE) loss function is minimized by the expected value of the targets. Hence, the Noise2Noise strategy is suitable for the noisy images whose expected value is equal to that of the underlying clean ground truth, for example the noisy images corrupted by additive white Gaussian noise (AWGN). Inspired by this, Ma et al. [38] proposed a noisy reference-based SAR deep learning (NR-SAR-DL) filter, which used multitemporal SAR images to train the despeckling network. These images (called Speckled2Speckled image pairs) were acquired from the same scene by the same sensor. NR-SAR-DL has outstanding despeckling performance on real multitemporal SAR data, especially in preserving point targets and the radiometrics. However, though NR-SAR-DL integrates temporal stationarity information into the loss function, its effectiveness is still affected by the training errors caused by temporal variations.
When neither "Speckled2Clean" nor "Speckled2Speckled" image pairs are available, training the CNN-based despeckling network becomes challenging. Recently, Quan et al. [39] proposed a dropout-based scheme (named Self2Self) for image denoising with only single noisy images. In the Self2Self strategy, a denoising CNN with dropout was trained on the Bernoulli-sampled instances of noisy images. For the noisy images corrupted by the AWGN, the denoising performance of Self2Self is comparable to that of "Noisy2Clean", which provides the possibility for just using real speckled SAR images to train a deep despeckling network.
In this paper, we aim to solve such a problem: training a deep despeckling network requires clean SAR ground truth images that are difficult to obtain in real-world conditions. By solving this problem, the deep despeckling network can be trained on real SAR images instead of synthetic speckled images. To this end, inspired by Self2Self, we propose an advanced SAR image despeckling method by virtue of Bernoulli-sampling-based selfsupervised deep learning, namely SSD-SAR-BS. Our main contributions are summarized as follows: • To address the problem that no clean SAR images can be employed as targets to train the deep despeckling network, we propose a Bernoulli-sampling-based selfsupervised despeckling training strategy, utilizing the known speckle noise model and real speckled SAR images. The feasibility is proven with mathematical justification, combining the characteristic of speckle noise in SAR images and the mean squared error loss function; • A multiscale despeckling network (MSDNet) was designed based on the traditional UNet, where shallow and deep features are fused to recover despeckled SAR images. Dense residual blocks are introduced to enhance the feature extracting ability. In addition, the dropout-based ensemble in the testing process is proposed, to avoid the pixel loss problem caused by the Bernoulli sampling and to boost the despeckling performance; • We conducted qualitative and quantitative comparison experiments on synthetic speckled and real SAR image data. The results showed that our proposed method significantly suppressed the speckle noise while reliably preserving image features over the state-of-the-art despeckling methods.
The rest of this paper is organized as follows. Section 2 introduces our proposed method in detail. Section 3 describes the compared methods, experimental settings, and experimental results on synthetic speckled and real SAR image data. Section 4 discusses the impacts of the several components of our proposed method. Section 5 summarizes the paper.

The Proposed Method
In this section, firstly, we describe the basic idea of our proposed SSD-SAR-BS, where only the speckle noise model and speckled SAR images are needed. Then, we introduce the MSDNet to achieve despeckling and utilize dense residual blocks to enhance the network performance. Lastly, we propose the dropout-based ensemble for testing. The overall flowchart of our proposed SSD-SAR-BS is presented in Figure 2.
• Testing process

Basic Idea of Our Proposed SSD-SAR-BS
We use Y and X to denote the observed speckled intensity SAR image and the corresponding underlying speckle-free SAR image, respectively. The relationship between Y and X can be characterized by a well-known multiplicative model [11]: where denotes the Hadamard product (i.e., elementwise product) of two matrices. N denotes the speckle noise and is considered to follow the independent and identically distributed (i.i.d.) Gamma distribution with unit mean. The probability density function P r (·) of N can be defined as [11]: where Γ(·) denotes the Gamma function and L is the number of looks. Our objective was to train a deep despeckling network, just using X as the training data to reconstruct Y, that is X is invisible during the network training. As previously mentioned, it is feasible to train a denoising network using Noise2Noise image pairs, which contain the same underlying clean targets. Therefore, a natural idea is: if we can sample two images for a given speckled image, we can use one of them as the input and the other as the target. To do so, we propose a self-supervised SAR image despeckling method based on Bernoulli sampling.
Firstly, for each speckled image, only a part of the pixels was used as the input, and the remaining pixels were used as the target. Then, we hoped to generate two matrices as the multiplication operators, whose sizes were the same as that of the original speckled image. The speckled input-target image pairs were obtained by multiplying the original fully speckled image by two matrices, respectively. Due to this reason, we employed the Bernoulli distribution as the sampling method to generate the matrices with one or zero values. Unlike the Bernoulli distribution, some other distributions (e.g., Gaussian distribution) sample a random value between zero and one. If the other distributions (e.g., Gaussian distribution) are employed to generate two matrices as the multiplication operators, a set of transformation operations needs to be performed before. Hence, the Bernoulli distribution is a simpler and more direct sampling method in this work. Specifically, for each speckled image with a size of W × H, we used two Bernoulli-sampled matrices (i.e.,B,B ∈ {0, 1} W×H ), which can be written as: whereB w,h andB w,h denote the pixel values inB andB, respectively. p ∈ (0, 1) denotes the probability of the Bernoulli distribution. Then, the corresponding Bernoulli-sampled speckled image pairs (Ŷ,Ỹ) can be generated: whereŶ is the input of the despeckling network andỸ is the corresponding target. An illustration of the Bernoulli-sampled speckled image pairs is presented in Figure 3. With the obtained Bernoulli-sampled speckled image pairs, we can train a deeplearning-based despeckling network f θ (·) described in Section 2.2. θ denotes the parameters (i.e., weights and biases) of f θ (·). Specifically,Ŷ is used as the input andỸ is used as the target. The training process of f θ (·) is to find the optimized parameters θ that achieve the smallest MSE loss function L MSE between the output-target image pairs ( f θ (Ŷ),Ỹ). Here, to make L MSE only be measured by those pixels masked by Bernoulli sampling, the outputtarget image pairs employed to calculate L MSE are rewritten as ( f θ (Ŷ) B ,Ỹ B ). Assume that there is an available training dataset containing a large of image samples. The training process of f θ (·) can be formulated as: where M is the number of image samples in the training dataset.Ŷ . Once the training process is completed, we can use the well-trained network to obtain despeckled results.
Next, we explain why this is feasible. It is well known that the MSE loss function is convex, and then, to solve (6), we can derive: where E denotes the expectation operator. By combining (3)-(5), (7) can be rewritten as: As defined in (1) and (2), the distribution of N is the unit mean. Hence, the expectation of Y is the same as that of X, which leads to: According to (8) and (9), we have: Furthermore, we can further approximately simplify (10) to: This means that when L MSE obtains the smallest value (i.e., the despeckling network parameters θ obtain the optimized values), we can obtain despeckled results by using the well-trained network. This is particularly true when a large dataset is employed, in other words M → +∞.

Main Network Architecture
We designed a multiscale despeckling network (MSDNet) as f θ (·) based on the traditional UNet, which adopts the symmetric encoder-decoder structure. This structure can obtain deep features with different scales. At the same time, the downsampling operations in the encoder part can make the network more lightweight. Our proposed MSDNet consists of a preprocessing block (PB), three encoder blocks (EB), three decoder blocks (DB), and an output block (OB). The architecture and the detailed configuration are presented in Figure 4 and Table 1, respectively.
To extract the deep semantic features of speckled images with different scales, the input SAR images are fed to the PB followed by three EBs. Each EB is made up of a downsampling subblock and a dense residual subblock (DRB) described in Section 2.2.2. Downsampling can enlarge the receptive field [40] and augment contextual information extraction, effectively facilitating the recovery of SAR images. Furthermore, memory usage and calculation can also be reduced by using downsampling. Here, different from the traditional UNet, the 3 × 3 strided convolution with stride 2 and padding 1 was adopted to implement downsampling with learnable parameters. Unlike pooling operations (e.g., max pooling) in the traditional UNet, the strided convolution can achieve downsampling utilizing all pixels in the sliding window rather than one pixel with the max value. Hence, replacing max pooling in traditional UNet by the strided convolution in our network architecture can enhance the interfeature dependencies and improve the network expressiveness ability [41].    With the features with four different scales obtained by the aforementioned PB and three PBs, three DBs and an OB are responsible for gradually recovering despeckled SAR images. Each DB consists of an upsampling subblock and a DRB with dropout, described in Section 2.2.2. Here, the transposed convolution (TConv) with kernel size 2 and stride 2 in the upsampling subblock is used to enlarge the feature scale. To fuse the shallow and deep features, three skip connections are introduced between PB and OB, EB-1 and DB-1, and EB-2 and DB-2, respectively. This is done by the channelwise concatenation, which can help to reuse the features and exploit the network potential. The shallow features come from the PB (W × H), EB-1 ( W 2 × H 2 ), and EB-2 ( W 4 × H 4 ). Moreover, they are passed to the OB, DB-1, and DB-2 by the concatenation. The deep features come from EB-3 ( W 8 × H 8 ). Finally, the OB is employed to convert the concatenated features into despeckled results.

Main Parts Subparts Configurations
Here, we provide an explanation about the parameter selection of Table 1. The kernel size of all convolutions (except for TConv) was set to be 3 × 3, which can provide a good tradeoff between network performance and memory footprint. Then, for the common convolutions, to keep the size of the feature map of the input and the output of the convolution kernel consistent, the stride (s) and the padding (p) were set to 1 and 0, respectively. For the strided convolutions, to achieve 2-times downsampling, in other words, to make the size of the feature map change to half of the original ones, the stride (s) and the padding (p) were set to 2 and 1, respectively. In particular, to achieve 2-times upsampling, in other words, to make the size of the feature map change to the 2-times the original ones, the kernel size (k) and the stride (s) of TConv were set to 2 and 2, respectively. Furthermore, for the output channels of each convolution, in general, more channels of the feature map can enhance the expressive ability of deep neural networks. However, due to the GPU memory limitation, in this work, the number of the out channels was set to be a multiple of 64, such as 64 and 128.

Dense Residual Block
To enhance feature extraction, we introduced the DRB in the subblocks of the MSDNet, in other words, the PB, EBs, and DBs. Different from the subblocks of the traditional UNet, the DRB combines the dense connection [42] and the residual skip connection [43], as presented in Figure 5. The DRB consists of three convolutions (i.e., W d,1 , W d,2 , and W d,r ) along with the parametric rectified linear units (PReLU) [44] (i.e., σ d,1 , σ d,2 , and σ d,r ). Let the input of DRB be F d−1 . The output of the first convolution along with PReLU is expressed as: Then, F d−1 and F d,1 are concatenated to feed the second convolution by the dense connection, which can be written as:  Finally, a residual skip connection is employed to obtain the final output of the DRB. Specifically, F d−1 is used as the residual to be added with F d,r , which can be written as: where F d is the final output of the DRB.

Dropout-Based Ensemble for Testing
Once the SSD-SAR-BS training process has been completed, we can employ the welltrained MSDNet to achieve despeckling for a given speckled SAR image Y . As described in Section 2.1, the input of the MSDNet is the Bernoulli-sampled speckled SAR imageŶ rather than the original full one Y . The recovered despeckled SAR image may lead to a problem in the testing process: some pixels may be lost. However, this problem can be solved by using the dropout-based [39,45] ensemble, as presented in Figure 2. Firstly, we generated a set of Bernoulli-sampled images (i.e.,Ŷ (1) ,Ŷ (2) , · · · ,Ŷ (K) ) for the same speckled SAR image Y , according to (3), (4), and (5). Here, the lost pixels of the used Bernoulli-sampled images may be different in every sampling. We introduced a dropout layer before each convolution of each DB, as presented in Table 1. Some units of each convolution were randomly ignored in each forward pass. Due to the independent randomness of dropout, we can view the well-trained MSDNet as a set of despeckling networks (i.e., θ (·), · · · , f (K) θ (·)). Then, these Bernoulli-sampled speckled SAR images are fed to the corresponding despeckling networks to generate the corresponding output results (i.e., θ (Ŷ (2) ), · · · , f (K) θ (Ŷ (K) )). Finally, the predicted despeckled SAR image X is obtained by averaging all output results, which can be described as: where K is the number of average times (i.e., the number of the generated Bernoullisampled images) in the dropout-based ensemble. W and H denote the image size. X w,h and f θ (Ŷ (k) ), respectively. Because the Bernoulli sampling and the dropout layers both have independent randomness, all pixels can be considered when the average number of times is sufficiently large. Hence, the problem of pixels being lost can be solved, and the despeckling performance can be effectively boosted.

Experimental Results and Analysis
In this section, to demonstrate the superiority of our proposed SSD-SAR-BS, we conducted quantitative and visual comparison experiments, where several state-of-the-art despeckling methods were used for comparison. Both synthetic speckled and real SAR data were employed for the analysis.

Compared Methods
The following state-of-the-art despeckling methods were compared to our proposed SSD-SAR-BS: PPB [16], SAR-BM3D [17], ID-CNN [28], SAR-DRN [29], and SAR-RDCP [34]. The first two are traditional despeckling methods, and the last three are deep-learningbased despeckling methods. Specifically, PPB is based on patch matching, and SAR-BM3D is based on 3D patch matching and wavelet domain filtering. ID-CNN, SAR-DRN, and SAR-RDCP all adopt the deep CNN model to learn the mapping relationships between speckled inputs and clean targets. SAR-RDCP combines the traditional variational model and the deep CNN model. It is worth noting that ID-CNN, SAR-DRN, and SAR-RDCP can only train the despeckling network on the synthetic speckled data rather than the real SAR data, due to them requiring clean targets to achieve supervised learning. For all the compared methods, the algorithm parameters were set as suggested in their corresponding papers. Furthermore, to make a fair comparison, SAR-CNN [26] and some recent works (e.g., NR-SAR-DL [38]) employing multitemporal data were not used, due to only single speckled SAR images being required in our proposed SSD-SAR-BS. Meanwhile, their training datasets are not publicly available.

Experimental Settings
The Bernoulli sampling probability p in (3) was fixed as 0.3. The Adam algorithm [46] was employed as the gradient descent optimizer to update the network weights and biases, with momentum β 1 = 0.9, β 2 = 0.999, ε = 10 −8 . The network was trained with 50 epochs with a batch size of 64 at an initial learning rate of 10 −4 . After first 25 epochs, the learning rate was reduced by being multiplied by a descending factor of 0.1. The number of average times K was set as 100 to output the predicted despeckled image in the testing process. Our proposed method was implemented in the PyTorch framework [47] and run on an Intel i9-10900K CPU and an NVIDIA GeForce GTX 3090 GPU.

Despeckling Experiments on Synthetic Speckled Data
For a fair comparison, in the synthetic speckled data despeckling experiments, all deeplearning-based despeckling methods shared the same training dataset, which was built using optical remote sensing images from the UC Merced land-use dataset [35]. From this dataset, a total of 209,990 image patches with a size of 64 × 64 were extracted as the specklefree target images. The corresponding speckled input images were generated according to (1) and (2), where the number of looks was randomly selected from {1, 2, 4, 8}. It is worth mentioning that our proposed SSD-SAR-BS did not see the speckle-free target images, and only speckled input images were needed for training. In contrast, other deep-learningbased methods (i.e., ID-CNN, SAR-DRN, and SAR-RDCP) all see the speckle-free target images in the training process. Furthermore, for testing, four optical remote sensing images from another dataset (i.e., the aerial image dataset (AID) [48]) were selected. They were Airport, Beach, Parking, and School, respectively. The corresponding single-look speckled input images were also generated according to (1) and (2).
With the speckle-free target images, two classic fully referenced metrics (i.e., the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [49]) were employed for the evaluation. The PSNR is defined as: where X and X denote the despeckled image and the corresponding speckle-free image with a size of W × H, respectively. X w,h and X w,h are the pixel value in X and X, respectively. The larger PSNR value indicates the lower distortion of the despeckled images. The SSIM is calculated as: where µ X and µ X are the mean values of X and X , respectively; σ X and σ X are the standard deviation values of X and X , respectively; σ XX represents the covariance value between X and X ; and C 1 and C 2 are added as two constants to avoid instability when µ X 2 + µ X 2 or σ X 2 + σ X 2 is close to zero. The larger SSIM value means better structural feature preservation. Except for the two fully referenced metrics (i.e., PSNR and SSIM), a nonreferenced metric (i.e., the equivalent number of looks (ENL) [17]) was employed to evaluate the speckle suppression performance, expressed as: where X HR represents a homogeneous region patch of X . µ X HR and σ X HR are the mean value and standard deviation value of X HR , respectively. The larger the ENL value, the better the speckle noise reduction is. Table 2 lists the quantitative evaluation results, with the best and second-best performance marked in bold and underlined. We can see that, among the traditional despeckling methods, SAR-BM3D was the one with better overall performance in the quantitative evaluation. Furthermore, deep-learning-based despeckling methods (especially SAR-DRN and SAR-RDCP) had a noticeable improvement compared to the traditional despeckling methods. Compared to SAR-RDCP, SAR-DRN had larger PSNR values for Beach, Parking, and School. On the contrary, the SSIM values of SAR-RDCP (i.e., Airport, Parking, and School) were larger than those of SAR-DRN. Generally speaking, it is difficult to maintain noise smoothing (i.e., PSNR and ENL) and feature preservation (i.e., SSIM) at the same time. However, our proposed SSD-SAR-BS gained about a 0.07-0.62 advancement in terms of the PSNR, compared to SAR-DRN. At the same time, our proposed SSD-SAR-BS gained about a 0.02-0.04 advancement in terms of the SSIM, compared to SAR-RDCP. Besides, combining its larger ENL values, in summary, our proposed SSD-SAR-BS achieved the best quantitative results in terms of the PSNR, SSIM, and ENL. This means that the results of our proposed SSD-SAR-BS were closer to the original speckle-free target images, with better structural feature preservation and better speckle noise suppression. Except the quantitative evaluation, visual assessment is also necessary for comprehensive analysis of despeckling performance. We present the one-look speckled input images, the despeckling results obtained by the compared methods and our proposed SSD-SAR-BS, and the original speckle-free target images in Figures 6-9. To give detailed contrasting results, we also provide the corresponding magnified results in Figure 10. The PPB removed most speckle noise, while its results showed significant oversmoothing. In other words, the detailed texture features in the results of PPB were also lost along with the speckle noise. SAR-BM3D showed an acceptable tradeoff between speckle noise suppression and detailed feature preservation. However, SAR-BM3D showed the blocking phenomenon. The blocking artefacts made its results look mottled and unnatural, which can be easily found in Figure 7c.
As shown in quantitative evaluation, the deep-learning-based methods obtained clearer results compared to the traditional ones, especially for SAR-DRN and SAR-RDCP. However, when compared to our proposed SSD-SAR-BS, their performance was still not good enough. This can be explained according to two aspects: (1) Edge preservation as marked in the red circles of Figures 10(1,3,4): The linear edge features were lost or intermittent in the magnified results of SAR-DRN and SAR-RDCP. Meanwhile, they still could be found or kept intact in the results of our proposed SSD-SAR-BS. (2) A dense small point area in the magnified results of Beach (i.e., Figure 10(2)): dense small points were incorrectly transformed to lines or blocks in the magnified results of SAR-DRN and SAR-RDCP. This may be due to the visual similarity between dense small points and speckle noise. SAR-DRN and SAR-RDCP cannot accurately separate dense point targets and speckle noise. To remove speckle noise, dense small points were also innocently removed in their despeckling results. Although it was still not perfect compared to the speckle-free target (i.e., Figure 10(2-h)), the dense small points were retained with their original visual form in the result of our proposed SSD-SAR-BS (i.e., Figure 10(2-g)). In other words, they were points rather than lines or blocks.   (1-a)

Despeckling Experiments on Real-World SAR Data
In this section, the despeckling performance of our proposed SSD-SAR-BS is verified using real SAR images. They are: (1) Sentinel-1 [50]: This is a low-resolution (about 10 m) SAR system with the C-band, provided by the Copernicus data hub of the European Space Agency (ESA), whose data can be downloaded from https://scihub.copernicus.eu/ (accessed on 23 October 2020). The download data used were single-look complex with interferometric wide swath (IWS) mode. In the Sentinel-1 despeckling experiment, Table A1 gives the downloaded file name list of the Sentinel-1 data used. Then, a total of 221,436 one-look (i.e., L = 1) image patches with a size of 64 × 64 pixels were used to train our proposed SSD-SAR-BS. In addition, we selected two one-look images (denoted as Sentinel-1 #1 and Sentinel-1 #2) with a size of 1024 × 1024 pixels. They were not included in the training data for the independent test. There were many representative SAR features in the test images, such as homogeneous region, detailed texture, point target, and strong edge; (2) TerraSAR-X [51]. This is a high-resolution (about three meters) SAR system with the X-band, provided by the TerraSAR-X ESA archive collection, whose data can be downloaded from https://tpm-ds.eo.esa.int/oads/access/collection/TerraSAR-X (accessed on 22 August 2020). The imaging mode of the download data used is StripMap (SM). In the TerraSAR-X despeckling experiment, Table A2 gives the downloaded file name list of the TerraSAR-X data used. Then, a total of 222,753 one-look (i.e., L = 1) image patches with a size of 64 × 64 pixels were used to train our proposed SSD-SAR-BS. In addition, we selected two one-look images (denoted as TerraSAR-X #1 and TerraSAR-X #2) with a size of 1024 × 1024 pixels. Similarly, they were not included in the training data. Further-more, there were many complicated features in the test images to examine the despeckling performance, such as homogeneous regions, dense lines, and strong edges.
To make the despeckled results smoother, we added total variation (TV) regularization L TV to the loss function, which is described as: where λ TV is the tradeoff weight for the TV regularization. L TV minimizes the absolute differences between neighbouring pixel values. To avoid detail loss, λ TV was set to be far less than 1, specifically, λ TV = 0.0001. Due to speckle-free (clean) SAR images not being able to be used as the reference, the PSNR and SSIM were no longer applicable for the real SAR despeckling quantitative evaluation. Except for the above ENL index, we employed another two nonreference indexes. They are the coefficient of variation (Cx) [52] and the mean of ratio (MoR) [17], calculated using the homogeneous region patch of despeckled results. Cx can estimate the texture preservation performance and can be given as: where the lower Cx value represents better texture preservation. The MoR can measure how well the radiometric preservation is in the despeckled results, which is defined as: For the ideal radiometric preservation, the MoR value should be 1. Table 3 provides the quantitative comparison. Figures 11-14 present the corresponding magnified despeckling results. From Table 3, we can see that the PPB had a strong speckle noise reduction ability in homogeneous regions. However, it lost the most detailed features of the heterogeneous regions presented in Figures 11-14b. The speckle noise removal ability of SAR-BM3D was not as strong as that for the PPB, but it surpassed the PPB by a large margin in terms of detailed feature preservation. Similar to the synthetic speckled dataset experiment, a critical problem of SAR-BM3D is that its results presented the blocking phenomenon. This phenomenon can be observed in Figures 11-13c. The mottled blocking artefacts should not appear in the ideal despeckled results.
The performance of ID-CNN, SAR-DRN, and SAR-RDCP on real SAR images was not as good as on synthetic speckled images. This can be confirmed in terms of speckle noise reduction in the homogeneous regions (seen in Table 3). Furthermore, from Figures 11-14d-f, we can see that there were some chaotic line-like artefacts in their despeckled results. These phenomena are called the domain gap problem. These methods adopted supervised learning, so their despeckling networks can only be trained using synthetic speckled data from optical images. However, due to the differences in the imaging mechanisms, SAR images have many characteristics that are not present in optical images, including scattering characteristics. When employing the feature extraction capability learned from synthetic speckled images on a different data domain (i.e., real SAR images), they showed nonoptimal despeckling performance and generated some unnatural artefacts in their despeckled results.  In contrast, our proposed SSD-SAR-BS learned the despeckling ability from real SAR images, thereby fundamentally avoiding the domain gap problem. Specifically, it showed good speckle noise suppression, texture preservation, and radiometric preservation of homogeneous regions, according to the larger ENL values, lower Cx values, and MoR values closer to one. This can also be verified in Figure 14g (the regions marked by red circles). Besides, our proposed SSD-SAR-BS also avoided generating artefacts while preserved the original features. Specifically, from Figures 12d-g, we can observe that our proposed SSD-SAR-BS provided smooth homogeneous regions; in contrast, chaotic line-like artefacts can be significantly found in the results of ID-CNN, SAR-DRN, and SAR-RDCP. Meanwhile, the dense small points in red circles can also be clearly found in the result of our proposed SSD-SAR-BS.

Discussion
As mentioned earlier, the Bernoulli sampling probability p was set to be 0.3, and the dropout-based ensemble was used to boost the performance in our proposed SSD-SAR-BS. To confirm the effectiveness of these settings, we provide a set of comparative experiments. For the Bernoulli sampling probabilistic p, only 0.1, 0.3, and 0.5 were compared because when the p was too large, most pixels of the input speckled images were lost and the despeckling performance rapidly fell. For the objective evaluation, we present their despeckling performance on 100 synthetic speckled images from the AID dataset in terms of the average PSNR and SSIM values, under different numbers of the average times K in the dropout-based ensemble.
As shown in Figure 15, as K improves, the average PSNR and SSIM values increased significantly, especially the average PSNR values where K was set from zero to twenty and the average SSIM values where K was set from zero to forty. This was similar for the different Bernoulli sampling probability p, in other words, p = 0.1, p = 0.3, and p = 0.5. Moreover, from the magnified curves as shown in the right column of Figure 15, the average PSNR and SSIM values of p = 0.3 were larger than those of p = 0.1 and p = 0.5. This was because when p = 0.1, the randomness of Bernoulli sampling was not strong enough; when p = 0.5, the preserved pixels were not enough to construct the despeckled images. Hence, 0.3 was the superior value of Bernoulli sampling probability p, and the dropoutbased ensemble could effectively boost the despeckling performance of our proposed SSD-SAR-BS.
We also list the inference runtimes of the compared and our proposed methods in Table 4. All methods were implemented on the same system environment described in Section 3.1.2. We employed the dropout-based ensemble in our proposed SSD-SAR-BS, enabling the model to operate as two: (1) Accurate model (e.g., K = 100): This model was more expensive due to it having more average times, providing superior speckle noise suppression and detail preservation. (2) Fast model (e.g., K = 40): Facing more real-time SAR image despeckling tasks, this model can improve the inference efficiency (reduce the test runtime) by reducing the average times K. From Table 4, we can see that as the number of the average times K reduces, the runtime reduces significantly. Specifically, when K = 40, the runtimes of images with 64 × 64 and 128 × 128 pixels were reduced to about 0.25 and 0.75 s, respectively. The time consumption of this fast model was superior to those of the traditional methods (i.e., PPB and SAR-BM3D). This was similar to other deep-learning-based methods. In other words, after much time was needed to train the deep neural network, the testing process was speedy in the deep-learning-based methods, which is another advantage of our proposed SSD-SAR-BS.

Conclusions
In this paper, we proposed a novel method for SAR image despeckling, based on self-supervised deep learning and Bernoulli sampling, called SSD-SAR-BS. Our proposed method does not need clean reference images to train the deep despeckling network. Hence, the network can be directly trained on real speckled SAR images. This overcomes the domain gap problem in most of the existing deep-learning-based SAR image despeckling methods; in other words, they adopt supervised learning and use synthetic speckled images rather than real SAR images as the training data. Qualitative and quantitative comparison of synthetic speckled and real SAR images verified the superior performance of our proposed method, compared to the state-of-the-art methods. Our proposed method can suppress most speckle noise and avoid generating artefacts, including the blocking artefacts caused by SAR-BM3D and the chaotic line-like artefacts caused by supervised deep-learning-based methods trained on synthetic speckled images. In the future, we will consider the combination of transfer learning and multitemporal SAR data (generating approximate clean labels) for SAR despeckling. Furthermore, we plan to explore the despeckling effect for some practical applications using SAR images, such as forest fire burn detection. Table A2. File name list of the downloaded TerraSAR-X data.