SAR Image Despeckling by Deep Neural Networks: from a pre-trained model to an end-to-end training strategy

Speckle reduction is a longstanding topic in synthetic aperture radar (SAR) images. Many different schemes have been proposed for the restoration of intensity SAR images. Among the different possible approaches, methods based on convolutional neural networks (CNNs) have recently shown to reach state-of-the-art performance for SAR image restoration. CNN training requires good training data: many pairs of speckle-free / speckle-corrupted images. This is an issue in SAR applications, given the inherent scarcity of speckle-free images. To handle this problem, this paper analyzes different strategies one can adopt, depending on the speckle removal task one wishes to perform and the availability of multitemporal stacks of SAR data. The first strategy applies a CNN model, trained to remove additive white Gaussian noise from natural images, to a recently proposed SAR speckle removal framework: MuLoG (MUlti-channel LOgarithm with Gaussian denoising). No training on SAR images is performed, the network is readily applied to speckle reduction tasks. The second strategy considers a novel approach to construct a reliable dataset of speckle-free SAR images necessary to train a CNN model. Finally, a hybrid approach is also analyzed: the CNN used to remove additive white Gaussian noise is trained on speckle-free SAR images. The proposed methods are compared to other state-of-the-art speckle removal filters, to evaluate the quality of denoising and to discuss the pros and cons of the different strategies. Along with the paper, we make available the weights of the trained network to allow its usage by other researchers.


Introduction
Synthetic Aperture Radar (SAR) provides high-resolution, day-and-night and weather-independent images. SAR technology is widely used for Earth remote sensing. With the advanced techniques like polarimetry, interferometry and differential interferometry, SAR images have numerous applications ranging from environmental system monitoring, city sustainable development, disaster detection applications up to planetary exploration [1]. Due to the coherent illumination of the scene by the SAR system, SAR images suffer from strong fluctuations: the speckle phenomenon. The presence of speckle in an image reduces the ability of a human observer to resolve fine details within the image [2] and impacts automatic image analysis.
It is well-known that the speckle phenomenon is caused by the presence of many elemental scatterers within a resolution cell, each back-scattering an echo with a different phase shift. The coherent summation arXiv:2006.15559v3 [cs.CV] 2 Jul 2020 of all these echoes produces strong fluctuations of the resulting intensity from one cell to the next, [3]. Speckle analysis and reduction is a longstanding topic in SAR imagery. Many works have been devoted to this task, from single polarization data to fully polarimetric SAR images (see the reviews [4,5]). Among the different possible strategies, we classify some commonly used despeckling algorithms into the following three categories.
(i) Selection based methods. The simplest way to reduce SAR speckle is to average neighboring pixels within a fixed window. This technique is called spatial multilooking. To reduce the degradation of the spatial resolution when applied to edges, lines or point-link scatterers, both local [6][7][8] and non-local approaches [9][10][11][12] have been proposed.
(ii) Variational methods. These methods formulate the restoration problem as an optimization problem, to find the underlying image which best explains the observed speckle-corrupted image. The objective function to be minimized is typically composed of two terms: a data-fitting term (that uses a Gamma [13], Rayleigh [14] or Fisher-Tippett distribution [15] to model the distribution of SAR images) and a regularization term (the total variation TV has been widely used in the literature [15][16][17][18][19]).
(iii) Transform based methods. Many works have considered the application of a wavelet transform for despeckling [20][21][22][23], but the estimation of the parameters of the signal and noise statistics is a difficult task. To apply in a more straightforward fashion the state-of-the-art methods designed for additive-noise removal, a logarithmic transformation (a.k.a. a homomorphic transform) is often applied to convert the multiplicative behavior of speckle into an additive component. Xie et al. [24] have derived the statistical distribution of log-transformed speckle noise. Many despeckling methods based on a homomorphic transform have been proposed [25][26][27][28][29]. Special care must be taken because, after a log-transform, the speckle is stationary but not Gaussian and not even centered. At least a debiasing step must be included to account for the expectation of log-transformed speckle.
Deep learning allows computational models formed of multiple processing layers to learn representations of data that include several levels of abstraction [30]. In recent years, these methods have dramatically improved the state-of-the-art in many computer vision fields, such as object detection [31], face recognition [32], and low-level image processing tasks [33]. Among them, convolutional neural networks (CNNs) with very deep architecture [34], displaying a large capacity and flexibility to represent image characteristics, are well-suited for image restoration. Dong et al. [35] proposed an end-to-end CNN mapping between the low/high-resolution images for single image super-resolution, which demonstrated state-of-the-art restoration quality, and achieved fast speed for practical on-line usage. CNN networks have also been applied to the denoising of natural images, but mostly in the context of additive white Gaussian noise (AWGN). Zhang et al. proposed a feed-forward denoising convolutional neural networks (DnCNN) to embrace the progress in very deep architecture. The advanced regularization and learning methods, including Rectifier Linear Unit (ReLU), batch normalization and residual learning [36] are adopted, dramatically improving the denoising performance. There are also some researches ongoing on non-AWGN denoising tasks, such as salt-and-pepper noise [37], multiplicative noise [38], and blind inpainting [39]. For SAR image despeckling, CNNs have been first used to learn an implicit model in [40]. More and more SAR despeckling methods are now based on deep architectures [41][42][43].
In this paper, we try to shed light on the advantages and disadvantages of making a significant effort of creating training sets and learning SAR-specific CNNs rather than readily applying generic networks pre-trained for AWGN removal on natural images to SAR despeckling. We believe that considering this question for intensity images is enlightening for the more difficult case of multi-channel SAR despeckling that arises in SAR polarimetry, SAR interferometry or SAR tomography. In order to discuss this matter, we consider two different SAR despeckling frameworks based on CNNs. The first one consists of applying a CNN pre-trained on AWGN removal from natural images. Many CNNs have been proposed for AWGN suppression. The extension to SAR imagery requires to account for the statistics of speckle noise. This can performed by an iterative scheme recently introduced for speckle reduction: MuLoG algorithm [44], which is based on the plug-in ADMM strategy [45]. In the second approach, a network is trained specifically on SAR images. To that end, we describe a new procedure to generate a high-quality training set. We consider a network architecture similar to that used in the work of Chierchia et al. [40] and discuss the influence of the number of layers and of the loss function on the despeckling performance of the CNN, trained and tested on our datasets. The performance of despeckling methods based on deep neural networks depends not only on the network architecture but also on the training set and the optimization of the network. Published results are then reproducible only if the network architecture and weights are released together with the research paper. To facilitate comparisons of future despeckling methods with our work, we provide an open-source code that includes the network weights for Sentinel-1 image despeckling (see section 5). We have also experimented a hybrid approach, in the sense that the generated SAR dataset has been used to train a CNN for AWGN removal on images whose content is the same as in the task that it will perform.
The remainder of the paper is organized as follows. Section 2 provides a detailed survey of related image denoising works using CNNs. Section 3 first introduces the statistics of SAR data, and then presents the two different SAR despeckling strategies considered in the paper, plus the hybrid approach. In Section 4, extensive experiments are conducted to evaluate restoration performance. Finally, our concluding remarks are given in Section 5 and Section 6.

Related works
The goal of image denoising is to recover a clean image x from a noisy observation y which follows a specific image degradation model. In this paper, we discuss two common degradation models: additive noise (y = x + n) and multiplicative noise (y = x × n), where n is a random component referred to as "the noise" 1 .

Additive Gaussian noise reduction by deep learning
In the additive model, one usual assumption is that the noise component n corresponds to an additive white Gaussian noise (AWGN). To perform the estimation of x, a maximum a posteriori (MAP) estimation is often considered, where the goal is to maximize the posterior probability: with p(y|x) the likelihood defining the noise model, and p(x) the prior distribution corresponding to the statistical model of clean images. Under an additive white Gaussian noise assumption, the term 1 the terminology "noise" to describe fluctuations due to speckle is sometimes considered misleading in the context of SAR imaging given that a pair of images acquired under an interferometric configuration have correlated speckle components, which makes it possible to extract meaningful information from the interferometric phase; we will, however, stick to the terminology common in image processing by referring to the speckle as a noise term throughout the paper since we focus on the restoration of intensity-only images.
− log p(y|x) takes the form of a sum of squares: − log p(y|x) = 1 2σ 2 y − x 2 2 , with σ the standard deviation of the noise. The MAP estimator then corresponds to: where g(x) = − log p(x) and the notation prox g defines the proximal operator associated to function g [46,47]. Classical prior models g include the total variation, sparse analysis and sparse synthesis priors [48]. While earlier models were handcrafted ( 1 norm of the wavelet coefficients for a specifically-chosen transform, total variation), more recent models are learned from sets of natural images (field of experts [49], higher-order Markov random fields [50]). Beyond MAP estimators, several patch-based methods were designed to estimate the clean image x, such as BM3D [51], LSSC [52] or WNNM [53]. These methods exploit the self-similarity observed in most natural images (repetition of similar structures/textures at the scale of a patch of size typically 8 × 8 within extended neighborhoods).
In order to learn richer models of natural images, deep neural networks have been considered for denoising purposes. After the training step, the estimation step is very fast, especially on graphical processing units (GPUs). The first network architectures designed for denoising were trained to learn a mapping from noisy images y to clean ones x by making use of a CNN with feature maps [54], a multi-layer perceptron (MLP) [55], a stacked denoising autoencoder (SDA) network [56], a stacked sparse denoising auto-encoder architecture [39], among all.
More recent networks are trained to estimate the noise component n, i.e., to output the residual y − x. Such a strategy, termed "residual learning" [36], has shown to be more efficient when the neural network contains many layers (i.e., for deep networks). In [57], this residual learning formulation for model learning is adopted and 17 layers are used with the 2 loss function. Wang et al. [58] replaced the convolution layer by dilated convolution. In [59] and [60], an exponential linear unit or soft shrinkage function are used as activation function (i.e., the non-linear step that follows the convolution), instead of the Rectified Linear Unit (ReLU). Liu et al. [61] discuss the relation between the width and the depth of the network. They introduce wide inference networks with just 5 layers. In [62], a wavelet transform is introduced into deep residual learning.
All these variations on a common structure are motivated by finding a trade-off between the network expressivity (in particular, by increasing the size of the receptive field) and the generalization potential (i.e., fighting against the over-fitting phenomenon that arises when the number of network parameters increases).
Most networks generate an estimatex of the clean image from a noisy image y by simple traversal of the feedforward network (and possibly subtracting the residuals to the noisy image, in case of residual learning): the proximal operator prox g is learned instead of the prior distribution p(x), no explicit minimization is performed when estimating a denoised image. The plug-and-play ADMM strategy [45] provides a means to apply implicit modeling of the prior distribution p(x) encoded within the network in the form of the proximal operator prox g in order to address more general image restoration problems than the mere AWGN removal.

Speckle reduction by deep learning
In the past years, most of the SAR image denoising approaches are based on detailed statistical models of signal and speckle. To avoid the problem of modeling the statistical distribution of speckle-free SAR images, several authors recently resorted to machine learning approaches implemented through CNNs.
The first paper that investigates the problem of SAR image despeckling through CNNs is [40]. Following the paradigm proposed in [57], the SAR-CNN implemented by Chierchia et al. is trained in a residual fashion, where a homomorphic approach [44] is used in order to stabilize the variance of speckle noise. The network comprises 17 convolutional layers with Batch Normalization [63] and Rectifier Linear Units (ReLU) [34] activation function. Logarithm and hyperbolic cosine are combined in a smoothed 1 loss function. The Image Despeckling Convolutional Neural Network (ID-CNN) proposed by Wang et al. [64] comprises only 8 layers and is applied directly on the input image without a log-transform. The novelty lies in the formulation of the loss function as a combination of an 2 loss and of Total Variation (TV), preventing the apparition of artifacts while preserving important details such as edges. In [42], the proposed SAR-DRN makes use of dilated convolutions and skip connections to increase the receptive field without increasing the complexity of the network and maintaining the advantages of 3x3 filters. Wang et al. [41] tackle the image despeckling problem by resorting to generative adversarial networks. In the proposed ID-GAN, the Generator is trained to directly estimate the clean image from a noisy observation, while the discriminator serves to distinguish the de-speckled image synthesized by the generator from the corresponding ground truth image. To exploit the abundance of stacks of multitemporal data and avoid the problem of creating a clean reference, in [43] a CNN is used as an auto-encoder through a formulation that does not use an explicit expression of the noise model, thus allowing generalization.

Statistics of SAR images
After SAR focusing, the SAR image is formed by the collection of the complex amplitudes back-scattered by each resolution cell. The squared modulus of this complex amplitude (the intensity image) is informative of the total reflectivity of the scatterers in each resolution cell. Because of the interference between echoes produced by elementary scatterers, the intensity fluctuates (speckle phenomenon). These fluctuations depend on the 3-D spatial configuration of the scatterers with respect to the SAR system and on the nature of the scatterers. They are generally modeled by Goodman's stochastic model (fully developed speckle): the measured intensity y is related to the reflectivity x by the multiplicative model y = x × n where the noise component n ∈ R + is a random variable that follows a gamma distribution [3]: with L ≥ 1 representing the number of looks, and Γ(·) the gamma function. It follows from E[n] = 1 and Var[n] = 1/L that E[y] = x: averaging the intensity leads to an unbiased estimator of the reflectivity in stationary areas, and Var[y] = x 2 /L: the noise is signal-dependent in the sense that the variance of the measured intensity increases like the square of the reflectivity. Considering the log of the intensity y instead of the intensity y transforms the noise into an additive component: y = x + n where n ∈ R follows a Fisher-Tippett distribution [24]: The mean E

Despeckling using pre-trained CNN models
As discussed in section 2, several approaches have been recently proposed in the literature to apply CNNs to AWGN suppression. The residual learning method DnCNN introduced in [57] is a reference method for which models pre-trained on natural images at various signal-to-noise ratios are available 2 . We consider two different ways to apply a pre-trained CNN to speckle noise reduction: a homomorphic filter that processes log-transformed intensities with DnCNN and the embedding of DnCNN within the iterative scheme MuLoG [44]. Note that our approach is general and other pre-trained CNNs than the DnCNN could readily be applied. Figure 1 illustrates the architecture used by the network DnCNN proposed by Zhang et al. [57]. The network is a modified VGG network and is made of 17 fully convolutional layers with no pooling. There are three types of layers: (i) Conv+ReLU: for the first layer, 64 filters of size 3×3 are used to generate 64 feature maps, and rectified linear units are then utilized for nonlinearity; (ii) Conv+BN+ReLU: for layers 2∼16, 64 filters of size 3×3×64 are used, and batch normalization is added between convolution and ReLU; (iii) Conv: for the last layer, a filter of size 3×3×64 is used to reconstruct the output. The loss function that is minimized during the training step is the 2 loss (i.e., the sum of squared errors, averaged over the whole training set). To train the DnCNN, 400 natural images of size 180×180 pixels with gray levels in the range [0, 1] were used for training. Patches of size 40 × 40 pixels were extracted at random locations from these images. Different networks were trained for simulated additive white Gaussian noise levels equal to σ = 10 255 , 15 255 , ..., 75 255 (14 different networks each corresponding to a given noise standard deviation).

Homomorphic filtering with a pre-trained CNN
The simplest approach to applying a pre-trained CNN acting as a Gaussian denoiser is the homomorphic filtering depicted at the top of figure 2, and hereafter denoted homom.-CNN. This approach consists of approximating the noise term n in log-transformed data as an additive white Gaussian noise with non-zero mean. As recalled in section 3.1, log-transformed speckle is not Gaussian but follows a Fisher-Tippett distribution under Goodman's fully developed speckle model. Hence, the homomorphic approach is built on a rather coarse statistical approximation. Under this approximation, the log-transformed SAR image can be restored by first applying the pre-trained CNN, then correcting for the bias ψ(L) − log L due to the non-centered noise component. Figure 2. Illustration of the three speckle reduction approaches described in this paper: the first two apply a CNN trained to remove AWGN from natural images, the last approach consists of training a CNN specifically to the suppression of speckle in (log-transformed) SAR images.
In order to successfully apply a pre-trained CNN model, it is crucial to properly normalize the data so that the range of input data matches the range of data used during the training step (neural networks are highly non-linear). We mapped the range [q m , q M ] by an affine transform to the [0, 1] range, with q m and q M corresponding to the 0.3% and 99, 7% quantiles of the log-transformed intensities. After this normalization, the standard deviation of the log-transformed noise is σ = ψ(1, L)/(q M − q m ). This value can be used to select the network trained for the closest noise standard deviation σ train that is less or equal to σ. The normalized image is then multiplied by σ train /σ so that the noise standard deviation exactly matches that of the images in the training set of the network.

Iterative filtering with MuLoG and a pre-trained model
The MuLoG framework [44] accounts for the Fisher-Tippett distribution of log-transformed speckle with an iterative scheme that alternates the application of a Gaussian denoiser (namely, a proximal operator) and of a non-linear correction. Figure 2, second row, illustrates that, by embedding a CNN trained as a Gaussian denoiser within an iterative scheme, a Fisher-Tippett denoiser is obtained. Throughout the iterations, the parameter of the Gaussian denoiser evolves, which requires to apply the network selection and image normalization strategy described in the previous paragraph.

Despeckling with a CNN specifically trained on SAR images
The architecture that we consider for our CNN trained on SAR images (SAR-CNN) is based on the work of Chierchia et al. [40], who in turn was inspired by the DnCNN by Zhang et al. [57] that we described in section 3.2.1. In this section, we describe in details a procedure for producing high-quality ground-truth images for the training step of the network.

Training-set generation
Deep learning models need a lot of data to generalize well. This is an issue in deep learning-based SAR image despeckling techniques, due to the lack of truly speckle-free SAR images. The reference image has to be therefore created through an ad-hoc procedure.
Speckle noise can be strongly reduced by multi-temporal multilooking (i.e., averaging the intensity of images acquired at different dates, assuming that no changes occurred). We, therefore, consider multi-temporal stacks. Due to the coherence of some regions, some speckle fluctuations are remaining after this temporal multi-looking procedure. We further improve the images by applying a MuLoG+BM3D denoising step [44] with an equivalent number of looks estimated from selected homogeneous regions. The images obtained are then considered speckle-free and serve as a ground truth. Synthetic speckle noise is simulated based on the statistical models described in section 3.1 in order to produce the noisy / clean image pairs necessary for the training of the network. Although Goodman's fully developed model is not verified everywhere in the images (the Rician distribution [66][67] [68] could instead be used for strong scatterers) and assumes spatially uncorrelated speckle, it has been the funding model of most SAR despeckling methods developed these last four decades [16]. Thus, we find it relevant to simulate 1-look speckle noise based on Goodman's model. In section 4.3 its limitations are discussed. A ground truth image is depicted in Figure 3, where we show the progress from the 1-look SAR image to the clean reference used in our training step. The temporal average of large temporal stacks of finely registered SAR images leads to images with limited speckle fluctuations and, after denoising, these images retain the characteristics of SAR images (bright points, sharp edges, textures) with almost no residual fluctuations due to speckle. Since the denoising operation is applied on an image already temporally multilooked, only small fluctuations have to be suppressed by the denoising step (i.e., this denoising step helps but is not crucial). This method thus proposes a realistic way to create speckle-free SAR images. A description of the training set is then given in Table 1.

Network architecture and loss function
The network is easier to train on log-transformed data since the noise is then additive and white (and coarsely Gaussian distributed). Compared to the suppression of AWGN, the networks need to learn how to separate log-transformed SAR reflectivities from log-transformed speckle, distributed according to the Fisher-Tippett distribution given in equation (4). In the follwing, a discussion about minor changes to the network architecture is carried out. It is worth to point out that, given the impossibility to reproduce the work proposed in [40] (the weights are not available and the datasets are not the same), we intend by no means to compare our results to those of Chierchia et al.. Instead, we have always followed our training strategy and drawn conclusions based on visual inspection of our testing set.
We found experimentally that increasing the depth of the network compared to the depth used by Zhang et al. and Chierchia et al. was improving the performance on the testing set. We used 19 layers, each layer involving spatial convolutions with 3 × 3 kernels (see figure 4). The receptive field of our network then corresponds to a patch of size 39 × 39.
We considered several loss functions to train the network. We found the 1 loss function to be preferable to the smoothed 1 loss function of Chierchia et al. and to 2 loss function. This matches other studies that have shown a reduction of artifacts and an improvement of the convergence when using the 1 loss [69]. We, therefore, used the following loss function: where the sum is carried over all N images from (a batch sampled from) the training set, f CNN (·) represents the action of the CNN on some log-transformed input data, the boldface is used to denote images (the i-th noisy image y i and the corresponding speckle-free reference x i ). The term (ψ(L) − log(L)) · 1 is a constant image that corresponds to the bias correction. Its role is to center Fisher-Tippett distribution. By making this term explicit, it is easier to perform transfer learning, i.e., to re-train a network to a different number of looks L by warm-starting the optimization from the values obtained for the previous number of looks.

Network training
The training set is formed by 7 Sentinel-1 speckle-free images produced by filtering 7 different multi-temporal stacks of size between 1024 × 1536 and 1024 × 8192 pixels. The images are selected so that to cover urban areas, forests, a coast with some water surfaces, fields and mountainous areas. Patches of size 40 × 40 are extracted from these images, with a stride of 10 pixels between patches. Mini-batches of 128 patches are used. In order to improve the network generalization capability, standard data augmentation techniques are used: vertical, horizontal flipping and ±90 • and 180 • rotations are applied on the patches. 11968 batches of 128 patches are processed during an epoch. A total of 50 epochs were used with ADAM stochastic gradient optimization method, with an initial learning rate of 0.001. The convergence of the learning and prevention for over-fitting were checked by monitoring the decrease of the loss function throughout the epochs as well as the performance over the test set.

Hybrid approach: MuLoG + trained CNN
A hybrid approach is also considered. In this method, the dataset that we have constructed is used to retrain the CNN described in section 3.2.1 to remove Gaussian noise from SAR images. Account for the Fisher-Tippett distribution is also made possible by embedding this network within the MuLoG framework. By doing so, we aim at investigating the influence of the content of the training images on the restoration performances.

Experimental results
We compare in the following paragraphs the performance of the different strategies for CNN-based despeckling both on images with simulated speckle and on single-look Sentinel-1 images. We first illustrate the impact of the loss function and of the number of layers on the performance of SAR-CNN.

Influence of the loss function and of the network depth
We illustrate on figure 5 the impact of the loss function ( 1 versus the smoothed 1 used by Chierchia et al. [40]) and of the network depth in terms of despeckling artifacts on an image with simulated speckle. The image corresponds to one of the speckle-free images of our testing set. We magnify 6 regions in order to illustrate cases where the initial CNN architecture fails to recover some structures (regions 1, 2, 3 and 6) that are present in the ground truth and at least partially recovered applying the proposed modifications to the CNN, and cases were some spurious structures appear (regions 4 and 5) with the first CNN architecture employed, but are not created with the modifications of the CNN depth and loss function that we considered.
All the three architectures are trained for 50 epochs on the high-quality database of SAR images we have created, and conclusions are drawn after qualitative evaluation of our testing images. Indeed, as already claimed, a comparison with the work of Chierchia et al. is not possible.

Quantitative comparisons on images with simulated speckle
We use two common image quality criteria to evaluate the quality of the despeckling obtained with different methods: the Peak-signal-to-noise ratio (PSNR), related to the mean squared error, which is relevant in terms of evaluation of the estimated reflectivities (bias and variance of the estimator), and the structural similarity (SSIM) which better captures the perceived image quality. For each speckle-free image from the testing set, we generate several versions synthetically corrupted by single-look speckle, in order to report both the PSNR and SSIM mean values, and their standard deviations over different noise realizations.
Seven different images are used in our testing set. We report the performance of the approaches proposed in this paper as well as the performance of SAR-BM3D [25], NL-SAR [70] and MuLoG+BM3D in Table 2 and Table 3.   Table 3. Comparison of denoising quality evaluated in terms of SSIM on amplitude images. For each ground truth image, 20 noisy instances are generated. 1σ confidence intervals are given. Per-method averages are given at the bottom.
To qualitatively evaluate the quality of denoising, we display in Figure 6 the results obtained by the different methods on image "Marais 1". To better analyze the results, the residual intensity images (i.e., the ratio between the noisy and the restored images) are displayed below the restored images. Almost no structure can be identified by visual analysis of the residual images, which means that the compared methods preserve very well the geometrical content of the original images (limited over-smoothing).
It can be observed both on the quantitative results reported in the tables and in the qualitative analysis that the CNN methods (the two versions of MuLoG+CNN and SAR-CNN) perform better than MuLoG+BM3D which we use as reference algorithm in SAR despeckling before the introduction of CNN methods. SAR-CNN removes speckle from the images while preserving the details, such as edges, at the cost of introducing small but noticeable artifacts in homogeneous areas. Instead, MuLoG+BM3D and MuLoG+CNN generate blurry edges, over-smoothing some areas where the details are lost, even when the CNN is pre-trained on SAR images. This can be observed comparing the denoised images of Figure 6, where SAR-CNN preserves better the details of the urban area at the bottom of the image compared to the three other denoising methods. The quality of the details can be attributed to the richness of information captured by the network when learning on many SAR image patches.

Despeckling of real single-look SAR images: how to handle correlations
In this section, the denoising performance of SAR-CNN for Sentinel-1 image despeckling and MuLoG+CNN are evaluated on real single-look SAR images acquired during Sentinel-1 mission. To test our deep learning-based denoiser, we focused on some of the areas of the images analyzed in the above tables, picking one of the multitemporal instances used to generate the ground truth images for the training of SAR-CNN and making sure that these areas do not belong to the training set.  Unlike synthetically generated noisy SAR images, in real acquisitions, pixels are spatially correlated. SAR images undergo an apodization (and over-sampling) process [71][72][73] aimed at reducing the sidelobes of strong targets, by introducing some spectral weighting. Thus, we subsample real SAR acquisitions by a factor of 2, as proposed in [74]. As already observed in the case of synthetic SAR data, the images that are restored with the proposed methods show significant improvements on the denoising performance over the reference despeckling algorithm MuLoG+BM3D, with SAR-CNN being the one that provides the best visual result. Even when compared to SAR-BM3D and NL-SAR, which do not require the image to undergo a downsampling step, our results appear more good-looking, with a better preservation of fine structures. NL-SAR, indeed, gives its best in polarimetric and interferometric configurations.  Figure 7 shows the restoration results and the residual images (obtained by forming the ratio noisy/denoised) obtained with the different methods. In contrast to the simulated case, some structures can be visually identified in the residual images. These structures correspond to thin roads. The downsampling operation necessary to remove the speckle correlations make the preservation of these structures very difficult for all methods. SAR-CNN seems to be the most effective at preserving those structures. Visual analysis of the restored images also seems to indicate fewer artifacts with SAR-CNN. Since the ground-truth reflectivity is not available, to measure the performance of the proposed method we estimate the Equivalent Number of Looks on manually selected homogeneous areas. The homogeneous regions chosen for the ENL estimation are shown with red boxes, and the ENL values are given in Table 4.
Then, our analysis has been extended to a TerraSAR-X acquisition. In this case, we want to assess the generalization capabilities of the trained SAR-CNN on images from a different sensor and a different spatial resolution. As we can see from Figure 8, MuLoG+CNN seems the approach that provides visually the best results. The estimated ENL indicate that all despeckling methods are very effective in homogeneous regions. It seems that MuLoG+CNN produces an image with a slightly better perceived resolution, which may indicate a better generalization property compared to SAR-CNN. The analysis of the residual images, like in figure 7, indicates that thin linear structures are attenuated in restored images due to the downsampling step.

Discussion
A CNN trained to suppress additive white Gaussian noise encodes a very generic model of natural images in the form of a proximal operator related to an implicit prior. Like other models of natural images that were successfully applied to the problem of speckle reduction in SAR imaging (wavelets, total variation), they are relevant to SAR imagery because they capture structures (points, edges, corners) and textures. Yet, the specificities of SAR images make it beneficial to train a model specifically on SAR images. This is done naturally by patch-based methods that use the content of the image itself as a model (repeating patches). In this paper, we have shown the superiority of a CNN model trained on SAR images provided that a high-quality training set is built, enough layers are used to capture large scale structures and an adequate loss function is selected. Moreover, once trained, SAR-CNN exhibits the best runtime performance when using a GPU (see Table 5). When considering the real-life case of partially correlated speckle or images from different sensors, the plugging of a network trained on natural images in a SAR adapted framework like MuLoG presents better generalization properties.  Table 5. Time for despeckling a 500 × 500 clip. Experiments were carried out with an Intel Xeon CPU at 3.40GHz and an Nvidia K80 GPU. For NL-SAR, the radius of the smallest/largest search window size is set to 1/20 and the half-width of the smallest/largest patches as 0/10.
The extension to multi-channel SAR images represents a real challenge. Speckle reduction in multi-channel images requires modeling the correlations between channels (the interferometric and polarimetric information). In order to learn those correlations directly from the data, a dataset that contains speckle-free images covering the whole diversity of polarimetric responses, interferometric phase differences and the whole range of coherences for typical geometrical structures (points, lines, corners, homogeneous regions, textured regions) must be formed. Needless to say, this is far more challenging than collecting single-channel SAR images to cover only the diversity of geometric structures. Failure to correctly include all cases in the training set implies that the network, instead of performing a high-dimensional interpolation, performs a high-dimensional extrapolation, which puts the user at high risk of experiencing large prediction errors.
This difficulty justifies the relevance of using pre-trained networks within MuLoG framework which has been designed to apply single-channel restoration methods to multi-channel SAR images. A summary of the advantages and drawbacks of the proposed methods are reported in Table 6.

MuLoG + CNN Pros
• No specific training is needed (improvement is small when the CNN is trained on SAR images) • Straightforward adaptation to multiple looks • Straightforward generalization to polarimetric and/or interferometric SAR images • Straightforward adaptation to different SAR sensors

Cons
• High runtime due to its iterative procedure

SAR-CNN Pros
• Provides the highest performances • Fastest runtime performances once the network is trained • Possible adaptation to multiple looks (requires re-training) • Possible adaptation to different SAR sensors (requires re-training)

Cons
• Requires a training: formation of a dataset of speckle-free SAR images • Generalization to multi-channel SAR images (polarimetric and/or interferometric) raises dimensionality issues: very large training set to sample the diversity of radar images Table 6. Advantages and disadvantages of the CNN-based despeckling strategies considered in this paper.
To offer the possibility to use the presented SAR-CNN for testing and comparison, we release an open-source code of the network trained on our dataset 3 . Indeed, replicating results of a published work is not an easy task and may represents up to months of work. Therefore, by sharing our code, we hope to help other researchers and users of SAR images to easily apply our CNN-based denoiser on single-look Sentinel-1 images, and possibly compare the restoration performance with their own methods.

Conclusions
With the new generation of sensors orbiting around Earth, access to long time-series of SAR images is improving. Given the increasing interest towards the use of deep learning algorithms in SAR despeckling, in this paper it is described a procedure to generate ground truth images that can be applied in a systematic way to produce large training sets formed by pairs of high-quality speckle-free images and simulated speckled images.
In a future work, it would be interesting to train the networks on actual single-look SAR images in order to account for spatial correlations of the speckle. This, however, would require a method to produce a high quality ground-truth image for each single-look observation. Restoration methods that exploit long time-series of SAR images [75] may pave the way to producing such training sets.