Deep Multi-Scale Recurrent Network for Synthetic Aperture Radar Images Despeckling

For the existence of speckles, many standard optical image processing methods, such as classification, segmentation, and registration, are restricted to synthetic aperture radar (SAR) images. In this work, an end-to-end deep multi-scale recurrent network (MSR-net) for SAR image despeckling is proposed. The multi-scale recurrent and weights sharing strategies are introduced to increase network capacity without multiplying the number of weights parameters. A convolutional long short-term memory (convLSTM) unit is embedded to capture useful information and helps with despeckling across scales. Meanwhile, the sub-pixel unit is utilized to improve the network efficiency. Besides, two criteria, edge feature keep ratio (EFKR) and feature point keep ratio (FPKR), are proposed to evaluate the performance of despeckling capacity for SAR, which can assess the retention ability of the despeckling algorithm to edge and feature information more effectively. Experimental results show that our proposed network can remove speckle noise while preserving the edge and texture information of images with low computational costs, especially in the low signal noise ratio scenarios. The peak signal to noise ratio (PSNR) of MSR-net can outperform traditional despeckling methods SAR-BM3D (Block-Matching and 3D filtering) by more than 2 dB for the simulated image. Furthermore, the adaptability of optical image processing methods to real SAR images can be enhanced after despeckling.


Introduction
Synthetic aperture radar (SAR), owing to its all-weather and all-time condition operation, has been widely applied to microwave remote sensing areas, such as topographic mapping, military target reconnaissance, and natural disaster monitoring [1,2]. SAR imaging achieves high range resolution by exploiting pulse compression technique and high azimuth resolution by using radar platform to form a virtual antenna synthetic aperture along track [3,4]. However, speckle noise exists in the imaging results due to the coherent imaging mechanism of SAR, which leads to images quality and readability reduction. Meanwhile, the existence of speckles limits the effectiveness of the application of common optical image processing methods to SAR images [5]. It thus restricts the SAR images to further understanding and interpretation, increasing the difficulty of extracting roads, farmlands, and buildings in the image and the complexity of spatial feature extraction in image registration, and reducing the accuracy of detection and classification of the objects such as vehicles and ships [6]. Speckle suppression is, therefore, an important task in SAR image post-processing.
To improve the quality of SAR images, there have been various speckle suppression methods proposed, including multi-look processing technologies during imaging and image filtering methods after imaging [1,7]. Multi-look processing divides the whole effective synthetic aperture length into reference images and could work in an unsupervised way when trained with real SAR images. Bai et al. [26] added fractional total variational loss to the loss function to remove the obvious noise while maintaining the texture details. The authors of [27] proposed a CNN framework based on dilated convolutions called SAR-DRN. This network amplified the receptive field by dilated convolutions and further improved the network by exploiting the skip connections and a residual learning strategy. State-of-the-art results are achieved in both quantitative and visual assessments.
In this study, we design an end-to-end multi-scale recurrent network for SAR image despeckling. Unlike [9,[25][26][27], which only utilized CNN to acquire speckle distribution characteristics and additional division operation or subtraction operation to remove speckle, we use the network to learn the distribution characteristics of speckle noise, meanwhile automatically implementing speckle suppression to output clean images. The proposed network is based on the encoder-decoder architecture. To improve the operation efficiency, in the decoder part, we use the subpixel unit to implement up-sample on the feature maps instead of the deconvolutional layer. Besides, this paper applies a multi-scale recurrent strategy, which inputs the resized images with different scales to the network, and different scale inputs share the same network weights parameters. So, the network performance can be improved without increasing the network parameters and the output which is friendly to the optical image processing algorithm can be obtained. Also, the convolutional LSTM unit is used to implement information transmit among each scale. Although our network is the same as the network based on noise output, i.e., a fully convolutional network, our MSR-net contains the pooling layer that can reduce the dimension of the network and further reduce the amount of computation to a great extent. Lastly, we propose two evaluation criteria based on image processing methods.
The paper consists of 6 parts. In Section 2, we analyze the speckles of SAR images and briefly introduce CNN, and convolutional LSTM. After providing the framework of our proposed MSR-net in Section 3, the result and discussion of the experiment are shown in Sections 4 and 5. The last section will summarize this paper.

Speckle Model of Sar Images
Multiplicative model is usually used to describe speckle noise [28] and the formula is defined as: where I is image intensity, p s is a constant which denotes the average scattering coefficient of objects or ground, and n denotes the speckle which is independent with p s statistically. For the homogeneous SAR image, the single-look intensity I obeys negative exponential distribution [29] and its probability distribution function (PDF) is defined as: The multi-look processing methods are usually used to improve the quality of SAR images by diminishing the speckle noise. If the Doppler bandwidth is divided into L sub-bandwidths during imaging, and I i is the single-look intensity image corresponding to each sub-bandwidth, the result of multi-look processing is: where L is the number of looks. If I i obeys the exponential distribution in Equation (2), then after multi-look averaging, the L-look intensity image follows the Gamma distribution [1], and the PDF is: where Γ(L) denotes the Gamma function. The PDF of L-look speckle n can be obtained by applying the product model on Equations (1) and (4),

Convolutional Long Short-Term Memory
Convolutional neural networks (CNNs) have powerful capabilities of extracting spatial features and can automatically extract universal features through back-propagation algorithms driven by dataset [30,31], however they cannot be used to process sequence signals directly for the reason that the input is independent with each other, and the information flows strictly in one direction from layer to layer.To solve this problem, we introduce convolutional long short-term memory (ConvLSTM) [32] to the network, the inner structure of ConvLSTM is shown in Figure 1. As a special kind of RNN, long short-term memory network (LSTM) has internal hidden memory which allows the model to store information about its past computations and capable of learning long-term dependency [33]. Different than standard LSTM, all of the features variables of ConvLSTM including the input X t , cell state C t , the output of the forget gate F t , input gate i t , and output gate O t are three-dimensional tensors, the latter two of which are spatial dimensions width and height. The key equations of ConvLSTM are defined as: where " * " and "σ" denote the Hadamard product and the logistic sigmoid function, respectively. F t controls the abandoned state information of the last layer and i t is in charge of current state update, i.e.,C t . W f , W i , W c , and W o represent weights of each neural unit with b f , b i , b c , and b o denoting the corresponding offsets.

Proposed Method
An end-to-end network MSR-net for SAR image despeckling is proposed in this paper. Rather than using additional division operation [25,26] or subtraction operation [9,27], our network can automatically perform despeckling and generate a clean image. In this section, we first introduce the multi-scale recurrent architecture and then describe specific details through a single-scale model.

Architecture
MSR-net is built based on cascaded subnetworks, and each subnetwork contains three parts: encoder, decoder, and ConvLSTM unit, as illustrated in Figure 2. Different levels of subnetworks correspond to different scales of inputs and outputs. The next scale speckled image and the output of current subnetwork are combined as the input of next-level subnetwork. In addition, an LSTM unit with single input and two-output is embedded between encoder and decoder. Specifically, one output is connected to the decoder, and the other output which represents the hidden state is connected to LSTM unit of the next subnetwork. Different from the general cascaded network like [34], which uses three stages of independent subnetworks, all the state features flow across scales and share the same training parameters in MSR-net. Owing to the multi-scale recurrent and parameter share strategy, the number of parameters that need to be trained in MSR-net is only 1/3 of [34].
For the subnetwork, the output F i of encoder Net en , which takes the speckled image and despeckled result up-sampled from the previous scale as input, can be defined as: where I i in is the input image with speckle noise, Θ en is the weights parameters of Net en . i = 0, 1, 2, ... is the scale index. The larger i is, the lower the resolution is. i = 0 represents the original resolution and i = 1 indicates down sampling once. I i+1 o is the output of the previous coarse scale. up(·) is the operator that adapts features or images from the (i + 1)-th to the i-th scale, which is implemented by bilinear interpolation.
To exploit the information contained in feature maps of different scales, a convolutional LSTM module is embedded between the encoder and the decoder. The ConvLSTM can be defined as: where Θ LSTM is the set of parameters in ConvLSTM, h i is the hidden state which is passed to the next scale, h i+1 is the hidden state from the previous scales, and g i is the output of the current state (scale). Finally, we use Θ de to denote the parameters of the decoder, and the output can be defined as:

Single Scale Network
Details of the MSR-net are introduced by the single-scale model in this section. As shown in Figure 3, the single-scale model consists of two parts: encoder and decoder. The encoder includes three building blocks: convolutional layer, pooling layer, and Res block.  The convolution unit performs convolution operation and non-linear activation. Increasing the number of convolutional layers can enhance the feature extraction ability [35,36]. Multiple Res blocks are added after the convolutional layer while designing the network. Unlike the convolution unit, skip connection proposed by He et al. [37] is built into this block, which can effectively avoid gradient explosion or gradient disappearance, as well as increasing the training speed.
The size of the input and output of the convolutional layers keeps the same as the despeckling networks designed in [6,25,27], which increases the amount of computation to a certain extent. We reduce the amount of calculation by decreasing the dimension of the feature maps, i.e., adopting the pooling layer. We choose max pooling operation with the 2 × 2 pooling kernel in this layer. It should be noted that the pooling layer can also be replaced by strided convolutions [38].
The decoder consists of the convolutional layer and the sub-pixel units. The width and height of the input feature map to the decoder are only 1/4 of the original image after down-sampling twice through the pooling layer. Therefore, the up-sampling operation is required to make the output image of the network the same as the input size. However, an up-sampling operation such as transposed convolution used in [39,40] needs a high amount of computation and causes unwanted checkerboard artifacts [41,42]. A typical checkerboard pattern of artifacts is shown in Figure 4. To reduce the network runtime and avoid the checkerboard pattern of artifacts, the sub-pixel convolution described in Section 3.3 is used to implement the up-sampling operation.

Sub-Pixel Convolution
Sub-pixel convolution, also called as pixel shuffle, is an upscaling method first proposed in [43] for image super-resolution tasks. Different from the commonly used up-sampling methods in deep learning such as transposed convolution and fractionally strided convolution, sub-pixel convolution adopts channel to space method which achieves spatial scale-up amplification by rearranging pixels in multiple channels of the feature map, as illustrated in Figure 5. For a sub-pixel unit with r times up-sampling, its output image is defined as I up , and we have I up ∈ R W×H×c , in which W, H, and c denote the width, height and channels of I up . The sub-pixel convolution operation is defined as: where I up (x, y, c) is the value of the pixel at the position (x,y) for the cth channel. F is the input of sub-pixel and F ∈ R W/r×H/r×cr 2 . · represents floor function that takes as input a real number and gives as output the greatest integer less than or equal to it [43]. After sub-pixel convolution operation, the elements of F are rearranged to the output I up by increasing the horizontal and vertical count, and decreasing channel count. For example, when a 64 × 64 × 4 feature map is passed through the sub-pixel unit, an output with shape 128 × 128 × 1 will be obtained.

Proposed Evaluation Criterion
In this paper, the peak signal to noise ratio (PSNR) [44], structural similarity (SSIM) [45], equivalent number of looks (ENL) [46], and two new proposed evaluation criterions edge feature keep ratio (EFKR) and feature point keep ratio (FPKR) are used to evaluate the performance of despeckling methods.
PSNR is the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation, which has been widely used in quality assessment of reconstructed images. SSIM is a metric of image similarity. ENL can describe the smoothness of regions, and no reference image is needed for its calculation, so it can be used to evaluate the performance of despeckling methods for real SAR images.

Edge Feature Keep Ratio and Feature Point Keep Ratio
PSNR and SSIM can effectively evaluate the overall performance of despeckling methods. Specifically, PSNR measures noise level or image distortion, SSIM measures the similarity between two images, and ENL measures the degree of region smoothing. They are not, however, capable of evaluating the edge and typical features retention ability in despeckling tasks directly. In this section, we propose two evaluation criteria that can compensate for the above deficiencies, i.e., edge feature keep ratio (EFKR) and feature point keep ratio (FPKR).
(a) EFKR: from the edge detection results shown in Figure 6, we have the following observations. (1) The edge outline of the speckled image is blurred, and there are discrete points in the image; (2) the edge outline is clear after despeckling and there is no discrete point, which is in agreement with the edge detection results of a clean image. Enlightened by this phenomenon, we design a quantitative evaluation criterion EFKR with the ability of edge retention based on counting the number of pixels of edges. The computation steps are as follows: 1.
Edge detection processing for clean and test image using edge detection algorithms such as Sobel [47], Canny [48], Prewitt [49], and Roberts [50]   The ratio of these two factors is the edge feature retention ratio, which is defined as: where & and sum denotes the bit-wise conjunction operation and sum operation, and edge represents edge detection. (b) FPKR: for real SAR images, ENL is only able to evaluate the smooth level but not the retention ability of typical features such as edges, corners in the image. SIFT [51] can find feature points from different scales and obtain the ultimate descriptor of features. Also, the key points found by SIFT are usually corner points, edge points, bright spots in dark areas, and dark points in bright areas. These points are robust to light, affine transformations, and other transformation. The registration method based on SIFT first uses SIFT to obtain the feature points of the image to be registered, the reference image and their descriptor then matches the feature points according to descriptor and obtains one-to-one corresponding feature point pairs. Finally, the transformation parameters are calculated, and the image registration is carried out.
For SAR images, the registration of feature points and descriptors at the lights spots of speckles are redundant, which also reduce the efficiency and accuracy of subsequent searching of matching points. Based on this phenomenon, we design an evaluation criterion FPKR targeting at key feature points. We first execute an affine transformation to the evaluation image, then use the SIFT algorithm to find the feature points in the two images before and after the transformation, and finally match the feature points. The better the despeckling performs, the more typical features are preserved. The more prominent the feature descriptor obtained by SIFT, the greater the difference of descriptor between different features, so more effective feature point pairs can be searched efficiently. FPKR is defined as: where N(X), N(X t ) are the number of key points before and after SIFT, and N match (X, X t ) denotes the number of points for calculating transformation parameters.

Dataset
Because it is hard to collect real SAR images without speckle noise, we train networks by using synthetic noise/clean image pairs. Public dataset UC Merced Land Use Dataset (http://weegee.vision. ucmerced.edu/datasets/landuse.html) is chosen as the original clean image for training. The dataset contains 21 scene classes with 100 optical remote sensing images per class. Each image has a size of 256 × 256 pixels and the pixel resolution is 1 foot [52]. According to [27], we randomly select 400 images from the dataset as the training set and use the remaining images for testing. Some training samples are shown in Figure 7. Finally, after grayscale preprocessing, the speckled images are generated using Equation (1) same as the [25,53]. The noise levels (L = 2, 4, 8, 12) correspond to the number of looks in SAR, and the code of adding speckle noise is available on GitHub (https://github.com/rcouturier/ ImageDenoisingwithDeepEncoderDecoder/tree/master/data{_}denoise).

Experimental Settings
All the networks are trained with stochastic gradient descent (SGD) with a mini-batch size of 32. All weights are initialized by a modified scheme of Xavier initialization [54] proposed by He et al. Meanwhile, we use the Adam optimizer [55] with tuned hyper-parameters to accelerate training. The hyper-parameters are kept the same across all layers and all networks. Experiments are implemented on TensorFlow platform with Intel i7-8700 CPU and an NVIDIA GTX-1080(8G) GPU.
The details of the model are specified here. The number of kernels in each unit is shown in Figure 3. The kernel sizes for the first and last convolutional layers are 5 × 5, while all others are 3 × 3. Rectified Linear Units (ReLU) are used as the activation function for all layers except for the last convolutional layer before sub-pixel unit. The L 1 loss is chosen to train the network, which is defined as: where Θ is the filter parameters that need to be updated during training, C, X and ϕ (·) denote the objective image without noise, the input image with speckle noise, and the output after despeckling, respectively.

Experimental Results
The test results of our proposed network will be presented in this section. To verify the proposed method, we compare the performance of our MSR-net with other three despeckling methods, SAR-BM3D [22], ID-CNN [25], and Residual Encoder-Decoder network (RED-NET) [53]. The first one is a traditional nonlocal algorithm based on wavelet shrinkage, and the latter two methods are based on deep convolutional neural networks.

Results on Synthetic Images
Building, freeway, and airplane, three classes of synthetic images, are chosen as the test set to evaluate the noise reduction ability of each method. Part of the processing results of different algorithms under different levels of noise are shown in Figures 8 and 9.  From the figures, we can observe that the CNN-based methods, including our MSR-net, can preserve more details like texture features in images than SAR-BM3D after despeckling. When the noise is strong, the SAR-BM3D algorithm will cause blurring at the edge of the objects.
ID-CNN has a good performance on image despeckling, however, after filtering by the network, pepper and salt noise appear in the image, which needs to be processed subsequently by using nonlinear filters such as median filtering and pseudo-median filtering. As the noise intensity increases, the salt and pepper noise increase gradually.
MSR-net has excellent retention performance of spatial geometry features like texture features, lines, and feature points. Compared with the other three algorithms, MSR-net has a higher smoothness of smooth areas as well as a smaller loss of sharpness of edges and details, especially for strong speckle noise. Also, more detail information in image will loose when the speckle noise is strong and more local detail can be preserved in output images when the speckle noise is weak.
When the level of noise added to the test set is small, all the CNN-based approaches can get state-of-art results. Therefore, it is difficult to judge the merits of these algorithms by using visual assessments. Experimental results of evaluation indexes such as PSNR and SSIM are necessary for these circumstances. The PSNR, SSIM, and EFKR evaluation indexes of the above methods are listed in Tables 1-4, respectively. The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value. We also test MSR-net with only one scale and call it a single scale network (SS-net) during the experiment. The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row. The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row. The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row. The bold number represents the optimal value in each row, while the underlined number denotes the suboptimal value in each row.
Consistent with the results shown in Figures 8 and 9, our method has much better speckle-reduction ability than non-learned approach SAR-BM3D at different noise levels. In addition, the advantage of MSR-net will increase as the noise level increases. Taking airplane images  In addition, we can see that our network does not always achieve the best test results. We consider this may have a certain relationship with the feature distribution of the images. Although MSR-net can only get sub-optimal test results for a certain class of images, the difference to the best result is small. For other classes of images, the advantages of MSR-net are more considerable. For example, we can observe from Table 1 that the EFKR of RED-Net only outperforms MSR-net by about 0.006 for freeway images, while MSR-net can outperform RED-Net by about 0.0191 and 0.0194 for building and airplane images. When the noise level L = 12, our network only gets four best test values, which suggests the advantage of the multi-scale network becomes smaller while the noise is weaker, as shown in Table 4.
MSR-net with a single scale (SS-net) also has very good speckle-reduction ability. When the noise intensity is weak, the performance is even better than multi-scale. For example, when L = 12, PSNR/SSIM/EFKR of the freeway are 28.787 dB/0.776/0.6247 and 28.893 dB/0.778/0.6197 for SS-net and MSR-net, respectively. Ultimately, we can find by comparison that the edge detection effect of the image has been significantly improved after despeckling.

Results on Real Sar Images
To further verify the speckle-reduction ability of our network for real SAR images, two SAR sceneries are selected, as shown in Figures 10a and 11a, and these two images are imaging results of spaceborne SAR RADARSAT-2.
It can be seen by comparing the subgraphs in Figures 10 and 11 that MSR-net generates the visually best output among all the results and retains the edge sharpness as well as the detail information about the structure in the image while removing the speckle noise. After filtering by SAR-BM3D, the loss of edge sharpness of the original SAR image is obvious and most of the lines and texture feature are blurred. ID-CNN and RED-Net can generate smooth results in homogeneous regions while maintaining textural features in the image. However, from the red boxes, we can observe that they are capable of retaining some texture features but not better than MSR-net. Although SS-net performs well in despeckling for real SAR images, it is still worse than multi-scale MSR-net.
The ENL results are shown in Table 5. We can observe from the table that MSR-net has an outstanding performance for real SAR image despeckling. For these four evaluation regions, the three highest scores and one second highest score of the ENL are obtained by our MSR-net.  The bold number represents the optimal value in each row.
To achieve FPKR results for real SAR data with different methods, we first apply the same affine transformation to each image, as shown in Figure 12. SIFT is then applied to search for the feature points and calculate their descriptors. Matching key points are ultimately conducted by minimizing the Euclidean distance between their SIFT descriptors. Generally, the ratio between distances is used [56] to obtain high matching accuracy. In the experiments, we select three ratios and the FPKR results are shown in Tables 6 and 7. By comparing EFKR of each image, MSR-net performs better than SAR-BM3D. Also, MSR-net shows advantages over other neural network-based algorithms. Specifically, MSR-net achieves the best testing results in five out of six sets of experiments. It also indicates that pre-processing to SAR images by MSR-net can effectively enhance the usefulness of SIFT algorithm to SAR images and improve its performance and efficiency.
(b) After despeckling.  The value inside brackets is the number of feature points. The bold number represents the optimal value in each column. The value inside brackets is the number of feature points. The bold number represents the optimal value in each column.

Runtime Comparisons
To evaluate the algorithm efficiency, we make statistics of the runtime of each algorithm in CPU implementation. The runtime of different methods on images with different sizes is listed in Table 8. We can see that the proposed denoiser is very competitive although its structure is relatively complex. Such a good compromise between speed and performance over MSR-net is properly attributed to the following two reasons. First, two pooling layers that can achieve spatial dimensionality reduction are embedded in the MSR-net. Each pooling layer with the 2 × 2 pooling kernel can reduce the amount of data that needs to be processed by the subsequent convolution operation to 25% of before. Second, in contrast to the transposed convolution which increases the resolution of feature maps by padding and complex convolution operation, sub-pixel unit, which up-samples feature maps by a periodic shuffling of pixel values, is adopted to build our network.

Choice of Scales
To select a proper scale, building, freeway, and airplane, three classes of test images are used to analyze. Figure 13 shows the testing results of networks with different scales. In the single-scale network, recurrent modules ConvLSTM are replaced by a convolution layer to keep the same number of convolution layers.
We can observe that as the scale increases, the values of the three evaluation metrics are all improved. It suggests that exploiting the multi-scale information can help with improving network performance. However, we can meanwhile find that the improvement is small while the scale is greater than 3. We thus choose s=3 in our network to balance the network performance and complexity.

Loss Function
The influence of loss function on network performance is also discussed in this paper. Instead of L 1 loss, L 2 loss function is used to train our MSR-net. L 2 norm loss, also called Euclidean loss, is the most commonly used loss function in despeckling tasks. It is defined as: ϕ (X(x, y); Θ) − C(x, y) 2 2 (14) where Θ is the filter parameter that needs to be updated during the training process, C is the ground truth image without noise, X is the input image with speckle noise, and ϕ (·) is the output after despeckling. The purpose of training network is to minimize the cost. Smaller loss value suggests a smaller error between the network output and its corresponding ground truth. As shown in Figure 14, the network trained by L 2 loss function is more likely to obtain a higher PSNR only for building images and the network trained by L 1 loss function can obtain both slightly higher PSRN and SSIM with the other images. But for EFKR, the advantage of L 1 loss is significant compared to L 2 loss. Generally speaking, the L 1 loss is more suitable to SAR despeckling task.

Conclusions
In summary, different from the existing despeckling network, MSR-net proposed in this paper adopts the coarse-to-fine structure and the convolutional long short-term memory unit that can obtain high-quality despeckling SAR images. During research, we find that the weights sharing strategy of convolutional kernels can reduce network parameters and training complexity, and the sub-pixel unit used in this work can reduce up-sampling complexity, improve network efficiency, and shorten the runtime of the network with respect to the transposed convolutional layer. Meanwhile, new design evaluation metrics EFKR and FPKR are introduced herein to evaluate the compatibility of the despeckling algorithms to the optical image processing algorithms. Experimental results show that our MSR-net has excellent despeckling ability and achieves the state-of-the-art results both for simulated and real SAR images with low computational costs, especially in low signal noise ratio cases. The adaptability of optical image processing algorithms to SAR images can be enhanced after despeckling in our network.