Coupling Denoising to Detection for SAR Imagery

Featured Application: The proposed object detection framework aims to improve detection performance for noisy SAR images, which is applicable for general object detection in SAR imagery: recognition of militarily important targets such as ships and aircrafts or monitoring for abnormal civilian events. Abstract: Detecting objects in synthetic aperture radar (SAR) imagery has received much attention in recent years since SAR can operate in all-weather and day-and-night conditions. Due to the prosperity and development of convolutional neural networks (CNNs), many previous methodologies have been proposed for SAR object detection. In spite of the advance, existing detection networks still have limitations in boosting detection performance because of inherently noisy characteristics in SAR imagery; hence, separate preprocessing step such as denoising (despeckling) is required before utilizing the SAR images for deep learning. However, inappropriate denoising techniques might cause detailed information loss and even proper denoising methods does not always guarantee performance improvement. In this paper, we therefore propose a novel object detection framework that combines unsupervised denoising network into traditional two-stage detection network and leverages a strategy for fusing region proposals extracted from both raw SAR image and synthetically denoised SAR image. Extensive experiments validate the effectiveness of our framework on our own object detection datasets constructed with remote sensing images from TerraSAR-X and COSMO-SkyMed satellites. Extensive experiments validate the effectiveness of our framework on our own object detection datasets constructed with remote sensing images from TerraSAR-X and COSMO-SkyMed satellites. The proposed framework shows better performances when we compared the model with using only noisy SAR images and only denoised SAR images after despeckling under multiple backbone networks.


Introduction
Synthetic Aperture Radar (SAR) is a type of radar system used to reconstruct 2D or 3D terrain and objects on the ground (or over oceans). The SAR system utilizes a technology to synthesize a long virtual aperture through a coherent combination of the received signals from objects. The synthesized aperture transmits pulses of microwave radiation, which in turn has the effect of narrowing the effective beam width in an azimuth direction and thus achieving high resolution. Combining return signals by an on-board radar antenna, SAR overcomes the main limitations of traditional systems that the azimuth resolution is determined by physical antenna size. Optical and infrared sensors are passive since they detect objects by reflected light and emitted signals from the objects, respectively, while the radars can actively transmit and receive radar waves, operating in all-weather and day-and-night conditions.
Thanks to the useful characteristics available under all-weather conditions and also during night-time, SAR images are especially applied to military reconnaissance as most military operations take place at night in poor weather conditions. There is a variety of applications such as information and electronic warfare, target recognition of aircrafts that maneuver irregularly, battlefield situational awareness, and development of aircrafts that are hard for the other party to track with radar. In addition, it is necessary to study on object detection using radar imagery for civilian applications (e.g., resources exploration, environmental monitoring, etc.).
SAR images are formed from a coherent sum of backscattered signal components at the boundary of different media after pulsed transmissions of microwave radiation, enabling to observe the interior of the targets otherwise invisible to the naked eye. However, when obtaining the SAR images, if the emitted pulses are reflected from the boundary of a target with uneven surface, then scattering and interference waves are created. These wave signals have a direct impact on a SAR imaging the structure of the target as noise components. The produced noise is often called speckle noise, which hinders the original image information and causes a speckle corrupted SAR image as shown in Figure 1. The scattering characterization of the target gets severe depending on changes in radial properties and orbital surfaces, leading to degradation of recognition performance. It is worth noting that a number of published studies were conducted for denoising (or despcekling) SAR images [20][21][22][23][24][25].  Many previous works first perform despeckling on SAR images as one of preprocessing steps and then utilize the SAR images for several tasks via deep learning; e.g., classification task [26,27], detection task [28][29][30], etc. Processing separately the large amount of SAR images results in high time consumption and low efficiency. Though various despeckling methods such as Lee filter [22], Kuan filter [23], Frost filter [24], Probabilistic Patch-Based (PPB) filter [25] have been proposed, if we take an improper despeckling methodology without considering the dataset characteristics carefully, then the despeckling may lead to poor performance due to the information loss from raw SAR images. Meanwhile, to further improve the visual quality of SAR images, there are other preprocessing methods such as contrast enhancement methods. Given that most of SAR images are usually grayscale images, we can consider various processing methods, for example, fuzzy-based gray-level image contrast enhancement [31] or fuzzy-based image processing algorithm [32].
To overcome the issue and guide for directly promoting object detection performance, developing an object detection framework through incorporating an alternative deep denoiser replacing the separate denoising preprocessing step into the classical object detection network is significant and necessary. The motivation shares the similar spirit to the recent classification work proposed by Wang et al. [33], where they learn a noise matrix from an input noisy image and with the noise matrix synthesize a despeckled image taken as the input into a subsequent classification network. According to our best knowledge, we are the first to connect a denoising network to an object detection network. We additionally introduce fusing region proposals approach which fuses set of Region of Interests (RoIs) from both noisy and denoised images; rather than simply ending with the coupling structure as in Wang et al. [33].
We propose a novel object detection framework whose the core idea comprises two parts: (1) connecting an unsupervised denoising network to an object detection network for dynamically extracting a denoised SAR image from a given noisy SAR image, and (2) forwarding an image pair of two SAR images (the given real SAR image and the synthetically generated SAR image) to an object detection network and fusing region proposals from the two SAR images for complementarily integrating regional information.
Here fusing region proposals refers to merging two sets of RoIs yielded by a shared region proposal network within the object detection network. This is inspired by the observation that utilizing only real SAR image may bring about false positives due to the inherent speckle noise of the image and on the contrary, depending on only denoised SAR image may cause missing targets because inadequate denoising leads to fine information loss of raw data.
The rest of this paper is organized as follows. Section 2 mainly consists of two parts, where the first part introduces our datasets constructed with SAR images from TerraSAR-X and COSMO-SkyMed satellites, and the second part describes the detailed design of our proposed object detection framework, i.e., how to incorporate an unsupervised denoising network into an object detection network and fuse the region proposals within the object detection network. Section 3 reports comparative experimental results for the proposed object detection network on our own datasets. To validate the effectiveness of our approach, we carry out multiple experiments; (1) we need to experimentally demonstrate that our coupling structure between denoising and detection networks can strengthen detection performance, (2) we further verify the proposed region proposal fusing strategy in terms of input data for detection network and fusing method through ablation studies, and (3) we additionally perform comparative experiments with respect to the choice of a feature map extracted from either real or synthetic SAR image, where the feature map refers to the output of CNN backbone in the detection network. Section 4 presents the discussion of the experimental results together with an additional time complexity analysis. Finally, Section 5 includes the final remarks and a conclusion.

Materials and Methods
In this section, we describe SAR remote sensing datasets that we constructed and the proposed object detection framework which fuses region proposals utilizing denoised SAR image. The remote sensing datasets include not only SAR imagery but also corresponding labeled objects. We develop our object detection framework with the datasets and detail the proposed framework in the rest of this section.

Description
We constructed our datasets with 60 TerraSAR-X images from German Aerospace Center [34] and 55 COSMO-SkyMed images from Italian Space Agency [35], which is mainly covering harbor-and airport-peripheral areas. For TerraSAR-X satellite, the images have resolutions from 0.6 m to 1 m, and is of the size in the range from about 6 k × 2 k to 11 k × 6 k pixels (sorted by their area). For COSMO-SkyMed satellite, the images have a resolution of 1m, and is of the size in the range from about 13 k × 14 k to 20 k × 14 k pixels (sorted by their area). Each remote sensing image is labeled by experts in aerial image interpretation with multiple categories such as airplane (A), etcetera (E) and ship (S). The ship/airplane classes contain a variety of civil and military ships/airplanes while the etcetera class includes support vehicles, air defense weapons and air defense vehicles. Some example ship/airplane objects are shown in Figures 2 and 3 for TerraSAR-X and COSMO-SkyMed imagery, respectively.  Our labeled objects include a total of 15.7 k instances of 3 categories; 3.7 k instances for A class, 0.2 k instances for E class, and 11.8 k instances for S class, which implies that our datasets are quite imbalanced between the categories and relatively skewed towards S class. The class distribution by type of satellite imagery is plotted in Figure 4. Furthermore, target objects in our dataset exist at a variety of scales due to our multiresolution images and the variety of shapes, especially for ships objects. We measure the bounding box size of objects with w bbox × h bbox and present the frequency of boxes by size as a histogram in Figure 5, where w bbox and h bbox is the width and height of the bounding box, respectively.   Table 1 summarizes the detailed comparisons between our own constructed dataset and other publicly available SAR detection datasets, i.e., AIR-SARShip-1.0 [16], SSDD [18], SAR-Ship-Dataset [17], and HRSID [19]. SAR-Ship-Dataset is the dataset with the largest number of instances, followed by our own dataset. The primary differentiator of our dataset as compared with other datasets lies in (1) class diversity such as ships, aircrafts, and etcetera classes, and (2) the number of scene areas. We obtained the SAR images from a variety of harbor and airport peripheral areas around the world wide and annotated different shapes of objects.

Proposed Methodology
Given the inherent speckle noise of SAR, researchers have previously performed a preprocessing step like despeckling before training an object detection model. However, such prior preprocessing independent of the performance of object detection may not only be inefficient, but also lead to weak detection performance because an unintentionally improper denoising induces loss of detailed information. Therefore, we integrate a denoising network with a two-stage detection network so that the denoising network can directly receive feedbacks from the detection network, as illustrated in Figure 6.
We choose a blind-spot neural network [36] based self-supervised scheme as the unsupervised denoising model and adopt Gamma noise modeling as in Speckle2Void [37] fitted with SAR speckle, but not limited to this model sturcture. We can train the unsupervised denoising model as a generator G that maps a real (noisy) SAR image I real to the synthetic (denoised) SAR image G(I real ). The core idea of our model is to infer a synthetic denoised SAR image from the input SAR image and merge the two sets of extracted RoIs to improve detection performance. Without any help of related materials such as corresponding denoised image for an input SAR image, we can autonomously simulate the denoised image and fuse the inferred information such as RoIs. The entire model enables effective end-to-end learning.  1) connecting an unsupervised denoising network to an object detection network for dynamically extracting a denoised SAR image from a given noisy SAR image, and (2) forwarding an image pair of two SAR images to an object detection network and fusing region proposals from the two SAR images for complementarily integrating regional information.
The unsupervised denoising network G in our model firstly takes as input a real (noisy) SAR image I real and extracts synthetic (denoised) SAR image G(I real ) as the output. Then, the formed (real, synthetic) image pairs (I real , G(I real )) are fed into a shared region proposal network and the region proposal network outputs two corresponding feature maps and sets of RoIs. The two sets of RoIs B real , B synth are merged and the redundant bounding boxes are subsequently removed by a NMS procedure, i.e., B f inal = N MS(B real ∪ B synth ), where B f inal is the resultant fused bounding boxes. For each RoI in B f inal on the feature map from the real SAR image, the RoI feature vector is then forwarded to obtain the classification and regression results as traditional two-stage detection network.
Usually, only single SAR image which is either real or denoised (preprocessed) is employed for training an object detection network as shown in Figure 7. Suppose we have real SAR images which is inherently speckled noisy without any preprocessing, relying solely on the real SAR image for training may cause false alarms of region proposals. On the other hand, utilizing denoised SAR images alone may be prone to suffer from missing targets because of detailed information loss. We, therefore, devise a novel denoising-based object detection network to make full use of the complementary advantages between the real and denoised SAR images. To combine extracted information from both real and synthetic SAR images, we consider fusing region proposals which merges two sets of RoIs yielded by a region proposal network. Considering that there exist qualitative differences between the two sets of RoIs derived real and synthetic SAR images, the real and synthetic SAR images are separately trained by the region proposal network. After fusing region proposals, we take the feature map from the real SAR image for preserving the global context information of the raw input SAR image.
The proposed architecture is trained end-to-end with a multi-task loss which mainly consists of (1) unsupervised denoising loss, (2) region proposal loss, and (3) RoI loss for classification and bounding-box regression. Especially, the region proposal network is trained for both real and synthetic SAR image, and thus two distinctly losses are defined. The final loss function that we propose is a weighted summation of all losses as follows. where:

Results
We first present the description of our experimental dataset settings in Section 3.1. Section 3.2 presents the details of our model architecture and the hyperparameter settings. Based on this implementation, we conduct extensive experiments to validate the contributions of the proposed model and Sections 3.3 and 3.4 contain the experimental results. Section 3.5 provides comprehensive ablation studies.

Dataset Settings
We acquired 60 TerraSAR-X raw scenes from German Aerospace Center [34] and 55 COSMO-SkyMed raw scenes from Italian Space Agency [35]. The raw scenes go through multiple stages like preprocessing, Doppler centroid estimation (DCE), and focusing to obtain single look slant range complex (SSC) images. The SSC images are then converted to multi-look ground range detected (MGD) images by multi-looking procedures. With the MGD images, we create patches of size 800×800 via sliding-window operation, within each patch containing at least one target object which belongs to airplane (A), etcetera (E), or ship (S) categories. Finally, we randomly split patches into 80% for training, and 20% for testing.

Implementation Details
We implemented our unsupervised denoising model following self-Poisson Gaussian [38], however, adopted Gamma noise modeling as in Speckle2Void [37] to characterize the SAR speckle. Our implementation for detection framework was based on the MMDetection tool box [39] which is developed in PyTorch [40]. Stochastic gradient descent (SGD) Optimizer [41,42] with momentum of 0.9 was used for optimization. We trained a total of 24 epochs, with an initial learning rate of 0.0025, momentum of 0.9, and weight decay of 0.0001. We experimented with ResNet-50-FPN and ResNet-101-FPN backbones [43,44]. All evaluations were carried out on a TITAN Xp GPUs with 12G memory. Figure 8 shows paired examples of real SAR images and corresponding synthetically denoised SAR images where the denoised SAR images are the intermediate results in our model. After the denoising stage, the general speckle noises are drastically reduced; however, there inevitably exists a trade-off between the noise level and image clarity. Especially, a lot of buoys that usually look like actual ships are located in the first example of Figure 8 and in the denoised SAR image, brightness of the buoys relatively gets faded and the visual difference with the surrounding ships becomes clear. In addition, scattering waves around target objects which are one of factors hindering accurate localization is blurred after the denoising. The denoising within our network confirms such positive effectiveness.

Qualitative Evaluation
Some image triples of groundtruth, baseline detection, and our detection visualizations are presented in Figure 9. We train the baseline detection model with non-preprocessed and raw noisy SAR images. For a fair comparison, both the baseline and our detection model equally adopt Faster RCNN with ResNet-101-FPN [43,44] backbone architecture. The detection results show that our model could localize overall objects accurately with higher confidence scores and detects with a small number of false alarms compared to with the baseline detection model in the given patch images. Although the progress made by our detection models are inspiring, our detectors still have a room further improvement due to the few remaining false alarms and missing targets.

Quantitative Evaluation
To quantitatively evaluate the detection performance, we calculate mean average precision (mAP). The mAP metric is widely used as a standard metric to measure the performance of object detection and estimated as the average value of AP over all categories. Here, AP computes the average value of precision over the interval from recall = 0 to recall = 1. The precision weighs the fraction of detections that are true positives, while the recall measures the fraction of positives that are correctly identified. Hence, the higher the mAP, the better the performance.
As shown in Table 2, we compare the proposed network with the traditional twostage detection model under two different backbones such as ResNet-50-FPN and ResNet-101-FPN [43,44]. By varying despeckling approaches, we set several baseline models as previous work processes: (1) inputting non-preprocessed real SAR images, (2) feeding denoised SAR images into the traditional two-stage detection model after denoising via representative techniques called Lee filter [22] or PPB filter [25]. We observe that the despeckling effect of applying Lee filter is more minor than PPB filter. PPB filter enables us to reduce more speckle noises; but, much detailed information visually gets concealed. This validates our experimental results that the baseline model with PPB filter slightly performs inferior compared to the baseline model with Lee filter. On the other hand, our detection network provides significant advances in performance under all backbone architectures. Through observation of the test results, this is attributed to the suppression of many false positive detections resulting from speckle noise problems of real SAR images.  . Image triples are shown in which the left image is groundtruth, while the middle image is for baseline models (traditional two-stage detection models with real SAR images), and the right image is for our models. The groundtruth and predicted bounding boxes are plotted in blue color for A class, yellow color for E class, and pink color for S class. The numbers on the bounding boxes in the middle and right images denote the confidence score for each corresponding category. We visualize all detected bounding boxes after NMS and thresholding detector confidence at 0.05.

Ablation Study
We conduct an ablation study for structurally verifying the proposed fusing region proposal strategy. We first compare the case without fusing itself after denoising on input noisy SAR image, which corresponds to the first experiment in Table 3. With the comparison to inputting only denoised SAR image as an input to detection network, we can identify whether the usage of real SAR image as another input of the detection network is important. This case shows the poorest detection performance and justifies the importance of fusing information from raw noisy SAR images. Secondly, for the choice of feature map after fusing, we perform experiments with feature map from denoised SAR image or feature map from real SAR image. As a result, keeping the feature map from the real SAR image as proposed is found to be much better.

Discussion
Our proposed detection framework obviously achieves a better performance through combining a denoising network with an existing detection network; however, more parameters and the complex structure demand larger memory for model storage and higher computing cost. We report average inference times (measured in seconds/(patch image) on a Titan Xp GPU) for the purpose of time complexity analysis, as presented in Table 4. Compared with the existing two-stage object detection network like Faster RCNN [45] in the first row of Table 4, our detection framework further requires denoising time and time for fusing region proposals during inference. The denoising time makes up a large portion of the added running times, so the most promising way for reducing the average inference time would be adopting a relatively light denoising network. Table 4. Comparison of running times for the time complexity analysis. We evaluated the running times on a patch image sized 800 × 800 with a Titan Xp GPU.

Conclusions
In this study, we develop a novel object detection framework, where an unsupervised denoising network is combined with a two-stage detection network and two sets of region proposals extracted from a real noisy SAR image and a synthetically denoised SAR image are complementarily merged. The coupling structure of denoising network with detection network together intends to replace a cumbersome preprocessing step for denoising with our denoising network and at the same time, the integrated denoising network performs denoising to support the subsequent object detection. To remedy a potential risk due to fine information loss after denoising, we keep raw information from input SAR image within detection network while only utilize a set of region proposals inferred from the synthetically denosied SAR image. The extensive qualitative and quantitative experiments on our own datasets with TerraSAR-X and COSMO-SkyMed satellite images suggest that the proposed object detection framework involves the adaptive denoising for directly influencing detection performance. Our method shows significant improvements over several detection baselines on the datasets constructed from TerraSAR-X and COSMO-SkyMed satellite images.