Ship Classiﬁcation and Detection Based on CNN Using GF-3 SAR Images

: Ocean surveillance via high-resolution Synthetic Aperture Radar (SAR) imageries has been a hot issue because SAR is able to work in all-day and all-weather conditions. The launch of Chinese Gaofen-3 (GF-3) satellite has provided a large number of SAR imageries, making it possible to marine targets monitoring. However, it is difﬁcult for traditional methods to extract effective features to classify and detect different types of marine targets in SAR images. This paper proposes a convolutional neutral network (CNN) model for marine target classiﬁcation at patch level and an overall scheme for marine target detection in large-scale SAR images. First, eight types of marine targets in GF-3 SAR images are labelled based on feature analysis, building the datasets for further experiments. As for the classiﬁcation task at patch level, a novel CNN model with six convolutional layers, three pooling layers, and two fully connected layers has been designed. With respect to the detection part, a Single Shot Multi-box Detector with a multi-resolution input (MR-SSD) is developed, which can extract more features at different resolution versions. In order to detect different targets in large-scale SAR images, a whole workﬂow including sea-land segmentation, cropping with overlapping, detection with MR-SSD model, coordinates mapping, and predicted boxes consolidation is developed. Experiments based on the GF-3 dataset demonstrate the merits of the proposed methods for marine target classiﬁcation and detection.


Introduction
With continuous development of Synthetic Aperture Radar (SAR) technology, an increasing number of very high resolution (VHR) SAR images have been obtained, providing a new way to strengthen marine monitoring. Different from optical sensors, SAR is capable of working in all-day and all-weather conditions, and it is receiving more and more attention. However, it is very time-consuming to interpret SAR images manually because of speckle noise, false targets, etc. With increasing demand for ocean surveillance in shipping and military sectors, marine target classification and detection has been an important research area in remote sensing with great application prospects. In this work, we will focus on marine target classification on patch level and marine target detection in large-scale SAR images.
Earlier studies on marine target classification were carried out on simulated SAR images due to a lack of real image samples [1]. In recent years, with the deployment of several spaceborne SAR satellites such as TerraSAR-X, RadarSat-2, and GF-3, a wide variety of SAR images with different In recent years, the region-based CNN networks such as Faster-RCNN [29], YOLO [30], and SSD [31], which can not only generate the coordinates, but also predict the labels of the targets, have shown a great success on the PASCAL VOC dataset [32]. Faster-RCNN uses a deep convolutional network to extract features and then proposes candidates with different sizes by the Region Proposal Network (RPN) at the last feature map. The candidate regions are normalized through RoI Pooling layer before they are fed into fully connected layers for classification and coordinates regression. This algorithm can detect objects accurately but cannot realize real time detection. YOLO processes images at a faster speed but with lower accuracy than Faster-RCNN. An end-to-end model called SSD was proposed in Reference [31], which can detect the target at real time with high accuracy. It generates region proposals on several feature maps of different scales while Faster-RCNN proposes region candidates with different sizes on the last feature map provided by the deep convolutional network.
The CNN based methods have been used for target detection in SAR images, e.g., ship detection [24] and land target detection [33], and has shown a better performance than the traditional methods. One method splits the images into small patches and then uses the pre-trained CNN model to classify the patches, after which the classification results are mapped onto the original images [34]. However, this method has a low target location precision because it does not take the edges of target into consideration. Some other works apply the region-based CNN networks to detect ships in SAR images. The study in Reference [25] adopts the structure of Faster-RCNN and fuses the deep semantic and shallow high-resolution features in both RPN and Region of Interest (RoI) layers, improving the detection performance for small-sized ships. Kang et al. used the Faster-RCNN to carry out the detection task and employed the CFAR method to pick up small targets [35]. While the modifications to Faster-RCNN could help detect small ships, they introduce false alarms to the detection results. Furthermore, researchers in Reference [36] applied SSD algorithms to ship detection in SAR images. Apart from comparing the performance of different SSD models, almost no changes are introduced to the SSD structure to improve the performance in terms of marine target detection. To sum up, the previous studies demonstrate that the CNN-based methods can detect the marine targets more accurately than the CFAR methods and feature-based methods. However, among all the detection methods introduced above, they only focus on ship detection in SAR images and are unable to detect other marine targets with classifying the targets at the same time. Moreover, the existing CNN-based methods generate some false alarms and miss the targets due to the complex sea background. To solve the existing problems, a multi-resolution SSD model is proposed to detect marine targets with classification in this work.
Overall, this paper builds a novel CNN structure to recognize eight types of marine targets at patch level and propose an end-to-end algorithm using a modified SSD to realize marine target detection with classification in large-scale SAR images. The main contributions of the work are as follows:

•
The Marine Target Classification Dataset (MTCD) including eight types of marine targets and the Marine Target Detection Dataset (MTDD) containing six kinds of targets are built on GF-3 SAR images, which provide a benchmark for future study. The features of various targets are analyzed based on their scattering characteristics to generate the ground truths. • A novel CNN structure with six convolutional layers and three max pooling layers is developed to classify different marine targets in SAR images, whose performance is superior to the existing methods. • A modified SSD with multi-resolution input is proposed to detect different targets. This is the first study for detection of different types of marine targets instead of only detecting ships in SAR images, to the best of our knowledge. Then, the framework for detecting marine objects in large-scale SAR images is introduced.
The remainder of this paper is organized as follows: In Section 2, feature analysis for different targets and the proposed methods are introduced. Experimental results for target classification and detection in comparison with different existing methods are provided in Section 3. Section 4 discusses the results of the proposed methods. Finally, Section 5 concludes this paper.

Preprocessing of GF-3 Images
The original GF-3 images used in this paper are large-scale single look complex (SLC) images containing many targets, which means it is impossible to use them directly. In this subsection, the preprocessing method is proposed to extract the targets patches automatically and efficiently.
Firstly, the SLC images are transformed into amplitude images using the following formula: where S is the SLC image while O represents the amplitude image. Then, a non-linear normalization is applied to the images using Equation (2) where T is a constant, O(i, j) is the value of the normalized image at (i, j). The constant T works as a threshold and its value depends on the image. Usually we set T = 10O, where O represents the average value of the image. As it is time-consuming to select the target patches manually, an image segmentation method is proposed here to extract the target patches in the large-scale SAR images. The Otsu method is an effective algorithm to image segmentation, which searches for a threshold that minimizes the intra-class variance [37].
The proposed method uses the Otsu method to binarize the SAR images, after which the target candidates form the isolated points in the binary images because their pixel values are higher than the Otsu threshold, while the pixel values of sea clutter is lower than the threshold. Then, the algorithm searches for the isolated points and extract the coordinates. Finally, the fixed-size slices are collected according to the coordinates.

Feature Analysis
In this paper, eight types of maritime targets: Boat, cargo ship, container ship, tanker ship, cage, iron tower, platform, and windmill are selected and studied. Due to the lack of ground truths of the targets in SAR images, the scattering characteristics of each kind of targets are analyzed to get the label for each target. Figure 1 presents the eight types of targets in both optical and SAR images. This is the first such explicit analysis on eight types of marine targets to the best of our knowledge.
Boat: Boats have the simplest and smallest structures among the eight targets. As for boats, the hull edges and the engines at the tail generate strong backscattering, which leads to a closed ellipse in SAR images.
Cargo: Due to the existence of several warehouses, there is a strong secondary reflection on the walls of each warehouse, which is where the rectangular shapes come from in the SAR images.
Container ship: Container ships possess the largest hull. When the ships are fully loaded with containers, the container exteriors would produce strong secondary reflections, resulting in the effect of a washboard in the radar images. In addition, the strong reflections at the tail of the ship come from the complex structure of the ship tower.
Oil tanker: In order to transport oil, an oil pipeline is installed in the middle of the tanker. This causes a bright line in the middle of the tanker in the SAR images. Besides, the closed ellipse in the SAR image comes from the hull edges of the ship.
Cage: Cages used in marine aquaculture are concentrated in square grids in offshore areas. The edges of cages can provide strong backscattering, which forms dotted rectangular distribution in the radar image.

Boat
Cargo Container Tanker Cage Iron Tower Platform Windmill Iron tower: When the incident angle is small, the complex structures lead to a strong scattering point. However, when the incident angle is large, the tower target appears as a cone-shaped structure in the radar image. In addition, the transmission lines on the tower produce relatively weak reflection, like a gloomy strip in the image.
Platform: Many offshore countries have built drilling platforms to exploit oil and gas. Usually, they contain support structures, pipelines, and additional combustion towers. The pipelines and combustion towers of the platform result in bright lines, while the support structures produce massive bright spots in SAR images.
Windmill: The strong scatterings of turbines in windmills result in a bright spot. In addition, as the fans rotate, they also produce a bright line that gradually fades toward both ends.

Marine Target Classification Model Based on CNN
While earlier studies have proposed some CNN models with different structures to classify marine targets, they are employed over simple structures or inappropriate layer arrangements when dealing with small datasets, making it hard to extract distinctive features for marine targets.
In order to solve existing problems, we proposed a CNN structure with six convolutional layers (Conv.1-Conv.6), three max-pooling layers (Pooling1-Pooling3) and two full connection layers, which is shown in Figure 2. It can be seen that a pyramid structure is adopted, and as the CNN goes deeper, the outputs of each layer are down-sampled by pooling layers and the channel of the feature maps increases at the same time. This kind of structure can extract both low level and high level features.  x λ Figure 2. The structure of the proposed classification model.
As the length and width of the marine targets in this study is smaller than 100 pixels, the size of the input patches is set to 128 × 128 pixels to accommodate the objects. At the beginning, 32 convolutional kernels of size 6 × 6 work on the input images to extract features, after which the outputs are down sampled by max-pooling kernels with a size of 3 × 3. Then, the second convolutional layer filters the outputs of the first pooling layer with 128 kernels of size 5 × 5. After that, the convolutional layers are down sampled by the second pooling layer to shrink the feature maps. Then, the CNN network goes deeper with four convolutional layers employing 128 convolutional kernels of size 3 × 3, to generate high-level features, which are transmitted to the third max pooling layer. Finally, two fully-connected layers (FC1 with 1024 output neurons and FC2 with 8 output neurons) take the outputs of the third pooling layers as input and then output the vector to the softmax function to predict the labels of the targets. The strides of all the convolutional layers and all the pooling layers are set to one and two, respectively. Furthermore, the Rectified Linear Units (ReLU) are used for every convolutional layers and full-connected layers to prevent vanishing gradient or exploding gradient.
The training objective is to minimize the cross entropy loss function by forward propagation algorithm and error backpropagation algorithm, which can be written as follows: where m represents the total number to training examples and y (i) and x (i) refer to the true label and predicted label of the ith example, respectively. w is the trainable parameter and a regularization term λ w 2 is added to the loss function to prevent overfitting, where λ is the regularization factor. Figure 3 shows the proposed Multi-Resolution Single-Shot Multibox Detector (MR-SSD), which has three parts: The first part is multi-resolution image generation, the second part is a standard CNN architecture used for image classification, and the last part is the auxiliary structure containing multi-scale feature maps, convolutional predictors, and default boxes with different aspect ratios. The MR-SSD is capable of extracting features from different resolution images at the same time, which helps to increase the detection precision. The input images of traditional SSD have three channels: R, G, and B channels, while the original SAR images only have one channel. Previous practices usually put the same image into the three channels, which causes redundancy in computation and ignore the effects of resolution versions. In this part, a multi-resolution input procedure is designed by adopting images with different resolutions in different channels to extract more features than the traditional SSD. The size of the input images of the MR-SSD is set to 300 × 300. As for the multi-resolution generation part, the images are transformed to the frequency domain using the 2-D Fourier Transformation, and then low-pass filter is used to lower the ground resolution while keeping the image size fixed, described as follows:

A Modified SSD Network for Marine Target Detection
where B a and B r represent the M × N image's bandwidth in azimuth direction and range direction, respectively. It can be seen that λ is the factor determining the cutoff bandwidth of the filtered images, which is set to 0-1. After that the filtered images in the frequency domain are transformed to the time domain via inverse Fourier Transformation. Finally, the image ground resolution is reduced because of the linear relationship between ground resolution and SAR image bandwidth. For the proposed MR-SSD, M and N are set to 500. Filters with λ = 0.5 and λ = 0.25 are used to reduce the image resolution, and the images are transmitted to the G channel and B channel, respectively. The second part of the MR-SSD is a standard CNN architecture, i.e., VGG-16 [38], including five groups of convolutional layers combined with ReLU and pooling layers. Different from the VGG-16, the last two fully connected layers are replaced with two convolutional layers to extract features.
The extra feature layers allow the detection at multiple-scales. In this part, we adopt the corresponding parameters used in SSD [31], which proves to be effective in object detection challenges. The extra features layers generate default boxes on each feature map cells with different aspect ratios and then many convolutional filters are used to filter the default boxes to get the class score and offsets. Suppose there are r feature maps in the MR-SSD, the scale of default boxes on different feature maps is defined as follows: where s min = 0.2, s max = 0.9, S k is the scale of kth feature map. The aspect ratios for default boxes are denoted as a r ∈ {1, 2, 3, 1/2, 1/3}. Then, the width (w a k ) and height (h a k ) can be calculated by: As for a r = 1, a default box with the scale of s ′ k = √ s k s k+1 is added. As a result, 6 default boxes on each feature map cell are generated and the number of filters for a m × n feature map is 6 × m × n × (c + 4), in which c is the number of class categories and 4 corresponds to the four offsets.
After that, the total number of the default boxes per class is 8732, and non-maximum suppression (NMS) is used to improve the performance of MR-SSD. When the MR-SSD is trained, it is necessary to determine whether the default box corresponds to a ground truth box or not. For every ground truth box, the default boxes with overlapping rate higher than a threshold (0.5) are selected to match the ground truth boxes. We minimize the loss function as SSD, which is written in Equation (7), where N is the number of matched default boxes, L con f (x, c) and L loc (x, l, g) are the confidence loss and the localization loss, respectively. The confidence loss function employs softmax loss over multiple classes confidences, which is: where x p i,j is an indicator for matching the ith default box to the jth ground truth box of category p. If the two boxes are matched, the indictor will be set to 1, otherwise it will be set to 0. c p i represents the confidence of the ith default box of category p. The localization loss uses the Smooth L1 loss between the proposed box (l) and the ground truth box (g) parameters, defined as: The offsets for the center (cx;cy), width (w) and height (h) of the default box (d) are regressed by the following formulas:ĝ

The Whole Workflow for Marine Target Detection in Large-scale SAR Images
However, target detection in large-scale SAR images (larger than 10,000 × 10,000 pixels) is difficult because some images cover buildings and islands, which would lead to false alarms. Moreover, the CNN based methods can only detect targets at patch level due to the fixed input size. If the large-scale images are resized to a patch size for target detection, it will lose many detail features, making it hard to detect small targets. In order to solve the existing problems, this paper proposes a whole workflow consisting of sea-land segmentation, cropping with overlapping, detection with pre-trained MR-SSD, coordinates mapping and predicted boxes consolidation for marine target detection in large-scale SAR images. It is able to rule out the false alarms on lands, reduce overlapping predicted boxes, and generate accurate coordinates for each marine target. Figure 4 illustrates the whole procedure in detail.
The whole workflow is divided into two processes: Training process and detection process. As for training process, the patches including marine targets are extracted from SAR images to build the training set and then train the MR-SSD model.
The other one is the detection process for large-scale SAR images. In order to reduce the false alarms on lands, the level-set method [39] is used to remove land parts, which proves to be effective in image segmentation. Due to the high computation complexity of the level-set method, the images are down-sampled and then the level-set method is employed to generate the land masks, which will be resized to the original scale by interpolation later. After that, the land mask removes all the land objectives.
Usually, the large-scale images cannot feed the MR-SSD model directly because it resizes the large-scale images into 300 × 300, which means that a large number of small targets are hard to be detected. In order to solve this problem, the large-scale images are cropped into overlapping small patches and then the patches are sent to MR-SSD. The purpose of overlapping is to keep the target intact in at least one patch. Given a large-scale SAR image of size Lw × Lh, the total number of patches is m × n, which can be calculated as follows: where Pw and Ph denotes the width and length of the patches, respectively. In addition, Overlap is the overlap distance between the patches, which can be adjusted according to the image ground resolution. Then, the patches have to be resized to 300 × 300 to meet the input requirements of MR-SSD. The pre-trained MR-SSD model extract deep features of the objectives to generate targets labels and coordinates for each patch later.
With the preliminary detection results, the coordinates on small patches are projected onto the large-scale images and the final detection results are obtained. For a patch whose index in width is ith and index in height is jth, the coordinate of its kth target can be written as (x i,j ). The mapping relationship can be calculated as follows: where X (l) and Y (l) are the coordinates of the lth target in two directions in the large-scale SAR images.
However, cropping the SAR images would split the targets in two or more pieces, leading to fragmentary predicted boxes and the overlapping operation could cause overlapped predicted boxes, which can be seen in Figure 5a. In order to solve the problems, we consolidate the overlapping and fragmentary predicted boxes by searching the box coordinates to find a coordinates group forming the largest box. As a result, the consolidated box is considered as the final predicted box shown in Figure 5b. . , Figure 5. Procedure for consolidating the predicted boxes. (a) predicted boxes before consolidation; and (b) predicted box after consolidation.

Materials
In this paper, a total of 111 VHR spaceborne SAR images generated by the Chinese GF-3 satellite are used, which carries a Band C radar sensor working at 12 imaging modes with a wide variety of ground resolutions. In order to perform target classification and detection, two datasets: Marine Target Classification Dataset (MTCD) and Marine Target Detection Dataset (MTDD),which are built, respectively, using the preprocessing method given in Section 2.1. In the following, we first present the details of the 111 large-scale SAR SLC images and then describe the compositions of the MTCD and MTDD.
3.1.1. GF-3 SLC Dataset 111 GF-3 SAR images covering the offshore areas of Eastern Asia, Western Asia, Western Europe, and Northern Africa are selected. All of them are images of Band C, acquired from December 2016 to May 2018. There are four polarization mode images (51% for HH mode, 27% for HV mode, 9% for VH mode, and 13% for VV mode), with ground resolution from 0.5 m to 5 m. Table 1 shows the details of the SAR images used in this paper. The MTCD is built on the patches captured from the GF-3 SAR SLC dataset and it consists of eight types of maritime targets: Boat, cargo, container ship, tanker, tower, platform, cage, and windmill. Each target chip includes only one type of target and the ground truth is acquired by feature analysis introduced in Section 2.2. The MTCD contains 2522 training samples and 688 testing samples, whose size is fixed to 128 × 128 pixels. Table 2 lists the numbers of patches per class available for training and testing. In our experiments, the training patches are flipped up-to down to achieve data augmentation, which means a total of 5044 patches are used as training sets.

Marine Target Detection Dataset (MTDD)
The MTDD dataset is built following the PASCAL VOC format [32], containing the slices with corresponding xml files providing the label as well as the location of the target. In this task, six types of targets, i.e., cargo, container ship, tower, platform, tanker, and windmill, are studied because they are more common and valuable than the other targets such as boat and cage. The slices consisting of more than one targets are set to 500 × 500 and Table 3. presents the composition of the MTDD, including 1727 patches in total.

Classification Results of MT-CNN
The experiments are performed on the Caffe [40] framework on Ubuntu 16.04 system, using NVIDIA GeForce GTX 1060 with Max-Q Design acceleration graphics. The training and testing batches are set to 48 and 24, respectively. The network is trained for 60,000 iterations using SGD random gradient descending method with initial learning rate of 0.002 and momentum of 0.9. In addition, the training process takes 1451.22 s, while the testing process takes 153.48 s. Table 4 gives the confusion matrix of the classification result on the test dataset consisting of eight marine classes. Each row in the table denotes the actual target class, while each column represents the class predicted by MT-CNN. It can be seen that the overall accuracy (OA) achieves 95.20%. Due to the distinctive features of cage and tower, their accuracies reach 100%. However, tanker possesses the lowest accuracy (88.16%) among the eight classes. As for the low resolution images, the bright lines caused by the pipelines in tankers would be merged by the reflections of hulls, making it hard to discriminate the tankers from other kinds of targets. Interestingly, two cargos are predicted as tankers, while six tankers are predicted as cargos, implying that the two classes share the similar features.

Effectiveness of MT-CNN
This subsection compares the proposed MT-CNN with previous methods including CNN based methods and traditional machine learning methods such as SVM and KNN. As for the CNN based methods, three typical CNN networks, i.e., CNN-CB [19], ConvNet [13], and CNN-ML [18], are selected. The CNN-CB is simple constructed with two convolutional layers, two max-pooling layers as well as fully connected layers, while the ConvNet is unique for its lack of fully connected layers. Furthermore, the CNN-ML using a multi-looks input proves to be effective. In addition, We compute the Gist feature of each slices following the procedure in Reference [41] and then train the SVM(RBF-kernel, gamma = 0.5, C = 50) and KNN to classify them. In this subsection, the KNN algorithm employs KD Trees and the number of neighbors and leaf size are set to five and 30, respectively. Table 5 illustrates the classification accuracies of different methods among the eight categories. It can be noticed that the proposed method outperforms other methods in every category except platform, with the average accuracy achieving 95.20%. CNN-CB and ConvNet can only classify the targets with the overall accuracy of 80.96% and 82.27%, respectively, due to their lack of enough convolutional layers and insufficient convolutional kernels to extract high-level features. While the CNN-ML is able to classify platform more accurately than the proposed MT-CNN does, its performance on other categories is poorer than MT-CNN. As for the traditional machine learning methods such as SVM and KNN, the accuracy of different classes varies a lot. They expert on classifying the targets with distinct characteristics, i.e., boat and windmill, while their performance on other classes are much poorer. Overall, the performance of the proposed method is superior to other methods and the predicted results are more reliable than others.

Effectiveness of MR-SSD
In this section, the detection experiments are carried out on the Caffe [40] framework via Ubuntu 16.04 systems. MR-SSD is trained on the MTDD training set including six types of marine targets, i.e., cargo, container ship, tanker, tower, platform, and windmill. The MR-SSD network is trained with learning rate of 0.0001 and a weight decay parameter of 0.005 for 160,000 iterations. After that, the trained model is used to detect the marine targets in the testing set. In addition, the confidence threshold is set to 0.5.
The detection results of MR-SSD on testing samples with different backgrounds are shown in Figure 6. For Figure 6a-g, the objects surrounded by sea clutters are detected with accurate coordinates. The proposed model recognizes all the three tanker against the distractions from the small ships and ambiguities in Figure 6h. Moreover, the defocused container ship in Figure 6i is detected, which demonstrates the robustness of the proposed method. In Figure 6i,j that cover the offshore areas, the trained model is capable of extracting the targets coordinates and predicting the labels precisely. As PASCAL VOC challenges, this paper uses average precisions (AP), which is the average of the maximum precisions at different recall values, to access the performance. Recall, precision, and F1 score are defined as follows: where T d denotes the number of the correctly detected targets, T g represents the number of ground truths, and T f indicates the number of false alarms. F1 is the harmonic mean of precision and recall. Besides, the mean Average Precisions (mAP) is used to access the model's ability in detecting all types of targets.
To prove the advantages of the proposed MR-SSD model, existing algorithms (i.e., Faster-RCNN [29] and SSD [31]) are selected for contrast experiments. Table 6 compares the AP and mAP of different methods. It can be seen that the proposed method achieves 87.38% mAP, which is 5.29% and 1.76% higher than Faster-RCNN and SSD, respectively.
It is evident that MR-SSD has the best AP for every individual category on MTDD. The proposed MR-SSD improves the accuracy for tower significantly, surpassing SSD by 5.52% mAP. While the improvements for other classes are slight, there is less than 2% mAP. Though Faster-RCNN can detect cargo, platform and tanker with more than 85% mAP, its performance in terms of container, tower, and windmill are much worse than that of MR-SSD. The experiments demonstrate that the proposed method can extract more features and detect targets more precisely than the traditional one and can achieve higher performance.

Detection Results of Large-Scale SAR Iimages-Case Study
A large-scale SAR image can hardly contain all kinds of marine targets because of the variance of locations of targets, e.g., the windmills are mainly located in open sea while a large number of cargos settle in offshore areas. In order to demonstrate the performance of the proposed method, some types of targets: Windmills, platforms, and towers, are transplanted to the large-scale SAR images (12,000 × 14,000 pixels) covering Weihai City, Shandong Province, China. The imaging mode is Ultlra Fine Strip (UFS), polarization mode is HH and the ground resolution is 1.7 m. The image and the numbers of the ground truths are depicted in Figure 7a and Table 7, respectively.

Sea-Land Segmentation
In this subsection, the level-set method [39] is employed to sea-land segmentation. The downsampling rate is set to 10 to accelerate computing and the segmentation contour is iterated for 10 times. It takes 43.21 s to generate the land mask and the segmentation results are shown in Figure 7.
It is evident that the land mask wipes out all the land areas precisely, while all of the marine targets remain in the images, which contributes to reducing false alarms and increasing detection precision.

Detection Results of the Whole Workflow
After removing the lands from the images, the image is cropped into 500 × 500 sub-images with overlapping of 200 pixels. Then, the 500 × 500 sub-images are resized to 300 × 300 to match with the input size of MR-SSD. After that, the coordinates in the sub-images are mapped onto the large-scale images. Figure 8 shows the detection results of the proposed methods. It can be seen that most of the six types of targets can be detected with accurate coordinates. In the large-scale image, a windmill, a tanker, and a platform are missed. The missing windmill and tanker have week intensity, while the missing platform is overlapped by another platform, which diminishes the performance of MR-SSD. Besides, two tankers are misrecognized as container ships because they share similar features: Large hulls and multiple components leading strong reflections. In practice, as the land mask can hardly rule out small reefs near the coastline, some small reefs are transmitted into the MR-SSD. As a result, five reefs are recognized as cargos.
, Figure 8. Detection results of a large-scale SAR image. The red circles and pink circles denote the false alarms and missed targets, respectively. Cargo, container ship, and tanker are labelled by yellow, green, and blue rectangles, respectively. Yellow eclipses, green eclipses, and blue eclipses indicate windmills, iron towers, and platforms. Also, Faster-RCNN [29] and SSD [31] models are employed in the overall scheme to demonstrate the advance of the proposed method. The detection results are recorded in Table 8. The recall, precision, and F1 score are calculated according to the Equations (15)-(17). It can be seen that the MR-SSD gets the highest recall, precision and F1 score among the three methods. Compared with SSD that generates 22 false alarms, the proposed method reduces the number of false alarms, only 8 false alarms exist. Though Faster-RCNN produces the same number of false alarms as t/6he proposed method, its number of the correctly detected targets is less than that of the proposed method, which leads to a lower F1 score. In summary, the proposed method outperforms other methods in detecting different marine targets in large-scale SAR images.

Discussion
By comparing and analyzing the results of experiments conducted in our work, the merits of the proposed methods are demonstrated. In this section, we discuss impacts of some parameters on performance of the proposed methods and analyze characteristics of false alarms and missing targets, which helps to improve the performance in the near feature.

Performance of MT-CNN Trained with Different Data Augmentation Methods
In order to analyze the impacts of data augmentation on the MT-CNN's performance, we use four datasets: Training sets without flipping (TS1), training sets with up-to-down flipping (TS2), training sets with left-to-right flipping (TS3), and training set with up-to-down and left to right flipping (TS4). The experiments are carried out under the same conditions. Table 9 shows the classification results of MT-CNN trained with different augmentation methods. It can be seen that flipping could help to improve the models' performance. However, more flips can hardly improve their performance and this is because this operation cannot provide more information that the models need.

Comparison of Performance of Different CNN Structures
In this subsection, we propose four CNN models with different layer arrangements and perform classification experiments on MTCD to demonstrate the merits of MT-CNN. The structures of the CNN models are shown in Table 10 and the parameters of the layers are the same with those of corresponding layers in MT-CNN.  Table 11 shows the classification accuracies of different CNN models. While the accuracy of platform of MT-CNN is lower to that of CNN-A, MT-CNN can obtain higher accuracies than the four CNN models in other categories and its overall accuracy achieves 95.20%. In addition, the overall accuracies of the fiver models are all over 90% and increasing or decreasing network layers would have slight impacts on their performance.

Class Imbalance Effect
Among MTCD, there are a few big classes (i.e., cargo and boat) and small classes (i.e., container ship and platform). In order to discuss the class imbalance effect on the MT-CNN's performance, we use two balancing methods to build two balanced dataset: BAL1 and BAL2. As the smallest class (container ship) in MTCD has 200 patches, we reduce the number of slices to 200 in other classes to form BAL1. BAL2 augments small classes in MTCD by left to right flipping and each class contains 400 slices. In the experiment, all of the slices in the three datasets are flipped up-to down to realize data augmentation. Table 12 compares classification accuracies of MT-CNN trained in the three datasets. As for cargo, which accounts for the largest proportion in MTCD, its accuracy drops when the dataset is balanced. One possible reason is that the MT-CNN tends to extract specific features in other categories as cargo's proportion in the datasets declines. However, platform shows the opposite trend, with the accuracy rising by 2% and 4% in BAL1 and BAL2, respectively. This is because its proportions in BAL1 and BAL2 are higher than that in MTCD. Among other classes, there is not a significant imbalance effect because MTCD is not a serious imbalanced dataset.

Performance of MT-CNN against Ground Resolution Variance
To evaluate the performance brought by resolution variance in the proposed MT-CNN, extensive experiments using different resolutions images are further conducted. Test dataset including eight types of target slices at 1.7 m ground resolution is built. Then, lower pass filters are used to lower image resolution to generate target slices with eight resolution versions. Table 13 shows the composition of the test dataset. In the experiment, the proposed network is fed with target slices with different resolutions.  Figure 9 illustrates the robustness of the proposed MT-CNN against ground resolutions. We can see that there is a slight increase in average accuracy from 1.7 m to 3.4 m and then it decreases gradually from 97% at 3.4 m to 85% at 13.6 m. Windmill and cargo keeps at 100% when images ground resolution varies from 1.7 m to 13.6 m. Tanker, container ship, and cage are more sensitive to resolution variance than other kinds of targets. Tanker declines dramatically from 95% at 1.7 m to 57% at 6.8 m and then it remains stable from 6.8 m to 13.6 m. One possible reason is that the auxiliary structures such as pipelines and cranes on tankers could be blurred in low resolution images, making tankers lose distinctive features. Additionally, cages drop rapidly from 6.8 m to 11.9 m because they share some rectangle-like shapes with platforms and many cages are misclassified into platforms.

Comparison of Performance of MR-SSD with Different Low-Pass Filters
In this subsection, we adopt different values for λ in G channel and B channel to analyze its influence on the performance of MR-SSD. All the MR-SSD models are trained on the CAFFE framework and the experimental parameters are the same with those in Section 3.4. Table 14 shows the performance of MR-SSD with different low-pass filters. It can be seen that as λ varies, the mAP of MR-SSD changes slightly, and it achieves the highest mAP (87.38%) when λ is set to 0.5 and 0.25 for G channel and B channel, respectively.

Influence of Different Patch Sizes in the Proposed Workflow
As for the proposed workflow, large-scale SAR images are cropped into different slices, which are then sent to the pre-trained MR-SSD and the impacts of slice size are discussed in this subsection. We carry out experiments using the large-scale SAR image provided in Section 3.5 and performance of the proposed method in terms of patch size are compared in Table 15. It can be noticed that the computational time drops dramatically when patch size increases, because the patch size determines the total number of patches. Recall is relatively high when patch size is under 700 × 700 but it drops dramatically from 89.84% at 700 × 700 to 64.06% at 900 × 900, because the resize operation removes many image details, leading to many missing targets. In practice, the cropping size should be carefully considered and keep a balance between computational time cost and F1 score.

False Alarms and Missing Targets in the Large-Scale Images
Some typical patches containing false alarms are displayed in the blue box, while missing targets are shown in the red box in Figure 10. It can be seen small reefs are easy to be recognized as cargos because they are brighter than the sea clutters and share similar visual features with cargos. For images without geocoding, it is difficult to remove all the reefs precisely. Interestingly, some dams are classified as cargo or container ship. One possible reason is that dams lead to the bright lines similar to that produced by warehouses or containers. Additionally, a cargo is recognized as platform in Figure 10d because it possesses a rectangular contour with high intensity, which looks like a platform visually. As for the platform in Figure 10f, the blurring in image looks like burning towers on the platform, which is the main reason for such misclassification. Besides, some coastlines are classified as cargos because of their bright lines in SAR images.
When it comes to missing targets, some of them have small or weak intensity, which leads to little response in the network, remaining to be undetected. The strong noise and motion blurring in Figure 10k,m exert adverse effects on target detection.

Conclusions
With the labeled SAR images provided by the GF-3 satellites, this paper proposes a convolutional network (MT-CNN) to classify marine targets at patch level and an overall scheme to detect different marine targets in large-scale SAR images. The proposed MT-CNN with six convolutional layers and three pooling layers are capable of extracting features at different levels and achieve higher classification accuracy than existing CNN models. As for the marine target detection task in large-scale SAR images, the proposed MR-SSD with a three-resolution input is able to learn the features on different resolution versions. The proposed framework containing sea-land segmentation, cropping with overlapping, detection with MR-SSD model, and coordinates mapping shows its superiorities to other methods by improving detection accuracy and reducing false alarms. Besides, this is the first such experiments that carries out on such various types of marine targets in SAR images. This paper presents the preliminary results of the proposed methods. Looking ahead, future works can be focused on eliminating false alarms in SAR imageries by image processing methods.