D-ATR for SAR Images Based on Deep Neural Networks

: Automatic target recognition (ATR) can obtain important information for target surveillance from Synthetic Aperture Radar (SAR) images. Thus, a direct automatic target recognition (D-ATR) method, based on a deep neural network (DNN), is proposed in this paper. To recognize targets in large-scene SAR images, the traditional methods of SAR ATR are comprised of four major steps: detection, discrimination, feature extraction, and classiﬁcation. However, the recognition performance is sensitive to each step, as the processing result from each step will affect the following step. Meanwhile, these processes are independent, which means that there is still room for processing speed improvement. The proposed D-ATR method can integrate these steps as a whole system and directly recognize targets in large-scene SAR images, by encapsulating all of the computation in a single deep convolutional neural network (DCNN). Before the DCNN, a fast sliding method is proposed to partition the large image into sub-images, to avoid information loss when resizing the input images, and to avoid the target being divided into several parts. After the DCNN, non-maximum suppression between sub-images (NMSS) is performed on the results of the sub-images, to obtain an accurate result of the large-scene SAR image. Experiments on the MSTAR dataset and large-scene SAR images (with resolution 1478 × 1784) show that the proposed method can obtain a high accuracy and fast processing speed, and out-performs other methods, such as CFAR+SVM, Region-based CNN, and YOLOv2.


Introduction
Synthetic aperture radar (SAR) is capable of working every day, in all weather conditions, and all the time, to provide high resolution images, and so it plays a significant role in surveillance and battlefield reconnaissance [1,2]. Automatic target recognition (ATR) is the process of automatic target acquisition and classification, which is capable of recognizing targets or other objects, based on data obtained from the sensors, which has good application prospects in both military and civilian areas [3]. The process of SAR ATR can be summarized as finding regions of interest (ROIs) in the observed SAR image and classifying the category of each ROI (e.g., T72 or BTR70) [4]. Some earlier methods of SAR ATR can be found in [5][6][7][8][9].
Traditional SAR ATR techniques mainly include four steps: detection, discrimination, feature extraction, and target recognition/classification [10]. For target detection, potential ROIs are extracted from the input SAR image according to the local brightness or the shape of targets; CFAR [11] is a classical algorithm used to detect targets against a background of noise, cluster, and conduct interference from SAR images by detecting every pixel. In the discrimination phase, the ROIs obtained from the previous step of detection are processed to remove false alarms, with the purpose of reducing classification cost. The feature extractor is specific to particular tasks in the interpretation of SAR images, which can suppress the dimension of the feature space to interpret the SAR imagery. Some researchers use a feature-based approach to deal with the problem of SAR ATR [12,13]. After detection and discrimination, the remaining ROIs are input into the recognition/classification stage to obtain the type of target (i.e., armored personnel carrier, howitzer, or tank). There are mainly two traditional methods, the most common one is based on template-matching methods. The second is based on classifier models, such as support vector machines (SVM) [5] and adaptive boosting [14]. However, traditional SAR ATR methods depend heavily on handcrafted features and have a large computational burden or poor generalization performance [15]. The accuracy will also decrease significantly if any stage of the SAR ATR is not well designed or not suitable for the current operating conditions [16].
Recently, deep learning (DL) algorithms have been significantly developed. Girshick proposed regions with CNN features (R-CNN) [17] in 2014, and object detection based on deep learning began to come into favor. Subsequently, many improved algorithms based on R-CNN have been proposed, such as Fast R-CNN [18] and Faster R-CNN [19], which have achieved high accuracies in recognizing targets in optical images. However, these methods have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications.
For the sake of speeding up computation, some researchers proposed methods based on a single network, which predicts bounding boxes directly without region proposals. Redmon proposed You Only Look Once (YOLO) [20], a regression-based method which directly recognizes different kinds of objects with different sizes in optical images and gives confidence ratios. However, it had a problem with inaccurate positioning. Liu proposed a Single Shot MultiBox Detector (SSD) [21], which showed a compromise of accuracy and speed in the field of optical object detection.
Inspired by the successful application of deep learning methods in optical areas, some researchers introduced DL methods for dealing with problems in the processing of SAR images. Ref. [22,23] effectively extracted a high-level feature representation for SAR images by using a Deep Convolutional Neural Network (DCNN) which learned high-level features automatically, rather than requiring handcrafted features. Ref. [24] proposed an efficient feature extraction and classification algorithm, based on a visual saliency model. Ref. [25] proposed a target detection and discrimination method, based on a visual attention model, and the experimental results on synthetic images and the miniSAR image data set demonstrated that the proposed target detection and discrimination method coukld detect and discriminate the targets from complex background clutter with a high accuracy and fast speed for high-resolution SAR images, which provides an effective way to overcome the drawbacks in target detection and discrimination in SAR images with large, complex scenes. Ref. [15] used CNN to recognize SAR targets, and achieved a competitive classification performance with existing methods considered to be state-of-the-art [22]. Their work proved that deep learning methods can be used in every process of SAR ATR. However, these methods mainly just focus on one of the four steps of SAR ATR.
To date, Wang [26] used faster R-CNN which achieved detection and recognition integration in the field of optical target detection, to realize the integration of detection and recognition in the field of SAR ATR, and obtained a system dealing with large-scene SAR images. Ref. [27] proposed a region-based convolutional neural network to process the problem of SAR target recognition in large-scene images. However, the processing time of these systems can be further decreased.
For the sake of integrating the traditional four steps of SAR ATR as a whole system, we were encouraged by the previous works in adopting deep learning methods for target detection in optical images to the field of SAR images. By encapsulating all computation in a single deep neural network, the integration of target detection and recognition of large scene SAR images can be realized.
The proposed D-ATR system can directly recognize targets from complex background clutter with a high accuracy and fast speed in large-scene SAR images. Transfer learning and data augmentation methods, such as horizontal flip and random crop, are used in this paper, at the stage where the available SAR images are limited for training. To meet the requirement of input size of the neural network, a method of fast sliding is used to cut the large-scene SAR images into sub-images with a suitable size for the input of the neural network, and to guarantee that every target exists completely in one of the sub-images. Finally, non-maximum suppression between sub-images (NMSS) is proposed to suppress the predicted boxes among the sub-images, for more accurate recognition performance.
The organization of this paper is as follows. Section 2 introduces the structure and components of the deep convolutional neural network. Section 3 provides experimental results, by several experiments, to compare the performance of the proposed method. Finally, Section 4 makes a conclusion of this paper and prospects for the future work.

Structure of The D-ATR
The flowchart of the proposed D-ATR is shown in Figure 1, which can realize the integration of target detection and recognition in large-scene SAR images. It contains three main parts. The first part is a fast sliding method for cutting the large-scene SAR into sub-images with a suitable size. The second part is the network for feature extraction, target detection, and recognition. The third part is the proposed NMSS for retaining the best bounding box and obtaining the best recognition result.

Base Network
The most common obstacle when applying deep learning methods to solve problems lies in the necessary large amount of data. The reason why so much training data is needed is that there are a large number of parameters to be determined during the training process. For example, ImageNet is a large-scale labeled image dataset, organized according to the WordNet architecture, and contains about 2.2 million categories and 15 million images, which are strictly selected and labeled by human curators. AlexNet [28] showed surprising performance on the object classification of 1000 categories in ImageNet in 2012. Subsequently, VGG-16-Net [29] and GoogLeNet [30] were proposed, with better recognition rates. The latest Res-Net [31] achieved extraordinary performance when recognizing targets in ImageNet. Deep learning has made great progress in object recognition, and is also expected to solve the problems of SAR target recognition. However, compared to ImageNet, there are insufficient annotated SAR data, as it is expensive to capture SAR images and annotate them.
Transfer learning is a method in the area of machine learning. Given a source domain D s = {X s , f s (X)} and a learning task T S , as well as a target domain D T = {X T , f T (X)} and a learning task T T , transfer learning aims to help improve the learning of the target predictive function f T (·) in D T , using the knowledge in D S and T S , where D S = D T or T S = T T . The key point is to store the knowledge acquired in solving one problem and applying it to a different, but related, problem. For instance, the knowledge acquired by learning to recognize a dog may be suitable in applying to recognize a cat. The method of transfer learning provides an effective way to train a large network with limited training data without overfitting.
Researchers have successfully applied the method of transfer learning to the field of SAR classification. Ref. [32] proposed a method based on transfer learning, which transformed the knowledge learned from sufficient unlabeled SAR scene images to labeled SAR target data. Ref. [33] introduced transfer learning into the classification of a small number of SAR images with a limited quantity of SAR imagery training data, and the parameters from the model trained on CIFAR10 have successfully applied to TerraSAR-X data.
In this paper, VGG-16-Net is selected as the base network, which is pre-trained on the ILSVRC CLS-LOC dataset [34].

Additional Feature Layers
Additional convolutional layers are added after the basic network. As shown in Figure 2, there are five convolutional layers (from CONV1 to CONV5) in the structure of the additional feature layer; the products in the semicircle box indicate the number of convolutional kernel and its size (e.g., 3 × 3 × 1024 represents that there are 1024 convolutional kernels with size of 3 × 3 ). The number in parentheses, next to the arrow, represents the resulting feature maps from the related convolutional layer (e.g., (1024, 19 × 19) indicates that CONV1 generates 1024 feature maps with size of 19×19). The size of these convolutional layers decreases layer by layer, which can generate multi-scale feature maps for detection.
is the unit of the ith input feature map at position (x, y); k l ji (u, v) represents the convolution kernel which connects the ith and jth input and output feature maps, respectively; and b (l) j is the trainable bias of the jth output feature map. The calculation of convolution is illustrated as follows: where f (·) is the nonlinear activation function, G (l) j (x, y) represents the weighted sum of inputs to the output feature map at position (x, y), I is the number of input feature maps, K × K is size of the convolution kernels, and P and S are the zero padding and convolution stride, respectively.
The size of kernels of the convolutional layer is K × K, and if there are J feature maps with size W 1 × H 1 as input, the J output feature maps are W 2 × H 2 , and the computation of W 2 and H 2 is as follows, respectively: Some researchers introduced a novel visualization technique which gives insight into the function of intermediate feature layers and the operation of the classifier, which proved that a smaller stride (2 versus 4) and filter size (7×7 versus 11×11) resulted in more distinctive features and performed better [35]. In this paper, we comply with certain guidelines, so the hyperparameters, such as convolution stride and filter size in the convolution layer, are as shown in Table 1, where the size of feature maps are calculated by Equations (5) and (6).

Receptive Field
After determining the hyperparameters of the CNN network, the corresponding theoretical receptive fields in each layer are also determined. The receptive field of a neuron in one of the lower layers encompasses only a small area of the image, while the receptive field of a neuron in subsequent (higher) layers involves a combination of receptive fields from several (but not all) neurons in the layer before (i.e., a neuron in a higher layer "looks" at a larger portion of the image than a neuron in a lower layer does). In this way, each successive layer is capable of learning increasingly abstract features of the original image. Assuming that each receptive domain is R i (i = 1, 2, · · ·, n), where R i denotes the receptive domain of the ith layer, the formula for calculating the perceptual domain is as follows: where s i and k i represent the convolution stride and the size of the convolution kernel in each layer, respectively. R i−1 and R i are the receptive fields of the (i − 1)th and the ith convolutional layers. Obviously, with a larger number of network layers, the size of the receptive field gradually increases, thus allowing simultaneous detection of targets of different sizes. The large receptive field is mainly responsible for the detection of large targets, and the small receptive field is responsible for the detection of small targets.

Detector and Classifier
As shown in Figure 3, the detector and classifier are mainly comprised of three parts. The first part is to generate the default bounding box, the second part is for positioning or localization, and the third part is responsible for generating the confidence of the category. In detail, the m × n feature maps obtained from the additional convolutional layers and base network will be convoluted with two different 3 × 3 convolutional kernels, one for producing a score or confidence for a category, and the other generating a shape offset relative to the default box coordinates. For a feature map of size m × n, there are m * n feature map cells, in total. The number of default bounding boxes of each cell and the number of objects to be detected and recognized are denoted by K and C, respectively. Each cell requires a total of K * (C + 4) predictions, and so all cells need a total of K * (C + 4) * m * n predictions.

Overall Training Process
The training process includes the following five steps: Step 1: Obtain the basic features of the input image by forward propagation; Step 2: Extract multi-scale feature maps and select candidate regions with different scales and different aspect ratios in these feature maps; Step 3: Calculate the coordinate position offset and category score of each candidate area; Step 4: Calculate the final region, according to the offsets of the candidate region and the coordinate position, and then calculate the loss function of the candidate region according to the category score and accumulate the final loss function; and Step 5: The weight of each layer is modified by the last loss function by a back-propagation algorithm.
The center (cx, cy), width(w), and height(h) of the default bounding box are regressed to offsets. The overall loss function is similar to [19], as shown in Equation (7), which contains two parts including localization loss (loc) and confidence loss (conf).
The localization loss is shown as follows L loc (x, l, g) = N ∑ i∈Pos ∑ m∈{cx,cy,w,h} and the confidence loss is The (ĝ cx j ,ĝ (10) and (11) represent the ground truth box and default bounding box, respectively. During the training process, the default bounding boxes are matched to the ground truth boxes. If two default boxes with T72 and BMP2 have been matched, then they will be treated as positives while the rest are treated as negatives.

Before and After Operation for DCNN
Before DCNN, a fast sliding method is proposed to partition the large image into sub-images to keep the information integrity. After DCNN, NMSS is performed to eliminate false alarms.

Fast Sliding
In conventional cases, images are resized so that all images have the same size. Take AlexNet for example, during the CNN training and testing stages, all images are resized to a same size of 227 × 227, before being fed into the network and for feature extraction and classification, respectively. Generally, the size of a large-scene SAR image is several times larger than the resized size of the CNN. However, when resizing a SAR image with a large scene, it may suffer from substantial information loss and object distortion, which may compromise image matching between query and database images. This problem is significant for target recognition, as the object of interest may take up only a small region in the target image (however, in an image with a large size), the details can be more clearly observed; and, in the recognition stage, keeping the aspect ratio of an image will also help to preserve the shape of the objects/scene, thus making the classification more accurate [36].
In order to avoid the situation discussed above, it is necessary to partition the large-scene images into sub-images with a suitable size. During the cutting process, the target in the scene is likely to be divided into several parts, which will lead to a terrible recognition result. Therefore, it is significant to design a strategy to cut the image into a suitable size to match the input of the convolutional network and ensure that every target will exist in one sub-image completely.
In this paper, a fast sliding method is proposed to deal with the problem of partitioning the large scene into sub-images with a suitable size by sliding a rectangular window with a fixed size on the original large-scene SAR images; sliding the window on the original image in a certain step, such that the latter slice overlaps the previous slice with a certain area [37]. Assuming that the largest bounding box of the target is w t * h t , the size of the sliding window is w s * h s and the size of the large-scene SAR image is w o * h o . Figure 4 shows the process of fast sliding, where λ h and λ v denote horizontal and vertical sliding, respectively. If the sliding window slides to the edge of the image but exceeds the boundary of the image, the sliding window moves forward until the right side of the sliding window coincides with the right side of the image. As shown in Figure 5, the SAR image is divided into four parts by the method. Different colors of the rectangular boxes indicate different positions of the sliding windows, of which the target in the green box is split. However, the purple box contains the target completely.  To ensure every target in the large-scene image will exist in at least one sub-image completely, the relationship between the parameters is as shown in Equation (12). In this paper, w s = h s = 258, λ h = λ v = 128, and the overlap is set to 0.5.

NMSS
By the method of fast sliding, the large-scene SAR image was divided into sub-images with a suitable size for the input of the network, which will then be sent into the network, sequentially, to detect and recognize targets. When the recognition result and confidence of the targets and position of bounding boxes are generated by bounding box regression and classification, the primary task in this stage is to analyze the results on the sub-images and select the appropriate results to display on the original large-scene SAR images.
As the task of object detection is to map an image to a set of boxes-a box for each object of interest in the image, with each box surrounding an object. This means that the detectors ought to return only one detect result per object. Non-maximum suppression (NMS) is a post-processing algorithm responsible for merging all detections that belong to the same object and removing redundant detections [38]. For every sub-image, NMS has been adopted to ignore bounding boxes that significantly overlap each other. However, different sub-images may contains a same target, because there exists an overlap between some of them by the fast sliding method. If the detection results of the sub-images are directly displayed on the original picture, it may cause multiple confused detection results for some targets in the figure.
To solve this problem, non-maximum suppression between sub-images (NMSS) is proposed in this paper. The specific process of this method is as follows: Step 1: Coordinate transformation, mapping the coordinates of the sub-images to the original image; Step 2: Retain the bounding box with highest category confidence for the current target; Step 3: Retain the bounding boxes which are independent in the image; Step 4: Calculate the intersection over union (IoU) between the rest of the boxes with the box from Step 2, and delete the bounding boxes which have an IoU exceeding the set threshold; Step 5: Continue to choose a box from the category with highest confidence from the unprocessed box and repeat Steps 1 and 2; and Step 6: Repeat the previous four steps, until the N bounding boxes with a highest category confidence of the targets are found.

Dataset Generation
In this paper, the training dataset and test dataset are generated from the MSTAR dataset, provided by the Air Force Research Laboratory and the Defence Advanced Research Projects Agency (AFRL/DARPA) [4]. The dataset serves as a standard data set for the research of SAR ATR. The sensor that collected the dataset is a spotlight SAR, with a high resolution of 0.3 × 0.3 in both range and azimuth. There are thousands of SAR images, including ten categories of ground military vehicles (armored personnel carrier: BMP2, BRDM2, BTR60, and BTR70; tank: T62 and T72; rocket launcher: 2S1; air defense unit: ZSU234; truck: ZIL131; and bulldozer: D7), which are publicly released. Examples of SAR images of ten types of targets at similar aspect angles and their corresponding optical images are depicted in Figure 6. The serial number, depression angle, and number of images available for training and testing are listed in Table 2. Images for training are acquired at a 17 • depression angle, and images for testing are captured at 15 • . In this paper, for a three-type target detection and recognition problem, three categories (armored personnel carrier: BMP2 and BTR70, and tank: T72) are adopted to train and test our method. For ten-type target detection and recognition problem, all types of targets in Table 2 are used for generating the training and testing dataset. As the cost of acquiring SAR images, including ground vehicle targets in large scenes, is expensive, it is essential to adopt the large scenes and target images provided in the MSTAR dataset to generate the large-scene SAR images containing targets for research. The MSTAR dataset provides thousands of scene images without targets. Therefore, we embed many targets from the 128 × 128 image chips into the large scene image. This operation is reasonable, because both the targets and the scene image are captured by the same spotlight SAR with the same resolution of 0.3 × 0.3. In this paper, several large-scene SAR images were made for our experiments. In Figure 7, a composite SAR image with a large scene with 15 targets randomly distributed on it is shown, and the target category and its corresponding number is shown in Table 3.

Accuracy of Detection and Recognition
For evaluating the performance of D-ATR, which integrates the traditional four steps of SAR ATR as a whole system, we implemented this method to solve the three-type and the ten-type target detection and recognition problems, respectively. In the three-type target problem, the three types of targets included BMP2, BTR70, and T72, and the number of images available for training and testing are listed in Table 2. As the number of available images for training is limited, some data augmentation methods, such as horizontal flip and random crop, are used in this paper. As the size of the test samples was 128 × 128, which was suitable to input into the network directly, fast sliding and NMSS were not used in this part of the experiment. As shown in Figure 8, it shows a detection and recognition result of three types of targets. Every target in each chip is surrounded by box and its category with high confidence. The confusion matrix for the three-type task is shown in Table 4, and the confusion matrix for the ten-type task is shown in Table 5. The true target types are listed on the left and predicted target types are shown on top.

Performance on Large Scene SAR Images
To detect and recognize targets in a SAR image with large scene, first of all, a SAR ATR system would have to detect potential targets and isolate the regions out from a complex background, such as river, sea surface and forest. Then those isolated image chips are fed to a classifier and ultimately declare the recognized target type. For the purpose of presenting such a case, [22] used two stages network, with the first one performing binary classification, i.e., detection, and the second performing recognition.
Most methods for SAR interpretation use strategy of segmentation to deal with large scene SAR images, however, many detection and recognition methods are sensitive to segmentation results, thus easily leading to a worse result.
As for our method, every target in the scene only need to be guaranteed to appear completely on no less than one sub-image. In this paper, D-ATR system is proposed for the sake of realizing the SAR ATR system for large scene SAR images, which integrate the traditional four steps, i.e., detection, discrimination, feature extraction and classification as a whole system.
In this part, several large scene SAR images were simulated from the publicly available MSTAR dataset to show the feasibility and performance of D-ATR system. As shown in Figure 9, there are 21 targets distributing in the scene randomly. These ten types of targets are surrounded with boxes in different colors, such as target surrounded with yellow is ZIL131. This 1478 × 1784 image is first cut into a series of 256 × 256 sub-images, and then these sub-images will be input to the network sequently in order to detect and recognize the potential targets.
When the NMSS is not used, the result is shown in Figure 9a. For the purpose of suppressing the redundant bounding box and retaining the suitable box with highest predict confidence for every target, NMSS is used in this part, and the result is shown in Figure 9b.

Comparison Experiments
To verify the feasibility and efficiency of the proposed method, several comparison experiments were conducted. As shown in Table 6, four different methods (i.e., CFAR+SVM [39], Region-based CNN [27], YOLOv2 [40], and D-ATR) were performed on one simple and one complex large-scene SAR image, respectively.
There are six parameters listed in Table 6: number of targets (No.Target) in the SAR image, number of correctly detected targets (No.Det), the proportion of targets that are correctly detected in all targets (Det Rate), number of correctly recognized targets (No. Rec), the proportion of targets that are correctly recognized in all detected targets (Rec Rate), and time consumption. Table 6. Comparison of different methods for Figure 9 with a simple scene and Figure 7 with a complex scene.

Analysis on Detection and Recognition Accuracy
As shown in Table 4, the detection and recognition result on the three-type task is inspiring. However, the result of BMP2 and T72 were a little bit worse than BTR70, and it seems that several slices of BMP2 and T72 were inaccurately recognized as each other. The reason may be that the two kinds of targets have a similar turret and gun barrel, which makes them easily confused.
As shown in Table 5, the average accuracy of the detection and recognition results on the ten-type task is 96.5%. The accuracy of most types is higher than 94%.
Actually, when interpreting all the 1365 128×128 SAR image chips, it costed 13 seconds in total, which shows a faster speed than the method that Wang [26] proposed to realize the integration of detection and recognition in the field of SAR ATR.

Analysis on Performance of Large-Scene SAR Images
From Figure 9a, it can be seen that all of the 21 targets are surrounded by several boxes. The reason for this situation is that the method of fast sliding makes it possible for each target to appear on multiple sub-images.
From Figure 9b, it can be seen that the results of the SAR image with a relatively simple scene is exciting, with all targets correctly recognized and each of the 21 targets covered by only one rectangular box, which means the rest of the predicted boxes, with a lower confidence category, are deleted.
Comparing Figures 9a,b, it can be seen that the proposed NMSS can solve the multiple confused detection problem effectively.
Finally, to show the performance of the D-ATR system on SAR images with more complex scenes, we test our model on Figure 7, in which there are more trees and bushes and 15 targets randomly embedded in the scene. The category and the corresponding number of each target is shown in Figure 7.
As shown in Figures 10a,b, each of the 15 targets is surrounded by a bounding box with high prediction confidence, and the result illustrates that the proposed D-ATR system that realizes integration of target detection and recognition performs well. No trees or bushes are interpreted as targets, which has proved the effectiveness and usefulness of features extracted by DCNN. There is only one surrounding box on each target on this large-scene SAR image; many of these targets appeared on more than two sub-images. Thus, it was proved that NMSS is a useful strategy in dealing with prediction boxes among adjacent sub-images.

Analysis on Comparison Experiments
As shown in Table 6, it can be seen that the result of detection and recognition rate for targets in large-scene SAR image by D-ATR was 100%, and with a relatively low time consumption.
For comparison, the proposed method outperforms CFAR+SVM, not only in the accuracy of detection and recognition, but also in time consumption. As for the comparison between the proposed method and Region-based CNN, it is obvious that our proposed method had the same accuracy as the Region-based CNN when detection and recognition for SAR images with large scene. However, the proposed D-ATR consumed 17.3 s and 16.5 s less than Region-based CNN on large-scene SAR images with simple scenes and complex scenes, respectively.
Additionally, it can be seen, from Table 6 that the time consumption of YOLO-2 was 1.5-1.6 s, which was the lowest. However, some targets were missing by YOLO-2, with a detection rate less than 73.33%. A possible reason is that YOLO-2 divides the images into many 7 × 7 or 13 × 13 sub-images, which may cause the target to be partitioned into several parts. When recognizing, these parts cannot be recognized as a target, which will cause targets to be missing.
In a word, the proposed D-ATR can detect and recognize all targets in a large-scene SAR image accurately, and performed better than other methods listed in the table.

Conclusions
The traditional SAR ATR mainly includes four steps: detection, discrimination, feature extraction, and target recognition/classification. However, these processes are independent, and the processing result from each step will affect the following one. Inspired by the recent success of deep learning methods in optical image processing, these problems in SAR ATR can be solved by encapsulating all computation into a single deep convolutional neural network. Whenever a large-scene SAR image is directly input into the network, it may suffer from substantial information loss and object distortion when resizing. The fast sliding method is proposed to cut a large image into a series of sub-images, which can guarantee that every target will be contained in one of the chips completely. NMSS is proposed to retain the best bounding box with the highest confidence for every target. Experimental results on simulated large-scene SAR images (with size 1478 × 1784) show that the recognition rate can reach to 100%, with a time consumption less than 7 s.