Broad Area Target Search System for Ship Detection via Deep Convolutional Neural Network

: High-resolution optical remote sensing data can be utilized to investigate the human behavior and the activities of artiﬁcial targets, for example ship detection on the sea. Recently, the deep convolutional neural network (DCNN) in the ﬁeld of deep learning is widely used in image processing, especially in target detection tasks. Therefore, a complete processing system called the broad area target search (BATS) is proposed based on DCNN in this paper, which contains data import, processing and storage steps. In this system, aiming at the problem of onshore false alarms, a method named as Mask-Faster R-CNN is proposed to di ﬀ erentiate the target and non-target areas by introducing a semantic segmentation sub network into the Faster R-CNN. In addition, we propose a DCNN framework named as Saliency-Faster R-CNN to deal with the problem of multi-scale ships detection, which solves the problem of missing detection caused by the inconsistency between large-scale targets and training samples. Based on these DCNN-based methods, the BATS system is tested to verify that our system can integrate di ﬀ erent ship detection methods to e ﬀ ectively solve the problems that existed in the ship detection task. Furthermore, our system provides an interface for users, as a data-driven learning, to optimize the DCNN-based methods.


Introduction
With the development of society and technology, measuring and monitoring human activity in the ocean area is becoming a topic of significance and increasing interest. In this topic, ship objects play an important role in both areas of the military and civilian, such as the maritime safety, marine traffic, border control, fisheries management, marine transport, etc. Meantime, the position and behavior information of the ship objects is the cornerstone of the marine domain awareness (MDA) [1], which has been defined as the effective understanding of any activity associated with the maritime domain. Thus, a better performance on ship objects detection can greatly promote the harmonious development of the human and ocean.
The purpose of the object detection is to find out the targets with more attention in the human's vision, then determine its location and category, which is one of the core issues in the field of computer vision (CV). Due to the similarity of the object features like background, texture, shape, etc., it is achievable to finish this task. However, it is still a challenging task because of the differences between target individuals.
Living in the age of rapid development of deep learning (DL), it will be able to design and realize an efficient object detection architecture for ship targets on the sea. The nature of DL is to extract and analyze the intrinsic law and feature representation method of the data, which is similar as the human learning process, by using the computers' powerful ability. Actually, the improvement of DL has made a revolutionary progress in the technology of the CV and machine learning (ML) [2], and promotes the The architecture like SSD and YOLO are collectively called the one-stage network because of its single neural network architecture. Additionally for this reason, one-stage networks are more efficient but slightly lower in accuracy compared to the two-stage networks. Fortunately, it brings more alternative in different detection tasks due to these differences.
In fact, the ships detection in the high-resolution optical remote sensing image is one of the representative applications based on DCNN, in which the ship objects present the characteristics of multi-scale, multiple directions, multiple shapes and a complex background environment. Thus, taking these characters into account, the ship detection task is commonly divided into the following several important steps [1]: Removal of Environment Effects, Sea-Land Separation, Ship Candidate Detection, and False Alarm Suppression. In addition, it is necessary to propose some effective measures to the specific problem at each step.
The first step: Removal of Environment Effects. The presence of environmental factors in optical images is an undesirable, but generally unavoidable fact. Some main factors significantly influence the ship detection accuracy such as clouds, waves and sunlight reflection. In recent years, some effective methods have been proposed to solve these environment effect problems. In 2013, a water/cloud clutter subspace is estimated and a continuum fusion derived anomaly detection algorithm is proposed by Daniel et al. [21] to remove the clouds. In 2014, Kanjir et al. [22] proposed a method based on histogram to remove the clouds, which takes advantage of the character of high value of the clouds in spectral bands. Further, Buck et al. [23], used a Fourier transform algorithm to remove the clouds. While the effects of waves and sunlight reflection often result in false alarms, it will be solved in the fourth step rather than in the first step.
The second step: Sea-Land Separation. An accurate sea-land separation is not only necessary for an accurate detection of ships in harbor areas, but also important because DCNN-based methods may produce many false alarms when applied in the land scene. Sea-Land Separation falls into two groups: By introducing extra data such as the coastline data or by generating the segmentation sign by itself. In 2011, Lavalle et al. [24], imported a GIS data to describe the line that separates a land surface from the ocean. In 2015, Jin and zhang [25] introduced a shapefile to describe the coastline data. However, these methods based on extra information are undesirable on accuracy and it is hard to update the data timely. In another group, many researchers used histogram data to discriminate the sea and land such as in the works of Li et al. [26] and Xu et al. [27], which take the differences of sea and land in the distribution of gray value into account. Though these methods have a desirable efficiency, its accuracy is unsatisfied. To solve this problem, Besbinar and Alatan [28] additionally used digital terrain elevation data to generate a precise sea-land mask by utilizing the zero values. Additionally, Burgess [29] used a heuristic approach for land masking based on observations of the relationships between the values in the two input images for sea and land. In addition to the above methods, there are still many methods to have a good effect on sea-land separation and reduce the influence of onshore false alarms.
The third step: Ship Candidate Detection. After the removal of environment effects and sea-land separation, an appropriate ship candidate detection algorithm will be applied, and it is the most essential part in the procedure of ship detection. In addition, the purpose of this step is to find out all regions containing the targets, of course, this needs to consider about the characteristics of different scenes. For the multi-scale ship, many algorithms often combine the RPN structure introduced in the Faster R-CNN with the special architecture that fuses the low-level and high-level features, such as FPN, DFPN [30] and SNIP, which have been mentioned above. Further, for multiple orientations of the ships, Jiang et al. [31] and Yang et al. [32] both proposed a method, training data with rotate bounding boxes, to obtain a real orientation of the target in the inference process. Except for these methods, there are many other works aiming at the specified problems, Zhang [33] combined the convolutional neural network and manual ship features to improve the accuracy, and Bodla et al. [34] proposed the Soft-NMS to relieve the problem of overlapping box suppression and improve the accuracy of the ship in dense scenes. The fourth step: False Alarm Suppression. The phenomenon of the false alarm is the result from the misjudgment of interferes during the detection process, and the false alarm usually exists in the background and has some similar characteristics with the ship objects. In fact, false alarms in land can be removed by using sea-land segmentation, and for false alarms in sea caused by waves and sunlight reflection, Yang et al. [35] proposed a structure named the MDA-Net to describe the saliency of the ship to suppress non-object pixels, and Zhang et al. [33] imported manual ship features to overcome the false alarm. Although some measures can suppress the false alarms moderately, it is still a main challenging problem in ship detection.
Therefore, facing the realization process of the ship detection task, a complete detection framework has been established by combining the existing defogging algorithm and two DCNN-based algorithms proposed in this paper, which are aiming at the problems of the multi-scale ship detection and onshore false alarm suppression. At the first step, an improved network based on the Faster R-CNN with the function of scene mask estimation is proposed to achieve the sea-land segmentation and object detection. In this method, the network is trained by special data that includes the mask information of the target scene, which differentiates the target area (sea) and non-target area (land). By using this network, we can effectively remove the land area and reduce a lot of false alarms onshore during detection. In another step, a saliency-based Faster R-CNN is proposed to deal with the problem of multi-scale ship objects in high-resolution remote sensing image. In this method, a saliency estimation network is used to extract the salient region which presents the ship objects, then an image pyramid is used to compress the images containing the large-scale ship objects, in order to make the size of ships in the real-time image more similar with those in the training dataset, and finally, it will improve the accuracy of the ship detection. At last, one feasible processing chain is formed, including the image pre-processing, image database, results display and manual review modules, and the entire processing system is called the broad area target search (BATS).
The remainder of this paper is organized as follows. The proposed BATS structure and its details are described in Section 2. Two DCNN-based methods proposed in this work are presented in Section 3. Section 4 will explain the mechanism of the interaction part between the user and system, which includes input, display and review modules. Experimental results on a lot of representative high-resolution remote sensing images are presented in Section 5 followed by the conclusions drawn in Section 6.

The Framework of Broad Area Target Search System
Different from the existing researches about the ship detection task, we not only focus on one specified algorithmic problem, but comprehensively consider various aspects existing in the task of the broad area ship detection, and then a processing chain of high-resolution data is established, named the broad area target search (BATS). In this framework, it includes image pre-processing, saliency estimation, scene mask estimation, ship detection, results presentation and manual review. The framework is shown in Figure 1, where a stack (or cube) indicates that the module contains several effective sub-steps, for example, the object detection step of BATS has some different implementations including the Saliency-Faster R-CNN, Faster R-CNN, FPN, etc.
Before elaborating on the details of each module in this framework, some explanations of the detection process will be given. Firstly, a remote sensing image with the resolution better than 1 m is called the high-resolution image in our method. Based on this resolution, we can get more information about the shape and context details of the ship, and these high-resolution images reflect multi-scale features of targets. It is also a main form of remote sensing data for the ship objects detection and a prerequisite for fine-grained identification. In BATS, each step operates on the high-resolution remote sensing images. Secondly, the step of the algorithms (D, E, F) in the flow chart can be executed independently to solve the specified problems during ship detection. Meanwhile, different methods can be selected according to characteristics of input data and the type of tasks. For example, step D and F can be used for the broad scene under the clouds without E. Finally, the data of the algorithms and system are all derived from a shared image database which is generated after data preprocessing. This database supports the data import, data and results storage for each step. Now, we focus on the meaning of each step and the corresponding implementation process. The first is the input of data (Step A of Figure 1). In BATS, the high-resolution remote sensing data is obtained from Google Earth or uploaded from users. This is followed by the preprocessing of the raw data (Step B of Figure 1). For the high-resolution and wide coverage images, it is necessary to balance the size of the input image with the GPU performance. The most common processing method is to cut the input image into fixed-size patches (such as 1024×1024) for prediction. In order to ensure the integrity of the ship targets, we use an overlap cutting and overlap rate can be controlled. The data augmentation process includes some image operations such as multi-angle rotation, color contrast control and addition of random noise to increase the diversity of samples. The results of the preprocessing of input data are shown in Figure 2. For the structure management of the sample, the data is stored into the database after augmentation, which is convenient for the subsequent module.

After
Step B, the database is utilized to store the real and public datasets separately for multiple utilization (Step C in Figure 1). In this database, the information of the training set and test set are stored in different tables. The information table of the training dataset contains the storage path, image size, file name, position in the original input, etc. The information table for the test dataset contains the storage path and processing results after testing.
In the following, the ship detection is illustrated from three aspects: Imaging factor, environmental factor and the ship target itself. First, for input scenes, the image quality should be further analyzed. In the task of the ship objects detection, the cloud occlusion has a huge impact on the target imaging. Severe cloud occlusion will directly lead to the blurring and disappearance of ship targets in this scene, as shown in Figure 3. Therefore, the effective measures should be taken to mitigate the cloud occlusion problem in such a scene (Step D of Figure 1). The usual algorithms for clouds removal mainly have two types: Physical model defogging and image enhancement for defogging. Among these methods, the typical physical model algorithm is the defogging in the dark channel based on the dark channel prior [36], and the most common traditional image enhancement is the contrast limited adaptive histogram equalization [37], in which by calculating the histogram of multiple local regions of the image to redistribute the brightness and contrast to realize defogging. With these effective defogging methods, the DCNN inside BATS can effectively classify the foggy scene from the fog-free scene, and then uses the classic algorithm to process the foggy images. After defogging, the environment factors of the ship target are considered in Step E. In the task of ship detection, there are many ships mooring in ports or sailing in near-land areas. However, the complex terrestrial environment may easily mislead the feature extraction of DL algorithms and result in a lot of onshore false alarms. Therefore, there are some papers for ship detection that use sea-land separation to remove land false alarms. In this paper, we propose a novel false alarms suppression method based on DCNN, which use the powerful ability of feature extraction to segment the sea and land. The specific structure will be explained in Section 3.2.
The factors of the ship target itself should be focused to improve the ability of the ship targets detection (Step F of Figure 1). The size of the ship targets in the high-resolution remote sensing images varies greatly, while the size of the targets in the commonly used training dataset, such as DOTA [38], NWPU VHR-10 [39] and UCAS-AOD [40], are mostly similar and large-scale targets are rare, which results in a poor performance for large-scale ships detection. So far, some methods are proposed for concerning the problem of targets size, most of which use the combination of multi-layer feature layers [15,18,30] to retain the feature of different scale targets, and to alleviate the difficulty of the multi-scale targets detection. There is also a study that compresses the target in the image to a fixed size to enhance detection performance [17]. In our processing framework, a multi-scale ship detection method based on the saliency feature is proposed. This method adaptively compresses scenes containing large-scale ship targets, which improves the detection performance of remote sensing images in reality. The specific structure is explained in Section 3.3.
After ship detection, the result patches are aligned to the grid of the original image and the coordinates of the predicted boxes are converted into the original image coordinates. This step is known as data post-processing (Step G of Figure 1).
The post-processed data is stored in the database for the call of the display module. In the display module of BATS, the pixel coordinates of the predicted results are converted into geographic coordinates and the results are displayed in Google Map. In addition, the display module also provides an effective feedback mechanism, and that is the manual review for the ship detection algorithm (Step I of Figure 1). Considering that the DCNN-based methods cannot guarantee the detection result to be completely correct, the wrong result is selected manually and retrained to make up for the shortage of training samples, and then achieving the purpose of the iterative optimization of the algorithm. The structure and implementation of the display module will be explained in Section 4.3.

Ship Objects Detection Methods Based on DCNN
This chapter highlights the introduction of the DCNN-based ship detection algorithms proposed and applied in the BATS system which focuses on the onshore false alarm suppression and multi-scale ship object detection, respectively, in high-resolution remote sensing image. The first part would introduce the basic object detection network, which is used in both methods proposed in this paper. The second part would initially introduce the algorithm named as the Mask-Faster R-CNN focused on the problem that the near shore ship objects detection is disturbed by the onshore false alarms. This algorithm is combined with the scene mask extraction network that suppresses land information in feature maps to avoid the occurrence of onshore false alarms. Then, the algorithm concerning the problem of the multi-scale ship detection is demonstrated in the second part. In this method, a saliency estimation network is used to differentiate the scene containing large-scale objects from others, and then an image pyramid is used to compress these images so that the scale of the ships in real images is similar to the samples in the training dataset, and this improvement might have a good performance. Thus, this method is called the Saliency-Faster R-CNN.

Object Detection Network from Faster R-CNN
The object detection network adopts the classic two-stage detection process, which originated from the Faster R-CNN [14] by using ResNet-101 [41] as the feature extraction network (FEN) to implement the feature extraction of high-resolution remote sensing images. Then, the region proposal network (RPN) is used to conduct the candidate region proposal to obtain the high-confidence region proposals and corresponding coordinates on the feature map obtained by FEN. At last, it puts the region proposal and corresponding feature map through ROI pooling operation [14] to obtain the feature map block in the fixed size, then the classification and regression of the region proposal calculated by RPN is conducted to obtain a more accurate object class and coordinates and thus this process is known as the classification and regression network (CRN). The structure of the object detection network is shown in Figure 4. The ResNet-101 network is made up of 99 convolutional layers, which consist of four ResNet blocks [41] and two fully-connected layers. Due to its special block structure, it could extract the deep feature in the image without the problem of the vanishing gradient, which makes it extremely appropriate for feature extraction of ship objects in the complicated environment. In order to import this network into the object detection network, we remove the last two fully-connected layers and divide it into two parts: The first part is used as the FEN, which include the first three ResNet blocks and the second part (the fourth ResNet block) is used after the operation of the ROI pooling. The configuration of FEN is shown in part one of Table 1 and the configuration of another block in part two. Table 1. Configuration of the Resnet-101 in object detection network.

Block
Output Size Layers Layer parameter

Part2 (used after ROI Pooling)
Average pooling 2×2, stride = 1 In the object detection network, the most important structure is the RPN, which is used to propose regions that have a high-confidence and are closer to the ground-truth of targets by using some fixed-size anchors [14]. However, this process is controlled by the training dataset and the loss of classification and regression, which means the region proposal mechanism of RPN relies on the samples in the dataset. Therefore, it is hard to differentiate the objects with similar characteristics of targets in the non-target area. Furthermore, assuming that there are many small-scale objects in our training dataset, RPN is restrained so that it only responds to the small-scale targets while the large-scale targets cannot be fully covered by the insufficient anchors. Theoretically, the anchors and the subsequent regression process in the RPN prefers to fit the small-scale objects, and thus the large-scale targets are ignored via the trained RPN. The mechanism of RPN is shown in Figure 5. After RPN, a ROI pooling operation is used to resize all region proposals with 14×14 (same with the input of the fourth block in ResNet-101). Then, the feature in each proposal is used to finish the classification and regression to get the target class and coordinates.

Mask-Faster R-CNN for Suppression of Onshore False Alarms
In the ship detection process, because of the complexity of the land surface, there are amounts of objects that have a similar feature with the ships, which results in the fact that these objects might be regarded as candidate objects and influences the accuracy of the DL algorithm. Therefore, in terms of the problem of onshore false alarms, we put forward a ship detection method based on the estimated scene mask, known as the Mask-Faster R-CNN. This method could assist the object detection network by knowing target and non-target areas. Then, the region proposals are obtained only in the target area by using RPN and classified by using the softmax classifier to eventually implement the detection of the ship objects. As a mechanism to accurately judge the region proposals, the estimated scene mask could be used to suppress false alarms in the non-target area, such as harbor area and island, which reduces the redundancy of the region proposals, and eventually increases the accuracy of detection. The flow chart of the training process is shown in Figure 6. The Mask-Faster R-CNN consists of four sub-networks: The FEN based on Resnet-101, its parameters is presented in Table 1; a branch of the scene mask extraction network (SMEN) is inserted to estimate the scene mask by using the deconvolution operation; the RPN and CRN are used to decide the class and coordinates of the targets. Additionally in this method, the key point of suppressing the onshore false alarms is to eliminate the information of the non-target area (i.e., land) which is realized by SMEN.

Scene Mask Extraction Network
In this method, the role of the scene mask extraction network is a pixel-level classification between the target area (sea) and non-target area (land) to eliminate the disturbance of the non-target area in the forward propagation stage, which is the core idea in the onshore false alarm suppression. Since the boundaries of shores, especially in the harbor area, usually appear to be irregular and have a variety of shapes, it is necessary to classify each pixel in the input image in order to achieve accurate segmentation results. Deconvolution is an up-sample mechanism, which could operate the convolution process and restore the original size of the image at the mean time. By using the combination of the convolution, down-sample and deconvolution operations [42], we could implement the classification on each pixel (i.e., semantic segmentation in the context of DL). Therefore, the function of SMEN is a separation of the sea and land. At first, the scene label is introduced as a binary mask between the target and non-target areas in our dataset. Next, we add four deconvolution layers after FEN, and then the intermediate feature map goes through a softmax layer, which calculates the result of the non-target area (i.e., land scene). Additionally, the configuration of SMEN is shown in Table 2. Considering the real circumstance, for example the harbor area, parked ships are common near the shore, a low accuracy in boundary segmentation would result in the problem that the sea area may be classified as the land by mistake, which makes the detection stage abandon the ship targets covered by this inaccuracy land mask. In order to effectively extract the scene mask and avoid disturbing the parked ship objects, it is necessary to further optimize the accuracy in the segmentation. Therefore, the low-level feature map of SMEN is imported into the deconvolution layer by using the concatenate operation to optimize the accuracy of the estimated scene mask. To elaborate, the specific method is to concatenate the outcome feature map of each block in FEN with the same-size deconvolution layer, and then put the concatenated feature map through a convolutional layer in order to implement a further feature extraction. According to the operations above, a more accurate boundary information and more abstract semantic information are presented in the high-level feature map.

Training Process
In the training process, a multi-task loss function is established to restrain the training. The loss function is calculated by forward propagating in each iteration, and the parameters in all sub-networks are updated by using the gradient descent method [5]. When the minimum of the loss function is obtained, eventually the network is convergent.
The loss function of SMEN, denoted by L mask , could be represented as the cross entropy of the scene mask obtained in the network and ground truth label for each training image, and the expression is shown by Formula (1).
where R × C is the size of the scene mask, M (i,j) is the prediction probability of the scene category of the pixel (i, j) in the image, and T (i,j) , is the ground truth of the pixel (i, j) with the value of zero and one, representing the scenes of the land and sea, respectively, and the size of the image is R × C.
The loss function of classification in the object detection network could be represented as the cross entropy of the classification outcome and its ground truth, which is denoted by L cls , as shown in Formula (2): where O is the predicted probability being a target of box p, C is ground truth of the target, and it equals one when the label is positive, and is zero when the label is negative. N cls is the number of boxes. The loss function of the bounding box regression could be represented as the smooth L1 , which is the entropy loss between the region proposals and the ground truth box closest to it, denoted by L reg , as shown in Formula (3) and Formula (4).
where N reg is the number of box locations, β is utilized to balance the process of classification and regression. reg (p) = reg (p) x , reg  (5).
where λ 1 and µ are hyper-parameters to balance the loss of classification, regression and mask estimation.

Inference Process
There is a considerable difference between the inference and training process. The scene mask estimation and the object detection are driven by the above training, however, without any information interaction, while in the inference process, the outcome of the scene mask extraction network needs to exert an impact on the object detection network. Initially, the input image goes through the FEN and SMEN to obtain the information that represents the scene mask, which is a binary matrix with the value of one for the target area and zero for the non-target area.
where S is the estimated scene mask with the value of zero or one obtained from SMEN, and the Ω non−target is a set of pixels belonging to the non-target area (i.e., land).
Then, after compressing this scene mask into the size of the final feature map, i.e., the output of FEN with c channels, it will be multiplied with each channel of the feature map obtained from FEN, which would set the value of the non-target area in the feature map to zero, eliminating the effect of this area in the subsequent RPN process. In addition, this process can be formulated as: where represents the Hadamard product, and F p is a certain channel of the feature map obtained from FEN. Then V p is employed in the RPN to effectively reduce the false alarms in the non-target area, and the inference framework is shown in Figure 7.

Saliency-Faster R-CNN for Multi-Scale Ship Detection
It is usually difficult to get rid of the influence from the training dataset pattern in the object detection task by using the DL algorithm, and unfortunately, the same problem more obviously exists in the ship object detection in a high-resolution image. Most of the available training samples of the ship reflect the small-scale targets, which leads to the fact that the DL algorithm could not correctly learn the feature of the multi-scale ships, especially some large ships. In other words, the trained DCNN may be limited by the small-scale ships in the training dataset. Therefore, it is necessary to adopt an effective scale adjustment mechanism to implement the accurate multi-scale ship detection. On account of the difference between the training dataset and the real scene with the large ship, an object detection method based on the saliency estimation network, known as the Saliency-Faster R-CNN, is proposed in this paper which combines the object detection network with the saliency estimation network. By making use of the saliency estimation network to describe the object scale of remote sensing images, it divides inputs into the scene with the large-scale object and the scene with the small-scale object. Based on this division, it constructs an image pyramid to compress the images and adjust the input close to the sample scale and increases the detection ability of the object detection network for multi-scale ships (large-scale object, especially). The overall framework of the network is shown in Figure 8, where the object detection network has been described in the Mask-Faster R-CNN.

Saliency Estimation Network
In the Saliency-Faster R-CNN, the purpose of the saliency object network is to extract the saliency feature of the object and further distinguish the scenes containing large-scale or small-scale objects. Since the current DL algorithm is considerably dependent on the pattern of the samples, it is difficult to effectively implement the large-scale object detection task in some real scenes by using the small-scale object dataset. Therefore, we adopt the traditional digital image processing method to construct the saliency estimation network, which could extract the saliency feature of objects.
In this saliency estimation network, we first process the high-resolution remote sensing image and obtain the saliency feature of ships by using the saliency estimation algorithm. Based on this, the gray values in the saliency map obtained after the saliency estimation are analyzed by using the histogram method, and then estimating the saliency degree of the ship scale. Second, combined with the statistical results, we construct a mathematical model which could distinguish the large-scale ship and small-scale ship scenes. The framework of the saliency estimation is shown as the input in Figure 8.
It is worth noting that there are many saliency estimation algorithms, and thus we studied the ability of different saliency estimation algorithms for high-resolution images. Comparing and analyzing the performance in real high-resolution images of several classic algorithms which can be found in [43], including Saliency Intensity Model (SIM) [44], region Covariance (COV) [45], Fast and Efficient Saliency detection (FES) [46], Segmenting object detection (SEG) [47], Spectral Residual approach (SR) [48], Saliency Using Natural statistics [49], Spatially Weighted Dissimilarity (SWD) [50], and the SIM is assured as the algorithm adopted in saliency estimation network. The saliency map outcome and histograms analyzing the outcome are shown in Figure 9.
According to Figure 9, we could infer that the results of the SIM algorithm are able to appropriately present the energy distribution of objects in the high-resolution remote sensing image, and could effectively distinguish the scenes whether containing the large-scale ships or not. The describing factor of the target scale is shown in Formula (8).
where X represents the statistical feature information, and L u , L l represent the upper and lower bounds of left peak in the histogram, R u and R l are also boundaries of the gray interval for the right peak, and T is the threshold. When f scale = 1, the input image is interpreted as a scene with large-scale ships. It is found in the experiment that the area with the low gray value in the saliency map is related to the background information, while the area with a high value is related to the object information. It is worthy to note that this series of parameters are estimated in the ship detection process, so it should be adjusted dealing with the size of the input image in different circumstances.   [44], region Covariance (COV) [45], Fast and Efficient Saliency detection (FES) [46], Segmenting object detection (SEG) [47], Spectral Residual approach (SR) [48], Saliency Using Natural statistics [49], Spatially Weighted Dissimilarity (SWD) [50]; (b) the corresponding histogram of the saliency feature map.

Training and Inference
In the training process, the saliency estimation network is operated by the traditional processing method, it is not necessary to train on a specific dataset. Therefore, there is no need to participate in the training process. Thus, the total loss function is made of the weighted sum of the bounding box loss function and classification loss function, and the expression is shown in Formula (9). Loss = L cls + λ 2 L reg (9) where L cls and L reg are elaborated in formula (2) and (3), and λ 2 is a hyper-parameter to balance the loss of classification and regression.
In the inference process, based on the saliency map obtained by the saliency estimation, the scenes including large-scale ships or small-scale ships are distinguished. Combining with the mechanism of the image pyramid, it achieves the compression of images including large-scale ships, and implements the scale consistency of ships between real scenes and training samples. Then, it imports the compressed images into the object detection network to conduct the ship objects detection.

The Construction of the BATS System
This part will mainly introduce the structure, function and display interface of the BATS system, which belong to Step A, Step H and Step I of Figure 1, respectively. Each interface in the front end is inseparable from the architecture of the system function. It is an essential tool for the human-computer interaction and an effective measure to enhance performance. Aiming to detect ships over a broad area automatically, by using high-resolution remote sensing images collected from Google Earth or uploaded by users, a visual and interactive system, i.e., BATS, is developed based on the WEB technology and DL model.
The BATS has two work components: The front end and back end. The front end is developed based on the Vue.js framework [51] and combined with Google Map API [52] for the data input and data display. The back end is developed based on Django [53] and uses SQLite [54] as the database. The service of the back end includes the database interaction, image preprocessing and calling DL models. Figure 10 shows the architecture of the BATS system.

Broad Area Search Module
The broad area search module is the main function in BATS for ship detection in the wide coverage and high-resolution remote sensing images. In this module, we use the wide areas in Google Earth as an instance to display the detection results (Step A and Step H in Figure 1). The interface of the broad area search module is shown in Figure 11. As illustrated in Figure 11, on the left side of this interface, there is a window displaying high-resolution remote sensing image with wide coverage, and this broad area has many regular areas of interest (AOIs), in which the user can scan and select one or more AOIs to search the ships. After clicking the search button, the red markers of the targets are shown in this window and the thumbnails of the predicted results of the cropped images are displayed in the right side of the interface. The user can click on the thumbnail to view the image details.
The realization process of this module is as follows: (1) A broad region on Google Map is divided into several large regular AOIs which correspond to some remote sensing images. (2) Users select the AOIs to search ship targets. (3) The system begins pre-processing (format conversion, image segment) the corresponding images, and calls the DL model for ship detection, and records the position of the marked targets. (4) The markers of the targets and thumbnails of the predicted results will be displayed on the page. The data flow of the broad area search module is shown in Figure 12

Target Detection Module
The target detection module is mainly to meet the inference requirements of individual images (Step A and Step H in Figure 1). The interface of the target detection module is shown in Figure 13. There is the upload file function located on the left side of this page, in which the user can drag a single image or folder into the box to upload. The results of the inference will be displayed by the carousel on the right side of the page, and the user can switch the current displayed images. In the target detection module, we first determine whether the type and size of the uploaded images are correct and call the pre-processing procedure and DL models for the ship target detection. Finally, the results of detection are displayed in the web page. The data flow in the target detection module is same as Figure 12

Manual Review Module
The purpose of the manual review module is to artificially inspect the results after inference, which is relative to step I in the architecture of BATS. In fact, it is necessary to conduct a manual review because the problems inside the network are directly reflected on these results, and this could give us a lot of chances to develop effective improvements.
The interface of the manual review module is shown in Figure 14. There is a presentation area on the left side of this page, in which the user can preview the result image and then vote for problems contained in this image shown on the right side. After that, the result with an error will be stored into the database for the process of subsequent relabeling and training. During the steps of the manual review, the data flow is shown in Figure 15.

Results
This section is devoted to present the results of the different DCNN-based algorithms on their aspects of concern and the results of the entire system in the real high-resolution remote sensing image. During the experiment, the validity is carried out by using the same infrastructure and evaluation indicators, then, we analyze the result and compare it to baseline method, i.e., the Faster R-CNN, to validate the effectiveness of our methods. At last, some additional discussions about the system application environment and scene are presented.

Introduction of Dataset
In this paper, DOTA [38] is utilized as the training dataset, and the resolution of most images in this dataset is better than 1 m. In DOTA, there are 434 scenes of different sizes containing multi-scale ships, it is fortunate that this distribution, i.e., small-scale ships occupy the majority of the dataset and most of them are close to the harbor in these scenes as shown in Figure 16, accords with the problems that have been solved in this paper. Eventually, 400 images are obtained from the dataset by manual selection (excluding the relative low-resolution images) and labeled with mask information in order to be able to train in both proposed DCNN algorithms. The background information and target categories of these images are complex, which include the different lighting conditions and cloud occlusions, and the dataset contains a variety of ship categories such as the cargo, cruise, carriers, etc. It ensures the diversity of the samples. The testing dataset consists of some images randomly picked from DOTA, and some remote sensing images collected from Google Earth are chosen as the testing images, in which the resolution is 0.3 m. Therefore, the images with labeled targets in our dataset are shown in Figure 17, in which the green box represents the ground truth of the ship target, and the white part represents the land (i.e., non-target) area while the black one is the sea in the mask map of the ground truth. In order to keep the consistency of the input in networks and the effective training process under one GPU, the images in the dataset are cut into the fixed size of 1024×1024 with a 20% overlap rate, and thus more than 1600 images are generated in the pre-processing step. Then, the images from DOTA are split into the training data and testing data with the rate of 8:2. Eventually, we form the testing dataset by merging about 310 images from the DOTA testing data and 152 images collected from Google Earth after pre-processing. Further, the augmentation operations have been operated to improve the generalization ability of networks.
All models are built on a Tensorflow DL framework and calculated on the GeForce GTX 1080 Ti GPU. As a result of the same structure of RPN in both proposed networks, we set its parameters to be the fixed value that the scale is in (4,8,16,32,64) and the aspect ratio is in (0.5, 1, 2). Additionally, all models are trained with the initial weight value based on ResNet-101 and overall training steps are 80,000.

Results of Multi-Scale Ship Detection
A set of comparison experiments has been conducted in this part, which compares the Saliency-Faster R-CNN proposed in this paper to the baseline method (Faster R-CNN) to validate the accuracy and superiority of our method. In fact, SVM [55], as a classic machine learning method, can also be combined with DCNN for classification, however, relevant studies have proved that the classification accuracy of SVM is slightly less than the softmax used in the Fast R-CNN [13], Faster R-CNN [14] and Bengio's research [56], in addition, the method based on SVM is not an end-to-end system, that is the reason for choosing Faster R-CNN as the benchmark method in our paper. In these experiments, we select the images containing the large-scale ship as an independent dataset to highlight the advantages of our method, and during the testing process, we adopt L l = 30, L u = 100, R l = 130, R u = 170 and T = 2000 to calculate the describing factor of the target scale (see Formula (8)). The value of the average precision (AP) is used to indicate the performance of detection results, which is shown in Table 3. The value of AP L represents the AP index on the large-scale ship dataset. The NCR (not compressing rate) represents the rate of failure to differentiate the ship size, i.e., not properly compressing the input image containing the large-scale ships, during the process of saliency estimation.  It should be noted that the AP of the ship by using the Faster R-CNN is only 0.504 in the literature of the first proposed DOTA [38], but our baseline method here has a value of 0.606, the reason for the improvement of accuracy is not only the effectiveness of our processing but the single category detection in our experiments. In Table 3, the AP L of the Saliency-Faster R-CNN is 0.727, which is much better than AP L of the Faster R-CNN, and it is more obvious than the improvement in AP. Actually, this is a reasonable result because the number of large-scale ships in the dataset is much less than the others, it is hard to promote a lot on AP. Therefore, on the basis of maintaining the detection accuracy of small-scale ships, our method also has a significant improvement on the detection of large-scale ships, which is an important aspect in the ship detection task by using the high-resolution remote sensing image.  Table 3, the NCR of the Saliency-Faster R-CNN is only 0.105, which fully demonstrates the effectiveness of the saliency estimation. Some testing images and their saliency maps are shown in Figure 18. The intuitive detection results of the baseline and our method are shown in Figure 19, and it is obvious that the large-scale ships can be accurately detected after the compressing process in our method.

Results of Onshore False Alarm Suppression
In order to validate the accuracy of the Mask-Faster R-CNN, the experiment between the Faster R-CNN and our method has been realized. AP, utilized in the above experiment, and two extra indicators known as the false rate and mean intersection over union (mIOU) are introduced to evaluate the detection results. The assessment is presented in Table 4, where the false rate indicates the false alarm rate in the detection results, and the mIOU demonstrates the performance of the scene mask estimation in the Mask-Faster R-CNN. In Table 4, the AP of the Mask-Faster R-CNN is 0.628, which is 2.2% higher than that of the Faster R-CNN, and at the same time, the value of the false rate has a huge reduction from 0.686 to 0.397. Obviously, when the land (non-target area) information is reduced during the detection process, some improvements on the false alarm and accuracy are obtained. This also demonstrates that our method can effectively suppress the onshore false alarms and promote the nearshore ship detection. Furthermore, we get a value of 0.877 on mIOU, which means the scene mask predicted in the inference process is very similar to the ground truth, and it can prove that this network is able to differentiate the target area (sea) and non-target area (land). Some inference results of the Faster R-CNN and Mask-Faster R-CNN are shown in Figure 20, in which the yellow region represents the predicted scene mask, the red boxes represent the ship detection results, and the blue boxes represent the existing false alarms. In these results, the Mask-Faster R-CNN effectively reduces the number of onshore false alarms.

Discussions
The above experiments are designed to analyze the accuracy of two DCNN models independently. Then in this part, an analysis on the effectiveness of the system and cascading model (connecting the Saliency-Faster R-CNN with Mask-Faster R-CNN) is presented. First, the results of the cascading model are compared with each DCNN model separately to validate the accuracy and effectiveness of the cascading model. The evaluation results of the different models are shown in Table 5, where time represents the average inference time on each image with 1024 × 1024.
According to Table 5, the AP of the cascading mode (the fourth row) is 0.629, which has a little improvement compared with the Mask-Faster R-CNN, and it indicates that the cascading model really fuses the advantages of two DCNN models proposed in this paper. However, the consumption time of both the DCNN model is larger than that of the Faster R-CNN, which only has 0.173 sec/image, while the time of the Saliency-Faster R-CNN reaches 2.181 sec/image because of the saliency estimation in this network, so that the time of the cascading model even reaches 2.233 sec/image. Objectively speaking, this is unbearable in real-time detection tasks of massive data, but it still can be used as a time-accuracy balance choice because of its improvement on the large-scale ship detection.
Then, an evaluation of the efficiency on the entire system is conducted by using some high-resolution remote sensing images not in the testing data set. The environment of evaluation consists of one GPU, a communication network with 1000 Mbps, one CPU with quad core and eight threads. The results are shown in Table 6. In Table 6, the upload represents the time consumption when uploading the images by the site designed in this paper, the pre-processing represents the time consumed in pre-processing, which includes images cutting with the fixed size and its position record in the original image, the post-processing represents the time consumed by dealing with the results after the detection process, which includes the coordinates transformation of boxes in the original images, the removal of repeated detection results by using the NMS operation in the overlapping area, finally the Num of Images represents the total number of images by using the overlapping cutting. According to statistics in Table 6, the times of processing and transferring are very short compared to the detection time, it will not influence the efficiency of the entire system and can be accepted.
In addition, in order to demonstrate the superiority of the cascading model and each individual model on different conditions, the curve of the precision-recall (PR) for each DCNN model is drawn in Figure 21.  Figure 21, in the same value of the recall, the precision from high to low is the cascading model, Mask-Faster R-CNN, Saliency-Faster R-CNN and Faster R-CNN. The result indicates that not only our two individual models can improve the accuracy of ship detection on their respective different concerns but the cascading model has a comprehensive improvement because of its fusion of two DCNN models.
In addition to the harbor scenes, we have encountered some scenes which contain icebergs located in the Arctic, and it is frequent to be some false alarms near or on iceberg during ship detection. Fortunately, our method still has a good performance on these scenes. Some results are shown in Figure 22, and the Faster R-CNN is deteriorated because of the false alarms, however, the Mask-Faster R-CNN avoids such a problem by using the estimated scene mask. In order to prove the applicability ability of BATS, we uploaded some high-resolution panchromatic remote sensing images via the target detection module, which are obtained from Gaofen-2 (GF-2) and the resolution of images is 0.8 m, and the results are shown in Figure 23. It is obvious that our method still has a good performance compared with the Faster R-CNN, even the type of input data is different from our training set. The generalization performance of the proposed system is verified to some extent.  Finally, one broad-area high-resolution remote sensing image is used to demonstrate the effectiveness and accuracy of our BATS for the complex scene. The size of the input image is 10656 × 7130, which is collected from Google Earth and it is one part of the Norfolk harbor, and the resolution of image is 0.3 m, which means it can nearly present a Nimitz-class aircraft carrier. The detection result is shown in Figure 24.  It can be seen that in (a) of Figure 24 almost each ship can be exactly detected in this complex scene, and some huge ships, for example the Nimitz-class aircraft carrier close to the harbor, which is shown in (e), can be marked through the Saliency-Faster R-CNN. Meanwhile, a dense ships scene is located on the right side of the image, and its scaled up image is shown in (b), and each small-scale ship has been detected, and it is amazing that some onshore alarms are suppressed, which means our Mask-Faster R-CNN plays its role in the system. At the same time, (d) and (f) obtained from the Faster R-CNN are the contrast of (c) and (e), respectively, which indicates that our method suppresses the onshore false alarms and detect the large-scale ship. Therefore, the BATS system has a good detection performance on the broad area.

Conclusions
In order to implement the DCNN-based ship detection in the broad area by using high-resolution remote sensing images, a ship detection system consisting of data pre-processing, core algorithms based on DCNN, results display module and manual review module is proposed in this paper, known as the broad area target search (BATS) system. The ship detection algorithms in this system are the key step in the ship detection task. For the problem of object scale differences between the training dataset and real scenes, a method known as the Saliency-Faster R-CNN combining the saliency estimation network with the object detection network is proposed in this paper, which uses the saliency feature to describe the scale of the ship inside the image, then the image pyramid is utilized to compress the input image that includes the large-scale objects so that the ship scale in the real high-resolution image is close to the scale of samples in the training dataset, and eventually it improves the detection accuracy. Additionally for the problem of onshore false alarms, a method named as Mask-Faster R-CNN is introduced, which inserts a branch of the scene mask estimation network into the object detection network to learn how to discriminate the target area (sea) from the non-target area (land), and then cover the land area with a zero response during the RPN process. This method can effectively reduce the appearance of onshore false alarms. Furthermore, this system also incorporates the capabilities of the data structure management of database technology to realize the results storage and display of each step inside the system, and meantime, it also provides users an interface to review the results and iteratively optimize the core DCNN-based algorithms. The processing chain based on the database unit constitutes an important foundation of the data interaction between the BATS system and users.
Through the independent experiment of each DCNN algorithm, the validity of each part has been verified. Firstly, we validate the Saliency-Faster R-CNN and obtain its superiority compared to the Faster R-CNN on the large-scale ship detection. Then, the Mask-Faster R-CNN is verified to have its greatly improvement compared to the Faster R-CNN on the suppression of onshore false alarms. In the last part, two DCNN-based algorithms are combined to validate the accuracy and efficiency of the entire system. Actually, the results are really satisfied either on the independent experiment or on the cascade experiment. To summarize, the framework proposed in our research provides a series of effective solutions for ship detection in the high-resolution remote sensing image.
At last, taking the data-driven and complexity of the ship detection task into account, the scalability of the BATS system is increased by using the database and the way of algorithm cascading so that it inserts more DCNN-based algorithms and constantly improves the performance of ship detection. In future work, some targets of interest in high-resolution remote sensing images, such as the airplane, vehicle, storage tank etc., will be concerned in the BATS system. The DCNN models proposed in this paper need to be continuously optimized according to the diversity of targets and scenes in the detection task. Furthermore, considering the all-time and all-weather capability of the synthetic aperture radar (SAR) images, it can be used as a powerful complement to the optical remote sensing images to achieve an accurate target detection under the condition of the heavy cloud and fog and inadequate optical data.