Ship Detection in Optical Remote Sensing Images Based on Saliency and a Rotation-Invariant Descriptor

Major challenges for automatic ship detection in optical remote sensing (ORS) images include cloud, wave, island, wake clutters, and even the high variability of targets. This paper presents a practical ship detection scheme to resolve these existing issues. The scheme contains two main coarse-to-fine stages: prescreening and discrimination. In the prescreening stage, we construct a novel visual saliency detection method according to the difference of statistical characteristics between highly non-uniform regions which allude to regions of interest (ROIs) and homogeneous backgrounds. It can serve as a guide for locating candidate regions. In this way, not only can the targets be precisely detected, but false alarms are also significantly reduced. In the discrimination stage, to get a better representation of the target, both shape and texture features characterizing the ship target are extracted and concatenated as a feature vector for subsequent classification. Moreover, the combined feature is invariant to the rotation. Finally, a trainable Gaussian support vector machine (SVM) classifier is performed to validate real ships out of ship candidates. We demonstrate the superior performance of the proposed hierarchical detection method with detailed comparisons to existing efforts.


Introduction
Sea target detection has been a standing topic in the field of remote sensing image processing for several decades due to the wide range of applications, such as fishery management, vessel traffic services, illegal oil spills, naval warfare, and maritime activities, etc. From the perspective of data sources, ship detection can be roughly classified into three domains: synthetic aperture radar (SAR) images, infrared (IR) images, and visible remote sensing (VRS) images. Because the synthetic aperture radar (SAR) method has the capacity to image day and night regardless of weather conditions, all SAR-based methods expend greatly, achieving impressive performance. However, the invisibility of small and wooden boats in SAR images may result in detection failures. Besides, lack of color and texture features makes SAR imagery unsuitable for recognizing the ship targets. IR images are employed to enhance the vision effect in weak light conditions but they also have some drawbacks, such as poor signal-to-noise ratio, insufficient structure information, and varied gray levels [1]. Compared with SAR and IR images, the VRS images investigated in this paper are more intuitive and capture more details and complex structures of an observed scene, which can be further used in target recognition. However, the above-mentioned facts about VRS images complicate the background and pose three main challenges to ship detection: of Fourier Transform) followed by a homogeneous filter to extract the candidate regions; and Xu [16] constructed a combined saliency model with self-adaptive weights to prescreen the ship candidates. These saliency models calculated in the frequency domain mentioned above have better performance (especially in a highly cluttered backgrounds) as compared with the spatial domain saliency methods [10,11]. However, they also have some drawbacks, such as the low resolution of saliency map, low target integrity, and blurring of the target boundary. The frequency-tuned saliency detection method [17], which can obtain full-resolution saliency maps and well-defined boundaries of objects, was applied by Wang [18] to extract the target regions. Despite the fact that these models have a low missed detection rate, they still suffer from the interference mentioned earlier and cause a high number of the false alarms. Therefore, it is still vital to go deeply into the study of fast and efficient ship detection methods which can pop out the targets and suppress the distractors under complex uncertain situations.
In order to solve the problem in the ship detection, the first requirement is an efficient predetection model accelerating the prescreening process and decreasing the false alarms. Furthermore, a robust feature set is also required to discriminate the ships from non-ship targets. To meet the two requirements mentioned above, a practical ship detection scheme is presented in this paper. The workflow of our detection algorithm is given in Figure 1. The prescreening stage consists of saliency detection and binary segmentation. The discrimination stage includes feature extraction (radial gradient descriptor [19] and sigma set descriptor [20]) and classification by using Gaussian support vector machine (SVM) [21]. ORS: optical remote sensing.
The scheme contains two main coarse-to-fine stages: prescreening and discrimination. According to the difference of the region characteristics between regions of interest (ROIs) and natural background, a novel and practical ship candidate detection scheme based on region variance is proposed in the prescreening stage. Firstly, our model decomposes an input image into nonoverlapping regions of square blocks and estimates their variances of simple features. Secondly, the information entropy is introduced to adaptively tune the relative weight of the saliency maps estimated by variances of the different feature. Finally, the inter-scale fusion is performed to increase the contrast between the salient and non-salient patches. In this way, not only can the targets be precisely detected, but the false alarms are also significantly reduced. After obtaining the saliency map, binary segmentation is operated to extract the candidate regions. In the discrimination stage, taking advantage of the symmetrical shape of the ships, the radial gradient histogram [19] is applied for guaranteeing the rotation invariance. Additionally, the region covariance descriptor [20], which is robust to large rotations and illumination changes, is utilized to describe the texture feature of the targets. Both shape and texture features are extracted and concatenated as a feature vector for subsequent classification, and then a trainable Gaussian support vector machine (SVM) classifier is performed to further remove the false alarms and maintain the real ship targets. Compared with the previous works [2,15], our detection model can achieve better performance in terms of both detection accuracy and running time. As a result, it is potentially of great benefit in the complex task of ship detection. Figure 1. Diagram of the proposed hierarchical target detection scheme. The prescreening stage consists of saliency detection and binary segmentation. The discrimination stage includes feature extraction (radial gradient descriptor [19] and sigma set descriptor [20]) and classification by using Gaussian support vector machine (SVM) [21]. ORS: optical remote sensing.
The scheme contains two main coarse-to-fine stages: prescreening and discrimination. According to the difference of the region characteristics between regions of interest (ROIs) and natural background, a novel and practical ship candidate detection scheme based on region variance is proposed in the prescreening stage. Firstly, our model decomposes an input image into non-overlapping regions of square blocks and estimates their variances of simple features. Secondly, the information entropy is introduced to adaptively tune the relative weight of the saliency maps estimated by variances of the different feature. Finally, the inter-scale fusion is performed to increase the contrast between the salient and non-salient patches. In this way, not only can the targets be precisely detected, but the false alarms are also significantly reduced. After obtaining the saliency map, binary segmentation is operated to extract the candidate regions. In the discrimination stage, taking advantage of the symmetrical shape of the ships, the radial gradient histogram [19] is applied for guaranteeing the rotation invariance. Additionally, the region covariance descriptor [20], which is robust to large rotations and illumination changes, is utilized to describe the texture feature of the targets. Both shape and texture features are extracted and concatenated as a feature vector for subsequent classification, and then a trainable Gaussian support vector machine (SVM) classifier is performed to further remove the false alarms and maintain the real ship targets. Compared with the previous works [2,15], our detection model can achieve better performance in terms of both detection accuracy and running time. As a result, it is potentially of great benefit in the complex task of ship detection.
The rest of this paper is organized as follows. Section 2 introduces the framework of the visual saliency detection. Section 3 describes the discrimination stage, including the combined rotation-invariant Remote Sens. 2018, 10, 400 4 of 19 descriptor as well as the Gaussian SVM. Experimental results are provided in Section 4. We then briefly conclude on the method, performances, and future work in Section 5.

Ship Candidate Extraction Based on Saliency
In the prescreening stage, the saliency value of each region is determined by quantifying its variance. The attended regions are detected to highlight potential objects by performing the fast and efficient saliency detection. Secondly, the ship candidates are extracted from the segmented binary image.

The Proposed Saliency Model
The ship targets in a VRS image of the sea are more salient than the background because the pixels of the targets are variable while those of the background have great similarities. Then, if we extract different low-level features from the images, the feature set of the ships will be very distinct from those of the sea backgrounds. It can be also concluded that a patch which contains a part of the target has more complex information compared with the one which only contains the similar background. To describe the distinction mentioned above, statistical characteristics have been investigated and proved to be powerful descriptions in remote sensing image processing. For instance, the variance weighted information entropy (WIE) was applied to detect target both in infrared and SAR images and achieved impressive performance [22,23]. In this paper, the value of region variances from optical remote sensing images in their uniform areas, the area including false alarms, and the partial area of ship target is primarily tested.
To illustrate the general idea, consider that patch A, patch B and patch C shown in Figure 2a,b present their corresponding gray level distributions. It can be observed that the gray level distributions of patch A (red line) and patch B (green line) are very different from that of patch C (blue line). Due to the wide range gray level distribution of patch C, the region variance values of highly non-uniform areas (patch C) which allude to ROIs are usually greater than those of the homogeneous backgrounds (patches A and B). In other words, region variance, as a basic regional statistical characteristic, can measure the complexity of a given patch to some extent. The similar conclusion could also be found in another model [24]. Therefore, it is reasonable to connect the saliency of a region with its variance. We constructed a novel saliency model based on this fact. As shown in Figure 2c, if our model is performed, the ship targets pop out from the background. We also compare our results with the seminal model of Itti in Figure 2d. The Itti model computes intensity, color, and orientation maps for a given input image base on a center-surround operation. The resulting feature maps are combined into the saliency map using a winner-takes-all network and an inhibition of the return mechanism. As shown in the second raw of Figure 2, compared with the Itti model, the proposed model is more effective in suppressing the background interference.
Next, we explain the proposed saliency model in detail. Figure 3 shows its schematic diagram. There are four main steps. First, we extract pixel amplitude and amplitude derivative features. Secondly, the rarity values for each scale are estimated based on the region variance of the different feature map. Afterwards, selection algorithm and intra-scale fusion are applied based on the information entropy. Finally, we obtain the final saliency map by performing the multi-layer cellular automata [25]. A detailed description of the proposed saliency model is provided hereinafter.   Given an H × W image I, it is observed that the targets of interest usually have great intensity fluctuations and obvious edges [26,27], and we extract the pixel intensity and intensity derivatives to define the four-dimensional feature vector k f for the kth pixel in I.  I  I  I  I x y x y   Given an H × W image I, it is observed that the targets of interest usually have great intensity fluctuations and obvious edges [26,27], and we extract the pixel intensity and intensity derivatives to define the four-dimensional feature vector k f for the kth pixel in I.  I  I  I  I x y x y Given an H × W image I, it is observed that the targets of interest usually have great intensity fluctuations and obvious edges [26,27], and we extract the pixel intensity and intensity derivatives to define the four-dimensional feature vector f k for the kth pixel in I.
where I k is the intensity of the kth pixel, and the image derivatives are calculated through the filter To obtain a maximum features decorrelation, we transform the feature map into four linearly uncorrelated maps by performing PCA decomposition (Principal Component Analysis) [28]. The resulting four feature maps after PCA transformation are denoted as f map j and shown in the second column of Figure 3. The resulting feature vector of the kth pixel is redefined as Non-overlapping patches with the size of n × n pixels are drawn from each feature map. A patch of each feature map is denoted as p j i , and the region variance of p j i can be expressed as where f j i denotes the mean value of the feature points in p j i . Then, the rarity value of patch is defined as where Z is a normalization factor equal to max i∈ f map j var j i . We obtain a set of four maps called rarity maps as shown in Figure 3. To integrate data information together, a selection algorithm is applied to the rarity maps. The first step is to compute for each rarity map an efficiency coefficient (EC j ), which is estimated by the information entropy. We can obtain the entropy value by considering the rarity map as a probability map. Image entropy can reflect the degree of difference in the gray values of pixels. According to the definition of entropy, the stronger the discriminative ability of the rarity map is, the smaller the entropy is. Then, the EC j is defined as where P i represents the probability of gray level i in the image, and H j denotes the information entropy of the rarity map j. When EC j is greater, the rarity map is more efficient. We sort the rarity maps based on each map efficiency coefficient EC j . r 1 is the most efficient map, and r 4 is the least efficient one. Finally, we eliminate r 4 , the fusion is then the sum of the rest maps weighted by EC j : Note that the patch size n × n specifies the resolution of the saliency map and affects the performance of the algorithm. There are different outputs at different scales. The saliency map with small scales (small patch size n × n) may tend to favor the boundaries rather than the entire body of a big ship target. In other words, it only focuses on the edges of targets and may introduce inner holes to the detection results. On the contrary, the small target boundary in the saliency map with large scales would be blurry. Furthermore, if the distance between the ship targets is too small, the ship candidates will be detected as a whole, the number of the ships cannot be distinguished. The situation becomes complex when different sizes of targets occur in the VRS images. To overcome this issue, we obtain multi-scale saliency maps by changing the patch size n × n and perform the inter-scale fusion to produce a better saliency map. In this step, the multi-layer cellular automata [25] is introduced to integrate multi-scale saliency maps and improve the contrast between salient and non-salient patches. Pixels which have the same coordinates in different saliency maps are neighbors in the multi-layer cellular automata. It can enhance saliency consistency among similar regions by exploiting the intrinsic relationship in the neighborhood. Consider the scales N = {n 1 , n 2 , · · · , n M }. The saliency map at each scale is resized to the scale of the original image and denoted as {S 1 ,S 2 , · · · , S M }, and then the multi-layer cellular automata is expressed as: where S t m = S t m1 , · · · , S t mP T denotes the saliency value of all pixels on the m-th map at time t, and P is the total number of pixels. The length of the vector l = [1, 1, · · · , 1] T is P. γ i denotes the threshold of the i-th saliency map generated by Otsu [29]. We empirically set ln( λ 1−λ ) = 0.5 based on the analysis of [25]. After T time steps, the final saliency map S T is defined as In inter-scale fusion step, the number of time steps T is determined by the convergence time. We set T = 10. We still need to further investigate the appropriate set of scale parameters. The input images with increased sizes of the targets, from top to bottom, are shown in the first column of Figure 4. For small ship targets, edges are blurred with a large scale in accordance with the aforementioned discussion. When the patch size is 4 × 4, the middle areas of bigger ships have low salient values and only the edges are preserved. When the patch size goes up to 8 × 8 or 16 × 16, the performance gets better. To sum up, scale parameters that are too large or too small could cause poor performance.
After several experiments, the scale parameter is fixed as N = {4, 8, 16} for better performance, and thus the number of scales is M = 3. Then single-scale saliency model can be easily extended to operate on multiple scales. Via performing the multi-scale saliency and selecting the appropriate set of scale parameters, our model is insensitive to the variation in target size. As shown in Figures 3 and 4, the output saliency map is now unique and the ship targets can be detected accurately even in a highly cluttered background.

Target Candidates Extraction
The final saliency map needs to be segmented to extract the candidate regions. In this step, we use the optimal threshold generated by the Otsu algorithm [29] to acquire the binary map. The optimal threshold is determined by the integration of the histogram and is selected automatically. The pixels with larger saliency values than the obtained threshold are defined as targets, while the rest of the pixels in the image are treated as backgrounds. Then, we define the smallest rectangle containing the connected region as ship candidates. There are two types of test images with complex backgrounds as shown in Figure 5. One set of images is covered by the clouds, and the other is disturbed by the islands. The first column presents the test VRS images, the second column presents their corresponding saliency maps, and the binary maps and prescreening stage results are shown in the third and the fourth column, respectively. As shown in Figure 5d, after saliency detection, segmentation, masking and extraction processing, the ship candidates are cut from the input image according to the location of each detected region in the binary image and marked with red boxes.

Ship Discrimination
These attended regions acquired by visual saliency model could correspond to either ship objects in the image or false alarms. The discrimination process is performed to further remove pseudotargets and confirm whether they are real ship targets. Therefore, a two-step solution is adopted to

Ship Discrimination
These attended regions acquired by visual saliency model could correspond to either ship objects in the image or false alarms. The discrimination process is performed to further remove pseudotargets and confirm whether they are real ship targets. Therefore, a two-step solution is adopted to

Ship Discrimination
These attended regions acquired by visual saliency model could correspond to either ship objects in the image or false alarms. The discrimination process is performed to further remove pseudo-targets and confirm whether they are real ship targets. Therefore, a two-step solution is adopted to identify real ships, namely feature extraction and machine learning techniques. Feature extraction is conductive to the subsequent classification. An effective and robust descriptor characterizing the ship target is the key of the final discrimination. Considering the fact that arbitrary direction of the ship candidates brings difficulty to target detection, investigating a robust descriptor that allows the ship to be well recognized without the influence of direction is critically needed. In our approach, the rotation-invariant features describing the shape and the texture information of the targets are extracted and concatenated as a feature vector for subsequent classification. Finally, a trainable Gaussian support vector machine (SVM) classifier is performed to further remove the false alarms and maintain the real ship targets. The discrimination stage is described briefly as followed.

Rotation-Invariant Global Gradient Descriptor
Taking advantage of the symmetrical shape of the ships, the histogram of oriented gradients (HOG) descriptor [30] is introduced to distinguish between the ships and non-ship targets [15,16]. Note that the HOG descriptor usually samples cells on grids to describe objects, thus it is clearly not rotation-invariant and not applicable to directly describe targets because the direction of the ship in chips is arbitrary. To make up for the deficiency, Qi [14] performed the PCA transform to obtain the direction of the main axis and rotated the ship candidates to the vertical direction before extracting HOG feature from the ship candidates. In a similar manner, Xu [15] performed the segmentation algorithm and radon transform to estimate the ship target heading. Considering that the estimation of the principal axis direction is time-consuming and not always accurate enough, we introduce the radial gradient transform (RGT) [19] which can eliminate the computation of estimating an orientation to guarantee the rotation invariance. Moreover, the RGT descriptor, which was initially developed for real-time tracking, is faster compared with the other rotation-invariant descriptor [31,32].
The specific process of the RGT transform is shown in Figure 6. Two orthogonal basis vectors, r and t, denote the radial and tangential direction at a point p, and point c is the center of the chips. By projecting onto r and t, the gradient g is reformulated as g T r r + g T t t. The rotation matrix for some angle θ is denoted as R θ . If we rotate the patch about its center by the angle θ, a new local coordinate system and gradient will be expressed as: and the radial gradient after the rotation can be expressed as g T r , g T t . It is easy to verify that the coordinates of the gradient in the local frame are invariant to the rotation by: Then, the radial gradient direction can be calculated by the formula: and the magnitude is given by Remote Sens. 2018, 10, x FOR PEER REVIEW 10 of 19 Figure 6. Illustration of radial gradient transform. For the given radial coordinate system (r, t), when the chip is rotated by θ, the projections of the gradient in (r, t) remain the same.
After obtaining the magnitudes and corresponding radial gradient orientations of the ship candidates, the gradient orientations are divided into eight specific bins in 0-360°. The angle in each bin is 45°, and we will get an eight-dimensional histogram from the gradient image by performing radial gradient transform. As shown in Figure 7, the gradient histogram of targets is basically unchanged even if the ship is rotated with the various angles. For a real ship target, bins 4 and 5 of the histogram have higher statistical quantized values in comparison to the other bins. Theoretically, the target chips share a similar distribution of gradient histograms, which is also illustrated in Figure 7. The obtained global gradient descriptor is robust to the variety of the sizes and rotations, and reliably grasps the shape information of targets. Finally, the magnitudes in bins 1-8 are denoted as a feature vector 1 2 8 ,  Figure 6. Illustration of radial gradient transform. For the given radial coordinate system (r, t), when the chip is rotated by θ, the projections of the gradient in (r, t) remain the same.

Region Covariance Descriptor
After obtaining the magnitudes and corresponding radial gradient orientations of the ship candidates, the gradient orientations are divided into eight specific bins in 0-360 • . The angle in each bin is 45 • , and we will get an eight-dimensional histogram from the gradient image by performing radial gradient transform. As shown in Figure 7, the gradient histogram of targets is basically unchanged even if the ship is rotated with the various angles. For a real ship target, bins 4 and 5 of the histogram have higher statistical quantized values in comparison to the other bins. Theoretically, the target chips share a similar distribution of gradient histograms, which is also illustrated in Figure 7. The obtained global gradient descriptor is robust to the variety of the sizes and rotations, and reliably grasps the shape information of targets. Finally, the magnitudes in bins 1-8 are denoted as a feature vector Remote Sens. 2018, 10, x FOR PEER REVIEW 10 of 19 Figure 6. Illustration of radial gradient transform. For the given radial coordinate system (r, t), when the chip is rotated by θ, the projections of the gradient in (r, t) remain the same.
After obtaining the magnitudes and corresponding radial gradient orientations of the ship candidates, the gradient orientations are divided into eight specific bins in 0-360°. The angle in each bin is 45°, and we will get an eight-dimensional histogram from the gradient image by performing radial gradient transform. As shown in Figure 7, the gradient histogram of targets is basically unchanged even if the ship is rotated with the various angles. For a real ship target, bins 4 and 5 of the histogram have higher statistical quantized values in comparison to the other bins. Theoretically, the target chips share a similar distribution of gradient histograms, which is also illustrated in Figure 7. The obtained global gradient descriptor is robust to the variety of the sizes and rotations, and reliably grasps the shape information of targets. Finally, the magnitudes in bins 1-8 are denoted as a feature vector 1 2 8 , Figure 7. Radial gradient histogram statistics of the ship candidates. The x-coordinate denotes the eight orientation bins and the y-coordinate denotes the radial gradient statistic information in the gradient histogram. Ships with different orientations, sizes and textures share similar histogram distribution which is different from that of false alarms.

Region Covariance Descriptor
Second, a region covariance descriptor is applied to describe the texture features of ship candidates. The covariance matrix [33], which was initially proposed for texture classification and object detection, is introduced to characterize the ship targets. The region covariance descriptor is reviewed hereinafter. For a given image patch P, the W × H × d dimensional feature image extracted from P is denoted as F: where φ denotes the function of the features, such as color, intensity, orientation, filter responses, spatial attributes, etc. Then, the image patch P is represented with the d × d covariance matrix C p of the feature points.
..n denote the d-dimensional feature points and u is the mean of all points inside P. We use simple features, namely intensity, color, and the norm of the first and second-order derivatives of the intensity to define the d-dimensional (d = 7) pixel-level feature vector f (x, y): with L, a, and b denoting the color of the pixel in Lab color space. The derivatives are calculated through the filters [−1 0 1] T and [−1 2 − 1] T , and (x, y) denotes the location information. Hence, the covariance matrix C p is computed as a 7 × 7 matrix. It has several advantages: • It provides nonlinear integration of different features through modeling its correlations. • Due to the low-dimensional representations of the patches, it captures local structures better than linear filters.

•
It is insensitive to the large rotations and the illumination changes.
To use C p as the ship descriptor, the matrix C p needs to be mapped to a vector. Note that covariance matrices do not lie on the Euclidean space. It is infeasible to change d × d matrix into vector intuitively. To remedy this issue, Hong [20] proposed the sigma point descriptor which can transform covariance matrices on Euclidean vector space by using the Cholesky decomposition. After performing the Cholesky decomposition of C p , C p = LL T , we can obtain L, which is a lower triangular matrix. Then the nonzero elements in matrix L can be changed into a d 2 + d /2 vector denoted as f 2 = [L 1 , L 2 , . . . , L 28 ]. Finally, both f 1 and f 2 are concatenated as a feature vector for classification. For the sake of simplicity, we redefine the combined features as f = [ f 1 , f 2 , . . . , f 36 ].

Gaussian SVM
The main aim of the classification is to discriminate the real ship targets from the ship candidates based on the obtained features f . The support vector machine (SVM) [21] can non-linearly map the input vector into a very high-dimension feature space. More importantly, the solution of SVM is globally optimal. Due to its high performance in many pattern recognition applications, the SVM is adopted in the discriminative stage. Given a training set of m observations: with x i denoting the feature vector corresponding to the ith observation labeled, and y i the input label belonging to −1 and 1, which denote non-ship and ship targets. For non-linear classification problems, to construct a separating hyperplane built in the feature space, the d-dimensional feature vector x is first transformed into a D-dimensional feature vector by function φ: Then, the sign of the function is taken, where w and b are to-be-learned parameters, and the optimization problem becomes min w,b subject to Then, the Lagrangian is computed to solve this convex quadratic programming problem and the corresponding dual problem is expressed as where α i denotes the Lagrange multiplier. Instead of the explicit computations on φ(x i ) T φ x j , the kernel trick is applied and the SVM model for function estimation yields where κ(·, ·) is the kernel function. The kernel mapping technique plays an important role in classification performance. One can combine the prior knowledge of the problem at hand through constructing special kernel functions [21]. In our experiment, SVMs with linear, quadratic, cubic, and Gaussian kernels are tested. Finally, the Gaussian SVM is adopted to classifier the ships and non-ship targets. The Gaussian kernel function can be expressed as: More details can be found in Section 4.3.

Experimental Results and Discussion
We conduct our experiments using a PC equipped with a 3 GHz CPU and 4-GB memory. Firstly, we compare the proposed saliency model both qualitatively and quantitatively with four state-of-art methods in different complex backgrounds (e.g., luminance fluctuation, cloud cover, fog, sea clutter, islands interference). We employ the receiver operating characteristic (ROC) area under the curve (AUC) metric to evaluate the candidate location prediction quantitatively. Secondly, the classification accuracy is adopted to measure the performance of SVMs with different kernel functions, we also compare our combined rotation-invariant feature with S-HOG feature (ship histogram of oriented gradient), single feature f 1 and f 2 . Finally, the overall detection performance is compared to further demonstrate the effectiveness and robustness of the proposed scheme.

Data Set
All VRS images were collected from Google Earth and were captured under different weather conditions and various viewpoints, the dataset contains 338 ship targets for a total of 162 images of size 512 × 512 pixels, the corresponding binary maps were manually labeled. The resolution of these images is about 1 m. Sample images of the dataset are listed in the left column of Figure 8.

Data Set
All VRS images were collected from Google Earth and were captured under different weather conditions and various viewpoints, the dataset contains 338 ship targets for a total of 162 images of size 512 × 512 pixels, the corresponding binary maps were manually labeled. The resolution of these images is about 1 m. Sample images of the dataset are listed in the left column of Figure 8. Input image Itti GBVS SR COV Our method Figure 8. Examples of saliency maps with comparison to the 4 state-of-the-art methods. From left to right: input image, Itti [9], GBVS(Graph-Based Visual saliency) [10], SR(Spectral-Residual) [13], COV(Covariance saliency) [23], and our method. Our method outperforms other typical methods visually. Figure 8 presents the results of our saliency approach and other typical models including the Itti [9], GBVS (Graph-Based Visual saliency) [10], SR (Spectral-Residual) [13] and COV (Covariance saliency) [23] methods with respect to some sample images from our dataset. These images can be divided into several types based on the different complex backgrounds, such as thin and thick cloud cover, the interference of islands, fog, sea clutter, etc. The complicated backgrounds make every target detection task unique and challenging. Though it is difficult for all these methods to exactly extract the saliency regions in remote sensing images, our saliency model tends to be less distracted by the cluttered backgrounds in comparison to other methods.

Comparison to the State-of-the-Art Saliency Model
As shown in Figure 8, our proposed saliency model achieves the best results of all saliency models visually both in terms of the accuracy and the integrity of object detection, and has the following advantages:

•
Our model can distinguish different ship targets even when they are very close to each other.

•
It can identify both large and small ships and highlight the entire ship target regions.

•
It can suppress the interference from the complex backgrounds such as cloud, fog and sea clutter. input image, Itti [9], GBVS (Graph-Based Visual saliency) [10], SR (Spectral-Residual) [13], COV (Covariance saliency) [23], and our method. Our method outperforms other typical methods visually. Figure 8 presents the results of our saliency approach and other typical models including the Itti [9], GBVS (Graph-Based Visual saliency) [10], SR (Spectral-Residual) [13] and COV (Covariance saliency) [23] methods with respect to some sample images from our dataset. These images can be divided into several types based on the different complex backgrounds, such as thin and thick cloud cover, the interference of islands, fog, sea clutter, etc. The complicated backgrounds make every target detection task unique and challenging. Though it is difficult for all these methods to exactly extract the saliency regions in remote sensing images, our saliency model tends to be less distracted by the cluttered backgrounds in comparison to other methods.

Comparison to the State-of-the-Art Saliency Model
As shown in Figure 8, our proposed saliency model achieves the best results of all saliency models visually both in terms of the accuracy and the integrity of object detection, and has the following advantages:

•
Our model can distinguish different ship targets even when they are very close to each other. • It can identify both large and small ships and highlight the entire ship target regions. • It can suppress the interference from the complex backgrounds such as cloud, fog and sea clutter.
It is noted that the background suppression abilities of the Itti and GBVS model are weak, especially in the case of the cloud cover. Although the detection results are finer for the SR model, this model is sensitive to the input image pixels. The COV model is effective for suppressing complex backgrounds, but it is time-consuming and produces more false alarms compared to our model. Overall the proposed saliency model is superior to other typical models and can obtain more accurate shapes and highlight the whole target regions.
In addition to visual comparisons of saliency maps, we employ the ROC-AUC metric to quantitatively evaluate the performance of the proposed method. Using this metric, the pixels with larger saliency values than a threshold are treated as targets, while the rest of the pixels in the image are treated as backgrounds. Binary maps are used as ground truth. An ROC graph can be drawn by varying the threshold in which the true positive rate (TPR) and the false positive rate (FPR) are plotted on the Y axis and X axis, respectively. The TPR and FPR are expressed as where tp is the number of true positives, f p is the number of false positives, tn is the number of true negatives, and f n is the number of false negatives. The performance in terms of ROC-AUC metric is measured and the results are shown in Figure 9a,b respectively. The ROC curve in the upper-left corner of the graph is best. It can be observed that the proposed saliency model has the highest ROC-AUC performance and outperforms all the other methods in consideration.
Remote Sens. 2018, 10, x FOR PEER REVIEW 15 of 19 It is noted that the background suppression abilities of the Itti and GBVS model are weak, especially in the case of the cloud cover. Although the detection results are finer for the SR model, this model is sensitive to the input image pixels. The COV model is effective for suppressing complex backgrounds, but it is time-consuming and produces more false alarms compared to our model. Overall the proposed saliency model is superior to other typical models and can obtain more accurate shapes and highlight the whole target regions.
In addition to visual comparisons of saliency maps, we employ the ROC-AUC metric to quantitatively evaluate the performance of the proposed method. Using this metric, the pixels with larger saliency values than a threshold are treated as targets, while the rest of the pixels in the image are treated as backgrounds. Binary maps are used as ground truth. An ROC graph can be drawn by varying the threshold in which the true positive rate (TPR) and the false positive rate (FPR) are plotted on the Y axis and X axis, respectively. The TPR and FPR are expressed as fp tn (24) where tp is the number of true positives, fp is the number of false positives, tn is the number of true negatives, and fn is the number of false negatives.
The performance in terms of ROC-AUC metric is measured and the results are shown in Figure  9a,b respectively. The ROC curve in the upper-left corner of the graph is best. It can be observed that the proposed saliency model has the highest ROC-AUC performance and outperforms all the other methods in consideration. We also evaluate the performance of the proposed saliency model in terms of speed with reference to the other methods mentioned above. Table 1 compares the average time taken by each method. Note that SR, COV, and the proposed method are programmed in Matlab, while the codes with regard to Itti and GBVS are quasi Matlab codes which call C++ functions for saving the running time. Nevertheless, a relative overview of the run-time performance of the considered methods is given. It can be observed that COV model has the defect of long running time in despite of good performance. Due to mixed-language programming, the costs of performing Itti and GBVS are relatively low. SR model has the shortest running time because of small calculation efforts. The time complexity of our method is lower than that of other spatial saliency models.  We also evaluate the performance of the proposed saliency model in terms of speed with reference to the other methods mentioned above. Table 1 compares the average time taken by each method. Note that SR, COV, and the proposed method are programmed in Matlab, while the codes with regard to Itti and GBVS are quasi Matlab codes which call C++ functions for saving the running time. Nevertheless, a relative overview of the run-time performance of the considered methods is given. It can be observed that COV model has the defect of long running time in despite of good performance. Due to mixed-language programming, the costs of performing Itti and GBVS are relatively low. SR model has the shortest running time because of small calculation efforts. The time complexity of our method is lower than that of other spatial saliency models.

Discrimination Results
There are 543 ship candidates obtained by performing the candidate extraction mechanism. The size of the ship candidate is typically in a range from 11 × 18 to 100 × 92. They are manually classified into 325 ship chips and 218 non-ship chips. They are used to verify the performance of the hierarchical feature extraction as well as the classification approach. Some examples of the extracted ship candidates are shown in Figure 10. Group A and B show the samples of targets and the false alarms, respectively. We randomly select two-thirds of the ship chips and the non-ship chips as the training set. The test set consists of the left chips.

Discrimination Results
There are 543 ship candidates obtained by performing the candidate extraction mechanism. The size of the ship candidate is typically in a range from 11 × 18 to 100 × 92. They are manually classified into 325 ship chips and 218 non-ship chips. They are used to verify the performance of the hierarchical feature extraction as well as the classification approach. Some examples of the extracted ship candidates are shown in Figure 10. Group A and B show the samples of targets and the false alarms, respectively. We randomly select two-thirds of the ship chips and the non-ship chips as the training set. The test set consists of the left chips. To validate the effectiveness of our combined feature, S-HOG feature (ship histogram of oriented gradient), radial gradient feature, sigma set feature, and the combined feature are separately combined with the classification learner to perform the discrimination. SVM can solve the small sample, nonlinear classification problem and has good generalization performance, which is suitable for the extracted data. Selection of kernel function is a pivotal factor which decides classification accuracy. Based on the above factors, the four different feature sets mentioned above are compared using the SVMs with various kernel functions, namely the linear, quadratic, cubic, and Gaussian functions. The classification accuracy of each method is calculated as: Number of correctly classified samples Accuracy= 100% Number of tested samples × The parameter for the SVM-based classifier is determined by adopting 5-fold cross-validation. The classification accuracy of each feature is shown in Figure 11.  To validate the effectiveness of our combined feature, S-HOG feature (ship histogram of oriented gradient), radial gradient feature, sigma set feature, and the combined feature are separately combined with the classification learner to perform the discrimination. SVM can solve the small sample, nonlinear classification problem and has good generalization performance, which is suitable for the extracted data. Selection of kernel function is a pivotal factor which decides classification accuracy. Based on the above factors, the four different feature sets mentioned above are compared using the SVMs with various kernel functions, namely the linear, quadratic, cubic, and Gaussian functions. The classification accuracy of each method is calculated as: Accuracy = Number of correctly classified samples Number of tested samples × 100% The parameter for the SVM-based classifier is determined by adopting 5-fold cross-validation. The classification accuracy of each feature is shown in Figure 11.

Discrimination Results
There are 543 ship candidates obtained by performing the candidate extraction mechanism. The size of the ship candidate is typically in a range from 11 × 18 to 100 × 92. They are manually classified into 325 ship chips and 218 non-ship chips. They are used to verify the performance of the hierarchical feature extraction as well as the classification approach. Some examples of the extracted ship candidates are shown in Figure 10. Group A and B show the samples of targets and the false alarms, respectively. We randomly select two-thirds of the ship chips and the non-ship chips as the training set. The test set consists of the left chips. To validate the effectiveness of our combined feature, S-HOG feature (ship histogram of oriented gradient), radial gradient feature, sigma set feature, and the combined feature are separately combined with the classification learner to perform the discrimination. SVM can solve the small sample, nonlinear classification problem and has good generalization performance, which is suitable for the extracted data. Selection of kernel function is a pivotal factor which decides classification accuracy. Based on the above factors, the four different feature sets mentioned above are compared using the SVMs with various kernel functions, namely the linear, quadratic, cubic, and Gaussian functions. The classification accuracy of each method is calculated as: Number of correctly classified samples Accuracy= 100% Number of tested samples × The parameter for the SVM-based classifier is determined by adopting 5-fold cross-validation. The classification accuracy of each feature is shown in Figure 11.  As can be seen in the comparison shown in Figure 11, for any given kernel function, the ship's classification accuracy based on the combined feature sets is higher than the accuracy based on the single feature set. Note that S-HOG feature has the worst accuracy result; this may be related to low accuracy when estimating the principal axis direction. In addition, for the combined feature f, the accuracy of the SVM with Gaussian kernel is 96.1% which is the highest level of accuracy compared to the others. Therefore, the combined feature set and Gaussian SVM are adopted in the following experiments.

Comparison of Overall Detection Performances
Finally, we compare our overall detection method with two typical methods. The evaluation criteria are defined as Accuracy = Number of correctly detected ships Number of real ships × 100% False ratio = Number of detected false alarms Number of detected candidates × 100% (27) The detection results are listed in Table 2. As can be seen from Table 2, our detection model can obtain higher accuracy and lower false ratio than the other two methods. Note that the method [2] has the worst performance. This is because method [2] generates the candidate regions by image segmentation and uses the simple shape feature to distinguish between the ships and non-ship targets. While our model and method [15] extract the ship candidates by using the visual attention mechanism, this can obtain few false alarms and low missing rate. Besides, the improved HOG feature can describe target shape information efficiently. In addition, compared to method [15], we extract not only shape features but also texture features. This is beneficial to further removing false alarms, which have similar shapes to the targets. Through the analysis above, it can be concluded that our model is effective for eliminating the false alarms and preserving the real targets and outperforms the other ship detection model considered.
With regard to the time consumption of the overall detection algorithm, compared to slide window algorithm, our saliency detection model can greatly decrease the detection time. The time consumption of saliency model detection is 0.6 s, and the average time consumption of discrimination stage is 0.2 s, which basically meets the needs of the near-real-time tasks.

Conclusions
In this paper, we have proposed a hierarchical model to tackle the problem of ship detection in a complex and changing background environment based on optical remote sensing data. The scheme consists of prescreening and discrimination stages. First, a fast and efficient multi-scale saliency model based on region statistical characteristics is performed to locate candidate regions. Through performing saliency detection, our model effectively reduces missed detection and false detection. Second, from a given ship candidate, we extract the combined rotation-invariant feature which offers a more powerful descriptor to capture the shape and texture information of the object. Finally, a trainable Gaussian SVM is employed as the discriminator. Our overall detection model achieves the best performance of 94% in terms of accuracy and 4% in terms of false ratio, outperforming the other typical ship detection model. Experiments on the optical remote sensing data have demonstrated the effectiveness and robustness of the proposed model.
Our future work will focus on two aspects. First, we will build a large dataset including thousands of optical remote sensing images to make sure that the input data of classifier is sufficient. It thus can improve the object detection performance further. Second, more effective features may be further explored and feature selection will be considered. Moreover, the better discriminator will be investigated.