Robust Ship Detection in Infrared Images through Multiscale Feature Extraction and Lightweight CNN

The sophistication of ship detection technology in remote sensing images is insufficient, the detection results differ substantially from the practical requirements, mainly reflected in the inadequate support for the differentiated application of multi-scene, multi-resolution and multi-type target ships. To overcome these challenges, a ship detection method based on multiscale feature extraction and lightweight CNN is proposed. Firstly, the candidate-region extraction method, based on a multiscale model, can cover the potential targets under different backgrounds accurately. Secondly, the multiple feature fusion method is employed to achieve ship classification, in which, Fourier global spectrum features are applied to discriminate between targets and simple interference, and the targets in complex interference scenarios are further distinguished by using lightweight CNN. Thirdly, the cascade classifier training algorithm and an improved non-maximum suppression method are used to minimise the classification error rate and maximise generalisation, which can achieve final-target confirmation. Experimental results validate our method, showing that it significantly outperforms the available alternatives, reducing the model size by up to 2.17 times while improving detection performance be improved by up to 5.5% in multi-interference scenarios. Furthermore, the robustness ability was verified by three indicators, among which the F-measure score and true–false-positive rate can increase by up to 5.8% and 4.7% respectively, while the mean error rate can decrease by up to 38.2%.


Introduction
Ship detection from infrared remote sensing images has an important but challenging role in remote surveillance and military reconnaissance [1,2]. Due to the large coverage area of infrared remote sensing images, and the small proportion of targets in the images, the accuracy of target-detection algorithms and real-time processing performances is seriously affected. For example, a remote sensing image obtained by one satellite contains 30,000 × 30,000 pixels, among which the size of ship area is 10 × 10 pixels, and the target area only accounts for one part per million of the image, which results in a serious lack of target texture detail, especially when the image resolution is low. Although certain existing infrared-image ship detection technologies have emerged, it is difficult for these methods to simultaneously address the following challenges: (1) it is difficult to extract universal features for various types of target ships in low-resolution or low-contrast images. Detection in low-resolution or low-contrast images will lead to the loss of texture details, and difficulty in the extraction of the available features of the target. Coupled with the interference of infrared image noise from clouds, reefs and so on, the features of the middle and lower layers are easily ignored, and it is easy to generate false-positives and negatives. (2) It is harder to maintain the balance between algorithm performance and algorithm complexity. Generally speaking, algorithms with high detection performances have high levels of complexity due to their calculation modes. When algorithms with high complexities need to be transplanted to embedded chips with limited resources, part of the algorithm's performance is often sacrificed. (3) Most of the existing detection methods compromise on real-time computing. The majority of detection algorithms have requirements that make them difficult to match with the resources that are available in real-time space reconnaissance applications, especially when introducing algorithms with the depth of neural networks to target classification; it not only requires a large amount of training data to be effectively generalized but also uses a larger amount of computational power compared to other methods.
Infrared images have a strong spatial correlation, contain more homogeneous regions, and have weak texture features, so the mean gray value is relatively stable [3]. However, when detecting a target ship on the sea surface, there are still several forms of interference: first of all, because the response characteristics of each pixel in the infrared imaging equipment are not completely consistent, different detection units produce different outputs under the same radiation input, so there is a bright and dark striped noise in the infrared image, resulting in a low signal-to-noise ratio of the image, which seriously affects the performance of the target-detection algorithm. Secondly, remote sensing satellites are easily affected by clouds, sea-clutter, and other weather during imaging, resulting in a complex detection background, reduced contrast between the target and background, which makes it easy to produce a large number of false-alarm. Thirdly, it is difficult to select appropriate features to separate the target and background because of the existence of various types and sizes of targets, and the unequal representation of grayscale features. Therefore, at present, there is no detection algorithm that is applicable to all possible scenarios, exist methods can only be used to minimise interference and ensuring detection efficiency in specific detection scenarios.
With the improvement of artificial intelligence technology, deep learning can adaptively and automatically learn features in data by constructing a deep neural network, which makes up for the deficiency of manual design features to a certain extent [4][5][6]. Object detection technology based on deep learning can generally be divided into one-stage and two-stage detection methods. Two-stage target-detection algorithms generally consist of candidate-region extraction and target confirmation, they utilize powerful features with statistical classifiers to discriminate ships from false-alarm, which has great advantages in maintaining detection accuracy. One-stage target-detection algorithms aim to extract all the candidate regions for a subsequent classifier, they omit the step of candidate-region extraction and directly obtain the target category and position from the image. Compared with two-stage algorithms, they have a huge speed advantage, but their detection accuracies are low due to their rough detection strategy. When deep learning algorithms are deployed on embedded platforms or other platforms, it is a great challenge to balance the accuracy, speed and memory resources needed for target detection.
As a result of the above-mentioned analysis, the complex background and diverse interference factors will seriously affect the extraction and classification of the effective features of the targets. In addition, during the process of deployment and application, it is difficult to balance detection performance, computational complexity and real-time performance. In order to solve the problems raised above, the contributions of this paper are presented as follows. (1) Candidate-region extraction: our method combines the cascade rejection mechanism with multiple other features through a linear cascade classifier, which orders candidates from simple to complex, uses relatively simple features to exclude a large number of simple alarms, such as seawater and clouds, and uses more sophisticated features to extract final candidate regions. (2) Multiple feature fusion-classification: a falsealarm elimination method based on Fourier global spectral features and lightweight CNN is proposed. In this method, the global Fourier transform was applied to each candidate area to obtain the corresponding feature description and achieve a rough classification of candidate regions, then the local feature was extracted by the lightweight CNN model to further eliminate false-alarm. (3) Classifier training and target confirmation: a classifier training algorithm is proposed to minimise the classification error rate and maximise generalisation, and the improved NMS algorithm is used to merge real ships and achieve an accurate output.
The remainder of this paper is organized as follows. Section 2 describes the related work. Section 3 introduces the methodology and details the elaborate implementation and optimization of the core module of the algorithm. Section 4 describes the extensive experiments, and Section 5 presents our conclusions and recommendations future work.

Related Work
In recent years, many infrared ship detection algorithms have been proposed by researchers. In these researches, ship target detection algorithms are generally divided into traditional ship detection method and deep learning ship detection method. The existing traditional ship detection algorithms can be divided into four categories for different scenarios: ship detection algorithms based on wake extraction; ship detection algorithms based on template matching; ship detection algorithm based on feature statistics; and ship detection methods based on classification learning. Although some scholars have put forward some novel research ideas, their core ideas are inseparable from the above types of target detection. Traditional target-feature extraction methods are mainly based on the idea of grey and texture features where a pre-trained classifier is employed for classification. For example, some researchers propose a saliency strategy, a feature descriptor [7,8], and a local comparison method [9] to determine small infrared targets, but these methods are very sensitive to noise, which usually generates a high false-alarm rate. Most infrared small-target-detection methods based on saliency have high computational complexity and are difficult to optimise using parallelism. Therefore, ship detection algorithms based on the weighted local difference measurement [10], and weighted voting mechanism [11] have been proposed. Moreover, references [12,13] introduce the multiscale local uniformity and greyscale difference weighting strategy to detect small infrared targets. In references [14,15], multi-frame images and sensor data in the infrared image sequence were analysed for ship detection. Based on extreme value theory, the edge detection [16,17] and cascading characteristics methods [18] are adopted to identify the objects of interest and suppress background clutter. The above traditional detection methods are often based on low-level, hand-made features. It is a great challenge to achieve high detection accuracy in complex scenes, such as those with cloud interference and low contrast, and improvement is needed.
With the improvement in artificial intelligence technology, deep learning has attracted an increasing amount of attention. Target-detection techniques based on deep learning include anchor-based and anchor-free techniques. Firstly, anchor-based technology includes one-stage and two-stage detection. One-stage detection techniques include the single-shot detector (SSD) [19], the deconvolutional SSD (DSSD) [20], RetinaNet [21], RefineDet [22], You Only Look Once Version 3 (YOLOV3) [23], etc. Two-stage detection techniques include FTP-region-based convolutional neural networks (RCNNs) [24], region-based fully convolutional networks (R-FCNs) [25], the Feature Pyramid Network (FPN) [26], Cascade R-CNNs [27], the subnet Internet Protocol (SNIP) [28], etc. Generally, two-stage target detection is more accurate than one-stage target detection, but the processing speed is slow. Secondly, anchor-free technology includes key-points and segmentation. Key-pointbased technologies include CornerNet [29], CenterNet [30], Cornernet-Lite [31], etc., and segmentation technologies include a feature-selective anchor-free (FSAF) module [32], a fully convolutional one-stage (FCOS) object detector [33], FoveaBox [34], etc. These deep learning methods achieve good detection accuracy in natural image target detection, but they also have great limitations during satellite processing with limited remote sensing image resources. First, the compression method of deep neural networks has higher performance requirements, especially for large networks. Second, the algorithm has difficulty meeting performance expectations. It is difficult to design a state-of-the-art machine for the data-flow scheduling of different layers, and there will be considerable redundancy in logical resources. In addition, there is the data dependence problem. Compared with traditional methods, deep learning relies more on the large-scale training of data and it needs a large amount of data to understand the potential data-mode. When the target features and false-alarm features in the detected images are relatively uniform, such a data dependence problem is not obvious, otherwise, when the target features and false-alarm features in the detected images are significantly different, the scale of training data will need to be considerably increased, which is a great practical challenge.
In some relatively simple conditions, the methods mentioned above can achieve considerable detection results. However, the detection performance of these algorithms will be affected in the following three situations: (a) low contrast between ships and background; (b) scenes with complicated sea conditions; and (c) in situations of false-alarm interference. In addition, these algorithms also give rise to different levels of missed detection when multiple vessels are docked. Therefore, there is still much room for improvement in ship detection.

Candidate-Region Extraction
To solve the problem of complete extraction of the ship region, the suspected region of the target ship is located step-by-step based on the idea of coarse-to-fine detection, as is shown in Figure 1. We design the candidate-region extraction through three parts. Firstly, using a multi-scale model, hierarchical images are constructed, which are employed as original image-data for subsequent processing. Secondly, regional gradient-feature reconstruction is undertaken. The region of interest (ROI) is extracted by constructing the Sobel operator and gradient template, and the target region is preliminarily determined. Thirdly, in the vector binarization of flow convolution, target information is extracted from the combination of multiple features by traversing the image data using flow convolution alongside using the more sophisticated features for further identification. The target types to be detected are different under different image resolutions, such as in 5-m-resolution images, the target ships are generally large military targets, large passenger targets and large cargo targets, but in high-resolution images, fishing ships and  The target types to be detected are different under different image resolutions, such as in 5-m-resolution images, the target ships are generally large military targets, large passenger targets and large cargo targets, but in high-resolution images, fishing ships and other small ships are also the targets that need to be detected. Ships in the same remote sensing image have different sizes, and the same ship has different scales in images with different resolutions. Therefore, the existing methods are difficult to adapt to problems that involve large differences in the apparent features of ship caused by scale diversity. In order to find target ships of different sizes in remote sensing images, we constructed a standard image model through repeated smoothing and sub-sampling, and then generated a multi-scale image model through reasonable scaling. The definition of each layer is shown in Equation (1).
where Lay i represents the generated image models at different scales, Lay ori represents the standard image model, ⊗ represents the bilinear interpolation operation, θ i represents the scaling factor, and the value of parameter i ranges from 1 to n. In the candidate-region extraction stage, one of the more important parameters is the filter size (k × k), which is the sliding cell that we need. As to the small target, we can set the filter size to ensure that it completely covers the target. As for the larger ships, if a smooth or partial calculation can't cover the whole target, we down-sample the image according to the actual testing requirements in terms of narrowness, until the scaled target size meets the minimum coverage area. Therefore, the selection of the parameter "k" is very important. If k is too small, the target ship cannot be covered, otherwise, false-alarm will appear in the covering box. By analyzing the size of target ship in the dataset, we can select parameters 10-15, basically to detect all ships in the dataset on the basis of a multi-scale model, this size-range is the optimal processing unit size after a lot of derivation, and can deal with different target sizes and characteristics. In this paper, we define k as 11, that is, the filter size is 11 × 11 pixels. The parameter θ relates to the reduction factor for image scaling. For easy understanding, the parameter θ can be regarded as the enlargement factor for the sliding 11 × 11 window. In the multi-scale model stage, there are two important parameters: the image size img × img; and the threshold value θ. Since the size of the sliding window is 11 × 11, the parameters should be set to fully cover the target when using the maximum reduction factor, where the maximum reduction factor is the result of the multiplication of θ 1 , θ 2 and θ 3 . Through a large amount of training and derivation, we find that when the scale is too small it will result in incomplete coverage, and if the scale is too large it will result in the loss of target features. When the maximum reduction factor is close to 35, better results can be obtained in the data set. Further tests of the combination of factors in each layer achieve combinations of values of 2, 1.25 and 1.25 for the maximum reduction factor. Under this set of parameters, our algorithm can achieve a better detection effect. Of course, this specific combination is not unique, we only need an approximate optimal combination, and the parameters θ 1 , θ 2 and θ 3 can be exchanged freely. Therefore, in order to meet the test requirements, if we make θ 1 = 2, θ 2 = 1.25, θ 3 = 1.25, we design a three-layer scale scaling model that can meet the full coverage requirements of all ships of different sizes, and the maximum coverage size can reach 34 pixels, which meets the test requirements of ship samples in the dataset. At the same time, because the multi-scale model uses a standard image zooming process for large object detection without any processing of the standard image for testing, it can also provide security for small-target detection (e.g., pixels covering far less than 11 × 11 of a small target in a standard image can be detected even if the target is not detected after scaling through the entire range of scales in the model), and avoid the occurrence of missed detection problems.
Since the sizes of images obtained in the detection stage are different, to facilitate subsequent processing, the first stage of this paper is to carry out image scaling. For vector images, image scaling does not cause distortion, blur or other problems, but the infrared remote sensing image is similar to a bitmap, and thus, it is necessary to select an appropriate image scaling algorithm. Bilinear interpolation is the interpolation of image pixels. Even under the condition that the original image is not smooth, bilinear interpolation will produce a smooth output, and as the infrared remote sensing image resolution is low, the target ships in the images require contour smoothing if they are small. In this case, the bilinear interpolation algorithm has a high-quality effect; the algorithm complexity is lower, and the time efficiency is better.
For point P = (x, y) on the line between Q 1 = (x 0 , y 0 ) and Q 2 = (x 1 , y 1 ), the calculation process of the y coordinate is shown in Equation (2): From the perspective of the weighted average, the weight is inversely proportional to the distance between the known point and the unknown point, that is, the closer the known point is to the unknown point, the greater the weight of the solution result of the unknown point. Therefore, according to the distance between the normalised unknown point and the two known points along the X-axis, the two weights should be x−x 0 x 1 −x 0 and We obtain the derivative calculation progress of y that is shown in Equation (3) as follows: Bilinear interpolation is the extension of linear interpolation in the plane region, and its core idea is to perform linear interpolation in each direction of two dimensions. Given the coordinates of four points Q 11 = (x 1 , y 1 ), Q 21 = (x 2 , y 1 ), Q 12 = (x 1 , y 2 ), and Q 22 = (x 2 , y 2 ), and given the value of function f at four points, the bilinear interpolation algorithm can be utilised to obtain the value of function f at point P(x, y). The calculation process is shown in Equations (4)-(6): In this paper, a bilinear interpolation algorithm is used to expand and shrink the image size of the module to be detected, and images of different sizes are obtained as the input of the subsequent detection module. Through this scaling process, the subsequent detection algorithm has better adaptability to target ships with different characteristics. A summary of the multiscale model's operation is shown in Algorithm 1.

Regional Gradient-Feature Reconstruction
The target in the image usually has a well-defined contour, while most backgrounds do not. According to the difference between the target and the background, the background and target can be distinguished by gradient features.
To quantify the possibility of targets being contained in a region, the region size is adjusted to a fixed size, and then the gradient of the whole region is calculated as the feature vector. For image I, I (i, j) represents the grey value at position (i, j) in the image.
The gradient x i of image I along the x-direction is the first derivative of the image in the x-direction. The gradient y i of image Img along the y-direction is the first derivative of the In the actual algorithm design process, the value of h is generally 1, and the Sobel operator is applied to quickly extract image gradient features. Sobel operator is a discrete difference operator, which is used to approximate the gray-level from an image brightness function. Using this operator at any point in the image will produce the corresponding grayscale vector. The Sobel operator is based on the gray-weighted-difference between the upper, lower, left, and right adjacent points of a pixel. If the gradient value of a pixel in the overall x and y directions is obtained, it only needs to add the gradient calculation results calculated by Sobel in each direction. The Sobel operator is used to calculate the gradient characteristics of the image, which is shown in Equations (9) and (10) as follows: where Img represents the input image, S x and S y represent the Sobel operators, and ⊗ represents the gradient operation. The gradient amplitude M (i, j) of image Img at (x, y) can be obtained from I x and I y , which are shown in Equation (11):  Figure 2 shows a schematic of the image after scaling. The target ship can be better represented by selecting an appropriate scaling factor.

Vector Binarization of Flow Convolution
The operation of linear convolution is essentially the computation of the inner product of two vectors by representing a vector as the weighted sum of multiple binary vectors, consisting of −1 and 1, such that the inner product operation of a vector can be computed quickly using simple bit operations. The convolution operation is an important step in the realisation of whole-target detection. When hardware is employed to realise the convolution operation, it is necessary to identify a scheme that accurately realises the convolution operation. The operation does not occupy too much space and is fast enough to complete the convolution operation of a target region in a pixel-clock. During the convolution operation, it is necessary to know the value of pixels around the current pixels. However, in the process of image processing, the data obtained exhibit the form of pixel streams rather than the whole image, and thus, it is necessary to cache the surrounding pixels that are needed. Algorithm 2 shows a summary of the vector binarisation operation.
in the overall x and y directions is obtained, it only needs to add the gradient calculation results calculated by Sobel in each direction. The Sobel operator is used to calculate the gradient characteristics of the image, which is shown in Equations (9) and (10) as follows: x x I =S Img  (9) y y I =S Img  (10) where Img represents the input image, x S and y S represent the Sobel operators, and  represents the gradient operation. The gradient amplitude M (i, j) of image Img at (x, y) can be obtained from x I and y I , which are shown in Equation (11): Figure 2 shows a schematic of the image after scaling. The target ship can be better represented by selecting an appropriate scaling factor.

Vector Binarization of Flow Convolution
The operation of linear convolution is essentially the computation of the inner product of two vectors by representing a vector as the weighted sum of multiple binary vectors, consisting of −1 and 1, such that the inner product operation of a vector can be computed quickly using simple bit operations. The convolution operation is an important step in the realisation of whole-target detection. When hardware is employed to realise the convolution operation, it is necessary to identify a scheme that accurately realises the convolution operation. The operation does not occupy too much space and is fast enough to complete the convolution operation of a target region in a pixel-clock. During the convolution operation, it is necessary to know the value of pixels around the current pixels. However, in the process of image processing, the data obtained exhibit the form of pixel streams rather

Algorithm 2 Vector binarisation operation
Input: Vector w to be approximated, number of binary vectors N w , represents N w binary vectors and corresponding weights 1. Do the following steps: 2. Initialise residuals ε= w 3. Update ∂ j and β j as the following conditions: The approximate representation of w can be obtained through the vector binarisation approximation algorithm, w ≈ ∑ N w j=1 β j ∂ j , assuming that the region size is 8 × 8. The gradient feature is a 64-dimensional feature vector, x ∈ R 64 . The model parameters of the linear support-vector-machine (SVM) classifier based on gradient-feature training are equivalent to the gradient feature dimension, which is also a 64-dimensional vector, and is defined as w ∈ R 64 . In the above summary of Algorithm 2, ∂ j represents the binary basis vector ∂ j ∈ {−1, 1} 64 . After the binary approximation algorithm, there are a total of N w binary basis vectors, and β j represents the weight coefficient corresponding to the basis vector ∂ j .
The pixels around the target pixels are stored in registers according to their addresses. During the algorithm's operation, values flow from left to right and top to bottom throughout the image as required. The next step is to convolve the pixel value stored in the register with the corresponding convolution-kernel weight. The ping-pong operation is a commonly used data-flow control-processing technique, the main process is to assign input data streams to different data buffers isochronously through an input-data selection unit. The ping-pong operation sends the buffered data-stream to the processing module continuously for calculation by an input-data selection unit and an output-data selection unit, switching with each other according to the beat. Because the input-data-and output-data-flows are continuous, it is very suitable for the pipeline processing of data-flows to complete seamless buffering and processing of data, greatly saving buffer space. In this paper, the number of convolution kernels is constant, a new pixel can be input in a pixel-clock-cycle, and the convolution result can be output after several clock-cycle delays. If the ping-pong operation is not used, the data preprocessing module will become the bottleneck to limit the system-data-throughput in the design. By optimizing the ping-pong operation's design, the computation period is increased. The data throughput of the system can be improved through the cache design optimisation while increasing the data buffer delay. This process can be defined as follows: where T xn represents the pixel matrix, which is processed according to the regional gradient, C xn represents the convolution template, the templates in this article are 11 × 11 pixels, and there are six template types in different directions, as shown in Figure 3. Max represents the maximum pixel matrix after convolution of the image matrix and several pixel matrices.  The value of each pixel of Conv is compared with the detection threshold ξ. A value greater than ξ is regarded as a suspected target ship, and the coordinate information of the target is output. Otherwise, it is regarded as a nontarget point. To choose an appropriate value of η, the recall rates are counted with different choices of ξ, which are shown in Table 1. We can see that a value of ξ = 0.6 is a turning point in the recall rate. When ξ is set larger than 0.6, the recall rate quickly decreases. On the other hand, when ξ is set smaller than 0.6, it will slow down the subsequent processes and bring additional falsepositives. So, we choose η = 0.6 in our method as this is the optimal parameter-solution proved by a large number of tests. In addition to ensuring that the recall rate of the algorithm remains within an appropriate range, the real-time processing performance of the algorithm is maximally improved. The target information output by the threshold model is regarded as the input for region classification, in which the target information mainly includes an x-coordinate, a y-coordinate, a width and a height.

Multiple-Feature Fusion Classification
To solve the problem of complete extraction of the ship region, the suspected region of the target ship is located step-by-step based on the idea of coarse-to-fine detection, as is shown in Figure 4. The value of each pixel of Conv is compared with the detection threshold ξ. A value greater than ξ is regarded as a suspected target ship, and the coordinate information of the target is output. Otherwise, it is regarded as a nontarget point. To choose an appropriate value of η, the recall rates are counted with different choices of ξ, which are shown in Table 1. We can see that a value of ξ = 0.6 is a turning point in the recall rate. When ξ is set larger than 0.6, the recall rate quickly decreases. On the other hand, when ξ is set smaller than 0.6, it will slow down the subsequent processes and bring additional false-positives. So, we choose η = 0.6 in our method as this is the optimal parameter-solution proved by a large number of tests. In addition to ensuring that the recall rate of the algorithm remains within an appropriate range, the real-time processing performance of the algorithm is maximally improved. The target information output by the threshold model is regarded as the input for region classification, in which the target information mainly includes an x-coordinate, a y-coordinate, a width and a height.

Multiple-Feature Fusion Classification
To solve the problem of complete extraction of the ship region, the suspected region of the target ship is located step-by-step based on the idea of coarse-to-fine detection, as is shown in Figure 4.
We design the multiple-feature classification process through two parts. The first is Fourier global spectral feature extraction. The global spectral features based on Fourier transform are applied to extract the Fourier features of positive and negative samples to train the classifier. The magnitude of the gradient generated by the Fourier global spectral feature represents the difference between a point in the image and its neighbourhood, which can better distinguish the target ship from the ocean background and initially exclude false-alarm. Secondly, local feature classification through lightweight CNN is undertaken. Through the full analysis of target ships in infrared remote sensing images, available features that can effectively distinguish target ships from typical false-alarm (clouds, tracks, etc.) are selected. The optimal feature-subset of the target can be constructed to quickly and accurately eliminate false-alarm to improve the accuracy and universality of ship detection. We design the multiple-feature classification process through two parts. The first i Fourier global spectral feature extraction. The global spectral features based on Fourie transform are applied to extract the Fourier features of positive and negative samples to train the classifier. The magnitude of the gradient generated by the Fourier global spectra feature represents the difference between a point in the image and its neighbourhood which can better distinguish the target ship from the ocean background and initially ex clude false-alarm. Secondly, local feature classification through lightweight CNN is un dertaken. Through the full analysis of target ships in infrared remote sensing images available features that can effectively distinguish target ships from typical false-alarm (clouds, tracks, etc.) are selected. The optimal feature-subset of the target can be con structed to quickly and accurately eliminate false-alarm to improve the accuracy and uni versality of ship detection.

Fourier Global Spectral-Feature Extraction
Through the function of frequency domain analysis, we can change the angle and improve the visibility of certain signal information. For example, by Fourier transform [35,36], if the numerical value is graphed and existing spikes are visible, they not only represent hidden frequencies in the mixed signal but also can solve the problem of distin guishing between several signals. Specifically, the work of this stage is to perform Fourie

Fourier Global Spectral-Feature Extraction
Through the function of frequency domain analysis, we can change the angle and improve the visibility of certain signal information. For example, by Fourier transform [35,36], if the numerical value is graphed and existing spikes are visible, they not only represent hidden frequencies in the mixed signal but also can solve the problem of distinguishing between several signals. Specifically, the work of this stage is to perform Fourier transform on the candidate slices obtained in the previous stage. For each candidate target image, the spatial domain is converted to the frequency domain. In other words, a two-dimensional Fourier transform is applied to each candidate region to obtain the spectral domain, which is the distribution of the image gradient. The magnitude of the gradient represents the strength of the difference between a certain point on the image and the neighbourhood point, which is used to better distinguish the target ship from the ocean background. During the training, the Fourier transform is extracted from the positive sample set of six different directions (ship direction up, down, left, right, left tilt 45 degrees, and right tilt 45 degrees) to extract the spectrum characteristics. The diagram of six templates is shown in Figure 5. neighbourhood point, which is used to better distinguish the target ship from the ocean background. During the training, the Fourier transform is extracted from the positive sample set of six different directions (ship direction up, down, left, right, left tilt 45 degrees, and right tilt 45 degrees) to extract the spectrum characteristics. The diagram of six templates is shown in Figure 5. More specifically, each sample slice is extracted from six parts of theimage block, among which, the first part is the whole area of the square data block, the second part is the upper part of the square data block (width 32, height 8), the third part is the left part of the square data block (width 8, height 32), the fourth part is the lower part of the square data block (width: 32, height: 8), the fifth part is the right part of the square data block (width: 8, height: 32), and the sixth part is the middle part of the square data block (width: 16, height: 16). These specific extracted features are shown in Figure 6. To be more specific, each sample section is extracted from the spectral features of one global region, four background regions, and one central region to which the target belongs. The two-dimensional Fourier transform is defined as follows: where u and v are frequency variables, the value of u ranges from 0 to M − 1, and the value of v ranges from 0 to N − 1. F(x, y) represents the graph function, and M, N represents the length of the sequence f(x, y). The feature descriptions of the entire global slice , the centre region containing the target ship, and the background region except the centre are , , and . The calculation process refers to Equation (14) as follows: Among them, the value of i ranges from 0 to 5, and A is the Fourier transform operation. The frequency-domain descriptor F is obtained by fusing the global and local features of the image, = [ ; ; ; ; ; ]. Similarly, we perform a similar task for a negative sample set. The spectrum feature vectors obtained by the Fourier transform of the positive and negative sample sets in the abovementioned steps are sent to a classifier. In the experimental process, Fourier global features are applied to roughly eliminate falsealarm and extract several candidate areas. For easy understanding, the six-part solution of the Fourier transform is shown in Algorithm 3. More specifically, each sample slice is extracted from six parts of theimage block, among which, the first part is the whole area of the square data block, the second part is the upper part of the square data block (width 32, height 8), the third part is the left part of the square data block (width 8, height 32), the fourth part is the lower part of the square data block (width: 32, height: 8), the fifth part is the right part of the square data block (width: 8, height: 32), and the sixth part is the middle part of the square data block (width: 16, height: 16). These specific extracted features are shown in Figure 6.
background. During the training, the Fourier transform is extracted from the positive sample set of six different directions (ship direction up, down, left, right, left tilt 45 degrees, and right tilt 45 degrees) to extract the spectrum characteristics. The diagram of six templates is shown in Figure 5. More specifically, each sample slice is extracted from six parts of theimage block, among which, the first part is the whole area of the square data block, the second part is the upper part of the square data block (width 32, height 8), the third part is the left part of the square data block (width 8, height 32), the fourth part is the lower part of the square data block (width: 32, height: 8), the fifth part is the right part of the square data block (width: 8, height: 32), and the sixth part is the middle part of the square data block (width: 16, height: 16). These specific extracted features are shown in Figure 6. To be more specific, each sample section is extracted from the spectral features of one global region, four background regions, and one central region to which the target belongs. The two-dimensional Fourier transform is defined as follows: where u and v are frequency variables, the value of u ranges from 0 to M − 1, and the value of v ranges from 0 to N − 1. F(x, y) represents the graph function, and M, N represents the length of the sequence f(x, y). The feature descriptions of the entire global slice , the centre region containing the target ship, and the background region except the centre are , , and . The calculation process refers to Equation (14) as follows: Among them, the value of i ranges from 0 to 5, and A is the Fourier transform operation. The frequency-domain descriptor F is obtained by fusing the global and local features of the image, = [ ; ; ; ; ; ]. Similarly, we perform a similar task for a negative sample set. The spectrum feature vectors obtained by the Fourier transform of the positive and negative sample sets in the abovementioned steps are sent to a classifier. In the experimental process, Fourier global features are applied to roughly eliminate falsealarm and extract several candidate areas. For easy understanding, the six-part solution of the Fourier transform is shown in Algorithm 3. To be more specific, each sample section is extracted from the spectral features of one global region, four background regions, and one central region to which the target belongs. The two-dimensional Fourier transform is defined as follows: where u and v are frequency variables, the value of u ranges from 0 to M − 1, and the value of v ranges from 0 to N − 1. F(x, y) represents the graph function, and M, N represents the length of the sequence f(x, y). The feature descriptions of the entire global slice I 1 , the centre region I 2 containing the target ship, and the background region I 3 except the centre are f 1 , f 2 , and f 3 . The calculation process refers to Equation (14) as follows: Among them, the value of i ranges from 0 to 5, and A is the Fourier transform operation. The frequency-domain descriptor F is obtained by fusing the global and local features of the image, F = [ f 0 ; f 1 ; f 2 ; f 3 ; f 4 ; f 5 ]. Similarly, we perform a similar task for a negative sample set. The spectrum feature vectors obtained by the Fourier transform of the positive and negative sample sets in the abovementioned steps are sent to a classifier. In the experimental process, Fourier global features are applied to roughly eliminate false-alarm and extract several candidate areas. For easy understanding, the six-part solution of the Fourier transform is shown in Algorithm 3.
The global Fourier transform was applied to each candidate frame to obtain the corresponding feature description, achieve a rough classification of candidate regions, and then the local feature was extracted by the lightweight CNN model to further eliminate false-alarm in the following section.

Local Feature Classification through Lightweight CNN
Generally, when deep learning methods are utilised to extract features for classification, deeper network layers and larger datasets are needed to achieve higher classification performance. However, the larger the model, the larger the number of parameters and computing resources that are consumed. It is difficult to meet the limited resource requirements of the onboard processor. Therefore, to balance accuracy, speed and memory, the global features of positive and negative samples are extracted with the Fourier spectrum features in the previous part to roughly eliminate false-alarm in the candidate region of the target ship. A lightweight classification network is then designed to accurately identify the local features of the ship. The reason for combining the Fourier global features with local features extracted from lightweight networks in this chapter is to consider the following two aspects. (a) In general, the full connection layer in deep networks learns global patterns from the feature space, and the convolution layer learns local patterns, while lightweight networks can only learn simple features at a lower level. (b) With an increase in the network layer number, the receptive fields are also increased gradually. If the network layer is low, the convolution kernels are lower, and the receptive fields are unable to capture the global image. Therefore, the Fourier spectrum of global features can effectively reduce the size of the previous stage to produce the number of invalid candidate areas, reducing the complexity for the lightweight classification-network classification target.
The target ship itself in the external remote sensing image has a small scale of approximately 10 to 50 pixels. Considering the small scale of the target ship itself, the model compression will compensate for the loss of accuracy. Therefore, lightweight networks with fewer network layers are designed in this chapter. The lightweight classification network consists of four convolutional layers and two fully connected layers. The network structure is shown in Figure 7. Among them, the first three convolutional layers are configured with a convolution kernel with a size of 3 × 3, followed by the maximum pooling layer of downsampling. The last convolution layer is configured as a 1 × 1 convolution kernel. Deep convolution and 1 × 1 convolution can reduce the computational complexity of the model with a small precision loss. These convolutional layers are deployed on rectified-linear activation-unit (ReLU) activation functions to make the network undisturbed by vanishing gradients compared to sigmoID activation functions and tanh activation functions. In addition, different data enhancement strategies are adopted in the training process. The last fully connected and softmax layers address the classification problem based on the features extracted from the previous convolutional layer. During the experiment, cross-entropy was employed to define the loss function as follows: where N is the batch size, which is 128 in this experiment. The term i represents the label of the candidate slice, y ik represents that the candidate slice label of block I belongs to category K, and p ik represents the predicted probability that the candidate slice belongs to category K. In order to facilitate understanding, the extraction and calculation of feature vectors at each layer are briefly shown in Algorithm 4.

Classifier Training and Target Confirmation
As a powerful classification method with the ability to minimise the classification error rate and maximise generalisation, the basic working principle of the SVM is described as follows [37]: two kinds of samples that are linearly indivisible in the input space are mapped to a high-dimensional feature space by a kernel function, and linearly constrained quadratic programming is solved in the high-dimensional feature space to obtain a classification hyperplane with a maximum interval that can linearly divide the samples.
For the dataset For linear indivisibility, a soft interval can be considered to allow some samples to fail to meet constraint conditions; the optimisation objective function is then defined as follows: This section proposes a lightweight convolutional network based on the target candidate region under resource constraints in real scenes. Based on the multi-scale target candidate-region extraction model, the network orders the candidate region to identify the real target and false-alarm, returns the index corresponding to the correctly classified candidate region to the corresponding connected region in the original image, and realises the target location according to the largest connected region's outer rectangle. In this way, the regressive operation of object detection based on deep learning is avoided, and the amount of computation is doubled. Therefore, the computational complexity of the model with a small precision loss can be reduced.

Classifier Training and Target Confirmation
As a powerful classification method with the ability to minimise the classification error rate and maximise generalisation, the basic working principle of the SVM is described as follows [37]: two kinds of samples that are linearly indivisible in the input space are mapped to a high-dimensional feature space by a kernel function, and linearly constrained quadratic programming is solved in the high-dimensional feature space to obtain a classification hyperplane with a maximum interval that can linearly divide the samples.
For the dataset T = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x m , y m )}, yi ∈ {−1, +1} and the hyperplane (w, b), the geometric interval of sample points (x i , y i ) is defined as γ i = y i (w/||w||·x i + b/||w||) , and the minimum geometric interval of the hyperplane (w, b) with respect to the training dataset T is γ = min i=1,2,...,N γ i . Solving a separated hyperplane with the maximum geometric interval requires the maximum and minimum geometric interval, which can be expressed as the constrained optimisation problem max Considering the relationship between the geometric interval and the function interval, linearly separable support vector machines can be transformed into an optimisation problem as follows: min For linear indivisibility, a soft interval can be considered to allow some samples to fail to meet constraint conditions; the optimisation objective function is then defined as follows: where C represents the degree of punishment for the right and wrong samples, and the optimisation of the whole algorithm can still use the Lagrange multiplier method. It can be seen from the optimisation objective function that an SVM ultimately achieves a compromise between the maximum classification interval and the minimum classification error, and its punishment is the same for positive and negative classification errors. In the process of extracting potential regions of ships, the classifier must ensure a high recall rate to avoid missing detection as much as possible. Therefore, the risks brought about by classifying ships as backgrounds and classifying backgrounds as targets differ, and the misclassification of the two cannot be treated equally. According to this feature, this paper demonstrates how the optimisation objective function improves and that the constructedrisk unbalanced SVM classifier overcomes the shortcomings of the traditional SVM classifier. In addition, the risk-unbalanced SVM classifier is applied to the task of ship potentialextraction. The optimisation objective function of an SVM with uneven risk is shown as follows: where C p and C n are the risk of positive samples and the risk of negative samples, respectively, and generally C p > C n . The flow of the cascade classifier training algorithm is shown in Algorithm 5. For the same target, the classification algorithm may identify several bounding boxes. One target corresponds to multiple bounding boxes, meaning that it is necessary to filter out the redundant windows and retain only the optimal bounding boxes. Non-maximum suppression searches for the local maximum value of all bounding boxes to identify the maximum value within a certain neighbourhood retains the windows with the highest score in the neighbourhood and inhibits windows with low scores to filter part of the bounding boxes to improve the final detection accuracy.
Non-maximum suppression is an iteration-ergo-elimination process; the specific algorithm is expressed in Algorithm 6.
The traditional non-maximum suppression (NMS) algorithm generates a series of detection candidate boxes B and the corresponding probability value S. First, we sort the probability values and select the candidate box M with the maximum probability before M joins the final detection result set D and deletes it from B. Candidate boxes in set B whose overlap with candidate box M is greater than the threshold Nt are deleted. The main problem of the algorithm is that the adjacent detection frames must be deleted. In this case, if ships are clustered and distributed, there is an overlap between two detection frames, which causes the detection failure of ships near each other and reduces the detection accuracy of the algorithm. from the image data collected by a satellite, which has an important experimental reference value. Ships in the dataset have been marked in the form of rectangular frames. In different scenarios, target ships have different sizes, directions and interferences. To verify the anti-interference ability of the algorithm, the dataset should include samples of various scenarios, such as cloud interference samples, trail interference samples, reef interference samples, and sea-clutter interference samples to make the dataset persuasive.
The images employed in the experiment include ocean scenes and nearshore scenes, which have not only calm and undisturbed sea state backgrounds but also complex backgrounds, such as clouds and reefs. The ship length is various and the ship azimuth angle is arbitrary, which is suitable for the comprehensive testing and comparison of the algorithm performances. There are 214 images in the test set, including 1270 ships. This paper enlarges the dataset by following methods: Random clipping, where a fixed size image block is intercepted randomly from the original image, and the ships in the image block have been marked in the form of a rectangular box; Mirror flip, through horizontal flip and vertical flip to construct a new dataset; Rotation transformation, where the whole image rotates in a specific direction around the fixed point; Enhanced contrast, where by changing the image's gray value to improve the visual effect of the image. We divide the image into 8456 sub-images and convert these sub-images into five datasets: dataset 1 (noninterference), which includes 4129 sub-images, including 308 target ships; dataset 2 (cloud interference), which includes 2476 sub-images, including 826 target ships; dataset 3 (trial interference), which includes 1851 sub-images, including 421 target ships; dataset 4 (reef interference), which contains 1168 sub-images, including 85 target ships; and dataset 5 (cloud interference), which containes 2788 sub-images, including 428 target ships. These datasets are classified according to interference types, and interference occurs simultaneously in multiple scenarios. However, the total number of target ships is still 1270 when repeated target ships in each sub-dataset (after classification) are excluded.
In order to evaluate the stability of the dataset and its compatibility with small sample detection, we tested and verified it through cross-validation. The specific process of the cross-validation experiment is as follows: the sample is divided into 10 parts. First, K parts are randomly selected from the sample for training, and the remaining parts are tested. To ensure the accuracy of this experiment, it was repeated three times, and then the average value of the three results was taken as the accuracy of this verification. Finally, the test accuracy of different proportions of training samples from the entire sample set is obtained, which can be used to evaluate the accuracy of the algorithm more accurately. The accuracy of the training samples is stable, which proves that the proposed method has good target expression characteristics and can effectively distinguish the target and background. At the same time, when the size of the training set exceeds 50% of the total number of samples, the test accuracy is basically stable at over 98%.

Applicable Platform
As the geometric progression of remote sensing image data increases and the complexity of intelligent processing algorithms increases, it is more and more difficult to process remote sensing images in real-time on the satellite platform with strictly limited resources. For large-scale remote sensing images, spaceborne resources have high requirements on algorithm running time, storage space resources and detection performance. At present, the cooperative realization of a space-borne infrared ship detection system based on DSP and FPGA has become the mainstream, but it faces two challenges: (a) it is limited by the complex space environment and hardware platform in terms of volume, weight, power consumption, transmission bandwidth alongside other aspects; and (b) satellite processors are slow to update due to the reliability, stability and cost of the equipment, which will result in their performance generally being lower than mainstream processors.
The test and verification platform of the target-detection algorithm proposed in this paper is a Tesla k40 M GPU equipped with 64 GB memory, an Ubuntu16.04 operating system, and the MATLAB 2016 language. This test environment is mainly used to help us train the models and verify the performance of modules. As a large number of templates need to be trained in the algorithm, this test environment is needed in order to ensure the accuracy of the training results. According to the actual demand for infrared target detection, this algorithm will eventually be carried out on the FPGA platform. For example, when we select the xc7k410t from Xilinx as the core processing module of the image processing unit, it is difficult to meet the deployment requirements of deep learining methods, mainly because the compression method of deep neural networks has higher resource requirements, especially for large networks. Moreover, it is difficult to design a state-of-the-art machine for data-flow scheduling for different layers, and there will be considerable redundancy in logical resources. Therefore, deep learning methods achieve good detection accuracy in natural image target detection, but they also have great limitations during image processing with limited resources. This is the advantage of the lighter network model designed in this paper under the condition of ensuring the detection rate.
The storage space of the network model designed in this paper is less than 25 MB, which meets the requirement of limited space-borne resources and provides a feasible scheme for real-time satellite image processing. Considering the low power consumption of FPGAs, the algorithm proposed in this paper provides a feasible solution for deploying deep learning networks on satellite-borne FPGAs with guaranteed accuracy.

Detection Performance Verification
In this paper, the main indicators employed in the algorithm performance verification include algorithm recall rate R, algorithm accuracy P and algorithm error rate E, in which the algorithm error rate is the sum of the error rate L and error rate F, which can effectively reflect the robustness of the algorithm. The calculation process of each indicator refers to Equations (20)-(24) as follows: where D c represents the number of correctly detected target ships and D f represents the number of falsely detected target ships. The term D l represents the number of missed detected target ships, D s represents the detected targets by different methods, T s represents the total real targets contained in the dataset, and the value of T s is 1270. This paper compares qualitative and quantitative methods with several other representative target detection methods in the dataset. The algorithm performance was verified by datasets under different interference scenarios, as shown in Table 2.
The test results show that the proposed algorithm can achieve better detection performance in the non-interference scenario, with a recall rate of 98.2%, an accuracy of 96.8% and an error rate of only 5%. It is not difficult to determine that island interference has a great impact on the algorithm in several test scenarios. The most notable reason is that the similarity among islands and ships is high, and it is difficult to distinguish between them. The test results show that the recall rate is 92.9%, that the accuracy rate is 90.9%, and that the error rate is 16.2%, which can still reach the expected performance, reflecting that the algorithm has a strong anti-interference ability.
After multi-scene verification of our algorithm, we compare five target-detection algorithms, mainly including SVDNet [39], Faster R-CNN [40], SPP-PCANet [41], RB [42], MRA [43], and DF [44]. Among them, SVDNet is designed based on the recent popu-lar convolutional neural networks and the singular value decompensation algorithm, it provides a simple but efficient way to adaptively learn features from remote sensing images; Faster R-CNN proposes a Region Proposal Network(RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. SPP-PCANet proposes coarse-to-fine ship detection strategies based on anomaly detection and spatial pyramid pooling; RB proposes a nearly closed-form ship-rotated bounding box space used for ship detection, and designs a method to generate a small number of high-potential candidates based on this space; MRA proposes a method to densely divide a test infrared image into a set of image patches and the radiation anomaly of each patch is estimated by a Gaussian Mixture Model, thereby target candidates are obtained from anomaly image patches, then target candidates are further checked by a more discriminative criterion to obtain the final detection result; DF consists of a simple region proposal network and a deep forest ensemble, among which the region proposal network, that is trained over gradient features robustly generates a small number of candidates that precisely cover target ships in various backgrounds, and the deep forest ensemble adaptively learns features from remote sensing data and discriminates real ships from region proposals efficiently. Through the comparative analysis of the compared methods regarding their recall, precision, running time, etc. We summarize the experimental results from three parts. Compared with traditional detection methods such as RB and MRA, our method has a greatly improved detection performance and the processing unit is smoother and more efficient; compared with deep learning methods such as SVD, Faster R-CNN, SPP-PCANET and DF, our method is lighter on the premise of ensuring the detection accuracy, especially compared to the lightweight networks SVD and SP-PCANET, and shows substantial results in in terms of processing speed. In addition, there are also some space-oriented methods in the above algorithms such as SVD and MRA, despite this our method is better in terms of detection performance and processing speed, although MRA algorithms may have fewer resources because of their lack of networks. The performance comparison results are shown in Table 3. In brief, our approach is more effective at ship detection than the other three methods and takes less time to process an image.
The detection results of different methods in different scenarios are shown in Figure 8. The first line indicates thin cloud interference and has four real target ships, the second line indicates the cloud interference and sea-clutter interference scenes and has three real target ships, the third line indicates the reef interference and trail interference scenes and has one real target ship, and the fourth line indicates a ship-intensive scene with eleven real target ships. As shown in Figure 8, the yellow boxes represent the real detected targets in the original image, and the red boxes represent the detected results of different methods. Even though some false-alarm are generated due to all kinds of interference, our method achieves impressive detection performance on different sea surfaces and fewer misses and errors than other algorithms. It is proven that the proposed algorithm has high stability in different scenarios.

Robustness Verification
This algorithm is a strong, robust and effective algorithm. Specifically, robustness is mainly reflected in the following three points. Firstly, the model has high accuracy or effectiveness. Secondly, small deviations from model assumptions can only have small impacts on the algorithm's performance. Thirdly, large deviations from model assumptions should not have a "catastrophic" impact on algorithm performance. So we verify function 1 with the F-measure, we verify function 2 with the True-False-positives rate, and we verify function 3 with Mean error rate.


F-measure score The precision and recall rates are sometimes contradictory, so they need to be considered comprehensively. The F-measure is the weighted average of the precision rate and recall rate, and the F-measure value is the arithmetic mean divided by the geometric mean. When the F-measure value is small, true positives increase and false-positives decrease. Therefore, we can verify the performance of the algorithm through the F-measure score that corresponds to the first of the three function-points of robustness verification above. The calculation method for the F-measure is as follows: where P represents algorithm accuracy, R represents algorithm recall rate, and α repre-

Robustness Verification
This algorithm is a strong, robust and effective algorithm. Specifically, robustness is mainly reflected in the following three points. Firstly, the model has high accuracy or effectiveness. Secondly, small deviations from model assumptions can only have small impacts on the algorithm's performance. Thirdly, large deviations from model assumptions should not have a "catastrophic" impact on algorithm performance. So we verify function 1 with the F-measure, we verify function 2 with the True-False-positives rate, and we verify function 3 with Mean error rate.
• F-measure score The precision and recall rates are sometimes contradictory, so they need to be considered comprehensively. The F-measure is the weighted average of the precision rate and recall rate, and the F-measure value is the arithmetic mean divided by the geometric mean. When the F-measure value is small, true positives increase and false-positives decrease. Therefore, we can verify the performance of the algorithm through the F-measure score that corresponds to the first of the three function-points of robustness verification above. The calculation method for the F-measure is as follows: where P represents algorithm accuracy, R represents algorithm recall rate, and α represents the calculation parameter and is usually set to 1. The F-measure weighs both precision and recall, F-score comparison results of the different algorithms are shown in Table 4, where F none represents an F-score without any interference, F cloud represents an F-score under cloud interference, F trail represents an Fscore under trail interference, F reef represents an F-score under reef interference, F clutter represents an F-score under clutter interference, and F total represents the total F-Score of detection. By comparing the experimental results in Algorithm 6, it is not difficult to find that our algorithm can achieve a better F-score in different scenes, especially in the scene without interference; the F-score of the proposed method reaches 0.975. Moreover, by comparing experiments in full scenarios, it can be found that our method can also achieve better results, which proves the effectiveness and stability of the algorithm in different scenes.

•
True-False-positives graph After the proposal of a region, the algorithm preliminarily obtains the potential target region, removes some background interference and negative targets, and obtains the suspected positive targets. Then, real target ships, i.e., true positive targets, are screened out through region classification, and false-alarm in the suspected positive targets, i.e., false-positive targets are eliminated. This part quantifies the performance of the algorithm under different interference scenes by using the true-false-positives graph. Because different scenes have different effects on the algorithm, the curve can intuitively reflect the adaptability of the algorithm to each scene, that is, whether there will be obvious differences in the algorithm's performance when the interference changes. Therefore, this part of the test corresponds to the second of the three function-points of the above robustness verification. The calculation of the true positives rate is as follows: T rate = T P T P + F P (26) where T rate means the true positives rate, T P means the true positives samples, and F P means the false-positives samples.
In a supplemental test, we compare the correlations between the false-positive and true positive rates of different algorithms under different interferences, as shown in Figure 9.
These four scenarios are a cloud scenario, a trail scenario, a reef scenario and a seaclutter scenario. Through the linear comparison of the data in the figure, our method can quickly reach the expected detection rate when generating negative samples, which indicates that the method has good robustness and anti-interference ability. The true-  Table 5, where F P means false-positives, and T rate means the true positives rate.
tion. The calculation of the true positives rate is as follows: where rate T means the true positives rate, P T means the true positives samples, and P F means the false-positives samples.
In a supplemental test, we compare the correlations between the false-positive and true positive rates of different algorithms under different interferences, as shown in Fig These four scenarios are a cloud scenario, a trail scenario, a reef scenario and a seaclutter scenario. Through the linear comparison of the data in the figure, our method can quickly reach the expected detection rate when generating negative samples, which indicates that the method has good robustness and anti-interference ability. The true-falsepositive quantitative data under all scenarios are shown in Table 5, where P F means falsepositives, and T means the true positives rate.    The robustness of the algorithm should be compared regarding not only the detection performance but also the avoidance of false and missed detections. To better prove the robustness of the algorithm, we compare the mean error rate obtained with different image quantities, as shown in Table 6. The third function-point of robustness is to verify the impact of large deviations on the algorithm. Since the average value is considered to be a positive correlation factor of an algorithm, the mean error rate is more meaningful than any robust measure. Therefore, this part counts the average error rate in the overall operation cycle of each algorithm, and the results can reflect the stability of the algorithm during operation, The lower the mean error rate, the better the compatibility of the algorithm with strong anti-interference.

Conclusions and Future Work
This paper presents an infrared ship detection method based on the combination of traditional feature recognition and lightweight CNN classification. The effective ship candidate-region is extracted by a multiscale feature extraction model, and the global features extracted by the Fourier transform are combined with the local features extracted by a lightweight CNN to eliminate false-alarm to confirm the target ship. Compared with the existing methods, the proposed method is more efficient and robust for target detection in complex scenes. Our future work will focus on memory storage and explore hard negative mining strategies to improve the generalisation performance.