1. Introduction
The intelligent detection and recognition of ships is quite important for maritime security and civil management. Ship detection has a wide range of applications, including dynamic harbor surveillance, traffic monitoring, fishery management, sea pollution monitoring, the defense of territory and naval battles, etc. [
1]. In recent years, satellite and aerial remote sensing technology has developed rapidly, and optical remote sensing images can provide detailed information with extremely high resolution [
2]. Therefore, ship detection has become a hot topic in the field of optical remote sensing. Due to the large differences between the material of the ship and sea surface in the radar images, it is easier to detect the ship target in synthetic aperture radar (SAR) images. SAR can work under all weather conditions and various climatic conditions, so ship detection is mostly completed in the SAR images. Compared to SAR images, the information provided by optical remote sensing images is more intuitive, so it is easy for humans to understand [
3]. In addition, numerous satellites and unmanned aerial vehicles (UAVs) have made it possible to obtain massive high-resolution optical remote sensing images on the sea. Therefore, we can obtain more detailed information to detect the ship in the optical remote sensing images. Ship detection plays an important part in marine target monitoring. However, this work mainly faces following three challenges due to the complicate background:
(1) Complex backgrounds such as sea clutter waves, shadows, clouds, mist and water vapor may affect the quality of image, sometimes ships are even hard to see under the influence of interferences.
(2) There are many distractors with similar color, texture, or shape as ships, such as docks, clouds and islands, leading to high false alarms in ship detection.
(3) The characteristics of ships also change in optical images under different parameters of imaging sensor, illumination, imaging perspectives, image spatial resolution, image integration time, and so on; therefore, it is hard for us to acquire a robust model for ship detection.
According to the complexity of the sea background and the locations of ships, existing ship detection algorithms can be divided into offshore ship detection and inshore ship detection. Inshore ships are difficult to detect accurately due to the various interferences in the harbor scene. In addition, it is not easy to detect offshore ships because the influences of clouds, wake clutters and islands on the sea. A variety of object detection methods have been developed during the last decades.
Traditional ship detection methods in remote sensing images mainly focus on the mining of unique characteristics of ships [
4]. It is difficult to find the target directly in the image; some object detection methods firstly find regions of interest (ROIs) in the image based on the visual saliency model. Itti and Koch [
5] introduced the concept of a saliency map; the Itti visual saliency model is a visual attention model based on the visual nervous system of early primates. Itti first used the Gaussian sampling method to construct a Gaussian pyramid of the color, brightness and orientation of the images. The Itti method does not require a training process, and the calculation of the salient map can be completed using purely mathematical methods. Harel et al. [
6] introduced Graph-Based Visual Saliency (GBVS), in which the random walk theory is applied to visual saliency detections. The GBVS theory establishes a Markov chain on a graph. The equilibrium state of the Markov chain reflects the time spent by random walkers on each node, and those nodes that are different from the surrounding accumulate naturally. This form of accumulation reflects visual salience. Some methods perform saliency detections in the spatial domain directly, while others perform saliency detections by frequency domain analysis. Hou and Zhang [
7] proposed the SR algorithm, in which the spectral residual is used to obtain visual saliency through analysis in the frequency domain. Achanta et al. [
8] proposed the FT algorithm, which used frequency tuned to achieve saliency object detection. It used a Difference of Gaussians (DOG) operator to implement a bandpass filter to solve the problem that object edge and noise information appear in the high frequency part simultaneously. The highlighted area in the saliency map contains both targets and false alarms. To eliminate false alarms, it is necessary to design feature vectors that can distinguish between targets and interferences. Sun Li et al. [
9] proposed a ship detection method via ship head classification and body boundary determination, using the trapezoid shape of the ship head to distinguish ships. Xu et al. [
10] proposed a method to describe the gradient direction feature of the target. To acquire rotation-invariant features, the ship must be rotated to the vertical direction through Radon transform [
11]. The gradient direction histogram of a ship is very different from islands, clouds, etc. However, the detection accuracy of this method depends on the rotation accuracy of the ship’s axis. Qi et al. [
12] proposed a S-HOG descriptor to characterize the gradient symmetry of the ship’s sides, which is a histogram of the gradient direction of the ship. The S-HOG descriptor was applied to the automatic ship detection. In general, these methods are based on hand-designed features, so they cannot adapt well to complex and changeable scenes in remote sensing images.
In recent years, with the continuous enhancement of hardware computing power, deep learning algorithms have been rapidly developed and applied in the field of object detection. Deep learning-based methods are widely used to detect common objects in daily life and achieve extremely high performance. Deep learning algorithms can be divided into two-stage detection algorithms and one-stage detection algorithms. In two-stage detection methods, a region proposal that may contain object is produced by the algorithm, and then the candidate region is classified during the second stage, with the exact location of the object being determined by regression to obtain the final detection result. Faster regions convolution neural network (Faster R-CNN) R-CNN [
13] is a two-stage detector; the whole picture is loaded to the network, and the feature layer of the picture is obtained by convolutional neural network. Then, the candidate region of the original image space is obtained by the selective search algorithm. A pooled operation is used to obtain the feature representation of the fixed dimension. Finally, softmax activation function is used to classify and regress the model in the full connection layer. The Single Shot MultiBox Detector (SSD) algorithm [
14] uses end-to-end training networks to predict the classification of objects and regress the bounding box position from multi-scale feature maps, which are produced by the hierarchical down sampling structure of the deep network. The single shot multi-box detector (SSD) framework was modified into visual geometry group 16 layers (VGG16) [
15], adopting a convolutional layer to replace the full connected layer in VGG16. Combined with the strategy of data augmentation, the performance of SSD is close to that of the Faster R-CNN algorithm. You Only Look Once (YOLO) [
16] does not have a candidate box extraction process, and thus belongs to the one-stage target detection algorithms. The neural network predicts the coordinates of the bounding box and gives the category and confidence of the object. It solves the object detection as a regression problem, the model outputs the detection results on an end-to-end network. YOLO has been continuously improving, and there are different versions: YOLOv1, YOLOv2 [
17], and YOLOv3 [
18]. From the structure of YOLOv1, it can be found that YOLOv1 uses a fully connected layer network to predict the position of the bounding box directly. This cannot adapt to different scales of objects at the same time, and the spatial information of the object is lost, resulting in a decrease in accuracy. YOLOv2 introduces an anchor box into the network and uses a sliding window to sample the convolutional feature map; spatial information is well used. It adopts logistic regression instead of the softmax layer in the network structure, changing the single-label classification to multi-label classification. YOLOv3 proposed Darknet53, which is a new architecture for feature extraction. As the basic network has more layers, the detection accuracy is further improved. Zhang [
19] proposed a real-time detection framework based on tiny- YOLOv3, the model uses K-means clustering on the training set to determine the optimal anchor box. The method has a better trade-off between accuracy and speed and makes the network more suitable for real-time application. To improve the ability to extract features, the author added more convolutional layers than the basic network. The network introduced a 1 × 1 convolutional kernel to decrease the dimension of feature. Due to the tiny network structure, the model does not need to occupy too much memory, which reduces the requirements for hardware. The detection speed of YOLO series is the fastest at present, but there is a large error in detecting small objects. There are many difficulties in object detection using deep learning methods, especially for the detection of small targets. Lim [
20] improved the performance of small target detection by using context and attention, they extracted context information from surrounded pixels of small objects by using more abstract features from higher layers. An attention mechanism was used in the early layer to focus on small objects. The FA-SSD model that they proposed achieved better performance in small target detection than SSD. These models only won competitions for object detection of natural scenes using PASCAL VOC dataset [
21] or COCO dataset [
22], however, and these models are not applicable for remote sensing scenes. Rabbi [
23] applied a new edge-enhanced super-resolution generative adversarial nets (GAN) to improve the quality of remote sensing images. This architecture took low-resolution satellite imagery as input and gave object detection results as outputs. The detector loss was backpropagated into the edge-enhanced super-resolution generative adversarial nets (EESRGAN) to improve the detection performance for small objects. The method relied on diverse datasets and the techniques to create realistic low-resolution images. Ship detection in remote sensing images is very different from natural scenes. The signal-to-noise ratio (SNR) is relatively low, which leads to poor quality of images. Due to the images being shot from a long distance, ships are quite small in the images, so the available features of targets is really limited. Ships have different types and multiple scales. With the variation of azimuth and pitch of image sensor platform, the perspective of a ship changes greatly. All these factors cause a lot of difficulties for ship detection in remote sensing images. Some researchers have applied deep learning to ship detection and recognition. At first, deep learning only replaced parts of the ship detection process, such as feature extraction. Shao [
24] proposed a saliency-aware convolution neural network for ship detection based on the YOLOv2, introducing the CNN framework to obtain a saliency map. The ship’s class and localization are refined using saliency detection. When the confidence of the bounding box is low, the model uses salient features to obtain more precise positions of ships. Nie [
25] proposed a ship detection and segmentation method based on Mask R-CNN, using channel-wise attention to adjust weights in every channel and adopt spatial attention mechanism to adjust weight at per pixel. This method makes the feature maps better to describe the target’s features. Li [
26] proposed a deep learning method for detecting ship targets in remote sensing images, the method match the prior multi-scale rotated bounding boxes to the ground-truth bounding boxes to obtain positive sample information and use it to train the deep learning model; the algorithm is robust for detecting ship targets under complex conditions, such as wave clutter background, target in close proximity, ship close to the shore, and multi-scale varieties. An [
27] used a deep convolutional neural network to detect ships in Gaofen-3 SAR images. The method was based on sea clutter distribution analysis, and it used truncated statistic as a preprocessing scheme and iterative censoring scheme for boosting the performance of detector. However, these methods use independent feature maps for ship detection, so the efficiency of the model is reduced. Due to limited features, these methods cannot avoid false alarms when detecting small ships under complex circumstances.
Features and the selection of features are very critical for object detection. There are many common features in the field of object detection, such as optical flow features and some physical attributes. Zhang [
28] introduced physics-inspired methods in crowd video analysis, including fluid dynamics, interaction force, and complex crowd motion systems. Physics-based methods can be used to represent and analyze crowd behavior, finding application in crowd video surveillance. Zhang [
29] used enthalpy to describe the state of a system. Entropy is very suitable for measuring the degree of disorder of a system. Optical flow features were used to get the motion information of a crowd. Based on the optical flow features, the pedestrian moving region could be obtained using the flow field visualization method. Song [
30] detected isolated ships by using the shape of connected components. Clustered ships were detected by using a mixture of multi-scale Deformable Part Models and HOG features. These features were effective for ship detection. The method performed well in detecting ships gathered together and staying alongside the dock. Wang [
31] used a CNN-based classifier to separate false alarms from ship object. The constant false-alarm rate (CFAR) was used as the object detector, as it is a simple and fast feature for SAR images. To use both the features of centered objects and the surrounding background noise, a new pooling structure called max-mean pooling was proposed to extract the effective feature in CNN flow. The application of multiple features could increase the accuracy of ship detection. We use convolutional features for ship detection. Convolutional features are more abstract than optical flow features, shape features, texture features, and color features. They can express deeper target characteristics from semantic aspects. The fusion of convolutional features of different layers can be more helpful for identifying multi-scale targets.
In this paper, to cope with the problems in ship detection, a novel ship detection model based on convolutional feature fusion (CFF-SDN) for remote sensing images is proposed. The object detection framework consists of a feature extraction network, a feature fusion network, classification and regression. The detection result contains the classification and localization of ships. The flow chart of the proposed model CFF-SDN is shown in
Figure 1.
Our method is different from other methods proposed in the literature. The main contributions of our work can be summarized as follows:
A dataset for ship detection in remote-sensing images (DSDR) is created. Deep learning methods need a lot of training data during the complicated training process. Thus, a ship dataset is badly needed. DSDR contains rich satellite remote sensing images and aerial remote sensing images, which is an important resource for supervised learning algorithms.
We introduce data augmentation to supplement the lack of ship samples in military applications. Thus, preventing the model from overfitting can increase the detection accuracy of ship targets. We adopt an affine transformation method to change the perspectives of ships, thereby increasing the accuracy of ship detection in aerial images.
A dark channel prior is adopted to solve the atmospheric correction on the sea scenes. We remove the influence of the absorption and scattering of water vapor and particles in the atmosphere by using the dark channel prior. The image quality is greatly improved by atmospheric correction. Atmospheric correction is beneficial to improving the accuracy of target detection in remote sensing images.
A feature fusion network is used to comprehend different levels of convolutional features, which can better use the fine-grained features and semantic features of the target, achieving multi-scale detection of ships. Meanwhile, feature fusion and anchor design are helpful for improving the performance of small target detection.
Soft non-maximum suppression (NMS) is used to assign a lower score for redundant prediction boxes, thereby reducing the missed detection rate and improving the recall rate of densely arranged ships. The detection accuracy is improved compared to the traditional NMS.
Our proposed approach can achieve better performance in terms of detection accuracy and inference speed for ship detection in optical remote sensing images compared with previous works. The CFF-SDN model is very robust under different disturbances such as fogs, islands, clouds, sea waves, etc.
The rest of this paper is organized as follows: we state the framework of our ship detection model based on convolutional feature fusion in
Section 2, and the experimental results based on DSDR dataset are presented in
Section 3. In
Section 4, we discuss the advantage of the model and the measures to suppress false alarms. Finally, the conclusions are provided in
Section 5.
4. Discussion
Through comprehensive analysis and comparison with other models, our proposed CFF-SDN model was shown to be effective for ship detection in optical remote sensing images. The multi-layer convolutional feature fusion method is innovatively proposed, enhancing the fine-grained information and semantic information. It can be seen through experiments that our model has excellent performance in terms of detection accuracy and speed.
We proposed the CFF-SDN model, which can fuse fine-grained information from shallow layers and semantic information from deep layers. This network architecture is very beneficial for the detection of small objects like ships in remote sensing images. Due to the use of fused feature maps for regression and classification, the CFF-SDN model has good adaptability to the multi-scale changes of ships.
Table 3 shows that the CFF-SDN model can achieve better performance than other object detectors.
Various data augmentation strategies are important measures for improving detection accuracy. Innovatively, affine transformation was used to change the perspective of satellite remote sensing images. As shown in
Figure 5, the satellite image after affine transformation is very similar to the aerial remote sensing images taken from different perspectives. The use of rich satellite remote sensing images to improve the detection accuracy of aerial remote sensing images plays an important role in improving the overall detection accuracy.
As ships are often densely arranged on the sea, as shown in
Figure 10, unlike traditional non-maximum suppression, we use soft NMS to suppress redundant prediction boxes, which increases the probability that the ship will be detected when closely arranged, effectively improves the recall rate of the model, and reduces missed detections.
Since our model adopts a model pruning strategy, the CFF-SDN model has a lower computational complexity. As shown in
Table 4, our proposed model has a faster detection speed than the other compared models, and is thus more conducive to migration to the embedded platform, in order to achieve real-time ship target detection in engineering applications.
By comparing the many groups of experiments, it is verified that the CFF-SDN ship detection model can achieve high performance on detection accuracy, as shown in the precision–recall curves in
Figure 17. However, ships sometimes sail in complex scenes, and the shapes and textures of interfering objects (such as islands, clouds) can change considerably. Sometimes the shape, color, and texture of clouds or islands are very similar to those of ships. These disturbances can cause false alarms in the detector, as shown in
Figure 20.
Although CFF-SDN fully reuses feature information by fusing features from different layers, it is still not enough to eliminate all false alarms.
Both the training set and the test set contain harbor images, and the ship detection in these images is interfered with by the land. The ship detection results of harbor images containing land are shown in
Figure 21. The CFF-SDN model can detect ships in the harbor. Although the model does not appear to be overfitting, the detection effect in the harbor images is not as good as that in the ocean images. The ships near the shore in
Figure 21a–c are well detected. Three ships were detected in
Figure 21d, but one ship docked on the shore was not detected. There are many interferences when detecting ships in the harbor, and the detection effect is lower than that of ships on the sea. The mAP would be significantly decreased when the trained model is applied to the harbor images. Enhancing the robustness of algorithms for ship detection in harbor is an important research topic in the future. We need to collect more harbor images to support the quantitative analysis of ship detection in the harbor.
The interferences of ship detection on different datasets are quite different. We collected several different datasets, including vehicle detection in aerial imagery (VEDAI) dataset [
40], dataset for object detection in aerial images (DOTA) [
41], and high-resolution remote sensing detection (HRRSD) dataset [
42]. These datasets contain various types of targets such as airplanes, tractors, ships, trucks, etc. The ship images extracted from these datasets are detected by CFF-SDN model to detect ship images. In addition, the number of ships in these datasets is not as high as in our dataset DSDR. The ship detection results of CFF-SDN model on other datasets are shown in
Figure 22.
It can be seen from
Figure 22 that various types of ships in these datasets were detected, and no interfering objects such as harbor facilities were mistakenly detected as ships. The ship detection results on different datasets prove that our model is very robust. The detection result of the DOTA dataset in the first row of
Figure 22 shows that the localization of the ship in the upper left corner is not accurate enough. In the future, the localization accuracy of the CFF-SDN model on other datasets needs to be improved.
Increasing the learning category is a better solution to this problem. Common disturbances such as clouds and islands are divided into separate categories. In addition to learn the target characteristics of the ships, the model also learns the characteristics of common interferers that cause false alarms to distinguish between ships and interference. The fusion of visible and infrared image information may be another idea for enhancing the recognition capability of the detector by comprehensively using the interference suppression effect of different spectrum bands to improve the performance of distinguishing ships from false alarms, but this depends on the linkage of the visible and infrared sensors, so as to obtain both visible and infrared images of the same scene.