Improved YOLOv3 Based on Attention Mechanism for Fast and Accurate Ship Detection in Optical Remote Sensing Images

: Ship detection is an important but challenging task in the ﬁeld of computer vision, partially due to the minuscule ship objects in optical remote sensing images and the interference of clouds occlusion and strong waves. Most of the current ship detection methods focus on boosting detection accuracy while they may ignore the detection speed. However, it is also indispensable to increase ship detection speed because it can provide timely ocean rescue and maritime surveillance. To solve the above problems, we propose an improved YOLOv3 (ImYOLOv3) based on attention mechanism, aiming to achieve the best trade-off between detection accuracy and speed. First, to realize high-efﬁciency ship detection, we adopt the off-the-shelf YOLOv3 as our basic detection framework due to its fast speed. Second, to boost the performance of original YOLOv3 for small ships, we design a novel and lightweight dilated attention module (DAM) to extract discriminative features for ship targets, which can be easily embedded into the basic YOLOv3. The integrated attention mechanism can help our model learn to suppress irrelevant regions while highlighting salient features useful for ship detection task. Furthermore, we introduce a multi-class ship dataset (MSD) and explicitly set supervised subclass according to the scales and moving states of ships. Extensive experiments verify the effectiveness and robustness of ImYOLOv3, and show that our method can accurately detect ships with different scales in different backgrounds, while at a real-time speed.


Introduction
With the rapid advancement of space remote sensing technology, high-resolution and large-scale remote sensing images acquired from spaceborne and airborne sensors are constantly enriched and facilitate a wide range of applications, such as natural disaster assessment [1], urban planning [2], traffic management [3,4], and environment monitoring [5].In these applications, automatic ship detection in remote sensing images has attracted increasing interests due to its important value in both civil and military fields, such as national defense construction, maritime security, harbor surveillance, and fishery management.For decades, many studies [6][7][8][9][10] in this field have prioritized synthetic aperture radar (SAR) images since they are little affected by weather and time.However, the noisy response and low resolution of SAR images [11] may limit their utilizations.In recent years, attributing to the improvement of high-resolution technology, some researchers [12][13][14][15][16] have paid more attention to using optical remote sensing images because they could provide more details and spatial contents for detecting ships than SAR images.
Extensive studies have been carried out about ship detection in optical remote sensing images in the past decades.Many traditional methods take a hierarchical paradigm via extracting handcrafted features, such as shape [12,17], texture [12,15], local binary patterns (LBP) [18], and histogram of oriented gradients (HOG) [13], followed by certain classification operations, e.g., support vector machine (SVM) [19,20], extreme learning machine (ELM) [14], and Adaboost [21].Most of the traditional algorithms make great success for ideal-quality images.However, they may encounter bottlenecks when ships are under complex weather conditions.Furthermore, the establishment of handcrafted features excessively relies on expert experience, so its generalization ability is weak.In this paper, we concentrate on ships at sea since the land can be removed by prior geographic information.In addition, we focus on detecting ships in panchromatic band of optical remote sensing images since it has higher image resolution and more detailed visual content than other bands.
Many recent works have exploited state-of-the-art two-stage detectors for ship detection in remote sensing field.For instance, Yao et al. [32] utilized region proposal network (RPN) [35] to discriminate ship targets and regress the detection bounding boxes, in which the anchors were designed by intrinsic shape of ship targets.Yang et al. [30] presented dense feature pyramid network (DFPN) by introducing rotation anchor strategy and multi-scale region of interest (RoI) align into the Faster R-CNN baseline.Due to the massive amount of generated ROIs, the computation is time-consuming and the model efficiency may be reduced.Chen et al. [16] also designed a hierarchical detection process which applied discrete wavelet transform (DWT) to extract ship candidate regions and proposed deep residual dense network (DRDN) to improve the ship detection accuracy.The aforementioned two-stage detection methods mainly focus on improving the ship detection accuracy; however, they may ignore the detection speed.As a matter of fact, in some practical applications such as real-time ocean observation and timely ship rescue, increasing ship detection speed is as important as boosting the detection accuracy.
To achieve real-time ship detection in remote sensing images, the methods based on one-stage detectors have been gradually explored.For instance, Tang et al. [41] proposed HSV-YOLO which applied the difference of HSV color space among remote sensing images to extract the RoIs and sent them into the YOLOv3 [39] network.This approach is similar to saliency-based approaches [13,42].However, the HSV-YOLO is not end-to-end due to the complex preprocessing based on the HSV difference.Van Etten [43] extended the YOLOv2 [38] network and implemented an end-to-end object detection framework optimized for overhead imagery: You Only Look Twice (YOLT).It realized rapid multiscale object detection in satellite images.Nina et al. [44] compared YOLOv3 [39] and YOLT [43] for ship detection, without modification of the original architecture.Although these YOLO series could be seen as first choice for real-time applications in computer vision, they provided much lower detection accuracy when compared with two-stage detectors, notably in detecting small ships.In addition, some common challenges still arise for both two-stage and one-stage ship detection methods.First, due to the overhead perspective and imaging characteristics of optical remote sensing images, they usually suffer from the complex weather condition such as clouds, mists, and waves [45].Second, the variant appearances and sizes of ships make it difficult for existing methods to accurately locate and detect targets, especially small ships.
To solve the aforementioned problems, especially for small ships detection and the interference of complex backgrounds, we propose a novel one-stage ship detection method named improved YOLOv3 (ImYOLOv3), intending to achieve the best trade-off between detection accuracy and speed.The ImYOLOv3 combines attention mechanism and dilated convolution for fast and accurate ship detection in optical remote sensing images.Due to the speed, accuracy, and flexibility of original YOLOv3 [39], we adopt it as the inspiration of our ship detection framework.We argue that the YOLO series or other one-stage detectors remain appealing for real-time applications in remote sensing thanks to their high efficiency.
To address the problems that YOLOv3 struggles with small objects and may produce more false alarms in complex backgrounds (such as clouds occlusion and strong waves), we integrate the attention mechanism into the detection network to highlight the difference between the ships and background.More specifically, we introduce a lightweight dilated attention module (DAM) which uses dilation convolution to enlarge the receptive fields and integrates both channel and spatial attention modules to extract salient features.To overcome the multi-scale characteristics of ship targets, we construct a multi-level feature pyramid and use the attention modules to obtain the salient features of different levels.At the end of the framework, we fuse the feature maps of different levels to further improve the detection accuracy and assign them to predict different scales of ship targets.In addition, we explicitly provide finer ship category information based on their sizes and moving states.Using these novel techniques, the proposed method is more suitable for ship detection task than original YOLOv3.
The main contributions of our paper are as follows: (1) In view of the complex backgrounds encountered in optical remote sensing images, an end-to-end network structure named ImYOLOv3 is proposed for fast and accurate ship detection.We integrate the attention mechanism into the network to obtain discriminative feature maps at different levels and fuse corresponding multi-scale features, which ensures the effectiveness of detecting multi-scale ships in complex backgrounds.(2) We design a novel and lightweight DAM which consists of three crucial cores: dilated block, channel attention sub-module, and spatial attention sub-module.The DAM can help our model to enlarge the receptive fields and highlight the difference between the ships and backgrounds, which overcomes the difficulty in detecting small ships.(3) The proposed network is based on a one-stage object detection algorithm and can achieve high ship detection accuracy while maintaining a fast speed.Consequently, it can support real-time ship detection.(4) We validate the proposed network on a challenging multi-class ship dataset (MSD) with huge scale variation.Additionally, unlike other ship detection datasets, our MSD includes four supervised categories, namely big ship, middle ship, small ship and moving ship, to investigate the detection effect of ships with different scales.
The reminder of this paper is organized as follows.The related work about CNN-based ship detection algorithms, attention mechanism, dilated convolution and classification strategy are described in Section 2. The framework and details of our proposed method are introduced in Section 3. The dataset setup and implementation details are shown in Section 4.Then, in Section 5, we present experiments conducted on MSD with the proposed ship detection framework.Finally, Section 6 concludes this paper.

Related Work 2.1. CNN-Based Ship Detection Methods
Researchers in the remote sensing community have made a lot of attempts to exploit CNN-based ship detection frameworks [26][27][28][29][30][31][32]43,45,46].For instance, Zou et al. [45] presented SVDNet, which leveraged CNN to extract features and adopted feature pooling operation and linear SVM to determine the positions of ships.To further enhance the detection accuracy, Li et al. [27] introduced a hierarchical selective filtering (HSF) layer into Faster R-CNN [35] to generate features for multiscale ships and used an end-to-end training strategy by defining a joint loss.More recently, increasing attention has been paid to the time efficiency of ship detection algorithms.Some researchers try to transfer regressionbased detection methods developed for natural images to remote sensing images.Liu et al. [31] designed an arbitrary-oriented ship detection framework based on YOLOv2 [38], which achieved robust and real-time detection in complex scenes.However, they may struggle with ships of various scales due to a single-scale prediction.Chang et al. [46] also constructed a YOLOv2-based end-to-end training convolutional neural network to detect ships.In addition, they introduced a compact YOLOv2-reduced which had fewer layers for faster ship detection.Despite the improvement of detection speed, the YOLO-based methods suffer from a drop of the localization accuracy compared with two-stage detectors, especially for small ships.To achieve better trade-off between speed and accuracy, we propose DAM and incorporate it into YOLOv3 to improve the detection performance while maintaining its fast speed.

Attention Mechanism
Attention is an important tool in human perception [47,48] to bias the allocation of available processing resources towards the most salient part of input signals.Owing to its benefits, attention mechanism has been widely used in a lot of applications, from natural language processing [49] to image classification [50][51][52][53][54], image captioning [55] and image question answering [56].In computer vision, Wang et al. [50] built "Residual Attention Network" by stacking attention modules with encoder-decoder style, which could quickly collect global information of the whole image and gradually refine the feature maps.Hu et al. [51] proposed a compact "Squeeze-and-Excitation" (SE) block to recalibrate the inter-channel attention.Closer to our work, Lin et al. [57] inserted SE module [51] into Faster R-CNN and designed rank operation to improve performance in SAR ship detection task.Beyond the channel attention, spatial attention is also explored in [52,53] to emphasize meaningful features along the spatial axis.However, Lin et al. [57] ignored the importance of spatial attention to emphasize salient objects.In this paper, we sequentially apply channel and spatial attention modules in our DAM, so that each module can learn what and where to attend in the channel and spatial axes respectively.Different from the attention modules in [52,53], we add dilated blocks in our DAM and leverage dilated convolutions [58] to increase receptive fields and adaptively adjust the receptive field sizes for objects of different scales, aiming to address the problem arising from scale variation and small ship instances.

Dilated Convolution
Dilated convolution, also named atrous convolution [59,60], was first introduced in the semantic segmentation task to incorporate large-scale context information [58,61].It enlarges the convolutional kernel sizes with original weights by performing convolution at sparsely sampled locations, which increases the receptive field size without additional parameter cost.Dilated convolution has also been widely used in the field of object detection.To maintain the spatial resolution, DetNet [62] employed dilated convolution to design a specific detection backbone network and enlarge the receptive field.TridentNet [63] constructed a parallel multi-branch architecture with different receptive fields by using dilated convolution and generated scale-specific feature maps with a uniform representation power.In our ship detection framework, we use dilated convolution in our multi-branch architecture with different dilation rates to adapt the receptive fields for ships of different scales.

Classification Strategy
The main aim of ship detection is to locate and recognize true ships in remote sensing images.Many previous studies [15,30,45] in this field consider ship detection as a binary classification problem: ship and non-ship.However, as observed in [12], there usually exist large differences between the objects of interest within the same class.Simply using binary classification may have some detrimental impact on the detection performance.To address this problem, Qi et al. [13] divided ships into big ships and small ships according to their sizes.Li et al. [27] collected a ship detection dataset that included two categories, namely moving ship and static ship.Similarly, Shao et al. [64,65] also adopted multi-class classification for ship detection in visible light video images.They built a large-scale ship dataset, called SeaShips, which covered six ship types (ore carrier, bulk cargo carrier, general cargo ship, container ship, fishing boat, and passenger ship).Compared with video images, it is difficult to tell the exact categories of ships in remote sensing images due to the overhead perspective and long shooting distance.Consequently, we intuitively classify ship objects into four classes based on ship sizes and their moving states.The purpose is to alleviate the problem of scale variation and evaluate the detection accuracy for ships with different sizes.

Proposed Method
In this section, we elaborate the architecture of proposed ImYOLOv3 for fast and accurate ship detection in optical remote sensing images.

Network Architecture
As illustrated in Figure 1, our ImYOLOv3 mainly consists of three components: feature extraction module (FEM), dilated attention module (DAM), and class and bounding box prediction module (CBM).Firstly, the input images are resized to an appropriate size (taking 608 as an example).We employ Darknet-53 [39] as the backbone of FEM and propose DAM to extract salient features at different branches.Then, we construct a three-level feature pyramid sharing a similar concept with FPN [66].Specifically, we upsample the salient feature maps of low resolution by a factor of 2 and merge them with previous feature maps of the same spatial size via concatenation.The feature fusion allows us to obtain more meaningful semantic information from the up-sampled feature maps and finer-grained information from the earlier feature maps.Our ImYOLOv3 predicts bounding boxes at three different scales.Behind the base FEM, we add convolutional sets to make further feature extraction based on these fused feature maps.The three branches have the same DAM structure but different receptive fields with the help of dilated convolutions [58,63].Finally, we add a few more convolutional layers behind DAM of each branch and the last 1 × 1 convolutional layer in CBM is used to predict a 3D tensor encoding parameters of bounding box, object, and class predictions.In the multi-class ship detection experiments on MSD, we predict three boxes at each scale.Hence, the size of the predicted tensor is N × M × [3 × (4 + 1 + 4)] for four bounding box offsets, one object confidence, and four class probabilities, where N and M denote the height and width of the tensor, respectively.The final detection results are obtained after filtering the predicted boxes via non-maximum suppression (NMS).

Dilated Attention Module
As investigated in [63], the detection performance on objects of different scales is impacted by the receptive field of a network.Larger receptive field can capture a wider range of context information, which is beneficial for distinguishing small ship targets from complex backgrounds.In this section, we propose DAM which uses dilation convolution to enlarge the receptive fields and integrates both channel and spatial attention modules to extract salient feature maps, aiming at enhancing regions of ship targets and suppressing the interference of clouds and waves.As depicted in Figure 2, the DAM is mainly composed of three parts: dilated block, channel attention sub-module, and spatial attention sub-module.The detailed structure of three sub-modules is illustrated in Figure 3.The dilated block shares similar style with the bottleneck [23] which consists of three convolutions with kernel size 1 × 1, 3 × 3, and 1 × 1.The difference is that we set different dilation rates for the 3 × 3 convolutional layers.Dilated convolution with dilation rate d inserts (d − 1) zeros between consecutive filter values, enlarging the kernel size without increasing the number of parameters and computation costs.Specifically, a dilated 3 × 3 convolution could have the same receptive field as the standard convolution with kernel size of 3 + 2 × (d − 1).Between two convolutional layers are batch normalization (BN) layer and leaky rectified linear unit (ReLU) layer sequentially.Stacking dilated blocks allows us to modulate receptive fields of different branches and obtain an intermediate feature map X ∈ R C×H×W , where C is the channel, H is the height, and W is the width of feature maps.The subsequent two attention modules separately infer a 1D channel attention map M C ∈ R C×1×1 and a 2D spatial attention map M S ∈ R 1×H×W .The whole process can be summarized as:

Residual
where ⊗ denotes element-wise multiplication, X C ∈ R C×H×W represents the channel-wise refined feature maps, and X S ∈ R C×H×W represents the spatial refined feature maps.Both M C and M S are expanded to R C×H×W before multiplication.We compute the final refined feature map X O via residual connection along with the attention mechanism to alleviate the vanishing-gradient problem as follows: The following sections describe the detailed structure of each attention module.Channel Attention Sub-module.Motivated by the success of attention mechanism in general deep neural networks, we adopt the channel attention module [52] after the dilated blocks to adaptively recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels.As shown in the top of Figure 3, the channel attention sub-module is fed with the intermediate feature map X ∈ R C×H×W and produces a channel vector M C ∈ R C×1×1 .We first squeeze X along the spatial dimension H × W by using both average pooling and max pooling operations to obtain two channel descriptors (R C×1×1 ).The average pooling is to learn the extent of the target objects effectively and the max pooling is to gather important features about distinctive objects, which are helpful for ship detection task.Both descriptors are then forwarded to a shared network that consists of two fully connected (FC) layers.To reduce model complexity, the activation size of the first FC layer is set to R C/r×1×1 , where r is the reduction ratio.After the shared network is applied to each channel descriptor, we merge the two output vectors via element-wise summation and employ a simple gating mechanism via sigmoid activation to obtain the final output M C .
Spatial Attention Sub-module.Different from the channel attention that focuses on what is important in a given input image, we utilize the spatial attention as a supplement to learn where to attend for ship detection task.The structure of spatial attention submodule is shown at the bottom of Figure 3.After obtaining the channel-wise refined feature map M C , we also apply both average pooling and max pooling operations along the channel dimension and concatenate them to generate an efficient feature descriptor, which proves valid in highlighting informative regions [67].Different from that in [52,53], we employ two 3 × 3 dilated convolutions to further enlarge the receptive fields and effectively leverage contextual information.Finally, the features are squeezed to R 1×H×W with 1 × 1 convolution, and the sigmoid function is used to obtain the final spatial attention map M S .

Dataset
Recently, large-scale datasets have played important roles in data-driven research.To this end, we collect seven panchromatic images from GF-1 satellite and six panchromatic images from GF-2 satellite for training and testing.These images contain many kinds of landscapes, such as the ocean, the harbor, and the island, which are taken under different light and weather conditions.First, we cut the large panchromatic images into 1000 × 1000 sized patches with 0.2× overlap and filter out images that do not contain ships.Then, these patches with ships are annotated by experts in aerial image interpretation via an open annotation tool LabelImg.Some example images are listed in Figure 4, which shows some typical scenes in optical remote sensing images, such as quiet sea, sea under cloud coverage, inshore land, moving ships with wakes, and sea with waves.
To facilitate multi-scale ship detection, we set four ship categories according to the length and moving state of targets, namely big ship (length more than 100 pixels), middle ship (length about 50-100 pixels), small ship (length about 10-50 pixels), and moving ship (ship with long wakes).Those extremely small ships (length less than 10 pixels) are excluded from the targets in our dataset.When labeling moving ships, the main parts of ships are inside the bounding boxes.In total, there are 1015 images in our multi-class ship dataset (MSD).We randomly choose 80% of the images for training and the remaining 20% for testing.An overview of MSD can be found in Table 1.To better show the scale variation of ship targets, we draw a scatter plot according to the width and height of the ship targets in MSD dataset, as shown in Figure 5.

Evaluation Protocol
To quantitatively evaluate the detection performance of the network, we follow the same protocol as used by PASCAL VOC [68], which is briefly described below.Given the intersection-over-union (IoU) threshold, the detector will decide whether the detected box belongs to the background or not.The recall and precision are calculated as follows: P = N TP N TP + N FP (5) where N TP , N FP , and N FN , respectively, stand for the number of accurately detected ship targets (true positive), the number of incorrectly detected targets (false positive), and the number of missing ship targets (false negative).For each category in our dataset, we can draw a precision-recall curve according to recall and precision values.The average precision (AP) means the area surrounded by the curve.
Each class i corresponds to an AP value AP i and mAP denotes their mean, which assesses the detection effect of the model.
where n denotes the number of ship categories.As for the speed, we adopt FPS (frame per second) as indictor which symbolizes the frames that model can detect in one second.

Implementation Details
In our experiment, the dilation rates of dilated blocks are set to 1, 2, and 3 in Branch-1, Branch-2, and Branch-3, respectively.The reduction ratio r in the channel attention summodule is set to 16 following Woo et al. [52].All experiments were executed on a PC with NVIDIA GeForce GTX 1080ti.The initial learning rate was set to 0.001.We used stochastic gradient descent (SGD), with a weight decay of 0.0001 and momentum of 0.9.Warm-up was introduced during the initial training stage to avoid gradient explosion.All models were trained for 21,000 iterations, and the learning rate was divided by 10 at 7000 and 14,000 iterations.The IoU threshold of NMS was set to 0.5.For evaluation, we used mAP and FPS as the metrics.

Ablation Experiments
Number of dilated blocks.We conduct an ablation study to explore how many dilated blocks are needed for ImYOLOv3.The results are shown in Figure 6 as follows.When the number of dilated blocks grows beyond 3, the performance of ImYOLOv3 becomes stable.It indicates the robustness of our method with regards to the number of dilated blocks when the disparity of receptive fields between branches is large enough.Consequently, we set three dilated blocks in our DAM.
Performance of each branch.Our ImYOLOv3 has a multi-branch architecture, where each branch is responsible for detecting ships with different scales.Each branch has a different receptive field with the help of dilated convolution.Here, we conduct an additional experiment to study the mechanism about how to adjust the receptive field adaptively.Table 2 shows the results of three single branches and our three-branch method.As expected, through our training, Branch-1 with smallest receptive field achieves good results on small ships, Branch-2 works well on ships with middle scale, and Branch-3 with largest receptive field is good at detecting big ships.As a result, our three-branch ImYOLOv3 inherits the merits from three single branches and achieves the best result.Performance of attention module.To verify the effectiveness of DAM, we conduct comparative experiments with other attention modules, and the results are listed in Table 3.We adopt YOLOv3 as the baseline and keep other hyper-parameters consistent in the experiments.It can be seen that adding the DAM into YOLOv3 brings 5.69% mAP increment (79.38% vs. 85.07%), which especially increases the AP for middle ship (80.14% vs. 86.69%),small ship (72.97% vs. 80.97%), and moving ship (70.67% vs. 77.60%).It implies that our DAM can extract discriminative and robust features for ship targets in remote sensing images, which effectively boosts the detection performance of small ships.When comparing the performance difference between YOLOv3 with SE block [51] and YOLOv3 with our DAM, the improvement of mAP (81.43% vs. 85.07%)indicates that adding spatial attention module is also very helpful for ship detection.In addition, we compare our DAM with the similar modules BAM [53] and CBAM [52] with both channel and spatial attention.The YOLOv3 with DAM achieves higher accuracy than YOLOv3 with BAM (85.07%vs. 82.55%)and YOLOv3 with CBAM (85.07%vs. 82.91%).The improvement is attributed to the benefits that our DAM enlarges receptive fields of the network through dilated convolutions and obtains richer feature representation.The above experiments demonstrate that our DAM is more suitable for the ship detection task than other off-the-shelf attention modules.In addition, as shown in Table 3, YOLOv3+DAM achieves larger AP increment for small ship (8.00%), moving ship (6.93%), and middle ship (6.55%) than for big ship (1.27%), compared with the baseline YOLOv3.Since the intention of our DAM is to boost the performance of original YOLOv3 for small objects, we further show some detection results compared with original YOLOv3 [39] to verify the validity of our model.Figure 7 shows the detection results of original YOLOv3 (the first row) and our ImYOLOv3 (the second row).The boxes in different colors represent different ship categories and the yellow circles denote missing ships.As is shown in the first column of Figure 7, YOLOv3 fails to detect some small ships and moving ships while our ImYOLOv3 can detect them successfully.Detecting ships under clouds coverage or with waves remains a challenging problem in remote sensing images.In the second and third columns of Figure 7, YOLOv3 misses the small ship under cloud coverage and the middle ship surrounded by waves.However, our ImYOLOv3 can effectively distinguish them from the complex background, which shows that our method is very robust in complex scenes.The proposed attention modules can greatly help to highlight the difference between the ships and background.The performance benefits not only from the proposed dilated attention module but also from the better feature representation for ship targets.

Comparison with the State-of-the-Art Methods
In this section, we compare our ImYOLOv3 with other state-of-the-art object detection methods, including single shot multi-box detector (SSD) [36], feature pyramid network (FPN) [66], and RetinaNet [40].The training settings follow the original references.Table 4 shows the comparative results, while Figure 8 shows the precision-recall curves of each ship category.As shown in Table 4, our ImYOLOv3 clearly outperforms SSD300 by 9.64% mAP (85.07%vs. 75.43%)and SSD512 by 7.10% mAP (85.07%vs. 77.97%).As for the detection speed, ImYOLOv3 is a little slower than SSD300 (28 FPS vs. 46 FPS), but it is faster than SSD512 (28 FPS vs. 19 FPS).Compared with two-stage model FPN, our ImYOLOv3 achieves 3.28% mAP increments, which improves big ship AP by 0.44%, middle ship AP by 3.4%, small ship AP by 2.93%, and moving ship AP by 6.34%, while being much faster (28 FPS vs. 6 FPS).Moreover, our model surpasses one-stage RetinaNet by 1.52% mAP, while being 2.8× faster.It should be noted that, because FPN and RetinaNet achieve better performance than original YOLOv3 (79.38% mAP), the gains of ImYOLOv3 can be only attributed to the effectiveness of the proposed DAM.To sum up, our ImYOLOv3 achieves a better trade-off between the detection accuracy and detection speed than the state-of-the-art methods.The contribution of ImYOLOv3 is not only a more powerful attention module but also a better understanding of what (channel-wise attention) and where (spatial attention) the ship object is in an input image.We believe that the efficiency and simplicity of our proposed method will benefit future research and ship detection applications.

Detection Performance in Different Image Backgrounds
To show the robustness of our model, we test it in some typical remote-sensing scenes and the detection results are shown in Figure 9.The different environmental conditions include quiet sea, sea with waves, sea under clouds coverage, and inshore land (from top to the bottom in Figure 9).The boxes in different colors represent different ship categories.As shown in the first row, we can see that our ImYOLOv3 performs well in quiet sea, which can successfully detect ships with different scales.The second row represents results for ship target detection in sea with waves.The third row shows results for ship detection under clouds coverage, which are difficult problems for ship detection in remote sensing images due to the interference of strong waves and clouds.However, our proposed method still effectively distinguishes ship targets even with the interference of waves and clouds and accurately locates ship targets with long wakes.The fourth row presents some detection results for inshore land, where some buildings or harbors may have similar shapes and sizes to ship targets.It is clear that our ImYOLOv3 accurately recognizes ship targets in such complicated scenes even though some ships are very small.In short, our method is very robust in various complex scenes, which benefits from the proposed dilated attention module and better feature representation for multi-scale ship detection.

Generalization Ability Testing
To further illustrate the generalization ability of the proposed method, we evaluate its performance on SeaShips dataset [64], which consists of 31,455 images and cover six common ship types (ore carrier, bulk cargo carrier, general cargo ship, container ship, fishing boat, and passenger ship).The similarity between SeaShips and MSD is that both datasets provide fine-grained ship category information to overcome the intra-class differences because different ship types vary greatly in shape and appearance.The difference between SeaShips and our MSD is that all of the images in SeaShips dataset are from real-world video segments, which are acquired by the monitoring cameras in a deployed coastline video surveillance system, while our MSD is based on optical remote sensing images.
We compare our ImYOLOv3 with other baseline detectors including Fast R-CNN [34], Faster R-CNN [35], SSD [36], and YOLOv3 [39].The results are shown in Table 5.For Fast R-CNN, we adopt VGG16 net [25] as the backbone network.For Faster R-CNN [35], we utilize ZFNet [69], VGG16 net [25], and ResNet (ResNet50 and ResNet101) [23] as the base network architectures.For SSD300 and SSD512 [36], we use VGG16 net as the backbone network.To ensure fair comparison, we keep the same training settings in each experiment.As shown in Table 5, Fast R-CNN performs worse than other detectors by a wide margin in terms of mAP and has a lower speed (3 FPS).Faster R-CNN achieves better detection accuracy when we use deeper networks, and the Faster R-CNN series all outperform SSD series.Surprisingly, YOLOv3 with Darknet-53 obtains better performance than Faster R-CNN with ResNet101 by 0.62% mAP (93.02% vs. 92.40%),while being 4.8× faster (29 FPS vs. 6 FPS).We conjecture that this is due to the benefits of multi-scale training strategy and because the ship targets in SeaShips dataset are relatively large objects and easy to be detected.It should be noted that the average precision (AP) for six ship types increases when we add DAM into YOLOv3, which improves ore carrier AP by 0.79%, bulk cargo carrier AP by 0.61%, general cargo ship AP by 0.15%, container ship AP by 0.60%, fishing boat AP by 0.97%, and passenger ship AP by 0.92%.Moreover, our ImYOLOv3 achieves a real-time speed (28 FPS).Considering the same experimental setup, the improvements can only be attributed to the better feature extraction of our DAM.The above experiments demonstrate that our ImYOLOv3 has great generalization ability and detection performance in both remote sensing images and natural images.Furthermore, our ImYOLOv3 achieves a better trade-off between the detection accuracy and speed than other detection methods.

Conclusions
In this paper, we propose a novel one-stage ship detection algorithm based on attention mechanism called ImYOLOv3, aiming to support further research in optical remote sensing ship detection field.First, to realize fast ship detection, we adopt the off-the-shelf YOLOv3 as the basic detection framework due to its fast speed and utilize Darknet-53 for feature extraction.Then, to alleviate the problem of multi-scale ship detection, we construct a threelevel feature pyramid sharing a similar concept with FPN, which combines meaningful semantic information from the up-sampled feature maps and fine-grained information from the earlier feature maps.Finally, to achieve accurate ship detection, we design DAM which can be integrated into the backbone framework to highlight the difference between ship targets and background.Our DAM can help to extract more discriminative features for ship targets and overcome the interference of clouds, mists, and waves.More specifically, the channel-wise attention is combined with spatial attention to learn what and where to focus or suppress in an input image and adaptively refine the intermediate features.
With the assistance of dilated convolutions, we enlarge the receptive fields of our model to further improve the ship detection accuracy.In addition, we build a multi-class ship dataset to facilitate multi-scale ship detection.Extensive experimental results show that our ImYOLOv3 has promising potentials on detecting ships with different scales and different moving states in complex backgrounds, while achieving a fast speed.We believe that ImYOLOv3 could benefit current ship detection and other vision tasks.We would like to explore it in the future work.

Figure 1 .
Figure 1.Architecture of ImYOLOv3 for ship detection.

Figure 3 .
Figure 3. Diagram of each sub-module: dilated block, channel attention sub-module, and spatial attention sub-module.

Figure 4 .
Figure 4. Some example images in our MSD dataset.

Figure 5 .
Figure 5.The sizes of ship targets in our dataset.As illustrated, ships have different scales and there exists large scale variations in MSD.

Figure 6 .Table 2 .
Figure 6.Ship detection results on the MSD dataset using different number of dilated blocks in DAM.Table 2. Ship detection results of each branch in ImYOLOv3 evaluated on the MSD dataset.The dilation rates of three branches in ImYOLOv3 are set as 1, 2, and 3.The entries with the best APs for each object category are boldfaced and the suboptimal entries are underlined.

Figure 7 .
Figure 7. Detection results of YOLOv3 (the first row) and our ImYOLOv3 (the second row).Yellow circles denote missing ships.

Figure 8 .
Figure 8. Precision-recall curves of different detection methods on four ship types.

Figure 9 .
Figure 9. Ship detection results of our ImYOLOv3 in some typical scenes.

Table 1 .
Overview of our MSD.

Table 3 .
The performance of attention modules on ship detection performance.The entries with the best APs for each object category are boldfaced.

Table 4 .
Detection average precision (%) and speed of different methods on MSD.The entries with the best APs for each object category are boldfaced.

Table 5 .
Detection average precision (%) and speed of different methods on SeaShips.c1-c6 represent ore carrier, bulk cargo carrier, general cargo ship, container ship, fishing boat, and passenger ship, respectively.The entries with the best APs for each object category are boldfaced.