CAA-YOLO: Combined-Attention-Augmented YOLO for Infrared Ocean Ships Detection

Infrared ocean ships detection still faces great challenges due to the low signal-to-noise ratio and low spatial resolution resulting in a severe lack of texture details for small infrared targets, as well as the distribution of the extremely multiscale ships. In this paper, we propose a CAA-YOLO to alleviate the problems. In this study, to highlight and preserve features of small targets, we apply a high-resolution feature layer (P2) to better use shallow details and the location information. In order to suppress the shallow noise of the P2 layer and further enhance the feature extraction capability, we introduce a TA module into the backbone. Moreover, we design a new feature fusion method to capture the long-range contextual information of small targets and propose a combined attention mechanism to enhance the ability of the feature fusion while suppressing the noise interference caused by the shallow feature layers. We conduct a detailed study of the algorithm based on a marine infrared dataset to verify the effectiveness of our algorithm, in which the AP and AR of small targets increase by 5.63% and 9.01%, respectively, and the mAP increases by 3.4% compared to that of YOLOv5.


Introduction
Scene understanding based on machine vision is a key technology in the field of marine transportation, and object detection has received extensive attention as an important part of it. Infrared imaging has been widely used in marine ship detection because of its outstanding characteristics of strong anti-interference ability and all-weather operation [1]. Different from other detection tasks, the ocean scene has its unique characteristics: (1) the scenes including multiscale targets have a broader field of view; (2) the size of targets is extremely distributed. At the same time, due to the low signal-to-noise ratio and poor spatial resolution of infrared images, the visual effect is fuzzy and the target is seriously lacking in texture details, especially for small objects, which bring great challenges to the detection of marine targets based on infrared scenes.
Object detection methods based on deep learning have been shown to greatly improve the performance of object detection. There exist a sea of detection frameworks such as twostage algorithms (R-CNN [2], fast R-CNN [3], faster R-CNN [4]) and one-stage algorithms (YOLO [5][6][7], SSD [8]). However, it has always been a difficult point of research to improve the detection of small targets while ensuring the detection of large and medium targets. To solve this problem, a series of methods such as multiscale feature fusion strategies and context learning have been proposed. Nayan et al. [9] proposed to use upsampling and skip connections to extract the multiscale features of different network depths during the training process. Deng et al. [10] proposed an extended feature pyramid network, which uses an additional high-resolution pyramid level specifically for small object detection. Tan et al. [11] proposed a weighted bidirectional feature network named BiFPN for adaptively selecting the importance of different features in the fusion process. Lim et al. [12] proposed a method using context to connect multiscale features, supplemented by an attention mechanism to focus on the target in the image. Zhang [13] et al. proposed to use a feature fusion network to comprehend different levels of convolutional features and better utilize the fine-grained features and semantic features of targets.
In the infrared-based ocean scene, the difficulty of object detection is further aggravated by the huge change of ship scale and the weak texture features of small targets. Therefore, directly introducing an extra high-resolution pyramid level brings more noise in the bottom of the feature fusion stage. The use of a traditional backbone network could easily lead to feature degradation or even disappearance of extremely small targets in the process of downsampling and the context feature fusion will be extracted deficiently without guidance.
Based on the characteristics of infrared ocean ship scenes, we propose a CAA-YOLO infrared ship detection algorithm based on one-stage detector YOLOv5. As a state-of-the-art detector, YOLOv5 has the advantages of fast convergence, high accuracy and lightweight model. It also has strong real-time image detection capabilities and low hardware computing requirements, which means it is easy to be transplanted to mobile devices for application in marine traffic scenarios. In the design of the CAA-YOLO network, to better detect small and weak objects, we add the high-resolution feature layer P2 to obtain more shallow details and location information. In order to suppress shallow noise caused by the P2 layer and enhance the feature extraction ability, we introduce the attention module TA into the backbone. In the feature fusion stage, we construct a new feature connection mode to offer more contextual information for the shallow layer. Moreover, we design the combined attention mechanism to achieve adaptive feature fusion, which not only enhances the semantic information of targets but also suppresses the noise interference caused by the P2 layer. We use infrared ship dataset to evaluate the performance of our method. Compared with the baseline network YOLOv5, CAA-YOLO improves the detection performance of small targets in extreme multiscale scenes. Especially for scenes with small targets and intensive and mutual occlusion, in which pixel areas are less than 32 × 32, its AP and AR increase by 5.63% and 9.01%, respectively. In addition, the mAP of all ships increases by 3.4%. The main contributions of the research can be summarized as follows: • Aiming at the problems existing in infrared ocean ship scenes, this paper proposes a CAA-YOLO for infrared ocean ship detection based on YOLOv5. By introducing an attention module in the stage of feature extraction and feature fusion to utilize more shallow information, we improve the detection for small and weak targets. Compared with some state-of-the-art algorithms, the proposed method achieves better detection results. • To reserve more shallow details and location information, we add the high-resolution feature layer P2, which improves the detection accuracy of small objects. • To suppress the background noise and allow the network to independently distinguish the correlation and effectiveness between different feature mapping channels, we introduce a TA module into the backbone network. • To capture the long-range contextual information of small objects, we design a novel feature fusion method and use a combined attention mechanism to enhance the ability of feature fusion and suppress the noise interference brought by shallow feature layers.

Related Work
Object detection technology has made remarkable progress due to the development of deep learning technology in recent years. Currently, popular object detection algorithms can be divided into two categories: two-stage algorithms represented by RCNN [2] and faster RCNN [4], and one-stage algorithms represented by YOLO [5] and SSD [8]. They mainly contain input, backbone, neck and head modules as shown in Figure 1. Small targets have fewer pixels than conventional objects, so it is difficult to extract better features in multiscale object detection algorithms. Moreover, with the increasing depth of the convolutional neural network, the details and location features of small targets are gradually lost. In this section, firstly, we briefly review the major works of target detection when adding a highresolution feature layer, enhancing the feature extraction ability of the backbone network and multiscale feature fusion. Secondly, we briefly introduce the development status of an infrared small target detection. With higher resolution and more detailed information for localization, low-level features maps are particularly important for small target detection. Adding the shallow characteristics of the high-resolution layer has become one of the methods for small target detection. That is, the shallow feature map in the backbone module of Figure 1b is integrated into the neck module of Figure 1c. Kim et al. [14] proposed the structure of ECAP-YOLO to improve the detection performance of small targets in aerial photography scenes. Shao et al. [15] proposed an adaptive spatial feature fusion network with a high-resolution detection layer to enhance the effect of ship detection in night remote sensing scenes.
As a shared structure of different neural networks, the backbone is the main element of the model, which determines the basic performance. To extract the feature information more effectively, an attention mechanism is widely used in convolution neural networks, which is usually introduced into the modules of the backbone as shown in Figure 1b. Bi et al. [16] embedded a visual attention enhancement network into the backbone network DSOD to extract visual features, which improved the performance of ship detection. Cui et al. [17] proposed a spatial mixing group enhancement (SSE) attention module into the backbone network to suppress some noise while extracting stronger semantic features to reduce false positives caused by inshore and inland interference. Chen et al. [18] designed a novel and lightweight extended attention module (DAM) to extract the discriminant features of ship targets. The integrated attention mechanism suppressed irrelevant regions and highlighted salient features that were useful for ship detection tasks.
One of the main difficulties of object detection is the effectiveness of feature representation with multiscale information processing in the feature fusion stage shown in Figure 1c. Some researchers tried to improve the multiscale feature expression ability. Dewi et al. [19] applied SPP to collect the local region features at different scales in the same CNN layer to learn multiscale object features more comprehensively. Liu et al. [20] proposed a novel RF block (RFB) module, which took the relationship between the size and eccentricity of RFs into account, to enhance the feature discriminability and robustness. Some researchers studied the connection mode of the feature map for a better feature fusion. Liu et al. [21] proposed the structure of FPN, introduced bottom-up and top-down network structure and fused the features of adjacent layers to achieve feature enhancement. Liu et al. [22] proposed PANet, which added an additional bottom-up path aggregation on the basis of FPN.
Zhou et al. [23] proposed a scale transfer module to take advantage of cross-scale features. YOLO-V4-lightship [24] has greatly reduced the number of convolutional layers in CSP-Darknet53 and used a bottom-up information fusion to improve the precise positioning of ship detection. In addition, an adaptive feature fusion method was also studied to improve the expressive ability of multiscale features. Tan et al. [11] proposed a simple and efficient bidirectional feature pyramid network named BiFPN, which introduced learnable weights to learn the importance of different input features. Hu et al. proposed PAG-YOLO [25] with attention mechanisms in spatial and channel dimensions to adaptively assign the importance of features at different scales.
In addition to the above methods, some other improvement strategies have been put forward in recent years. Researchers have successively proposed loss functions such as GIoU-Loss [26], CIoU-Loss, DIoU-Loss [27] based on IoU to solve the target regression problem. In order to solve the problem of target imbalance, Kisantal [28] et al. proposed a replication enhancement method to increase the number of training samples for small targets; Chen [29] et al. proposed an adaptive resampling strategy by considering target context information to solve the target background mismatch caused by direct copy and paste. In terms of training deep neural networks, the Harris hawks optimization (HHO) [30] algorithm was proposed to tune the hyperparameters of a CNN, which attained 100% accuracy for hand gesture classification. Ref. [31] proposed a simple warm restart technique for stochastic gradient descent to improve its anytime performance.
With the rapid development of deep learning object detection research, some related works based on deep learning have also appeared in the field of infrared small target detection, as is shown in Table 1. In terms of optimizing the backbone network structure, Lin et al. [32] proposed a seven-layer end-to-end convolutional neural network, in which the network structure did not perform image downsampling operations to ensure the accuracy of target localization. M. Li et al. [33] proposed SE-YOLO, a real-time pedestrian object detection algorithm for small objects in infrared images, which improves the feature modeling ability of the network by introducing an SE block [34] into YOLOv3, which improved the feature expression ability of the network combined with the SE block. Li et al. [35] developed a detector, YOLO-ACN, by introducing an attention module and a depth-wise separable convolution. Sun et al. [36] proposed I-YOLO, which modified the backbone with EfficientNet and added a preposition network, DRUNet, to reduce the noise of infrared images. Dai et al. [37] put forward a novel object detection approach, termed TIRNet, where the residual branch was introduced to get robust and discriminating features for accurate box regression and classification. Du et al. [38] proposed FA-YOLO with a CBAM module in the backbone to enhance the performance of infrared occlusion object detection under a confusing background. In terms of improving feature fusion strategy, Dai et al. [39] proposed the asymmetric contextual modulation (ACM), which explored the fusion method between deep and shallow features. Inspired by the improved networks of UNet [40][41][42][43], a dense nested attention network (DNANet) [44] was proposed to resolve the detection of infrared small targets with dense nested connection and a CSAM attention mechanism. Cao et al. [45] presented ThermalDet, which included a dual-pass fusion block (DFB) and a channel-wise enhancement module (CEM) to enhance the performance in thermal imagery. In terms of adding a high-resolution small target detection layer, Li et al. [1] proposed a region-free object detector named YOLO-FIRI for infrared (IR) images by compressing channels and optimizing parameters. To maximize the use of shallow features, the cross-stage-partial-connections (CSP) module in the shallow layer was changed and an improved attention module was introduced into the residual blocks. Table 1. Some related works based on deep learning in the field of infrared small target detection.

Backbone
Lin et al. [32] A 7-layer CNN was designed to automatically extract small target features and suppress clutters in an end-to-end manner.

SE-YOLO [33]
An SE block was introduced into YOLOv3 to achieve higher accuracy and lower false alarm rate in small pedestrian detection task.
YOLO-ACN [35] An attention mechanism was introduced in the channel and spatial dimensions in each residual block of YOLOv3 to focus on small targets.

I-YOLO [36]
A dilated-residual U-Net was also introduced to reduce the noise of infrared road images.

FA-YOLO [38]
A dilated CBAM module was added to the CSPDarknet53 in the YOLOv4 backbone.
TIRNet [37] A residual branch was added when training to force the network to learn robust and discriminating features.
Fusion strategy ACM-U-Net [39] An asymmetric contextual modulation module was proposed for detecting infrared small targets based on FPN and U-Net [46].
DNANet [44] A DNIM module was designed to achieve progressive interaction among high-level and low-level features on a infrared small target dataset.
ThermalDet [45] A DFB block and CEM module were designed to directly fuse features from all different levels based on RefineDet.
High-resolution detection layer YOLO-FIRI [1] Multiscale detection was added to improve small object detection accuracy.
These methods achieved a better performance for object detection in different fields, such as pedestrian detection and autonomous driving, but they cannot be directly applied to the detection of infrared ocean ships due to the severe distribution of target sizes in the ocean scene, the weak characteristics of small targets and the interference of noise. Therefore, this paper studied the state-of-the art detector YOLOv5; we applied a high-resolution feature layer so that shallow details and location information can be better used. To suppress shallow noise, we introduced the TA module into the backbone. Besides, we utilized a new feature fusion method to capture long-range contextual information for small targets and designed a combined attention mechanism to enhance the ability of feature extraction and feature fusion while suppressing the noise interference caused by the shallow feature layers.

Methodology
Aiming at the severe distribution of ocean ship size and weak small target features in infrared scenes, we propose a combined-attention-augmented YOLO (CAA-YOLO) based on YOLOv5. In this section, we describe the overall framework of the proposed method and three specific improvement measures in detail. A shallow feature layer P2 is added to the structure of FPN, then we build the backbone enhancement module TA and feature fusion module CAA based on the P2 layer with multiple attention mechanisms. Figure 2 shows the overall framework of the proposed method. YOLOv5 consists of four main modules. The first is the input module, in which data are preprocessed for given input images. YOLOv5 uses mosaic data enhancement, automatic calculation of anchor boxes and image scaling to process input images. The second is the backbone module, where CSPDarknet53 is used as the backbone network to perform feature extraction on images. The third is the neck module, the path aggregation network [22] (PANet), a cyclic pyramid structure composed of convolution operations, upsampling operations and CSP2_X, which performs feature fusion operations on multiscale feature maps from the backbone network. The fourth is the head module, which mainly performs the operations of target regression and localization. CAA-YOLO is mainly improved for the backbone and neck modules: (1) To make full use of the location and details information of small targets provided by the underlying feature layer, the feature map P2 extracted by the backbone module is fused to the neck module to make the network adaptable to infrared ship targets with extreme scale changes.

Conv
(2) To extract more effective feature information, we introduce the triplet attention [47] mechanism into the residual block of the backbone to help the model better extract features of interesting targets and suppress noise interference. (3) To prevent the noise brought by the feature map P2 from affecting the entire feature fusion stage and provide more contextual information for small targets, the feature connection method of the neck module is modified. Besides, we use a hybrid attention module to guide the feature fusion process to make the feature fusion more effective.

High-Resolution Feature Layer P2
As the pixel proportion of small targets is small, the feature information is generally reduced after several downsampling layers in the process of feature extraction by the convolutional neural network. For example, when the step size stride is 16, a target region with 32 × 32 pixels is only 2 × 2 pixels in the feature map, and the effective regions for detecting small objects cannot be discerned. In addition, with the deepening of the network layers, the feature information and location information of small targets are gradually lost, which is not conducive to the localization of the target [48,49]. In the convolutional neural network, the high-level feature map has a large receptive field and rich semantic features, but it loses a lot of spatial features. On the contrary, the shallow feature map has a small receptive field, but it has high spatial resolution and accurate target locations, which is suitable for small object detection [50].
The backbone network is a convolutional neural network that extracts feature maps of different sizes from input images through multiple convolution layers and pooling layers. YOLOv5 uses 3 scales of output feature maps to detect objects of different sizes and 8-time downsampled feature maps to detect small objects. However, there are a large number of small targets in the datasets of the infrared ocean, and the area distribution is shown in Table 2. For tiny targets with less than 10 × 10 pixels of the size, the target features compressed by 8-time downsampling are extremely weak and may even disappear completely, so we added a high-resolution feature map with a 4-time downsampling to detect tiny targets, its structure is shown in Figure 3. First, the feature map P2 extracted by the backbone network is fused with feature maps of other scales through FPN [21] and PAN [22] structures. Then, a D2 detection head dedicated to tidying target detection is constructed using the fused F2 features, so that the network can utilize more location information and feature information of small objects provided by the high-resolution feature layer P2. Mathematically, the D2 layer can be obtained by Formula (1): C3(I) represents the feature map I as the input data of the C3 module, Conv(I)) represents the convolution operation on the input I, Upsampling(I) represents the double upsampling operation on input I, Concat(I1, I2) indicates that the input features I1 and I2 are stitched according to the channel.

Enhanced Backbone
Ocean targets based on infrared scenes contain a lot of noise. If the shallow feature map is directly introduced into the neck network, a large amount of underlying noise will also be diffused into the entire feature fusion module, which will directly affect the feature fusion results of other detection layers. As a result, suppressing noise in the backbone network is of great significance for subsequent feature fusion. In addition, the feature extraction capability of the backbone network will also directly affect the final detection effect. These reasons make small target detection more dependent on efficient feature extraction networks. To suppress the background noise in the image, enhance the effective feature information of the target and allow the network to independently distinguish the correlation and effectiveness between different feature map channels to improve the detection effect, the triplet attention (TA) [47] module was introduced in this paper.
Spatial attention tells us where in the channel we should focus and channel attention tells us what channel we should focus on. Previously proposed attention mechanisms calculated channel attention and spatial attention separately; D. Misra [47] proposed the concept of cross-dimensional interaction, which emphasized the interaction between the spatial and channel dimensions of the input tensor. Triplet attention establishes interdimension dependencies through rotation and residual transformation, then re-encodes interchannel and spatial feature information and calculates the attention weight through the interdependence between the feature information dimensions, so that the network can provide a richer feature representation. Triplet attention consists of three parallel branches, two of which are used to capture the interaction between channel dimension and spatial dimension, and the third branch is used to calculate spatial attention; its structure is shown in Figure 4. Given an input feature map x ∈ R C×H×W , branch (1) is used to compute the attention weights of channel dimension C and spatial dimension H . Firstly, the input x is rotated 90 • anticlockwise along the W axis and then performed a Z-pool operation. Secondly, a convolution layer with a kernel size of 7 × 7 and a batch normalization layer are used. Thirdly, it adopts a sigmoid activation layer to generate the attention weights. Finally, the attention weights retain the same input shape as x by rotation. Similarly, branch (2) computes the attention weight of channel dimension C and spatial dimension W like branch (1). For branch (3), the channels of x are reduced to two by the Z-pool layer, and then the space attention weights are computed by operations similar to the other two branches. In the end, the refined feature maps generated by the three branches are aggregated by simple averaging. The Z-pool layer here is responsible for reducing the zeroth dimension of the input feature map to two, by concatenating the average-pooled and max-pooled features across that dimension. Mathematically, it can be represented by Formula (2): YOLOv5 uses CSPDarknet53 as the backbone network for multiscale feature extraction, which mainly consists of a convolution block with a convolution kernel of 6 × 6 and a stride of 2, three groups of Conv+C3 combined operations and a spatial pyramid pooling SPPF module, whose structure is shown in Figure 5a. The C3 module with residual block divides the feature map of the base layer into two parts to make the gradient flow propagate in different network paths by separating the gradient flow, and then the two parts are spliced through the cross-stage hierarchy; the structure is shown in Figure 5b. We introduced the TA attention mechanism into each C3 module in the backbone network by building a new Bottleneck_TA structure based on the original Bottleneck, as shown in Figure 5d. The TA is embedded in the backbone network, which can capture more effective feature information through spatial attention and channel attention cross-channel interaction, weakening the underlying noise interference brought by the P2 layer to a certain extent, which is beneficial to ship detection in infrared scenarios.
(3) Figure 4. Illustration of the triplet attention, which has three branches. The first branch (1)

Feature Fusion
Four layers of feature maps are generated in the backbone network. Through these feature maps of different sizes, the neck network fuses feature maps of different levels to obtain more contextual information and reduce information loss. In the fusion process of YOLOv5, the feature pyramid structure of FPN and PAN is used. The FPN [21] structure transfers powerful semantic features from the top feature map to the lower one, as shown in Figure 6a. At the same time, the PAN [22] structure transfers strong localization features from the lower feature map to the higher ones, as shown in Figure 6b. To prevent the degradation of features between the same layers, BiFPN [11] introduces cross-layer connections to fuse more features without adding too much cost. Its structure is shown in Figure 6c.
Due to the weak features of infrared targets, especially small targets, the features are prone to degradation in the process of a continuous deepening of the convolutional network. Therefore, inspired by BiFPN, we introduced cross-layer connections between the same feature layers in YOLOv5, aiming to provide more features for the feature fusion stage. Furthermore, since the small target features in infrared ocean scenes are extremely weak, contextual information becomes extremely important for their detection process. Therefore, we added the paths from the feature layers P4 and P5 of the backbone network to the F2 and F3 layers based on YOLOv5, aiming to provide more contextual information for small targets. The structure is shown in Figure 6d. The idea was as follows: On the one hand, because the target features smaller than 32 × 32 pixels of size have been highly compressed and even disappeared in the 32-time downsampling of P5, we did not introduce it to the F2 layer. On the other hand, the feature map P3, which was 8-time downsampled, was fused with F3 through horizontal connections and F3 was fused with F2 from top to bottom through the FPN network, so we also did not introduce the feature map P3 into the F2 layer. In the same way, we did not introduce the feature information of the P4 layer into the F3 layer.
Because the shallow feature map P2 is directly fused with other multiscale feature layers, more noise will interfere with the feature fusion results. In addition, the feature fusion strategy used by YOLOv5 is to concatenate in terms of channel dimension. Concatenating features based on the channel dimension cannot reflect the correlation and importance of features between different channels. Without additional guidance, it is difficult for the underlying fusion module to accurately capture the key features of the target due to the interference of noise. Therefore, we needed to strengthen the feature fusion capabilities of the P2 and P3 layers for small target detection and suppress a large amount of noise interference at the bottom. The attention mechanism can flexibly capture the global and local relations in the input image, which focus on looking for important information related to the application task. We designed a hybrid attention module CAA, as shown in Figure 7. By improving the ability of the global context modeling and local feature extraction, the low-level small target detection layer can obtain more effective feature information. The CAA module mainly includes the information fusion of three feature maps, namely the high-level feature information P h from the backbone network, the same-layer feature information P s of the backbone network and the top-down feature information F td in the FPN network.
For the high-level feature P h to fully utilize its contextual information, we adopted the attention module in GCNet [51] to capture long-range dependencies, whose structure is shown in Figure 7a. GCNet fully exploits the advantages and disadvantages of Non-Local [52] and SE [34] modules, absorbs the strong global context modeling ability of Non-Local Network (NLNet) and the low computational cost of SENet, and designs a more effective global context module to capture remote dependencies. Given an input tensor P h ∈ R C h ×H h ×W h , first, it adjusts its channel to C × H × Wthrough a standard convolution operation with the BN layer and a 4-time downsampling layer and then passes it through the GC attention to capture more feature information. For the low-level feature P s to make full use of the rich location information of the shallow feature map, the CBAM [53] attention module was used to selectively aggregate the features of each location by weighting the key location features. CBAM generates the attention map of the convolutional network feature map from two aspects of channel and space and then multiplies the attention map with the input feature map for feature adaptive learning. The network structure is shown in Figure 7b. The channel attention mechanism enables the network model to effectively pay attention to important channels while ignoring or even suppressing negative channels. The introduction of the spatial attention mechanism can effectively focus on the ship target and suppress other nonimportant information in the image, so as to further improve the detection accuracy.
After the high-level feature P h and the low-level feature P s were processed by the attention module, respectively, the feature fusion was performed by pixel-by-pixel addition. Then, it was combined with the top-down feature information F td according to the dimension information to get the final feature fusion result. Through this module, important features can be enhanced and unimportant features can be weakened, so that the extracted features are more directional. Mathematically, it can be represented by Equation (3): P s is the underlying feature map in the backbone network, P h is the high-level feature map in the backbone network, F td is the feature map in the top-down path, Concat [,] represents features that are spliced by channel, f sums represent features that are pixel-bypixel additively fused and f CBAM and f GC represent the feature enhancement operation before the feature layer fusion.

Experiments
In this section, we first introduce the experimental dataset, implementation details and related evaluation metrics. Furthermore, extensive experiments are also conducted to demonstrate the effectiveness and robustness of the proposed method.

Data Set
This paper used the infrared ship images provided by the organizers of the 2021 Ocean Target IntelliSense International Challenge as the data set, which contains a total of 9402 infrared images, including seven types of detection targets: warship, liner, container ship, bulk carrier, sailboat, canoe and fishing boat. We performed a preliminary manual screening of the dataset, removed some problematic images and readjusted the labels of some images to ensure the validity of the dataset.
The distribution of each category of targets in the infrared ship data set is shown in Figure 8. The number of images and targets of fishing boats is approximately 1:5, which is more densely distributed than the other ships. After normalizing the size of the dataset to 640 × 640 pixels, the distribution of size and aspect ratio within the dataset are shown in Table 3 and Table 4, respectively. Table 3 shows the extreme difference in the size distribution of the datasets, with the smallest object being only 4 pixels and the largest reaching 408,960 pixels. Table 4 shows the variation between the target aspect ratios within the class. In total, 42% of the targets in the infrared ship dataset are small targets, whose size is less than 32 × 32 pixels, and are mainly distributed in fishing boat and canoe targets. Because small targets have few features and even present a dense distribution state, these problems further aggravate the difficulty of infrared small target detection.    The experiment environment was performed on a PC with Intel core i9-10900KF CPU, GeForce RTX 3090 (24 GB storage), CUDA 11.1 and the operating system was Ubuntu 18.04 LTS. The deep learning framework was Pytorch 1.9. Considering the equipment performance, the batch size was 24, and a total of 150 training epochs were selected with an initial learning rate of 0.01. The optimizer used stochastic gradient descent (SGD) and the cosine learning rate decay strategy to train the network. In our experiment, the images were resized to 640 × 640 uniformly. The dataset was split into approximately 80% training and 20% validation sets.
We used the mosaic data augmentation method provided in Yolov5 to increase the number of small objects and increase the speed of training by stitching four images together, as shown in Figure 9.

Evaluation Metrics
To evaluate the performance of the proposed algorithm, the classic detection metrics average precision (AP) , average recall and mean average precision (mAP) were used. We calculated the AP and AR in two settings: AP@0.5/AR@0.5, AP@0.75/MR@0.75, APs/ARs, APm/ARm and APl/ARl. AP@0.5/AR@0.5 means the value of AP or AR when IoU = 0.5, AP@0.75/AR@0.75 means the value of AP or AR when IoU = 0.75, APs/ARs means the average value of AP or AR when the size of the target is smaller than 32 × 32 pixels, APm/ARm means the average value of AP or AR when the size of target is between 32 × 32 and 96 × 96 pixels, APl/ARl means the average value of AP or AR when the size of target is bigger than 96 × 96 pixels. Precision (P) and recall (R) were calculated by Formulas (4) and (5), where # denotes the number, TP denotes the situation where the prediction and label are both ships, FP denotes the situation where the prediction is a ship but the label is the background, FN denotes the situation where the prediction is the background but the label is a ship. The average precision (AP) was calculated by Formula (6). AP = 1 0 P(r)dr (6) where P denotes the precision, and r denotes the recall.

Results
This section first introduces the experimental dataset, implementation details and related evaluation metrics. Furthermore, extensive experiments were also conducted to demonstrate the effectiveness and robustness of the proposed method.
(1) Influence of the P2 layer Because of a large number of small targets in the marine ship dataset, we aimed to use the feature information of small targets provided by the shallow layer P2 for classification and localization. We extracted the feature map downsampled four times from the backbone network as the P2 layer. It first performed feature fusion with the FPN, which transmitted high-level feature information from top to bottom, and then transferred the fused low-level features to the high-level detection layer through the PAN network. We set three anchor boxes of different scales on the high-resolution feature layer and obtained 12 sets of anchor boxes for the detection target by using the K-means algorithm, as shown in Table 5.  As shown in Figure 10, YOLOv5 missed the detection of targets with area smaller than 20 × 16 pixels, while adding the P2 layer could detect all small targets. Therefore, more low-level detailed information and positioning information can be extracted by adding the high-resolution feature layer P2, which is helpful to alleviate the phenomenon of missed detection of targets, especially for small targets. As shown in Figure 11a-c, our algorithm was able to detect small objects with extremely small size and weak features located on the sea line. In terms of the objects with indistinct features due to motion blur, such as Figure 11b,c, adding the P2 layer can achieve more accurate detection with more low-level details. In Figure 11d, ships occluding each other might compress the features due to downsampling, which often led to missed detection. However, adding the P2 layer also brought some noise interference. For example, Figure 11f showed error detection results. In Table 6, we find that the AP1 after adding P2 is lower than that of YOLOv5; the reason may be that the increase of the P2 layer at the bottom brings more noise to interfere with the result of the feature fusion at the top layers.   In conclusion, the addition of the P2 layer extracted more details of the targets, alleviated the missed detection of small targets and improved the detection rate of densely distributed targets. However, the detection of large targets was affected because the PAN network transmitted more noise introduced by P2 to the higher levels, so we had to suppress the shallow noise.
(2) Influence of the backbone By adding the P2 detection layer, we found that using the shallow feature could detect targets with weak features and small sizes, but it also brought more noise information to other detection layers. By introducing the TA attention network into the residual module of C3 in the backbone, we found that the noise increase could be alleviated to a certain extent, so that effective features could be assigned more weights to continue to propagate to the lower layers of the network and the propagation of invalid features could be suppressed. In Table 7, we find that the APl is improved relative to the model with the addition of P2 layers, which proves that the TA module can make the network pay more attention to the target area, thus reducing the influence of background on the result. (3) Influence of the feature fusion The high-level feature layers P4 and P5 were used as context layers, respectively, and provided more contextual information for small objects through the proposed new feature fusion method. We applied an attention CAA module on the lower two layers to guide the fusion process of features and suppress noisy information. Figure 12a shows that the CAA module can detect more small targets. In Figure 12b-e, the addition of the P2 layer introduces more noise, which leads to the wrong detection of the target. The CAA module improves the detection. The group of images (Figure 12f) in the densely distributed scene reflect a better detection effectiveness of the CAA module. It can be seen from Table 5 and Table 7 that the average detection accuracy and recall rate of small targets improved, which were 3.39% and 6.67% higher than YOLOv5, and 2.41% and 5.01% higher than when only adding P2 layer. At the same time, the detection accuracy of large targets was also increased by 1.87% compared with the direct increase of the P2 layer, which solved the problem that the detection effectiveness of large targets decreased due to the addition of the P2 layer.
In conclusion, more context information was helpful for the target detection. The CAA fusion strategy improved the detection effectiveness of targets and suppressed the influence of the noise, which reduced false detection. Moreover, it also achieved a better detection effectiveness in the densely distributed scene and further improved the detection effectiveness of small targets, especially for targets below 10 × 10 pixels.

(4) Overall Detection Results
The algorithm proposed in this paper improved the detection effectiveness of small targets by constructing a high-resolution feature layer P2 and solved the noise interference caused by adding the P2 layer by means of a combined attention mechanism. By verifying the effectiveness of our algorithm in the infrared ocean dataset, it was found that CAA-YOLO could achieve more prominent detection results than YOLOv5 when facing extremely small targets. In addition, CAA-YOLO could also improve object detection in dense and occluded scenes. As shown in Figure 13a, our algorithm can not only detect extremely small targets on the water antenna, but also locate small targets in dense scenes. In Figure 13b, for targets with weaker features, our algorithm can also successfully detect targets, and the confidence of targets is improved by a combination of multiple strategies. In Figure 13c, our algorithm is able to detect more objects for complex and dense scenes. In Figure 13d, we are able to detect more small objects in the dense small object scene.
Compared with the YOLOv5 algorithm, the CAA-YOLO algorithm improved the detection effectiveness of each category, especially for smaller fishing boats and canoes, as shown in Figure 14. In general, our algorithm achieved better accuracy and recall rate, as shown in Figure 15. Although the average recall rate and average detection accuracy of the model improved less, the detection accuracy and recall rate of small targets were greatly improved, by 5.63% and 9.01%, respectively, as shown in Table 5 and Table 7. Compared with YOLOv5, our model had a 3.4% increase of mAP, a 5.81% increase in the number of parameters, and an increase in FLOPs from 108 to 131.9, as shown in Table 8.  As a result, our method could extract more target details without increasing computation too much, which is very important for occlusion scene and small target detection. In terms of time complexity, our GFLOPs increased by 23.9 and the increase in time complexity was mainly caused by the addition of the P2 layer. The CAA module brought only a small increase in floating point operations. In addition, our method hardly increased the space complexity of the model. For targets with drastic scale changes, CCA-YOLO could improve the recall rate and detection accuracy of small targets without reducing the detection effect of large targets.

(5) Comparison to State-of-the-Art Approaches
To evaluate the effectiveness and timeliness of the proposed CAA-YOLO algorithm, we compared several state-of-the-art methods with our model, including faster R-CNN [4], SSD [8], RetinaNet [54] and EfficientDet [11]. For a fair comparison, all the compared methods adopted the same training and test sets and kept the same epoch of training. We tested the speed of image processing on a PC with Intel Xeon(R) CPU E5-2683 v3 CPU and GeForce GTX 1080 (12 GB storage). Table 9 summarizes the detection results when IoU = 0.5 and confidence = 0.05 in terms of model volume, the mean average precision (mAP) and speed. mAP coco and mAP voc represent the results of detection using COCO and Pascal VOC evaluation methods, respectively. FPS (frame per second) is the detection speed, which is the number of images that the algorithm can detect per second. It can be seen that the proposed CAA-YOLO exhibits the best performance with the highest mAP values, which demonstrates its effectiveness compared with other models. Other algorithms show lower values in mAP coco , but better performance in mAP voc , indicating that their target positioning ability is weaker than that of YOLO and CAA-YOLO. When IoU is greater than 0.5, other advanced algorithms gradually show poor performance. Faster R-CNN can achieve better mAP voc compared with other compared methods, but its processing speed is slow because it is a two-stage algorithm. The SSD algorithm has the highest processing speed, but its detection performance is poor, especially for small targets. To further show the superior of CAA-YOLO visually, we present the detection results of several typical scenes in Figure 16. In Figure 16a, all algorithms can identify large targets correctly, but only faster R-CNN and CAA-YOLO algorithms can detect small targets with blurry visual effects on the sea surface. In Figure 16b, the target located at the sea line is small in size and weak in feature, the two-stage algorithm faster R-CNN can detect them better than the one-stage SSD, RetinaNet and EfficientDet, but it still misses targets, while CAA-YOLO can extract more target features to achieve a more accurate detection and alleviate missed detection. In Figure 16c, the ships near the shore are densely distributed and occluded from each other, and the detection effectiveness of faster R-CNN on canoes in complex backgrounds is poor. RetinaNet and EfficientDet have weak postprocessing capabilities for target occlusion scenes and fail to filter a large number of overlapping target boxes. CAA-YOLO can not only detect canoes in complex backgrounds, but also better detect densely distributed overlapping targets. In Figure 16d, other algorithms lead to error detection and missed detection of small targets, while CAA-YOLO can correctly detect more small targets. In these four examples, CAA-YOLO performs best, and the other four algorithms have different degrees of missed detection, false alarm and incomplete detection.
In conclusion, compared with several state-of-the-art methods, our proposed method obtained better target positioning ability and achieved real-time target detection performance. Because small targets have higher requirements for localization, CAA-YOLO showed better results in the detection of small targets.

Conclusions
In this paper, we proposed the CAA-YOLO network for infrared ocean ships detection to alleviate the problem of extremely widely multiscale distribution in infrared scenes. The network improved the detection effectiveness of small targets by adding a highresolution detection layer and increased the feature extraction capability of the backbone by introducing an attention module. Furthermore, the combined attention mechanism used in the feature fusion stage could achieve a more effective feature fusion while suppressing infrared noise interference. We validated the effectiveness of the method using the infrared ship dataset and experiments illustrated substantial improvements in the infrared small target detection, in comparison with other state-of-the-art methods. The performance of our proposed model could be attributed to the combination of learned shallower features and attention features, which allowed our model to detect more infrared small targets based on their low resolution and weak features, and improved the target recall rate in dense and occluded scenes to a certain extent.
On the one hand, for dense and occluded objects, our algorithm could improve its recall rate, but there were still missed targets, so we will focus on solving the detection of dense and occluded targets in future work, for example, using better postprocessing mechanisms. On the other hand, compared with SSD and YOLOv5, we achieved better detection results, but the speed of detection was not the fastest, so we will further study how to lighten our method and improve its real-time detection speed. For example, depthwise separable convolution and lighter backbones can be used to replace the backbone of our method.
Funding: This research was funded by National Science and Technology Foundation Strengthening Plan (2021-JCJQ-JJ-1020).

Institutional Review Board Statement:
The study did not require ethical approval; it did not involve humans or animals.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.