Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images

: Object detection has made tremendous progress in natural images over the last decade. However, the results are hardly satisfactory when the natural image object detection algorithm is directly applied to satellite images. This is due to the intrinsic differences in the scale and orientation of objects generated by the bird’s-eye perspective of satellite photographs. Moreover, the background of satellite images is complex and the object area is small; as a result, small objects tend to be missing due to the challenge of feature extraction. Dense objects overlap and occlusion also affects the detection performance. Although the self-attention mechanism was introduced to detect small objects, the computational complexity increased with the image’s resolution. We modiﬁed the general one-stage detector YOLOv5 to adapt the satellite images to resolve the above problems. First, new feature fusion layers and a prediction head are added from the shallow layer for small object detection for the ﬁrst time because it can maximally preserve the feature information. Second, the original convolutional prediction heads are replaced with Swin Transformer Prediction Heads (SPHs) for the ﬁrst time. SPH represents an advanced self-attention mechanism whose shifted window design can reduce the computational complexity to linearity. Finally, Normalization-based Attention Modules (NAMs) are integrated into YOLOv5 to improve attention performance in a normalized way. The improved YOLOv5 is termed SPH-YOLOv5. It is evaluated on the NWPU-VHR10 dataset and DOTA dataset, which are widely used for satellite image object detection evaluations. Compared with the basal YOLOv5, SPH-YOLOv5 improves the mean Average Precision (mAP) by 0.071 on the DOTA dataset.


Introduction
Earth satellite technology usually acquires high-resolution satellite photographs to observe the Earth's surface. However, existing interpretation algorithms face a significant challenge in digesting plenty of satellite images. Object detection is one of the most fundamental issues in computer vision, which means finding predefined instances reliably and efficiently from photographs. It has wide applications in disaster monitoring, precision agriculture, urban traffic management, etc. [1][2][3].
Recently, data-driven deep learning methods have promoted significant progress in segmentation and object detection tasks [4][5][6]. The detection precision is affected by the 1.
New feature fusion layers and prediction head are added from the shallow layer in YOLOv5 for the first time to detect small objects because it can maximumly preserve the feature information.

2.
Original convolutional prediction heads in YOLOv5 are replaced with Swin Transformer Prediction Heads (SPHs) for the first time to reduce computational complexity. 3.
Normalization-based Attention Modules (NAMs) are integrated into YOLOv5 to add a sparsity penalty to the attention module to improve performance. 4.
Our proposed SPH-YOLOv5 achieves 0.716 mean Average Precision (mAP) on the DOTA dataset with complex objects and 0.98 on the NWPU-VHR10 dataset with relatively simple objects, the best accuracy among the results of the existing models.

Object Detection
Feature extraction is crucial for object detection because it can transform raw data into high-level feature representations. However, traditional model-driven approaches, such as Histogram of Oriented Gradients (HOG) [21] and Scale Invariant Feature Transform (SIFT) [22], usually take considerable time and staffing to deal with large datasets. In contrast, data-driven deep learning methods can automatically extract robust feature representations from the raw data, outperforming traditional extraction approaches. Furthermore, they can relieve the heavy load of traditional feature modeling and engineering. Recently, object detection frameworks based on deep learning have been broadly divided into two types: two-stage detection frameworks and one-stage detection frameworks [23].
The two-stage methods use the selective search algorithm to extract candidate regions in the first stage. Then, they take CNN to extract features from the candidate regions and finally apply the classifier for classification at the second stage. Region-CNN (R-CNN) [24] is a typical two-stage object detection algorithm. However, R-CNN fixes the size of the input image. To avoid this drawback, Spatial Pyramid Pooling Network (SPP-Net) [25] introduces spatial pyramid pooling, which can extract features from arbitrary regions, allowing the network to detect objects from the input images of various sizes and significantly reduce computational effort. Fast R-CNN [26] incorporates the strengths of SPP-Net and avoids the considerable overlap of region proposals in R-CNN. It proposes a shared feature extraction network for all candidate regions, which brings improvements in detection speed and accuracy. Unlike region proposal generation with the selective search in Fast-RCNN, Faster R-CNN [27] designs a Region Proposal Network (RPN) and proposes an anchor frame mechanism for region proposal generation, which brings a considerable boost. However, none of the above methods are end-to-end frameworks. The detection speed would be limited.
In contrast, one-stage methods directly predict bounding boxes and class probabilities end-to-end. They are faster and less computationally expensive than the two-stage models and have the capability of detection in real time. For example, Single Shot MultiBox Detector (SSD) [6] uses a fully convolutional network (FCN) for feature extraction and detects small and large objects from shallow and high-level feature maps, respectively. However, the one-stage object detector has a category imbalance problem; thus, its detection accuracy is lower than that of the two-stage object detector. RetinaNet [5] utilizes a focal loss to address the category imbalance problem, which guarantees detection speed while outperforming all two-stage detection algorithms simultaneously. YOLO [13] is another representative onestage object detection algorithm that pursues extreme speed, which results in a low recall rate. YOLOv2 [14] replaces GoogleNet in YOLO with Darknet-19 as the feature extraction network to improve detection performance. Darknet-19 has fewer convolutional layers and more efficient performance. Meanwhile, YOLOv2 also introduces the prior anchor frame from the RPN to improve the recall rate. YOLOv3 [15] updates its feature extraction network from Darknet-19 into Darknet-53 with the multiscale framework and adds residual connection from ResNet [28]. It uses feature maps with three different scales for object detection to improve the detection accuracy of small objects. YOLOv4 [16] designs a Cross Stage Partial (CSP) structure based on Darknet53 to form a backbone network to reduce the computational effort further and enhance the gradient performance. Furthermore, it introduces CIoU loss [29] and Mish activation functions [30] to further improve detection accuracy. Besides, the powerful Scaled-YOLOv4 [31] offers a range of linearly scaled object detection models for engineering applications. As the latest and strongest generation of the YOLO series, the YOLOv5 model inherits all the above advantages. On the MS COCO Val 2017 dataset, YOLOv5 reports the highest mAP of 55.4%, with an inference time of 19.4 ms per picture, and is now rated top among state-of-the-art object detectors. Therefore, in this paper, we used YOLOv5 as a benchmark framework for detecting satellite optical image objects.

Data Augmentation
Datasets are crucial for deep learning. However, creating plenty of satellite datasets is expensive and impractical. Data augmentation is a popular and effective way to improve detection performance. Multiple data augmentation strategies can expand the number of training datasets and enrich the diversity of datasets, thus enhancing the robustness and generalization ability of the detection model. Early data enhancement methods utilized distortion, rotation, and scaling to improve image classification accuracy [32]. Subsequently, geometric transformation data enhancement methods were developed, including random scaling, cropping, panning, and clipping. In addition, photometric transformations are also widely used. For example, changing the training data's hue, saturation, and value to expand the dataset. There are also some more unique methods for enhancing multi-image fusion data. Mixup [33] expands the dataset by randomly weighting and mixing images from different categories in the training dataset. Cutout [34] randomly erases parts of the sample image and fills them with zero-pixel. CutMix [35] improves Cutout by not filling the erased part with zero-pixel, but with pixel values from other training images. Mosaic combines the advantages of the above methods. It is first proposed in YOLOv4, and the main idea is to randomly crop four images and stitch them into one image as training data. It greatly enriches the background of the images. However, stitching four images together inevitably increases the batch size. Therefore, batch normalization is required to calculate the four images. In YOLOv5, the combination of Mixup and Mosaic effectively expands the satellite dataset and dramatically improves object detection performance, especially for small objects. Therefore, our benchmark network combined Mixup, Mosaic, and traditional data augmentation strategies.

The Attention Mechanism
The attention mechanism originated from the studies of selective human attention to information. It allows neural networks to have the perception-adapted ability in computer vision tasks, specifically by making the model pay more attention to the essential parts of the input and thus extracting key features. The attention mechanism has also derived different representations in practice. Recurrent Attention Model (RAM) [36] pioneers the combination of attention mechanism with deep neural network. Recurrent Neural Networks (RNN) are fundamental tools for attention mechanism early. In order to implement spatial attention in CNN, Spatial Transformer Network (STN) [37] was proposed to automatically select the features of the region of interest and perform a spatial transformation of data with various deformations. Different from spatial attention, Squeeze-and-Excitation Network (SENet) [38] demonstrates a unique channel attention network that adaptively predicts possible essential features. Convolutional Block Attention Module (CBAM) [39] combines channel attention and spatial attention and is one of the most widely used lightweight attention modules to capture both spatial and channel features. The latest phase of research on attention mechanisms comes from self-attention, which originates from natural language processing. A novel non-local network [40] introduces self-attention into computer vision tasks and makes excellent progress in the field of object detection. Recently, without any convolutional and recurrent operations, various pure self-attention deep networks (named visual transformers) [41,42] have emerged, showing great potential for attention-based models with overall improvements in detection speed and detection performance and generalization capabilities [43]. For example, TPH-YOLOv5 with a transformer prediction head is developed for UAV images [44]. However, the transformer encounters obstacles in processing high-resolution images because the computational complexity of self-attention is quadratic to image size. Recently, the Swin transformer proposes a hierarchical transformer with shifted windows, significantly improving computational efficiency and detection performance [20].
Satellite images include both sparsely and densely distributed objects. Complex objects place higher demands on the feature extraction network. Generally, CNN extracts features with translational invariance, while the transformer does not have such visual features. CNN is good at extracting local information, while the Swin transformer is good at global modeling and has a stronger anti-interference ability. Due to the above characteristics, we expect to enhance the extraction of information features by combining CNN and the Swin transformer, to explore the representation potential of attention mechanisms, and to advance the development of object detection in satellite images.

Review of YOLOv5
The framework architecture of YOLOv5 consists of three main parts: backbone, neck, and predict head. The backbone extracts feature information from input pictures, and the neck combines the gathered feature information and creates three different scales of feature maps. The prediction head detects objects based on these created feature maps. YOLOv5 employs the CSPDarknet53 framework with an SPP layer as the backbone, the PANet as the neck, and the YOLO detection head. YOLOv5 can calculate the best anchor frame value by adapting the clustering algorithm in different training datasets. In addition, YOLOv5 has tried various combinations of activation functions, such as sigmoid, leaky-ReLU, and SiLU [45]. There are five derived models for YOLOv5, including YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. They have the same model architecture, but different model widths and depths. The smaller models are faster and are usually designed for mobile deployment. The larger model is more computationally intensive, but has a better performance.

SPH-YOLOv5
The architecture of our proposed SPH-YOLOv5 for object detection on optical satellite images is depicted in Figure 1.
jects place higher demands on the feature extraction network. Generally, CNN extra features with translational invariance, while the transformer does not have such vis features. CNN is good at extracting local information, while the Swin transformer is go at global modeling and has a stronger anti-interference ability. Due to the above char teristics, we expect to enhance the extraction of information features by combining CN and the Swin transformer, to explore the representation potential of attention mec nisms, and to advance the development of object detection in satellite images.

Review of YOLOv5
The framework architecture of YOLOv5 consists of three main parts: backbone, ne and predict head. The backbone extracts feature information from input pictures, and neck combines the gathered feature information and creates three different scales of f ture maps. The prediction head detects objects based on these created feature ma YOLOv5 employs the CSPDarknet53 framework with an SPP layer as the backbone, PANet as the neck, and the YOLO detection head. YOLOv5 can calculate the best anch frame value by adapting the clustering algorithm in different training datasets. In ad tion, YOLOv5 has tried various combinations of activation functions, such as sigmo leaky-ReLU, and SiLU [45]. There are five derived models for YOLOv5, includ YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. They have the same mo architecture, but different model widths and depths. The smaller models are faster a are usually designed for mobile deployment. The larger model is more computationa intensive, but has a better performance.

SPH-YOLOv5
The architecture of our proposed SPH-YOLOv5 for object detection on optical sa lite images is depicted in Figure 1.

Proposed Prediction Head for Small Objects
The SPH-YOLOv5 backbone network contains continuous down-sampling convolutional layers, so the feature map decreases as the network deepens during feature information extraction. The small size of the feature map affects the detection of small objects in the image. Unfortunately, satellite images contain a large number of small objects. To enhance the feature fusion effect of small objects, we added an extra prediction head and residual connections from the shallower backbone network. They bring the low-level, highresolution feature information into the feature fusion layer, making the added prediction head more sensitive to small objects. Finally, four different scale features of the prediction head are more adaptable to the dramatic changes in the scale of the satellite image objects.

Normalization-Based Attention Module (NAM)
The NAM is the variant of the CBAM module [44,46], which was originally proposed for image classification. As shown in Figure 2, it includes redesigning channel and spatial attention submodules in sequence. They can reweight attention by adjusting the variance measurement of the training weights in both channel and spatial dimensions. One of the vital scaling factors comes from batch normalization (BN).
where γ and β represent the trainable scale and shift parameters, respectively; and µ b and σ b are the mean and standard deviation in each batch b, respectively. The scaling factor γ is the variance in BN. A larger variance means more variation and richer information. More attention can be given to important channels based on γ normalized correlation weights W γ for the channel attention module, and less informative weights will be suppressed. Let us suppose F 1 ∈ R H×W×C is the input feature map, where H, W and C represent the height, width, and number of channels, respectively. The output M c of the channel attention can be expressed as M c = sigmoid(W γ (BN(F 1 ))). (2) 2022, 14, x FOR PEER REVIEW 6 of 17

Proposed Prediction Head for Small Objects
The SPH-YOLOv5 backbone network contains continuous down-sampling convolutional layers, so the feature map decreases as the network deepens during feature information extraction. The small size of the feature map affects the detection of small objects in the image. Unfortunately, satellite images contain a large number of small objects. To enhance the feature fusion effect of small objects, we added an extra prediction head and residual connections from the shallower backbone network. They bring the low-level, high-resolution feature information into the feature fusion layer, making the added prediction head more sensitive to small objects. Finally, four different scale features of the prediction head are more adaptable to the dramatic changes in the scale of the satellite image objects.

Normalization-Based Attention Module (NAM)
The NAM is the variant of the CBAM module [44,46], which was originally proposed for image classification. As shown in Figure 2, it includes redesigning channel and spatial attention submodules in sequence. They can reweight attention by adjusting the variance measurement of the training weights in both channel and spatial dimensions. One of the vital scaling factors comes from batch normalization (BN).
where γ and β represent the trainable scale and shift parameters, respectively; and b   In the manner of the channel attention module, its spatial attention module applies BN to pixels in the spatial dimension. It is the so-called pixel normalization (PN). It focuses on the more informative pixels according to the scaling factor λ and adjusts the associated weights W λ . Similarly, F 2 ∈ R H×W×C is the input feature map and the output M s of the spatial attention models is: To suppress the less important weights, NAM adds a regularization term to the loss function, where l(·) and g(·) represent the loss function and l 1 norm penalty function, respectively; x and y are the input and output, respectively; W is the network weight; and p is the equilibrium penalty factor.
In satellite images, the comprehensive imaging coverage can introduce complex background interference. It has been demonstrated that CBAM in TPH-YOLOv5 for UVA images can inform the network on what to focus on and where to focus using spatial and cross-channel feature connections. Nevertheless, NAM improves CBAM in weight normalization, bringing a clean module and higher computational efficiency. The neck is the crucial link between the top and bottom of the object detection framework. It reprocesses and rationalizes the important features extracted from the backbone to facilitate the object prediction of the head in the next step. Therefore, we inserted the NAM module after each concatenation operation before inputting the following feature layer in YOLOv5's neck. It can help to refine the fused feature information in the neck. In this study, we added five NAM modules to the neck of SPH-YOLOv5, aiming to refine the channel and spatial information of the feature fusion layer. It can help the model to focus more on the object's key messages in a complex environment. Furthermore, the increased computational cost is inconsiderable because of its lightweight design.

Swin Transformer Encoder Block
Inspired by the visual transformer, researchers have developed a superior network by combining CNN with the transformer. The transformer has a more vital ability to capture global information than CNN and performs better for dense and occluded objects in satellite datasets. We took both and fused them in the network by adding four Swin transformer encoder blocks to every prediction head.
Given a feature map X ∈ R H×W×C , after linear projection and reshape operations, the feature map becomes Q, K, V ∈ R N×C to feed self-attention, where N = H × W. The output of self-attention is expressed as: where A ∈ R N×N is the attention matrix representing the relationship between all elements on the feature map and other elements. The output Z aggregates global information. The actual computation of the transformer is performed in parallel, where the inputs are computed separately and then integrated, considering the correlation. It is called Multi-head Self Attention (MSA). MSA generates an attention matrix by integrating multiple independent subspaces used to compute self-attention. This process is the core of transformer. However, we found in our experiments that the transformer consumes enormous computational resources when processing high-resolution satellite images because the computational complexity of MSA in transformer is proportional to the quadratic image size. To improve the computational efficiency of self-attention, we introduced the Swin transformer encoder blocks into the prediction header of our SPH-YOLOv5. Its structure is shown in Figure 3. Each Swin transformer encoder contains two sub-layers. The first sub-layer is Window Multi-head Self-Attention (W-MSA). It divides the feature map into separate windows in a non-overlapping manner, and then self-attention is computed in these local windows. For a feature map X ∈ R H×W×C with a local window of size m × m, the computational complexity Ω are: size. To improve the computational efficiency of self-attention, we i transformer encoder blocks into the prediction header of our SPH-YO is shown in Figure 3. Each Swin transformer encoder contains two sub-layer is Window Multi-head Self-Attention (W-MSA). It divides t separate windows in a non-overlapping manner, and then self-atten these local windows. For a feature map H W C X R    with a local win the computational complexity  are: The computational complexity is significantly reduced since the w smaller than the image size. The second sublayer (MLP) is a fully co sidual connection between W-MSA and MLP is added to counteract ance and degradation of the weight matrix.

Comparison with TPH-YOLOv5
According to the above statements, it is easy to deduce the diff proposed SPH-YOLOv5 from TPH-YOLOv5 in ref. [44]. First, TPH-UAV images, while our SPH-YOLOv5 aims for satellite images with weaker information than the UAV images. Second, TPH-YOLOv5 only prediction head. In contrast, we added not only a Swin transformer p also another C3 module in the backbone network. The backbone netwo has only three C3 modules, but we had four C3 modules to retain t objects better. Third, TPH-YOLOv5 uses a general transformer that processing higher resolution satellite images. In contrast, our SPH-Y the Swin transformer to compute self-attention with sliding windows computational efficiency and a better performance on higher resolut Finally, we upgraded the CBAM in TPH-YOLOv5 to NAM to reduce f  The computational complexity is significantly reduced since the window size is much smaller than the image size. The second sublayer (MLP) is a fully connected layer. A residual connection between W-MSA and MLP is added to counteract gradient disappearance and degradation of the weight matrix.

Comparison with TPH-YOLOv5
According to the above statements, it is easy to deduce the difference between our proposed SPH-YOLOv5 from TPH-YOLOv5 in ref. [44]. First, TPH-YOLOv5 focuses on UAV images, while our SPH-YOLOv5 aims for satellite images with smaller objects and weaker information than the UAV images. Second, TPH-YOLOv5 only adds a transformer prediction head. In contrast, we added not only a Swin transformer prediction head, but also another C3 module in the backbone network. The backbone network in TPH-YOLOv5 has only three C3 modules, but we had four C3 modules to retain the features of small objects better. Third, TPH-YOLOv5 uses a general transformer that is not adaptable to processing higher resolution satellite images. In contrast, our SPH-YOLOv5 introduces the Swin transformer to compute self-attention with sliding windows to achieve a higher computational efficiency and a better performance on higher resolution satellite images. Finally, we upgraded the CBAM in TPH-YOLOv5 to NAM to reduce fully connection layers and improve the attention module in a normalized way. As a result, our proposed SPH-YOLOv5 has a competitive performance on the public higher resolution satellite images dataset, as demonstrated in the following section.

Experiments
To validate the effectiveness of our proposed model, we conducted experiments on two widely used satellite image datasets, NWPU-VHR10 [47] and DOTA [48].

Datasets and Evaluation Metrics
The NWPU-VHR10 dataset contains 800 high-resolution RGB satellite images, including 715 images with a spatial resolution of 2 m and 85 images with a spatial resolution of 8 cm. The image size is almost close to 1000 × 1000. It was cropped from the Google Earth and Vaihingen datasets and then manually annotated by experts. The dataset is divided into ten categories (aircraft, ships, storage tanks, baseball fields, tennis courts, basketball courts, surface runways, ports, bridges, and vehicles) and includes background images without objects.
As shown in Table 1, objects between 10-50 pixels in size are called small objects, and objects larger than 300 pixels are called large objects. Object sizes of 50 to 300 pixels are called medium objects. So, the DOTA dataset has more small objects than the NWPU-VHR10 dataset, and the corresponding detection will be more challenging. For the NWPU-VHR10 dataset, we kept its original size for input to the network. A total of 75% of the images were randomly selected as the training set and 25% as validation. For the DOTA dataset, we cropped the more prominent images so that there were overlaps and upscaled the smaller images. Finally, approximately 15,000 images with a size of 1024 × 1024 were generated. Similarly, 75% were randomly selected for training, and 25% were left for validation. We chose the frequently used Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP) as evaluation metrics. P and R are defined with the True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) as: After creating the precision-recall curve (P-R curve) from the precision and recall, AP can be calculated using the area under the P-R curve, and mAP is the average of all APs for N categories, Remote Sens. 2022, 14, 2861 10 of 17

Implementation Details
Our proposed SPH-YOLOv5 were implemented in the PyTorch framework, and all variant models were trained and tested on an NVIDIA RTX3090 GPU with 24 GB memory. In the training phase, we used the pre-trained model of YOLOv5's backbone on the COCO dataset [9] for transfer learning, saving considerable training time. We used the SGD optimizer for training with an initial learning rate of 0.01. The NWPU-VHR10 dataset was trained with 50 epochs, while the DOTA dataset was trained with 150 epochs. The first three epochs were trained using warm-up, which is a common learning rate optimization method. The learning rate of 0.1 was decreased at the first 3 epochs approximately, and a learning rate of 0.01 was subsequently used. Warm-up allows a deeper model to remain stable. It should be noted that all images are enlarged to 1280 × 1280 automatically by the SPH-YOLOv5 to facilitate the detection of small objects. Thanks to the GPU memory saved by the Swin transformer encoder, the batch size can be no less than 16. In contrast, for the transformer encoder, the batch size can only be set to 2.
The implementation of TPH-YOLOv5 is different from the SPH-YOLOv5. We trained the model at the first 2 epochs for warm-up. We used the Adam optimizer for training and used 3e-4 as the initial learning rate with the cosine learning rate schedule. The learning rate of the last epoch decays to 0.12 of the initial learning rate. The batch size is only 2.

Experimental Results
We demonstrated the SPH-YOLOv5 on the NWPU-VHR10 and DOTA test datasets and compared the results with other representative models. The results are shown in Tables 2 and 3. The highest mAP was obtained in the NWPU-VHR10 dataset. On the DOTA dataset, our method achieves a mAP of 0.716, which is 0.071 higher than that of YOLOv5, proving its effectiveness for small object detection on satellite images. These results indicate that our model can maintain medium object detection performance, while improving small object detection capability. We first drew the P-R curve for each category as shown in Figure 4. The AP, the integrated area under the curve, is calculated in legend. The larger the AP, the better the detection performance. Generally, the Intersection over Union (IoU) threshold and confidence threshold are two essential metrics for deep learning models. We then calculated the confusion matrix of the SPH-YOLOv5 results for the DOTA test set with the IoU threshold 0.5 and the confidence threshold 0.25, respectively. As shown in Figure 5, the confusion matrix visualizes the classification of each category. Each row represents the predicted categories, each column represents the actual categories, and the data on the diagonal line represents the proportion of categories that were correctly classified. However, it is shown that the high FN for the container crane category means that most objects are missed. The corresponding AP is also very low. It is mainly because the container-crane category training samples are much less than other categories. The lack of training samples leads to a limited extraction of features and results in a high FN. In addition, the FP is high for the small vehicle category, producing false alarms. Although the training samples of small vehicle are sufficient, they belong to tiny objects, which are very hard to detect in a dense environment with objects blocking each other.
SPH-YOLOv5 0.806 0.683 0.716 We first drew the P-R curve for each category as shown in Figure 4. The AP, the integrated area under the curve, is calculated in legend. The larger the AP, the better the detection performance. Generally, the Intersection over Union (IoU) threshold and confidence threshold are two essential metrics for deep learning models. We then calculated the confusion matrix of the SPH-YOLOv5 results for the DOTA test set with the IoU threshold 0.5 and the confidence threshold 0.25, respectively. As shown in Figure 5, the confusion matrix visualizes the classification of each category. Each row represents the predicted categories, each column represents the actual categories, and the data on the diagonal line represents the proportion of categories that were correctly classified. However, it is shown that the high FN for the container crane category means that most objects are missed. The corresponding AP is also very low. It is mainly because the containercrane category training samples are much less than other categories. The lack of training samples leads to a limited extraction of features and results in a high FN. In addition, the FP is high for the small vehicle category, producing false alarms. Although the training samples of small vehicle are sufficient, they belong to tiny objects, which are very hard to detect in a dense environment with objects blocking each other.  We show some representative detection results from the SPH-YOLOv5 on the DOTA dataset in Figure 6. The SPH-YOLOv5 is suitable for several small-and medium-sized objects, such as planes, small vehicles, and ships, demonstrating the value of contextual knowledge in providing further assistance. However, these objects are frequently clumped together and difficult to differentiate. Furthermore, the SPH-YOLOv5 shows super performance in object categories with significant scale variation, such as tennis court, soccer ball field, and harbor. The SPH-YOLOv5 can simultaneously extract the detailed low-level characteristics for localization and the high-level semantics for identification. We show some representative detection results from the SPH-YOLOv5 on the DOTA dataset in Figure 6. The SPH-YOLOv5 is suitable for several small-and medium-sized objects, such as planes, small vehicles, and ships, demonstrating the value of contextual knowledge in providing further assistance. However, these objects are frequently clumped together and difficult to differentiate. Furthermore, the SPH-YOLOv5 shows super performance in object categories with significant scale variation, such as tennis court, soccer ball field, and harbor. The SPH-YOLOv5 can simultaneously extract the detailed low-level characteristics for localization and the high-level semantics for identification. We tested the inference speed of several comparison algorithms on satellite images on our device in Table 4. There is a speed advantage of our proposed model SPH-YOLOv5 compared to TPH-YOLOv5, proving the computational effect advantage of the Swin transformer and NAM. However, it is still slower than YOLOv5, probably because of the redundancy caused by the many modules we added to improve the accuracy. Streamlining the model to improve the speed in the future is an essential direction of work.  We tested the inference speed of several comparison algorithms on satellite images on our device in Table 4. There is a speed advantage of our proposed model SPH-YOLOv5 compared to TPH-YOLOv5, proving the computational effect advantage of the Swin transformer and NAM. However, it is still slower than YOLOv5, probably because of the redundancy caused by the many modules we added to improve the accuracy. Streamlining the model to improve the speed in the future is an essential direction of work.

Ablation Experiments
First, we investigated the fusion of kinds of modules in the YOLOv5 framework, as shown in Table 5. It is interesting to find that embedding the additional feature layers in the added detection head (termed as P2) can expectedly improve the mAP by about 0.03. In contrast, the fusion of the attention modules, such as transformer, Swin transformer, or NAM, introduces a relatively moderate improvement on mAP (about 0.01~0.02). It is mainly because the fusion of P2 changes the whole architecture of the YOLOv5, making

Ablation Experiments
First, we investigated the fusion of kinds of modules in the YOLOv5 framework, as shown in Table 5. It is interesting to find that embedding the additional feature layers in the added detection head (termed as P2) can expectedly improve the mAP by about 0.03. In contrast, the fusion of the attention modules, such as transformer, Swin transformer, or NAM, introduces a relatively moderate improvement on mAP (about 0.01~0.02). It is mainly because the fusion of P2 changes the whole architecture of the YOLOv5, making it more significant than the other local modules. Although the introduction of self-attention modules can improve the detection performance of satellite objects, the Swin transformer can reduce the computational complexity with a larger batch size at the same time. Second, we only analyzed the NAM module. The NAM is formed in four ways: using channel attention, using spatial attention only, using both attention modules simultaneously, and using a normalization-based method. As shown in Table 6, the normalization-based way is better than the others. Spatial attention is more effective than channel attention, mainly because spatial information is more affluent than channel information on satellite RGB images. Therefore, spatial attention captures more delicate features and plays a vital role in detection.

Hyperparameter Exploration
The detection performance of SPH-YOLOv5 is also affected by different hyperparameter settings, such as scaled image size, batch size, and optimizer. To obtain the optimal hyperparameters for the SPH-YOLOv5 model, we manually adjusted the parameters to observe the performance of the DOTA dataset. The experimental results are shown in Table 7. It is noted that the optimal performance is obtained when using the SGD optimizer with a batch size of 16 and the scaled image size of 1280 × 1280. However, we observed that the Adam optimizer converges faster, although the detection accuracy is not as good as the SGD. It is well known that the GPU memory generally limits the batch size. It was found that the maximum batch size is 2 for the use of transformer prediction head, while the maximum batch size is 32 for the use of the Swin transformer prediction head. It means that the Swin transformer can effectively reduce the computational complexity and save GPU memory. Furthermore, the scaled image size refers to the input size that conforms to the YOLOv5 network limits and anchor box settings. We compared the impact of different scaled sizes, while keeping the original image size constant. It should be noted that the default scaled image size that inputs into the YOLOv5 is 640, which just adapts to the small sizes of natural images. However, the sizes of satellite images are intrinsically larger. Therefore, the larger scaled size of 1280 can retain more detailed feature information about the small object and improve detection accuracy.

Conclusions
This paper updated the state-of-the-art YOLOv5 model for detecting natural images to the SPH-YOLOv5 model for adapting satellite images. We improved the YOLOv5 network structure according to the characteristics of satellite images. The novel feature fusion layers and the prediction heads (SPHs) were added in YOLOv5. The shallow features acquired from the backbone network were brought into the feature fusion layer, effectively reducing the feature information loss of small objects and improving the detection performance in satellite images. In addition, multiple NAM attention modules were introduced for specifically focusing on objects in complex scenes. Both spatial attention and channel attention were used to find attention regions in dense scenes. The use of the Swin transformer can effectively improve the model detection performance and overcome the computational complexity of transformer. The proposed SPH-YOLOv5 was tested on the widely used NWPU-VHR10 dataset and the DOTA dataset, and the resulting mAPs reached 0.980 and 0.716, respectively, which are better than other models' performance. The effectiveness of the proposed SPH-YOLOv5 for object detection on satellite images was fully demonstrated. Multispectral images have the advantage of rich spectral information compared with RGB images, which can theoretically improve the detection effect. Still, there are few large-scale multispectral remote sensing datasets with detailed annotation. So, it is not confident that our proposed model is valid on multispectral data. However, the proposed method does not fully utilize the spectral information, and the subsequent fusion of RGB images with multispectral data can be considered to improve the detection efficiency.