A Convolution with Transformer Attention Module Integrating Local and Global Features for Object Detection in Remote Sensing Based on YOLOv8n

: Object detection in remote sensing scenarios plays an indispensable and significant role in civilian, commercial, and military areas, leveraging the power of convolutional neural networks (CNNs). Remote sensing images, captured by crafts and satellites, exhibit unique characteristics including complicated backgrounds, limited features, distinct density, and varied scales. The contextual and comprehensive information in an image can make a detector precisely localize and classify targets, which is extremely valuable for object detection in remote sensing scenarios. However, CNNs, restricted by the essence of the convolution operation, possess local receptive fields and scarce contextual information, even in large models. To address this limitation and improve detection performance by extracting global contextual information, we propose a novel plug-and-play attention module, named Convolution with Transformer Attention Module (CTAM). CTAM is composed of a convolutional bottleneck block and a simplified Transformer layer, which can facilitate the integration of local features and position information with long-range dependency. YOLOv8n, a superior and faster variant of the YOLO series, is selected as the baseline. To demonstrate the effectiveness and efficiency of CTAM, we incorporated CTAM into YOLOv8n and conducted extensive experiments on the DIOR dataset. YOLOv8n-CTAM achieves an impressive 54.2 mAP@50-95, surpassing YOLOv8n (51.4) by a large margin. Notably, it outperforms the baseline by 2.7 mAP@70 and 4.4 mAP@90, showcasing its superiority with stricter IoU thresholds. Furthermore, the experiments conducted on the TGRS-HRRSD dataset validate the excellent generalization ability of CTAM.


Introduction
With the advancement of remote sensing technologies, images captured by various crafts and satellites have an enormous quantity and high spatial resolution.These images contain significant information crucial for a wide range of applications, such as land planning, forest protection, traffic monitoring, disaster detection, and personnel rescue.Object detection plays a fundamental yet important role in remote sensing image processing.It can extract valuable information from images by localizing and classifying regions of interest.However, traditional object detection algorithms such as Histogram of Oriented Gradients (HOG) [1] and Scale-Invariant Feature Transform (SIFT) [2] rely on handcrafted features tailored to specific scenes, resulting in inferior efficiency, accuracy, and generalization.
In recent years, CNNs have rapidly revolutionized various fields in computer vision (CV), such as image classification, object detection, instance segmentation, and pose estimation.Object detection, as one of the primary tasks, is an indispensable component in industry detection, security surveillance, and autonomous driving.Since the success of To demonstrate the effectiveness and efficiency of CTAM, we selected YOLOv8n as the baseline, which achieves better performance while maintaining fast detection speed.We improved YOLOv8n with CTAM and conducted extensive experiments on the DIOR dataset [31].YOLOv8n-CTAM surpasses the baseline by 2.8 mAP@50-95, with only a slight increase in detection time (0.2 ms).Notably, YOLOv8n-CTAM exhibits higher superiority with stricter IoU thresholds, such as mAP@70 and mAP@90, indicating CTAM makes the model focus on the central regions of targets and enhances localization capacity by integrating local features with global information.Compared with state-of-the-art detectors, it achieves cutting-edge performance while maintaining extremely fast speed.The results obtained on the TGRS-HRRSD dataset [32] further demonstrate the excellent generalization ability of CTAM.
The main contributions of this paper are as follows: (1) We construct CTAM, a novel plug-in-play attention module, which effectively addresses the limitations of both CNNs and Transformer.CTAM facilitates the integration of local features and global contextual information and significantly enhances YOLOv8n's localization capacity.(2) In contrast to the original Transformer applied in CV, we design a simplified Transformer structure by eliminating universal yet unnecessary operations for remote sensing scenarios, resulting in superior performance.(3) We conducted extensive experiments on the DIOR and TGRS-HRRSD datasets, explicitly demonstrating the positive impact of CTAM.It improves localization capacity and exhibits noteworthy effectiveness, efficiency, and generalization ability.
The remainder of this paper is organized as follows: In Section 2, we provide an overview of related works concerning Transformer and the YOLO series.Section 3 offers a To demonstrate the effectiveness and efficiency of CTAM, we selected YOLOv8n as the baseline, which achieves better performance while maintaining fast detection speed.We improved YOLOv8n with CTAM and conducted extensive experiments on the DIOR dataset [31].YOLOv8n-CTAM surpasses the baseline by 2.8 mAP@50-95, with only a slight increase in detection time (0.2 ms).Notably, YOLOv8n-CTAM exhibits higher superiority with stricter IoU thresholds, such as mAP@70 and mAP@90, indicating CTAM makes the model focus on the central regions of targets and enhances localization capacity by integrating local features with global information.Compared with state-of-the-art detectors, it achieves cutting-edge performance while maintaining extremely fast speed.The results obtained on the TGRS-HRRSD dataset [32] further demonstrate the excellent generalization ability of CTAM.
The main contributions of this paper are as follows: (1) We construct CTAM, a novel plug-in-play attention module, which effectively addresses the limitations of both CNNs and Transformer.CTAM facilitates the integration of local features and global contextual information and significantly enhances YOLOv8n's localization capacity.The remainder of this paper is organized as follows: In Section 2, we provide an overview of related works concerning Transformer and the YOLO series.Section 3 offers a detailed description of CTAM and the improved model.Section 4 presents the specific datasets used in our study and the experiments and analysis of CTAM.Finally, the conclusion is drawn in Section 5.

The YOLO Series
The primary goal of the YOLO series is to make object detection better, faster, and more scalable.YOLOv1, as the pioneering model, treats object detection as a regression problem.It segments an image into multiple grids, and each grid is responsible for predicting bounding boxes and class probabilities.With the improvements including a multi-scale training method, anchor boxes, and a new backbone, YOLOv2 achieves a tradeoff between accuracy and speed.YOLOv3 designs a stronger backbone called Darknet-53, implements multi-scale prediction, and incorporates various data augmentation techniques.YOLOv4 further optimizes the backbone and establishes CSPDarknet-53 based on CSPNet.Additionally, it introduces many methods to enhance the detection performance without increasing the inference cost.To accommodate various detection tasks and ease deployment, YOLOv5 designs five models with different computational costs.In order to mitigate the overfitting of the model, it proposes Mosaic, a novel data augmentation method, feeding a mixed image composed of several images into the model during training.YOLOX [33] switches YOLOv3 to an anchor-free manner with a leading label assignment strategy and constructs a decoupled head to suppress the conflict between regression and classification.For diverse platforms and applications, YOLOv6, an anchor-free detector like FCOS, constructs a fundamental CSPStackRep block for the backbone and adopts Task Alignment Learning [34] as the label assignment method.To enhance the capacity for learning and converging, YOLOv7 proposes Extended-ELAN to regulate the gradient paths.As the latest version in the YOLO series, YOLOv8 creates an anchor-free detector based on 'C2F' with more gradient paths.It constructs a decoupled head with Distribution Focal Loss [35].It supports various CV tasks, including classification, detection, segmentation, pose estimation, and tracking.
Due to its exceptional accuracy, flexibility, and efficiency, the YOLO series has extensive and significant applications in remote sensing scenarios.For instance, aiming at identifying high-density targets in UAV aerial images, MS-YOLOv7 [36] combines the original model with the Swin Transformer unit and proposes a new pyramidal pooling module.To address the dense occlusion in low-resolution images from the TinyPerson dataset, TOD-YOLOv7 [37] appends a tiny object detection layer and designs a recursive gated convolution module.UAV-YOLOv8 [38] utilizes Wise-IoU as the box regression loss to enhance localization ability in the UAV scenario.For UAV object detection, [39] modifies YOLOv8 with Bi-PAN-FPN and improves 'C2F' with GhostblockV2.In [40], YOLOv7-TS devises a Feature Map Extraction Module to reduce information loss.
In the YOLOv8 variants, YOLOv8n stands out for its fast speed, salient performance, and flexible deployment, so we select it as the baseline for real-time object detection in diverse remote sensing scenarios.With the goal of enhancing global contextual information for CNN-based detectors, we propose CTAM to integrate long-range dependency with local features, and YOLOv8-CTAM achieves superior performance on the DIOR and HRRSD datasets among various detectors.

Transformer
Transformer [41] was initially conceived to model long-range dependency and introduce parallel computation for natural language processing (NLP).The remarkable breakthrough achieved by Transformer in NLP has motivated many scholars and researchers to explore its potential applications in CV.
Transformer-based backbones have made enormous progress in fundamental CV tasks, such as image classification, object detection, and semantic segmentation.Vision Transformer (ViT) [42], a pioneering vanilla Transformer model, achieves competitive performance in image classification compared with state-of-the-art CNNs.ViT converts from 2D images to sequence data through a process of flattening and mapping image patches, which are fed into standard Transformer encoders with position embeddings.It finally employs a multi-layer perception (MLP) head for category classification.In contrast, DeiT [43] introduces the token-based knowledge distillation method, aiming at reducing the reliance on large amounts of data while achieving better performance on ImageNet-1K.Swin Transformer [44] adopts a hierarchical structure similar to the CNN-based models.To restrict the computational complexity posed by high-resolution images, it computes self-attention in non-overlapping local windows instead of global dependency.Additionally, it allows cross-window communication via shifted windows.CSWin Transformer [45] designs horizontal and vertical stripes to calculate self-attention in parallel.In comparison to Swin Transformer, it expands the local receptive field while constraining the computational complexity.BiFormer [46] proposes a novel sparse selfattention called Bi-Level Routing Attention, which filters irrelevant key-value pairs and applies self-attention to the remaining pairs to alleviate the heavy computational burden and high memory usage of Transformer.
Treating object detection as a set prediction problem, DETR [47] utilizes a CNN-based backbone to extract features and adopts a Transformer encoder-decoder and a feed-forward network to obtain detection results.It avoids the heavy computational cost produced by Transformer-based backbones while capturing global contextual information.Building on the success of DETR, various DETR variants have been proposed, such as Deformable DETR [48], Conditional DETR [49], and Lite DETR [50].
In contrast to natural scenes, long-range dependency captured by Transformer is more significant for object detection in remote sensing.TPH-YOLOv5 [51] designs a Transformer prediction head instead of the original head and incorporates an additional scale for tiny objects.Moreover, it adopts CBAM to identify dense objects.As a result, TPH-YOLOv5 achieves superior performance on the VisDrone dataset.Lu et al. [52] select CSWin Transformer as the backbone and construct a hybrid patch embedding module and a slicing-based inference method for UAV image object detection.Based on Swin Transformer, Xu et al. [53] designed a local-perception backbone to improve small object detection.
In this paper, we assert that the original Transformer retains substantial potential in remote sensing scenarios.Consequently, we devise a simplified Transformer and incorporate it into CTAM to integrate global contextual information with local features.The feasibility of the simplified Transformer is demonstrated in Section 4.5.

CTAM
With the limitations of the top-down view, long capturing distance, and complex interference, remote sensing images exhibit some characteristics that differ from nature images, including complicated backgrounds, limited features, distinct density, and varied scales.Global contextual and comprehensive information can help detectors recognize targets, which is exceptionally valuable for object detection in remote sensing scenarios.Nevertheless, typical CNN-based detectors, restricted by the nature of convolution operation, severely lack global interaction.To address this pivotal problem, we construct a novel plug-and-play module named CTAM, aiming at integrating global contextual information with local features.It is composed of two primary components: a simplified Transformer layer, responsible for capturing long-range dependency, and a convolutional bottleneck block, responsible for extracting local features and providing inductive biases for the other component.
As depicted in Figure 2, the simplified Transformer layer contains two reshape operations and an easier Transformer variant.Although many models in CV, such as ViT, apply a standard Transformer to various tasks, we assume that it could not be the optimal form for object detection in remote sensing.The experiments described in Section 4 indicate that, at least for CTAM, the original Transformer is not a suitable form.For this paper, the simplified Transformer removes LayerNorm [54], Dropout, and GeLU [55] and utilizes single-head attention to compute self-attention.It can be broadly divided into two parts: multi-head attention (MHA) and multi-layer perception (MLP).To process 2D feature maps, we flatten the map X ∝ R H×W×C into X t ∝ R (H * W)×C to serve as the input for the simplified Transformer, where (H, W, C) represents the resolution of the feature map.The matrices of Query, Key, and Value are computed as 'Linear' refers to a fully connected layer, and 'Split' is an operation that segments a matrix into chunks along the channel dimension.Q, K, and V maintain the same sizes as X t .Then, the computation of self-attention is as follows: Remote Sens. 2024, 16, x FOR PEER REVIEW 6 of 19 at least for CTAM, the original Transformer is not a suitable form.For this paper, the simplified Transformer removes LayerNorm [54], Dropout, and GeLU [55] and utilizes singlehead attention to compute self-attention.It can be broadly divided into two parts: multihead attention (MHA) and multi-layer perception (MLP).To process 2D feature maps, we flatten the map X  R Η×W×C into   R (Η*W)×C to serve as the input for the simplified Transformer, where (H, W, C) represents the resolution of the feature map.The matrices of Query, Key, and Value are computed as 'Linear' refers to a fully connected layer, and 'Split' is an operation that segments a matrix into chunks along the channel dimension.Q, K, and V maintain the same sizes as  .
Then, the computation of self-attention is as follows: The final output of MHA with a residual connection, denoted as  , can be expressed as The final output of MHA with a residual connection, denoted as X sha , can be expressed as MLP is composed of two fully connected layers without GeLU in the first layer.The entire process can be defined as where X ml p represents the output of MLP.
In summary, we express the output of the simplified Transformer layer X tran as Regarding the convolutional bottleneck block, it contains a stack of 'CBSs' (Conv-BN-SiLU), composed of one convolutional layer, Batch Normalization [56], and SiLU [57], as shown in Figure 2. The first and final 'CBSs' are used for channel compression and expansion.Those in between are employed for feature extraction and fusion.To balance the performance and computational cost of CTAM, we introduce the hyperparameter n to control the quantity of the middle 'CBS'.The result of the convolutional bottleneck block X conv can be computed as Inspired by NAB using traditional attention mechanisms to regulate the feature map after convolutional layers instead of the original map, we employ element-wise multiplication to fuse the features generated by the convolutional bottleneck block and the simplified Transformer layer.In this manner, we can acquire the features of each grid, guided by both global contextual and local information.CTAM is a complementary integration that introduces local concentration for Transformer and long-range dependency for the CNN.With the improvement of CTAM, YOLOv8n exhibits stronger localization capacity and better performance, as detailed in Section 4. Ultimately, the whole CTAM can be formally expressed as where X ctam and '⊗' refer to the output of CTAM and element-wise multiplication, respectively.

YOLOv8n-CTAM
YOLOv8, as one of the most cutting-edge models in the YOLO series, further boosts performance and flexibility across various tasks and applications.Like other common detectors, YOLOv8 can be divided into three components, backbone, path aggregation structure [58], and detection head.Figure 3 explicitly depicts the architecture of YOLOv8n-CTAM.
A preprocessed remote sensing image with the resolution of (800, 800, 3) serves as the input image and is fed into the backbone.Note that YOLOv8 proposes 'C2F' as the basic unit instead of 'C3' in YOLOv5, featuring more gradient paths.Through a sequence of 'Stage', we can obtain three feature maps with 8×, 16×, and 32× downsampling rates, respectively.Subsequently, these maps are sent into the path aggregation structure, composed of topdown and bottom-up paths.This structure aims to enhance localization information for the coarse maps and contextual information for the fine-grained maps.Finally, the detection head utilizes the augmented feature maps to predict the category and bounding box for each grid.To mitigate the conflict between classification and regression, YOLOv8 designs a decoupled head and adopts the general distribution to model bounding box representation.
YOLOv8 develops some variants with different widths and depths for various applications.YOLOv8n acquires the fastest detection speed and the smallest memory usage by decreasing its width and depth.Therefore, we select YOLOv8n as the baseline to satisfy the requirement of real-time detection.CTAM is inserted between the path aggregation structure and the detection head to integrate global contextual information with local features for object detection in remote sensing scenarios.The visualization results in Section 4 adequately demonstrate the effectiveness of CTAM.YOLOv8 develops some variants with different widths and depths for various applications.YOLOv8n acquires the fastest detection speed and the smallest memory usage by decreasing its width and depth.Therefore, we select YOLOv8n as the baseline to satisfy the requirement of real-time detection.CTAM is inserted between the path aggregation structure and the detection head to integrate global contextual information with local features for object detection in remote sensing scenarios.The visualization results in Section 4 adequately demonstrate the effectiveness of CTAM.

Experimental Environment and Settings
All experiments were carried out on a Linux operating system (Ubuntu 20.04) with an Intel(R) Core (TM) i9-10940X CPU and two Nvidia RTX-3090 GPUs for distributed training.The deep learning framework was Pytorch 1.13 based on Python 3.9.16,CUDA 11.7, and Torchvision 0.14.1.
Hyperparameter settings play a significant role in the training process and greatly impact the final detection accuracy.To ensure a fair comparison, each model in this paper adopted the same hyperparameters outlined in Table 1.'Image size' is the resolution of input images, restricting the sizes of targets and computational cost.'Epoch' represents the number of iterations that a detector is trained on a dataset.Appropriate epochs make a model achieve excellent performance while saving computational resources.'Learning rate', 'Momentum', and 'Weight decay' regulate the convergence rate and training stability.'Mosaic'

Experimental Environment and Settings
All experiments were carried out on a Linux operating system (Ubuntu 20.04) with an Intel(R) Core (TM) i9-10940X CPU and two Nvidia RTX-3090 GPUs for distributed training.The deep learning framework was Pytorch 1.13 based on Python 3.9.16,CUDA 11.7, and Torchvision 0.14.1.
Hyperparameter settings play a significant role in the training process and greatly impact the final detection accuracy.To ensure a fair comparison, each model in this paper adopted the same hyperparameters outlined in Table 1.'Image size' is the resolution of input images, restricting the sizes of targets and computational cost.'Epoch' represents the number of iterations that a detector is trained on a dataset.Appropriate epochs make a model achieve excellent performance while saving computational resources.'Learning rate', 'Momentum', and 'Weight decay' regulate the convergence rate and training stability.'Mosaic' is a valuable measure for alleviating data overfitting.In addition, 'n (#CBS)', utilized to control local feature extraction, is introduced into the convolutional bottleneck block of CTAM.According to the experiments in Section 4.5, YOLOv8n is improved with CTAM (n = 2) to integrate long-range dependency with local features.

Evaluation Metrics
To evaluate the effectiveness and efficiency of CTAM, we adopt common metrics in object detection, including precision, recall, average precision (AP), mean average precision (mAP), model parameters, FLOPs, and detection time.Precision denotes the proportion of true positive samples among the total positive samples, and recall measures the proportion of true positive samples among the total true samples.The AP value for each category is obtained by calculating the area under the precision-recall curve, and mAP denotes the mean of AP values across all categories.The AP and mAP can be expressed as where 'P(R)' denotes the precision-recall curve and 'nc' represents the number of categories.
To evaluate the performance of the detector more comprehensively and accurately, we utilized different IoU thresholds to acquire corresponding mAP values.A higher threshold signifies a more rigorous criterion for the overlaps between bounding boxes and ground truth boxes.Specifically, mAP@50 represents an mAP value computed with an IoU threshold of 0.5.mAP@50-95 is the average of the mAP values under the IoU thresholds between 0.5 and 0.95, with a step of 0.05.To explicitly verify the localization capacity of CTAM, mAP@50, mAP@70, mAP@90, and mAP@50-95 were adopted as the evaluation criteria in the next experiments.

Datasets
DIOR is a large-scale, diverse, and publicly available remote sensing dataset containing 23,463 images and 192,472 instances.It is divided into three subsets: a training set (5862 images), a validation set (5863 images), and a test set (11,738 images).It has 20 categories: airplane, airport, baseball field, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, harbor, golf course, ground track field, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill.Each image in DIOR is standardized to the resolution of (800, 800, 3).The sizes of bounding boxes range from 2 to 764 pixels, posing a considerable challenge for object detection on the DIOR dataset.Each object in DIOR is annotated with a horizontal bounding box.In comparison to VEDAI, HRSC2016, and COWC, DIOR has more images and instances, which is beneficial for the robustness and generalization of detectors.For this paper, we conducted extensive experiments on the DIOR dataset to demonstrate the efficiency and effectiveness of CTAM.
Aiming at validating the generalization ability of CTAM, experiments were conducted on the TGRS-HRRSD dataset, another large-scale remote sensing dataset.This dataset possesses 21,761 images categorized into 13 classes, and the mean scale per class ranges from 42 to 277 pixels.Furthermore, it elaborately balances the number of each category.The comprehensive results are detailed in the next section.

Experiments on the DIOR Dataset
To testify to the efficiency and effectiveness of CTAM, we initially trained YOLOv8n on the training and validation sets of the DIOR dataset with 100 epochs and evaluated its performance on the test set.For a fair comparison, we improved YOLOv8n with CTAM (n = 2) using the same settings and strategies.The experimental results for all categories are documented in Table 2. YOLOv8n-CTAM achieves 84.6 precision, 68.5 recall, and 54.2 mAP@50-95.It outperforms the baseline by a large margin, indicating the effectiveness of CTAM.'Time' represents the total time, including preprocessing, inference, and postprocessing time on an NVIDIA RTX 3090 with a batch size of 16.Due to the calculation of global contextual information occurring in the feature maps with 8×, 16×, and 32× strides, YOLOv8n-CTAM has a slight growth in 'FLOPs', 'Param', and 'Time'.It remains an extremely lightweight detector meeting the real-time requirement.Furthermore, with an increasing IoU threshold, YOLOv8n-CTAM achieves progressively better performance, surpassing the baseline by 1.8 mAP@50, 2.7 mAP@70, and 4.4 mAP@90.These results provide substantial evidence that CTAM can enhance localization capacity and detection performance by introducing global contextual information.Table 3 presents the performance of the baseline and YOLOv8n-CTAM across each category.In almost all classes, YOLOv8-CTAM displays higher accuracy, recall, and mAP@50-95 compared with the baseline, especially in golf field detection, where it exceeds the baseline by 5.0 precision, 4.4 recall, and 11.4 mAP@50-95.Apparently, we can confirm that CTAM is beneficial for multi-scale target detection in remote sensing scenarios.Furthermore, we analyze the training processes of both detectors, as shown in Figure 4. YOLOv8n-CTAM exhibits a faster convergence rate in both regression and classification loss.Notably, the loss curves of both models show a rapid decline in the last 10 epochs, indicating that closing Mosaic in the last 10 epochs can lead to an enhancement in the final performance.Some detection results obtained by YOLOv8n-CTAM are displayed in Figure 5. YOLOv8n-CTAM successfully overcomes the challenges posed by remote sensing images, including complicated backgrounds, limited features, distinct density, and varied scales.It achieves salient performance across various scenes and multi-scale categories.Although it may miss some targets or yield incorrect results in extremely hard situations, the acceptable detection accuracy with the remarkably low computational burden renders YOLOv8n-CTAM flexible and robust for deployment on real-time hardware platforms.In conclusion, In comparison with the state-of-the-art detectors, YOLOv8n-CTAM achieves the most cutting-edge performance, as listed in Table 4. Specifically, the improved detector outperforms the well-known detectors in natural scenes by a large margin.In the field of remote sensing, it also surpasses SCRDet++ with ResNet-101 and CANet by 1.4 and 2.2 mAP@50, respectively.Above all, YOLOv8n-CTAM is a lightweight detector with an impressive 435 frames per second (FPS) on a single NVIDIA RTX 3090.YOLOv8n-CTAM demonstrates considerable potential for various applications and deployments in diverse remote sensing scenarios.Some detection results obtained by YOLOv8n-CTAM are displayed in Figure 5. YOLOv8n-CTAM successfully overcomes the challenges posed by remote sensing images, including complicated backgrounds, limited features, distinct density, and varied scales.It achieves salient performance across various scenes and multi-scale categories.Although it may miss some targets or yield incorrect results in extremely hard situations, the acceptable detection accuracy with the remarkably low computational burden renders YOLOv8n-CTAM flexible and robust for deployment on real-time hardware platforms.In conclusion, the integration of global and local information within CTAM can compensate for inherent drawbacks in CNN and Transformer, leading to excellent localization capacity and detection accuracy in remote sensing images.obtain data.Extensive experiments for the optimal Transformer encoder structure were conducted on the DIOR dataset, as documented in Table 5.At first, 'Initial' has the worst mAP@50-95 among all Transformer variants.The absence of LayerNorm in 'A' results in an improvement of 0.5 mAP@50-95 compared with 'Initial', indicating that LayerNorm, widely applied in NLP, may hinder detection performance in remote sensing scenarios.Subsequently, the variants with different numbers of heads, specifically two and four, exhibit identical performance to the variant with '#Heads' = 1.Hence, we removed this hyperparameter and viewed it as a constant.Similarly, 'Dropout' has a negative impact on the detection accuracy, so it was set to 0. Finally, we eliminated the activation function 'GeLU' from MLP and constructed the simplified Transformer layer for CTAM.Moreover, the investigation of the influence of biases in MHA and MLP illustrates that the simplified Transformer encoder with both biases achieves the most salient performance on the DIOR dataset.In the convolutional bottleneck block of CTAM, the number of 'CBSs' serves as a hyperparameter introduced to regulate the extraction of local features and restrict computational complexity, as depicted in Figure 2. We varied the value of 'n' within the range [0, 1, 2, 3], and the corresponding experimental results are listed in Table 6.YOLOv8n with CTAM (n = 2) achieves the best precision, recall, mAP@50, and mAP@50-95, illustrating that it adequately extracts and fuses local features while increasing negligible computational burden.Consequently, CTAM (n = 2) is considered as the default module for YOLOv8n due to its optimal performance.Traditional attention modules in CV such as SE [61], CBAM [62], and ECA [63] have widespread applications in CV.To enhance features and suppress noises, these modules utilize the information extracted from the feature map to recalibrate themselves.However, like NAB, we claim this way is inflexible and harmful for feature extraction.In contrast, CTAM utilizes the global information generated by the simplified Transformer layer to integrate with the local features of the convolutional bottleneck block.For a fair comparison, we replaced CTAM with SE, CBAM, and ECA in the same positions, and the corresponding ex-perimental results are displayed in Table 7.Compared with the original model, SE, CBAM, and ECA are nearly useless in performance, but CTAM brings an obvious improvement.

Visualization
To further comprehend the influence of employing CTAM between the path aggregation structure and the detection head in YOLOv8n, we visualize the feature maps before and after the employment of CTAM, as depicted in Figure 6.The raw image contains two small-size vehicles and two medium-size airplanes.The detection results generated by YOLOv8n-CTAM exhibit highly accurate bounding boxes and reliable probabilities, demonstrating the effectiveness of CTAM.YOLOv8n is structured with three branches for multi-scale prediction, where the feature maps with 8×, 16×, and 32× downsampling rates are responsible for small-size, medium-size, and large-size targets, respectively.In the small-scale branch, the feature map behind CTAM exhibits higher and more centralized attention towards the two vehicles, compared with the map before CTAM.Meanwhile, in the medium-scale branch, the former feature map focuses on multiple parts surrounding the two airplanes, while the attention of the latter map converges on the centers of the airplanes.Since the responses mainly concentrate on the first two feature maps for these targets, the visualization and discussion of the large-scale branch are omitted.
This visualization provides valuable insights into how CTAM influences the feature extraction and fusion in YOLOv8n.The detailed comparison illustrates that CTAM enables YOLOv8n to focus on the central regions of targets and generate extremely accurate bounding boxes by integrating local features with global contextual information.This visualization corresponds with the conclusion that CTAM can significantly improve the localization capacity according to the mAP values with different IoU thresholds.

Experiments on the TGRS-HRRSD Dataset
To validate the generalization ability of CTAM, we also conducted experiments on the TGRS-HRRSD dataset, a multi-scale remote sensing dataset containing 55,740 instances and 13 categories.For a fair comparison, we adopted the consistent hyperparameters and strategies used in the DIOR dataset and trained detectors on the train-validate set.As listed in Table 8, YOLOv8n-CTAM outperforms the baseline by 0.9 mAP@50 and 2.1 mAP@50-95.Compared to typical detectors, YOLOv8n-CTAM achieves a superior performance on the TGRS-HRRSD dataset while maintaining a rapid detection time.Compared with lightweight models, YOLOv8n-CTAM exceeds YOLOv4-Tiny and YOLOv4-Tiny-NAB by a large margin.Hence, these experiments indicate that CTAM is not limited to a specific dataset and exhibits excellent generation ability in various remote sensing scenarios.
bounding boxes by integrating local features with global contextual information.This visualization corresponds with the conclusion that CTAM can significantly improve the localization capacity according to the mAP values with different IoU thresholds.

Conclusions
Remote sensing images have complicated backgrounds, limited features, distinct densities, and varied scales, rendering global contextual information extremely significant and valuable for object detection.However, CNN-based detectors with the limitation of local receptive fields have difficulty in capturing long-range dependency, resulting in inferior performance.To eliminate this inherent deficiency, we make the following contributions in this paper: Despite the remarkable performance and efficiency displayed by CTAM, it also brings unacceptable computational complexity and memory usage due to the defect of selfattention.In the future, we will optimize the computation of self-attention in CTAM and further explore the feasibility and flexibility of designing a backbone based on CTAM for object detection in remote sensing.

Figure 1 .
Figure 1.Various targets in remote sensing images.

Figure 1 .
Figure 1.Various targets in remote sensing images.

( 2 )
In contrast to the original Transformer applied in CV, we design a simplified Transformer structure by eliminating universal yet unnecessary operations for remote sensing scenarios, resulting in superior performance.(3)We conducted extensive experiments on the DIOR and TGRS-HRRSD datasets, explicitly demonstrating the positive impact of CTAM.It improves localization capacity and exhibits noteworthy effectiveness, efficiency, and generalization ability.

Figure 2 .
Figure 2. The structure of CTAM.Width, height, and channel are denoted as w, h, and c, respectively.'w/o' is an abbreviation for 'with or without'.'Norm', 'Drop', and 'Activation' represent LayerNorm, Dropout, and GeLU, respectively.'Bias' is the bias in the linear layer.

Figure 2 .
Figure 2. The structure of CTAM.Width, height, and channel are denoted as w, h, and c, respectively.'w/o' is an abbreviation for 'with or without'.'Norm', 'Drop', and 'Activation' represent LayerNorm, Dropout, and GeLU, respectively.'Bias' is the bias in the linear layer.

Figure 5 .
Figure 5. Detection examples of YOLOv8n-CTAM on the DIOR dataset.Figure 5. Detection examples of YOLOv8n-CTAM on the DIOR dataset.

Figure 5 .
Figure 5. Detection examples of YOLOv8n-CTAM on the DIOR dataset.Figure 5. Detection examples of YOLOv8n-CTAM on the DIOR dataset.

4. 5 .
Ablation Study 4.5.1.The Simplified Transformer in CTAMTransformer is a powerful structure that can acquire long-range dependency by calculating scaled dot-product attention among all positions.To address the critical limitation of CNNs lacking global contextual information, we incorporate Transformer into CTAM to integrate local features with contextual and comprehensive information.Although the standard Transformer derived from NLP has been widely applied in CV, we assume that sequences and images have essential differences, and the standard Transformer can be optimized to achieve better performance for object detection in remote sensing scenarios.In this paper, we delve into a detailed analysis of the Transformer structure and construct a simplified Transformer layer in CTAM to accommodate the task of remote sensing object detection.

Figure 6 .
Figure 6.The visualization of YOLOv8n-CTAM.(a1,a2) display the raw image and detection results.(b1,b2) represent the feature maps before and after CTAM in the small-scale branch.Similarly, (c1,c2) denote the feature maps in the medium-scale branch.The visualization in the large-scale branch is omitted because both feature maps have scarce responses for the targets.

Figure 6 .
Figure 6.The visualization of YOLOv8n-CTAM.(a1,a2) display the raw image and detection results.(b1,b2) represent the feature maps before and after CTAM in the small-scale branch.Similarly, (c1,c2) denote the feature maps in the medium-scale branch.The visualization in the large-scale branch is omitted because both feature maps have scarce responses for the targets.

( 1 )
We construct a novel plug-in-play attention module called CTAM, composed of a convolutional bottleneck block and a simplified Transformer layer.It can integrate local features with global contextual information through the interaction between the two components.(2) We design a simplified Transformer in CTAM that is unlike the standard Transformer encoder widely applied in CV, and we demonstrate its validity in various remote sensing scenarios.(3) For real-time object detection in remote sensing, we adopt YOLOv8n as the baseline and introduce CTAM to build YOLOv8n-CTAM.Extensive experiments demonstrate that YOLOv8n-CTAM achieves cutting-edge performance and generalization ability while maintaining an extremely rapid inference speed.(4) The visualization of CTAM explicitly explains why CTAM can enhance localization capacity and improve detection accuracy by incorporating global information into local features.

Table 2 .
The experimental results of YOLOv8n and YOLOv8n-CTAM on the DIOR dataset for all categories.

Table 3 .
Comparison of YOLOv8n and YOLOv8n-CTAM on the DIOR dataset across each category.

Table 4 .
The experimental results of different detectors on the DIOR dataset.'--' denotes difficult-toobtain data.

Table 5 .
Ablation study for the Transformer structures.'Initial' and 'Simplified' denote the initial and final Transformer structures, respectively.'A'-'G' are the Transformer variants with different hyperparameters.The number of heads in self-attention is represented as '# Heads'.

Table 6 .
Ablation study for the number of 'CBSs' within the convolutional bottleneck block.

Table 7 .
Ablation study for traditional attention modules.CTAM only with the convolutional bottleneck block and the simplified Transformer layer, respectively.Their results in Table7demonstrate that the integration of local features and global attention is indispensable and significant.

Table 8 .
The experimental results of different detectors on the TGRS-HRRSD dataset.