YOLO-DCTI: Small Object Detection in Remote Sensing Base on Contextual Transformer Enhancement

: Object detection for remote sensing is a fundamental task in image processing of remote sensing; as one of the core components, small or tiny object detection plays an important role. Despite the considerable advancements achieved in small object detection with the integration of CNN and transformer networks, there remains untapped potential for enhancing the extraction and utilization of information associated with small objects. Particularly within transformer structures, this potential arises from the disregard of the complex and the intertwined interplay between spatial context information and channel information during the global modeling of pixel-level information within small objects. As a result, valuable information is prone to being obfuscated and annihilated. To mitigate this limitation, we propose an innovative framework, YOLO-DCTI, that capitalizes on the Contextual Transformer (CoT) framework for the detection of small or tiny objects. Speciﬁcally, within CoT, we seamlessly incorporate global residuals and local fusion mechanisms throughout the entire input-to-output pipeline. This integration facilitates a profound investigation into the network’s intrinsic representations at deeper levels and fosters the fusion of spatial contextual attributes with channel characteristics. Moreover, we propose an improved decoupled contextual transformer detection head structure, denoted as DCTI, to effectively resolve the feature conﬂicts that ensue from the concurrent classiﬁcation and regression tasks. The experimental results on the Dota, VISDrone, and NWPU VHR-10 datasets show that, on the powerful real-time detection network YOLOv7, the speed and accuracy of tiny targets are better balanced.


Introduction
Remote sensing object detection is a prominent and consequential application within the realm of remote sensing image processing [1].It aims to accurately identify and locate specific target instances within an image.Within this domain, remote sensing small object detection holds particular importance as it focuses on detecting objects in remote sensing images that occupy a very small area or consist of only a few pixels.Detecting small objects is considerably more challenging than detecting larger objects, resulting in lower accuracy rates [2].In recent years, small object detection based on convolutional neural networks (CNNs) has rapidly developed with the rapid growth of deep learning [3].Small object detection often faces challenges such as limited information on small objects, scarcity of positive samples, and imbalanced classification.To tackle this challenge, researchers and experts have put forth diverse deep neural network methodologies, encompassing CNNs, GANs, RNNs, and transformers, to tackle the detection of small objects, encompassing remote sensing small objects.To improve the detection of small objects, Liu W et al., proposed the YOLOV5-Tassel network, which introduced the SimAM module in front of each detection head to extract the features of interest [4].Li J. et al., suggested using GAN models to generate high-resolution images of small objects, narrowing the gap between small and large objects, and improving the detection capability of tiny objects [5].Xu W et al. integrated contextual information into the Swin Transformer and designed an advanced framework called the foreground-enhanced attention Swin Transformer (FEA-Swin) [6].Although the accuracy of detecting small objects has improved, the speed has been somewhat compromised.Zhu X. et al., proposed the YOLOv5-THP model, which is based on YOLOv5 and adds a transformer model with an attention mechanism to the detection head [7].While this enhances the network's performance in detecting small objects, it also brings a significant computing burden.
In the field of remote sensing, detecting small objects remains challenging due to large image scales, complex and varied backgrounds, and unique shooting perspectives.Cheng et al. propose a model training regularization method that enhances the performance of detection of small or tiny objects in remote sensing by exploiting and incorporating global contextual cues and image-level contextual information [8].Liu J. et al., added a dilated convolution module to the FPN and designed a relationship connection attention module to automatically select and refine features, combining global and local attention to achieve the detection task of small objects in remote sensing [9].Cheng et al., proposed an end-to-end cross-scale feature fusion (CSFF) framework based on the feature pyramid network (FPN), which inserted squeeze-and-excitation (SE) modules at the top layer to achieve better detection of tiny objects in optical remote sensing images [10].Dong et al., proposed a CNN method based on balanced multi-scale fusion (BMF-CNN), which fused high-and low-level semantic information to improve the detection performance of tiny objects in remote sensing [11].Liang X. et al., proposed a single-shot detector (FS-SSD) based on feature fusion and scaling to better adapt to the detection of tiny or small objects in remote sensing.FS-SSD added a scaling branch in the deconvolution module and used two feature pyramids generated by the deconvolution module and feature fusion module together for prediction, improving the accuracy of object detection [12].Xu et al., designed a transformer-guided multi-interaction network (TransMIN) using local-global feature interaction (LGFI) and cross-view feature interaction (CVFI) modules to enhance the performance of small object detection in remote sensing.However, this improvement unavoidably introduces a computational burden [13].Li et al., proposed a transformer that aggregates multi-scale global spatial positions to enhance small object detection performance but it also comes with a computational burden [14].To reduce the computational cost of the transformer, Xu et al., improved the lightweight Swin transformer and designed a Local Perception Swin transformer (LPSW) backbone network to enhance small-scale detection accuracy [15].Gong et al., designed an SPH-YOLOv5 model based on Swin Transformer Prediction Heads (SPHs) to balance the accuracy and speed of small object detection in remote sensing [16].Although many experts and scholars are studying the balance between detection accuracy and inference speed, achieving an elegant balance remains a challenging problem [17][18][19][20][21].
Considerable advancements have been achieved in the utilization of transformers [6,7,[13][14][15][16]] for small object detection within the remote sensing domain.The exceptional performance of the Contextual Transformer (CoT) [22] in harnessing spatial contextual information, thereby offering a fresh outlook on transformer design, merits significant attention.In the domain of remote sensing, small target pixels are characterized by a scarcity of spatial information but a profusion of channel-based data.Consequently, the amalgamation and modeling of spatial and channel information assume paramount importance.Furthermore, transformers impose notable demands on computational resources and network capacity, presenting a challenge in striking an optimal balance between detection accuracy and processing speed for small object detection in the remote sensing discipline.Meanwhile, Bar M et al. demonstrated that the background is critical for human recognition of objects [18].Empirical research in computer vision has also shown that both traditional methods [19] and deep learning-based methods [12] can enhance algorithm performance by properly modeling spatial context.Moreover, He K. et al., have proven that residual structures are advantageous for improving network performance [17,20].Finally, we note that the classification and regression tasks of object detection focus on the salient features and boundary features of the target, respectively [23].Therefore, a decoupled detection head incorporating residual structure as well as channel and spatial context knowledge should have a positive impact on the detection of small or tiny objects.
We propose a new detection framework, YOLO-DCTI, for detecting small or tiny objects in remote sensing images.By introducing a global residual structure and a local fusion structure into the contextual transformer (CoT), and designing an improved decoupled contextual transformer detection head structure (DCTI) based on CoT, we have achieved improved detection performance for small or tiny objects on the powerful single-stage benchmark network YOLOv7.The main contributions of this paper can be summarized as follows: 1.
We have developed the CoT-I module, an extension of the original CoT framework, which integrates global residual structures and local fusion modules.This integration facilitates the extraction of spatial context background features and the fusion of channel features, thereby enabling the network to learn deeper-level characteristics.
In comparison to the conventional CoT approach, the inclusion of global residual structures empowers the network to capture more profound features, while the incorporation of local fusion structures seamlessly combines background context features with channel features.

2.
We introduce an efficient decoupled detection head structure DCTI, leveraging the CoT-I framework, to mitigate the limited exploration and utilization of salient region features and boundary-adjoining features arising from the interdependence of classification and regression tasks within most object detection heads.This decoupled design allows the classification task to emphasize salient region features, while the regression task focuses on boundary-surrounding features.Concurrently, CoT-I effectively exploits and harnesses the feature relationships between spatial context and channels, facilitating the detection of small objects in remote sensing and yielding substantial improvements in detection accuracy.

3.
Despite the escalation in model parameters and the consequential inference latency resulting from the adoption of our proposed DCTI structure, the integration of global residual connections and local fusion strategies yields a notable enhancement in inference accuracy without incurring any detrimental impact on the inference speed.Comparative evaluation against the baseline YOLO v7 model showcases a substantial improvement in the inference accuracy specifically for diminutive targets, mAP@0.5:0.95surging from 61.8 to 65.2.Additionally, our model achieves a reduction of 0.3ms in the inference speed per image with dimensions of 640 × 640.

Related Work 2.1. Transformer Framework for Object Detection
The transformer structure, based on self-attention, first appeared in NLP tasks.Compared to modern convolutional neural networks (CNN) [24], the Vision Transformer has made impressive progress in the field of computer vision.After Dosovitskiy A et al. successfully introduced transformers into computer vision [25], many scholars turned to transformers [26][27][28].In object detection, DETR [29] and Pix2seq [30] are the earliest transformer detectors that define two different object detection paradigms.However, transformers have many parameters, require high computing power and hardware, and are not easily applicable.To apply transformers on mobile devices, Mehta S et al. proposed a lightweight MobileVIT series [31][32][33], which achieved a good balance between accuracy and real-time performance, and has been widely used in risk detection [34], medicine [35], and other fields.A major advantage of transformers is that they can use the attention mechanism to model the global dependence of input data, obtain longer-term global information, and ignore the connection between local contexts.To address this problem, Li Y. et al., proposed a lightweight CoT [22] self-attention module to capture contextual background information on 2D feature maps.It can extract information between local contexts while capturing global dependencies for more adequate information exchange.In this paper, we use CoT to exploit the global characteristics of spatial context and channels.Based on the original structure, we added the global residual and local fusion structures to further tap and utilize the characteristics of space and channels.

YOLO Framework for Object Detection
In 2015, YOLO [36] introduced a one-stage object detection method that combined candidate frame extraction, CNN feature learning, and NMS optimization to simplify the network structure.The detection speed was nearly 10 times faster than R-CNN, making realtime object detection possible with the computing power available at that time.However, it was not suitable for detecting small objects.YOLOv2 [37] added optimization strategies such as batch normalization and a dimensional clustering anchor box based on v1 to improve the accuracy of object regression and positioning.YOLOv3 [38] added the residual structure and FPN structure based on v2 to further improve the detection performance of small objects.The network framework structure after YOLOv3 can be roughly divided into three parts, backbone, neck, and head.Subsequent versions have optimized internal details to varying degrees.For example, YOLOv4 [39], based on v3, further optimized the backbone network and activation function, and used Mosaic data enhancement to improve the robustness and reliability of the network.YOLOv5 [40] added the focus structure based on v4 and accelerated the training speed by slicing.YOLOv6 [41] introduced RepVGG in the backbone, proposed a more efficient EfficientRep block, and simplified the design of the decoupling detection head to improve the detection efficiency.YOLOv7 [42] adopted the E-ELAN structure in the neck part, which reduces the inference speed, and used the auxiliary head training method.At present, YOLOv7 is one of the more advanced object detection networks due to its real-time characteristics.It is widely used in fields with high time requirements such as industrial equipment inspection [43], sea rescue [44], and aquaculture [45].Therefore, we use YOLOv7, one of the powerful benchmarks, as the benchmark model.

Detection Head Framework for Object Detection
In the object detection task, there are two tasks: classification and regression, which respectively output the classification and bounding box position of the object.Song G. et al., pointed out that the focus of the classification and regression tasks is different [23].Specifically, classification pays more attention to the texture content of the object, while regression pays more attention to the edge information of the object.Wu Y et al. suggested that it may be better to divide classification and regression tasks into FC-head and Conv-head [46].In the single-stage model, YOLOX [47] adopts the decoupling head structure that separates the classification and regression branches and adds two additional 3 × 3 convolutional layers.This improves detection accuracy at the cost of inference speed.Building upon this approach, YOLOv6 takes into account the balance between the representation ability of related operators and the hardware computing overhead and adopts the Hybrid Channels strategy to redesign a more efficient decoupling head structure that reduces the cost while maintaining accuracy.They also mitigate the additional latency overhead of 3 × 3 convolutions in the decoupled detection head.Feng C. et al., use feature extractors to learn task interaction features from multiple convolutional layers to enhance the interaction between classification and localization [48].They also pointed out that the interaction characteristics of different tasks may vary due to the classification and localization goals.To resolve the feature conflict introduced between the two tasks, they designed a layer attention mechanism that focuses on different types of features such as different layers and receptive fields.This mechanism helps to resolve a certain degree of feature conflict between the two tasks.

Proposed Method
We present an improved decoupled contextual transformer (YOLO-DCTI) for the detection of tiny or small objects in the domain of remote sensing.Our proposed framework is built upon the foundation of YOLOv7.The comprehensive architecture of our framework is depicted in Figure 1.Our contributions begin with the feature X ∈ R H×W×C obtained after the backbone, FPN, and PAN stages.In this section, we first give a brief overview of the widely adopted Contextual Transformer (CoT) framework in object detection.Subsequently, we introduce an enhanced variant named CoT-I, which incorporates a global residual structure and a local fusion structure into the CoT module.The global residual mechanism integrates input information with self-attention features, while the local fusion mechanism combines spatial contextual information with channel-based information.Ultimately, we integrate the CoT-I module into a decoupled detection head named DCTI, enabling the establishment of global interdependencies between the classification and regression tasks through the utilization of self-attention mechanisms.This integration facilitates the comprehensive exploration and exploitation of a wider spectrum of channel features and spatial contextual features.

Contextual Transformer
In this section, we present the formulation of the Contextual Transformer (CoT) framework, illustrated in Figure 2 (left).The input X, obtained from the Feature Pyramid Network (FPN), undergoes three transformation matrices: W q , W k , W v .These matrices yield K = XW k , Q = XW q , V = XW v .Specifically, W q is an identity matrix, W k represents spatial convolution using a k × k kernel, and W v signifies spatial convolution with a 1 × 1 kernel.The output Y is mathematically expressed as: In the above equation, the function f c () denotes concatenation along the channel dimension C, while W δ and W θ correspond to 1 × 1 convolutions in the spatial domain.The symbol × represents matrix multiplication.For brevity, we omit the transformation of the channel dimension C and the batch normalization (BN) operation during the 1 × 1 convolutions.
It is evident that the learned key matrix K captures significant information from neighboring pixels within the spatial domain, incorporating essential static spatial context information.Subsequently, Q and K are concatenated along the channel dimension, followed by the application of two 1 × 1 convolutions: W δ with an activation function and W θ without an activation function.Matrix multiplication is then performed with V, resulting in a matrix T enriched with dynamic contextual information, which can be used as follows: This resultant matrix T is subsequently fused with the static contextual information K to derive the final output Y. Notably, Y incorporates both dynamic contextual information and static contextual information.CoT demonstrates exceptional performance in leveraging contextual information; however, it partially overlooks the joint contribution of channel information and contextual information.

Improved Contextual Transformer (CoT-I)
First of all, the ResNet [20] and Densenet [17] models prove the effectiveness of the residual structure in the model, which helps optimize the deeper layer of the architecture.We add a residual structure from the input to the output, called the global residual structure.
Secondly, we note that the intermediate matrix T captures relatively persistent global dependencies along the channel dimension, whereas the static contextual information K contains abundant spatial contextual information.The interrelation between them is more closely intertwined compared to the dynamic contextual information.Hence, we integrate the intermediate matrix T with the static contextual information K to achieve a more comprehensive representation of spatial and channel features.This fusion can be mathematically expressed as follows: where + represents through attention mechanism to complete the fusion of static K and intermediate vector T.
The global residual mechanism plays a pivotal role in facilitating the fusion of input information and self-attention features.By incorporating a global residual structure, the method adeptly amalgamates and consolidates information from diverse hierarchical levels of input, thereby engendering a more comprehensive and all-encompassing feature representation.This ensemble effectively captures long-range dependencies and augments the overall discriminative capacity of the model.
The local fusion mechanism concentrates on integrating spatial context information with channel-based information.Through the integration of these two distinct information sources, the method proficiently models the intricate interplay between neighboring pixels and harnesses the wealth of channel-based information inherent in remote sensing data.This fusion framework empowers the model to capture intricate details and contextual cues with greater precision, thereby engendering superior performance in the detection of small objects.

Improved Decoupled Contextual Transformer Detection Head (DCTI)
Our DCTI structure starts with features X ∈ R H×W×C obtained after the backbone, FPN, and PAN stages.X has three different dimensions of features; (H, W, and C) are (20,20,128), (40,40,256), and (80, 80, 512), respectively.For the purpose of enhancing the lucidity of the presentation, we solely exemplify the variations in channel C, as depicted in Figure 3.The coupled detection head depicted (left) leverages 1 × 1 convolutions and the RepConv module to enhance feature information along the channel dimension.Despite its simplicity, this approach is remarkably effective.However, in the context of object detection tasks, it becomes evident that the classification task primarily emphasizes salient feature information, while the regression task is more concerned with capturing boundary feature information related to the targets.Employing a shared Repconv module for both tasks inevitably introduces conflicts between them.Consequently, we propose the DCTI, building upon the CoT-I framework, as illustrated (right).Initially, CoT-I assimilates the feature information derived from the feature pyramid and computes an intermediate variable T, which encompasses abundant spatial information, employing Equation (1).Subsequently, a local fusion strategy is employed to merge the spatial and channel information, yielding the dynamic feature K1 through the utilization of Equation (3).Finally, employing a global residual strategy, the input from the feature pyramid is combined with the dynamic feature to produce the output Y.The decoupled detection head further individually models Y to obtain category information and bounding box information.

Results
To evaluate the performance of the YOLOv7-DCTI algorithm for remote sensing small object detection, training, and testing were conducted on the Dota-small dataset [49].Furthermore, to assess the algorithm's overall performance, this experiment included training and testing on the VISDrone dataset [50] and NWPU VHR-10 datasets [51].A comparison was made between seven different networks, namely Faster RCNN, SSD, YOLO v5s, YOLO v5l, YOLO v5m, YOLO v7-tiny, and YOLOv7, using the aforementioned three datasets.To ensure fairness among the YOLO series, a batch size of 16 was utilized during the training process and pre-trained weights were not employed.The data augmentation strategy [52,53] remained consistent with other training conditions.During testing, the NMS [54] threshold was uniformly set to 0.65 and the batch size was uniformly set to 32.

Datasets 4.1.1. Dota-Small
In recent years, several remote sensing datasets have been developed.This paper focuses on the extraction of small or tiny objects from the DoTAv1.0dataset, which consists of 2000 aerial images of 2000 cities and over 190,000 fully labeled objects, each of which comprises eight positional parameters (x1, y1, x2, y2, x3, y3, x4, and y4).In this study, we have selected a dataset that includes five categories of small objects, namely small vehicles, large vehicles, planes, storage tanks, and ships.However, due to the large image size in the DoTAv1.0dataset, direct training is not feasible.Therefore, we have cropped the images to a size of 1024 × 1024, resulting in a total of 8624 images.These images were subsequently divided into three sets according to the train:val:test ratio, with a split of 8:2:2.Among these, 5176 images were used for training, while 1724 images were allocated to validation and testing.The five types of objects included in the dataset are small vehicle, large vehicle, plane, storage tank, and ship, as illustrated in Figure 4. We have set three sets of anchors with the following dimensions: (10,10, 13,22, 24,12), (23,24,28,37,45,25), and (49,50, 91,88, 186,188).

VisDrone
VisDrone is a widely recognized and highly demanding aerial photography dataset that is extensively used in UAV (Unmanned Aerial Vehicle) applications.It features a meticulous manual annotation process that has accurately labeled and classified 342,391 objects into 10 distinct categories.However, the official evaluation portal for the test-challenge set is unavailable, so we have utilized the test-dev set for evaluating our proposed method.Figures 5-7 showcases a selection of unprocessed images extracted from the VisDrone dataset.In our experiments, we have employed three sets of anchor dimensions: (3,4, 4,8, 8,7), (7,14,14,9,13,20), and (25,13,27,27,51,40).

NWPU VHR-10
To assess the generalization capability of our proposed method, we conducted experiments on the NWPU VHR-10 dataset.This dataset is specifically designed for geospatial object detection and comprises ten different object categories, namely airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle.In our study, we randomly divided the dataset into three sets: 60% of the images were allocated to the training set, 20% to the validation set, and the remaining 20% to the testing set. Figure 8 showcases a selection of unprocessed images from the NWPU VHR-10 dataset.For our experiments, we employed three sets of anchor dimensions: (24,23,30,30,32,46), (47,32, 52,51, 74,61), and (88,96, 205,104, 150,194).

Experimental Environment and Settings
The experiment was conducted using a 64-bit Windows 10 operating system.The GPU utilized was NVIDIA GeForce RTX3090 and the deep learning framework employed was Torch v1.10.0.To evaluate the performance of object detection methods, this paper adopts the common indicators of object detection.this paper employs several metrics, including accuracy, recall, average precision, mean average precision, and average inference time per image (ms).The accuracy rate measures the proportion of correctly predicted samples out of the total tested samples.The recall rate indicates the proportion of positive samples that are accurately predicted.AP is calculated as the area under the precision-recall curve.mAP represents the average of AP values across all categories.Specifically, mAP@0.5:0.95refers to the average mAP value computed at ten IoU thresholds (0.50, 0.55, . . ., 0.95).On the other hand, mAP@0.5 denotes the mAP value computed at an IOU threshold greater than 0.5.

Experimental Results and Analysis
Experiments were conducted on three publicly available datasets, namely VISDrone, NWPU VHR-10, and Dota-small, to assess the efficacy of the proposed method.The experimental evaluation was carried out in four distinct stages: (1) Validation of the method's feasibility on the Dota-small dataset, including a comparative analysis against other object detection techniques to showcase its effectiveness.(2) Examination of the method's generalization capability using the VISDrone dataset.(3) Further validation of the method's performance on the NWPU VHR-10 dataset.(4) Evaluation of the performance of our model, encompassing inference speed, model parameters, and detection accuracy, and conducting a comparative analysis with existing models.(5) Execution of ablation experiments to scrutinize the effectiveness of each step within the proposed method and assess the optimal parameter configuration.

Experiments on Dota-Small Dataset
To validate the proposed model, we trained it for 200 epochs on the Dota-small dataset.To ensure fairness, we used a batch size of 24 during the training process and did not use pre-trained weights.We kept the data augmentation strategy [52,53] and other training conditions consistent.During testing, the NMS [54] threshold was uniformly set to 0.65 and the batch size was uniformly set to 32.In our experimental analysis, we conducted a comparative evaluation of the YOLOv7-DCTI algorithm with mainstream object detection algorithms using the Dota-small dataset generated specifically for this study.The outcomes of these experiments are documented in Table 1, encompassing five distinct categories: small vehicles, large vehicles, planes, storage tanks, and ships.The AP@0.5 values provided in the table indicate the average recognition accuracy achieved by each algorithm for individual categories.Additionally, the columns denoted as mAP@0.5 and mAP@0.5:0.95represent the average recognition accuracy across all categories.
The Dota-small dataset contains predominantly small or tiny objects with limited information and is characterized by complex and variable image backgrounds.Distinguishing these objects from the background presents significant challenges, as some objects may be partially occluded, further complicating detection.In comparison to other mainstream object detection algorithms, the enhanced network proposed in this study demonstrates a notable accuracy advantage in detecting small or tiny objects.
Our proposed method achieves the highest accuracy rate of 65.2% for small or tiny objects in the Dota-small dataset, surpassing YOLOv7 by 3.4%.In comparison to these four object detection algorithms, Faster R-CNN, SSD, YOLOv5l, and YOLOv7-tiny, the mean average precision (mAP) at the intersection over union (IoU) threshold range of 0.5 to 0.95 showed improvements of 19.3%, 45.2%, 4.5%, and 12.1%, respectively.
Although the detection speed of YOLOv7 and YOLOv5 is similar to that of our proposed method, their mAP scores are lower.In scenarios where the differentiation among the YOLO series detection heads is minimal, the proposed method achieves higher mAP at the IoU range of 0.5 to 0.95, while maintaining similar detection speeds.This demonstrates that the proposed method effectively compensates for the differences in detection heads and offers greater advantages.
The introduction of a Contextual Transformer (CoT) in the decoupled head unavoidably leads to a slight sacrifice in reasoning speed.However, the global residual structure and local fusion structure do not experience any decrease in speed.As a result, the detection speed of the proposed structure remains largely unaffected even with increasing complexity, achieving a favorable balance between inference speed and detection accuracy.As illustrated in Figure 7, YOLO-DCTI demonstrates superior performance in accurately detecting objects with unclear features or small sizes, even in complex backgrounds.It exhibits no omissions or false detections, unlike YOLOv7, which is prone to such errors.The proposed YOLO-DCTI algorithm in this paper excels at identifying small or tiny objects in challenging scenarios, yielding relatively high prediction probabilities.In contrast, YOLOv7 may struggle to accurately recognize small or tiny objects, resulting in lower recognition probabilities compared to YOLO-DCTI.
Figure 8 depicts the detection outcomes obtained through the utilization of the YOLO-DCTI methodology.On the whole, the performance is laudable; nonetheless, certain instances of missed detections persist in densely populated target scenes exhibiting analogous configurations.This phenomenon can be ascribed to the intricacy involved in discerning like attributes within such highly congested environments.As a result, this specific issue accentuates the imperativeness of conducting more extensive and meticulous investigations in future research endeavors.

Experiments on VisDrone Dataset
Table 2 presents the experimental results, demonstrating the strong performance of the proposed method on the VisDrone dataset.YOLO-DCTI achieves a noteworthy improvement of 0.2% in mAP@0.5:0.95 compared to the original method while maintaining a comparable detection speed.Notably, YOLOv5x achieves a mAP@0.5 of 48.1% on this dataset, while YOLOv7 achieves a mAP@0.5 of 49.2%.However, it is important to note that both networks employ coupled detection heads, which are unable to effectively address the inherent discrepancy between classification and regression tasks, resulting in slightly lower detection accuracy compared to our proposed method.As shown in Figure 9, the effectiveness of our method is reflected in its effective detection of small and dense objects (e.g., people and cars).and assign appropriate attention to them.Overall, the experimental results highlight the robust performance of the proposed method in object detection tasks, thus indicating its potential in various real-world applications.

Comparison Experiment on Inference Speed and Model Parameters
To elucidate the equilibrium achieved by our YOLO-DCTI model concerning inference speed and detection performance, we conducted a comprehensive comparative analysis of various models using the Dota-small dataset, as presented in Table 4.This analysis incorporates essential metrics such as mAP@0.5:0.95,inference speed, and model parameters.The evaluation employed test images with a resolution of 640 × 640 and the inference speed was quantified in milliseconds (ms).Our model showcased a total of 37.67 M parameters, closely akin to the 39.46 M parameters of Faster RCNN.However, our model gained a notable advantage due to the absence of candidate box generation operations, leading to commendable inference speed.While our inference speed aligns with YOLOv5l and YOLOv7, the distinctive structural design of DCTI fortifies its capacity to effectively capture features of small targets.Although our approach may not outperform others in terms of model parameters and inference speed, it successfully achieves a favorable equilibrium between inference speed and detection accuracy.The Dota-small dataset was used to conduct an ablation experiment aiming to investigate the impact of different structures on the final detection results.The obtained test results are presented in Table 5.After incorporating CoT in the decoupled head, the mAP@0.5:0.95value increased by 0.4%.CoT helps in identifying small or tiny objects by exploiting spatial context and global channel information.The addition of global residual structure and local fusion to CoT led to an improvement in mAP@0.5:0.95 by 1.0%.CoT-I further fuses spatial context and channel features, enabling the network to learn more about small object information, thereby enhancing detection performance.After incorporating the CoT-I structure in YOLOV7, mAP@0.5:0.95increased by 1.5%, providing further evidence that CoT-I can enhance detection accuracy.
We conducted a comparative analysis of the model's performance regarding speed and parameters, as presented in Table 6.The results demonstrate that the incorporation of decoupled heads and CoT introduced a speed latency of 0.1 ms and 0.2 ms, respectively, in comparison to the baseline, along with an augmentation of 0.4 M and 2.86 M parameters.However, it is noteworthy that the inclusion of Global-Residual and Local-Fusion did not impose any discernible burden on the inference speed and parameter requirements.We analyzed the kernel sizes utilized in CoT-I and present our findings in Table 7.In Equation ( 1), the kernel size, denoted as W k , is specifically referred to as k = 3, 5, and 7. Our analysis reveals that the model achieves the highest inference speed when employing a kernel size of k = 3.On the other hand, adopting a kernel size of k = 5 results in improved detection accuracy, albeit with a certain trade-off in terms of inference speed.Notably, when utilizing a kernel size of k = 7, both the model's detection accuracy and inference speed significantly decrease.These observations suggest that expanding the perception range does not necessarily lead to performance enhancement.

Conclusions
This research proposes the framework YOLO-DCTI for remote sensing small or tiny object detection based on YOLOv7 and an improved Context Transformer (CoT-I), which improves detection accuracy by mining and utilizing upper and lower spatial features and channel features.Specifically, we designed an efficient decoupled detection head structure DCTI by introducing CoT and embedding it into YOLOv7 to obtain long-term dependent features on channel and context spatial features.Furthermore, we introduce an innovative CoT variant, CoT-I, which incorporates a global residual structure and a local fusion structure.The global residual structure plays a critical role in merging and integrating information from various input levels, thereby yielding a more comprehensive and holistic feature representation.Similarly, the local fusion structure assumes a vital role in modeling the intricate interactions among neighboring pixels while leveraging the abundant channel-based information prevalent in remote sensing data.Extensive experiments demonstrate that our method can improve detection accuracy by mining and utilizing more features.Although the improvement in detection accuracy is accompanied by a slight loss of detection speed, the balance between detection accuracy and speed is crucial for remote sensing object detection, provided that the speed meets the application requirements.Despite the advancements in detection accuracy demonstrated by our model, occasional instances of missed detections persist in densely populated scenarios featuring small targets with similar features.In light of this, we aim to undertake more comprehensive research in our future endeavors to delve deeper into this matter.

Figure 1 .
Figure 1.The overall framework of our YOLO-DCTI; DCTI consists of CoT-I and Decoupled-Head.The FPN features are input to the CoT-I module for comprehensive modeling of global contextual information and spatial relationships.Subsequently, a dual-branch architecture is employed to effectively extract and discriminate both category-specific and localization-specific information.

Figure 2 .
Figure 2. The detailed structures of (left) CoT and (right) our CoT-I block.

Figure 3 .
Figure 3.Comparison between (left) YOLOv7 head and (right) ours.Moreover, we observe the exceptional capability of CoT-I in capturing global information for effective modeling.In comparison to 1 × 1 convolutions, CoT-I exhibits superior feature exploration in both the classification and regression tasks.Consequently, we propose the DCTI, building upon the CoT-I framework, as illustrated (right).Initially, CoT-I assimilates the feature information derived from the feature pyramid and computes an intermediate variable T, which encompasses abundant spatial information, employing Equation (1).Subsequently, a local fusion strategy is employed to merge the spatial and channel information, yielding the dynamic feature K1 through the utilization of Equation (3).Finally, employing a global residual strategy, the input from the feature pyramid is combined with the dynamic feature to produce the output Y.The decoupled detection head further individually models Y to obtain category information and bounding box information.

Figure 7 .
Figure 7. Recognition results in crowded environments of Dota-small dataset.(a) Recognition results of the YOLOv7 network.(b) Recognition results of the YOLO-DCTI network.

Figure 8 .
Figure 8.Some examples of detection results on the Dota-small dataset using YOLO-DCTI.

Figure 11 .
Figure 11.A selection of class activation maps exported using grad-cam on NWPU VHR-10 datasets.

Table 1 .
Comparison of detection accuracy of different object detection algorithms on Dota-small dataset.

Table 4 .
Inference speed and parameter comparison.

Table 5 .
Comparison of detection performance of different categories.

Table 6 .
Inference speed and parameter comparison.

Table 7 .
Comparison of improved contextual transformer parameters.