ECAN-Detector: An Efficient Context-Aggregation Network for Small-Object Detection

Xing, Gaofeng; Xu, Zhikang; He, Yulong; Ning, Hailong; Sun, Menghao; Wang, Chunmei

doi:10.3390/appliedmath5020058

Open AccessArticle

ECAN-Detector: An Efficient Context-Aggregation Network for Small-Object Detection

by

Gaofeng Xing

,

Zhikang Xu

,

Yulong He

,

Hailong Ning

^*,

Menghao Sun

and

Chunmei Wang

School of Computing, Xi’an University of Posts and Telecommunications, No. 618, West Chang’an Street, Chang’an District, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

AppliedMath 2025, 5(2), 58; https://doi.org/10.3390/appliedmath5020058

Submission received: 17 March 2025 / Revised: 11 May 2025 / Accepted: 16 May 2025 / Published: 20 May 2025

(This article belongs to the Special Issue Optimization and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Over the past decade, the field of object detection has advanced remarkably, especially in the accurate recognition of medium- and large-sized objects. Nevertheless, detecting small objects is still difficult because their low-resolution appearance provides insufficient discriminative features, and they often suffer severe occlusions, particularly in the safety-critical context of autonomous driving. Conventional detectors often fail to extract sufficient information from shallow feature maps, which limits their ability to detect small objects with high precision. To address this issue, we propose the ECAN-Detector, an efficient context-aggregation method designed to enrich the feature representation of shallow layers, which are particularly beneficial for small-object detection. The model first employs an additional shallow detection layer to extract high-resolution features that provide more detailed information for subsequent stages of the network, and then incorporates a dynamic scaled transformer (DST) that enriches spatial perception by adaptively fusing global semantics and local context. Concurrently, a context-augmentation module (CAM) embedded in the shallow layer complements both global and local features relevant to small objects. To further boost the average precision of small-object detection, we implement a faster method utilizing two reparametrized convolutions in the detection head. Finally, extensive experiments conducted on the VisDrone2012-DET and VisDrone2021-DET datasets verified that our proposed method surpasses the baseline model, and achieved a significant improvement of 3.1% in AP and 3.5% in

A P_{s}

. Compared with recent state-of-the-art (SOTA) detectors, ECAN Detector delivers comparable accuracy yet preserves real-time throughput, reaching 54.3 FPS.

Keywords:

small object detection; autonomous driving; contextual information; multiscale representation

1. Introduction

Object detection, one of the cornerstones of computer vision, has progressed markedly over the past decade thanks to rapid advances in computing technology. Small-object detection, an especially challenging yet indispensable sub-task, underpins critical applications such as aerial surveillance, autonomous driving, and medical image analysis.

Current small-object detection approaches are consistent with mainstream generic object-detection frameworks. Broadly, these approaches can primarily be classified into two categories. The first category is two-stage detectors, represented by [1,2,3], which decompose the object-detection task into two steps. In the first step, region proposals are generated; in the second step, these proposals are classified and regressed. Although two-stage detectors achieve superior accuracy, their architectural complexity incurs higher computational latency. On the other hand, one-stage detectors, represented by [4,5,6,7,8,9,10,11], directly extract convolutional features from the input image based on a regression approach. This strategy offers a clear advantage in detection speed, although it typically suffers from lower accuracy compared to two-stage detectors. Historically, the vast majority of detectors have relied on convolutional neural networks (CNNs). Since Transformers made a major impact in the field of NLP [12], researchers have sought to adapt them to computer vision, yielding remarkable results, as demonstrated in [13,14]. DETR [15] employed a combination of CNNs and Transformers to develop an end-to-end object-detection framework, which was instrumental in integrating Transformers into object-detection. Subsequently, a series of excellent Transformer-based object detectors emerged, such as Deformable DETR [16], Swin Transformer [17], YOLOS [18], DL-YOLOX [19], and Relation DETR [20]. Building on these methods, researchers have proposed numerous outstanding small-object detection techniques, including MARE [21], DotD [22], CZ Det [23], and YOLOM [24]. Yet small-object detection is still hindered by their low resolution, frequent occlusion, and weak discriminative cues.

To alleviate these challenges, small-object detection techniques address issues related to feature extraction and edge information by incorporating two types of cues: multiscale representations [25,26,27] and contextual information [28,29,30]. In mainstream small-object detectors, the backbone generates feature maps at different scales. The shallow feature map preserves fine-grained details, facilitating object classification, while the deep feature map contains more comprehensive semantic information that aids object localization [31]. The feature pyramid network (FPN) [32] is a pioneering technique that addresses the challenge of multiscale representation by integrating detailed information from shallow features with semantic information from deep features. FPN introduces a bottom-up, top-down structure in which deep features are up-sampled and combined element by element with shallow features [33]. Such fusion boosts feature quality: the top-down pathway merges deep semantics with shallow details, thereby improving small object-detection accuracy. However, excessive feature fusion can erase fine details and thus degrade overall detection performance.

Another critical aspect involves leveraging contextual information to exploit relationships between small objects and their surrounding environments. By considering the context surrounding the objects, these methods can use additional cues from other objects or the background to confirm the positioning and categorization of small objects. Small objects carry limited features and often lack distinctive appearance features to separate them from the background. Contextual information is therefore indispensable for reliably detecting small objects. By incorporating the surrounding context, some works have improved feature representations and overall accuracy [34]. For example, FA-SSD [35] pioneered the use of contextual cues and attention mechanisms to enhance small-object detection accuracy. Container [36] exploits long-range interactions, similar to those provided by Transformers, while retaining the inductive bias of local convolution operations, resulting in the faster convergence speeds typically observed in CNNs. Integrating attention into shallow layers further guides the network to focus on small objects while capturing contextual cues at that resolution. In summary, contextual cues deepen the semantic understanding of small objects. MA-FPN [37] introduces a pixel-wise attention module: multiscale convolutions first produce a feature map that matches the input size, and channel attention then weights this map to yield fine-grained pixel-level attention. This approach captures appearance features, such as texture, shape, and color, that are crucial to accurately locate small objects. Leveraging these contextual cues enables the model to separate objects from cluttered backgrounds, thereby improving both localization and classification accuracy. However, incorporating an excessive amount of contextual information may deteriorate the model’s generalization ability, reduce its robustness against background noise, and ultimately lower detection accuracy.

Although these methods have made significant progress in small-object detection, several challenges remain unresolved. Specifically, small-object detectors still suffer from sparse discriminative features, imbalanced object distributions in benchmark datasets, and information loss caused by repeated down sampling, all of which hinder further progress. Several specialized small object detectors have also been designed for autonomous driving scenarios.

Inspired by seminal works [38,39,40], we propose a novel ECAN-Detector, an efficient context-aggregation method based on YOLOv7 [7] for small-object detection. First, to enhance the expressive power of shallow features, we introduce an extra feature layer, Shallow detection layer (P2), into the backbone, making the network more suitable for small-object detection scenarios and reducing the loss of effective features related to small objects. Meanwhile, leveraging the power of the Transformer in semantic feature extraction and long-range feature capture, we incorporate a dynamic scaled transformer (DST) into both the backbone and the neck, enabling the model to focus on salient regions globally and enhancing its spatial perception. In addition, to further fuse shallow detailed information with deep semantic information, a context-augmentation module (CAM) (CAM denotes the context-augmentation module rather than a Class Activation Map.) is combined with the extra detection layer to integrate features from various receptive fields, achieving an optimal balance between global and local information and effectively combining components at multiple scales. Finally, to reduce the number of parameters while introducing richer gradient flow information, a faster implementation of two reparametrized convolutions (C2f_Rep) is adopted to replace the original convolution in the head. This redesign lowers the parameter budget and boosts small-object accuracy, markedly reducing missed and false detections in long-range autonomous-driving scenes. The overall structure of ECAN-Detector is shown in Figure 1.

Our contributions can be summarized as follows:

We devise a novel, efficient context aggregation network that markedly improves detection precision on small objects.
We introduce an additional small-object detection layer to improve the extraction of detailed features from small objects. At the same time, DST and CAM, based on context-augmentation, further enhance performance.
To reduce the number of parameters and introduce richer gradient flow information, we propose replacing RepConv in the head with C2f_Rep to maintain a lightweight structure.
Extensive experiments conclusively demonstrate that our proposed method surpasses both the baseline and existing state-of-the-art (SOTA) detectors, underscoring its effectiveness in addressing small-object detection challenges.

The rest of the paper is organized as follows: Section 2 reviews previous related work; Section 3 describes our proposed detector, ECAN-Detector, and its modules; Section 4 presents comparative experiments on VisDrone2021-DET [41] and validates effectiveness of the method on DOTA [42] and COCO [43]; finally, Section 5 summarizes the paper and discusses our shortcomings.

2. Related Work

In this section, we survey related work on advanced detection models, multiscale feature representation, and contextual information.

2.1. Advanced Detection Models

In the past few years, CNN-based small-object detection methods have made remarkable advances due to the rapid development of deep learning. Yet key hurdles remain low resolution, demanding localization precision, and background interference. Consequently, various deep learning-based approaches for small-object detection have emerged. Notably, current anchor-based detectors struggle to accurately localize small objects due to their limited ability to perceive positional offsets. NWD-RKA [44] proposed a new evaluation metric, NWD, to address the drawbacks of the IoU threshold matching strategy in detecting small objects and introduced a new sorting label assignment strategy, RKA, for detecting small objects in aerial images. To deploy small-object detectors on mobile devices, TinyDet [45] proposed a lightweight two-stage detection framework utilizing smaller input images and feature maps to reduce computational overhead. Additionally, sparsely connected convolutions have been introduced to enhance shallow feature extraction in the backbone, thereby mitigating the feature mismatch issue commonly observed in small-object detection. The limited overlap between anchors and small-object regions reduces the effective sample pool for training, and the sparse discriminative features of small objects further exacerbate detection difficulty. CFINet [46] suggested using a coarse-to-fine region proposal network (RPN) to generate high-quality bounding boxes and incorporated feature imitation branches in the head to enhance the feature representation of small objects. Because conventional box regression suffers from gradient distortion on small objects, Sun [47] introduced confidence-driven bounding box localization (C-BBL) to address these challenges. The C-BBL method incorporates a confidence-driven objective function, which improves gradient stability and convergence during small-object detection. In transportation scenarios, long-range recognition of small vehicles is critical [48]. Leveraging the real-time strengths of YOLOv7 [7], our work therefore targets small-vehicle detection. The one-stage detection paradigm of YOLOv7 offers a strong foundation for this study, striking an effective balance between accuracy and speed. Thus, this paper proposes an adaptation of YOLOv7 for small-vehicle detection, incorporating novel methods to improve feature extraction and localization in challenging driving environments.

2.2. Multiscale Feature Representation

Multiscale features have always played a critical role in object detection, significantly enhancing its accuracy [49]. In particular, for small-object detection, shallow features offer high resolution and contain more precise location and detail information, which facilitates object localization. In contrast, deep features provide richer semantic information but suffer from lower resolution and a diminished ability to capture the fine details of small objects. Therefore, fusing features across different scales is an important strategy to improve both the precision and recall of small objects. Current mainstream feature pyramid networks focus on interlayer feature interactions, while often neglecting intralayer feature regulation. For example, CFP [50] presents a Centralized Feature Pyramid that applies a top-down global adjustment to conventional pyramids. It derives a centralization map from deep layer cues to recalibrate shallow features, yielding more discriminative representations. AFPN [12] developed an asymptotic feature pyramid network that merges adjacent low-level features while progressively incorporating high-level features during the fusion process. In addition, adaptive spatial fusion operations are introduced in various hierarchical stages to mitigate information conflicts between feature layers, thus improving overall accuracy [51]. In our work, the CAM, drawing inspiration from the aforementioned ideas, aims to fuse multiscale features, addressing the shortcomings in extracting deep semantic information from long-distance autonomous-driving scenarios and thereby enhancing the model’s robustness.

2.3. Contextual Information

Contextual information is a machine learning technique that captures the interplay between object-and-scene and object-and-object in an input image using detectors. Leveraging such cues within the detection pipeline helps distinguish small objects from clutter and enriches their feature representations. Empirical studies [52] confirm that context integration markedly improves small-object detection accuracy. However, due to the low resolution of small objects, the available information is limited, making small-object detection still challenging. IENet [53] leveraged both intrinsic features and external context to boost small object recognition accuracy. First, a bidirectional feature fusion module (Bi-FFM) based on FPN was designed to enhance feature representation for object prediction. Additionally, two modules were created using contextual information: the context reasoning module (CRM) and the context feature augmentation module (CFAM). The CRM aimed to improve the quality of the region proposals through contextual reasoning, while the CFAM learned the pairwise relationships between region proposals generated by the CRM. The global feature information associated with the regional proposals was then generated via these linkages, leading to more accurate classification. To address the issue of low accuracy in detecting small objects in drone scenes, CEASC [54] proposed an adaptive sparse convolutional network with global context enhancement. The first step involved introducing a context-enhanced group normalization layer, which replaced global context features with statistical data derived from sparsely sampled features. In addition, CEASC developed a flexible multilayer masking method to determine the optimum mask ratio. This strategy achieved a balance between accuracy and efficiency, resulting in improved performance [55]. These works inspire the context-aware design of our DST, which markedly strengthens small-object detection.

3. Our Methods

This section presents a detailed overview of the Efficient Context-Aggregation Network (ECAN). First, we introduce the baseline and the specific implementation details of our proposed method. Then, we outline ECAN-Detector and elaborate on its three core modules. Next, we introduce C2f_Rep in the head to further enhance the model’s detection capability in autonomous-driving scenarios. Finally, we propose an improved loss function that more effectively supervises small objects.

3.1. Preliminaries

YOLOv7 is an anchor-based one-stage detector composed of three main components: the backbone, neck, and head. The backbone primarily integrates ELAN [56] and a redesigned max-pooling layer, while the neck adopts PAFPN [57] enhanced with ELAN-W for multiscale feature extraction. In the head, RepConv [58] is utilized to fuse multiscale feature maps, followed by IDetect, which predicts objects of different sizes separately. The loss function of YOLOv7 consists of three components: coordinate regression loss, object confidence loss, and classification loss. Both object confidence loss and classification loss use binary cross-entropy loss (BCEWithLogits Loss), while coordinate loss utilizes CIoU loss. In our setting, the YOLOv7 baseline runs only 24.8% AP, indicating sub optimal precision.

To tackle the aforementioned challenges, we design ECAN Detector, a YOLOv7-based architecture depicted in Figure 1. In our detector, a shallow feature layer is first employed to enrich fine-grained features for small objects. Simultaneously, we introduce a Transformer-based DST module to further improve the network’s capacity for shallow feature extraction. A context-augmentation module (CAM) is inserted into the neck to exploit high-resolution features and capture detailed context. Moreover, in the head, the output features from the neck are processed using C2f_Rep, which replaces the original heavily parameterized convolution to extract more expressive features. This not only improves the accuracy of small-object detection but also reduces the number of parameters and computational complexity.

Through these optimizations, our proposed ECAN-Detector effectively addresses the key challenges of small-object detection, significantly enhancing detection accuracy and recall.

3.2. Our ECAN-Detector

In this subsection, we provide a detailed explanation of the implementation of the three key components of the ECAN-Detector. Specifically, P2 is employed for shallow feature extraction, while DST and CAM are designed to enhance contextual information and fuse multiscale features. These improvements make the network more effective in detecting small objects.

3.2.1. Shallow Detection Layer

In YOLOv7, the backbone primarily generates feature layers P3, P4, and P5 at resolutions of

80 \times 80

,

40 \times 40

, and

20 \times 20

, respectively, to detect small, medium, and large objects. However, in long-distance autonomous-driving scenarios, small objects are often occluded or lost due to downsampling and the limitations of anchor-based designs, resulting in reduced detection accuracy. To address this issue, we introduce an additional shallow feature layer, P2, with a resolution of

160 \times 160

. Instead of modifying existing layers, such as P3, we explicitly generate P2 from earlier backbone features to preserve fine-grained spatial details. Altering deeper layers would compromise high-level semantic information or require excessive up-sampling, both of which are detrimental to small-object localization. The P2 layer captures higher-resolution information that is critical for detecting objects smaller than 32 × 32 pixels (targets that are otherwise poorly represented at coarser scales). Furthermore, we further equip P2 with DST, CAM, and ELAN W, boosting feature richness and robustness in complex traffic scenes.

3.2.2. Dynamic Scaled Transformer

YOLOv7’s backbone consists of a series of traditional convolutions with a limited receptive field in shallow feature maps, which restricts its ability to capture global features. To capture richer context, we insert a dynamic scaled transformer (DST) into the shallow extraction stage based on [39]. By integrating DST into the network, we harness the Transformer’s powerful global modeling capabilities.

The Transformer architecture [12] relies on a self-attention mechanism for parallel computation and operates as a non-recurrent network primarily composed of two components: an encoder and a decoder. The encoder structure, illustrated in Figure 2a, processes the embedded input using multi-head attention to capture feature relationships, followed by a feedforward network for further feature extraction. In addition, residual connections and normalization layers ensure stability and efficient training.

DST is an encoder based on the Transformer architecture, with its specific implementation details illustrated in Figure 2b. It primarily consists of two components: scaled-cosine attention and a feedforward network (FFN). Layer Normalization and Dropout further stabilize training and reduce over fitting.

Traditional multi-head attention computes the similarity score by taking the dot-product of the query, key, and value. However, as the vector dimension increases, the computational complexity grows, potentially leading to gradient explosion or vanishing gradients. To address this, we propose adopting scaled-cosine attention and introducing a scaling factor to stabilize attention weight calculations. This approach produces more balanced attention weights, better captures the relative relationships between feature vectors, and more effectively extracts contextual information around the target in complex backgrounds.

Additionally, cosine similarity is focused solely on directional similarity between vectors and is not affected by amplitude variations. This property improves the model’s robustness when handling features at different scales.

The input features

F_{m}

are first passed through an embedding layer, which flattens them into a sequence tensor and inputs them into an encoder. The encoder uses scaled-cosine attention to compute the cosine similarity between adjacent pixels. The formulation is shown in Equation (1).

S i m (q_{i}, k_{j}) = \frac{c o s (q_{i}, k_{j})}{τ} + B_{i j}

(1)

where

q_{i}

denotes the feature representation of the query of the input pixel i, and

k_{j}

indicates the feature representation of the key to the input pixel j.

q_{i}

and

k_{j}

are feature vectors based on the pixels at the corresponding positions in the input feature map

F_{m}

. The learnable temperature parameter

τ

is constrained to be larger than 0.01. After this process, the input features

F_{m}

will be transformed into a new feature representation

F_{m}^{'}

, which contains more contextual information and helps to improve the model performance. The relative position bias is denoted by

B_{i j}

, it is a learnable bias, usually generated through Equation (2), where f is a function representing the relative position between the query pixel i and the key pixel j.

B_{i j} = f (p_{i} - p_{j})

(2)

In DST, the positional feedforward network consists of two fully connected layers that enhance feature representation through nonlinear transformations while incorporating positional information for each element in the sequence. However, as the number of network layers increases, the number of parameters grows significantly. To address this, instead of adopting the pre-norm approach used in ViT [59], this paper employs a post-norm process, where normalization operations occur after the attention mechanism. This approach mitigates the accumulation of parameters as the network deepens, alleviates the gradient vanishing problem, and improves the model’s generalization ability.

Dropout further regularizes DST, lowering model complexity and preventing over fitting to improve learning and generalization. Additionally, integrating residual connections into both modules strengthens the network’s ability to selectively learn across different layers. These connections enrich feature representations and facilitate gradient propagation. As a result, deeper features propagate more effectively to the shallow network, ultimately improving the training process for deeper layers. The process is represented by Equation (3).

F_{m}^{″} = L a y e r N o r m (W_{2} R e L U (L a y e r N o r m (W_{1} F_{m}^{'} + b_{1})) + b_{2}) + F_{m}^{'}

(3)

Here,

W_{1}

and

W_{2}

represent the weight matrix of the fully connected layer,

b_{1}

and

b_{2}

denote the bias term of the fully connected layer. Finally, through residual connections, the features that have incorporated scaled-cosine attention are added together, resulting in the output features

F_{m}^{″}

. The general process of DST is shown in Algorithm 1.

Algorithm 1 Dynamic scaled transformer (DST)

Input: $F_{m}$ : the input feature map;
L: the number of feature pyramid levels;
$τ$ : the learnable scaling factor for cosine similarity;
$W_{1}, W_{2}$ : the weight matrices for the fully connected layers;
$b_{1}, b_{2}$ : the bias terms for the fully connected layers;
f: the function to compute relative position bias.
Output: $F_{m}^{″}$ is the final feature representation after DST processing;

1:: procedure DST( $F_{m}$ , L, $τ$ , $W_{1}$ , $W_{2}$ , $b_{1}$ , $b_{2}$ , f)
2:: $F_{m}^{'} \leftarrow \emptyset$ ▹ Initialize the transformed feature map
3:: Flatten( $F_{m}$ ) ▹ Flatten input feature map $F_{m}$ into a sequence tensor
4:: $Q \leftarrow W_{Q} \times {Flattened}_{F_{m}}$ ▹ Compute queries
5:: $K \leftarrow W_{K} \times {Flattened}_{F_{m}}$ ▹ Compute keys
6:: $V \leftarrow W_{V} \times {Flattened}_{F_{m}}$ ▹ Compute values
7:: for all pixels i in the feature map do
8:: for all pixels j in the feature map do
9:: $cosine_sim_ij \leftarrow CosineSimilarity (Q [i], K [j])$
10:: $B_{i j} \leftarrow f (i - j)$
11:: $S_{i j} \leftarrow (cosine_sim_ij / τ) + B_{i j}$
12:: end for
13:: end for
14:: $Attention_Output \leftarrow S \times V$
15:: $F_{m}^{'} \leftarrow LayerNorm (Attention_Output)$
16:: $Z_{1} \leftarrow LayerNorm (W_{1} \times F_{m}^{'} + b_{1})$
17:: $A_{1} \leftarrow ReLU (Z_{1})$
18:: $Z_{2} \leftarrow LayerNorm (W_{2} \times A_{1} + b_{2})$
19:: $F_{m}^{″} \leftarrow Z_{2} + F_{m}^{'}$
20:: end procedure

3.2.3. Context-Augmentation Module

As discussed in the previous section, traditional object-detection methods rely on shallow feature maps, which contain rich feature information about small objects. However, after multiple downsampling operations, CNNs significantly limit the information extracted. Therefore, incorporating contextual information around small objects within shallow feature maps enhances both their detection and feature representation.

To improve the identification of dense, small objects, we introduce a context-augmentation module (CAM) in the neck, which provides more detailed contextual information. Simultaneously, CAM integrates a rapid normalized fusion operation, effectively combining deep semantic information with shallow feature representations without significantly increasing the parameter count. The overall process is formulated in Equation (4).

F_{f u s i o n} = \sum_{i = 1}^{3} w_{i} f_{i} + α \times A t t e n t i o n (F_{1}, F_{2}, F_{3}) + [F_{1} ‖ F_{2} ‖ F_{3}]

(4)

As shown in Figure 1, in the P2 feature layer, contextual information around small objects is extracted using CAM [60]. CAM employs three dilated convolutions with distinct expansion rates to capture contextual information from various receptive fields. The convolution kernel size in CAM is

3 \times 3

, with dilation rates of 1, 3, and 5, respectively. These three dilated convolutions are then processed through a fusion operation.

In implementation, fusion operations typically fall into three categories: weighted fusion, adaptive fusion, and concatenation fusion. Experimental results indicate that traditional feature fusion methods may not effectively handle multiscale features, particularly when detecting small objects in complex backgrounds. Therefore, in this work, CAM introduces a specialized multiscale feature fusion operation that adaptively adjusts the fusion weights based on scene-specific characteristics, enhancing feature representation accuracy. The detailed implementation is illustrated in Figure 3.

In the main network, we replace the original concatenation operation after CAM with fast normalized fusion to better integrate multiscale contextual information. The formula is shown in Equation (5). First, the input features

F_{i}

are normalized to accelerate the model convergence. Then, a weighted average is used to fuse adjacent deep and shallow feature maps, therefore boosting the capacity to convey multiscale features. Accelerating feature propagation enhances network convergence, improving the ability of shallow networks to detect small objects.

F_{n} = \sum_{i} \frac{ω_{i} \cdot F_{i}}{ϵ + \sum_{j} ω_{j}}

(5)

Here,

ω_{i}

and

ω_{j}

are the learnable parameters obtained by training, representing the weights of the input feature maps of layer i and layer j, respectively.

F_{i}

denotes the input features of layer i,

ϵ

is a small positive constant that ensures that the divisor is non-zero, thus ensuring the stability of the computation, and

F_{n}

denotes the output features.

3.3. C2f_Rep

In YOLOv7, RepConv primarily reduces the number of FLOPs and model parameters. During training, it enhances performance by adding parallel

1 \times 1

convolution branches and identity mapping branches to each

3 \times 3

convolution, thereby improving the model’s efficiency and flexibility. Inspired by the work in [8], we propose C2f_Rep, which consists mainly of CBS (Cross-branch Structure) and BottleNeck_Rep. CBS is responsible for the initial feature fusion, while BottleNeck_Rep further processes the features using depthwise separable and pointwise convolutions for efficient feature extraction. The architecture of C2f_Rep is shown in Figure 4b. Compared to the original RepConv, C2f_Rep’s multibranch structure and enhanced gradient flow significantly improve small-object detection performance, while keeping the model lightweight by preventing a substantial increase in parameters.

In our design, the convolution output of YOLOv7’s Conv–BN–SiLU (CBS) block is split into n BottleNeck_Rep branches and then concatenated with the original features. A

1 \times 1

convolution is subsequently applied to adjust the channel count and integrate the information. Unlike the original structure, which uses a single

1 \times 1

convolution, the C2f_Rep structure introduces parallel BottleNeck_Rep branches to enhance feature diversity, as shown in Figure 4a. Each BottleNeck_Rep, depicted in Figure 4b, employs two stacked

3 \times 3

reparametrized convolutions with residual connections, improving accuracy without increasing FLOPs or parameters.

3.4. Loss Function

To address the challenges of assigning positive anchors in small-object detection, we adopt an enhanced variant of the Optimal Transport Assignment (OTA) strategy [61]. Unlike traditional fixed IoU threshold methods, OTA dynamically selects foreground anchors based on a unified cost matrix that incorporates both classification confidence and localization distance. This approach is particularly effective when target objects are sparsely distributed, as is common in small-object scenarios, where conventional assignment strategies often lead to suboptimal matches and imbalance. Using the principles of optimal transport, OTA provides a more robust and adaptive label assignment, improving convergence stability and overall detection accuracy.

The overall loss comprises three terms: CIoU-based box regression, BCE objectness, and BCE classification. Compared to YOLOv7’s default configuration, we introduce two key enhancements specifically tailored for dense small-object detection: first, we replace GIoU with Complete IoU (CIoU), which better penalizes localization errors in both aspect ratio and center alignment, especially beneficial for tightly clustered small targets; and second, we integrate OTA for dynamic label assignment based on spatial constraints and feature similarity.

Furthermore, we incorporate anchor scale-aware loss weighting and apply gain-adjusted anchor matching to normalize loss contributions across object sizes. This prevents large objects from dominating the optimization process and ensures balanced gradient updates throughout training. These combined modifications result in a more stable and effective loss design for detecting small and densely distributed targets.

L_{total} = λ_{box} \cdot L_{CIoU} + λ_{obj} \cdot L_{obj} + λ_{cls} \cdot L_{cls}

(6)

\begin{matrix} L_{CIoU} & = \frac{1}{N} \sum_{i = 1}^{N} (1 - CIoU (b_{i}, {\hat{b}}_{i})) \end{matrix}

(7)

\begin{matrix} L_{obj} & = \frac{1}{N} \sum_{i = 1}^{N} BCE (p_{i}^{obj}, {\hat{p}}_{i}^{obj}) \end{matrix}

(8)

\begin{matrix} L_{cls} & = \frac{1}{N} \sum_{i = 1}^{N} BCE (p_{i}^{cls}, {\hat{p}}_{i}^{cls}) \end{matrix}

(9)

In the above formulation, N denotes the number of positive samples, dynamically assigned by the OTA matcher.

b_{i}

and

{\hat{b}}_{i}

represent the predicted and ground-truth bounding boxes, respectively.

p_{i}^{obj}

and

{\hat{p}}_{i}^{obj}

are the predicted objectness confidence and its binary ground-truth label (1 for foreground, 0 for background). Similarly,

p_{i}^{cls}

and

{\hat{p}}_{i}^{cls}

refer to the predicted class probability distribution and the one-hot encoded ground-truth class vector. The weights

λ_{box}

,

λ_{obj}

, and

λ_{cls}

control the relative contribution of each loss term. Following YOLOv7, we set them at 0.05, 1.0, and 0.5, respectively, unless otherwise specified.

4. Experiments

4.1. Datasets

This study evaluates the effectiveness of our proposed approach using two prominent autonomous-driving datasets and a widely recognized benchmark in the field of object detection: VisDrone2021-DET [41] (http://aiskyeye.com) (accessed on 19 May 2025), DOTA-v1.0 [40] (https://captain-whu.github.io/DOTA/dataset.html) (accessed on 19 May 2025), and the MS COCO 2017 [43] (https://cocodataset.org) (accessed on 19 May 2025). The VisDrone2021-DET dataset [41] encompasses diverse real-world scenarios captured in 14 cities in China, reflecting various weather conditions, lighting environments, and urban and rural landscapes. It consists of 6471 images for training, 548 for validation, and 3190 for testing.

The DOTA dataset [40] is a large-scale benchmark widely adopted in computer vision and remote-sensing tasks, focusing on object detection in aerial images. It contains aerial photographs captured by various sensors and platforms, with image resolutions ranging from

800 \times 800

to

4000 \times 4000

pixels. In this study, we utilize the DOTA-v1.0 version, which comprises 2806 images spanning different scenes, such as cities, villages, and highways. The dataset features 15 categories of common remote-sensing objects, including airplanes, ships, and vehicles, totaling 188,282 annotated instances. The dataset is partitioned into training, validation, and test sets with a ratio of 1/2, 1/6, and 1/3, respectively. To accommodate high-resolution images, all original images are resized to

1024 \times 1024

pixels.

To further assess the generalization capability of our proposed method, we conduct experiments on the MS COCO 2017 dataset [43], a widely recognized benchmark in the field of object detection. The dataset comprises 118,287 images for training, 5000 images for validation, and 40,670 images for testing (test-dev). It features 80 object categories spanning a wide range of everyday scenes, including people, animals, vehicles, and household objects. The images in COCO present significant challenges due to the high diversity in object scale, occlusion, and context. Each image contains an average of 7.7 object instances, often with overlapping objects and cluttered backgrounds, which makes detection tasks more complex. The dataset provides detailed instance-level annotations, including bounding boxes, object categories, and segmentation masks. During training and evaluation, all images are resized to 640 × 640 pixels to ensure consistency. This dataset serves as a robust benchmark for evaluating object-detection performance under real-world conditions.

4.2. Evaluation Metrics

To assess the effectiveness of our method, we follow the evaluation protocol of the MS COCO dataset, which measures accuracy using three metrics: AP,

A P_{50}

, and

A P_{75}

. Additionally, to demonstrate the practicality of our approach in detecting small objects, we report the evaluation metrics for

A P_{s}

,

A P_{m}

, and

A P_{l}

. Specifically,

A P_{s}

represents the average precision for small objects (those with sizes up to

32 \times 32

), while

A P_{m}

and

A P_{l}

represent the average precision for medium (

32 \times 32

to

96 \times 96

) and large objects (larger than

96 \times 96

), respectively.

4.3. Implementation Details

All models are trained on a single NVIDIA RTX 3090 GPU (24 GB) using PyTorch 1.12.0 with CUDA 11.6 and Python 3.8. Our training and inference pipelines are implemented in PyTorch, and we leverage libraries such as pycocotools for evaluation, albumentations for data augmentation, and OpenCV for image preprocessing.

We adopt the AdamW optimizer with hyperparameters

β_{1} = 0.9

and

β_{2} = 0.999

. The initial learning rate is set to 1.0

\times 10^{- 2}

and decays to 1.0

\times 10^{- 4}

following a cosine annealing schedule. We apply a weight decay of 5.0

\times 10^{- 4}

and use an exponential moving average (EMA) with a momentum of 0.9999. All models are trained for 100 epochs on the VisDrone, COCO, and DOTA datasets, using a batch size of 8 images per GPU with an input resolution of 640 × 640. Additionally, mixed-precision training is enabled through AMP to accelerate computation and reduce memory usage.

We train the ECAN-Detector from scratch without using ImageNet pretrained weights to validate its architectural effectiveness. Our early tests showed that pretraining slightly reduced accuracy in small-object detection scenarios.

With these settings in place, we now analyze the contribution of each component.

4.4. Ablation Experiments

The experimental evaluation in this paper consists of several key components. First, we conduct a comprehensive ablation study on the VisDrone2021-DET dataset to assess the performance contribution of each module within the ECAN-Detector architecture. We then provide an in-depth analysis of the experimental results, including the overall detection performance and category-wise comparisons against baseline models. In addition, we investigate how fusion strategies and dilation rates in CAM affect detection accuracy. Subsequently, we evaluate the robustness and generalization capability of the proposed method on the DOTA dataset. To further emphasize the architectural novelty of ECAN-Detector, we conduct a comparative study on the COCO dataset against recent YOLO variants and Transformer-based detectors. Finally, we assess the practical deployability of the ECAN-Detector through real-world inference benchmarking on edge devices.

4.4.1. Performance Comparison of Each Sub-Module

To verify the effectiveness of our proposed method, we conduct ablation experiments on each sub-module of ECAN-Detector and analyze their impact on overall performance, as shown in Table 1. Firstly, adding the P2 layer leads to a notable improvement in detection performance, with +1.8% AP overall and +2.0% AP_s compared to baseline. These results demonstrate that high-resolution features in the early stage significantly enhance the model’s ability to detect small and densely packed objects in aerial scenes. Next, we evaluate the effects of DST on performance. Based on the self-attention mechanism, DST excels at capturing contextual information around objects, enabling better adaptation to objects of different sizes. Compared to C2f_Rep, DST further boosts the overall accuracy, improving

A P_{l}

by 3.1%. We also assess the impact of CAM by adding more detailed contextual information and performing multiscale feature fusion, which has a notable effect on

A P_{50}

and

A P_{l}

, enhancing them by 0.3% and 0.8%, respectively. Furthermore, experiments evaluate the performance of simultaneously integrating P2 and DST. The results show that, compared to adding P2 alone, the AP is further improved by 0.6%. Finally, in comparison to the baseline, ECAN-Detector demonstrates improvements of 3.1% in AP and 3.5% in

A P_{s}

.

We note that applying CAM in isolation slightly reduces AP (24.8% → 24.7%) in Table 1. This is attributed to CAM’s reliance on global context aggregation without concurrent spatial refinement. Specifically, multi-dilation fusion can introduce over-smoothing effects in clustered small-object regions, weakening boundary localization. Moreover, CAM was originally designed to enhance early high-resolution features, particularly in conjunction with the P2 shallow layer. If P2 is absent or excluded from the fusion process, CAM lacks sufficient high-frequency structural cues, making it less effective in recovering fine-grained object details. This further impairs detection performance, especially in scenarios involving small, low-contrast targets.

4.4.2. Efficiency Analysis of ECAN-Detector

In Table 2, we systematically evaluate the individual and combined effects of P2, C2f_Rep, and CAM on model performance. First, we analyze the contribution of the P2 layer. Compared to baseline, P2 maintains nearly identical FLOPs and parameter counts while delivering a noticeable improvement in

A P_{s}

. Next, we assess C2f_Rep, an improved variant of C2f, which slightly increases FLOPs and parameter count but significantly enhances detection performance for large objects. Subsequently, we evaluate the impact of CAM. By aggregating shallow contextual information, CAM substantially improves overall performance but introduces a considerable increase in FLOPs. We further examine the combined effect of P2 integration. When CAM is combined with P2, AP, and

A P_{s}

improves by 2.4% each over the baseline, with only a modest rise in FLOPs and parameters. Overall, the integration of all three components (P2, C2f_Rep, and CAM) leads to further improvements, yielding gains of 3.1% in AP and 3.5% in

A P_{s}

compared with the baseline.

4.4.3. Performance Comparison of C2f_Rep and C2f

In Table 3, we evaluate the individual and combined effects of the C2f, C2f_Rep, and ECAN modules on model performance. First, we examine the impact of C2f. Compared to baseline, C2f achieves a modest improvement in

A P_{s}

while maintaining nearly identical FLOPs and parameter counts. Next, we analyze C2f_Rep, which is an enhanced version of C2f. Although it introduces a slight increase in FLOPs and parameters, it leads to improved detection performance for large objects.

Subsequently, we assess the contribution of the ECAN module. By incorporating shallow contextual information, ECAN yields a substantial boost in overall performance, albeit with a noticeable increase in computational cost. We then compare the performance gains achieved by integrating both C2f and ECAN. This combined approach results in improvements of 3.0% in AP and 3.4% in

A P_{s}

over the baseline, with only a marginal increase in FLOPs and parameters.

Furthermore, when comparing C2f_Rep to C2f, we observe a 0.1% gain in both AP and

A P_{s}

, while maintaining the same FLOPs and parameter count. In general, the integration of C2f_Rep and ECAN leads to further performance enhancements, achieving a 3.1% increase in AP and a 3.5% increase in

A P_{s}

compared to the baseline model.

4.4.4. Transformer Encoder vs. DST Comparison

Although DST inherits the core concept of self-attention from standard Transformer encoders, it is architecturally distinct and optimized for dense small-object detection. First, DST replaces standard dot-product attention with a scaled-cosine attention (SCA) mechanism, which enhances stability and angular sensitivity in low-resolution feature maps. Second, unlike fixed-size token windows or global attention, DST adaptively scales the receptive field across spatial hierarchies, allowing more flexible multiscale aggregation. To empirically justify this design, we compare DST with a standard Transformer encoder under identical backbone and training conditions (Table 4). Our DST achieves higher AP (25.4% vs 25.1%) and AP_s (16.8% vs 16.3%), while introducing only a marginal increase in parameters (+6.2M) and FLOPs (+17.1G). These results validate that DST is both structurally and functionally distinct from typical Transformer encoders and is more effective for small-object dense scenes like VisDrone.

4.4.5. Effects of Individual Classes

Single-class experiments allow fine-grained analysis of category-specific performance, revealing both strengths and weaknesses. They support capability assessment, issue diagnosis, targeted optimization, and better interpretability. The performance of individual classes on VisDrone 2021-DET at both the baseline and after adding each module independently is displayed in Table 5. The results indicate notable per-class gains achieved by our method. ECAN shows the most remarkable enhancement in overall performance, achieving good results for the smaller classes in the original image, which are improved by 4%, 4%, 1.7%, and 1.6% on pedestrian, car, truck, and tricycle, respectively. Finally, after integrating all modules, overall performance improves markedly across every class relative to the baseline. Notably, Notably, the ECAN module boosts the Vehicle class by a remarkable 10.6 percentage points. Moreover, this method achieves the best detection performance in all six categories.

4.4.6. Different Fusion Methods

To evaluate fusion strategies within CAM, Table 6 compares weighted, adaptive, and concatenation fusion. The experimental results clearly show that weighted fusion significantly improves all evaluation metrics. Adaptive fusion enhances accuracy for regular-sized objects, with minimal improvement in small-object detection. Concatenation fusion boosts the accuracy of large objects, but provides only limited enhancement for small objects. Since our method is primarily designed for small-object detection in complex traffic scenes, where small objects make up a large portion, the weighted fusion method proves to be more suitable for these scenarios.

4.4.7. CAM Dilation Rates

We conducted ablation experiments to investigate the impact of different dilation rate configurations in the CAM module. Specifically, we compared four settings: (2,4,6), (2,2,3), (5,2,1), and (1,3,5). As summarized in Table 7, the (1,3,5) configuration consistently achieves the best performance, yielding the highest AP of 27.9% and AP_s of 19.9%. These results empirically validate our selection of (1,3,5) as the optimal dilation setting for CAM in the final ECAN-Detector model.

4.4.8. Robustness on DOTA

To further validate the robustness of ECAN-Detector, we evaluated it on the DOTA dataset [42]. As shown in Table 8, our method, starting with ECAN, significantly improves overall accuracy compared to the baseline. The improvements in AP and

A P_{s}

are particularly notable, with increases of 0.8% and 0.6%, respectively, further demonstrating the robustness of our approach. Additionally, our method provides a distinct enhancement for medium and large objects, as demonstrated by the experimental results, which confirm the resilience of our approach.

4.4.9. Compared to YOLO Variants and Transformers

ECAN-Detector distinguishes itself from existing YOLO variants (e.g., YOLOv8 [8], YOLOv9 [9]) and Transformer-based detectors (e.g., DETR [15], Swin Transformer [17]) through both architectural innovations and its focus on small-object detection. While YOLOv8 and YOLOv9 mainly enhance backbone and head components via compound scaling and decoupled design, ECAN-Detector introduces two dedicated modules: the dynamic scaled transformer (DST) and the context-augmentation module (CAM).

DST employs a lightweight scaled-cosine attention mechanism to dynamically adjust receptive fields across spatial hierarchies, effectively capturing dispersed and context-dependent small targets. CAM supplements this by aggregating fine-grained context through multibranch dilated convolutions and normalized attention, helping the network recover localized details lost in early downsampling. Together, DST and CAM form a synergistic enhancement tailored to small-object challenges such as underdetection, low resolution, and occlusion.

In contrast, Transformer-based models such as DETR and Swin Transformer face key limitations in small-object detection, including excessive computational complexity and inadequate multiscale representation. For example, DETR requires over 500 training epochs and lacks explicit multiscale fusion, while Swin’s fixed window attention restricts contextual adaptability. ECAN-Detector maintains YOLO-style efficiency (one-stage pipeline, 100 epochs) while integrating attention-based context modeling.

As shown in Table 9, ECAN-Detector achieves an AP_s of 37.8 on the COCO2017 validation set, surpassing YOLOv9-C and YOLOv8 by +1.6 and +1.1 points, respectively, while maintaining a favorable trade-off between accuracy and computational cost. Compared to Swin-S (838 GFLOPs) and DETR (28 FPS), ECAN runs at 86.2 FPS with only 42.9 GFLOPs, demonstrating its suitability for real-time small-object detection scenarios.

Table 9 compares ECAN-Detector against YOLOv8, YOLOv9, DETR, and Swin-S on the COCO2017 val set. ECAN achieves an AP_s of 38.4, outperforming YOLOv8n and YOLOv9s by and points, respectively, while maintaining lower or comparable FLOPs.

4.4.10. Edge Deployment Insights

We evaluate ECAN-Detector’s practical feasibility by benchmarking its performance under INT8 quantization with TensorRT on three representative devices: RTX 3090 (NVIDIA Corporation, Santa Clara, CA, USA), Jetson AGX Orin (NVIDIA Corporation, Santa Clara, CA, USA), and Jetson Xavier NX (NVIDIA Corporation, Santa Clara, CA, USA).

As shown in Table 10, the model runs at over 25 FPS on Orin (INT8), meeting real-time constraints for many embedded vision systems.

4.5. Visualization of Results

To clearly illustrate the superiority of our proposed method, we compare its detection results with those of TPH-YOLOv5, YOLOv7, and YOLOv8, as presented in Figure 5. In the street scene depicted in the first row, YOLOv7 shows the poorest detection performance. Incorrectly classifies the traffic light on the left as a pedestrian and fails to detect the car on the right. TPH-YOLOv5 and YOLOv8 also miss detecting the tricycle in the middle. Our proposed method, however, achieves the most accurate detection in these regions. The only exception is the pedestrian, which is missed due to a tree obstruction on the left side.

In the second-row street scene near the mall, the other three models produce multiple false positives on the left side, misclassifying barriers and traffic lights as pedestrians. Our method outperforms them in the left area but still misses some pedestrians and vehicles in the densely packed area on the right.

In the night scene of the final row, YOLOv7 yields the poorest results, missing the tricycle on the zebra crossing and performing inadequately in the densely populated right region. In contrast, ECAN-Detector achieves the best detection performance among all detectors. Although most objects are correctly detected, several pedestrians and vehicles remain unrecognized due to low illumination in the nighttime scene.

Overall, our proposed method surpasses the other detectors in densely populated small-object scenarios and maintains robust performance under low-light conditions, significantly improving the baseline’s detection accuracy.

4.6. Discussion

Previous sections have already demonstrated the effectiveness of our method in both design and empirical results. In this section, we will explore the underlying reasons for its success, focusing on the additional modules and key factors that contribute to its performance. ECAN is specifically designed to strengthen the shallow network’s capability to accurately detect and represent small objects. As shown in Table 1, the introduction of the P2 layer provides high-resolution feature representations, improving the capture of detailed information about small objects. Meanwhile, DST and CAM further strengthen the shallow contextual information. DST incorporates a scaling mechanism based on cosine attention to stabilize weight calculations, while CAM utilizes a scaling factor with three dilated convolutions to capture contextual information from different receptive fields.

Furthermore, the feature fusion operation effectively integrates both deep and shallow features, enhancing the network’s capacity to represent multiscale features. Our method excels in detecting small objects in crowded environments and performs well under challenging lighting conditions, resulting in significant improvements in detection performance over the baseline of 0.8% in AP and 0.6% in

A P_{s}

, as demonstrated in Table 6. As shown in Table 5, among the fusion methods, the weighted feature fusion operation outperforms the other approaches, achieving the best results.

To further enhance the performance of small-object detection, we introduce C2f_Rep in the head, as shown in Table 2. C2f_Rep effectively incorporates gradient flow information while remaining lightweight, without significantly increasing the number of parameters or computations. Additionally, in Figure 6, we compare the heatmap visualizations of the baseline model, the C2f addition, and our proposed C2f_Rep. In the zebra crossing scene (second row), the heatmap with C2f_Rep shows fewer false detections of objects reflected in the glass compared to the baseline and C2f heatmaps. It also performs better in detecting the densely populated area in the upper right corner. Overall, the heatmap generated by C2f_Rep outperforms the others, effectively reducing false positives and false negatives.

4.7. Comparison with SOTA Detectors

To further validate the superior performance of our method in dense small-object detection scenarios, we compared and analyzed our approach with other state-of-the-art (SOTA) detectors using the VisDrone2021-DET dataset. As shown in Table 11 and Table 12, our method achieves a significant improvement in detection precision, with an increase of 3.1% in AP and 3.5% in

A P_{s}

. It outperforms Faster R-CNN [1] and Cascade R-CNN [3], achieving a 40.1% higher FPS and a 5.5% higher AP, all while maintaining similar computational complexity. Compared to DL-YOLOX [62], our method achieves higher AP (+1.3%), AP50 (+4.5%), and APs (+11.3%), although it incurs a significantly higher computational cost and parameter volume. Furthermore, our method surpasses recent YOLO variants, including YOLOv5 [5], YOLOv6 [6], YOLOv8 [8] and YOLOv9 [9], in accuracy while maintaining comparable speed. In small-object detection, our method outperforms TPH-YOLOv5 [38] in all metrics, demonstrating its effectiveness. However, computational efficiency remains an area for future improvement. However, computational efficiency remains an area for future improvement. Specifically, ECAN-Detector introduces a 2.3× increase in GFLOPs (105.3 to 246.3) and a 15.3% rise in parameters (37.2 M to 42.9 M) relative to the YOLOv7 baseline, which lowers real-time throughput from 128 FPS to 54 FPS. These overheads stem mainly from the additional P2 branch, DST blocks, and CAM fusion. Nonetheless, the accuracy gains (+3.1 AP, +3.5

A P_{s}

) yield a markedly better accuracy-to-cost ratio than heavyweight two-stage models (e.g., Cascade R CNN, 239.4 GFLOPs/69.1 FPS) while still running above 50 FPS on an RTX 3090, making the model viable for near real-time perception on edge GPUs (e.g., Jetson AGX Orin) after TensorRT sparsity pruning.

Regarding the observed performance gap between overall AP (27.9%) and small-object AP_s (19.9%) on the VisDrone2021-DET dataset, this discrepancy arises from the challenging nature of the dataset. Over 72% of annotated targets are smaller than

32 \times 32

pixels, and the images often feature dense object distributions, severe occlusions, motion blur, and complex aerial perspectives. These factors make small objects especially difficult to detect and locate with high precision.

While our CAM and DST modules improve detection under such constraints, the extremely low resolution of small objects, often just a few pixels, limits the effectiveness of standard convolution and attention layers. CAM enhances contextual aggregation, improving recall for partially visible targets, while DST provides cross-scale feature interaction that helps recover structure. However, in cases where small objects are highly crowded or ambiguous, the fused context can also lead to over-smoothing. We mitigate this by jointly optimizing the loss weights and incorporating both modules, leading to a +3.5 AP_s gain over the YOLOv7 baseline.

5. Conclusions

In intelligent transportation scenarios, detecting small objects remains a major challenge due to their low resolution and limited feature representation. To tackle this, we present ECAN-Detector, an efficient context-aggregation network tailored for small-object detection in complex traffic scenes. Our design incorporates a high-resolution shallow detection layer, a dynamic scaled transformer (DST) for enhanced spatial awareness, and a context-augmentation module (CAM) to enrich contextual semantics. Additionally, we adopt a lightweight C2f_Rep module in place of RepConv in the detection head, achieving higher accuracy for small objects with minimal computational overhead. Experimental results on VisDrone2021-DET show a 3.1% AP gain over the baseline, with consistent improvements validated in the DOTA and COCO datasets. While the reported AP increase may appear incremental, it leads to substantial practical benefits in autonomous driving, especially in reliably detecting distant or occluded objects like pedestrians and road hazards under adverse conditions, ultimately enhancing perception robustness and system safety.

Limitations and Future Work: Despite its promising results, ECAN-Detector introduces a higher computational cost compared to the baseline, leading to slower inference. Enhancing real-time performance remains an open challenge. In future work, we plan to: (1) design loss functions specialized for dense small-object scenarios; (2) refine the backbone to better capture fine-grained details; and (3) improve anchor box allocation to reduce missed detections. We also aim to incorporate neural architecture search and channel sparsification to further optimize the model for edge deployment. These improvements will facilitate more efficient and deployable small-object detection systems.

Author Contributions

Conceptualization, Y.H.; methodology, G.X., Z.X., C.W. and Y.H.; software, Z.X. and M.S.; formal analysis, G.X.; project administration, G.X., Z.X., C.W. and H.N.; investigation, G.X., M.S. and C.W.; validation, Z.X., Y.H. and C.W.; writing, Y.H.; revision, G.X. and Z.X.; editing, Z.X.; visualization, Y.H.; funding acquisition, H.N.; supervision, M.S. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62201452 and the Shaanxi Province Qinchuangyuan “Scientist + Engineer” Team Construction Project under Grant 2023KXJ-241. (Corresponding authors: Hailong Ning.)

Data Availability Statement

No new data were created during the study, or data is unavailable due to privacy or ethical restrictions.

Acknowledgments

I sincerely thank my mentor, Chunmei Wang, for her selfless guidance and support throughout the research process. Her expertise and advice have had an indispensable impact on my research. I would also like to thank all the members and staff of the laboratory for the technical support and the positive collaborative environment that allowed my research to proceed smoothly.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. Ultralytics/yolov5: v3.0. Available online: https://doi.org/10.5281/zenodo.3983579 (accessed on 19 May 2025).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 19 May 2025).
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Lin, Y.; Yuan, Y.; Zhang, Z.; Li, C.; Zheng, N.; Hu, H. Detr does not need multi-scale or locality design. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6545–6554. [Google Scholar]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197. [Google Scholar]
Song, Q.; Wang, B.; Ma, Y.; Hu, M.; Liu, C. DL-YOLOX: Real-time object detection via adjustable dilated enhancement for autonomous driving scene. Trans. Inst. Meas. Control 2024, 47, 01423312241239020. [Google Scholar] [CrossRef]
Hou, X.; Liu, M.; Zhang, S.; Wei, P.; Chen, B.; Lan, X. Relation detr: Exploring explicit position relation prior for object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 89–105. [Google Scholar]
Zhang, F.; Jiao, L.; Li, L.; Liu, F.; Liu, X. MultiResolution attention extractor for small object detection. arXiv 2020, arXiv:2006.05941. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, L. Dot distance for tiny object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1192–1201. [Google Scholar]
Meethal, A.; Granger, E.; Pedersoli, M. Cascaded zoom-in detector for high resolution aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2045–2054. [Google Scholar]
Wang, S.y.; Qu, Z.; Gao, L.y. Multi-spatial Pyramid Feature and Optimizing Focal Loss Function for Object Detection. IEEE Trans. Intell. Veh. 2023, 9, 1054–1065. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6054–6063. [Google Scholar]
Liu, Z.; Gao, G.; Sun, L.; Fang, L. IPG-net: Image pyramid guidance network for small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 1026–1027. [Google Scholar]
Han, X.j.; Qu, Z.; Wang, S.y.; Xia, S.f.; Wang, S.y. End-to-End Object Detection by Sparse R-CNN with Hybrid Matching in Complex Traffic Scenes. IEEE Trans. Intell. Veh. 2023, 9, 512–525. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Fu, K.; Li, J.; Ma, L.; Mu, K.; Tian, Y. Intrinsic relationship reasoning for small object detection. arXiv 2020, arXiv:2009.00833. [Google Scholar]
Cui, L.; Lv, P.; Jiang, X.; Gao, Z.; Zhou, B.; Zhang, L.; Shao, L.; Xu, M. Context-aware block net for small object detection. IEEE Trans. Cybern. 2020, 52, 2300–2313. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13658–13667. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 936–944. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Lim, J.S.; Astrid, M.; Yoon, H.; Lee, S.I. Small Object Detection using Context and Attention. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 13–16 April 2021; pp. 181–186. [Google Scholar]
Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended feature pyramid network for small object detection. IEEE Trans. Multimed. 2021, 24, 1968–1979. [Google Scholar] [CrossRef]
Gao, P.; Lu, J.; Li, H.; Mottaghi, R.; Kembhavi, A. Container: Context aggregation network. arXiv 2021, arXiv:2106.01401. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 2847–2854. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part v 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision, Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 526–543. [Google Scholar]
Chen, S.; Cheng, T.; Fang, J.; Zhang, Q.; Li, Y.; Liu, W.; Wang, X. TinyDet: Accurate Small Object Detection in Lightweight Generic Detectors. arXiv 2023, arXiv:2304.03428. [Google Scholar]
Yuan, X.; Cheng, G.; Yan, K.; Zeng, Q.; Han, J. Small object detection via coarse-to-fine proposal generation and imitation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6317–6327. [Google Scholar]
Sun, H.; Zhang, B.; Li, Y.; Cao, X. Confidence-driven Bounding Box Localization for Small Object Detection. arXiv 2023, arXiv:2303.01803. [Google Scholar]
Zhang, Z.; Zhang, H.; Zhang, Z.; Wang, B. Context-embedded hypergraph attention network and self-attention for session recommendation. Sci. Rep. 2024, 14, 19413. [Google Scholar] [CrossRef]
Ying, X.; Wang, Q.; Li, X.; Yu, M.; Jiang, H.; Gao, J.; Liu, Z.; Yu, R. Multi-Attention Object Detection Model in Remote Sensing Images Based on Multi-Scale. IEEE Access 2019, 7, 94508–94519. [Google Scholar] [CrossRef]
Quan, Y.; Zhang, D.; Zhang, L.; Tang, J. Centralized feature pyramid for object detection. IEEE Trans. Image Process. 2023, 32, 4341–4354. [Google Scholar] [CrossRef]
Arulprakash, E.; Aruldoss, M. A study on generic object detection with emphasis on future research directions. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 7347–7365. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. arXiv 2023, arXiv:2306.15988. [Google Scholar]
Gao, X.; Mo, M.; Wang, H.; Leng, J. Recent Advances in Small Object Detection. J. Data Acquis. Process. 2021, 36, 391–417. (In Chinese) [Google Scholar] [CrossRef]
Leng, J.; Ren, Y.; Jiang, W.; Sun, X.; Wang, Y. Realize your surroundings: Exploiting context information for small object detection. Neurocomputing 2021, 433, 287–299. [Google Scholar] [CrossRef]
Liu, F.; Huang, Y.; Wang, Y.; Xia, E.; Qureshi, H. Short-term multi-energy consumption forecasting for integrated energy system based on interactive multi-scale convolutional module. Sci. Rep. 2024, 14, 21382. [Google Scholar] [CrossRef]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13435–13444. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing network design strategies through gradient path analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 303–312. [Google Scholar]
Li, F.; Yang, Z.; Gui, Y. SES-yolov5: Small object graphics detection and visualization applications. In The Visual Computer; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–14. [Google Scholar]
Xiao, J.; Zhao, T.; Yao, Y.; Yu, Q.; Chen, Y. Context augmentation and feature refinement network for tiny object detection. In Proceedings of the 2022 International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Guo, X. A novel Multi to Single Module for small object detection. arXiv 2023, arXiv:2303.14977. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. You only learn one representation: Unified network for multiple tasks. arXiv 2021, arXiv:2105.04206. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
Wang, Y.; Zhang, J.; Zhou, J. Urban traffic tiny object detection via attention and multi-scale feature driven in UAV-vision. Sci. Rep. 2024, 14, 20614. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The ECAN-Detector has three parts: a backbone for extracting multiscale feature maps, a neck for merging and enhancing features, and a head for final prediction through reparametrized convolution. (a) ECAN features three sub-modules, an extra detection layer for high-res feature extraction, and employs DST and CAM to boost shallow feature expression. (b) Details of C2f_Rep in the head. C2f_Rep is used to replace RepConv to improve the overall detection performance.

Figure 2. Comparison of architectures: (a) Standard Transformer Encoder using dot-product attention; (b) Our proposed DST module, which replaces dot-product with scaled-cosine attention and introduces dynamic receptive field adjustment for improved small-object detection.

Figure 3. Architecture of the CAM with multi-dilated fusion strategy.

Figure 4. Pipeline of C2f_Rep and BottleNeck_Rep. (a) illustrates the implementation details of C2f_Rep; (b) depicts the specific composition of BottleNeck_Rep.

Figure 5. Visualization of several detectors on the VisDrone2021-DET validation set. Bounding boxes are color-coded by class; Overall, our approach yields tighter, more accurate boxes and markedly fewer false detections, particularly on small objects, compared with competing methods.

Figure 6. The visualization of the feature maps generated by the neck. Red indicates stronger feature activation, while blue represents weaker or no activation. The red arrows highlight regions successfully detected by the model, whereas the green arrows indicate regions that were missed. (a) shows the original images; (b) indicates the heatmaps produced by incorporating our proposed C2f_Rep into the baseline; (c) depicts the heatmaps produced using C2f in addition to the baseline; (d) displays the heatmaps produced by the baseline.

Table 1. Performance comparison of each sub-module of ECAN-Detector on VisDrone2021-DET.

Model	P2	CAM	DST	C2f_Rep	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
Baseline					24.8	44.1	23.9	16.4	34.6	39.2
ours	√				26.6	45.7	26.4	18.4	36	41.3
ours		√			24.7	44.4	23.7	16.4	34.3	40
ours			√		25.4	44.9	24.8	16.8	35.3	42.3
ours				√	25.1	44.4	24.2	16.6	34.5	41.4
ours	√		√		27.2	46.6	26.9	19	36.7	42.3
ours	√	√	√		27.5	47	27.3	19.5	36.5	39.7
ours	√	√	√	√	27.9	47.4	28.0	19.9	37.5	41.7

√ indicates that the corresponding sub-module is enabled in the model configuration. Bold numbers indicate the best performance in each evaluation metric.

Table 2. Analysis of ECAN-Detector on VisDrone2021-DET.

Model	P2	C2f_Rep	CAM	DST	GFLOPs	Params (M)	FPS	AP	${AP}_{s}$
Baseline					$105.3$	$37.2$	128.2	24.8	16.4
ours	√				131.6	37.4	73.5	26.6	18.4
ours		√			109.7	40	81.3	25.1	16.6
ours			√		211.6	45.6	102	24.7	16.4
ours				√	136.0	43.8	73.5	25.4	16.8
ours	√			√	141.7	37.6	65.7	27.2	18.8
ours	√		√	√	243.3	40.2	65.8	27.5	19.5
ours	√	√	√	√	246.3	42.9	54.1	$27.9$	$19.9$

√ indicates that the corresponding sub-module is enabled in the model configuration. Bold numbers indicate the best performance in each evaluation metric.

Table 3. Performance comparison of C2f_Rep and C2f on VisDrone2021-DET.

Model	C2f	C2f_Rep	ECAN	GFLOPs	Params (M)	FPS	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
Baseline				105.3	37.2	128.2	24.8	44.1	23.9	16.4	34.6	39.2
ours	√			108.3	39.3	204	25.1	44.5	24.4	16.7	34.9	39.2
ours		√		109.7	40	81.3	25.1	44.4	24.2	16.6	34.5	41.4
ours			√	243.3	40.2	80	27.5	47	27.3	19.5	36.5	39.7
ours	√		√	244.4	42.3	90.9	27.8	47.1	27.6	19.8	36.8	39.9
ours		√	√	246.3	42.9	54.3	27.9	47.4	28	19.9	37.5	41.7

√ indicates that the corresponding sub-module is enabled in the model configuration. Bold numbers indicate the best performance in each evaluation metric.

Table 4. Comparison of standard Transformer Encoder and our proposed DST on VisDrone2021-DET.

Architecture	Params (M)	FLOPs (G)	AP	AP₅₀	AP₇₅	AP_s	AP_m	AP_l
Standard Transformer Encoder	37.65	118.9	25.1	44.6	24.3	16.3	35.1	39.9
Our DST (Ours)	43.81	136.0	25.4	44.9	24.8	16.8	35.3	42.3