1. Introduction
With the rapid proliferation of consumer-grade drones, the visual detection of unauthorized Unmanned Aerial Vehicles (UAVs) has become a paramount priority for low-altitude airspace security [
1]. Unlike general object detection scenarios [
2], anti-UAV systems operate in a highly dynamic ground-to-air visual environment. This introduces a unique “small target, vast background” dilemma in which intruder drones often appear as minute clusters of pixels against complex sky domains, fluctuating clouds, or cluttered urban skylines [
3]. Furthermore, distinguishing mechanical drones from biological distractors (e.g., flying birds) poses a severe challenge due to their similar scales and appearance. Compounding these difficulties, edge monitoring platforms, which are typically deployed on battery-powered embedded systems, impose rigorous constraints on computational complexity while demanding an uncompromising tradeoff between detection precision and inference latency.
To meet these demands, the community has witnessed the rapid evolution of efficient detectors, exemplified by the YOLO series [
4,
5,
6,
7]. From the early DarkNet [
5] to the recent CSPNet [
8] and ELAN [
9], these architectures have continuously optimized the speed–accuracy curve. Although these general-purpose detectors achieve impressive performance on natural scene benchmarks, they falter when applied to the anti-UAV domain. Existing solutions often prioritize macro-architectural modifications in the neck to aggregate multi-scale features [
10,
11]. However, they overlook a critical deficiency inherent in their basic building blocks, namely, the phenomenon of spectral homogenization. Redundancy in standard convolutional blocks causes the network to wash out the faint high-frequency signatures of distant drones, while being overwhelmed by dominant background noise such as cloud edges or tree branches. Although multi-branch structures such as Res2Net [
12] attempt to enrich feature diversity, they lack specific mechanisms to filter out such environmental contaminants, leading to high false alarm rates in complex aerial scenes.
In this paper, we fundamentally rethink the design of real-time detectors for the specific task of anti-UAV surveillance. We argue that an effective drone detector must not only be efficient but also granularity-aware and context-sensitive. We identify spectral homogenization as the primary bottleneck, defined as the phenomenon where high-frequency spatial details essential for defining small objects are disproportionately attenuated relative to low-frequency background structures during standard convolutional downsampling. To resolve this, we propose Eagle-YOLO. First, to break the bottleneck of feature homogenization, we redesign the basic building unit into the Hierarchical Granularity Block (HG-Block). By establishing a progressive semantic flow, the HG-Block functions as a “detail anchor” to preserve the pixel-level signatures of tiny targets while accumulating semantics for structure discrimination. Second, to distinguish drones from chaotic environmental distractors, we introduce the Cross-Stage Context Modulation (CSCM) mechanism. This module maintains a global semantic prior to dynamically recalibrate local features, suppressing spurious activations from birds or clouds. Finally, challenging the dogma of homogeneous convolutions, we formulate the Scale-Adaptive Heterogeneous Convolution (SAHC) strategy. This protocol orchestrates kernel sizes according to feature levels, efficiently aligning the Effective Receptive Field (ERF) with the extreme scale distribution inherent in drone detection tasks.
1.1. Contributions
The main contributions of this paper are summarized as follows:
We identify spectral homogenization as the primary bottleneck in applying general real-time detectors to UAV imagery. To resolve this, we propose Eagle-YOLO, a novel framework that dynamically orchestrates multi-granular features, thereby establishing a new paradigm for balancing high-precision aerial detection with the strict latency constraints of edge devices.
We introduce three synergistic components that mimic a raptor’s visual system: the HG-Block for extracting progressive fine-grained details, the CSCM for suppressing background clutter via global priors, and the SAHC strategy for aligning receptive fields with extreme scale variations.
Comprehensive evaluations on authoritative datasets, including DUT Anti-UAV and Anti-UAV, demonstrate the superiority of our method. Eagle-YOLO achieves a remarkable tradeoff between speed and accuracy. For instance, Eagle-YOLOv9-T achieves 79.64% on the DUT Anti-UAV dataset, outperforming the YOLOv9-T baseline by a significant margin of 5.52% while requiring fewer parameters, thereby validating its practicality for real-world UAV applications.
1.2. Organization
The remainder of this paper is organized as follows:
Section 2 reviews the existing literature on real-time object detection and small object detection algorithms tailored for UAVs;
Section 3 articulates the proposed Eagle-YOLO framework, detailing the designs of the HG-Block, CSCM mechanism, and SAHC strategy;
Section 4 presents the experimental setup, implementation details, and a comprehensive analysis of the results, including ablation studies and comparisons with state-of-the-art (SOTA) methods on the DUT Anti-UAV and Anti-UAV datasets; finally,
Section 5 concludes the paper and discusses potential avenues for future research.
3. Methodology
Robust visual detection of unauthorized UAVs remains a paramount challenge in low-altitude airspace security. The task is plagued by the dilemma of small targets against a vast background, in which drones often appear as minute clusters of pixels against complex sky domains or cluttered urban skylines [
10,
11]. In this section, we articulate the methodology of Eagle-YOLO, a real-time detector engineered to master these challenges. Unlike previous methods that rely on static building blocks, we propose a dynamic feature encoding framework centered on three novel components. First, to capture the diverse spatial details of airborne targets, we design the HG-Block. This structure replaces conventional bottlenecks with a multi-branch topology that enriches inter-channel feature diversity, ensuring robust encoding for both distant point-like drones and proximate targets with distinct mechanical structures. Second, to counteract severe environmental distractors, we introduce the CSCM mechanism. This module leverages a global context query to bridge different network stages, suppressing harmful background redundancy while guiding the focus toward salient airborne threats. Finally, we formulate the SAHC strategy. By strategically orchestrating convolutions with varying kernel sizes, SAHC aligns the ERF with the inherent scale distribution of drone targets, achieving an optimal tradeoff between localization precision and inference speed.
3.1. Overall Architecture of Eagle-YOLO
The architectural framework of Eagle-YOLO is designed to balance the strict computational constraints of edge monitoring devices with the rigorous demands for anti-UAV precision. We select RTMDet [
61] as our baseline due to its efficient architecture, which features a CSPDarkNet backbone and CSP-PAFPN neck enhanced by large-kernel (
) depth-wise convolutions. While RTMDet effectively expands the receptive field for general object detection, its uniform kernel size and standard bottleneck design struggle to handle the “small target, vast background” dilemma inherent in UAV imagery.
To bridge this gap, Eagle-YOLO diverges from the standard RTMDet architecture in three critical aspects. First, we replace the homogeneous large-kernel blocks with the HG-Block, establishing a multi-branch topology that acts as a detail anchor for tiny objects while aggregating context for larger ones. Second, distinct from the local processing in RTMDet, we introduce the CSCM mechanism to leverage global semantic priors for suppressing environmental distractors. Finally, instead of the fixed
kernels used in RTMDet, we implement the SAHC strategy (detailed in
Section 3.4). This strategy dynamically orchestrates kernel sizes (ranging from
to
) across stages, ensuring that fine-grained signatures of distant drones are preserved strictly by smaller kernels while broad semantic contexts of proximate targets are captured by larger kernels. To consolidate global context, a Spatial Pyramid Pooling (SPP) module [
62] is retained at the end of the backbone.
For the neck, we deploy a Path Aggregation Feature Pyramid Network (PAFPN) [
10,
11] to effectively fuse multi-level features. Crucially, the standard bottlenecks in the neck are upgraded to our novel HG-Blocks (detailed in
Section 3.2), with the SAHC strategy consistently applied across both the neck and the detection head to maximize scale adaptability. Furthermore, to optimize inference latency for real-time monitoring, we adjust the channel depth of the backbone features and introduce three scalable variants: Eagle-YOLO-T, S, and M.
3.2. Granularity-Aware Feature Modeling
Visual perception of UAVs faces a fundamental contradiction: the feature extractor must safeguard high-frequency transients for detecting distant point-like drones while simultaneously abstracting low-frequency semantics to differentiate proximate drones from avian distractors. Standard blocks such as CSP [
8] apply uniform filtering across all channels. This often results in feature erosion, where the faint signatures of micro-UAVs are overwhelmed by the high-amplitude gradients of the sky background. To mitigate this, we propose the HG-Block, which operates as a progressive multi-spectral distiller.
The HG-Block diverges from the holistic processing paradigm of previous YOLO architectures by adopting a split-and-aggregate strategy. As illustrated in
Figure 1, the input feature tensor
is projected and partitioned along the channel dimension into
M distinct granularity fragments, denoted as
, where each fragment
.
The mechanism utilizes a Cascaded Context Injection (CCI) pathway to enforce granular interaction. Instead of standard residual addition between layers, this pathway establishes a recursive dependency between channel fragments within the same layer. Specifically, the refined output of the preceding branch
is injected as a semantic prior into the current fragment
. This forces the subsequent branches to encode the discrepancy between local texture and global structure. The recursive aggregation is mathematically defined as
where
represents the transformation function of the Inverted Bottleneck Module and + denotes pixel-wise summation. In this hierarchy, the initial branch
serves as a high-frequency preserver, bypassing heavy convolution to strictly retain the pixel-level details essential for micro-target localization. Conversely, deeper branches where
function as context aggregators, progressively widening the effective receptive field. The final output is reconstructed via concatenation:
. By implementing
with depthwise separable convolutions, the HG-Block achieves granular feature isolation with minimal computational overhead, making it optimal for edge-based anti-drone deployment.
It is pertinent to distinguish the proposed HG-Block from the established Res2Net architecture. While both designs utilize a split-and-process strategy to enhance multi-scale representation, their core mechanisms and objectives differ fundamentally. Res2Net focuses on expanding the receptive field by hierarchically adding feature maps from previous splits. In contrast, our HG-Block employs a Cascaded Context Injection pathway designed for granularity preservation. Instead of simple addition, we enforce a recursive dependency where high-frequency residuals from shallow branches are injected into deeper semantic branches. This ensures that the fine-grained texture details of small drones are not lost during semantic abstraction but are instead carried forward as a detail anchor.
3.3. Global Context Aggregation via Query Learning
Despite the granular precision of the HG-Block, its receptive field remains topologically confined. In the domain of counter-UAV operations, this spatial isolation introduces susceptibility to high-frequency environmental noise where the detector may conflate the flapping of avian wings or the edges of cumulus clouds with the mechanical signatures of drones. To rectify this, we engineer the CSCM. This module functions not merely as an attention layer but as a global semantic gatekeeper integrated into the neck of the network.
The underlying mechanism assumes the existence of learnable environmental prototypes that dictate feature importance based on the macroscopic scene composition. We define these latent prototypes as a parameter matrix , where C represents the number of feature channels and denotes the embedding dimension. The number of rows C is strictly aligned with the channel width of the input feature map to enable element-wise modulation. The embedding dimension is set to to create a compact bottleneck. This design choice is critical as it forces the network to learn robust low-dimensional semantic abstractions rather than overfitting to specific pixel noise in the training dataset. The matrix is initialized randomly and updated end-to-end via standard backpropagation effectively encoding the global statistics of the training dataset into a persistent memory.
For an incoming feature tensor
produced by the HG-Block, we first compress its spatial redundancy into a compact scene descriptor
s via global average pooling followed by a linear projection
. The CSCM then evaluates the compatibility between the current scene descriptor
s and the global prototypes
. This interaction generates a channel-wise gating vector which dynamically suppresses background-dominant channels towards zero while amplifying target-relevant signals towards one. The modulation process is mathematically formalized as follows:
where
represents a linear transformation layer mapping the channel dimension from
C to
,
denotes the Sigmoid activation function, and ⊗ indicates channel-wise broadcasting multiplication. Through this global-to-local interaction, the network learns to implicitly categorize the aerial background style and adjust the feature response gain accordingly. This ensures robust target lock-on even in clutter-rich airspaces without requiring explicit supervision for background types.
3.4. Scale-Adaptive Heterogeneous Kernel Selection
In addition to the micro-design of building blocks, we optimize the macro-architectural choice of convolution kernels. A prevailing limitation in real-time detectors is the exclusive use of homogeneous convolutions. While computationally uniform, this one-size-fits-all approach is suboptimal for drone detection which demands a variable visual span.
In a feature pyramid, shallow stages process high-resolution maps rich in fine-grained semantics which are vital for point-like distant drones. Conversely, deep stages handle low-resolution maps. Applying uniform small kernels limits the ERF in deep layers and potentially fragments the semantic integrity of nearby large drones. Conversely, indiscriminately applying large kernels in shallow layers introduces contaminative noise from the sky or ground boundary.
To resolve this, we introduce the SAHC strategy. Instead of relying on runtime kernel prediction, we define a deterministic layer-wise heterogeneity where the kernel size k is strictly coupled with the feature stage s. We formulate a stage-wise kernel function, denoted as , to enforce a structural hierarchy.
Shallow Stages (): We utilize compact kernels to act as a foveal focus, strictly preserving local pixels for tiny object regression.
Deep Stages (): We progressively expand kernels. These act as peripheral scanners that dramatically enlarge the ERF to capture the holistic structure of proximate targets and distinguish them from similar-sized birds based on structural context.
The specific architectural configuration of this strategy is detailed in
Table 1. SAHC constitutes a fixed architectural prior, ensuring deterministic inference speed. To mitigate the computational overhead typically associated with the large kernels shown in the table (e.g.,
), we employ two strategic optimizations. First, all large-kernel operations are implemented as depth-wise separable convolutions, which significantly reduces the parameter count and floating-point operations compared to standard convolutions. Second, the kernel size expansion is inversely coupled with the feature map resolution. As observed in
Table 1, the largest
kernels are exclusively applied to Stage-4, where the spatial resolution is downsampled to 1/32 of the input. Consequently, despite the increased kernel footprint, the actual number of multiplication operations remains manageable. This design ensures that the expanded receptive field achieves a robust tradeoff between structural perception and real-time inference speed.
5. Conclusions
In this paper we have presented Eagle-YOLO, a novel real-time object detector specifically engineered to address the “small target, vast background” dilemma in UAV imagery. By rethinking the fundamental building blocks of convolutional networks, we identify feature homogenization as a critical bottleneck and propose a synergistic solution. Our introduced HG-Block successfully preserves fine-grained details for small targets through progressive semantic distillation, boosting the detection of small objects () by 2.2% over the baseline. Complementing this, the CSCM mechanism effectively suppresses environmental distractors by integrating global contextual priors, while the SAHC strategy ensures optimal receptive field alignment across varying object scales, leading to a substantial 2.69% increase in large object detection (). Empirical results on the DUT Anti-UAV dataset confirm that Eagle-YOLO achieves SOTA performance, with our Eagle-YOLOv8-M model reaching 90.26% , significantly outperforming robust baselines in both accuracy and inference speed. These findings validate Eagle-YOLO as a highly effective and practical solution for real-world resource-constrained aerial surveillance tasks. Future work will explore extending this framework to multi-modal data and further optimizing it for more diverse aerial scenarios.