CNIFE: Anti-UAV Detection Network via Cross-Scale Non-Local Interaction and Feature Enhancement

Liang, Bo; Shan, Hongfu; Feng, Song; Jiang, Ji

doi:10.3390/drones10010008

Open AccessArticle

CNIFE: Anti-UAV Detection Network via Cross-Scale Non-Local Interaction and Feature Enhancement

¹

Yunnan Key Laboratory of Computer Technology Application, Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

²

School of Information and Network Security, Yunnan Police College, Kunming 650223, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 8; https://doi.org/10.3390/drones10010008

Submission received: 14 November 2025 / Revised: 20 December 2025 / Accepted: 22 December 2025 / Published: 24 December 2025

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

We propose a Cross-scale Non-local Feature Interaction (CNFI) module that facilitates deep interactions among multi-scale features, significantly enhancing the model’s adaptability to large target scale variations.
We design a Non-local Feature Enhancement (NFE) module that integrates global semantic context with gradient-based structural cues, improving robustness against complex and cluttered backgrounds.

What are the implications of the main findings?

Extensive experiments on the DUT Anti-UAV and Det-Fly datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches across multiple evaluation metrics.
The proposed framework provides an effective and generalizable solution for anti-UAV detection in complex real-world scenarios.

Abstract

Anti-UAV detection is paramount for safeguarding airspace security. However, existing methodologies often exhibit low detection accuracy due to their inability to adaptively address target scale variations and complex backgrounds. To enhance detection precision, this paper introduces a UAV detection method founded on non-local feature learning. Initially, we design a Cross-scale Non-local Feature Interaction (CNFI) module. This module explicitly models long-range dependencies between features at disparate scales, thereby effectively integrating multi-scale information and adapting to target scale variations. Subsequently, a Non-local Feature Enhancement (NFE) module is proposed, which fuses global contextual information, acquired via non-local attention, with low-level structural cues such as gradients, to bolster the boundary and detail features of UAV targets amidst complex backgrounds. The proposed method was experimentally validated on the DUT-Anti-UAV and Det-Fly dataset. In comparison with the state-of-the-art model, our approach demonstrates improvements of 0.93%, 1.09%, and 2.12% in Precision (P), Recall (R), and mAP50 on DUT-Anti-UAV dataset, respectively. Experimental results affirm that our proposed enhancements yield superior performance in the anti-UAV detection task.

Keywords:

anti-UAV detection; non-local feature learning; complex backgrounds; target scale variation

1. Introduction

Unmanned Aerial Vehicles (UAVs), owing to their cost-effectiveness and high efficiency, have found widespread applications in numerous critical domains, including military reconnaissance, security surveillance, and precision agriculture [1,2]. However, the unauthorized or excessive operation of UAVs has concurrently raised significant security and privacy concerns [3], posing a substantial threat to low-altitude airspace security [4]. Consequently, the detection of airborne UAVs is a critical component in safeguarding low-altitude airspace and mitigating potential threats.

Presently, anti-UAV object detection methodologies predominantly draw upon advancements in the general object detection field [4], broadly categorized into traditional computer vision techniques and deep learning-based approaches. Early traditional methods, such as those based on background subtraction or employing handcrafted features like Haar [5] and HOG [6], could achieve preliminary UAV detection in scenarios with minimal background interference. However, these methods rely on empirically designed features with limited representative power, struggling to adapt to real-world scenarios involving complex backgrounds, varying illumination conditions, or rapid UAV movements, thereby resulting in suboptimal detection accuracy and robustness.

To support the development and fair comparison of anti-UAV detection methods, several public benchmark datasets have been released in recent years. Among them, DUT-Anti-UAV [7] is a representative large-scale dataset featuring a high proportion of small UAV targets captured in complex outdoor environments, while Det-Fly [8] provides diverse UAV instances under varying viewpoints and background conditions. These benchmarks offer standardized evaluation settings and serve as important testbeds for validating detection performance in realistic scenarios;Nevertheless, UAV detection in such real-world environments remains highly challenging due to extreme scale variation of small targets and severe interference from cluttered backgrounds (e.g., urban buildings, dense vegetation, and complex sky scenes). As illustrated in Figure 1, small UAVs often occupy only a few pixels and share similar visual characteristics with surrounding background structures, which makes accurate detection particularly difficult under complex outdoor conditions.

With the advent of deep learning, Convolutional Neural Network (CNN)-based object detectors, particularly single-stage algorithms represented by the YOLO series (e.g., YOLOv8 [9], YOLOv10 [10], YOLOv12 [11]), have demonstrated significant potential and progressively become mainstream in anti-UAV detection tasks. Given that UAV targets exhibit variable scales and are often subject to interference from complex backgrounds, researchers have introduced numerous modifications to these general frameworks from various perspectives. On the ond hand, to address the scale variations of UAV targets, some works have focused on fusing features from different scales to enhance the visual representation of UAVs [12], thereby improving detection capabilities for UAVs of varying sizes. To achieve this, researchers have designed efficient multi-scale feature fusion modules, such as FPN [13], PANet [14], and BiFPN [15]. Zhai et al. [16] incorporated multi-scale prediction layers into YOLOv3, utilizing the last four feature maps for multi-scale prediction, which enhanced the detection performance for small targets. Jiang et al. [17] designed a Reparameterized Cross-stage Efficient Layer Aggregation Network (RCELAN) and a Bidirectional Feature Pyramid Network (BiFPN) to bolster multi-scale feature processing capabilities, thereby improving detection accuracy. Huang et al. [18] improved upon the YOLOv8 model by employing Deformable Convolutional Networks v2 (DCNv2) to adaptively handle local details and scale variations of UAVs. Zhang et al. [19] significantly enhanced the detection capability of YOLOv10 for extremely small UAVs by introducing bidirectional scale attention to fuse information across different scales. Gao et al. [20] improved YOLOv11 by incorporating Haar wavelet downsampling and adding a dedicated small object detection head, markedly boosting detection accuracy for small-sized UAVs.

In addition to CNN-based detectors, Transformer-based approaches have recently attracted increasing attention in UAV detection due to their strong capability in modeling global context and long-range dependencies. Several studies have introduced Vision Transformer or DETR-style frameworks into UAV detection and tracking, demonstrating improved robustness under occlusion, cluttered backgrounds, and target re-appearance. However, most existing Transformer-based methods primarily perform non-local modeling within a single feature scale, which limits their ability to effectively capture cross-scale semantic relationships that are crucial for detecting small UAVs with significant scale variations.

On the other hand, to tackle complex background interference, some studies have focused on enhancing the features to extract more useful contextual information [21], thereby addressing interference and improving the overall scene understanding for UAVs. Various attention mechanisms have been integrated into anti-UAV detection models, such as CBAM [22] and SimAM [23]. Huang et al. [24] proposed a Dual Semantic Feature Extraction mechanism (DS-RPN) to generate target candidate boxes, eliminating complex background interference through extracted semantic features. Zhao et al. [25] integrated Transformer encoders, global attention, and coordinate attention into the C3 module of YOLOv5, improving the detection accuracy and robustness for UAV targets in complex environments. Ma et al. [26] proposed the LA-YOLO network based on YOLOv5, which integrates the SimAM attention mechanism and a normalized Wasserstein distance fusion block, enhancing detection accuracy in low-altitude backgrounds. Bo et al. [27] adjusted prior box sizes and introduced a Get-and-Send module into YOLOv7, improving object detection accuracy in complex backgrounds. Shi et al. [28] introduced graph attention to retain more foreground information while reducing the impact of the background.

Despite recent advancements, the detection accuracy of existing methods often deteriorates in challenging real-world environments like urban districts and parks. Two primary factors contribute to this degradation: the inherently small scale of UAV targets and severe interference from cluttered backgrounds [29], including buildings and vegetation. We posit that the primary reasons for this issue are twofold: (1) Insufficient cross-scale information interaction: UAVs present diverse scale variations at different distances. Existing multi-scale feature fusion methods often rely on simple addition or concatenation, failing to adequately explore and leverage the deep correlations and complementarities between feature maps at different scales. For example, widely adopted architectures such as FPN [13], PANet [14], and BiFPN [15] perform cross-scale feature fusion mainly through element-wise addition or channel concatenation along top-down and/or bottom-up pathways. Specifically, they lack effective modeling of cross-scale global dependencies, which refer to explicit interactions among feature representations at different spatial resolutions, a capability that is later explicitly modeled by the proposed CNFI module, which limits the model’s ability to distinguish targets of varying scales, particularly affecting the detection robustness for small objects. (2) Inadequate capability to counteract complex environmental interference: Complex backgrounds (e.g., clouds in the sky, building contours, tree branches, moving vegetation) often contain textures or shapes similar to UAVs, causing severe interference to detection. Existing models, even when incorporating CNNs or Transformers to enhance contextual modeling, often fail to specifically distinguish intricate details within the image. Consequently, they struggle to effectively suppress background noise and accurately differentiate targets from distractors.

To address these aforementioned challenges, this paper introduces an anti-UAV object detection method founded on non-local feature learning, which serves as the fundamental principle underlying the design of the proposed Cross-Scale Non-local Feature Interaction (CNFI) and Non-local Feature Enhancement (NFE) modules, and is implemented within an end-to-end object detection framework. The core idea of this method is to overcome the limitations of traditional approaches in modeling cross-scale global dependencies and effectively enhancing target feature representations. Specifically, we design two key modules: (1) Cross-Scale Non-local Feature Interaction (CNFI) module: Addressing the issue that existing non-local attention mechanisms are typically confined to intra-scale interactions, the CNFI module is designed to explicitly model long-range dependencies between feature maps at disparate scales. By computing correlations between pairs of feature pixels across scales, CNFI can integrate global contextual information, effectively suppress interference from background regions at different scales, and enhance the perception of small targets. (2) Non-local Feature Enhancement (NFE) module: To further improve the discriminability of UAV features amidst complex background interference, the NFE module fuses global contextual information, acquired via non-local attention, with low-level structural cues such as image gradients. This combination enables the model to focus not only on semantic information but also on the boundary and shape details of the target. Consequently, in scenarios with complex backgrounds and weak target features, it reinforces the feature representation of UAVs, thereby enhancing detection precision.

The major contributions of this study are summarized as follows:

We propose a Cross-scale Non-local Feature Interaction (CNFI) module to facilitate deep interactions among multi-scale features, thereby enhancing the model’s adaptability to significant target scale variations.
We design a Non-local Feature Enhancement (NFE) module that integrates global semantic context with gradient-based structural cues, effectively improving robustness against complex background interference.
Extensive experiments on the DUT-Anti-UAV and Det-Fly datasets demonstrate that the proposed method consistently outperforms state-of-the-art anti-UAV detection approaches.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 details the overall architecture of the proposed model. Section 4 describes the experimental setup, evaluation metrics, and results. Section 5 discusses the limitations of our method and outlines future directions. Finally, Section 6 concludes the paper.

2. Literature Review

UAV detection has become an essential research topic in recent years due to the rapid proliferation of drones in civil, industrial, and military environments. Early anti-UAV detection methods predominantly relied on handcrafted visual features such as Haar [5] and HOG [6], which suffer from limited robustness under illumination variation, cluttered backgrounds, and rapid UAV motion. Traditional background subtraction and motion-based approaches also struggle in complex scenes due to handcrafted assumptions that fail under dynamic environments.

With the advancement of deep learning, CNN-based object detectors have become the dominant choice for UAV detection. A number of YOLO-series networks, including YOLOv3, YOLOv11, and YOLOv12 [11,20,30], have significantly improved detection accuracy and efficiency. However, these models often fail to capture fine-grained details of small UAV targets, especially when targets occupy less than 1% of the image area. To address scale variation, multi-scale feature fusion structures such as FPN [13], PANet [14], BiFPN [15], and their numerous variants have been widely adopted. Several UAV-specific studies further extend these architectures. For example, Zhai et al. [16] extended multi-layer prediction for better small-target detection, while Jiang et al. [17] introduced a reparameterized feature aggregation design to enhance multi-scale robustness. Zhang et al. [19] improved YOLOv10 with bidirectional scale attention to capture enhanced multi-resolution dependencies.

In parallel, local attention mechanisms have shown notable progress in enhancing feature representation quality. Lightweight attention modules such as CBAM [22] and SimAM [23] have been widely integrated into UAV detectors to suppress background noise and highlight discriminative regions. Ma et al. [26] fused SimAM and Wasserstein normalization for low-altitude background suppression, while Bo et al. [27] improved YOLOv7 by refining spatial focus. Additionally, graph attention and context-aware modules [21,28] have been explored to enhance foreground feature retention.

Recent advances in Transformers [31,32] have introduced powerful global context modeling capabilities. Yu et al. [33] build their anti-UAV tracker around a Transformer-based local tracker and augment it with Transformer-style global re-detection and background-aware alignment modules, enabling attention to maintain target–context associations under occlusion, drift, and re-appearance. Zhu et al. [34] adapt the DETR Transformer paradigm for infrared small-UAV-swarm detection by improving how queries attend to small targets—via enhanced cross-level feature aggregation and geometry-aware attention—so that global self-attention better separates dense, low-contrast instances. However, they only compute non-local attention within a single scale, which restricts their ability to capture semantic relationships across different scales. Considering that UAV features vary drastically across scales—lower-resolution maps contain global semantic cues, while higher-resolution maps retain boundary details—pure intra-scale, non-local modeling in a single scale is insufficient for robust UAV detection.

To address challenges posed by complex backgrounds, several studies incorporate gradient or structure priors. Gradient-based enhancement methods [35,36] and edge-aware convolution modules [37,38] boost contour clarity and improve localization accuracy. Multi-branch convolution designs have also been proposed to enrich multi-receptive-field representation [39,40], although they lack the ability to capture global context. Recent works integrate global and local structural cues via hybrid attention [23], but few studies explicitly couple gradient features with non-local modeling.

Anti-UAV detection also faces challenges in dynamic outdoor environments. Small UAVs often appear in highly cluttered scenes such as foliage, urban areas, and sky regions with varying illumination. Studies have reported that complex backgrounds contribute heavily to false positives, while small-scale targets lead to significant false negatives [29]. Video-based tracking approaches can alleviate some of these issues, yet single-frame detectors remain preferable in many real-time surveillance applications due to lower complexity requirements.

In summary, existing research provides valuable progress in multi-scale fusion, attention-based enhancement, and contextual modeling. However, two critical limitations remain: (1) multi-scale fusion largely relies on local operations and lacks explicit cross-scale non-local interaction; (2) non-local attention methods seldom incorporate gradient cues, limiting boundary-awareness. These deficiencies directly motivate the design of our CNFI and NFE modules, which address the limitations of current methods by reinforcing cross-scale semantic relationships and enhancing gradient-aware structural cues, respectively.

3. Method

Currently, anti-UAV object detection faces two primary challenges: first, insufficient cross-scale information interaction, and second, inadequate robustness against complex environmental interference. To address these challenges, this paper proposes: (1) a Cross-scale Non-local Feature Interaction (CNFI) module to resolve the issue of inadequate cross-scale information interaction, and (2) a Non-local Feature Enhancement (NFE) module to bolster target feature representation and suppress complex background interference. These modules are integrated into an end-to-end detection system, in which non-local feature learning is employed as the core mechanism to model cross-scale dependencies and enhance feature representations.

The overall architecture of the network is illustrated in Figure 2a. Given an input image, the network initially extracts foundational features through the initial convolutional layers of the backbone. Subsequently, these foundational features are progressively enhanced by multi-stage NFE modules to reinforce the representation of target boundaries and shapes. During this process, features from different levels are combined via upsamplingdownsampling and concatenation operations to form a multi-scale feature pyramid. Next, the CNFI module facilitates deep interaction among features from different scales, thereby augmenting the detection capability for small targets. Finally, the enhanced multi-scale features are fed into three detection branches to generate detection results at different scales.

Specifically, the network first employs a backbone similar to that of YOLOv12 [11] for multi-scale feature extraction. Unlike YOLOv12 [11], this paper replaces the original C3f modules with the designed NFE modules to enhance feature representation capabilities and counteract interference from complex backgrounds. Subsequently, the extracted multi-scale features are processed by the CNFI module for fusion and interaction, enabling comprehensive learning of cross-scale information to adaptively address multi-scale targets. Following CNFI processing, these multi-scale features are progressively upsampled and fused with the multi-scale features extracted by the backbone, and then further enhanced by NFE modules at different scales. Finally, detection is performed using a detection head consistent with that of YOLOv12 to obtain the final detection results.

3.1. Cross-Scale Non-Local Feature Interaction (CNFI)

Traditional feature fusion methods, such as FPN [13], typically employ simple addition or concatenation operations. Such approaches often fail to adequately capture the complex correlations between features at different scales. Existing non-local attention mechanisms are also primarily confined to feature interactions within a single scale [32,41,42,43], making it difficult to establish global dependencies across scales. UAV targets exhibit variable sizes in images, and features from different scales contain complementary information [43]; for instance, small-scale feature maps contain more global semantic information, while large-scale feature maps possess richer spatial details. Evidently, effective integration of this cross-scale information is crucial for enhancing small object detection capabilities, particularly in anti-UAV scenarios. Therefore, this paper proposes the CNFI module, which facilitates interaction among multi-scale features and constructs a cross-scale attention computation mechanism to efficiently integrate cross-scale information.

In the CNFI module, multi-scale features are encoded as scale-specific queries

{Q_{i}}

, while a shared Key–Value memory is constructed by concatenating all encoded features, i.e.,

K = V = Concat (Q_{1}, Q_{2}, Q_{3})

. Each query

Q_{i}

attends to the same K–V bank, enabling explicit cross-scale non-local interaction. Compared with pairwise cross-scale attention, this shared KV design provides a unified global context and reduces computational overhead, which is particularly beneficial for small-UAV detection under extreme scale variation and cluttered backgrounds.

The detailed structure of the CNFI module is depicted in Figure 2b. The module receives multiple feature maps (

I_{1}

,

I_{2}

,

I_{3}

) from different feature levels as input. Subsequently, each input feature map is processed by an encoder (EnC = Conv + DP) to generate query features

Q_{1}

,

Q_{2}

,

Q_{3}

:

Q_{i} = E n C_{i} (I_{i}),

(1)

where

E n C_{i} (\cdot)

denotes the feature encoding operation for the i-th scale, implemented via a Convolution (Conv) layer followed by a Downsampling (DP) operation. To enable interaction across scales, the resolutions of these features are uniformed, typically by down-sampling all features to match the resolution of the smallest feature map, thereby reducing computational complexity. The encoding for different scales employs distinct downsampling factors.

After encoding the input features at different scales, they are concatenated along the channel dimension to form a shared key K and value V representation across scales. This design allows the key and value to jointly encode complementary information from multiple scales, while the query

Q_{i}

preserves scale-specific characteristics. As a result, the query

Q_{i}

from each individual scale computes attention with the same K–V pair, enabling explicit cross-scale information interaction. In this way, the Cross-scale Non-local Attention module constructs a correlation matrix that captures dependencies among features from different scales.

For the i-th scale, the attention weights are computed as:

A_{i} = softmax (\frac{Q_{i} \cdot K^{T}}{\sqrt{d}}),

(2)

where d denotes the feature dimensionality used for normalization. This operation measures the similarity between the scale-specific query

Q_{i}

and the shared key K, producing attention weights that reflect the relative importance of features from different scales.

Based on the obtained attention weights, cross-scale contextual information is aggregated as:

F_{i} = A_{i} \cdot V .

(3)

Finally, the aggregated feature

F_{i}

is transformed back to the original scale through a decoder and combined with the input feature

I_{i}

via a residual connection to produce the output feature:

O_{i} = DeC (F_{i}) + I_{i},

(4)

where

DeC (\cdot)

denotes the decoding operation for the i-th scale, implemented by a convolution layer followed by an upsampling operation. A more intuitive description of the overall procedure is provided in Algorithm 1.

Algorithm 1 Cross-scale Non-local Feature Interaction (CNFI)

Phase 1: Encode Query Features

1:: for each scale i do
2:: $Q_{i} \leftarrow EnC (I_{i})$
3:: end for

Phase 2: Construct Shared Key and Value

4:: $K \leftarrow Concat (Q_{1}, Q_{2}, Q_{3})$
5:: $V \leftarrow K$

Phase 3: Compute Cross-scale Attention

6:: for each scale i do
7:: $A_{i} \leftarrow Softmax (\frac{Q_{i} K^{T}}{\sqrt{d}})$
8:: end for

Phase 4: Aggregate Cross-scale Information

9:: for each scale i do
10:: $F_{i} \leftarrow A_{i} V$
11:: end for

Phase 5: Decode and Apply Residual Connection

12:: for each scale i do
13:: $O_{i} \leftarrow DeC (F_{i}) + I_{i}$
14:: end for

Complexity analysis of Algorithm 1: Let

I_{i} \in R^{C_{i} \times H_{i} \times W_{i}}

denote the multi-scale input features, where

i \in {1, 2, 3}

. In CNFI, all encoded features are spatially unified to the smallest resolution

H \times W

via downsampling, and we denote the number of spatial tokens by

N = H W

. After encoding, each query feature is reshaped as

Q_{i} \in R^{N \times d}

, where d is the embedding dimension. The shared key/value are constructed by channel concatenation across scales, leading to

K, V \in R^{(S N) \times d}

with S scales (here

S = 3

). Then we discuss the time complexity and memory complexity of Algorithm 1 as below:

(1) Time complexity: The dominant computation in Algorithm 1 is the cross-scale attention and aggregation. For each scale i, the matrix multiplication

Q_{i} K^{⊤}

costs

O (N \cdot S N \cdot d) = O (S N^{2} d)

, and the aggregation

A_{i} V

costs

O (N \cdot S N \cdot d) = O (S N^{2} d)

. Thus, the overall attention-related cost is

O (2 S^{2} N^{2} d)

, which becomes

O (18 N^{2} d)

for

S = 3

. The encoder/decoder (Conv + Down/Up sampling) contribute lower-order terms that are linear in N. Specifically, a

k \times k

convolution mapping

C_{i} \to d

costs

O (k^{2} C_{i} d N)

at the unified resolution, and the total convolutional overhead across scales is

O (k^{2} d N \sum_{i = 1}^{S} C_{i})

.

(2) Memory complexity: The main memory consumption comes from storing attention maps

A_{i} \in R^{N \times S N}

, which requires

O (S N^{2})

memory per scale. Therefore, the total memory complexity for S scales is

O (S^{2} N^{2})

, i.e.,

O (9 N^{2})

when

S = 3

. Importantly, by downsampling all scales to the smallest feature resolution, CNFI keeps N relatively small, making the additional overhead practically manageable.

3.2. Non-Local Feature Enhancement (NFE)

In complex backgrounds, UAVs often manifest as small targets with weak features, which can be easily overwhelmed by background interference possessing similar textures [37], a common challenge in anti-UAV detection. Traditional CNN-based detection methods primarily rely on convolutional operations to extract local features [38,44,45], making it difficult to establish relationships between distant pixels. Conversely, Transformer-based detection methods, while capable of capturing global dependencies via non-local attention [31], often overlook crucial detail information such as target boundaries and shapes. To simultaneously enhance the perception of both global context and local target details, the NFE module integrates non-local attention with gradient enhancement to jointly model global semantic context and local structural cues. Non-local attention captures long-range dependencies to enhance semantic discrimination between UAV targets and complex backgrounds, while gradient enhancement emphasizes contour-related boundary information. This complementary design improves boundary awareness and localization accuracy, particularly for small UAV targets in cluttered scenes.

The NFE module comprises two core sub-modules: Non-local Gradient Attention (NGA) and Gradient Enhancement Conv Block (GEC). These two modules are detailed below.

3.2.1. Non-Local Gradient Attention (NGA)

The NGA sub-module focuses on enhancing the global gradient information of the image, which is a critical low-level cue for object recognition and localization. Gradient information inherently captures edge and structural details, which are vital for distinguishing targets [35,36], particularly small or weakly textured ones like UAVs, from complex backgrounds. Traditional convolutional operations have limited receptive fields and may struggle to effectively model long-range dependencies between these crucial structural elements. The NGA module, addresses this by integrating gradient-derived features with a non-local gradient attention mechanism. The motivation for explicitly using gradient information is to provide the attention mechanism with robust structural cues. This allows the module to capture global contextual relationships between important edge and texture details, enabling it to selectively enhance gradient features that are globally consistent with the structure of target while suppressing irrelevant background clutter. This approach is particularly beneficial in scenarios where local information is ambiguous, as the global context provided by the non-local attention on gradients can help disambiguate and reinforce true target features.

The specific details of NGA are illustrated in Figure 2d. An input feature map (X) is first rearranged through a regrouping operation (R) to obtain the query

Q^{'}

. To capture edge and structural details, gradient information from the feature map is extracted using a regrouping and gradient extraction operation (RG) to generate the key

K^{'}

and value

V^{'}

. This can be expressed as:

\{\begin{matrix} Q^{'} & = R (X) \\ K^{'} & = R G (X), \\ V^{'} & = R G (X) \end{matrix}

(5)

where X represents the input feature,

R (\cdot)

denotes an operation that directly rearranges the features, and

R G (\cdot)

represents an operation that extracts gradients (e.g., from the 8-neighborhood) and then rearranges them.

Given an input feature map

X \in R^{C \times H \times W}

, for each channel c and spatial location

(i, j)

, we define the 8-neighborhood offset set

N = {(- 1, - 1), (- 1, 0), (- 1, 1), (0, - 1), (0, 1), (1, - 1), (1, 0), (1, 1)} .

(6)

The neighbor-to-center residual in direction

(u, v) \in N

is computed by

D_{(u, v)} (c, i, j) = X (c, i + u, j + v) - X (c, i, j),

(7)

and the RG response is obtained by concatenate these residuals along channel dimension:

R G (X) = {[D_{(- 1, - 1)}, D_{(- 1, 0)}, \dots, D_{(1, 1)}]}_{c a t},

(8)

where

{[\cdot]}_{c a t}

denotes the channel concatenate operation.

Subsequently, non-local attention is computed as:

A_{i}^{'} = s o f t m a x (\frac{Q_{i}^{'} \cdot K^{' T}}{\sqrt{d}}),

(9)

The value matrix

V^{'}

is then weighted by the attention weights, and the resulting weighted features are added to the original features to produce the output feature:

O_{i}^{'} = A_{i}^{'} \cdot V^{'} + X,

(10)

3.2.2. Gradient Enhancement Conv Block (GEC)

The design of GEC is fundamentally motivated by the imperative to efficiently capture multi-scale contextual information, a critical requirement for robust anti-UAV detection where targets exhibit significant scale variations and are often embedded in complex visual environments [46]. Drawing inspiration from multi-branch architectures [39,47], GEC employs parallel convolutional pathways operating on distinct channel groups [40]. Each pathway is configured to specialize in extracting features at different receptive fields, allowing the module to concurrently analyze both fine-grained details crucial for small or distant UAVs and broader contextual cues vital for discriminating targets from background clutter. These diverse, scale-specific features are then aggregated and refined through a residual connection, leading to a more comprehensive and discriminative representation. This strategy enables the GEC to learn a rich set of features tailored for varying target appearances, thereby enhancing the model’s ability to discern UAVs under challenging conditions with improved efficiency.

Details of GEC are shown in Figure 2c. GEC first divides the input feature into 4 channel groups, each comprising 1/4 of the total channels:

X_{i} = X [:, C / 4 \cdot i : C / 4 \cdot (i + 1), :, :] .

(11)

Then, each group is processed by a parallel convolution branch with a distinct kernel size to capture multi-scale receptive fields. Specifically, the four branches adopt convolution kernels of

1 \times 1

,

3 \times 3

,

5 \times 5

, and

7 \times 7

, respectively:

F_{i} = C o n v (X_{i}) .

(12)

Finally, the features processed by all branches are fused via a channel concatenation operation and then added to the original input:

O = C o n c a t (F_{1}, F_{2}, F_{3}, F_{4},) + X .

(13)

Through the aforementioned design, the NFE module leverages non-local attention mechanisms to capture global contextual information while simultaneously preserving target boundary and shape details via gradient enhancement. Consequently, it effectively enhances the discriminability of target features in scenarios with complex backgrounds and weak target characteristics, thereby improving detection precision for the anti-UAV detection task.

4. Experiments and Analysis

In this section, experiments were conducted on publicly available datasets to validate the effectiveness of the proposed method. The experimental process encompassed several aspects: first, the experimental datasets and implementation details are detailed; subsequently, the results of ablation studies are thoroughly analyzed and interpreted; finally, the detection performance of the proposed method is further validated through comparisons with baseline models and other distinct YOLO series models.

4.1. Datasets and Implementation Details

All experiments were conducted on a Linux server equipped with four NVIDIA GeForce RTX 4090 GPUs. The proposed method was implemented in PyTorch (v1.8.0) with CUDA (v11.8), using Python (v3.10.0) under Ubuntu (v18.04). We conducted experiments on the DUT-Anti-UAV and Det-Fly datasets. Following common practice, all images were resized to

640 \times 640

during both training and inference. During training, we used the Adam optimizer with a batch size of 24 for 300 epochs, an input image size of

640 \times 640

, and an initial learning rate of 0.01. The learning rate was scheduled using cosine decay with a warm-up of 15 epochs. Unless otherwise stated, all hyper-parameters were kept identical across ablation settings to ensure a fair evaluation. For efficiency analysis, we additionally report the number of parameters, FLOPs (at

640 \times 640

), and inference speed in FPS under the same hardware settings.

The experiments were conducted on the publicly available DUT-Anti-UAV dataset, which comprises 5243 training images, 2245 validation images, and 2621 test images. The most significant characteristic of this dataset is the larger proportion of small targets. Specifically, within this dataset, the average target area ratio is 0.013, the minimum target area ratio is 1.9 × 10⁻⁶, and the maximum target occupies 0.7 of the entire image. The proportion of target sizes relative to the entire image is generally less than 0.05. Figure 3 illustrates several typical scenarios from the DUT-Anti-UAV dataset.

In this paper, the performance of the anti-UAV detection model is evaluated using common evaluation metrics. During evaluation, Recall (R) represents the percentage of correctly identified targets, while Precision (P) measures the accuracy of the model in identifying targets. Mean Average Precision (mAP) is employed to comprehensively assess its object detection accuracy. The calculation of these metrics is detailed below, along with their respective formulas:

P = \frac{T P}{T P + F P},

(14)

R = \frac{T P}{T P + F N},

(15)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i},

(16)

P = \frac{T P}{T P + F P},

(17)

where

T P

,

F P

,

F N

, and

A P

represent True Positives, False Positives, False Negatives, and Average Precision, respectively.

4.2. Ablation Study

To validate the rationale behind the proposed model, a series of ablation studies were conducted on the DUT-Anti-UAV dataset. Detailed experimental results are presented in Table 1. Using the YOLOv12n [11] model as the baseline, the experiments evaluated the effectiveness of the CNFI and NFE modules. When the CNFI module was introduced independently, the detection performance showed a slight improvement: Precision, Recall, and mAP50 increased by 0.30%, 0.33%, and 0.71%, respectively. The independent introduction of NFE led to a significant enhancement in model performance, with experimental results indicating improvements in Precision, Recall, and mAP50 by 0.51%, 0.66%, and 0.63%, respectively. When both CNFI and NFE were incorporated, the model achieved optimal performance, with improvements over the baseline of 0.93% in Precision, 1.09% in Recall, and 2.12% in mAP50. The 2.12% mAP50 improvement is mainly attributed to our CNFI/NFE modules, which provide explicit cross-scale non-local interaction and structure-aware enhancement. Unlike conventional FPN/PAN/BiFPN-style add/concat fusion or scale-wise gating that remains largely local, our design enables direct long-range cross-scale dependency modeling (via a shared multi-scale key–value) and boundary-sensitive discrimination, demonstrating its practical relevance for anti-UAV detection.

Although the numrical improvement of 2.12% may appear moderate, it is practically meaningful in real-world anti-UAV surveillance scenarios. UAV targets are typically extremely small, exhibit significant scale variations across distances, and are often embedded in cluttered backgrounds such as vegetation, building contours, and clouds. Under such high-risk and low-tolerance conditions, a consistent gain in mAP@0.5 indicates a more stable precision–localization trade-off across confidence thresholds, enabling the detector to capture more challenging UAV instances while maintaining reliable bounding-box quality.

To further validate the effectiveness of CNFI, a comparison of target detection results at different scales was conducted. As shown in Figure 4, without CNFI, the model could primarily detect large-scale targets, exhibiting lower accuracy for small targets. However, with the introduction of CNFI, the model became capable of detecting targets across various scales. These comparative results indicate that the proposed CNFI, by facilitating thorough interaction of multi-scale features, can effectively capture targets of different sizes and demonstrates good robustness in small target scenarios.

To verify the efficacy of NFE, detection results before and after its introduction in complex scenarios were compared. Figure 5 illustrate that in complex scenes, due to the indistinct details of UAV targets, the baseline model’s detection accuracy was suboptimal. Conversely, after incorporating NFE, the model’s detection accuracy improved significantly. This demonstrates that the proposed NFE can enhance target detection performance in complex scenarios.

Moreover, to further investigate the effectiveness of our proposed CNFI and NFE modules, we conduct comprehensive ablation studies on the Det-Fly [48] dataset using an incremental validation approach. The quantitative results are summarized in Table 2.

Using YOLOv12n [11] as the baseline without any proposed modules, we first evaluate the individual contribution of the CNFI module in model-3. Interestingly, while the Precision improves from 84.21% to 86.00%, the Recall and mAP0.5 show slight decreases from 50.24% to 50.01% and 55.65% to 55.52%, respectively. This phenomenon suggests that CNFI alone enhances detection confidence, but may require complementary mechanisms to maintain recall performance.

The true potential of CNFI is revealed when combined with NFE in our final model. As shown in Figure 6, the standalone CNFI in model-3 demonstrates improved capability in detecting targets across multiple scales compared to the baseline, particularly for small UAVs that occupy less than 5% of the image area. The synergistic effect between CNFI and NFE is particularly evident in complex scenarios. As illustrated in the second row of Figure 6, while model-3 with CNFI shows better scale adaptation, the complete model excels in challenging environments with complex backgrounds. The NFE module significantly enhances feature representation in these complex settings, working complementarily with CNFI’s multi-scale fusion capability to achieve robust detection performance across all environmental conditions in the Det-Fly dataset.

These comprehensive results demonstrate that while CNFI provides the foundation for multi-scale target detection, its full potential is unlocked when integrated with NFE, creating a powerful framework that effectively addresses both scale variation and environmental complexity in aerial UAV detection tasks.

To further validate the effectiveness of the proposed NFE design and to explicitly analyze the contribution of its core components, we conduct an ablation study by constructing dedicated control variants on the Det-Fly dataset. Specifically, we design a variant that incorporates only the non-local gradient aggregation module (NGA) without gradient enhancement convolution (model-4), where model-3 is regard as the baseline.

As reported in Table 3, introducing NGA alone significantly improves detection performance over the baseline, indicating the importance of non-local semantic aggregation for suppressing background interference. More importantly, when gradient-based enhancement (GEC) is further fused with NGA, the full NFE module achieves consistent and substantial gains across all metrics, including Precision, mAP@0.5, and mAP@0.5–0.9. In particular, the joint design markedly improves mAP@0.5 from 64.62 to 72.12, demonstrating that combining gradient-aware structural cues with non-local semantic context is more effective than using either component in isolation. These results confirm that the performance improvement of NFE stems from the synergistic fusion of gradient and non-local features, rather than simply adding extra modules, thereby validating the core design motivation of NFE.

4.3. Comparative Experiments

To validate the performance of the proposed model, comparative experiments were conducted against YOLOv7-tiny [49], YOLOv8n [9], EDTC [50], Complex-GS [27], DMA-YOLO [51], and YOLOv12n [11] on the DUT-Anti-UAV dataset. The results in Table 4 indicate that the proposed model exhibits a significant advantage in terms of object detection performance. Specifically, the proposed model achieved the best results across four key metrics: Precision, Recall, mAP50, and mAP50-95. Notably, the proposed model demonstrated a remarkable performance advantage on the mAP0.5 metric, surpassing the suboptimal model by 2.12%. As for computational efficiency, our method achieves a moderate level of runtime performance, with its FPS, parameter count, and FLOPs ranking in the mid-range among the compared approaches.

To more comprehensively demonstrate the effectiveness of our model in anti-UAV detection, Figure 7 presents comparative results with other models across several different scenarios. In backgrounds such as steel towers and forests, our model achieves higher confidence scores; whereas in backgrounds with interference, such as residential buildings, most other models failed to detect the UAV targets. This indicates that our model performs better in anti-UAV detection under complex background conditions.

Besides the DUT-Anti-UAV dataset, we further conduct comparisons on the Det-Fly dataset to validate the generalizability of our method under more challenging conditions. As summarized in Table 5 and Figure 8, our approach achieves the best overall detection accuracy, yielding the highest Precision, Recall, and mAP among all compared methods, and hold the robuteness detections under extreme conditions such as ultra-tiny objects. In particular, the consistent improvement over strong baselines (e.g., YOLOv12n) indicates that the proposed design is effective at both suppressing background-induced false alarms and recovering weak true positives that are easily missed when UAV targets occupy only a few pixels.

From the experimental results above, it is evident that the proposed model can better fuse feature information of targets at different scales and simultaneously enhance target clarity through feature augmentation, thereby significantly improving object detection accuracy. These advantages enable it to outperform others in comparative experiments for small UAV target detection tasks. Therefore, the experimental results of this study surpass those of other models.

5. Discussion

To better understand the remaining weaknesses of our framework, we analyze typical false positives (FPs) and false negatives (FNs) in heavily occluded scenes (Figure 9). As shown in the left example, when only fragmented UAV cues are visible, background structures with similar edge/texture patterns may be mistakenly detected as UAVs, leading to FPs. Conversely, when the UAV is largely occluded and only a few pixels remain observable, the detector response can be suppressed and fall below the threshold, resulting in FNs (right). These cases indicate that, although CNFI and NFE enhance robustness to scale variation and background clutter, severe occlusion remains challenging. We therefore leave occlusion-aware modeling (e.g., visibility-aware feature learning) as a future direction to further improve reliability in practical anti-UAV surveillance.

6. Conclusions

This paper addresses the challenge of anti-UAV detection under extreme target scale variation and complex background interference. To this end, we propose an anti-UAV detection framework based on non-local feature learning, which integrates two key components: a Cross-scale Non-local Feature Interaction (CNFI) module and a Non-local Feature Enhancement (NFE) module.

The CNFI module explicitly establishes long-range dependencies across multi-scale feature maps via cross-scale non-local attention, enabling more effective global information interaction and improving the detection of small UAV targets. Meanwhile, the NFE module enhances feature discriminability in cluttered environments by fusing global non-local context with gradient-based structural cues, thereby reinforcing boundary and shape information.

Experimental results on the DUT-Anti-UAV and Det-Fly benchmark datasets demonstrate the effectiveness of the proposed method. Compared with the YOLOv12n baseline, our approach achieves improvements of 0.93% in Precision, 1.09% in Recall, and 2.12% in mAP@0.5 on DUT-Anti-UAV, and gains of +4.21% Precision, +15.87% Recall, and +16.47% mAP@0.5 on the Det-Fly dataset, confirming its robustness and generalization capability in diverse scenarios.

Despite these gains, the cross-scale non-local attention introduces additional computational overhead, and the current evaluation is limited to static image datasets. Future work will focus on lightweight attention designs to improve efficiency and extend the proposed framework to video-based anti-UAV detection and edge-device deployment.

Author Contributions

Conceptualization, B.L. and H.S.; Methodology, H.S.; Software, H.S.; Validation, B.L., H.S. and J.J.; Formal analysis, B.L. and H.S.; Investigation, S.F.; Resources, B.L.; Data curation, S.F.; Writing—original draft, H.S.; Writing – review & editing, B.L. and H.S.; Visualization, B.L. and H.S.; Supervision, J.J.; Funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key Research and Development Program of China; grant number 2022YFC3320800.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, H.; Lan, Y.; Fritz, B.K.; Hoffmann, W.C.; Liu, S. Review of agricultural spraying technologies for plant protection using unmanned aerial vehicle (UAV). Int. J. Agric. Biol. Eng. 2021, 14, 38–49. [Google Scholar] [CrossRef]
Ju, C.; Son, H.I. Multiple UAV systems for agricultural applications: Control, implementation, and evaluation. Electronics 2018, 7, 162. [Google Scholar] [CrossRef]
Cracknell, A.P. UAVs: Regulations and law enforcement. Int. J. Remote Sens. 2017, 38, 3054–3067. [Google Scholar] [CrossRef]
Wang, B.; Li, Q.; Mao, Q.; Wang, J.; Chen, C.P.; Shangguan, A.; Zhang, H. A Survey on Vision-Based Anti Unmanned Aerial Vehicles Methods. Drones 2024, 8, 518. [Google Scholar] [CrossRef]
Zhou, H.; Yu, G. Research on pedestrian detection technology based on the SVM classifier trained by HOG and LTP features. Future Gener. Comput. Syst. 2021, 125, 604–615. [Google Scholar] [CrossRef]
Bellavia, F.; Colombo, C. Is there anything new to say about SIFT matching? Int. J. Comput. Vis. 2020, 128, 1847–1866. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-based anti-uav detection and tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Wang, Y.; Sun, X.; Ding, B.; Dang, Z.; Sun, B. Anti-UAV: An Improved Algorithm for Small UAV Target Detection Based on YOLOv8. In Proceedings of the International Conference on Autonomous Unmanned Systems, Shenyang, China, 19–21 September 2024; Springer: Singapore, 2024; pp. 335–348. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Xu, Z.; Hong, X.; Chen, T.; Yang, Z.; Shi, Y. Scale-Aware Squeeze-and-Excitation for Lightweight Object Detection. IEEE Robot. Autom. Lett. 2023, 8, 49–56. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Chen, J.; Mai, H.; Luo, L.; Chen, X.; Wu, K. Effective feature fusion network in BIFPN for small object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 699–703. [Google Scholar]
Zhai, Y.; Gao, C.; Chen, X.; Chen, B.M. An Application of Thermography in Building Facade Inspection based on YOLOv3. In Proceedings of the 2022 International Conference on Computer and Drone Applications (IConDA), Kuching, Malaysia, 28–29 November 2022; pp. 131–136. [Google Scholar]
Jiang, P.; Yang, X.; Wan, Y.; Zeng, T.; Nie, M.; Liu, Z. DRBD-YOLOv8: A Lightweight and Efficient Anti-UAV Detection Model. Sensors 2024, 24, 7148. [Google Scholar] [CrossRef]
Fang, A.; Feng, S.; Liang, B.; Jiang, J. Real-time detection of unauthorized unmanned aerial vehicles using SEB-YOLOv8s. Sensors 2024, 24, 3915. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Wang, X.; Shi, H.; Wang, K.; Tian, Y.; Xu, Z.; Zhang, Y.; Jia, G. BRA-YOLOv10: UAV Small Target Detection Based on YOLOv10. Drones 2025, 9, 159. [Google Scholar] [CrossRef]
Gao, Y.; Xin, Y.; Yang, H.; Wang, Y. A Lightweight Anti-Unmanned Aerial Vehicle Detection Method Based on Improved YOLOv11. Drones 2025, 9, 11. [Google Scholar] [CrossRef]
Li, H.; Tian, E.; Zhang, W.; Li, Y.; Cao, J. Improving remote sensing object detection by using feature extraction and rotational equivariant attention. Int. J. Remote Sens. 2024, 45, 3789–3806. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Huang, B.; Li, J.; Chen, J.; Wang, G.; Zhao, J.; Xu, T. Anti-UAV410: A thermal infrared benchmark and customized scheme for tracking drones in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2852–2865. [Google Scholar] [CrossRef]
Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. Tgc-yolov5: An enhanced yolov5 drone detection model based on transformer, gam & ca attention mechanism. Drones 2023, 7, 446. [Google Scholar] [CrossRef]
Ma, J.; Huang, S.; Jin, D.; Wang, X.; Li, L.; Guo, Y. LA-YOLO: An effective detection model for multi-UAV under low altitude background. Meas. Sci. Technol. 2024, 35, 055401. [Google Scholar] [CrossRef]
Bo, C.; Wei, Y.; Wang, X.; Shi, Z.; Xiao, Y. Vision-based anti-UAV detection based on YOLOv7-GS in complex backgrounds. Drones 2024, 8, 331. [Google Scholar] [CrossRef]
Shi, X.; Zhang, Y.; Shi, Z.; Zhang, Y. Gasiam: Graph attention based siamese tracker for infrared anti-uav. In Proceedings of the 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; pp. 986–993. [Google Scholar]
Yu, M.; Duan, Y.; Wan, Y.; Lu, X.; Lyu, S.; Li, F. OLOD: A new UAV dataset and benchmark for single tiny object tracking. Int. J. Remote Sens. 2024, 45, 4255–4277. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Yang, Y.; Yang, F.; Sun, L.; Xiang, T.; Lv, P. Echoformer: Transformer architecture based on radar echo characteristics for UAV detection. IEEE Sens. J. 2023, 23, 8639–8653. [Google Scholar] [CrossRef]
Wu, X.; Zhang, K.; Hu, Y.; He, X.; Gao, X. Multi-scale non-local attention network for image super-resolution. Signal Process. 2024, 218, 109362. [Google Scholar] [CrossRef]
Yu, Q.; Ma, Y.; He, J.; Yang, D.; Zhang, T. A Unified Transformer Based Tracker for Anti-UAV Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 3036–3046. [Google Scholar]
Zhu, C.; Xie, X.; Xi, J.; Yang, X. GM-DETR: Infrared Detection of Small UAV Swarm Targets Based on Detection Transformer. Remote Sens. 2025, 17, 3379. [Google Scholar] [CrossRef]
Xu, Y.; Liu, P.; Qian, W.; Zhang, J.; Kong, X.; Wan, M. Small and Dim Target Detection Under Strong Clutter Based on Similarly of Gaussian and Motion Outlier Significance Using Moving Infrared Camera. IEEE Sens. J. 2025, 25, 13357–13373. [Google Scholar] [CrossRef]
Xi, Y.; Liu, D.; Kou, R.; Zhang, J.; Yu, W. Gradient-Enhanced Feature Pyramid Network for Infrared Small Target Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Li, S.; Yang, Y.; Zeng, D.; Wang, X. Adaptive and background-aware vision transformer for real-time uav tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 13989–14000. [Google Scholar]
Wang, Q.; Yang, P.; Yan, X.; Wu, H.C.; He, L. Radio frequency-based UAV sensing using novel hybrid lightweight learning network. IEEE Sens. J. 2023, 24, 4841–4850. [Google Scholar] [CrossRef]
Luan, X.; Zhao, Y.; Ou, W.; Liu, L.; Li, W.; Shu, Y.; Geng, H. Collaborative learning with a multi-branch framework for feature enhancement. IEEE Trans. Multimed. 2021, 24, 929–941. [Google Scholar] [CrossRef]
Wu, Q.; Wang, J.; Chai, Z.; Guo, G. Multi-scale feature aggregation and boundary awareness network for salient object detection. Image Vis. Comput. 2022, 122, 104442. [Google Scholar] [CrossRef]
Sun, H.; Li, B.; Dan, Z.; Hu, W.; Du, B.; Yang, W.; Wan, J. Multi-level feature interaction and efficient non-local information enhanced channel attention for image dehazing. Neural Netw. 2023, 163, 10–27. [Google Scholar] [CrossRef]
Shen, L.; Zhao, B.; Li, Q.; Zhang, C.; Sun, X.; Peng, B. Local to non-local: Multi-scale progressive attention network for image restoration. Comput. Vis. Image Underst. 2023, 233, 103725. [Google Scholar] [CrossRef]
Su, J.N.; Fan, G.; Gan, M.; Chen, G.Y.; Guo, W.; Chen, C.L.P. Revealing the Dark Side of Non-Local Attention in Single Image Super-Resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11476–11490. [Google Scholar] [CrossRef] [PubMed]
Khan, M.A.; Menouar, H.; Eldeeb, A.; Abu-Dayya, A.; Salim, F.D. On the detection of unauthorized drones—Techniques and future perspectives: A review. IEEE Sens. J. 2022, 22, 11439–11455. [Google Scholar] [CrossRef]
Ding, R.; Yang, L.; Shan, Z.; Wang, Y.; Ren, H.; Liu, L.; Yu, L.; Liu, B. Drone Detection and Recognition Based on Fiber-Optic EFPI Acoustic Sensor and CNN-LSTM Network Model. IEEE Sens. J. 2024, 24, 28835–28843. [Google Scholar] [CrossRef]
Liu, Z.; An, P.; Yang, Y.; Qiu, S.; Liu, Q.; Xu, X. Vision-Based Drone Detection in Complex Environments: A Survey. Drones 2024, 8, 643. [Google Scholar] [CrossRef]
Li, G.; Wang, Y.; Zhu, F. Multi-branch channel-wise enhancement network for fine-grained visual recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 5273–5280. [Google Scholar]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-Air Visual Detection of Micro-UAVs: An Experimental Evaluation of Deep Learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Zhu, X.F.; Xu, T.; Zhao, J.; Liu, J.W.; Wang, K.; Wang, G.; Li, J.; Wang, Q.; Jin, L.; Zhu, Z.; et al. Evidential detection and tracking collaboration: New problem, benchmark and algorithm for robust anti-uav system. arXiv 2023, arXiv:2306.15767. [Google Scholar] [CrossRef]
Li, Y.l.; Feng, Y.; Zhou, M.l.; Xiong, X.C.; Wang, Y.H.; Qiang, B.H. DMA-YOLO: Multi-scale object detection method with attention mechanism for aerial images. Vis. Comput. 2024, 40, 4505–4518. [Google Scholar] [CrossRef]

Figure 1. Illustration of key challenges in UAV detection, including extreme scale variation of small UAV targets and severe background clutter that often leads to false alarms and missed detections.

Figure 2. Framework of the proposed method.

Figure 3. Typical scenarios from the DUT-Anti-UAV dataset. The blue bounding boxes and the corresponding UAVs indicate the ground-truth annotations, while the yellow text denotes the roughly category of each sample.

Figure 4. Qualitative result of ablation study with respect of CNFI.

Figure 5. Qualitative result of ablation study with respect of NFE.

Figure 6. Qualitative result of ablation study with respect of CNFI and NFE on Det-Fly dataset. From left to right are results of baseline model, model-3 (w/o NFE), and the full version.

Figure 7. Qualitative result of comparison methods. From top to bottom are the results of method YOLOv7-tiny [49], YOLOv8n [9], EDTC [50], Complex-GS [27], DMA-YOLO [51], YOLOv12n [11] and ours.

Figure 8. Qualitative result of comparison methods. From left to right are results of YOLOv7-tiny [49], YOLOv8n [9], EDTC [50], Complex-GS [27], DMA-YOLO [51], YOLOv12n [11], and ours.

Figure 9. Failed cases of our method. (Left): false positive case. (Right): false negative case.

Table 1. Quantitative results of ablation study with respect of CNFI and NFE on DUT-Anti-UAV dataset.

Method	Modules		Metrics
Method	CNFI	NFE	P	R	mAP 0.5	mAP 0.5–0.9
Baseline			95.51	78.45	85.52	55.41
model-1	✓		95.81	78.78	86.23	55.40
model-2		✓	96.02	79.11	86.15	55.36
Ours	✓	✓	96.44	79.54	87.64	55.39

Table 2. Quantitative result of ablation study with respect of CNFI and NFE on Det-Fly dataset.

Method	Modules		Metrics
Method	CNFI	NFE	P	R	mAP 0.5	mAP 0.5–0.9
Baseline			84.21	50.24	55.65	28.74
model-3	✓		86.00	50.01	55.52	28.31
Ours	✓	✓	88.42	66.11	72.12	38.34

Table 3. Quantitative result of ablation study with respect of NGA and GEC on Det-Fly dataset.

Method	Modules		Metrics
Method	NGA	GEC	P	R	mAP 0.5	mAP 0.5–0.9
model-3			86.00	50.01	55.52	28.31
model-4	✓		87.17	71.01	64.62	33.11
Ours	✓	✓	88.42	66.11	72.12	38.34

Table 4. Quantitative accuracy and real-time efficiency comparison of different methods on Anti-UAV-DET dataset, including Precision (P), Recall (R), mAP, inference speed (FPS), number of parameters (Params), and FLOPs.

Method	Accuracy Metrics				Real-Time Efficiency
Method	P	R	mAP 0.5	mAP 0.5–0.9	FPS ↑	Params (M) ↓	FLOPs (G) ↓
YOLOv7	87.89	72.74	78.69	49.53	161	36.9	104.7
YOLOv8n	92.41	75.25	80.34	52.24	115	6.04	8.7
EDTC	92.50	74.85	80.14	51.64	98	11.8	10.2
Complex-GS	94.55	77.12	84.71	51.43	104	11.43	15.5
DMA-YOLO	94.61	77.25	84.80	51.91	87	9.2	18.4
YOLOv12n	95.51	78.45	85.52	55.41	112	2.6	6.5
Ours	96.44	79.54	87.64	55.39	98	4.8	10.8

Table 5. Quantitative accuracy and real-time efficiency comparison of different methods on Det-Fly dataset, including Precision (P), Recall (R), mAP, inference speed (FPS), number of parameters (Params), and FLOPs.

Method	Accuracy Metrics				Real-Time Efficiency
Method	P	R	mAP 0.5	mAP 0.5–0.9	FPS ↑	Params (M) ↓	FLOPs (G) ↓
YOLOv7	74.19	45.33	48.61	21.92	161	36.9	104.7
YOLOv8n	78.21	46.64	50.57	22.79	115	6.04	8.7
EDTC	80.17	48.35	52.15	23.38	98	11.8	10.2
Complex-GS	82.91	50.11	55.59	27.71	104	11.43	15.5
DMA-YOLO	82.13	50.01	55.44	26.69	87	9.2	18.4
YOLOv12n	84.21	50.24	55.65	28.74	112	2.6	6.5
Ours	88.42	66.11	72.12	38.34	98	4.8	10.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, B.; Shan, H.; Feng, S.; Jiang, J. CNIFE: Anti-UAV Detection Network via Cross-Scale Non-Local Interaction and Feature Enhancement. Drones 2026, 10, 8. https://doi.org/10.3390/drones10010008

AMA Style

Liang B, Shan H, Feng S, Jiang J. CNIFE: Anti-UAV Detection Network via Cross-Scale Non-Local Interaction and Feature Enhancement. Drones. 2026; 10(1):8. https://doi.org/10.3390/drones10010008

Chicago/Turabian Style

Liang, Bo, Hongfu Shan, Song Feng, and Ji Jiang. 2026. "CNIFE: Anti-UAV Detection Network via Cross-Scale Non-Local Interaction and Feature Enhancement" Drones 10, no. 1: 8. https://doi.org/10.3390/drones10010008

APA Style

Liang, B., Shan, H., Feng, S., & Jiang, J. (2026). CNIFE: Anti-UAV Detection Network via Cross-Scale Non-Local Interaction and Feature Enhancement. Drones, 10(1), 8. https://doi.org/10.3390/drones10010008

Article Menu

CNIFE: Anti-UAV Detection Network via Cross-Scale Non-Local Interaction and Feature Enhancement

Highlights

Abstract

1. Introduction

2. Literature Review

3. Method

3.1. Cross-Scale Non-Local Feature Interaction (CNFI)

3.2. Non-Local Feature Enhancement (NFE)

3.2.1. Non-Local Gradient Attention (NGA)

3.2.2. Gradient Enhancement Conv Block (GEC)

4. Experiments and Analysis

4.1. Datasets and Implementation Details

4.2. Ablation Study

4.3. Comparative Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI