Previous Article in Journal
Development and Techno-Economic Feasibility of a Low-Cost UAV Platform for Crop Protection in Indian Smallholder Farms
Previous Article in Special Issue
TriCross-D2D: A Cross-Scene, Cross-View, and Cross-Weather Dataset for Drone-to-Drone Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SMG-UAV: Sparse Mutual Guided RGB–Event Fusion for Robust UAV Detection in Challenging Dynamic Environments

United Laboratories of TT&C and Communication, Jiuquan Satellite Launch Center, Lanzhou 732750, China
*
Author to whom correspondence should be addressed.
Drones 2026, 10(7), 486; https://doi.org/10.3390/drones10070486 (registering DOI)
Submission received: 8 May 2026 / Revised: 16 June 2026 / Accepted: 22 June 2026 / Published: 25 June 2026

Highlights

What are the main findings?
  • A sparse mutual guided RGB–event fusion network, SMG-UAV, is proposed with a modality-specific hybrid CSP backbone, SMG-Bridge, and SGP-Neck, enabling bidirectional reliability-aware compensation between dense RGB appearance cues and sparse event-driven motion cues for robust small-UAV detection in challenging dynamic environments.
  • SMG-UAV achieves state-of-the-art performance on FRED and NeRDD, while consistently obtaining the best results on small-target, motion-blur, extreme-illumination, background-embedded, and bird distractor subsets.
What are the implications of the main findings?
  • This work demonstrates that event cameras and reliability-aware RGB–event fusion can substantially improve practical low-altitude anti-UAV surveillance when RGB imagery is degraded by illumination variation, motion blur, cluttered backgrounds, or tiny target scale.
  • The proposed method provides an effective solution for multimodal small-object detection by combining sparse cross-modal recovery, unreliable response suppression, and selective multiscale enhancement, supporting applications such as airport protection, perimeter defense, critical-infrastructure monitoring, and public-event security.

Abstract

Robust unmanned aerial vehicle (UAV) detection in real low-altitude anti-UAV scenarios remains challenging due to motion blur, extreme illumination, cluttered backgrounds, and tiny target sizes. Most existing UAV detectors rely on RGB imagery, but their performance often degrades severely under these adverse conditions. Event cameras, as a neuromorphic sensing modality, capture motion-sensitive responses with high temporal resolution and thus provide complementary cues for robust UAV detection. However, existing RGB–event fusion detectors usually employ homogeneous feature extraction and generic fusion mechanisms, which are insufficient to handle heterogeneous modality degradation and exploit reliable cross-modal cues. To address this limitation, we propose SMG-UAV, a sparse mutual guided RGB–event fusion network for robust small-UAV detection. The proposed method integrates a hybrid dual-branch backbone for modality-specific representation learning, a Sparse Mutual Guided Bridge for bidirectional sparse cross-modal refinement, and a Selective Gated Pyramid Neck for multiscale enhancement of weak UAV responses. Experiments on the Florence RGB-Event Drone Dataset (FRED) and the Neuromorphic-RGB Drone Detection Dataset (NeRDD) demonstrate that SMG-UAV achieves state-of-the-art performance, outperforming the strongest competing method by an average of 5.2 points in AP 50 , while delivering stronger robustness under multiple challenging anti-UAV conditions.

1. Introduction

The rapid development of unmanned aerial vehicles (UAVs) has brought significant benefits to aerial sensing, logistics, emergency response, infrastructure inspection, and environmental monitoring [1,2,3]. However, unauthorized or malicious UAVs may also pose potential threats to low-altitude airspace safety, airport operation, public-event security, critical infrastructure protection, and personal privacy [4,5,6]. Therefore, accurate and robust detection of UAVs has become a fundamental requirement for anti-UAV surveillance systems [7,8,9]. Reliable UAV detection not only provides the spatial localization of suspicious aerial targets, but also serves as the front-end basis for subsequent identification, tracking, risk assessment, and interception decisions.
Although deep learning-based UAV detection methods have achieved notable progress, most existing approaches still rely primarily on RGB visible-light images as the main sensing input. This RGB-centric detection paradigm benefits from dense texture, color, and semantic information, and has shown strong performance under well-illuminated and relatively clean imaging conditions. However, its robustness remains insufficient for real-world low-altitude anti-UAV surveillance. Small UAVs observed at long ranges often occupy only a few pixels, producing weak appearance cues and ambiguous object boundaries. Meanwhile, rapid UAV motion, camera ego-motion, low illumination, overexposure, and illumination variation can degrade RGB frames and cause motion blur, unstable target appearance, or partial loss of visual details. In addition, complex low-altitude backgrounds, such as clouds, tree branches, buildings, and power lines, often reduce the visual saliency of small UAVs, making their weak RGB appearance cues easily overwhelmed by surrounding background structures. As a result, conventional RGB-based detectors may struggle to reliably localize small UAVs in complex and dynamic environments.
Event cameras provide a promising complementary sensing modality for this task. Different from conventional RGB cameras that record dense intensity frames at a fixed frame rate, event cameras asynchronously respond to brightness changes with microsecond-level latency, high dynamic range, and sparse motion-sensitive outputs [10]. As a result, event streams can preserve sharp motion contours under fast motion and extreme illumination, while RGB frames provide dense appearance, texture, color, and semantic context [11,12]. Figure 1 shows representative RGB frames and event data under different challenging conditions. The complementary characteristics of RGB and event data make RGB–event sensor fusion an attractive solution for robust object detection in challenging dynamic environments [12,13,14,15]. Accordingly, RGB–event fusion has recently become a promising research direction in anti-UAV perception, as it offers the potential to improve detection robustness and localization reliability under complex low-altitude conditions [16,17,18].
Despite its potential, current RGB–event detection frameworks still have several limitations when applied to UAV detection in low-altitude anti-UAV scenarios. One important issue is the treatment of modality heterogeneity. Existing methods often use similar backbone networks for RGB frames and event representations, which simplifies the architecture but ignores the different sensing principles of the two modalities [12,13,14]. RGB images provide dense texture, color, and semantic context, whereas event streams respond sparsely to temporal brightness changes and contain motion-sensitive boundary information [10,11]. A homogeneous backbone design may therefore fail to fully exploit the sparse and dynamic nature of event data.
The fusion stage also remains insufficiently explored. Most existing methods combine RGB and event features through concatenation, summation, channel weighting, or generic attention [12,13,14,15]. While these operations are effective for feature aggregation, they provide limited insight into how complementary information is recovered under modality-specific degradation. In real anti-UAV scenarios, the reliability of RGB and event data changes with illumination, target motion, background dynamics, and imaging distance [17,18]. RGB features may be corrupted by blur or exposure changes, whereas event responses may be sparse for weak motion or contaminated by background activity. This suggests that RGB–event fusion should explicitly consider cross-modal compensation and reliability-aware selection rather than only feature-level aggregation.
Small UAV targets further increase the difficulty of feature representation. In long-range surveillance, they often appear as tiny, low-contrast objects with limited texture, and their responses can be easily weakened during hierarchical feature extraction [6,19,20]. Although feature pyramid networks are widely used for multiscale detection [21], generic pyramid aggregation does not necessarily strengthen small and weak UAV cues. High-level features are semantically informative but spatially coarse, while low-level features preserve location details but are sensitive to clutter. A dedicated multiscale enhancement design is therefore necessary to propagate useful semantic and localization information across scales and improve the robustness of small UAV localization.
To address the above challenges, we propose SMG-UAV, an RGB–event fusion framework for robust small UAV detection in challenging low-altitude environments. SMG-UAV follows a modality-aware and task-oriented design. For feature extraction, an appearance-oriented CSPDarknet [22] is adopted for RGB frames, while an improved Spiking CSPDarknet is developed to efficiently capture sparse event-driven motion features from event voxels. To achieve reliable cross-modal compensation under asymmetric modality degradation, we design Sparse Mutual Guided Bridge (SMG-Bridge) to perform sparse mutual guidance and reliability-aware feature interaction between RGB and event features. This module encourages reliable cues from one modality to complement degraded or ambiguous responses in the other. To enhance weak small-UAV responses during multiscale feature aggregation, we further construct Selective Gated Pyramid Neck (SGP-Neck), which progressively propagates semantic and spatial cues across scales to preserve localization-sensitive information and strengthen small-target representations.
The main contributions of this work are summarized as follows:
  • We propose SMG-UAV, a dedicated RGB–event small UAV detection framework for low-altitude anti-UAV surveillance. The framework exploits the complementary sensing properties of RGB frames and event streams to improve robustness under motion blur, illumination degradation, complex backgrounds, and weak small-target conditions.
  • We develop three task-oriented modules for robust RGB–event UAV detection: an improved Spiking CSPDarknet for efficient sparse event feature extraction, SMG-Bridge for sparse mutual guided and reliability-aware cross-modal interaction, and SGP-Neck for progressive multiscale enhancement of weak small-UAV responses.
  • Extensive experiments on RGB–event UAV detection benchmarks demonstrate the effectiveness of the proposed method in terms of detection accuracy and robustness compared with RGB-only, event-only, and existing RGB–event fusion detectors.

2. Related Work

2.1. Vision-Based UAV Detection

Vision-based UAV detection has attracted increasing attention in low-altitude surveillance, airport protection, and anti-UAV security systems. Early methods mainly relied on handcrafted appearance features, background modeling, motion cues, or trajectory filtering to distinguish UAV targets from sky or ground backgrounds [19,20,23]. With the development of deep learning, CNN-based and transformer-based object detectors have become the dominant solutions for UAV detection. One-stage detectors, represented by the YOLO family [24], are widely used due to their favorable balance between accuracy and inference speed, while two-stage and DETR-style detectors provide stronger feature modeling and localization capability in complex scenes [25,26].
However, UAV detection is still more difficult than generic object detection. UAVs observed from long distances usually occupy only a small number of pixels and show weak texture, low contrast, and ambiguous boundaries. To address these issues, recent studies have introduced small target feature extraction, multiscale attention, high-resolution detection heads, context-enhanced feature pyramids, and refined localization modules [27,28,29,30]. These methods improve scale adaptation and localization to some extent, but most of them still rely on RGB visible-light images. When RGB frames are affected by motion blur, low illumination, overexposure, or complex low-altitude backgrounds, the appearance cues of small UAVs can be severely degraded. This limitation motivates the use of complementary sensing modalities to improve detection robustness under challenging dynamic conditions.

2.2. RGB–Event Fusion Object Detection

Event cameras provide an alternative sensing paradigm for dynamic visual perception. Unlike frame-based cameras that record dense images at a fixed frame rate, event cameras asynchronously respond to pixel-level brightness changes and generate sparse event streams with high temporal resolution, high dynamic range, and low latency. These properties make event cameras suitable for challenging scenarios involving fast motion, illumination variation, and high-speed targets [10]. Since raw events are asynchronous and sparse, they are usually transformed into event frames, time surfaces, voxel grids, or other spatio-temporal representations for deep network processing [11,31].
RGB–event object detection aims to combine dense appearance cues from RGB frames with motion-sensitive information from event streams. Existing methods generally adopt early fusion, late fusion, or intermediate fusion strategies. Early fusion directly stacks RGB and event representations at the input level, late fusion combines predictions from independent branches, and intermediate fusion performs feature-level interaction inside the network [32,33,34,35,36,37,38]. Recent studies have explored multi-level feature fusion, gated interaction, temporal aggregation, cross-modal attention, and transformer-based fusion for RGB–event detection [12,13,14,15]. These methods demonstrate the effectiveness of event cameras in improving perception robustness under adverse illumination and dynamic conditions.
However, existing RGB–event detection frameworks still have several limitations when applied to small UAV detection. Many methods convert events into frame-like representations and process them with ordinary CNN or transformer backbones, which may weaken the sparse and temporal characteristics of event data. In addition, most fusion modules are mainly designed as generic feature aggregation operations, with limited consideration of modality-specific degradation and reliability variation. This limitation becomes more critical in low-altitude anti-UAV scenarios, where RGB cues may be degraded by motion blur, low illumination, or overexposure, while event responses may become sparse under weak UAV motion or noisy under strong background activity. Therefore, robust RGB–event UAV detection requires both event-oriented feature extraction and reliability-aware cross-modal interaction.

2.3. Cross-Modal Fusion and Sparse Representation

Cross-modal fusion is a key issue in RGB–event detection, because the two modalities provide complementary but heterogeneous visual cues. RGB frames contain dense texture, color, and semantic information, whereas event streams mainly encode sparse brightness changes and motion-sensitive structures. Existing RGB–event detectors have introduced multi-level feature fusion, gated interaction, asynchronous attention, and transformer-based modeling to integrate the two modalities [13,14,15]. These designs improve the utilization of complementary RGB and event information, especially under adverse illumination or fast-motion conditions.
However, most fusion modules still perform feature aggregation in a largely data-driven manner. Simple operations such as concatenation, summation, and channel weighting are easy to implement, but they provide limited control over modality-specific noise and unreliable responses. Attention-based and transformer-based modules can model long-range or cross-modal dependencies, but they do not explicitly describe how degraded information in one modality is compensated by reliable cues from the other. This limitation becomes more critical in anti-UAV scenarios, where the reliability of RGB and event data varies with illumination, target motion, imaging distance, and background dynamics. Therefore, robust RGB–event fusion requires not only cross-modal interaction, but also reliability-aware selection and degradation compensation.
Sparse representation provides a useful perspective for designing more interpretable fusion mechanisms. It assumes that signals can be represented by a small number of informative bases or atoms, and has been widely used in image restoration, multifocus image fusion, multimodal image fusion, and multi-sensor image fusion [39,40,41,42]. Compared with unconstrained feature mixing, sparse representation encourages compact and informative responses, which is consistent with the sparse and noise-sensitive characteristics of event data.
Deep unfolding further bridges model-based optimization and learnable neural networks. By mapping iterative optimization procedures into network stages, unfolding-based methods retain the interpretability of optimization models while benefiting from data-driven parameter learning [43,44,45]. The iterative shrinkage-thresholding algorithm (ISTA), for example, alternates between reconstruction-oriented updates and sparsity-inducing shrinkage operations. Such a structure is suitable for modeling sparse feature recovery and suppressing redundant responses. Nevertheless, sparse and unfolding-inspired mechanisms have not been sufficiently explored for RGB–event UAV detection, particularly for modeling mutual compensation between degraded RGB appearance cues and event-driven motion structures. This motivates the proposed SMG-Bridge, which performs sparse mutual guidance and reliability-aware feature interaction for robust small UAV detection.

2.4. Comparative Summary and Research Gap

RGB-based UAV detectors benefit from mature image-based architectures and strong appearance modeling, and are commonly evaluated by AP, mAP, AP 50 , or AP 75 . However, they are vulnerable to motion blur, extreme illumination, background clutter, and tiny-target degradation. Moreover, many studies do not fully analyze recall, false alarms, inference speed, hardware requirements, or deployment constraints.
Event-based detectors provide high-temporal-resolution motion cues and stronger robustness to illumination changes. However, event-only representations may be disturbed by background-triggered events or become weak when the relative motion between the UAV and the sensor is limited.
RGB–event fusion methods aim to combine RGB semantic cues with event-based motion responses. Most existing strategies improve multimodal representation, but they still rely mainly on direct concatenation or feature reweighting and do not explicitly model asymmetric modality degradation. In practical anti-UAV scenes, RGB features may be unreliable under visual degradation, while event features may be noisy due to camera jitter, moving backgrounds, or non-target motion.
It should also be noted that existing UAV detection, event-based detection, and RGB–event fusion studies are often evaluated on different datasets, object categories, input resolutions, event representations, and hardware platforms. Therefore, directly comparing reported AP, recall, F1-score, FPS, or latency values across different papers may be unfair or misleading. For this reason, we provide direct quantitative comparisons under a unified experimental protocol in the Experiments section, and further report practical metrics including precision, recall, F1-score, false positives per frame, miss rate, inference latency, and computational complexity.
The research gap addressed in this paper is how to selectively exploit reliable cross-modal cues while suppressing unreliable modality-specific responses for robust small-UAV detection. To this end, SMG-UAV introduces a Sparse Mutual Guided Bridge that performs bidirectional sparse cross-modal refinement in a compact latent space, rather than simple feature aggregation. The proposed adaptive sparse updating allows reliable cues from one modality to guide the other modality and suppresses weak, noisy, or redundant responses. In addition, the Selective Gated Pyramid Neck enhances weak multiscale UAV responses, which is important for tiny and distant UAV targets in low-altitude surveillance.

3. Method

3.1. Problem Formulation and Overall Architecture

Given an RGB frame R t R 3 × H × W captured at time t and the corresponding event observations within a temporal window Δ t , the goal of RGB–event small-UAV detection is to localize UAV targets by jointly exploiting appearance information from RGB imagery and motion-sensitive responses from event data.
The event set collected within the temporal window Δ t is defined as
E t = { e k t k [ t Δ t , t ] } k = 1 n ,
where t is the timestamp of the RGB frame R t , Δ t is the event accumulation window, and n is the number of events collected in this window. In our implementation, Δ t is set according to the RGB frame interval so that the event observations are temporally paired with the corresponding RGB frame.
Each raw event in E t is represented as
e k = ( x k , y k , t k , p k ) , k = 1 , , n ,
where ( x k , y k ) denotes the pixel coordinate of the k-th event, with x k { 0 , , W 1 } and y k { 0 , , H 1 } , t k is the timestamp, and p k { 1 , + 1 } is the polarity of the brightness change. Specifically, p k = + 1 and p k = 1 indicate positive and negative log-intensity changes, respectively. Zero is not used as the polarity value of a raw event. Zero values may only appear in the accumulated event representation, such as event frames or event voxels, at spatial–temporal locations where no event is observed.
The event set is converted into an event voxel representation:
V t R 2 T × H × W ,
where H and W denote the height and width of the input representation, respectively. T is the number of temporal bins, and the factor 2 corresponds to the positive and negative polarity channels.
Given the paired RGB frame and event voxel, the detector predicts a set of UAV instances:
Y t = F ( R t , V t ; Θ ) = { ( b i , c i , s i ) } i = 1 N t ,
where F ( · ; Θ ) denotes the proposed SMG-UAV detector parameterized by Θ , R t is the RGB frame, and V t is the paired event voxel representation. N t denotes the number of detected UAV instances in frame t. For the i-th detection, b i = ( x i , y i , w i , h i ) is the predicted bounding box, c i is the predicted class label, and s i is the final detection confidence score after post-processing.
The overall architecture of the proposed SMG-UAV is illustrated in Figure 2. The framework consists of a modality-specific hybrid CSP backbone, three SMG-Bridge, a SGP-Neck, and a YOLO-style detection head. The RGB branch extracts dense appearance and semantic features, while the event branch first converts asynchronous events into a voxel representation and then uses Spiking Conv blocks with LIF activations to encode sparse temporal motion cues. The multilevel RGB and event features are connected by SMG-Bridge at stages 3–5. After sparse mutual guidance, the fused features { F f 3 , F f 4 , F f 5 } are passed to SGP-Neck, which outputs pyramid features { F P 3 , F P 4 , F P 5 } for multiscale detection. Finally, the detection head predicts UAV bounding boxes, confidence scores, and class labels from these pyramid features.
For the fusion stages, the overall process can be summarized as
F f l = SMG - Bridge ( F R l , F E l ) , l { 3 , 4 , 5 } ,
{ F P 3 , F P 4 , F P 5 } = SGP - Neck ( F f 3 , F f 4 , F f 5 ) ,
where SMG - Bridge ( · ) denotes the proposed Sparse Mutual Guided Bridge, which is used to perform bidirectional sparse cross-modal refinement between RGB and event features; SGP - Neck ( · ) denotes the proposed Selective Gated Pyramid Neck, which takes the fused multiscale features as input and produces enhanced pyramid features for the subsequent detection head; F R l and F E l denote the RGB and event features at the l-th stage, respectively; F f l denotes the fused feature produced by SMG-Bridge; and { F P 3 , F P 4 , F P 5 } denote the enhanced pyramid features used for final prediction.
In this way, SMG-UAV forms a unified RGB–event detection framework that integrates modality-specific representation learning, sparse cross-modal fusion, and multiscale feature enhancement for robust small-UAV detection in challenging environments.

3.2. Modality-Specific Hybrid CSP Backbone

RGB frames and event streams are generated by fundamentally different sensing mechanisms, and therefore may not be optimally encoded by an identical feature extraction strategy. RGB imagery provides dense appearance information, including texture, color, and semantic context, whereas event data mainly capture sparse brightness variations with high temporal sensitivity and motion-aware structural responses. To better match these modality-specific characteristics, SMG-UAV adopts a hybrid CSP-based backbone, where the RGB branch is implemented with an appearance-oriented CSPDarknet and the event branch is constructed as an event-oriented Spiking CSPDarknet.
Given an RGB frame R t , the RGB branch extracts hierarchical appearance representations through a standard CSPDarknet encoder:
{ F R 1 , F R 2 , F R 3 , F R 4 , F R 5 } = Φ R ( R t ) ,
where F R l R C l × H l × W l denotes the RGB feature at stage l, and Φ R ( · ) denotes the RGB backbone for extracting hierarchical appearance and semantic features from the input RGB frame R t . Among them, the last three stages { F R 3 , F R 4 , F R 5 } are used for subsequent cross-modal fusion, corresponding to spatial resolutions of approximately 1 / 8 , 1 / 16 , and 1 / 32 of the input size, respectively. CSPDarknet is adopted in the RGB branch because cross-stage partial backbones have been widely validated in recent detection studies, including UAV and small-object detection tasks, showing reliable hierarchical feature extraction capability with favorable efficiency [46,47,48].
For the event modality, directly applying a conventional frame-based backbone to event voxels is suboptimal, since it may weaken the sparse and temporally sensitive nature of event signals. To address this issue, we construct the event encoder by adapting CSPDarknet into a spiking counterpart, termed Spiking CSPDarknet, in which the nonlinear activations in the event branch are replaced by spike-based neuronal dynamics. The event voxel V t is first discretized into multiple temporal bins, which are processed as sequential inputs to the spiking encoder. The resulting spiking responses are then temporally aggregated to obtain stage-wise event features F E l . The event voxel V t is encoded as
{ F E 1 , F E 2 , F E 3 , F E 4 , F E 5 } = Φ E ( V t ) ,
where F E l R C l × H l × W l denotes the event feature at stage l, and Φ E ( · ) denotes the event backbone for extracting hierarchical motion-sensitive features from the event voxel representation V t . Similar to the RGB branch, the last three stages { F E 3 , F E 4 , F E 5 } are used for multiscale cross-modal interaction. Maintaining the same stage hierarchy in both branches facilitates feature pairing and fusion at aligned spatial scales.
In the spiking event branch, we model the feature response at temporal step τ using the leaky integrate-and-fire (LIF) neuron model, following the LIF-neuron-based spiking modeling in [49,50]. Here, each spatial channel unit in the event feature map is treated as a spiking neuron, and its membrane potential represents the temporally accumulated activation state driven by event-based inputs. This design is suitable for event data because event responses are sparse and temporally asynchronous, while LIF neurons can naturally integrate temporal motion cues and generate sparse spike activations. The membrane potential is updated as
U τ = λ U τ 1 ( 1 S τ 1 ) + X τ ,
where U τ denotes the membrane potential, X τ is the input current, λ is the decay factor, and S τ 1 denotes the spike state at the previous step. The term ( 1 S τ 1 ) implements the reset-after-spike mechanism, which reduces the accumulated membrane potential after a neuron fires. The spike output is generated by
S τ = H ( U τ θ ) ,
where θ is the firing threshold and H ( · ) denotes the Heaviside step function. In our implementation, the firing threshold θ is fixed to 1.0 for all LIF neurons and is not learned during training. This fixed-threshold setting avoids introducing additional neuron-specific parameters and makes the spiking event branch easier to reproduce. Since the Heaviside step function H ( · ) is non-differentiable, direct backpropagation through Equation (10) is not feasible. Therefore, we adopt surrogate gradient learning to train the spiking event branch. During the forward pass, the hard threshold function H ( · ) is used to generate binary spike outputs. During the backward pass, its derivative is approximated by a differentiable surrogate function. Specifically, we use the sigmoid surrogate function
σ α ( x ) = 1 1 + exp ( α x ) ,
and approximate the gradient of the spike function as
S τ U τ α σ α ( U τ θ ) 1 σ α ( U τ θ ) ,
where α controls the smoothness of the surrogate gradient. In this way, gradients can be propagated through the non-differentiable spike function, while the forward computation still preserves binary spike activations. With this design, the event branch preserves the hierarchical representation capability of CSPDarknet while better matching the sparse activation pattern and temporal response characteristics of event data.
The modality-specific backbone therefore produces paired multiscale representations { F R 3 , F R 4 , F R 5 } and { F E 3 , F E 4 , F E 5 } for the subsequent SMG-Bridge, where cross-modal interaction is performed at aligned semantic scales.

3.3. Sparse Mutual Guided Bridge

After modality-specific feature extraction, RGB and event features provide complementary yet unevenly reliable cues. RGB features preserve dense appearance and semantic context, but they are easily degraded by motion blur, underexposure, overexposure, and cluttered backgrounds. Event features are more sensitive to motion boundaries and temporal structure, yet they may also contain background-triggered responses or become weak when UAV motion is subtle. Therefore, directly concatenating the two modalities may inject unreliable responses into the fused representation. To address this issue, we propose the Sparse Mutual Guided Bridge, which performs bidirectional cross-modal refinement in a compact sparse latent space and progressively propagates sparse guidance across feature levels.
The design of SMG-Bridge is inspired by sparse-coding and ISTA-style feature adaptation, where a high-dimensional feature can be represented by a compact sparse latent code and refined through residual correction [43,44,45]. Different from direct feature concatenation or ordinary gated fusion, we do not assume that RGB and event features are always equally reliable or directly interchangeable. Instead, we use sparse latent codes as compact guidance carriers. At each bridge stage, the current RGB or event feature is regarded as an observation anchor, while the propagated sparse code from the other modality provides cross-modal guidance for residual sparse updating.
In SMG-Bridge, reliability is not estimated by an additional manually designed confidence score. Instead, it is implicitly modeled through two learnable sparse updating mechanisms: the cross-modal guidance gate and the adaptive sparse threshold. The reconstruction residual measures the mismatch between the current feature and the feature reconstructed from its sparse code. The cross-modal guidance gate controls how the propagated sparse code from one modality modulates the direction and magnitude of the residual correction of the other modality, while the adaptive sparse threshold suppresses weak, noisy, or redundant latent responses through soft shrinkage. Therefore, reliable cross-modal cues are selectively propagated, whereas unreliable modality-specific responses caused by visual degradation, background-triggered events, or clutter are suppressed.
This mechanism differs from ordinary gated fusion. Conventional gated fusion usually concatenates dense RGB and event features or reweights them using scalar, channel-wise, or spatial attention weights. In contrast, SMG-Bridge first represents each modality in a compact sparse latent space, performs cross-modal residual correction under the guidance of the opposite modality, applies adaptive sparse shrinkage, and then reconstructs the enhanced modality-specific features for final fusion. Thus, SMG-Bridge is closer to an unfolding-inspired sparse refinement process than to direct feature aggregation.
Let { F R 2 , F R 3 , F R 4 , F R 5 } and { F E 2 , F E 3 , F E 4 , F E 5 } denote the RGB and event features extracted from the two backbone branches. For notational simplicity, each feature map at stage l is flattened along the spatial dimension and represented as
F R l , F E l R B × N l × C l ,
where B is the batch size, N l = H l W l is the number of spatial tokens, and C l is the channel dimension at stage l. Instead of independently re-initializing sparse representations at every fusion level, SMG-Bridge initializes the sparse latent codes only once from the stage-2 features and then progressively propagates them through stages 3–5:
I 2 = P 0 I ( F R 2 ) , E 2 = P 0 E ( F E 2 ) ,
where P 0 I ( · ) and P 0 E ( · ) are lightweight initialization projections.
The propagated sparse latent codes are denoted as
I l , E l R B × N l × d l ,
where N l denotes the number of spatial tokens and d l is the latent dimension at stage l.
Before the propagated sparse codes are used at the next bridge, they are aligned to the current feature stage:
I ^ l 1 l = A I l ( I l 1 ) , E ^ l 1 l = A E l ( E l 1 ) , I ^ l 1 l , E ^ l 1 l R B × N l × d l ,
where A I l ( · ) and A E l ( · ) denote lightweight cross-stage alignment operations that match both the token number and latent dimension of the current stage.
Specifically, the sparse token sequences propagated from stage l 1 are first rearranged according to their original two-dimensional spatial layouts. The resulting feature maps are resized to the spatial resolution of stage l and are subsequently flattened back into token sequences, so that their token number changes from N l 1 to N l . A modality-specific linear projection is then applied along the last dimension to transform the latent dimension from d l 1 to d l . The RGB and event alignment operations follow the same processing structure but use independent learnable parameters, allowing modality-specific sparse information to be preserved during cross-stage propagation. After these operations, the propagated sparse codes have the same token number and latent dimension as the sparse representations generated at the current bridge and can therefore participate directly in the subsequent guidance and reconstruction operations.
For simplicity, we omit the superscript of the aligned sparse codes in the following equations and still denote them as I l 1 and E l 1 . As illustrated in Figure 3, each bridge consists of three steps. Figure 3a corresponds to event-guided RGB sparse updating, where the propagated event sparse code acts as a latent guidance carrier to correct the current RGB feature. Figure 3b corresponds to RGB-guided event sparse updating, where the propagated RGB sparse code provides semantic guidance to refine the current event feature. Figure 3c corresponds to final feature fusion, where the enhanced RGB and event features are adaptively reweighted and further refined to generate the fused output.

3.3.1. Event-Guided RGB Sparse Updating

At the l-th bridge, the propagated event sparse code E l 1 is first transformed into a latent guidance gate:
Γ E I l = tanh Gate E I l ( E l 1 ) , Γ E I l R B × N l × d l ,
where Gate E I l ( · ) is implemented as a token-wise single linear layer applied in the low-dimensional sparse latent space:
Gate E I l ( E l 1 ) = E l 1 W E I l + b E I l ,
with W E I l R d l × d l and b E I l R d l . The tanh ( · ) function constrains the gate to [ 1 , 1 ] so that the event cue can either suppress or enhance the current RGB sparse update. Since the gate operates only along the latent dimension d l and does not involve spatial convolution or self-attention, it introduces only d l 2 + d l parameters and approximately B N l d l 2 multiply-add operations for one directional gate. In our implementation, the latent dimension is set to d l = 16 , so each directional gate contains only 272 parameters.
The RGB sparse code is then refined through an event-guided residual correction:
I ˜ l = I l 1 + 1 + Γ E I l P l I F R l D l I ( I l 1 ) ,
where D l I ( · ) maps the RGB sparse code back to the RGB feature space, P l I ( · ) projects the reconstruction residual into the sparse latent space, and ⊙ denotes element-wise multiplication. The residual term F R l D l I ( I l 1 ) measures the mismatch between the current RGB feature and its latent reconstruction, while the event-guided gate adaptively modulates the correction direction and magnitude according to the propagated event response.
To suppress noisy or redundant activations, a soft-shrinkage operator is further applied:
I l = S I ˜ l , λ I l ,
where S ( · ) performs sparsity-inducing shrinkage on the updated RGB sparse code. The purpose of this operator is to remove weak latent responses whose magnitudes are smaller than the adaptive threshold, while preserving stronger responses that are more likely to correspond to reliable UAV-related cues.
The adaptive threshold is predicted from the current RGB feature:
λ I l = Softplus MLP I l ( F R l ) , λ I l R B × N l × d l .
Here, MLP I l ( · ) is applied token-wise to the current RGB feature and maps the feature dimension from C l to the sparse latent dimension d l . Therefore, λ I l is an adaptive threshold predicted for each sample, each spatial token, and each latent channel, rather than a global scalar threshold. This token-channel-wise threshold allows the sparse updating process to adapt to local image content and modality degradation.
The soft-shrinkage operator is defined as
S ( x , λ ) = sign ( x ) max ( | x | λ , 0 ) ,
where ⊙ denotes element-wise multiplication. If the magnitude of a latent response is smaller than the corresponding threshold, it is suppressed to zero; otherwise, its magnitude is reduced by the threshold. In this way, soft shrinkage encourages sparse latent representations and suppresses weak, noisy, or redundant activations introduced during cross-modal residual updating. Softplus ( · ) is used to ensure that the predicted threshold is non-negative, which is required for valid shrinkage-based sparse regularization. Compared with a hard clipping operation, Softplus is smooth and differentiable, making the adaptive threshold prediction stable during end-to-end training.
After sparse updating, the enhanced RGB feature is reconstructed in a residual form:
F R l = F R l + D l I ( I l ) ,
where D l I ( · ) projects the refined sparse code back to the RGB feature space. This reconstruction preserves reliable RGB appearance structures while selectively emphasizing UAV-related cues recovered under event guidance.

3.3.2. RGB-Guided Event Sparse Updating

Symmetrically, the propagated RGB sparse code I l 1 is used to guide event sparse updating.
Since the event branch is driven by a voxelized event representation, the intermediate event response retain a temporal-step layout. To make the operation in Figure 3b explicit, we use M T ( · ) to denote temporal aggregation and B T ( · ) to denote temporal broadcast. The temporal aggregation operator M T ( · ) maps the event response to the stage-level feature used for sparse updating:
F ¯ E l = M T ( F E l ) , F ¯ E l R B × N l × C l ,
where F ¯ E l denotes the temporally aggregated event feature. If the event feature has already been aggregated before entering the bridge, M T ( · ) reduces to an identity operation.
The RGB-guided latent gate is defined as
Γ I E l = tanh Gate I E l ( I l 1 ) , Γ I E l R B × N l × d l ,
where Gate I E l ( · ) is implemented in the same way as Gate E I l ( · ) , namely a token-wise single linear layer followed by tanh ( · ) . Therefore, the two directional gates in each bridge introduce only 2 ( d l 2 + d l ) parameters. With d l = 16 , the bidirectional gates contain only 544 parameters in total for each bridge.
The event sparse code is updated as
E ˜ l = E l 1 + 1 + Γ I E l P l E F E l D l E ( E l 1 ) ,
where D l E ( · ) and P l E ( · ) are the reconstruction and projection operators for the event branch, respectively. Here, the residual term F ¯ E l D l E ( E l 1 ) measures the mismatch between the current temporally aggregated event feature and the feature reconstructed from the propagated event sparse code. The RGB-guided gate modulates this residual correction in the sparse latent space, so that RGB-derived semantic cues can suppress background-triggered event responses while retaining motion-sensitive UAV responses.
The refined event sparse code is then regularized by
E l = S E ˜ l , λ E l ,
where the adaptive threshold is predicted as
λ E l = Softplus MLP E l ( F E l ) , λ E l R B × N l × d l .
The event-side threshold λ E l is predicted in a token-channel-wise manner from the temporally aggregated event feature F ¯ E l . It is used to suppress weak or noisy event responses in the sparse latent space.
After sparse refinement, the enhanced event feature is reconstructed as
F E l = F E l + D l E ( E l ) ,
where D l E ( · ) maps the refined sparse code back to the event feature space. The temporal broadcast operator B T ( · ) then restores this correction to the temporal layout of the event branch:
F E l = F E l + B T Δ F ¯ E l ,
where B T ( · ) broadcasts the reconstructed correction to match the temporal-bin or temporal-step structure of F E l . If the event feature has no explicit temporal dimension at this stage, B T ( · ) also reduces to an identity operation. In this way, RGB semantics provide stable appearance priors to filter unreliable event responses, while the event branch still preserves its motion-sensitive temporal characteristics.

3.3.3. Fusion of Enhanced Features

After bidirectional sparse mutual updating at level l, the bridge obtains the enhanced RGB feature F R l and the enhanced event feature F E l . Instead of directly concatenating the two features, we use a lightweight dynamic fusion gate to predict modality-aware fusion weights:
[ γ R l , γ E l ] = Softmax m MLP γ l F R l , F E l ,
where [ · , · ] denotes channel-wise concatenation, MLP γ l ( · ) is a lightweight token-wise MLP, and Softmax m ( · ) is applied along the modality dimension. The predicted coefficients satisfy
γ R l , γ E l R B × N l × C l , γ R l + γ E l = 1 .
Therefore, γ R l and γ E l are dynamically predicted token-channel-wise fusion weights. They are not global scalar weights, static channel-wise parameters, or standalone spatial attention maps. Instead, they adaptively modulate the RGB and event features for each spatial token and channel according to the local cross-modal feature responses.
The two enhanced features are then aggregated as
F R E l = γ R l F R l + γ E l F E l ,
where ⊙ denotes element-wise multiplication. This dynamic reweighting allows the bridge to emphasize the more reliable modality at each spatial-token and channel position, while suppressing less reliable responses caused by RGB degradation or event noise.
The aggregated feature is further refined by a lightweight convolutional block:
F f l = SiLU BN Conv 3 × 3 F R E l .
The resulting fused features { F f 3 , F f 4 , F f 5 } are then fed into SGP-Neck for subsequent multiscale small-UAV enhancement.

3.4. Selective Gated Pyramid Neck

Although SMG-Bridge provides fused RGB–event features at multiple scales, reliable small-UAV detection still requires effective cross-scale aggregation. Low-level features retain fine localization cues but are sensitive to background clutter, whereas high-level features provide stronger semantics but may lose tiny-target details. To address this issue, we introduce the Selective Gated Pyramid Neck, which performs local response enhancement, top-down semantic injection, and bottom-up detail feedback in a gated manner, as shown in Figure 4.
Given the fused features { F f 3 , F f 4 , F f 5 } from SMG-Bridge, the features from different stages are first projected to a unified pyramid dimension:
X l = ϕ l ( F f l ) , X l R B × C p × H l × W l , l { 3 , 4 , 5 } ,
where ϕ l ( · ) denotes a stage-specific 1 × 1 convolution followed by batch normalization and SiLU activation. This operation aligns the channel dimension of the multiscale features to the unified pyramid channel number C p , ensuring dimensional consistency in the subsequent addition and concatenation operations.
Each channel-aligned feature is then enhanced by the Local Response Perception Module (LRPM):
R l = σ Conv 1 × 1 DWConv 3 × 3 ( X l ) ,
X ˜ l = X l + R l X l ,
where DWConv 3 × 3 ( · ) captures local spatial context, σ ( · ) denotes the sigmoid function, and X ˜ l denotes the locally enhanced feature used in the subsequent pyramid aggregation.
Next, top-down semantic gating injects high-level context into finer-resolution features. We first define the top-down upsampling operation as
Up l ( T l + 1 ) = ψ l u Interp T l + 1 ; H l , W l ,
where Interp ( · ; H l , W l ) denotes bilinear interpolation to the spatial resolution ( H l , W l ) , and ψ l u ( · ) is a 1 × 1 convolution used for channel alignment. Thus, Up l ( T l + 1 ) R B × C p × H l × W l .
The top-down pathway is then formulated as
T 5 = X ˜ 5 ,
G l s = σ Conv 1 × 1 s , l X ˜ l , Up l ( T l + 1 ) , l { 4 , 3 } ,
T l = X ˜ l + G l s Up l ( T l + 1 ) , l { 4 , 3 } .
Here, [ · , · ] denotes channel-wise concatenation. Since both X ˜ l and Up l ( T l + 1 ) have the same spatial resolution and channel dimension, their concatenation has the shape R B × 2 C p × H l × W l . The 1 × 1 convolution in Equation (40) maps the concatenated feature back to C p channels, and the sigmoid function produces G l s R B × C p × H l × W l . Therefore, G l s is a spatial-channel-wise gate that adaptively controls the contribution of high-level semantic information at each spatial location and channel.
To further preserve localization-sensitive details, a bottom-up detail-gating path is adopted. The downsampling operation is defined as
Down l ( F P l 1 ) = ψ l d ( F P l 1 ) ,
where ψ l d ( · ) is implemented as a 3 × 3 convolution with stride 2 and padding 1, followed by batch normalization and SiLU activation. It reduces the spatial resolution from ( H l 1 , W l 1 ) to ( H l , W l ) and outputs C p channels, such that Down l ( F P l 1 ) R B × C p × H l × W l .
The bottom-up pathway is formulated as
F P 3 = T 3 ,
G l d = σ Conv 1 × 1 d , l T l , Down l ( F P l 1 ) , l { 4 , 5 } ,
F P l = T l + G l d Down l ( F P l 1 ) , l { 4 , 5 } .
Similarly, G l d R B × C p × H l × W l is a spatial-channel-wise gate predicted from the concatenation of T l and the downsampled lower-level pyramid feature. It adaptively controls how much fine-grained, localization-sensitive information is propagated from the lower-level, higher-resolution feature path to the current pyramid level. In this way, SGP-Neck performs dimensionally aligned top-down semantic enhancement and bottom-up detail refinement through explicit upsampling, downsampling, and spatial-channel-wise gating operations.
Finally, the enhanced pyramid features { F P 3 , F P 4 , F P 5 } are fed into the YOLO-style detection head for multiscale UAV prediction.

4. Experiments and Results

4.1. Datasets and Implementation Details

We evaluate the proposed SMG-UAV on two RGB–event drone detection datasets, namely FRED and NeRDD. Both datasets provide synchronized RGB frames and event streams, making them suitable for evaluating the effectiveness of RGB–event fusion in UAV detection.
FRED. The Florence RGB–event Drone dataset (FRED) is a multimodal drone perception dataset designed for drone detection, tracking, and trajectory forecasting [17]. It provides spatio-temporally synchronized RGB video and event streams with dense UAV trajectory annotations. The dataset contains more than 7 h of annotated drone recordings, covers five different drone models, and includes challenging conditions such as rain, adverse illumination, distractors, and diverse motion patterns. In our experiments, FRED is used as the primary benchmark to evaluate the detection performance and robustness of the proposed method in realistic RGB–event anti-UAV scenarios.
NeRDD. NeRDD is a Neuromorphic-RGB Drone Detection dataset specifically collected for Event-RGB drone detection [18]. It contains more than 3.5 h of spatio-temporally synchronized RGB–event drone recordings, corresponding to approximately 7 h of multimodal footage. The dataset is divided into 115 videos, and both modalities are provided at HD resolution of 1280 × 720 with 30 FPS. Since NeRDD contains synchronized RGB and event data with drone annotations, it is used as an additional benchmark to evaluate the performance consistency of SMG-UAV across different UAV RGB–event benchmarks.
Evaluation metrics. Following common object detection protocols, we report AP 50 , AP 75 , and AP 50 : 95 . AP 50 and AP 75 denote the average precision at IoU thresholds of 0.50 and 0.75, respectively. AP 50 : 95 denotes the mean average precision averaged over IoU thresholds from 0.50 to 0.95 with a step size of 0.05. These metrics jointly evaluate coarse localization accuracy, strict localization quality, and overall detection performance.
Implementation details. All experiments are conducted on a high-performance computational server equipped with an Xeon(R) Platinum 8470Q CPU and an NVIDIA RTX PRO 6000 Blackwell GPU. The software environment is built on Ubuntu 22.04, Python 3.12, PyTorch 2.8.0, and CUDA 12.8. The spiking components in the proposed Spiking CSPDarknet are implemented using SpikingJelly 0.14 [51].
All models are trained from scratch for up to 200 epochs. To preserve the fine-grained appearance and weak structural cues of UAV targets, the input resolution is fixed at 1280 × 720 during both training and evaluation. For a fair and controlled comparison, all methods use the same sequence-level dataset split, input resolution, training budget, spatial preprocessing, optimization protocol, and evaluation metrics.
The compared methods are reproduced using their official implementations and released model configurations whenever available. The official network architecture and method-specific components are retained, while only the adaptations required for FRED and NeRDD, including the dataset path, annotation format, number of target classes, input resolution, and input channel configuration, are introduced. No additional method-specific hyperparameter search or preferential tuning is performed for any baseline.
Because different event-based methods are originally designed for different event representations, their event encodings are not forcibly unified. For all event-only and RGB–event methods, the raw event stream is first temporally aligned with the corresponding RGB frame interval and is then converted into the event representation required by the respective method. Specifically, SMG-UAV converts the asynchronous event stream into an event voxel grid with T = 3 temporal bins and two polarity channels, resulting in a six-channel event representation. For the compared event-only and RGB–event methods, we retain the native event encoding adopted by their official implementations, such as accumulated event frames or voxel grids, together with the corresponding temporal-bin and polarity settings. This protocol preserves the original design of each method while ensuring that all methods use the same synchronized RGB–event samples and evaluation protocol.
For the common optimization protocol, stochastic gradient descent is used with a momentum of 0.937 and a weight decay of 0.0005. The learning-rate schedule consists of linear warm-up during the first three epochs, followed by cosine annealing. The initial learning rate is set to 0.01 and decays to 0.0001. An early-stopping strategy with a patience of 30 epochs is applied, and the best-performing model weights on the validation set are used for final evaluation. Therefore, the compared methods share the same optimization settings and training budget, while their official architectures and native eventbrepresentation designs are retained.

4.2. Comparison with State-of-the-Art Methods

We compare SMG-UAV with state-of-the-art methods on FRED and NeRDD. The compared methods include representative RGB-only detectors, event-only detectors, and RGB–event fusion detectors. Specifically, the RGB-only detectors include YOLOv12 [52], MambaYOLO [53], and RT-DETR [54]; the event-only detectors include RVT [55], SAST [56], and SMamba [57]; and the RGB–event fusion detectors include FPN-Fusion [12], RENet [13], SODFormer [14], and EOLO [15].
For a fair comparison, all methods are trained and evaluated under the same dataset split, input resolution, and evaluation protocol whenever applicable, while retaining the original event encoding form of each compared method. To reduce the influence of random initialization and training stochasticity, each method is trained three times with different random seeds. The quantitative results are reported in Table 1 as mean ± standard deviation.
As shown in Table 1, RGB-only detectors show limited performance on both FRED and NeRDD. Although RGB frames provide rich texture, color, and appearance information, these cues become unreliable when UAV targets are small, distant, blurred, weakly contrasted, or affected by illumination degradation. For example, the best RGB-only AP 50 : 95 reaches only 14.3 ± 0.5 on FRED and 17.7 ± 0.5 on NeRDD, indicating that appearance-based detection alone is insufficient for robust anti-UAV perception in challenging dynamic scenes.
Event-only detectors achieve substantially better results than RGB-only detectors on both datasets. This demonstrates the importance of high-temporal-resolution motion cues for UAV detection. Event streams can provide more discriminative motion-sensitive responses, especially under fast motion, motion blur, and illumination variation. Among the event-only methods, SMamba obtains the best AP 50 : 95 of 47.7 ± 0.6 on FRED and 33.9 ± 0.5 on NeRDD. However, event-only detection still has limitations. Event responses may become sparse when the UAV moves slowly or appears at a long distance, and irrelevant background motion may introduce noisy activations.
RGB–event fusion methods generally improve over RGB-only methods, confirming the benefit of multimodal sensing for drone detection. However, the results also show that not all fusion strategies consistently outperform the strongest event-only baseline. For instance, some fusion methods achieve lower mean AP 50 : 95 than SMamba on FRED or only marginal gains on NeRDD. This suggests that directly combining RGB and event features is not sufficient. When one modality is degraded or when the two modalities contain inconsistent responses, conventional fusion strategies may introduce unreliable information rather than suppress it.
In contrast, SMG-UAV achieves the best mean performance across all metrics on both datasets. On FRED, SMG-UAV reaches 89.3 ± 0.2 AP 50 , 37.2 ± 0.4 AP 75 , and 51.6 ± 0.3 AP 50 : 95 , outperforming the strongest competing method by 6.8, 1.3, and 3.4 points in terms of mean performance, respectively. On NeRDD, SMG-UAV obtains 88.7 ± 0.2 AP 50 , 30.1 ± 0.3 AP 75 , and 49.5 ± 0.4 AP 50 : 95 , exceeding the best competing method by 3.6, 2.8, and 8.3 points in terms of mean performance, respectively. Although the AP 75 improvement on FRED is smaller than the improvements in AP 50 and AP 50 : 95 , the standard deviation of SMG-UAV remains low, indicating that the gain is stable across different random seeds. These results suggest that the proposed method exhibits both strong detection performance and stable training behavior across two UAV RGB–event benchmarks.
Although AP-based metrics provide a standard and comprehensive evaluation of detection accuracy, they do not fully reflect the operational reliability required by practical anti-UAV surveillance. In such scenarios, missed targets may lead to delayed warning, while frequent false alarms may increase the burden of downstream tracking, verification, and response modules. Therefore, in addition to AP 50 , AP 75 , and AP 50 : 95 , we further report precision, recall, F1-score, false positives per frame (FP/frame), and miss rate on the FRED dataset in Table 2. We select FRED for this additional reliability-oriented evaluation because it provides a more comprehensive RGB–event anti-UAV benchmark with diverse UAV motion patterns, target scales, illumination conditions, background clutter, and challenging surveillance scenarios.
For a fair comparison, these additional metrics are computed from the post-NMS detection results under the same confidence threshold and IoU matching criterion for all compared methods. A predicted bounding box is counted as a true positive if its confidence score is higher than the predefined confidence threshold and its IoU with an unmatched ground-truth box is larger than 0.5. Predictions that do not match any ground-truth box are counted as false positives. Ground-truth boxes that are not matched by any valid prediction are treated as missed targets. In our evaluation, the confidence threshold is set to 0.3 for all methods, and the IoU threshold for matching is set to 0.5.
The additional metrics are computed as follows:
Precision = T P T P + F P , Recall = T P N gt ,
F 1 = 2 · Precision · Recall Precision + Recall , Miss Rate = 1 Recall ,
FP / frame = F P N frames ,
where T P and F P denote true positives and false positives, respectively, N gt denotes the number of ground-truth UAV instances, and N frames denotes the number of evaluated frames. For anti-UAV surveillance, recall and FP/frame are particularly important because they directly reflect missed-warning risk and false-alarm burden.
As shown in Table 2, the proposed SMG-UAV not only achieves the highest AP-based accuracy, but also provides a better balance between recall and false alarms. Compared with RGB-only and event-only detectors, SMG-UAV achieves higher recall, lower FP/frame, and a lower miss rate. This is important for practical low-altitude surveillance, where a detector is often used as the front-end of subsequent tracking, verification, and warning modules.
Although the overall comparison provides a general evaluation of detection performance, it cannot fully reveal the reliability of anti-UAV detectors under specific adverse conditions. In practical low-altitude surveillance scenarios, UAV targets may be extremely small, blurred by fast motion, degraded by abnormal illumination, submerged in complex backgrounds, or confused with bird-like moving distractors. Therefore, we further conduct a challenge-oriented robustness evaluation on the FRED dataset.
Specifically, we manually curate five UAV-specific challenging subsets from the FRED test set. The original bounding-box annotations are kept unchanged, and only frame-level difficulty attributes are additionally assigned for evaluation. Each selected RGB frame is paired with the event stream accumulated within the corresponding RGB frame interval. The five subsets include small target, motion blur, extreme illumination, background-embedded target, and bird-like distractor. The small-target subset contains UAV instances whose bounding-box size is smaller than 16 × 16 pixels. The extreme-illumination subset includes both overexposed and underexposed scenes. The background-embedded subset refers to cases where the UAV is visually similar to the surrounding background or partially obscured by cluttered structures. The bird distractor subset contains scenes with target-like moving objects that may cause false alarms in anti-UAV detection. The statistics of these subsets are summarized in Table 3, and representative examples are shown in Figure 5. These subsets are used only for evaluation and are not involved in the training process. The five subsets are not mutually exclusive, and a frame may be assigned to multiple difficulty attributes if multiple challenges coexist.
For the small-target subset, candidate samples are first selected from the original annotations by identifying frames that contain at least one UAV bounding box with both width and height smaller than 16 pixels. The candidate samples are then manually re-examined to exclude cases with inaccurate, misaligned, or otherwise unreliable original annotations. For the other four difficulty attributes, namely motion blur, extreme illumination, background-embedded, and bird distractor, frame-level labels are manually assigned according to the predefined visual qualification criteria described in Table 3.
To reduce subjectivity during subset construction, two annotators independently assign the five frame-level difficulty attributes. The annotators examine the RGB frames and refer to the original UAV bounding-box annotations. The corresponding event representations and temporally adjacent frames are additionally inspected when necessary, particularly when determining motion blur, background interference, or bird-like distractors. Cases in which the two annotators produce inconsistent assignments are independently reviewed and adjudicated by a third experienced annotator.
We evaluate representative methods from different input modalities on these five challenging subsets and report AP 50 as the robustness metric. AP 50 is adopted because this analysis focuses on whether UAV targets can be reliably detected under severe degradation, especially when the targets are small, weak, or visually ambiguous. The results are shown in Table 4.
As shown in Table 4, SMG-UAV achieves the best AP 50 on all five challenging subsets, demonstrating its robustness under diverse anti-UAV failure factors. On the small-target subset, SMG-UAV obtains 31.7 AP 50 , outperforming the strongest baseline SMamba by 4.8 points. In contrast, RGB-only detectors only achieve 11.4 and 12.8 AP 50 , showing that appearance information alone is insufficient when UAVs occupy only a few pixels.
For motion blur, SMG-UAV reaches 36.9 AP 50 , exceeding SMamba by 4.2 points and SODFormer by 5.4 points. Event-only detection performs clearly better than RGB-only detection in this subset because event streams preserve motion-sensitive brightness changes. However, the further gain of SMG-UAV shows that event cues are more effective when they are used to guide RGB feature reconstruction rather than being used independently.
Under extreme illumination, the advantage of SMG-UAV becomes more evident. RGB-only methods drop to 5.7 and 4.9 AP 50 because UAV appearance cues are severely degraded by overexposure or underexposure. Event-only and fusion-based methods are more robust, but SMG-UAV still improves the best competing result from 41.2 to 47.6 AP 50 . This suggests that the proposed sparse mutual guidance can better exploit illumination-insensitive event responses while suppressing unreliable RGB features.
For the background-embedded subset, SMG-UAV achieves 45.6 AP 50 , surpassing the strongest baseline by 6.4 points. This subset is particularly challenging because the UAV is visually similar to the surrounding structures or partially submerged by cluttered backgrounds. The result indicates that SMG-UAV can recover more discriminative target representations by jointly using motion-sensitive event cues and RGB semantic constraints.
The bird distractor subset further evaluates the false-alarm resistance of different methods. Although RGB-only MambaYOLO obtains a relatively high AP 50 of 36.9, conventional event-only and fusion-based methods remain vulnerable to target-like distractors and background motion. SMG-UAV achieves 46.2 AP 50 , improving the best competing result by 9.3 points. This result suggests that the bidirectional guidance in SMG-Bridge helps suppress irrelevant event responses and improves discrimination between real UAV targets and target-like distractors.
Overall, the challenge-oriented evaluation demonstrates that the proposed method is not only superior in overall benchmark performance, but also more reliable under key anti-UAV degradation factors. The consistent gains across small target, motion blur, extreme illumination, background embedding, and distractor interference provide stronger evidence for the effectiveness of sparse mutual guidance in practical low-altitude UAV detection.

4.3. Computational Complexity and Inference Efficiency

Since practical anti-UAV systems require both high detection accuracy and real-time response, we further evaluate the computational complexity and inference efficiency of different methods. The evaluated metrics include the number of parameters, GFLOPs, FPS, and inference latency. All methods are evaluated under the same input resolution and batch size of 1. The server-side FPS is measured on the same hardware platform used for training and evaluation, so that the reported speed is consistent with the experimental environment used for the accuracy comparison. The reported latency denotes the average network forward inference time per sample, excluding disk I/O, data loading and event voxel preprocessing. For RGB–event methods, both RGB and event inputs are included in the network inference pipeline. The latency is derived from FPS T latency = 1000 / FPS ms.
As shown in Table 5, some event-only models, such as RVT and SAST, achieve faster inference and lower parameter counts compared with SMG-UAV, due to their single-modality design. In contrast, SMG-UAV introduces a dual-branch RGB–event architecture with spiking event processing, sparse mutual guidance, and pyramid feature enhancement. Although this adds some computational overhead compared with the fastest event-only models, SMG-UAV maintains a moderate parameter size of 18.7 M, 66.2 GFLOPs, and 132 FPS with 7.58 ms average network latency. More importantly, it consistently achieves the highest detection accuracy across AP 50 , AP 75 , and AP 50 : 95 metrics. Therefore, SMG-UAV provides a favorable balance between computational efficiency and detection reliability, making it more suitable for practical low-altitude anti-UAV surveillance scenarios where both accuracy and real-time response are critical.
To further evaluate embedded deployment potential, we also deploy the compared models on an NVIDIA Jetson Orin Nano development board (NVIDIA, Santa Clara, CA, USA). The Jetson Orin Nano is a widely used edge-AI development platform for embedded vision, robotics, autonomous systems, intelligent surveillance, and industrial inspection. Its compact form factor and limited computational resources make it suitable for evaluating the efficiency and practical deployment potential of the proposed method on resource-constrained edge-computing platforms. The embedded-platform FPS is measured with batch size 1 after model warm-up, excluding disk I/O, data loading and event voxel preprocessing. The results are reported in Table 6.
It should be noted that the reported latency measures only network forward inference. The complete end-to-end latency of a deployed RGB–event anti-UAV system may additionally include event stream buffering, event voxel construction, RGB–event synchronization, image resizing, and communication overhead. These factors are implementation- and hardware-dependent.
Despite these additional factors, SMG-UAV achieves a forward inference speed of 40 FPS on the NVIDIA Jetson Orin Nano development board, demonstrating that the proposed method can provide a favorable balance between detection accuracy and real-time performance on edge-computing platforms. This indicates its practical potential for low-altitude anti-UAV surveillance in resource-constrained embedded scenarios.

4.4. Qualitative Comparison and Heatmap Analysis

The quantitative results in Table 1 demonstrate the overall accuracy and challenge-specific robustness of SMG-UAV. However, numerical metrics alone cannot fully reveal how different detectors behave in practical anti-UAV scenarios. Therefore, we further provide qualitative detection comparisons and response heatmap visualizations to analyze the localization behavior and feature focus of different methods.
Figure 6 shows representative detection results under challenging scenes, including extreme illumination, background clutter, weak small targets, and low-contrast environments. For each scene, we compare representative RGB-only, event-only, RGB–event fusion, and proposed SMG-UAV detectors. The confidence value displayed beside each bounding box in Figure 6 denotes the final detection confidence of the prediction retained after non-maximum suppression (NMS). For a fair comparison, all compared methods use the same confidence threshold of 0.30 and the same NMS IoU threshold of 0.50 .
The visual results indicate that RGB-only methods are sensitive to severe illumination degradation and weak target appearance. When the UAV is overexposed, underexposed, or visually similar to the background, RGB-only detectors may miss the target or produce unstable confidence scores. Event-only methods can capture motion-sensitive responses and are less affected by static appearance degradation, but their predictions may become fragmented when event responses are sparse or when background motion introduces noise. Existing RGB–event fusion methods improve detection results by combining appearance and motion cues, but they may still suffer from localization drift or false responses when one modality is unreliable.
In contrast, SMG-UAV provides more stable detection results across different challenging cases. It produces tighter bounding boxes and more reliable confidence scores for distant or weak UAV targets. This advantage is especially clear in extreme-illumination and background-embedded scenes, where UAV appearance is severely degraded and the target region is difficult to distinguish from the surrounding background. These qualitative observations are consistent with the robustness analysis in Table 4.
To further explain the behavior of different detectors, Figure 7 presents Grad-CAM visualizations generated using a standard Grad-CAM implementation. For each model, the UAV detection score is used as the target output, and the feature layer immediately before the detection head is selected as the target layer. The generated heatmaps are resized to the input image resolution and independently normalized to the range [ 0 , 1 ] using min–max normalization. The same color map is applied to all compared methods, where warmer colors indicate regions with greater relative contribution to the corresponding UAV prediction and cooler colors indicate lower relative contribution. Since each heatmap is independently normalized, the visualization is used to compare the spatial localization and concentration of model attention rather than the absolute response magnitude across different models.
As shown in Figure 7, several existing RGB–event fusion methods produce diffuse or spatially shifted attention regions, with activations extending to irrelevant background structures. In contrast, SMG-UAV produces more compact and accurately localized attention around the true UAV regions while suppressing spurious background responses. This suggests that the proposed method learns more discriminative and target-aware feature representations under challenging anti-UAV conditions.
The qualitative detection results and heatmap visualizations provide intuitive evidence for the effectiveness of SMG-UAV. The proposed method not only improves quantitative detection performance, but also exhibits more reliable localization behavior and more discriminative internal responses in challenging anti-UAV environments.

4.5. Ablation Studies

To verify the effectiveness of the proposed components, we conduct a series of ablation studies on the FRED dataset. All variants are trained and evaluated under the same settings as the full model. Unless otherwise specified, only the analyzed component is changed, while the remaining architecture and training protocol are kept unchanged. AP 50 , AP 75 , and AP 50 : 95 are reported to evaluate detection accuracy at different localization thresholds.

4.5.1. Effectiveness of Main Components

We first evaluate the contribution of the main components in SMG-UAV through a progressive ablation study on the FRED dataset. The baseline is a basic dual-stream RGB–event detector, in which both branches use conventional CSPDarknet backbones, the event input is represented as a three-channel accumulated event frame, and cross-modal fusion is implemented by direct concatenation. Based on this baseline, we progressively introduce the event voxel representation, the Spiking CSPDarknet event encoder, SMG-Bridge, and the SGP-Neck. In addition to AP 50 , AP 75 , and AP 50 : 95 , Table 7 reports the cumulative parameter count and GFLOPs of each configuration to quantify the computational overhead introduced by the corresponding components.
As shown in Table 7, the baseline achieves 73.4 AP 50 , 30.2 AP 75 , and 38.7 AP 50 : 95 , with 16.0 M parameters and 54.1 GFLOPs. Replacing the accumulated event frame with the event voxel representation improves the performance to 77.9 AP 50 , 32.5 AP 75 , and 44.1 AP 50 : 95 , corresponding to improvements of 4.5, 2.3, and 5.4 points, respectively. This demonstrates that preserving the temporal distribution and polarity information of events within the RGB frame interval provides more informative motion cues than a temporally collapsed event frame.
The event voxel representation changes the event input from the three-channel accumulated event frame used in the baseline to a voxel tensor. The increased input channel dimension slightly enlarges the input layer of the event branch and introduces additional network-forward computation. Consequently, the parameter count increases from 16.0 M to 16.2 M, while the computational cost increases from 54.1 to 56.1 GFLOPs, corresponding to additional costs of 0.2 M parameters and 2.0 GFLOPs.
Further introducing the Spiking CSPDarknet event encoder raises the performance to 80.2 AP 50 , 35.3 AP 75 , and 44.5 AP 50 : 95 . This corresponds to improvements of 2.3, 2.8, and 0.4 points, respectively. The Spiking CSPDarknet encoder increases the cumulative model size from 16.2 M to 16.8 M parameters and the computational cost from 56.1 to 58.9 GFLOPs, introducing 0.6 M additional parameters and 2.8 GFLOPs. These results suggest that the spiking event branch better matches the sparse and motion-triggered characteristics of event data and improves hierarchical event feature extraction with a relatively limited computational overhead.
The largest accuracy gain is obtained by replacing direct concatenation with SMG-Bridge. This modification increases the performance to 87.6 AP 50 , 36.7 AP 75 , and 48.9 AP 50 : 95 , bringing gains of 7.4, 1.4, and 4.4 points over the preceding configuration, respectively. Meanwhile, SMG-Bridge introduces 1.0 M additional parameters and 4.5 GFLOPs, increasing the cumulative complexity from 16.8 M parameters and 58.9 GFLOPs to 17.8 M parameters and 63.4 GFLOPs. The substantial accuracy improvement relative to this moderate computational increase demonstrates that reliability-aware sparse mutual guidance is more effective than direct feature aggregation for exploiting complementary RGB–event information.
Finally, adding SGP-Neck further improves the results to 89.3 AP 50 , 37.2 AP 75 , and 51.6 AP 50 : 95 . Compared with the SMG-Bridge configuration, SGP-Neck provides additional gains of 1.7, 0.5, and 2.7 points, respectively. It introduces 0.9 M additional parameters and 2.8 GFLOPs, resulting in a full-model complexity of 18.7 M parameters and 66.2 GFLOPs. This result shows that lightweight gated multiscale enhancement remains beneficial after cross-modal fusion, particularly for small and weak UAV targets whose responses may be attenuated during hierarchical feature propagation.
Overall, the complete SMG-UAV improves the baseline by 15.9 AP 50 , 7.0 AP 75 , and 12.9 AP 50 : 95 , while introducing 2.7 M additional parameters and 12.1 GFLOPs. Among the proposed components, SMG-Bridge provides the largest accuracy improvement, whereas the event voxel representation, Spiking CSPDarknet encoder, and SGP-Neck introduce relatively limited additional complexity. These results demonstrate that the proposed components provide substantial detection gains with controlled computational overhead, resulting in a favorable accuracy–complexity trade-off.

4.5.2. Effect of Bidirectional Mutual Guidance

To further analyze the cross-modal interaction mechanism in SMG-Bridge, we compare different guidance directions. The direct fusion variant removes mutual guidance and aggregates RGB and event features directly. The event-to-image and image-to-event variants retain only one guidance direction, while the bidirectional variant corresponds to the full SMG-Bridge.
As shown in Table 8, direct fusion achieves 82.2 AP 50 , 35.7 AP 75 , and 45.1 AP 50 : 95 . Both one-way guidance variants improve over this baseline. Event-to-image guidance increases the performance to 86.9 AP 50 , 36.7 AP 75 , and 49.8 AP 50 : 95 , indicating that event cues provide effective motion-sensitive structure for enhancing degraded RGB representations. Image-to-event guidance also improves the results to 85.1 AP 50 , 36.1 AP 75 , and 48.4 AP 50 : 95 , showing that RGB appearance semantics help suppress noisy or irrelevant event responses.
The complete bidirectional guidance achieves the best performance, reaching 89.3 AP 50 , 37.2 AP 75 , and 51.6 AP 50 : 95 . Compared with event-to-image only, it further improves AP 50 and AP 50 : 95 by 2.4 and 1.8 points, respectively. These results show that the two guidance directions are complementary rather than redundant. By jointly exploiting event-guided RGB recovery and RGB-guided event refinement, SMG-Bridge produces more reliable cross-modal representations for robust UAV detection.

4.5.3. Effect of Sparse Thresholding and Guidance Gate

We further analyze two internal designs of SMG-Bridge: the adaptive sparse threshold and the dynamic guidance gate. As shown in Table 9, removing the guidance gate reduces AP 50 : 95 from 51.6 to 45.8, indicating that ungated cross-modal propagation introduces more unreliable responses. Replacing the dynamic gate with a static scalar also degrades the performance to 47.9 AP 50 : 95 , showing that spatially adaptive guidance is more effective than global weighting. Removing the adaptive sparse threshold lowers AP 50 : 95 to 49.4, suggesting that sparse thresholding contributes to suppressing noisy or unreliable responses during fusion. The full SMG-Bridge achieves the best results, demonstrating that both adaptive thresholding and dynamic guidance are important for reliable RGB–event fusion.

4.5.4. Effect of Event Voxel Configuration

We further investigate the influence of two key parameters in the event voxel representation: the number of temporal bins T and the event accumulation window Δ t . To isolate the effect of each factor, only one parameter is varied at a time, while the remaining network architecture, training settings, input resolution, dataset split, and evaluation protocol are kept unchanged. When evaluating the number of temporal bins, the event accumulation window is fixed at 33.3 ms, corresponding to one RGB frame interval on FRED. When evaluating the accumulation window, the number of temporal bins is fixed at the default value of T = 3 .
The positive and negative event polarities are accumulated separately in each temporal bin. Therefore, an event voxel with T temporal bins is represented as X E R 2 T × H × W , where the factor of two corresponds to the two event polarities. The default configuration uses T = 3 , resulting in a six-channel event representation. Changing T modifies the input channel dimension of the event branch and therefore slightly affects the parameter count and network-forward GFLOPs.
As shown in Table 10, using too few temporal bins excessively compresses the temporal distribution of the event stream and limits the representation of short-term UAV motion. Increasing T preserves finer temporal information and initially improves detection performance. However, an excessively large T distributes the available events over more temporal intervals, making the response within each bin increasingly sparse. It also increases the number of input channels and slightly raises the parameter count and network-forward computation. The configuration with T = 3 provides the best overall balance among temporal resolution, event density, detection accuracy, and computational complexity. Therefore, T = 3 is adopted as the default setting in SMG-UAV.
Table 11 analyzes the influence of the event accumulation duration. A short accumulation window contains relatively few events and may provide insufficient motion evidence for distant, slowly moving, or weak UAV targets. Increasing the window produces denser event observations and can improve the stability of event feature extraction. Nevertheless, an excessively long window may accumulate irrelevant background events, generate longer motion trails, and weaken the temporal alignment between the event representation and the current RGB frame. The accumulation window of 33.3 ms achieves the best overall performance and is used in the final model.

4.5.5. Contribution of SMG-Bridge at Different Feature Levels

We further investigate the contribution of SMG-Bridge at different feature levels of the dual-branch backbone. SMG-Bridge can be applied to the L3, L4, and L5 stages, which correspond to high-, medium-, and low-resolution feature representations, respectively. To isolate the contribution of bridge placement, all remaining components, including the event voxel representation, Spiking CSPDarknet encoder, SGP-Neck, detection head, and training protocol, are kept unchanged. When SMG-Bridge is disabled at a feature level, the RGB and event features at that level are directly concatenated along the channel dimension.
As shown in Table 12, the L3 bridge operates on higher-resolution features and primarily improves the preservation and cross-modal enhancement of fine spatial details, which are particularly important for small and distant UAV targets. The L4 and L5 bridges operate on progressively deeper features and contribute stronger semantic and contextual interaction between the RGB and event modalities.
Combining SMG-Bridge at multiple levels produces further improvements over using any individual level. In particular, the complete L3 + L4 + L5 configuration achieves the best performance, reaching 89.3 AP 50 , 37.2 AP 75 , and 51.6 AP 50 : 95 . These results indicate that high-resolution spatial guidance and deeper semantic guidance are complementary rather than redundant, supporting the multilevel deployment of SMG-Bridge in the final architecture.

4.6. Cross-Dataset Transfer Evaluation

To further examine the transferability of SMG-UAV across different anti-UAV data distributions, we conduct bidirectional cross-dataset experiments between FRED and NeRDD. Two transfer settings are considered: training on FRED and directly testing on NeRDD, denoted as FRED → NeRDD, and training on NeRDD and directly testing on FRED, denoted as NeRDD → FRED. The corresponding FRED → FRED and NeRDD → NeRDD results are also reported as in-domain references.
It should be noted that FRED was developed by extending NeRDD with additional sequences and more diverse target scales, backgrounds, illumination conditions, and motion patterns. Therefore, the two datasets are related rather than completely independent, and the two transfer directions have different levels of difficulty. FRED → NeRDD represents transfer from a broader data distribution to a related and relatively narrower distribution, whereas NeRDD → FRED requires the model to generalize from the more limited NeRDD distribution to the more diverse scenarios contained in FRED. Accordingly, the NeRDD → FRED direction provides a more challenging assessment of cross-dataset transferability.
For each transfer direction, SMG-UAV is trained from scratch using only the training split of the source dataset. The best-performing checkpoint is selected exclusively according to the validation split of the same source dataset and is then directly evaluated on the test split of the target dataset. No target-domain fine-tuning, parameter adaptation, additional training, hyperparameter search, or confidence-threshold adjustment is performed. Target-domain annotations are used only for the final calculation of the evaluation metrics. The confidence threshold, non-maximum suppression settings, and evaluation implementation remain unchanged across the in-domain and cross-dataset experiments.
As shown in Table 13, the model trained on FRED and directly evaluated on NeRDD achieves 87.5 AP 50 , 29.8 AP 75 , and 48.7 AP 50 : 95 . Compared with the model trained and evaluated on NeRDD, the transferred model shows only limited reductions of 1.2, 0.3, and 0.8 points in AP 50 , AP 75 , and AP 50 : 95 , respectively. This relatively small performance gap is consistent with the relationship between the two datasets. FRED extends the data distribution represented by NeRDD with additional sequences and more diverse target scales, backgrounds, illumination conditions, and motion patterns. Training on FRED therefore enables the model to learn from a broader but related distribution, and the resulting representations remain effective when transferred to NeRDD.
In the reverse direction, the model trained on NeRDD and directly evaluated on FRED obtains 66.2 AP 50 , 25.4 AP 75 , and 33.9 AP 50 : 95 . Compared with the FRED in-domain results, this corresponds to reductions of 23.1, 11.8, and 17.7 points, respectively. The substantially larger degradation indicates that the comparatively narrower NeRDD training distribution does not sufficiently cover the additional target appearances, scales, scene structures, illumination variations, and motion conditions contained in FRED. Consequently, NeRDD → FRED represents the more challenging transfer direction and provides a stricter assessment of the ability of SMG-UAV to generalize from a limited source distribution to broader anti-UAV scenarios.
Overall, the bidirectional experiments reveal a clear asymmetric transfer pattern. The FRED → NeRDD setting retains performance close to the NeRDD in-domain reference, whereas the NeRDD → FRED setting exhibits a considerable decrease under the broader target distribution. Because FRED was developed by extending NeRDD, the favorable FRED → NeRDD result should be interpreted as transfer between closely related datasets rather than generalization to a completely independent domain. Nevertheless, the results demonstrate that the complementary RGB appearance cues and event-based motion representations learned by SMG-UAV exhibit meaningful transferability across related anti-UAV data distributions. The degradation in the more difficult NeRDD → FRED direction also indicates that robust generalization from a restricted training distribution to more diverse real-world environments remains an important challenge.

5. Discussions

5.1. Real-Time Deployment and Sensing Limitations

In practical anti-UAV surveillance, the required processing speed is jointly affected by UAV velocity, target distance, camera field of view, and imaging frame rate. Faster or closer UAVs produce larger image-plane displacements within the same time interval and therefore require lower detection latency. In the current setup, RGB frames are acquired at approximately 30 FPS, corresponding to an interval of about 33.3 ms, and the events accumulated within each RGB interval are converted into a synchronized voxel representation. Although the event camera records asynchronous brightness changes at a much finer temporal resolution, the current detector produces one synchronized representation and detection result for each RGB frame interval. SMG-UAV achieves 132 FPS with a network-forward latency of 7.58 ms on the server GPU and 40 FPS with a latency of 25.00 ms on the Jetson Orin Nano. Both latency values are shorter than one RGB frame interval, indicating that the detector can support frame-rate-level real-time processing without introducing an inference backlog. However, one RGB frame interval is used only as a practical reference under the current setup, because the acceptable detection delay in an operational system also depends on UAV speed, observation distance, surveillance coverage, and the required response time.
Nevertheless, the reported latency covers network inference only. The complete alert delay also includes RGB acquisition, event accumulation, RGB–event synchronization, voxel construction, image preprocessing, non-maximum suppression, communication, and alert generation. Moreover, the sensing configuration imposes several practical limitations. A wide field of view increases surveillance coverage but reduces the number of pixels occupied by distant UAVs, whereas a narrow field of view improves target resolution but increases the risk of losing fast-moving targets. The effective detection range is also constrained by sensor resolution, lens focal length, target size, atmospheric visibility, and target–background contrast.
The current datasets do not provide complete calibrated target-to-camera distances or physical UAV dimensions for all sequences. Therefore, a fixed maximum detection distance cannot be reliably determined from the available data. Instead, practical detectability is evaluated in the image domain. In particular, the manually curated small-target subset contains UAV instances whose annotated bounding-box width and height are both smaller than 16 pixels, providing an image space assessment of the method under extremely small-target conditions and serving as a proxy for long-range detection difficulty. The physical distance corresponding to this image space size varies with the UAV dimensions, camera resolution, lens focal length, field of view, and atmospheric visibility.
Synchronization or geometric calibration errors may cause cross-modal feature misalignment, particularly for small and rapidly moving UAVs. In addition, hovering, slowly moving, or radially moving targets may produce weak event responses, while camera vibration, moving vegetation, clouds, and illumination changes may generate substantial background event activity.
Future work will focus on asynchronous or adaptive-window inference, lightweight end-to-end processing, improved RGB–event synchronization and calibration, and adaptive suppression of background events to reduce practical alert latency and improve robustness under weak-motion and long-range surveillance conditions.

5.2. Failure Case Analysis

To further analyze the limitations of the proposed method in safety-related anti-UAV scenarios, Figure 8 presents several representative failure cases of SMG-UAV, including both false negatives and false positives. In the first two examples, the detector fails to detect the UAV.
In Figure 8a, the UAV is extremely small and appears under weak relative motion, while the RGB image is blurred and the event stream contains only sparse responses. As a result, neither modality provides sufficiently reliable evidence for detection. In Figure 8b, the scene is captured under nighttime conditions, and the UAV is nearly invisible in the RGB image. Meanwhile, the UAV is hovering or moving only weakly, so the event modality also produces limited motion cues, which leads to a missed detection.
In the last two examples, the detector produces false alarms caused by bird distractors. As shown in Figure 8c,d, the bird targets are small and visually ambiguous in the RGB images, while the event modality mainly captures motion responses without explicit semantic discrimination. Therefore, bird motion patterns may generate event responses similar to those of UAVs, leading to incorrect detections. These cases indicate that although SMG-UAV improves robustness under challenging conditions, failure cases may still occur when both appearance cues and motion cues are weak, ambiguous, or inconsistent. Similar false alarms may also be caused by insects, distant aircraft, wires, cloud boundaries, and moving background structures, because their small-scale appearance or event responses may resemble those of UAVs under particular viewing conditions.
Overall, the failure cases suggest several directions for future improvement, including stronger modeling of weak-motion targets, better suppression of bird-like distractors, and tighter integration of temporal verification and semantic discrimination for safety-critical anti-UAV deployment.

6. Conclusions

This paper presented SMG-UAV, a sparse mutual guided RGB–event fusion network for robust UAV detection in challenging low-altitude anti-UAV surveillance scenarios. Different from conventional RGB–event detectors that mainly rely on direct feature aggregation, the proposed framework explicitly models complementary cross-modal recovery under asymmetric degradation. Specifically, an appearance-oriented RGB branch and a Spiking CSPDarknet event branch are used to encode dense appearance cues and sparse motion-sensitive responses, respectively. On top of this, SMG-Bridge performs bidirectional sparse mutual guidance to recover degraded target structures and suppress unreliable cross-modal interference, while SGP-Neck further enhances weak UAV responses through selective multiscale feature aggregation.
Extensive experiments on FRED and NeRDD demonstrated that SMG-UAV consistently outperforms representative RGB-only, event-only, and RGB–event fusion methods. In addition to the overall benchmark comparison, challenge-oriented robustness evaluation on manually curated FRED subsets further showed clear advantages under small-target, motion-blur, extreme-illumination, background-embedded, and bird distractor conditions. Ablation studies confirmed the effectiveness of the event voxel representation, Spiking CSPDarknet, bidirectional mutual guidance, adaptive sparse thresholding, dynamic reliability gating, and multiscale enhancement. Qualitative visualizations and failure-case analysis further demonstrated the advantages and remaining challenges of the proposed method in complex anti-UAV scenes.
Several limitations should nevertheless be acknowledged. First, SMG-UAV depends on spatially and temporally synchronized RGB–event sensing, and synchronization or calibration errors may reduce the reliability of cross-modal interaction, particularly for small and rapidly moving targets. Second, hovering, slowly moving, or radially moving UAVs may generate weak event responses, limiting the complementary motion information available to the detector. Third, although the method has been evaluated on FRED and NeRDD, its generalization to substantially different environments, sensor configurations, camera resolutions, fields of view, and acquisition systems remains to be further verified. Finally, the current embedded evaluation is limited to the Jetson Orin Nano and reports network-forward inference performance rather than complete end-to-end system latency, which additionally includes sensor acquisition, event accumulation, synchronization, voxel construction, preprocessing, post-processing, communication, and alert generation.
Future work will therefore focus on broader cross-environment and cross-sensor evaluation, improved synchronization and calibration robustness, adaptive modeling of weak-motion event responses, and end-to-end optimization on additional embedded platforms. We will also investigate asynchronous or adaptive-window inference and temporal distractor verification to further improve practical reliability and reduce alert latency in real anti-UAV deployment.

Author Contributions

Conceptualization, R.Z. and J.H.; methodology, R.Z.; software, R.Z.; validation, Y.S.; formal analysis, X.D.; investigation, K.Z.; resources, J.D.; data curation, J.D.; writing—original draft preparation, R.Z.; writing—review and editing, J.H.; visualization, K.Z.; supervision, J.H. and Y.S.; project administration, J.D.; funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UAVUnmanned Aerial Vehicle
RGBRed–Green–Blue
FREDFlorence RGB-Event Drone Dataset
NeRDDNeuromorphic-RGB Drone Detection Dataset
SMGSparse Mutual Guided
SMG-BridgeSparse Mutual Guided Bridge
SGP-NeckSelective Gated Pyramid Neck
CSPCross Stage Partial
ISTAIterative Shrinkage-Thresholding Algorithm
APAverage Precision
AP 50 Average Precision at IoU Threshold 0.50
AP 75 Average Precision at IoU Threshold 0.75
AP 50 : 95 Average Precision Averaged over IoU Thresholds from 0.50 to 0.95

References

  1. Rahman, M.H.; Sejan, M.A.S.; Aziz, M.A.; Tabassum, R.; Baik, J.-I.; Song, H.-K. A Comprehensive Survey of Unmanned Aerial Vehicles Detection and Classification Using Machine Learning Approach: Challenges, Solutions, and Future Directions. Remote Sens. 2024, 16, 879. [Google Scholar] [CrossRef]
  2. Feroz, S.; Abu Dabous, S. UAV-Based Remote Sensing Applications for Bridge Condition Assessment. Remote Sens. 2021, 13, 1809. [Google Scholar] [CrossRef]
  3. Guan, S.; Zhu, Z.; Wang, G. A Review on UAV-Based Remote Sensing Technologies for Construction and Civil Applications. Drones 2022, 6, 117. [Google Scholar] [CrossRef]
  4. Abro, G.E.M.; Zulkifli, S.A.B.M.; Masood, R.J.; Asirvadam, V.S.; Laouiti, A. Comprehensive Review of UAV Detection, Security, and Communication Advancements to Prevent Threats. Drones 2022, 6, 284. [Google Scholar] [CrossRef]
  5. Seidaliyeva, U.; Ilipbayeva, L.; Taissariyeva, K.; Smailov, N.; Matson, E.T. Advances and Challenges in Drone Detection and Classification Techniques: A State-of-the-Art Review. Sensors 2024, 24, 125. [Google Scholar]
  6. Wang, B.; Li, Q.; Mao, Q.; Wang, J.; Chen, C.L.P.; Shangguan, A.; Zhang, H. A Survey on Vision-Based Anti-Unmanned Aerial Vehicles Methods. Drones 2024, 8, 518. [Google Scholar] [CrossRef]
  7. Chiper, F.-L.; Martian, A.; Vladeanu, C.; Marghescu, I.; Craciunescu, R.; Fratu, O. Drone Detection and Defense Systems: Survey and a Software-Defined Radio-Based Solution. Sensors 2022, 22, 1453. [Google Scholar] [CrossRef] [PubMed]
  8. Liu, Z.; An, P.; Yang, Y.; Qiu, S.; Liu, Q.; Xu, X. Vision-Based Drone Detection in Complex Environments: A Survey. Drones 2024, 8, 643. [Google Scholar] [CrossRef]
  9. Yasmeen, A.; Daescu, O. Recent Research Progress on Ground-to-Air Vision-Based Anti-UAV Detection and Tracking Methodologies: A Review. Drones 2025, 9, 58. [Google Scholar] [CrossRef]
  10. Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-Based Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 154–180. [Google Scholar] [CrossRef] [PubMed]
  11. Gehrig, D.; Loquercio, A.; Derpanis, K.G.; Scaramuzza, D. End-to-End Learning of Representations for Asynchronous Event-Based Data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5633–5643. [Google Scholar]
  12. Tomy, A.; Paigwar, A.; Mann, K.S.; Renzaglia, A.; Laugier, C. Fusing Event-Based and RGB Camera for Robust Object Detection in Adverse Conditions. In Proceedings of the IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 933–939. [Google Scholar]
  13. Zhou, Z.; Wu, Z.; Boutteau, R.; Yang, F.; Demonceaux, C.; Ginhac, D. RGB-event Fusion for Moving Object Detection in Autonomous Driving. In Proceedings of the IEEE International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 7808–7815. [Google Scholar]
  14. Li, D.; Tian, Y.; Li, J. SODFormer: Streaming Object Detection with Transformer Using Events and Frames. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14020–14037. [Google Scholar] [CrossRef] [PubMed]
  15. Cao, J.; Zheng, X.; Lyu, Y.; Wang, J.; Xu, R.; Wang, L. Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera. In Proceedings of the IEEE International Conference on Robotics and Automation, Yokohama, Japan, 13–17 May 2024; pp. 9026–9032. [Google Scholar]
  16. Magrini, G.; Berlincioni, L.; Becattini, F.; Cultrera, L.; Pala, P. Drone Detection with Event Cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Honolulu, HI, USA, 19–20 October 2025; pp. 4762–4773. [Google Scholar]
  17. Magrini, G.; Marini, N.; Becattini, F.; Berlincioni, L.; Biondi, N.; Pala, P.; Del Bimbo, A. FRED: The Florence RGB-event Drone Dataset. In Proceedings of the 33rd ACM International Conference on Multimedia Workshops, Dublin, Ireland, 27–31 October 2025; pp. 13170–13176. [Google Scholar]
  18. Magrini, G.; Becattini, F.; Pala, P.; Del Bimbo, A.; Porta, A. Neuromorphic Drone Detection: An Event-RGB Multimodal Approach. In Computer Vision–ECCV 2024 Workshops; Springer: Cham, Switzerland, 2025; pp. 259–275. [Google Scholar]
  19. Gökçe, F.; Üçoluk, G.; Şahin, E.; Kalkan, S. Vision-Based Detection and Distance Estimation of Micro Unmanned Aerial Vehicles. Sensors 2015, 15, 23805–23846. [Google Scholar] [CrossRef] [PubMed]
  20. Rozantsev, A.; Lepetit, V.; Fua, P. Flying Objects Detection from a Single Moving Camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4128–4136. [Google Scholar]
  21. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  22. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  23. Unlu, E.; Zenou, E.; Rivière, N. Generic Fourier Descriptors for Autonomous UAV Detection. In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods, Funchal, Portugal, 16–18 January 2018; pp. 550–554. [Google Scholar]
  24. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  25. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  26. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  27. Cheng, Q.; Li, X.; Zhu, B.; Shi, Y.; Xie, B. Drone Detection Method Based on MobileViT and CA-PANet. Electronics 2023, 12, 223. [Google Scholar] [CrossRef]
  28. Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An Optimized YOLOv8 Network for Tiny UAV Object Detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
  29. Muhamad Zamri, F.N.; Gunawan, T.S.; Yusoff, S.H.; Alzahrani, A.A.; Bramantoro, A.; Kartiwi, M. Enhanced Small Drone Detection Using Optimized YOLOv8 with Attention Mechanisms. IEEE Access 2024, 12, 90629–90643. [Google Scholar] [CrossRef]
  30. Zheng, C.; Liu, L.; Fu, Q.; Yang, Q.; Zhang, D.; Yang, H. YOLO-DD: A Lightweight Framework for UAV Detection in Complex Environments via Boundary-Aware Fusion. EURASIP J. Adv. Signal Process. 2025, 2025, 44. [Google Scholar] [CrossRef]
  31. Lagorce, X.; Orchard, G.; Galluppi, F.; Shi, B.E.; Benosman, R.B. HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1346–1359. [Google Scholar] [CrossRef] [PubMed]
  32. Li, J.; Dong, S.; Yu, Z.; Tian, Y.; Huang, T. Event-Based Vision Enhanced: A Joint Detection Framework in Autonomous Driving. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1396–1401. [Google Scholar]
  33. Cao, H.; Chen, G.; Xia, J.; Zhuang, G.; Knoll, A. Fusion-Based Feature Attention Gate Component for Vehicle Detection Based on Event Camera. IEEE Sens. J. 2021, 21, 24540–24548. [Google Scholar] [CrossRef]
  34. Cao, H.; Zhang, Z.; Xia, Y.; Li, X.; Xia, J.; Chen, G.; Knoll, A. Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 161–177. [Google Scholar]
  35. Liu, M.; Qi, N.; Shi, Y.; Yin, B. An Attention Fusion Network for Event-Based Vehicle Object Detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 3363–3367. [Google Scholar]
  36. Liu, Z.; Yang, N.; Wang, Y.; Li, Y.; Zhao, X.; Wang, F.-Y. Enhancing Traffic Object Detection in Variable Illumination with RGB-Event Fusion. IEEE Trans. Intell. Transp. Syst. 2024, 25, 20335–20350. [Google Scholar] [CrossRef]
  37. Liu, Z.; Sun, Y.; Wang, Y.; Yang, N.; Li, S.E.; Zhao, X. Beyond Conventional Vision: RGB-Event Fusion for Robust Object Detection in Dynamic Traffic Scenarios. Commun. Transp. Res. 2025, 5, 100202. [Google Scholar] [CrossRef]
  38. Jiang, Z.; Xia, P.; Huang, K.; Stechele, W.; Chen, G.; Bing, Z.; Knoll, A. Mixed Frame-/Event-Driven Fast Pedestrian Detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8332–8338. [Google Scholar]
  39. Olshausen, B.A.; Field, D.J. Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images. Nature 1996, 381, 607–609. [Google Scholar] [CrossRef] [PubMed]
  40. Yang, B.; Li, S. Multifocus Image Fusion and Restoration with Sparse Representation. IEEE Trans. Instrum. Meas. 2010, 59, 884–892. [Google Scholar] [CrossRef]
  41. Liu, Y.; Liu, S.; Wang, Z. A General Framework for Image Fusion Based on Multi-Scale Transform and Sparse Representation. Inf. Fusion 2015, 24, 147–164. [Google Scholar] [CrossRef]
  42. Zhang, Q.; Liu, Y.; Blum, R.S.; Han, J.; Tao, D. Sparse Representation Based Multi-Sensor Image Fusion for Multi-Focus and Multi-Modality Images: A Review. Inf. Fusion 2018, 40, 57–75. [Google Scholar] [CrossRef]
  43. Gregor, K.; LeCun, Y. Learning Fast Approximations of Sparse Coding. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 399–406. [Google Scholar]
  44. Zhang, J.; Ghanem, B. ISTA-Net: Interpretable Optimization-Inspired Deep Network for Image Compressive Sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1828–1837. [Google Scholar]
  45. Zhao, Z.; Zhang, J.; Bai, H.; Wang, Y.; Cui, Y.; Deng, L.; Sun, K.; Zhang, C.; Liu, J.; Xu, S. Deep Convolutional Sparse Coding Networks for Interpretable Image Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 18–22 June 2023; pp. 2369–2377. [Google Scholar]
  46. Fang, A.; Feng, S.; Liang, B.; Jiang, J. Real-Time Detection of Unauthorized Unmanned Aerial Vehicles Using SEB-YOLOv8s. Sensors 2024, 24, 3915. [Google Scholar] [CrossRef] [PubMed]
  47. Zhang, J.; Zhang, Y.; Shi, Z.; Zhang, Y.; Gao, R. Unmanned Aerial Vehicle Object Detection Based on Information-Preserving and Fine-Grained Feature Aggregation. Remote Sens. 2024, 16, 2590. [Google Scholar] [CrossRef]
  48. Wu, D.; Li, J.; Yang, W. STD-YOLOv8: A Lightweight Small Target Detection Algorithm for UAV Perspectives. Electron. Res. Arch. 2024, 32, 4563–4580. [Google Scholar] [CrossRef]
  49. Wei, Y.; Yao, F.; Kang, Y. Design and Application of Spiking Neural Networks Based on LIF Neurons. In Proceedings of the 2024 3rd International Conference on Data Analytics, Computing and Artificial Intelligence (ICDACAI), Zakopane, Poland, 18–20 October 2024; pp. 10–15. [Google Scholar]
  50. Zhou, C.; Zhang, H.; Yu, L.; Ye, Y.; Zhou, Z.; Huang, L.; Ma, Z.; Fan, X.; Zhou, H.; Tian, Y. Direct Training High-Performance Deep Spiking Neural Networks: A Review of Theories and Methods. Front. Neurosci. 2024, 18, 1383844. [Google Scholar] [CrossRef] [PubMed]
  51. Fang, W.; Chen, Y.; Ding, J.; Yu, Z.; Masquelier, T.; Chen, D.; Huang, L.; Zhou, H.; Li, G.; Tian, Y. SpikingJelly: An Open-Source Machine Learning Infrastructure Platform for Spike-Based Intelligence. Sci. Adv. 2023, 9, eadi1480. [Google Scholar] [CrossRef] [PubMed]
  52. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
  53. Wang, Z.; Li, C.; Xu, H.; Zhu, X.; Li, H. Mamba YOLO: A Simple Baseline for Object Detection with State Space Model. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8205–8213. [Google Scholar] [CrossRef]
  54. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
  55. Gehrig, M.; Scaramuzza, D. Recurrent Vision Transformers for Object Detection with Event Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13884–13893. [Google Scholar]
  56. Peng, Y.; Li, H.; Zhang, Y.; Sun, X.; Wu, F. Scene Adaptive Sparse Transformer for Event-Based Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16794–16804. [Google Scholar]
  57. Yang, N.; Wang, Y.; Liu, Z.; Li, M.; An, Y.; Zhao, X. SMamba: Sparse Mamba for Event-Based Object Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 9229–9237. [Google Scholar] [CrossRef]
Figure 1. Representative RGB–event examples under challenging anti-UAV scenarios, including underexposure, overexposure, visual camouflage, and tiny-scale targets. The first row shows RGB frames, where UAV targets are difficult to localize due to severe visual degradation or small target size. The second row visualizes the event stream in the spatio-temporal space, where event responses preserve motion trajectories around moving UAVs. The third row shows the accumulated event frames with enlarged target regions, further illustrating the local event responses around UAV targets. Red and blue points denote positive and negative events, respectively. Gray points denote background event responses caused by irrelevant brightness changes, such as camera jitter, moving vegetation, illumination fluctuation, clouds, or other non-target motion. Green boxes indicate the UAV target regions.
Figure 1. Representative RGB–event examples under challenging anti-UAV scenarios, including underexposure, overexposure, visual camouflage, and tiny-scale targets. The first row shows RGB frames, where UAV targets are difficult to localize due to severe visual degradation or small target size. The second row visualizes the event stream in the spatio-temporal space, where event responses preserve motion trajectories around moving UAVs. The third row shows the accumulated event frames with enlarged target regions, further illustrating the local event responses around UAV targets. Red and blue points denote positive and negative events, respectively. Gray points denote background event responses caused by irrelevant brightness changes, such as camera jitter, moving vegetation, illumination fluctuation, clouds, or other non-target motion. Green boxes indicate the UAV target regions.
Drones 10 00486 g001
Figure 2. Overall architecture of the proposed SMG-UAV. The RGB frame is encoded by an appearance-oriented CSPDarknet branch, while the asynchronous event stream is first converted into an event voxel representation with temporal bins and polarity channels. The event voxel is then processed by a Spiking CSPDarknet branch, where Spiking Conv denotes convolutional layers equipped with LIF-based spike activations. At stages 3–5, SMG-Bridge performs bidirectional sparse mutual guidance between RGB and event features in a compact latent space. The fused features { F f 3 , F f 4 , F f 5 } are further enhanced by SGP-Neck to generate pyramid features { F P 3 , F P 4 , F P 5 } , which are explicitly fed into the detection head. The detection head predicts UAV bounding boxes, confidence scores, and class labels. Different from ordinary gated fusion, SMG-Bridge performs reliability-aware sparse updating through cross-modal guidance gates and adaptive sparse shrinkage, rather than simply concatenating or reweighting multimodal features.
Figure 2. Overall architecture of the proposed SMG-UAV. The RGB frame is encoded by an appearance-oriented CSPDarknet branch, while the asynchronous event stream is first converted into an event voxel representation with temporal bins and polarity channels. The event voxel is then processed by a Spiking CSPDarknet branch, where Spiking Conv denotes convolutional layers equipped with LIF-based spike activations. At stages 3–5, SMG-Bridge performs bidirectional sparse mutual guidance between RGB and event features in a compact latent space. The fused features { F f 3 , F f 4 , F f 5 } are further enhanced by SGP-Neck to generate pyramid features { F P 3 , F P 4 , F P 5 } , which are explicitly fed into the detection head. The detection head predicts UAV bounding boxes, confidence scores, and class labels. Different from ordinary gated fusion, SMG-Bridge performs reliability-aware sparse updating through cross-modal guidance gates and adaptive sparse shrinkage, rather than simply concatenating or reweighting multimodal features.
Drones 10 00486 g002
Figure 3. Structure of the proposed SparseMutual Guided Bridge (SMG-Bridge). The sparse latent codes are initialized only once from the stage-2 RGB and event features, and then progressively propagated across the bridges at stages 3, 4, and 5 for cross-scale mutual guidance. At each bridge, the current-scale features serve as observation anchors, while the propagated sparse codes act as latent guidance carriers for bidirectional refinement. (a) Event-guided RGB sparse updating, where the propagated event sparse code guides the correction and reconstruction of the current RGB feature. (b) RGB-guided event sparse updating, where the propagated RGB sparse code refines the current event feature; M T and B T denote temporal aggregation and temporal broadcast operators, respectively, used to handle the temporal-channel structure inherited from the voxelized event representation. (c) Fusion of the enhanced RGB and event features, where the two refined features are adaptively reweighted and further refined by a lightweight Conv-BN-SiLU block to generate the fused output.
Figure 3. Structure of the proposed SparseMutual Guided Bridge (SMG-Bridge). The sparse latent codes are initialized only once from the stage-2 RGB and event features, and then progressively propagated across the bridges at stages 3, 4, and 5 for cross-scale mutual guidance. At each bridge, the current-scale features serve as observation anchors, while the propagated sparse codes act as latent guidance carriers for bidirectional refinement. (a) Event-guided RGB sparse updating, where the propagated event sparse code guides the correction and reconstruction of the current RGB feature. (b) RGB-guided event sparse updating, where the propagated RGB sparse code refines the current event feature; M T and B T denote temporal aggregation and temporal broadcast operators, respectively, used to handle the temporal-channel structure inherited from the voxelized event representation. (c) Fusion of the enhanced RGB and event features, where the two refined features are adaptively reweighted and further refined by a lightweight Conv-BN-SiLU block to generate the fused output.
Drones 10 00486 g003
Figure 4. Architecture of the proposed SGP-Neck. The fused features from SMG-Bridge are first channel-aligned and enhanced by LRPM. Semantic gates inject high-level context in a top-down path, while detail gates feed localization-sensitive cues back in a bottom-up path. The resulting pyramid features { F P 3 , F P 4 , F P 5 } are used for multiscale UAV prediction.
Figure 4. Architecture of the proposed SGP-Neck. The fused features from SMG-Bridge are first channel-aligned and enhanced by LRPM. Semantic gates inject high-level context in a top-down path, while detail gates feed localization-sensitive cues back in a bottom-up path. The resulting pyramid features { F P 3 , F P 4 , F P 5 } are used for multiscale UAV prediction.
Drones 10 00486 g004
Figure 5. Representative scenes of the five manually curated challenging subsets from FRED for anti-UAV robustness analysis. From (left) to (right), the columns show small target, motion blur, extreme illumination, background embedded, and bird distractor cases. Red boxes indicate UAV targets, and yellow boxes indicate target-like distractors. The enlarged regions provide a closer view of the UAV or distractor, demonstrating the key challenges of small scale, appearance degradation, low contrast, background clutter, and semantic ambiguity.
Figure 5. Representative scenes of the five manually curated challenging subsets from FRED for anti-UAV robustness analysis. From (left) to (right), the columns show small target, motion blur, extreme illumination, background embedded, and bird distractor cases. Red boxes indicate UAV targets, and yellow boxes indicate target-like distractors. The enlarged regions provide a closer view of the UAV or distractor, demonstrating the key challenges of small scale, appearance degradation, low contrast, background clutter, and semantic ambiguity.
Drones 10 00486 g005
Figure 6. Qualitative comparison of detection results under representative challenging anti-UAV scenarios. The examples include extreme illumination, background clutter, weak small targets, and low-contrast scenes. The rightmost column shows the ground-truth annotations. Green boxes denote detected UAV targets, and the numbers indicate confidence scores. Compared with RGB-only, event-only, and conventional RGB–event fusion methods, SMG-UAV provides more accurate localization, more stable confidence, and fewer missed detections under challenging conditions. The values displayed beside the bounding boxes represent the final detection confidence scores after NMS. All methods use the same confidence and NMS thresholds.
Figure 6. Qualitative comparison of detection results under representative challenging anti-UAV scenarios. The examples include extreme illumination, background clutter, weak small targets, and low-contrast scenes. The rightmost column shows the ground-truth annotations. Green boxes denote detected UAV targets, and the numbers indicate confidence scores. Compared with RGB-only, event-only, and conventional RGB–event fusion methods, SMG-UAV provides more accurate localization, more stable confidence, and fewer missed detections under challenging conditions. The values displayed beside the bounding boxes represent the final detection confidence scores after NMS. All methods use the same confidence and NMS thresholds.
Drones 10 00486 g006
Figure 7. Comparison of Grad-CAM visualizations produced by different RGB–event detection methods in representative challenging scenes. Warmer colors indicate regions with greater relative contribution to the corresponding UAV prediction. Compared with competing methods, SMG-UAV produces more compact attention around the true UAV regions and fewer spurious background responses. Green boxes denote the ground-truth bounding boxes of UAV targets.
Figure 7. Comparison of Grad-CAM visualizations produced by different RGB–event detection methods in representative challenging scenes. Warmer colors indicate regions with greater relative contribution to the corresponding UAV prediction. Compared with competing methods, SMG-UAV produces more compact attention around the true UAV regions and fewer spurious background responses. Green boxes denote the ground-truth bounding boxes of UAV targets.
Drones 10 00486 g007
Figure 8. Representative failure cases of SMG-UAV in challenging anti-UAV scenarios. Green boxes denote the UAV bounding boxes predicted by the detection method. (a) False negative under weak relative motion and blurred RGB appearance, where the UAV is small and the event stream provides only sparse responses. (b) False negative in a nighttime hovering scenario, where the RGB appearance is almost invisible and the event modality contains insufficient motion evidence. (c,d) False positives caused by bird distractors. In these cases, the objects are small and visually ambiguous in RGB images, while the event modality captures strong motion responses but lacks semantic discrimination, causing bird targets to be incorrectly detected as UAVs. The enlarged regions highlight the target or distractor details.
Figure 8. Representative failure cases of SMG-UAV in challenging anti-UAV scenarios. Green boxes denote the UAV bounding boxes predicted by the detection method. (a) False negative under weak relative motion and blurred RGB appearance, where the UAV is small and the event stream provides only sparse responses. (b) False negative in a nighttime hovering scenario, where the RGB appearance is almost invisible and the event modality contains insufficient motion evidence. (c,d) False positives caused by bird distractors. In these cases, the objects are small and visually ambiguous in RGB images, while the event modality captures strong motion responses but lacks semantic discrimination, causing bird targets to be incorrectly detected as UAVs. The enlarged regions highlight the target or distractor details.
Drones 10 00486 g008
Table 1. Comparison of different methods on FRED and NeRDD over three independent training runs. Results are reported as mean ± standard deviation. The best results are highlighted in bold.
Table 1. Comparison of different methods on FRED and NeRDD over three independent training runs. Results are reported as mean ± standard deviation. The best results are highlighted in bold.
Input ModalityMethodFREDNeRDD
AP 50 AP 75 AP 50 : 95 AP 50 AP 75 AP 50 : 95
RGBYOLOv12 35.2 ± 0.7 11.4 ± 0.5 13.5 ± 0.6 35.8 ± 0.6 13.5 ± 0.4 16.4 ± 0.5
MambaYOLO 36.6 ± 0.5 12.2 ± 0.4 14.3 ± 0.5 39.5 ± 0.6 14.0 ± 0.4 17.7 ± 0.5
RT-DETR 34.0 ± 0.6 10.7 ± 0.5 11.9 ± 0.4 32.8 ± 0.7 11.2 ± 0.4 15.6 ± 0.5
EventRVT 79.3 ± 0.3 33.2 ± 0.5 46.7 ± 0.4 81.5 ± 0.4 24.8 ± 0.6 33.5 ± 0.4
SAST 77.9 ± 0.5 32.2 ± 0.4 45.0 ± 0.6 80.2 ± 0.5 23.8 ± 0.5 32.2 ± 0.3
SMamba 81.7 ± 0.4 35.8 ± 0.3 47.7 ± 0.6 82.8 ± 0.3 25.1 ± 0.3 33.9 ± 0.5
RGB + EventFPN-Fusion 80.4 ± 0.5 34.3 ± 0.3 43.1 ± 0.4 82.5 ± 0.6 26.4 ± 0.2 33.6 ± 0.4
EOLO 78.7 ± 0.5 31.7 ± 0.3 39.3 ± 0.4 85.1 ± 0.4 27.3 ± 0.3 41.2 ± 0.2
RENet 82.0 ± 0.5 35.6 ± 0.5 47.0 ± 0.4 82.8 ± 0.4 26.6 ± 0.6 39.6 ± 0.5
SODFormer 82.4 ± 0.1 35.9 ± 0.4 48.1 ± 0.3 83.9 ± 0.2 26.8 ± 0.3 38.7 ± 0.2
SMG-UAV 89 . 3 ± 0 . 2 37 . 2 ± 0 . 4 51 . 6 ± 0 . 3 88 . 7 ± 0 . 2 30 . 1 ± 0 . 3 49 . 5 ± 0 . 4
Table 2. Additional anti-UAV-oriented evaluation metrics on the FRED dataset. Precision, recall, F1-score, and miss rate are reported in %. FP/frame denotes the average number of false positives per frame. All metrics are computed under the same confidence threshold and IoU matching criterion. The best results are highlighted in bold.
Table 2. Additional anti-UAV-oriented evaluation metrics on the FRED dataset. Precision, recall, F1-score, and miss rate are reported in %. FP/frame denotes the average number of false positives per frame. All metrics are computed under the same confidence threshold and IoU matching criterion. The best results are highlighted in bold.
InputMethodPrecisionRecallF1FP/FrameMiss Rate
RGBYOLOv1243.834.938.90.22465.1
MambaYOLO45.136.840.50.22463.2
RT-DETR42.732.436.90.21767.6
EventRVT83.678.781.10.07721.3
SAST82.176.979.40.08423.1
SMamba84.780.982.80.07319.1
RGB + EventFPN-Fusion82.879.481.10.08220.6
EOLO80.976.578.60.09023.5
RENet83.980.182.00.07719.9
SODFormer84.581.783.10.07518.3
SMG-UAV91.288.689.90.04311.4
Table 3. Statistics and qualification criteria of the manually curated challenging subsets from the FRED test set. Each RGB frame is paired with the event stream accumulated within the corresponding RGB frame interval. Since the five subsets are not mutually exclusive, the same frame or UAV instance may appear in more than one subset.
Table 3. Statistics and qualification criteria of the manually curated challenging subsets from the FRED test set. Each RGB frame is paired with the event stream accumulated within the corresponding RGB frame interval. Since the five subsets are not mutually exclusive, the same frame or UAV instance may appear in more than one subset.
SubsetQualification CriterionNumber of
Frames
Number of
UAV Instances
Small TargetAt least one UAV has both an annotated bounding-box width and height smaller than 16 pixels. Candidate frames are manually reviewed to remove inaccurate annotations.75988944
Motion BlurThe UAV exhibits evident directional smearing, elongation, boundary diffusion, or contour degradation caused by relative motion or exposure.63927210
Extreme IlluminationSevere underexposure, overexposure, glare, or illumination variation causes substantial loss of UAV appearance details or local contrast.71457867
Background-EmbeddedThe UAV is visually similar to the surrounding background or partially obscured by cluttered structures, making foreground–background separation difficult.67247238
Bird DistractorAt least one annotated UAV and one or more non-UAV bird-like moving objects appear in the same frame and may cause confusion or false alarms.47835941
Table 4. Robustness comparison on the manually curated challenging subsets from FRED. AP 50 is reported for each subset, and the best results are highlighted in bold.
Table 4. Robustness comparison on the manually curated challenging subsets from FRED. AP 50 is reported for each subset, and the best results are highlighted in bold.
MethodSmall
Target
Motion
Blur
Extreme
Illumination
Background-
Embedded
Bird
Distractor
YOLOv1211.410.65.74.235.3
MambaYOLO12.810.14.93.936.9
SMamba26.932.736.739.233.8
RENet24.629.138.933.634.1
SODFormer26.331.541.238.135.7
SMG-UAV31.736.947.645.646.2
Table 5. Computational complexity and server-side inference efficiency comparison. Params and GFLOPs denote the model size and theoretical computational cost, respectively. FPS is evaluated under the same input resolution, batch size of 1, and server hardware platform used for training and evaluation.
Table 5. Computational complexity and server-side inference efficiency comparison. Params and GFLOPs denote the model size and theoretical computational cost, respectively. FPS is evaluated under the same input resolution, batch size of 1, and server hardware platform used for training and evaluation.
InputMethodParams (M)GFLOPsFPSLatency (ms)
RGBYOLOv1219.6123.61138.85
MambaYOLO19.1116.51198.40
RT-DETR20.1144.31079.35
EventRVT18.531.21675.99
SAST19.137.11466.84
SMamba23.772.89810.20
RGB + EventFPN-Fusion65.6283.76515.38
EOLO46.2100.99210.87
RENet37.7102.79310.75
SODFormer86.5279.76814.71
SMG-UAV18.766.21327.58
Table 6. Embedded-platform inference efficiency on the NVIDIA Jetson Orin Nano development board (NVIDIA, Santa Clara, CA, USA). FPS is measured with batch size 1 after model warm-up.
Table 6. Embedded-platform inference efficiency on the NVIDIA Jetson Orin Nano development board (NVIDIA, Santa Clara, CA, USA). FPS is measured with batch size 1 after model warm-up.
InputMethodParams (M)GFLOPsFPSLatency (ms)
RGBYOLOv1219.6123.63627.78
MambaYOLO19.1116.53826.32
RT-DETR20.1144.33132.26
EventRVT18.531.24721.28
SAST19.137.14223.81
SMamba23.772.82934.48
RGB + EventFPN-Fusion65.6283.72050.00
EOLO46.2198.42835.71
RENet37.7102.73033.33
SODFormer86.5279.72245.45
SMG-UAV18.766.24025.00
Table 7. Ablation study of the main components on FRED.
Table 7. Ablation study of the main components on FRED.
Configuration AP 50 AP 75 AP 50 : 95 Params (M)GFLOPs
Baseline73.430.238.716.054.1
+Event voxel representation77.932.544.116.256.1
+Spiking CSPDarknet encoder80.235.344.516.858.9
+SMG-Bridge87.636.748.917.863.4
+SGP-Neck89.337.251.618.766.2
Table 8. Ablation study of bidirectional mutual guidance in SMG-Bridge. The best results are highlighted in bold.
Table 8. Ablation study of bidirectional mutual guidance in SMG-Bridge. The best results are highlighted in bold.
Configuration AP 50 AP 75 AP 50 : 95
Direct fusion82.235.745.1
Event-to-image only86.936.749.8
Image-to-event only85.136.148.4
Bidirectional guidance89.337.251.6
Table 9. Ablation study of sparse thresholding and guidance gate in SMG-Bridge. The best results are highlighted in bold.
Table 9. Ablation study of sparse thresholding and guidance gate in SMG-Bridge. The best results are highlighted in bold.
Variant AP 50 AP 75 AP 50 : 95
Without guidance gate84.336.245.8
Static scalar gate86.136.747.9
Without adaptive threshold88.036.949.4
Full SMG-Bridge89.337.251.6
Table 10. Effect of the number of temporal bins T in the event voxel representation on FRED. The event accumulation window is fixed at 33.3 ms. The input channel number is 2 T , corresponding to T temporal bins and two event polarities. The selected default configuration is highlighted in bold.
Table 10. Effect of the number of temporal bins T in the event voxel representation on FRED. The event accumulation window is fixed at 33.3 ms. The input channel number is 2 T , corresponding to T temporal bins and two event polarities. The selected default configuration is highlighted in bold.
Temporal Bins TInput ChannelsParams (M)GFLOPs AP 50 AP 75 AP 50 : 95
1218.164.384.632.548.1
2418.465.186.136.747.9
3618.766.289.337.251.6
4819.167.288.937.150.9
51019.368.688.836.950.7
Table 11. Effect of different event accumulation windows on FRED. Since the input channel dimension remains fixed, the parameter count and network-forward GFLOPs are identical for all configurations. The selected window is highlighted in bold.
Table 11. Effect of different event accumulation windows on FRED. Since the input channel dimension remains fixed, the parameter count and network-forward GFLOPs are identical for all configurations. The selected window is highlighted in bold.
Window Δ t Approx. RGB Interval AP 50 AP 75 AP 50 : 95
16.7 ms 0.5 × 84.632.548.1
33.3 ms 1.0 × 89.337.251.6
50.0 ms 1.5 × 89.236.850.9
66.7 ms 2.0 × 88.736.550.2
Table 12. Ablation study of SMG-Bridge placement at different feature levels on FRED. L3, L4, and L5 denote the high-, medium-, and low-resolution feature stages, respectively. The best results are highlighted in bold.
Table 12. Ablation study of SMG-Bridge placement at different feature levels on FRED. L3, L4, and L5 denote the high-, medium-, and low-resolution feature stages, respectively. The best results are highlighted in bold.
SMG-Bridge Placement AP 50 AP 75 AP 50 : 95
L3 only76.130.240.1
L4 only74.829.839.6
L5 only72.328.638.2
L3 + L482.133.645.4
L4 + L581.932.244.8
L3 + L4 + L589.337.251.6
Table 13. Cross-dataset transfer performance of SMG-UAV between FRED and NeRDD. The original dataset splits are retained, and no target-domain fine-tuning, parameter adaptation, hyperparameter search, or threshold adjustment is performed.
Table 13. Cross-dataset transfer performance of SMG-UAV between FRED and NeRDD. The original dataset splits are retained, and no target-domain fine-tuning, parameter adaptation, hyperparameter search, or threshold adjustment is performed.
Training DatasetTesting Dataset AP 50 AP 75 AP 50 : 95
FREDFRED89.337.251.6
FREDNeRDD87.529.848.7
NeRDDNeRDD88.730.149.5
NeRDDFRED66.225.433.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, R.; Hou, J.; Shi, Y.; Dai, X.; Zhang, K.; Diao, J. SMG-UAV: Sparse Mutual Guided RGB–Event Fusion for Robust UAV Detection in Challenging Dynamic Environments. Drones 2026, 10, 486. https://doi.org/10.3390/drones10070486

AMA Style

Zhang R, Hou J, Shi Y, Dai X, Zhang K, Diao J. SMG-UAV: Sparse Mutual Guided RGB–Event Fusion for Robust UAV Detection in Challenging Dynamic Environments. Drones. 2026; 10(7):486. https://doi.org/10.3390/drones10070486

Chicago/Turabian Style

Zhang, Ruizhi, Jinghua Hou, Yan Shi, Xiping Dai, Ke Zhang, and Jingjing Diao. 2026. "SMG-UAV: Sparse Mutual Guided RGB–Event Fusion for Robust UAV Detection in Challenging Dynamic Environments" Drones 10, no. 7: 486. https://doi.org/10.3390/drones10070486

APA Style

Zhang, R., Hou, J., Shi, Y., Dai, X., Zhang, K., & Diao, J. (2026). SMG-UAV: Sparse Mutual Guided RGB–Event Fusion for Robust UAV Detection in Challenging Dynamic Environments. Drones, 10(7), 486. https://doi.org/10.3390/drones10070486

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop