4.1. Datasets and Implementation Details
We evaluate the proposed SMG-UAV on two RGB–event drone detection datasets, namely FRED and NeRDD. Both datasets provide synchronized RGB frames and event streams, making them suitable for evaluating the effectiveness of RGB–event fusion in UAV detection.
FRED. The Florence RGB–event Drone dataset (FRED) is a multimodal drone perception dataset designed for drone detection, tracking, and trajectory forecasting [
17]. It provides spatio-temporally synchronized RGB video and event streams with dense UAV trajectory annotations. The dataset contains more than 7 h of annotated drone recordings, covers five different drone models, and includes challenging conditions such as rain, adverse illumination, distractors, and diverse motion patterns. In our experiments, FRED is used as the primary benchmark to evaluate the detection performance and robustness of the proposed method in realistic RGB–event anti-UAV scenarios.
NeRDD. NeRDD is a Neuromorphic-RGB Drone Detection dataset specifically collected for Event-RGB drone detection [
18]. It contains more than 3.5 h of spatio-temporally synchronized RGB–event drone recordings, corresponding to approximately 7 h of multimodal footage. The dataset is divided into 115 videos, and both modalities are provided at HD resolution of
with 30 FPS. Since NeRDD contains synchronized RGB and event data with drone annotations, it is used as an additional benchmark to evaluate the performance consistency of SMG-UAV across different UAV RGB–event benchmarks.
Evaluation metrics. Following common object detection protocols, we report , , and . and denote the average precision at IoU thresholds of 0.50 and 0.75, respectively. denotes the mean average precision averaged over IoU thresholds from 0.50 to 0.95 with a step size of 0.05. These metrics jointly evaluate coarse localization accuracy, strict localization quality, and overall detection performance.
Implementation details. All experiments are conducted on a high-performance computational server equipped with an Xeon(R) Platinum 8470Q CPU and an NVIDIA RTX PRO 6000 Blackwell GPU. The software environment is built on Ubuntu 22.04, Python 3.12, PyTorch 2.8.0, and CUDA 12.8. The spiking components in the proposed Spiking CSPDarknet are implemented using SpikingJelly 0.14 [
51].
All models are trained from scratch for up to 200 epochs. To preserve the fine-grained appearance and weak structural cues of UAV targets, the input resolution is fixed at during both training and evaluation. For a fair and controlled comparison, all methods use the same sequence-level dataset split, input resolution, training budget, spatial preprocessing, optimization protocol, and evaluation metrics.
The compared methods are reproduced using their official implementations and released model configurations whenever available. The official network architecture and method-specific components are retained, while only the adaptations required for FRED and NeRDD, including the dataset path, annotation format, number of target classes, input resolution, and input channel configuration, are introduced. No additional method-specific hyperparameter search or preferential tuning is performed for any baseline.
Because different event-based methods are originally designed for different event representations, their event encodings are not forcibly unified. For all event-only and RGB–event methods, the raw event stream is first temporally aligned with the corresponding RGB frame interval and is then converted into the event representation required by the respective method. Specifically, SMG-UAV converts the asynchronous event stream into an event voxel grid with temporal bins and two polarity channels, resulting in a six-channel event representation. For the compared event-only and RGB–event methods, we retain the native event encoding adopted by their official implementations, such as accumulated event frames or voxel grids, together with the corresponding temporal-bin and polarity settings. This protocol preserves the original design of each method while ensuring that all methods use the same synchronized RGB–event samples and evaluation protocol.
For the common optimization protocol, stochastic gradient descent is used with a momentum of 0.937 and a weight decay of 0.0005. The learning-rate schedule consists of linear warm-up during the first three epochs, followed by cosine annealing. The initial learning rate is set to 0.01 and decays to 0.0001. An early-stopping strategy with a patience of 30 epochs is applied, and the best-performing model weights on the validation set are used for final evaluation. Therefore, the compared methods share the same optimization settings and training budget, while their official architectures and native eventbrepresentation designs are retained.
4.2. Comparison with State-of-the-Art Methods
We compare SMG-UAV with state-of-the-art methods on FRED and NeRDD. The compared methods include representative RGB-only detectors, event-only detectors, and RGB–event fusion detectors. Specifically, the RGB-only detectors include YOLOv12 [
52], MambaYOLO [
53], and RT-DETR [
54]; the event-only detectors include RVT [
55], SAST [
56], and SMamba [
57]; and the RGB–event fusion detectors include FPN-Fusion [
12], RENet [
13], SODFormer [
14], and EOLO [
15].
For a fair comparison, all methods are trained and evaluated under the same dataset split, input resolution, and evaluation protocol whenever applicable, while retaining the original event encoding form of each compared method. To reduce the influence of random initialization and training stochasticity, each method is trained three times with different random seeds. The quantitative results are reported in
Table 1 as mean ± standard deviation.
As shown in
Table 1, RGB-only detectors show limited performance on both FRED and NeRDD. Although RGB frames provide rich texture, color, and appearance information, these cues become unreliable when UAV targets are small, distant, blurred, weakly contrasted, or affected by illumination degradation. For example, the best RGB-only
reaches only
on FRED and
on NeRDD, indicating that appearance-based detection alone is insufficient for robust anti-UAV perception in challenging dynamic scenes.
Event-only detectors achieve substantially better results than RGB-only detectors on both datasets. This demonstrates the importance of high-temporal-resolution motion cues for UAV detection. Event streams can provide more discriminative motion-sensitive responses, especially under fast motion, motion blur, and illumination variation. Among the event-only methods, SMamba obtains the best of on FRED and on NeRDD. However, event-only detection still has limitations. Event responses may become sparse when the UAV moves slowly or appears at a long distance, and irrelevant background motion may introduce noisy activations.
RGB–event fusion methods generally improve over RGB-only methods, confirming the benefit of multimodal sensing for drone detection. However, the results also show that not all fusion strategies consistently outperform the strongest event-only baseline. For instance, some fusion methods achieve lower mean than SMamba on FRED or only marginal gains on NeRDD. This suggests that directly combining RGB and event features is not sufficient. When one modality is degraded or when the two modalities contain inconsistent responses, conventional fusion strategies may introduce unreliable information rather than suppress it.
In contrast, SMG-UAV achieves the best mean performance across all metrics on both datasets. On FRED, SMG-UAV reaches , , and , outperforming the strongest competing method by 6.8, 1.3, and 3.4 points in terms of mean performance, respectively. On NeRDD, SMG-UAV obtains , , and , exceeding the best competing method by 3.6, 2.8, and 8.3 points in terms of mean performance, respectively. Although the improvement on FRED is smaller than the improvements in and , the standard deviation of SMG-UAV remains low, indicating that the gain is stable across different random seeds. These results suggest that the proposed method exhibits both strong detection performance and stable training behavior across two UAV RGB–event benchmarks.
Although AP-based metrics provide a standard and comprehensive evaluation of detection accuracy, they do not fully reflect the operational reliability required by practical anti-UAV surveillance. In such scenarios, missed targets may lead to delayed warning, while frequent false alarms may increase the burden of downstream tracking, verification, and response modules. Therefore, in addition to
,
, and
, we further report precision, recall, F1-score, false positives per frame (FP/frame), and miss rate on the FRED dataset in
Table 2. We select FRED for this additional reliability-oriented evaluation because it provides a more comprehensive RGB–event anti-UAV benchmark with diverse UAV motion patterns, target scales, illumination conditions, background clutter, and challenging surveillance scenarios.
For a fair comparison, these additional metrics are computed from the post-NMS detection results under the same confidence threshold and IoU matching criterion for all compared methods. A predicted bounding box is counted as a true positive if its confidence score is higher than the predefined confidence threshold and its IoU with an unmatched ground-truth box is larger than 0.5. Predictions that do not match any ground-truth box are counted as false positives. Ground-truth boxes that are not matched by any valid prediction are treated as missed targets. In our evaluation, the confidence threshold is set to 0.3 for all methods, and the IoU threshold for matching is set to 0.5.
The additional metrics are computed as follows:
where
and
denote true positives and false positives, respectively,
denotes the number of ground-truth UAV instances, and
denotes the number of evaluated frames. For anti-UAV surveillance, recall and FP/frame are particularly important because they directly reflect missed-warning risk and false-alarm burden.
As shown in
Table 2, the proposed SMG-UAV not only achieves the highest AP-based accuracy, but also provides a better balance between recall and false alarms. Compared with RGB-only and event-only detectors, SMG-UAV achieves higher recall, lower FP/frame, and a lower miss rate. This is important for practical low-altitude surveillance, where a detector is often used as the front-end of subsequent tracking, verification, and warning modules.
Although the overall comparison provides a general evaluation of detection performance, it cannot fully reveal the reliability of anti-UAV detectors under specific adverse conditions. In practical low-altitude surveillance scenarios, UAV targets may be extremely small, blurred by fast motion, degraded by abnormal illumination, submerged in complex backgrounds, or confused with bird-like moving distractors. Therefore, we further conduct a challenge-oriented robustness evaluation on the FRED dataset.
Specifically, we manually curate five UAV-specific challenging subsets from the FRED test set. The original bounding-box annotations are kept unchanged, and only frame-level difficulty attributes are additionally assigned for evaluation. Each selected RGB frame is paired with the event stream accumulated within the corresponding RGB frame interval. The five subsets include small target, motion blur, extreme illumination, background-embedded target, and bird-like distractor. The small-target subset contains UAV instances whose bounding-box size is smaller than
pixels. The extreme-illumination subset includes both overexposed and underexposed scenes. The background-embedded subset refers to cases where the UAV is visually similar to the surrounding background or partially obscured by cluttered structures. The bird distractor subset contains scenes with target-like moving objects that may cause false alarms in anti-UAV detection. The statistics of these subsets are summarized in
Table 3, and representative examples are shown in
Figure 5. These subsets are used only for evaluation and are not involved in the training process. The five subsets are not mutually exclusive, and a frame may be assigned to multiple difficulty attributes if multiple challenges coexist.
For the small-target subset, candidate samples are first selected from the original annotations by identifying frames that contain at least one UAV bounding box with both width and height smaller than 16 pixels. The candidate samples are then manually re-examined to exclude cases with inaccurate, misaligned, or otherwise unreliable original annotations. For the other four difficulty attributes, namely motion blur, extreme illumination, background-embedded, and bird distractor, frame-level labels are manually assigned according to the predefined visual qualification criteria described in
Table 3.
To reduce subjectivity during subset construction, two annotators independently assign the five frame-level difficulty attributes. The annotators examine the RGB frames and refer to the original UAV bounding-box annotations. The corresponding event representations and temporally adjacent frames are additionally inspected when necessary, particularly when determining motion blur, background interference, or bird-like distractors. Cases in which the two annotators produce inconsistent assignments are independently reviewed and adjudicated by a third experienced annotator.
We evaluate representative methods from different input modalities on these five challenging subsets and report
as the robustness metric.
is adopted because this analysis focuses on whether UAV targets can be reliably detected under severe degradation, especially when the targets are small, weak, or visually ambiguous. The results are shown in
Table 4.
As shown in
Table 4, SMG-UAV achieves the best
on all five challenging subsets, demonstrating its robustness under diverse anti-UAV failure factors. On the small-target subset, SMG-UAV obtains 31.7
, outperforming the strongest baseline SMamba by 4.8 points. In contrast, RGB-only detectors only achieve 11.4 and 12.8
, showing that appearance information alone is insufficient when UAVs occupy only a few pixels.
For motion blur, SMG-UAV reaches 36.9 , exceeding SMamba by 4.2 points and SODFormer by 5.4 points. Event-only detection performs clearly better than RGB-only detection in this subset because event streams preserve motion-sensitive brightness changes. However, the further gain of SMG-UAV shows that event cues are more effective when they are used to guide RGB feature reconstruction rather than being used independently.
Under extreme illumination, the advantage of SMG-UAV becomes more evident. RGB-only methods drop to 5.7 and 4.9 because UAV appearance cues are severely degraded by overexposure or underexposure. Event-only and fusion-based methods are more robust, but SMG-UAV still improves the best competing result from 41.2 to 47.6 . This suggests that the proposed sparse mutual guidance can better exploit illumination-insensitive event responses while suppressing unreliable RGB features.
For the background-embedded subset, SMG-UAV achieves 45.6 , surpassing the strongest baseline by 6.4 points. This subset is particularly challenging because the UAV is visually similar to the surrounding structures or partially submerged by cluttered backgrounds. The result indicates that SMG-UAV can recover more discriminative target representations by jointly using motion-sensitive event cues and RGB semantic constraints.
The bird distractor subset further evaluates the false-alarm resistance of different methods. Although RGB-only MambaYOLO obtains a relatively high of 36.9, conventional event-only and fusion-based methods remain vulnerable to target-like distractors and background motion. SMG-UAV achieves 46.2 , improving the best competing result by 9.3 points. This result suggests that the bidirectional guidance in SMG-Bridge helps suppress irrelevant event responses and improves discrimination between real UAV targets and target-like distractors.
Overall, the challenge-oriented evaluation demonstrates that the proposed method is not only superior in overall benchmark performance, but also more reliable under key anti-UAV degradation factors. The consistent gains across small target, motion blur, extreme illumination, background embedding, and distractor interference provide stronger evidence for the effectiveness of sparse mutual guidance in practical low-altitude UAV detection.
4.3. Computational Complexity and Inference Efficiency
Since practical anti-UAV systems require both high detection accuracy and real-time response, we further evaluate the computational complexity and inference efficiency of different methods. The evaluated metrics include the number of parameters, GFLOPs, FPS, and inference latency. All methods are evaluated under the same input resolution and batch size of 1. The server-side FPS is measured on the same hardware platform used for training and evaluation, so that the reported speed is consistent with the experimental environment used for the accuracy comparison. The reported latency denotes the average network forward inference time per sample, excluding disk I/O, data loading and event voxel preprocessing. For RGB–event methods, both RGB and event inputs are included in the network inference pipeline. The latency is derived from FPS ms.
As shown in
Table 5, some event-only models, such as RVT and SAST, achieve faster inference and lower parameter counts compared with SMG-UAV, due to their single-modality design. In contrast, SMG-UAV introduces a dual-branch RGB–event architecture with spiking event processing, sparse mutual guidance, and pyramid feature enhancement. Although this adds some computational overhead compared with the fastest event-only models, SMG-UAV maintains a moderate parameter size of 18.7 M, 66.2 GFLOPs, and 132 FPS with 7.58 ms average network latency. More importantly, it consistently achieves the highest detection accuracy across
,
, and
metrics. Therefore, SMG-UAV provides a favorable balance between computational efficiency and detection reliability, making it more suitable for practical low-altitude anti-UAV surveillance scenarios where both accuracy and real-time response are critical.
To further evaluate embedded deployment potential, we also deploy the compared models on an NVIDIA Jetson Orin Nano development board (NVIDIA, Santa Clara, CA, USA). The Jetson Orin Nano is a widely used edge-AI development platform for embedded vision, robotics, autonomous systems, intelligent surveillance, and industrial inspection. Its compact form factor and limited computational resources make it suitable for evaluating the efficiency and practical deployment potential of the proposed method on resource-constrained edge-computing platforms. The embedded-platform FPS is measured with batch size 1 after model warm-up, excluding disk I/O, data loading and event voxel preprocessing. The results are reported in
Table 6.
It should be noted that the reported latency measures only network forward inference. The complete end-to-end latency of a deployed RGB–event anti-UAV system may additionally include event stream buffering, event voxel construction, RGB–event synchronization, image resizing, and communication overhead. These factors are implementation- and hardware-dependent.
Despite these additional factors, SMG-UAV achieves a forward inference speed of 40 FPS on the NVIDIA Jetson Orin Nano development board, demonstrating that the proposed method can provide a favorable balance between detection accuracy and real-time performance on edge-computing platforms. This indicates its practical potential for low-altitude anti-UAV surveillance in resource-constrained embedded scenarios.
4.4. Qualitative Comparison and Heatmap Analysis
The quantitative results in
Table 1 demonstrate the overall accuracy and challenge-specific robustness of SMG-UAV. However, numerical metrics alone cannot fully reveal how different detectors behave in practical anti-UAV scenarios. Therefore, we further provide qualitative detection comparisons and response heatmap visualizations to analyze the localization behavior and feature focus of different methods.
Figure 6 shows representative detection results under challenging scenes, including extreme illumination, background clutter, weak small targets, and low-contrast environments. For each scene, we compare representative RGB-only, event-only, RGB–event fusion, and proposed SMG-UAV detectors. The confidence value displayed beside each bounding box in
Figure 6 denotes the final detection confidence of the prediction retained after non-maximum suppression (NMS). For a fair comparison, all compared methods use the same confidence threshold of
and the same NMS IoU threshold of
.
The visual results indicate that RGB-only methods are sensitive to severe illumination degradation and weak target appearance. When the UAV is overexposed, underexposed, or visually similar to the background, RGB-only detectors may miss the target or produce unstable confidence scores. Event-only methods can capture motion-sensitive responses and are less affected by static appearance degradation, but their predictions may become fragmented when event responses are sparse or when background motion introduces noise. Existing RGB–event fusion methods improve detection results by combining appearance and motion cues, but they may still suffer from localization drift or false responses when one modality is unreliable.
In contrast, SMG-UAV provides more stable detection results across different challenging cases. It produces tighter bounding boxes and more reliable confidence scores for distant or weak UAV targets. This advantage is especially clear in extreme-illumination and background-embedded scenes, where UAV appearance is severely degraded and the target region is difficult to distinguish from the surrounding background. These qualitative observations are consistent with the robustness analysis in
Table 4.
To further explain the behavior of different detectors,
Figure 7 presents Grad-CAM visualizations generated using a standard Grad-CAM implementation. For each model, the UAV detection score is used as the target output, and the feature layer immediately before the detection head is selected as the target layer. The generated heatmaps are resized to the input image resolution and independently normalized to the range
using min–max normalization. The same color map is applied to all compared methods, where warmer colors indicate regions with greater relative contribution to the corresponding UAV prediction and cooler colors indicate lower relative contribution. Since each heatmap is independently normalized, the visualization is used to compare the spatial localization and concentration of model attention rather than the absolute response magnitude across different models.
As shown in
Figure 7, several existing RGB–event fusion methods produce diffuse or spatially shifted attention regions, with activations extending to irrelevant background structures. In contrast, SMG-UAV produces more compact and accurately localized attention around the true UAV regions while suppressing spurious background responses. This suggests that the proposed method learns more discriminative and target-aware feature representations under challenging anti-UAV conditions.
The qualitative detection results and heatmap visualizations provide intuitive evidence for the effectiveness of SMG-UAV. The proposed method not only improves quantitative detection performance, but also exhibits more reliable localization behavior and more discriminative internal responses in challenging anti-UAV environments.
4.5. Ablation Studies
To verify the effectiveness of the proposed components, we conduct a series of ablation studies on the FRED dataset. All variants are trained and evaluated under the same settings as the full model. Unless otherwise specified, only the analyzed component is changed, while the remaining architecture and training protocol are kept unchanged. , , and are reported to evaluate detection accuracy at different localization thresholds.
4.5.1. Effectiveness of Main Components
We first evaluate the contribution of the main components in SMG-UAV through a progressive ablation study on the FRED dataset. The baseline is a basic dual-stream RGB–event detector, in which both branches use conventional CSPDarknet backbones, the event input is represented as a three-channel accumulated event frame, and cross-modal fusion is implemented by direct concatenation. Based on this baseline, we progressively introduce the event voxel representation, the Spiking CSPDarknet event encoder, SMG-Bridge, and the SGP-Neck. In addition to
,
, and
,
Table 7 reports the cumulative parameter count and GFLOPs of each configuration to quantify the computational overhead introduced by the corresponding components.
As shown in
Table 7, the baseline achieves 73.4
, 30.2
, and 38.7
, with 16.0 M parameters and 54.1 GFLOPs. Replacing the accumulated event frame with the event voxel representation improves the performance to 77.9
, 32.5
, and 44.1
, corresponding to improvements of 4.5, 2.3, and 5.4 points, respectively. This demonstrates that preserving the temporal distribution and polarity information of events within the RGB frame interval provides more informative motion cues than a temporally collapsed event frame.
The event voxel representation changes the event input from the three-channel accumulated event frame used in the baseline to a voxel tensor. The increased input channel dimension slightly enlarges the input layer of the event branch and introduces additional network-forward computation. Consequently, the parameter count increases from 16.0 M to 16.2 M, while the computational cost increases from 54.1 to 56.1 GFLOPs, corresponding to additional costs of 0.2 M parameters and 2.0 GFLOPs.
Further introducing the Spiking CSPDarknet event encoder raises the performance to 80.2 , 35.3 , and 44.5 . This corresponds to improvements of 2.3, 2.8, and 0.4 points, respectively. The Spiking CSPDarknet encoder increases the cumulative model size from 16.2 M to 16.8 M parameters and the computational cost from 56.1 to 58.9 GFLOPs, introducing 0.6 M additional parameters and 2.8 GFLOPs. These results suggest that the spiking event branch better matches the sparse and motion-triggered characteristics of event data and improves hierarchical event feature extraction with a relatively limited computational overhead.
The largest accuracy gain is obtained by replacing direct concatenation with SMG-Bridge. This modification increases the performance to 87.6 , 36.7 , and 48.9 , bringing gains of 7.4, 1.4, and 4.4 points over the preceding configuration, respectively. Meanwhile, SMG-Bridge introduces 1.0 M additional parameters and 4.5 GFLOPs, increasing the cumulative complexity from 16.8 M parameters and 58.9 GFLOPs to 17.8 M parameters and 63.4 GFLOPs. The substantial accuracy improvement relative to this moderate computational increase demonstrates that reliability-aware sparse mutual guidance is more effective than direct feature aggregation for exploiting complementary RGB–event information.
Finally, adding SGP-Neck further improves the results to 89.3 , 37.2 , and 51.6 . Compared with the SMG-Bridge configuration, SGP-Neck provides additional gains of 1.7, 0.5, and 2.7 points, respectively. It introduces 0.9 M additional parameters and 2.8 GFLOPs, resulting in a full-model complexity of 18.7 M parameters and 66.2 GFLOPs. This result shows that lightweight gated multiscale enhancement remains beneficial after cross-modal fusion, particularly for small and weak UAV targets whose responses may be attenuated during hierarchical feature propagation.
Overall, the complete SMG-UAV improves the baseline by 15.9 , 7.0 , and 12.9 , while introducing 2.7 M additional parameters and 12.1 GFLOPs. Among the proposed components, SMG-Bridge provides the largest accuracy improvement, whereas the event voxel representation, Spiking CSPDarknet encoder, and SGP-Neck introduce relatively limited additional complexity. These results demonstrate that the proposed components provide substantial detection gains with controlled computational overhead, resulting in a favorable accuracy–complexity trade-off.
4.5.2. Effect of Bidirectional Mutual Guidance
To further analyze the cross-modal interaction mechanism in SMG-Bridge, we compare different guidance directions. The direct fusion variant removes mutual guidance and aggregates RGB and event features directly. The event-to-image and image-to-event variants retain only one guidance direction, while the bidirectional variant corresponds to the full SMG-Bridge.
As shown in
Table 8, direct fusion achieves 82.2
, 35.7
, and 45.1
. Both one-way guidance variants improve over this baseline. Event-to-image guidance increases the performance to 86.9
, 36.7
, and 49.8
, indicating that event cues provide effective motion-sensitive structure for enhancing degraded RGB representations. Image-to-event guidance also improves the results to 85.1
, 36.1
, and 48.4
, showing that RGB appearance semantics help suppress noisy or irrelevant event responses.
The complete bidirectional guidance achieves the best performance, reaching 89.3 , 37.2 , and 51.6 . Compared with event-to-image only, it further improves and by 2.4 and 1.8 points, respectively. These results show that the two guidance directions are complementary rather than redundant. By jointly exploiting event-guided RGB recovery and RGB-guided event refinement, SMG-Bridge produces more reliable cross-modal representations for robust UAV detection.
4.5.3. Effect of Sparse Thresholding and Guidance Gate
We further analyze two internal designs of SMG-Bridge: the adaptive sparse threshold and the dynamic guidance gate. As shown in
Table 9, removing the guidance gate reduces
from 51.6 to 45.8, indicating that ungated cross-modal propagation introduces more unreliable responses. Replacing the dynamic gate with a static scalar also degrades the performance to 47.9
, showing that spatially adaptive guidance is more effective than global weighting. Removing the adaptive sparse threshold lowers
to 49.4, suggesting that sparse thresholding contributes to suppressing noisy or unreliable responses during fusion. The full SMG-Bridge achieves the best results, demonstrating that both adaptive thresholding and dynamic guidance are important for reliable RGB–event fusion.
4.5.4. Effect of Event Voxel Configuration
We further investigate the influence of two key parameters in the event voxel representation: the number of temporal bins T and the event accumulation window . To isolate the effect of each factor, only one parameter is varied at a time, while the remaining network architecture, training settings, input resolution, dataset split, and evaluation protocol are kept unchanged. When evaluating the number of temporal bins, the event accumulation window is fixed at ms, corresponding to one RGB frame interval on FRED. When evaluating the accumulation window, the number of temporal bins is fixed at the default value of .
The positive and negative event polarities are accumulated separately in each temporal bin. Therefore, an event voxel with T temporal bins is represented as , where the factor of two corresponds to the two event polarities. The default configuration uses , resulting in a six-channel event representation. Changing T modifies the input channel dimension of the event branch and therefore slightly affects the parameter count and network-forward GFLOPs.
As shown in
Table 10, using too few temporal bins excessively compresses the temporal distribution of the event stream and limits the representation of short-term UAV motion. Increasing
T preserves finer temporal information and initially improves detection performance. However, an excessively large
T distributes the available events over more temporal intervals, making the response within each bin increasingly sparse. It also increases the number of input channels and slightly raises the parameter count and network-forward computation. The configuration with
provides the best overall balance among temporal resolution, event density, detection accuracy, and computational complexity. Therefore,
is adopted as the default setting in SMG-UAV.
Table 11 analyzes the influence of the event accumulation duration. A short accumulation window contains relatively few events and may provide insufficient motion evidence for distant, slowly moving, or weak UAV targets. Increasing the window produces denser event observations and can improve the stability of event feature extraction. Nevertheless, an excessively long window may accumulate irrelevant background events, generate longer motion trails, and weaken the temporal alignment between the event representation and the current RGB frame. The accumulation window of
ms achieves the best overall performance and is used in the final model.
4.5.5. Contribution of SMG-Bridge at Different Feature Levels
We further investigate the contribution of SMG-Bridge at different feature levels of the dual-branch backbone. SMG-Bridge can be applied to the L3, L4, and L5 stages, which correspond to high-, medium-, and low-resolution feature representations, respectively. To isolate the contribution of bridge placement, all remaining components, including the event voxel representation, Spiking CSPDarknet encoder, SGP-Neck, detection head, and training protocol, are kept unchanged. When SMG-Bridge is disabled at a feature level, the RGB and event features at that level are directly concatenated along the channel dimension.
As shown in
Table 12, the L3 bridge operates on higher-resolution features and primarily improves the preservation and cross-modal enhancement of fine spatial details, which are particularly important for small and distant UAV targets. The L4 and L5 bridges operate on progressively deeper features and contribute stronger semantic and contextual interaction between the RGB and event modalities.
Combining SMG-Bridge at multiple levels produces further improvements over using any individual level. In particular, the complete L3 + L4 + L5 configuration achieves the best performance, reaching 89.3 , 37.2 , and 51.6 . These results indicate that high-resolution spatial guidance and deeper semantic guidance are complementary rather than redundant, supporting the multilevel deployment of SMG-Bridge in the final architecture.
4.6. Cross-Dataset Transfer Evaluation
To further examine the transferability of SMG-UAV across different anti-UAV data distributions, we conduct bidirectional cross-dataset experiments between FRED and NeRDD. Two transfer settings are considered: training on FRED and directly testing on NeRDD, denoted as FRED → NeRDD, and training on NeRDD and directly testing on FRED, denoted as NeRDD → FRED. The corresponding FRED → FRED and NeRDD → NeRDD results are also reported as in-domain references.
It should be noted that FRED was developed by extending NeRDD with additional sequences and more diverse target scales, backgrounds, illumination conditions, and motion patterns. Therefore, the two datasets are related rather than completely independent, and the two transfer directions have different levels of difficulty. FRED → NeRDD represents transfer from a broader data distribution to a related and relatively narrower distribution, whereas NeRDD → FRED requires the model to generalize from the more limited NeRDD distribution to the more diverse scenarios contained in FRED. Accordingly, the NeRDD → FRED direction provides a more challenging assessment of cross-dataset transferability.
For each transfer direction, SMG-UAV is trained from scratch using only the training split of the source dataset. The best-performing checkpoint is selected exclusively according to the validation split of the same source dataset and is then directly evaluated on the test split of the target dataset. No target-domain fine-tuning, parameter adaptation, additional training, hyperparameter search, or confidence-threshold adjustment is performed. Target-domain annotations are used only for the final calculation of the evaluation metrics. The confidence threshold, non-maximum suppression settings, and evaluation implementation remain unchanged across the in-domain and cross-dataset experiments.
As shown in
Table 13, the model trained on FRED and directly evaluated on NeRDD achieves 87.5
, 29.8
, and 48.7
. Compared with the model trained and evaluated on NeRDD, the transferred model shows only limited reductions of 1.2, 0.3, and 0.8 points in
,
, and
, respectively. This relatively small performance gap is consistent with the relationship between the two datasets. FRED extends the data distribution represented by NeRDD with additional sequences and more diverse target scales, backgrounds, illumination conditions, and motion patterns. Training on FRED therefore enables the model to learn from a broader but related distribution, and the resulting representations remain effective when transferred to NeRDD.
In the reverse direction, the model trained on NeRDD and directly evaluated on FRED obtains 66.2 , 25.4 , and 33.9 . Compared with the FRED in-domain results, this corresponds to reductions of 23.1, 11.8, and 17.7 points, respectively. The substantially larger degradation indicates that the comparatively narrower NeRDD training distribution does not sufficiently cover the additional target appearances, scales, scene structures, illumination variations, and motion conditions contained in FRED. Consequently, NeRDD → FRED represents the more challenging transfer direction and provides a stricter assessment of the ability of SMG-UAV to generalize from a limited source distribution to broader anti-UAV scenarios.
Overall, the bidirectional experiments reveal a clear asymmetric transfer pattern. The FRED → NeRDD setting retains performance close to the NeRDD in-domain reference, whereas the NeRDD → FRED setting exhibits a considerable decrease under the broader target distribution. Because FRED was developed by extending NeRDD, the favorable FRED → NeRDD result should be interpreted as transfer between closely related datasets rather than generalization to a completely independent domain. Nevertheless, the results demonstrate that the complementary RGB appearance cues and event-based motion representations learned by SMG-UAV exhibit meaningful transferability across related anti-UAV data distributions. The degradation in the more difficult NeRDD → FRED direction also indicates that robust generalization from a restricted training distribution to more diverse real-world environments remains an important challenge.