Sparse-Gated RGB-Event Fusion for Small Object Detection in the Wild

Shi, Yangsi; Li, Miao; Chen, Nuo; Luo, Yihang; He, Shiman; An, Wei

doi:10.3390/rs17173112

Open AccessArticle

Sparse-Gated RGB-Event Fusion for Small Object Detection in the Wild

by

Yangsi Shi

,

Miao Li

,

Nuo Chen

^*

,

Yihang Luo

,

Shiman He

and

Wei An

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3112; https://doi.org/10.3390/rs17173112

Submission received: 5 August 2025 / Revised: 30 August 2025 / Accepted: 4 September 2025 / Published: 6 September 2025

(This article belongs to the Special Issue Artificial Intelligence-Based Sensor Data Processing for Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Detecting small moving objects under challenging lighting conditions, such as overexposure and underexposure, remains a critical challenge in computer vision applications including surveillance, autonomous driving, and anti-UAV systems. Traditional RGB-based detectors often suffer from degraded object visibility and highly dynamic illumination, leading to suboptimal performance. To address these limitations, we propose a novel RGB-Event fusion framework that leverages the complementary strengths of RGB and event modalities for enhanced small object detection. Specifically, we introduce a Temporal Multi-Scale Attention Fusion (TMAF) module to encode motion cues from event streams at multiple temporal scales, thereby enhancing the saliency of small object features. Furthermore, we design a Sparse Noisy Gated Attention Fusion (SNGAF) module, inspired by the mixture-of-experts paradigm, which employs a sparse gating mechanism to adaptively combine multiple fusion experts based on input characteristics, enabling flexible and robust RGB-Event feature integration. Additionally, we present RGBE-UAV, which is a new RGB-Event dataset tailored for small moving object detection under diverse exposure conditions. Extensive experiments on our RGBE-UAV and public DSEC-MOD datasets demonstrate that our method outperforms existing state-of-the-art RGB-Event fusion approaches, validating its effectiveness and generalization under complex lighting conditions.

Keywords:

RGB-Event fusion; event processing; small moving object detection; RGB-Event dataset

1. Introduction

Detecting small moving objects is a critical yet challenging task in various computer vision applications [1,2], such as surveillance, autonomous driving, and anti-Unmanned Aerial Vehicle (anti-UAV) systems. The difficulty is further exacerbated under adverse lighting conditions—such as overexposure or underexposure—where conventional Red–Green–Blue (RGB) cameras often struggle to capture clear and consistent object appearances [3,4]. In such scenarios, small moving objects may be obscured by motion blur, extreme illumination, or high dynamic range, leading to degraded detection performance [5,6]. While RGB cameras excel at capturing rich texture and color information under stable lighting, their limitations in handling fast motion and extreme lighting highlight the need for complementary sensing modalities. Event cameras, also known as dynamic vision sensors (DVSs), offer a promising solution to these challenges. Unlike conventional frame-based cameras, event cameras asynchronously detect pixel-level brightness changes, providing high temporal resolution, high dynamic range, and low data redundancy [7]. These properties make event cameras particularly effective for capturing fast-moving objects [8] and operating in challenging lighting environments [9,10]. However, effectively leveraging event data for object detection requires specialized processing techniques to convert the asynchronous event stream into a structured representation [7,11,12] suitable for feature learning. Existing event processing methods—such as event frames [11,13,14], time surfaces [12,15], and voxel grids [16,17,18,19]—each exhibit trade-offs in computational efficiency, temporal resolution, and noise robustness, especially in the context of small object detection.

While RGB and event cameras each have unique strengths, their complementary nature suggests that fusing information from both modalities could significantly enhance detection performance, especially for small moving objects under adverse lighting conditions. Recent efforts [9,10,20] in RGB-Event fusion have shown promise for tasks such as object detection in autonomous driving and all-day surveillance. However, most existing RGB-Event fusion methods rely on fixed, predefined strategies that struggle to generalize across diverse scenes or adapt to varying feature distributions in multimodal input data. A typical approach, as illustrated in Figure 1a, treats RGB as the dominant modality with event data serving as auxiliary or guiding information [21,22,23]. While this asymmetric fusion strategy is straightforward and can enhance RGB-based performance by incorporating event data, it fails to balance the importance of both modalities. In scenarios where event data may carry more critical information—such as under extreme lighting or fast motion—this approach underutilizes the complementary strengths of the two modalities. Alternatively, as shown in Figure 1b, some methods employ a single, symmetric fusion module [9,10,20] to integrate RGB and event features, which is an approach that is also common in other multimodal domains such as RGB-Depth [4] or RGB-Thermal [24,25] fusion. While this symmetric fusion encourages a more balanced feature integration, it typically employs a fixed fusion pattern learned during training, inherently limiting its adaptability to diverse multimodal inputs. Consequently, such methods lack the flexibility and robustness required in complex or dynamic environments. Furthermore, the absence of publicly available datasets specifically designed for small moving object detection with RGB-Event modalities continues to impede progress in this domain.

To address these challenges, we propose a novel framework for small moving object detection that leverages the complementary strengths of both RGB and event cameras through adaptive fusion mechanisms. Our approach introduces two key innovations: (1) a Temporal Multi-Scale Attention Fusion (TMAF) module for processing event streams, which captures motion information across multiple temporal scales to enhance the saliency of small moving objects while suppressing noise; and (2) a Sparse Noisy Gated Attention Fusion (SNGAF) module for fusing RGB and event features, which dynamically selects and combines specialized Cross-Modal Attention Fusion Expert (CMAFE) modules based on the input data characteristics, enabling flexible and adaptive multimodal integration, as illustrated in Figure 1c. In addition, we introduce RGBE-UAV, a new dataset specifically designed for small moving object detection under challenging lighting conditions. Collected using a DAVIS346 event camera, RGBE-UAV provides synchronized RGB frames and event streams across diverse environments and exposure settings, filling a critical gap in existing RGB-Event benchmarks.

Our main contributions are summarized as follows:

We propose the TMAF module, which effectively fuses event streams at multiple temporal scales to enhance the feature saliency of small moving objects while mitigating noise interference;
We introduce the SNGAF module, a sparse-gated fusion mechanism inspired by mixture-of-experts [26] (MoE) models, which adaptively integrates RGB and event features to accommodate dynamic input characteristics;
We present RGBE-UAV, the first publicly available dataset tailored for small moving object detection using RGB-Event modalities, featuring a wide range of lighting conditions and environments;
We achieve state-of-the-art (SOTA) detection performance on both the RGBE-UAV and DSEC-MOD [9] datasets, validating the effectiveness of our approach through extensive quantitative and qualitative evaluations.

2. Related Work

2.1. Event Processing

Event-based cameras are bio-inspired imaging sensors [7] that differ fundamentally from conventional frame-based cameras. Instead of capturing images at fixed intervals, they asynchronously record pixel-level brightness changes, enabling high temporal resolution, high dynamic range, and low data redundancy. Owing to these advantages, event cameras have drawn significant attention from the computer vision community in recent years, particularly for applications involving fast motion [8] or challenging lighting conditions [10]. An event is triggered at a pixel location when the logarithmic brightness change exceeds a predefined threshold

Δ C

. Each event is represented as

e_{k} = (x_{k}, y_{k}, t_{k}, p_{k})

, where

(x_{k}, y_{k})

denotes the pixel coordinates,

t_{k}

is the timestamp, and

p_{k} \in {+ 1, - 1}

indicates the polarity (positive for brightness increase, negative for decrease).

The raw event stream captured by event cameras must be processed before being fed into conventional convolutional neural networks (CNN) for feature learning. The choice of event representation plays a crucial role in fully exploiting the advantages of event cameras [7]. Currently, several mainstream methods have been proposed to convert event streams into structured formats, including event frame-based [11,13,14], time surface-based [12,15], and voxel grid-based representations [16,18,19]. Event frame-based methods aggregate events over fixed temporal windows to form image-like frames, enabling compatibility with standard CNN architectures. While straightforward, this approach’s reliance on predefined intervals limits its adaptability across diverse scenes. Time surface-based methods retain only the most recent event timestamp per pixel, preserving temporal dynamics but suffering from significant information loss and noise sensitivity, rendering them suboptimal for small object detection. Voxel grid-based methods partition the event stream into temporal bins to construct spatiotemporal representations, balancing temporal resolution and spatial structure. While effective, their performance depends heavily on discretization granularity, and their computational overhead—due to rich spatiotemporal detail—poses challenges for real-time applications, such as detecting small moving objects. More recently, Transformer-based methods [27,28,29] have been proposed for event processing, where event streams are tokenized across spatial and temporal dimensions and modeled with self-attention to capture long-range dependencies. While effective in some vision tasks, these methods are computationally expensive and less suitable for small object detection, as small objects activate only sparse events and require the efficient modeling of local spatiotemporal cues. In this study, to balance computational efficiency with the preservation of critical spatiotemporal information, we propose a novel event processing method for small moving object detection. While our approach is also built upon frame-based methods, it leverages event streams at multiple temporal scales. Through the TMAF module, we effectively exploit the spatiotemporal information within these multi-scale event streams to enhance the feature saliency of small moving objects. This enables a simple yet efficient utilization of the spatiotemporal information inherent in event streams.

2.2. RGB-Event Fusion for Object Detection

RGB cameras excel at capturing fine-grained texture and color details for slow-moving objects, whereas event cameras, with their resistance to motion blur and high dynamic range, are well suited for fast motion and adverse lighting conditions. These complementary strengths have driven the integration of RGB and event modalities for tasks such as deblurring [30,31], semantic segmentation [32,33], and depth estimation [34]. Recently, some researchers have combined the information from both RGB and event modalities for object detection under challenging weather or lighting conditions. For instance, Tomy et al. [20] introduce a Feature Pyramid Network (FPN)-based fusion network that simply integrates RGB and event features for object detection in autonomous driving scenarios under diverse weather conditions, while Zhou et al. [9] propose a multi-scale, three-branch fusion network that models modality-specific representations and shared representations of RGB and event data, enabling robust moving object detection in autonomous driving scenarios under complex lighting conditions. This approach preserves modality-specific features via individual branches during the fusion process, reducing the loss of target information. Cao et al. [10] present a symmetric RGB-Event fusion module that enables balanced and adaptive feature fusion for all-day object detection without relying on any specific modality. Nevertheless, most existing RGB-Event fusion methods rely on a single, fixed fusion strategy learned during training, which limits their adaptability to input data with diverse or shifting feature distributions [35,36]. These methods typically perform fusion along a single characteristic dimension, failing to fully exploit the rich, multi-dimensional complementary information inherent in RGB and event modalities. Consequently, their generalization and robustness degrade in complex or dynamic environments. To address these limitations, we propose a sparse-gated RGB-Event fusion framework inspired by mixture-of-experts models [26]. Instead of using a single fusion module, our approach learns a set of diverse fusion experts and employs a sparse gating network to dynamically select and combine the most suitable fusion patterns based on the characteristics of the input data. This enables adaptive and fine-grained multimodal integration, allowing the model to better leverage complementary cues across modalities and enhancing detection performance for small moving objects under challenging lighting conditions.

2.3. RGB-Event Object Detection Dataset

In contrast to RGB cameras, which benefit from a wide range of large-scale datasets [37,38] for object detection and other computer vision tasks, publicly available RGB-Event multimodal datasets for object detection remain limited. Moreover, to the best of our knowledge, there is currently no dedicated RGB-Event dataset specifically designed for small moving object detection. Widely adopted RGB-Event datasets include DSEC [39] and MVSEC [40] along with several task-specific derivatives constructed from them. For instance, MVSEC-NIGHT [41] is derived from MVSEC by generating nighttime vehicle annotations with the help of a pre-trained YOLOv3 model. Similarly, DSEC-MOD [9] creates bounding boxes for moving objects from DSEC using a pre-trained YOLOv5. Both MVSEC-NIGHT and DSEC-MOD are designed for moving object detection in autonomous driving scenarios. The VisEvent [42] dataset is a large-scale RGB-Event benchmark for object tracking, featuring a wide variety of object categories. However, it contains relatively few small moving objects, as most targets belong to general object classes. Due to its low resolution (

346 \times 260

), the authors later introduced EventVOT [43], which is a higher-resolution (

1280 \times 720

) RGB-Event tracking dataset. More recently, a dataset named NeRDD [44] was introduced for drone detection in anti-UAV scenarios. However, in NeRDD, drones are captured at close range, resulting in large object sizes. This setup deviates from real-world anti-UAV applications, where drones are typically small and distant, and early detection is crucial. While these efforts have advanced RGB-Event dataset development for object detection, to the best of our knowledge, there is still no publicly available dataset specifically designed for small moving object detection using RGB-Event modalities. To fill this gap, we introduce a new RGB-Event dataset tailored for small moving object detection under varying exposure and lighting conditions. Our dataset aims to promote further research on RGB-Event fusion for small moving object detection by leveraging the complementary strengths of both modalities to enhance detection performance in complex lighting environments.

3. Method

3.1. Network Overview

Figure 2 illustrates the overall architecture of our proposed network, which consists of four key components. The first component is the TMAF module, which is designed to fuse event streams across multiple temporal scales and enhance object saliency in the spatial feature space. The second component is the dual-stream feature extractors, which performs multi-scale feature extraction independently for the RGB and event modalities. The third component is the SNGAF module, which selectively and adaptively fuses multi-dimensional features from both RGB and event modalities. Lastly, a detection head with an FPN is employed to generate the final bounding boxes for detecting small moving objects.

Specifically, the raw event stream is first processed by our proposed TMAF module to extract motion cues along the temporal dimension, thereby enhancing the spatial saliency of small moving object features. We then employ two symmetric, slightly modified ResNet-50 networks as dual-stream feature extractors to independently extract hierarchical, multi-scale semantic features from the RGB and event modalities. To mitigate the loss of small object features during deep feature extraction, we retain the lowest-level feature layers in the ResNet-50 backbone. Inspired by the MoE paradigm [26,35,36,45], we propose the SNGAF module to enable adaptive cross-modal fusion between RGB and event features. At each feature scale, RGB and event features are fed into a gating network, which dynamically selects a sparse subset of fusion experts based on the statistical characteristics of both modalities. These selected experts are then used to integrate the input features in a data-dependent manner. This selective fusion mechanism enables adaptive and fine-grained multimodal feature integration, allowing the model to more effectively exploit the rich, complementary information inherent in RGB and event modalities. As a result, the proposed method offers greater flexibility in feature fusion and significantly enhances the model’s ability to detect small moving objects under challenging lighting conditions. Finally, the fused features are passed to a detection head equipped with an FPN, following the design of RetinaNet [46], to generate the final detection results.

3.2. Temporal Multi-Scale Attention Fusion Module for Event Processing

In an ideal scenario, when the event camera remains stationary, only the small moving objects generate sparse event streams. However, in real-world situations, the event stream generated by the event camera not only contains events from moving objects but also includes noise events, such as those caused by illumination disturbances under complex lighting conditions, events from background distractors, and sensor internal thermal noise. These noise events complicate the effective exploitation of event stream data. As observed in Figure 2, the small moving objects exhibit continuous and correlated motion within the event stream. The number of events accumulated by the small moving object in space varies across different temporal scales. Generally, selecting a larger temporal scale accumulates more object events, making the object features more prominent in the spatial domain. However, as the temporal scale increases, more noise events are accumulated, potentially interfering with the recognition of object events. Furthermore, events accumulated over a longer temporal scale become much more blurred, which hinders accurate object localization. Conversely, if the temporal scale is too small, the accumulated events may be insufficient to effectively detect the object, especially when the object moves slowly. Therefore, to accommodate various application scenarios, it is essential to utilize event streams from multiple temporal scales. Motivated by this observation, we propose the TMAF module, which models motion information from multi-scale event streams in the temporal domain. This module enhances the saliency of small moving objects in the spatial domain while suppressing noise interference as much as possible. Unlike conventional methods that use event frames with fixed temporal intervals, our approach effectively balances the enhancement of small moving object features and noise suppression, offering better adaptability across different scenarios.

As illustrated in Figure 2, we take event streams at three temporal scales (

Δ T_{1}

,

Δ T_{2}

, and

Δ T_{3}

) as input and convert them into corresponding event frames. To enhance the representation of small moving objects across these temporal scales, we apply convolutions with different kernel sizes to each event frame, generating three sets of event feature maps. The kernel sizes are selected based on the event density at each temporal scale to capture motion information at varying granularities. These feature maps are then concatenated along the channel dimension, and a channel attention mechanism is applied to adaptively fuse the multi-scale temporal features. Formally, this process is defined as

E_{c a t} = [f^{1 \times 1} (E_{1}), f^{3 \times 3} (E_{2}), f^{5 \times 5} (E_{3})],

(1)

A t t_{c} = σ (MLP ({GMP}_{s} (E_{c a t})) + MLP ({GAP}_{s} (E_{c a t}))),

(2)

where

E_{1}, E_{2},

and

E_{3}

denote the event frames corresponding to the temporal multi-scale event streams,

f^{n \times n}

represents the convolution operation with a kernel size of n,

[\cdot]

denotes the concatenation operation along the channel dimension, and

E_{c a t}

is the concatenated feature map along the channel dimension.

{GMP}_{s}

and

{GAP}_{s}

represent global max pooling and global average pooling operations along the spatial dimension respectively. The pooled vectors are passed through a shared multilayer perceptron (MLP), summed element-wise, and activated by the sigmoid function

σ

to obtain the final channel attention map

A t t_{c}

. This channel attention

A t t_{c}

represents the contribution of the event feature maps at different temporal scales to the saliency of the moving object features. Finally, the channel attention map

A t t_{c}

is applied to the concatenated event features

E_{c a t}

via element-wise multiplication to obtain the weighted temporal multi-scale event features. A subsequent convolution operation is then applied to fuse the information along the channel dimension, resulting in the final fused event representation E:

E = f^{3 \times 3} (A t t_{c} ⊙ E_{c a t}),

(3)

where ⊙ denotes element-wise multiplication. This convolution operation both integrates the attention-weighted temporal multi-scale event features along the temporal dimension and ensures that the generated event representation matches the input format of typical backbone networks, facilitating subsequent spatial multi-scale semantic feature extraction.

3.3. Sparse Noisy Gated Attention Fusion Module for RGB-Event Fusion

To enable the model to adapt to the diverse feature distributions of multimodal input data, we propose a more flexible and adaptive RGB-Event multimodal fusion approach. Inspired by the paradigm of MoE models, which dynamically select the most suitable experts based on the characteristics of the input data, we design the SNGAF module. This module integrates multiple fusion strategies to achieve discriminative and flexible multi-dimensional feature fusion, enabling an adaptive combination of RGB and event modalities. This design improves generalization and robustness across varied multimodal data distributions. As illustrated in Figure 2, each SNGAF module is composed of a Gating Network, which adaptively controls the selection of experts, and a set of four CMAFE modules, which perform fine-grained cross-modal feature interaction and fusion.

CMAFE Module. For simplicity, we take the RGB-Event fusion at a specific semantic level within the hierarchical multi-scale features as an example. Formally, let R and E denote the RGB and event inputs to their respective backbone networks. The features extracted by the individual backbones are represented as

F_{r} \in R^{(C \times p H \times p W)}

and

F_{e} \in R^{(C \times p H \times p W)}

, where C denotes the number of channels, and

p H

and

p W

are the spatial dimensions of the feature maps. Each CMAFE module performs a complementary fusion of RGB and event features via symmetric intra-modal attention enhancement and cross-modal attention interaction. Specifically, the CMAFE module first computes the spatial attention maps for the RGB and event features. For either RGB or event features, the feature map is first processed by global max pooling and global average pooling along the channel axis. The resulting pooled features are then concatenated along the channel dimension. A

3 \times 3

convolution is subsequently applied to fuse the max-pooled and average-pooled features, and a sigmoid activation is used to generate the corresponding spatial attention map. This process can be formulated as

F_{r} = E_{r} (R), F_{e} = E_{e} (E),

(4)

A t t_{r} = σ (f_{r}^{3 \times 3} ([{GMP}_{c} (F_{r}), {GAP}_{c} (F_{r})])),

(5)

A t t_{e} = σ (f_{e}^{3 \times 3} ([{GMP}_{c} (F_{e}), {GAP}_{c} (F_{e})])),

(6)

where

{GMP}_{c}

and

{GAP}_{c}

denote the global max pooling and global average pooling along the channel axis, respectively;

f_{r}^{3 \times 3}

and

f_{e}^{3 \times 3}

represent

3 \times 3

convolutional operations for RGB features and event features, respectively.

A t t_{r}

and

A t t_{e}

denote the spatial attention maps associated with the RGB features and event features, respectively. Next, the obtained spatial attention maps for the RGB and event features are used to perform intra-modal enhancement and cross-modal interaction. The intra-modal enhancement aims to preserve and strengthen modality-specific feature representations within each individual modality, while the cross-modal interaction facilitates the complementary fusion of information between the two modalities. The process of applying intra-modal enhancement and cross-modal interaction to the RGB and event features is formulated as follows:

F_{a t t}^{r} = F_{r} ⊙ A t t_{e} + F_{r} ⊙ A t t_{r},

(7)

F_{a t t}^{e} = F_{e} ⊙ A t t_{r} + F_{e} ⊙ A t t_{e},

(8)

Φ (F_{r}, F_{e}) = [F_{a t t}^{r}, F_{a t t}^{e}],

(9)

where

Φ (F_{r}, F_{e})

represents the fusion result of the RGB-Event dual-modal features by a single CMAFE module.

Gating Network. The Gating Network adaptively selects a subset of the four CMAFEs (CMAFE 1, CMAFE 2, CMAFE 3, CMAFE 4) based on the intrinsic distribution characteristics and complementary information of the input RGB and event data. Following a common practice in MoE-based multimodal fusion [35,36], we select the top-2 experts with the highest gating weights out of the four candidates to perform adaptive feature fusion. This design effectively balances model performance and computational efficiency. The Gating Network first concatenates

F_{r}

and

F_{e}

to form a unified feature representation, enabling subsequent interaction between the RGB and event modalities. The concatenated features are then processed by global max pooling and global average pooling along the spatial dimension, each producing a feature map of size

1 \times 1

. These two pooled features are summed to obtain the RGB-Event interaction feature

F_{g} \in R^{(C \times 1 \times 1)}

. This process can be formulated as

F_{g} = {GMP}_{s} ([F_{r}, F_{e}]) + {GAP}_{s} ([F_{r}, F_{e}]),

(10)

the interaction feature

F_{g}

is then fed into the gating function, which generates a gating weight for each CMAFE module based on

F_{g}

. These gating weights are used to select the most suitable combination of fusion experts to adaptively fuse the input RGB and event features. The gating function is defined as follows:

G_{t o p k} (x) = TopK (x \cdot W_{g} + N (0, 1) \cdot S o f t p l u s (x \cdot W_{n o i s e})),

(11)

G (x) = Softmax (G_{t o p k} (x)),

(12)

where

W_{g}

denotes the learnable weights that transform the input to the gating function, and the transformed output represents the gating weights assigned to each CMAFE module.

N (0, 1)

denotes standard Gaussian noise with zero mean. The term

S o f t p l u s (x \cdot W_{n o i s e})

, where

W_{n o i s e}

is also a learnable parameter, dynamically adjusts the standard deviation of the Gaussian noise based on the input, thereby controlling the noise strength. Injecting Gaussian noise into the gating weights facilitates smoother and more diverse expert selection, encouraging a broader set of fusion experts to be activated. This design promotes better load balancing and enhances the model’s exploratory behavior and robustness. The

TopK (\cdot)

operation retains the top K (where

K = 2

) gating weights and sets the remaining gating weights to

- \infty

. After applying the

Softmax (\cdot)

, the remaining weights will become zero, which indicates that the corresponding CMAFE modules are not activated, i.e., their gates remain open.

G_{t o p k} (x)

represents the gating weights of the top K CMAFE modules selected by the gating function. These weights are then passed through the Softmax function, transforming them into a standard probability distribution, which is subsequently used to perform a weighted sum of the outputs from the selected CMAFE modules.

F_{f u s i o n} = \underset{i = 1}{\overset{N}{Σ}} G {(F_{g})}_{i} \cdot Φ_{i} (F_{r}, F_{e}),

(13)

where N denotes the total number of CMAFE modules,

G {(F_{g})}_{i}

represents the probability weight for the i-th CMAFE module output by the gating network, and

Φ_{i} (F_{r}, F_{e})

denotes the output of the i-th CMAFE module after fusing the RGB and event features. The final fused result

F_{f u s i o n}

is then passed to the detection head with FPN for generating the final detection outputs.

4. Experiments

4.1. Dataset

Existing RGB-Event datasets for moving object detection are either designed for general-purpose targets or lack diverse exposure conditions. To address these limitations, we introduce RGBE-UAV, which is a new dataset specifically tailored for small moving object detection under challenging lighting conditions. The dataset is collected using a DAVIS346 event camera, which synchronously captures RGB frames and event streams at a resolution of

346 \times 260

. Both modalities are co-registered and spatially aligned at the sensor level through dedicated internal circuitry.

RGB images are captured at a fixed frame rate of 25 Hz, while the event stream is generated asynchronously with microsecond-level resolution due to the event camera’s high temporal sensitivity. A variety of UAVs with different sizes and types are employed as small moving objects, and each UAV remains in continuous motion within a sequence, introducing diverse variations in object appearance. The RGBE-UAV dataset covers three exposure scenarios—overexposure, normal exposure, and underexposure. For each exposure scenario, multiple video sequences were initially captured as candidates. Based on the quality of the captured images and the complementarity between the RGB and event streams, a subset of these sequences was selected for training and evaluation. In the normal exposure scenario, five sequences were chosen with three assigned to the training set and two to the testing set. For the overexposure and underexposure scenarios, four sequences were selected for each with three used for training and one for testing. This partitioning ensures that both the training and testing sets contain data from all three exposure conditions. Across all exposure scenarios, the ratio of training to testing frames is approximately 3:1, and the event streams were partitioned correspondingly. In total, the dataset consists of 13 video sequences, including 9 for training and 4 for testing, with 12,562 annotated objects, the majority of which are smaller than

10 \times 10

pixels.

The dataset spans a wide range of lighting conditions—including normal, overexposed, and underexposed settings—across diverse environments such as urban areas, forests, rivers, and mountainous terrain. Although the diversity and complexity of illumination and scene settings impose additional challenges for small object detection, these factors more faithfully reflect the characteristics of real-world UAV detection scenarios. Figure 3 provides an overview of the dataset, including statistical analyses and representative examples. Figure 3a presents a comparative analysis of object area distributions between our RGBE-UAV dataset and established RGB-Event benchmarks (DSEC-MOD and NeRDD). The histogram shows that object areas in our dataset are substantially smaller than those in other datasets, highlighting the significantly increased difficulty of detecting small moving objects. Figure 3b quantifies the distribution of exposure conditions within RGBE-UAV, demonstrating a balanced representation across normal, overexposed, and underexposed categories. Finally, Figure 3c illustrates representative RGB-Event frame pairs under different exposure conditions. Specifically, the left column presents RGB images captured with overexposure, normal exposure, and underexposure, while the right column displays the corresponding event frames aligned with these RGB images, highlighting both the lighting diversity and the complementary nature of the two modalities.

Furthermore, we provide a comparative summary of our RGBE-UAV dataset against several existing RGB-Event object detection datasets in terms of average object scale, multi-scene coverage, multi-target capability, and exposure diversity, as shown in Table 1. From the results, it can be observed that RGBE-UAV contains the smallest average object scale among all datasets and offers relatively comprehensive coverage across multi-scene, multi-target, and multi-exposure conditions.

4.2. Experiment Settings

Implementation Details: We employ ResNet-50 as the backbone network for multi-scale feature extraction. To better preserve small object features that may be lost due to aggressive downsampling, we extract features from the first three residual blocks, deviating from the standard practice of starting from the second block. This strategy preserves crucial low-level details captured in the early stages of the backbone. The TMAF module takes event streams at three temporal scales (

Δ T_{1} = 15

ms,

Δ T_{2} = 30

ms, and

Δ T_{3} = 40

ms) as input. We select small, medium, and large scales to ensure that small objects moving at different speeds can be clearly captured within appropriate temporal windows. The rationale and detailed selection process are described in Section 4.4.1. Simple data augmentation techniques, including horizontal and vertical flipping, are applied during training. We train the model using the Adam optimizer with an initial learning rate of

1 \times 10^{- 4}

, which is dynamically adjusted via the ReduceLROnPlateau scheduler. The model is trained for 50 epochs with a batch size of 8. All experiments are conducted on an NVIDIA GeForce RTX 4090 GPU using PyTorch 2.1.2 framework. Following [46], we adopt focal loss for classification and smooth L1 loss for bounding box regression. Given the sparsity and small scale of moving objects in our dataset, focal loss effectively mitigates class imbalance and enhances robustness against hard examples.

Evaluation Metrics: We evaluate performance using the standard mean average precision (mAP) metric. Specifically,

{mAP}_{50}

and

{mAP}_{75}

denote the mean average precision calculated at Intersection-over-Union (IoU) thresholds of 0.50 and 0.75, respectively. In addition, we also report the number of parameters (Params) and floating-point operations (FLOPs) to assess the model’s efficiency.

4.3. Main Results

4.3.1. Quantitative Comparison

To fairly and purely evaluate the effectiveness of our proposed RGB-Event fusion method, we replace our fusion module with several recent SOTA RGB-Event fusion modules [9,10,20] while keeping all other components fixed, including event stream processing, backbone network, object detector, and training configurations, isolating the fusion module as the only variable. Additionally, we conduct experiments using several classical SOTA object detection methods [46,47,48,49] on RGB-only and Event-only modalities to assess detection performance on each modality individually. For the Event-only modality, we employ a traditional fixed temporal scale accumulation method to convert events into frames, which are then treated as standard images for these experiments.

As shown in Table 2, under challenging lighting conditions, RGB-Event fusion methods generally outperform unimodal approaches based solely on RGB or event data in detecting small moving objects. This performance gain is primarily attributed to the event modality’s robustness in encoding motion cues, which effectively complements the RGB modality when appearance features are degraded. Furthermore, our method achieves SOTA detection performance among existing RGB-Event fusion approaches, demonstrating the effectiveness of the proposed sparse-gated fusion mechanism. Moreover, our method maintains a relatively lower parameter count and computational complexity compared to RENet and EOLO. This reduction may be attributed to the sparse-gated fusion mechanism, which adaptively combines lightweight expert modules instead of relying on a single, computationally heavy fusion network. Notably, the results in Table 2 reveal that the discrepancy between mAP50 and mAP75 is generally larger for RGB-Event fusion methods than for RGB-only methods. We posit that this phenomenon may stem from the event modality’s relatively lower precision in capturing object morphology, whereas the RGB modality excels at preserving target shape. Specifically, the events accumulated from object motion in the spatial domain may not precisely align with the object’s geometric contours, thereby limiting the precision of bounding box regression. Nevertheless, the event modality still provides effective localization gains for small moving objects. To evaluate the adaptability of our method to multi-scale object detection, we conduct additional experiments on the DSEC-MOD dataset. As shown in Table 2, our approach maintains leading performance on this RGB-Event dataset encompassing objects of diverse scales, demonstrating robust generalization capability across multi-scale moving object detection scenarios.

In addition, to more concretely evaluate the performance of our RGB-Event fusion method under different exposure conditions, we further report separate results for overexposure, normal exposure, and underexposure scenarios, as summarized in Table 3. Our method consistently outperforms other RGB-Event fusion approaches across all three scenarios, achieving the best detection performance in each case. Moreover, Table 3 shows that all methods perform worst under overexposure, followed by underexposure, while achieving the highest accuracy under normal exposure. This trend highlights the inherent challenges of detecting small objects in complex exposure conditions, particularly in overexposed and underexposed environments.

4.3.2. Qualitative Comparison

Figure 4 presents a qualitative comparison between our proposed RGB-Event fusion method and other SOTA RGB-Event fusion approaches. We visualize the detection results of different fusion methods for small moving objects under various challenging lighting scenarios, demonstrating the superiority of our proposed approach. In the visualizations, ground-truth bounding boxes are denoted by green rectangles, while predicted detection boxes are indicated by red rectangles annotated with the text “target”. Regions containing small moving objects are highlighted with blue dashed arrows, pointing to the corresponding zoomed-in insets. Notably, the zoomed-in regions in Figure 4 reveal that under challenging lighting conditions such as overexposure or underexposure, existing methods exhibit a higher incidence of both the number of false negatives (FN) and the number of false positives (FP). In contrast, our proposed method consistently yields more accurate and robust detections in these challenging scenarios.

In addition, we provide a qualitative visualization of the input and output of the proposed TMAF module under different exposure conditions. As illustrated in Figure 5, the first three columns correspond to the event frames accumulated at three temporal scales (

Δ T_{1}

,

Δ T_{2}

, and

Δ T_{3}

), while the last column shows the feature maps generated by the TMAF module after multi-scale temporal fusion. From top to bottom, the rows correspond to overexposure, underexposure, and normal exposure scenarios, respectively. It can be clearly observed that the fused representations produced by TMAF highlight the presence of small moving objects more prominently compared to the raw event frames. In particular, the module effectively aggregates complementary motion cues from different temporal scales while suppressing noise, thereby enhancing the saliency of small objects across diverse lighting conditions. This qualitative evidence further demonstrates the effectiveness of our TMAF module in learning discriminative features for small object detection.

4.4. Ablation Studies

4.4.1. Temporal Scale Selection Analysis

We conducted an ablation study to analyze the selection of the three temporal scales (

Δ T_{1}

,

Δ T_{2}

,

Δ T_{3}

) for the event streams input to the TMAF module, aiming to identify the optimal combination. As described in Section 3.2, objects moving at different speeds accumulate varying numbers of events within a fixed temporal scale. Fast-moving small objects can accumulate sufficient events for detection within a short temporal scale. In contrast, slow-moving small objects require a longer temporal scale to gather adequate events for reliable detection; however, under such extended scales, fast-moving objects may produce blurred event accumulations and increased noise, leading to inaccurate localization. To address this, we select a small, a medium, and a large temporal scale, and adaptively fuse the event streams from these scales to better accommodate small objects across varying motion speeds. From an engineering perspective, employing more densely sampled scales offers limited performance gains while increasing computational overhead; thus, using three distinct scales represents a balanced choice. Given the impracticality of exhaustively evaluating all possible combinations, we select representative temporal scale sets (

Δ T_{1}

,

Δ T_{2}

,

Δ T_{3}

) to investigate their impact on small object detection performance on the RGBE-UAV dataset, thereby determining the most suitable configuration for this dataset. The comparative results for different temporal scale combinations are presented in Table 4. Based on these results, the combination of

Δ T_{1} = 15

ms,

Δ T_{2} = 30

ms, and

Δ T_{3} = 40

ms yields the best performance, making it the optimal temporal scale combination for the RGBE-UAV dataset.

4.4.2. Module Contribution Analysis

We further conducted ablation studies on the RGBE-UAV dataset to assess the individual contribution of each proposed module. Specifically, we established a baseline model that shares a similar architecture with our full model, differing only in the event processing and RGB-Event fusion strategies. In the baseline, we adopted a widely adopted event processing method that accumulates events into frames using a predefined fixed temporal window. For the fusion component, we employed a simple concatenation followed by convolution for feature fusion. Based on this baseline, we incrementally integrated our proposed modules and conducted corresponding training and evaluation to analyze their effectiveness. We also examined the effect of removing Gaussian noise injection from the SNGAF module. The ablation study results of the proposed modules are presented in Table 5. From the results, it is evident that each of the proposed modules contributes positively and yields performance improvements. By incorporating our proposed TMAF module for processing the input event stream, replacing the traditional predefined fixed temporal scale event representation method, the baseline model achieved significant gains in detection performance. This indicates that the TMAF module effectively leverages the motion information in the temporal multi-scale event stream, enhancing the saliency of small moving object features. Furthermore, by replacing the standard concatenation-convolution fusion with our proposed SNGAF module for RGB-Event fusion, we observed a substantial performance boost, demonstrating the effectiveness of our method in fully exploiting both modality-specific and complementary information from RGB and event data. In addition, the full SNGAF module outperforms its variant without noise injection, which further validates that injecting Gaussian noise facilitates better load balancing and contributes to improved robustness.

4.4.3. Expert Configuration Impact

Additionally, we conducted ablation studies to investigate the impact of using a single CMAFE module versus two fixed parallel CMAFE modules on the overall performance of the RGB-Event fusion model. The results of these ablation experiments are presented in Table 6. In this experiment, all settings remained identical except for the design of the RGB-Event fusion module. Specifically, 1CMAFE refers to the configuration where only a single CMAFE module is used for fusing the RGB and event features, while 2CMAFE refers to the use of two fixed, parallel CMAFE modules for fusion. Consistent with other fusion methods [35,36,45] inspired by MoE models, we do not explore configurations involving more experts, as such setups generally lead to increased complexity without consistent performance gains. As shown in the ablation results, our sparse-gated fusion scheme continues to exhibit the best performance, demonstrating the effectiveness of our proposed design.

4.4.4. Modality Complementarity Validation

Finally, we conducted ablation studies to investigate whether dual-modality input leads to performance improvements over single-modality input in our model. As shown in Table 7, when using a single modality, the input from the other modality is replaced with zero tensors, while all other experimental settings are kept identical. The results demonstrate that integrating both RGB and event inputs consistently leads to superior performance compared to using either modality alone. This highlights the complementary nature of RGB and event modalities in detecting small moving objects under challenging lighting conditions, and it further validates the effectiveness of our model in fully leveraging dual-modality information for robust small object detection.

5. Discussion

Although our RGB-Event fusion framework improves small object detection under challenging lighting conditions, it still faces limitations in extremely difficult cases, as illustrated in Figure 6. In scenarios where small objects are heavily blended with the background and remain static or move very slowly, event sensors provide little signal, and RGB images may also fail due to weak contrast. Similarly, under extremely low-light conditions, both modalities struggle to capture reliable cues, leading to detection failure.

A promising direction is to incorporate additional sensing modalities, such as thermal infrared or hyperspectral imaging, which can provide complementary information when both RGB and event cues are insufficient. For instance, thermal cameras capture the target’s own infrared radiation, making them insensitive to visible-light illumination and effective in distinguishing targets such as drones from background clutter. Likewise, hyperspectral cameras can exploit differences in spectral reflectance between the target and background, which is particularly useful in highly cluttered scenes. Therefore, integrating these additional modalities could enable more robust small object detection in complex environments, representing a highly promising research direction.

6. Conclusions

In this paper, we proposed a novel RGB-Event fusion framework for small moving object detection under challenging lighting conditions. Unlike prior methods that rely on event representations at a single fixed temporal scale, we introduced the TMAF module to exploit event streams at multiple temporal resolutions. This allows for the effective capture of spatiotemporal cues and enhances the spatial saliency of small moving object features. Furthermore, we proposed the SNGAF module that enables the selective and adaptive integration of RGB and event modalities. By dynamically selecting fusion experts based on the properties of input data, our model can flexibly adapt to diverse multimodal feature distributions, thereby improving robustness in complex lighting scenarios. To address the lack of suitable datasets for small moving object detection in RGB-Event settings, we developed a new public dataset (namely RGBE-UAV) tailored for this task. Extensive experiments on both RGBE-UAV and existing benchmarks demonstrate that our method achieves superior performance compared to state-of-the-art approaches. Ablation studies further validate the effectiveness of each proposed component. Finally, we discussed the potential limitations of our proposed RGB-Event fusion detection framework and outlined promising directions for future research.

Author Contributions

Conceptualization, Y.S. and M.L.; methodology, Y.S.; software, Y.S.; validation, Y.S. and N.C.; formal analysis, Y.S., N.C. and S.H.; resources, M.L. and W.A.; data curation, Y.S., N.C., S.H. and Y.L.; writing—original draft preparation, Y.S.; writing—review and editing, Y.S., N.C. and Y.L.; visualization, Y.S.; supervision, N.C., M.L. and W.A.; project administration, M.L. and W.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are grateful to the anonymous reviewers and the editor for their thoughtful and constructive suggestions, which have greatly improved this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RGB	Red–Green–Blue
UAV	Unmanned Aerial Vehicle
DVS	Dynamic Vision Sensor
DSEC	A Stereo Event Camera Dataset for Driving Scenarios
MOD	Moving Object Detection
CNN	Convolutional Neural Network
FPN	Feature Pyramid Network
MVSEC	Multi-Vehicle Stereo Event Camera dataset
MoE	Mixture-of-Experts
TMAF	Temporal Multi-Scale Attention Fusion
SNGAF	Sparse Noisy Gated Attention Fusion
MLP	Multilayer Perceptron
SOTA	State of the Art
mAP	Mean Average Precision
IoU	Intersection over Union
Params	Number of Parameters
FLOPs	Number of Floating-Point Operations
FP	False Positive
TP	True Positive
FN	False Negative

References

Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Wang, X.; Wang, A.; Yi, J.; Song, Y.; Chehri, A. Small object detection based on deep learning for remote sensing: A comprehensive review. Remote Sens. 2023, 15, 3265. [Google Scholar] [CrossRef]
Rashed, H.; Ramzy, M.; Vaquero, V.; El Sallab, A.; Sistu, G.; Yogamani, S. Fusemodnet: Real-time camera and lidar based moving object detection for robust low-light autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar] [CrossRef]
Wu, Z.; Gobichettipalayam, S.; Tamadazte, B.; Allibert, G.; Paudel, D.P.; Demonceaux, C. Robust rgb-d fusion for saliency detection. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–16 September 2022; pp. 403–413. [Google Scholar] [CrossRef]
Zhen, W.; Scherer, S. Estimating the localizability in tunnel-like environments using LiDAR and UWB. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4903–4908. [Google Scholar] [CrossRef]
Chen, N.; Xiao, C.; Dai, Y.; He, S.; Li, M.; An, W. Event-based Tiny Object Detection: A Benchmark Dataset and Baseline. arXiv 2025, arXiv:2506.23575. [Google Scholar] [CrossRef]
Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-Based Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 154–180. [Google Scholar] [CrossRef] [PubMed]
Rebecq, H.; Ranftl, R.; Koltun, V.; Scaramuzza, D. High speed and high dynamic range video with an event camera. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1964–1980. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Wu, Z.; Boutteau, R.; Yang, F.; Demonceaux, C.; Ginhac, D. RGB-Event Fusion for Moving Object Detection in Autonomous Driving. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7808–7815. [Google Scholar] [CrossRef]
Cao, J.; Zheng, X.; Lyu, Y.; Wang, J.; Xu, R.; Wang, L. Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 9026–9032. [Google Scholar] [CrossRef]
Mondal, A.; Giraldo, J.H.; Bouwmans, T.; Chowdhury, A.S. Moving object detection for event-based vision using graph spectral clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 876–884. [Google Scholar] [CrossRef]
Sironi, A.; Brambilla, M.; Bourdis, N.; Lagorce, X.; Benosman, R. HATS: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1731–1740. [Google Scholar] [CrossRef]
Zhu, A.Z.; Yuan, L.; Chaney, K.; Daniilidis, K. EV-FlowNet: Self-supervised optical flow estimation for event-based cameras. arXiv 2018, arXiv:1802.06898. [Google Scholar] [CrossRef]
Gehrig, D.; Rebecq, H.; Gallego, G.; Scaramuzza, D. EKLT: Asynchronous photometric feature tracking using events and frames. Int. J. Comput. Vis. 2020, 128, 601–618. [Google Scholar] [CrossRef]
Manderscheid, J.; Sironi, A.; Bourdis, N.; Migliore, D.; Lepetit, V. Speed invariant time surface for learning to detect corner points with event-based cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10245–10254. [Google Scholar] [CrossRef]
Bardow, P.; Davison, A.J.; Leutenegger, S. Simultaneous optical flow and intensity estimation from an event camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 884–892. [Google Scholar] [CrossRef]
Zhu, A.Z.; Yuan, L.; Chaney, K.; Daniilidis, K. Unsupervised Event-Based Optical Flow Using Motion Compensation. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Leal-Taixé, L., Roth, S., Eds.; Springer: Cham, Switzerland, 2019; pp. 711–714. [Google Scholar] [CrossRef]
Rebecq, H.; Ranftl, R.; Koltun, V.; Scaramuzza, D. Events-To-Video: Bringing Modern Computer Vision to Event Cameras. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3852–3861. [Google Scholar] [CrossRef]
Zhu, A.Z.; Yuan, L.; Chaney, K.; Daniilidis, K. Unsupervised event-based learning of optical flow, depth, and egomotion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 989–997. [Google Scholar] [CrossRef]
Tomy, A.; Paigwar, A.; Mann, K.S.; Renzaglia, A.; Laugier, C. Fusing event-based and rgb camera for robust object detection in adverse conditions. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 933–939. [Google Scholar] [CrossRef]
Sun, L.; Sakaridis, C.; Liang, J.; Jiang, Q.; Yang, K.; Sun, P.; Ye, Y.; Wang, K.; Gool, L.V. Event-based fusion for motion deblurring with cross-modal attention. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 412–428. [Google Scholar] [CrossRef]
Tulyakov, S.; Gehrig, D.; Georgoulis, S.; Erbach, J.; Gehrig, M.; Li, Y.; Scaramuzza, D. Time lens: Event-based video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16155–16164. [Google Scholar] [CrossRef]
Lin, G.; Han, J.; Cao, M.; Zhong, Z.; Zheng, Y. Event-guided frame interpolation and dynamic range expansion of single rolling shutter image. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3078–3088. [Google Scholar] [CrossRef]
Zhou, W.; Guo, Q.; Lei, J.; Yu, L.; Hwang, J.N. ECFFNet: Effective and consistent feature fusion network for RGB-T salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1224–1235. [Google Scholar] [CrossRef]
Gao, W.; Liao, G.; Ma, S.; Li, G.; Liang, Y.; Lin, W. Unified information fusion network for multi-modal RGB-D and RGB-T salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2091–2106. [Google Scholar] [CrossRef]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar] [CrossRef]
Jiang, B.; Li, Z.; Asif, M.S.; Cao, X.; Ma, Z. Token-Based Spatiotemporal Representation of the Events. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5240–5244. [Google Scholar] [CrossRef]
Xie, B.; Deng, Y.; Shao, Z.; Xu, Q.; Li, Y. Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 13427–13440. [Google Scholar] [CrossRef]
Peng, Y.; Zhang, Y.; Xiong, Z.; Sun, X.; Wu, F. GET: Group Event Transformer for Event-Based Vision. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 6015–6025. [Google Scholar] [CrossRef]
Xu, F.; Yu, L.; Wang, B.; Yang, W.; Xia, G.S.; Jia, X.; Qiao, Z.; Liu, J. Motion deblurring with real events. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2583–2592. [Google Scholar] [CrossRef]
Zhou, C.; Teng, M.; Han, J.; Liang, J.; Xu, C.; Cao, G.; Shi, B. Deblurring low-light images with events. Int. J. Comput. Vis. 2023, 131, 1284–1298. [Google Scholar] [CrossRef]
Yao, B.; Deng, Y.; Liu, Y.; Chen, H.; Li, Y.; Yang, Z. Sam-event-adapter: Adapting segment anything model for event-rgb semantic segmentation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 9093–9100. [Google Scholar] [CrossRef]
Kachole, S.; Huang, X.; Naeini, F.B.; Muthusamy, R.; Makris, D.; Zweiri, Y. Bimodal SegNet: Fused instance segmentation using events and RGB frames. Pattern Recognit. 2024, 149, 110215. [Google Scholar] [CrossRef]
Devulapally, A.; Khan, M.F.F.; Advani, S.; Narayanan, V. Multi-modal fusion of event and rgb for monocular depth estimation using a unified transformer-based architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2081–2089. [Google Scholar] [CrossRef]
Zhu, P.; Sun, Y.; Cao, B.; Hu, Q. Task-customized mixture of adapters for general image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 7099–7108. [Google Scholar] [CrossRef]
Cao, B.; Sun, Y.; Zhu, P.; Hu, Q. Multi-modal gated mixture of local-to-global experts for dynamic image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 23555–23564. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Gehrig, M.; Aarents, W.; Gehrig, D.; Scaramuzza, D. DSEC: A Stereo Event Camera Dataset for Driving Scenarios. IEEE Robot. Autom. Lett. 2021, 6, 4947–4954. [Google Scholar] [CrossRef]
Zhu, A.Z.; Thakur, D.; Özaslan, T.; Pfrommer, B.; Kumar, V.; Daniilidis, K. The Multivehicle Stereo Event Camera Dataset: An Event Camera Dataset for 3D Perception. IEEE Robot. Autom. Lett. 2018, 3, 2032–2039. [Google Scholar] [CrossRef]
Hu, Y.; Liu, S.C.; Delbruck, T. v2e: From Video Frames to Realistic DVS Events. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 1312–1321. [Google Scholar] [CrossRef]
Wang, X.; Li, J.; Zhu, L.; Zhang, Z.; Chen, Z.; Li, X.; Wang, Y.; Tian, Y.; Wu, F. Visevent: Reliable object tracking via collaboration of frame and event flows. IEEE Trans. Cybern. 2023, 54, 1997–2010. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Wang, S.; Tang, C.; Zhu, L.; Jiang, B.; Tian, Y.; Tang, J. Event stream-based visual object tracking: A high-resolution benchmark dataset and a novel baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19248–19257. [Google Scholar] [CrossRef]
Magrini, G.; Becattini, F.; Pala, P.; Del Bimbo, A.; Porta, A. Neuromorphic Drone Detection: An Event-RGB Multimodal Approach. In Proceedings of the Computer Vision—ECCV 2024 Workshops; Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T., Eds.; Springer: Cham, Switzerland, 2025; pp. 259–275. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Li, Y.; Zhang, Y.; Dai, Y.; Hou, Q.; Cheng, M.M.; Yang, J. Sm3det: A unified model for multi-modal remote sensing object detection. arXiv 2024, arXiv:2412.20665. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]

Figure 1. Comparison of RGB-Event fusion strategies.

Figure 2. The architecture of the proposed network. (a) The overall framework consists of a TMAF module for event processing, dual-stream backbones for feature extraction, SNGAF modules for cross-modal feature fusion, and a detection head with FPN for small object detection. (b) The TMAF module aggregates event streams at multiple temporal scales and enhances the feature saliency of small moving objects. (c) The SNGAF module, composed of a Gating Network and multiple CMAFE modules, adaptively fuses RGB and event features. In this subfigure, the black arrows denote the flow of RGB and event feature maps, the solid green arrows indicate how the gating weights generated by the Gating Network control the opening and closing of the gates, and the dashed green arrows represent the weighted summation of the selected CMAFE outputs according to their gating weights. (d) Each CMAFE module performs symmetric intra-modal enhancement and cross-modal attention interaction to exploit both modality-specific and complementary information.

Figure 3. Overview of the RGBE-UAV dataset. (a) Comparison of object area distributions across RGB-Event datasets. (b) Proportions of exposure types in the RGBE-UAV dataset. (c) Examples of RGB-Event frame pairs under varying exposure conditions (from top to bottom: overexposure, normal exposure, underexposure).

Figure 4. Qualitative comparison on our RGBE-UAV dataset. Green bounding boxes denote ground truth, while red bounding boxes indicate the model predictions.

Figure 5. Qualitative visualization of the input and output of the TMAF module. From left to right: event frames accumulated at three temporal scales (

Δ T_{1}

,

Δ T_{2}

, and

Δ T_{3}

), and the fused feature map produced by the TMAF module. From top to bottom: overexposure, underexposure, and normal exposure scenarios.

Figure 5. Qualitative visualization of the input and output of the TMAF module. From left to right: event frames accumulated at three temporal scales (

Δ T_{1}

,

Δ T_{2}

, and

Δ T_{3}

), and the fused feature map produced by the TMAF module. From top to bottom: overexposure, underexposure, and normal exposure scenarios.

Figure 6. Examples of detection failures. (a) Scenario where the object is heavily blended with a complex background and exhibits slow motion. (b) Scenario under extremely low-light conditions.

Table 1. A comparative summary of existing RGB-Event object detection datasets. MS: multi-scene, MT: multi-target, OE: overexposure, NE: normal exposure, UE: underexposure.

Dataset	AVG Object Scale	MS	MT	OE	NE	UE	Year
EventVOT [43]	$129 \times 100$ pixels	✓	×	×	✓	✓	2024
VisEvent [42]	$84 \times 66$ pixels	✓	×	×	✓	✓	2023
NeRDD [44]	$55 \times 31$ pixels	×	✓	×	✓	✓	2024
DSEC-MOD [9]	$32 \times 34$ pixels	×	✓	×	✓	✓	2023
RGBE-UAV	$7.0 \times 5.1$ pixels	✓	✓	✓	✓	✓	2025

Table 2. Quantitative comparison on RGBE-UAV dataset and DSEC-MOD dataset. The bold and the underline represent the best and second-best performance, respectively.

Modality	Method	Pulication	Params	FLOPs	RGBE-UAV		DSEC-MOD
Modality	Method	Pulication	Params	FLOPs	mAP50	mAP75	mAP50	mAP75
RGB-only	RetinaNet [46]	ICCV2017	36.33M	61.37G	0.7304	0.1622	0.3074	0.1964
	FCOS [47]	ICCV2019	32.29M	61.62G	0.7019	0.1764	0.2943	0.2011
	Deformable DETR [48]	ICLR2021	40.80M	60.20G	0.7865	0.1824	0.3901	0.2259
	YOLOv10 [49]	NeurIPS2024	16.58M	64.50G	0.8680	0.2240	0.4550	0.2759
Event-only	RetinaNet [46]	ICCV2017	36.33M	61.37G	0.3903	0.0311	0.3366	0.1358
	FCOS [47]	ICCV2019	32.29M	61.62G	0.3859	0.0204	0.3171	0.1336
	Deformable DETR [48]	ICLR2021	40.80M	60.20G	0.4323	0.0324	0.3521	0.1433
	YOLOv10 [49]	NeurIPS2024	16.58M	64.50G	0.5780	0.0435	0.3602	0.1682
RGB-Event Fusion	Early-Fusion	—	33.52M	167.20G	0.8974	0.1864	0.4718	0.2632
	FPN-Fusion [20]	ICRA2022	59.85M	198.40G	0.9186	0.2583	0.5729	0.3636
	RENet [9]	ICRA2023	87.74M	273.98G	0.9456	0.2795	0.5890	0.3977
	EOLO [10]	ICRA2024	106.93M	327.79G	0.9333	0.3046	0.6681	0.4352
	Ours	—	60.77M	198.41G	0.9557	0.3159	0.6810	0.4531

Table 3. Performance comparison of RGB-Event fusion methods under different exposure conditions. Over: overexposure, Normal: normal exposure, Under: underexposure.

Method	Over		Normal		Under
Method	mAP50	mAP75	mAP50	mAP75	mAP50	mAP75
Early-Fusion	0.8294	0.1617	0.9585	0.2229	0.8438	0.2160
FPN-Fusion [20]	0.8458	0.1929	0.9687	0.3070	0.9067	0.2720
RENet [9]	0.8534	0.1855	0.9737	0.2722	0.9372	0.2679
EOLO [10]	0.8407	0.1941	0.9848	0.3203	0.9370	0.2772
Ours	0.8601	0.2209	0.9884	0.3395	0.9419	0.3046

Table 4. Ablation study on different temporal scale combinations for the TMAF module on the RGBE-UAV dataset.

Combination	$Δ T_{1}$ (ms)	$Δ T_{2}$ (ms)	$Δ T_{3}$ (ms)	mAP50	mAP75
# 1	5	10	20	0.9285	0.2853
# 2	10	20	30	0.9391	0.2944
# 3	15	25	35	0.9490	0.3089
# 4	20	30	40	0.9525	0.3126
# 5	30	35	40	0.9453	0.3094
# 6	15	30	40	0.9557	0.3159

Table 5. Ablation study results of the proposed modules.

Baseline	TMAF	SNGAF w/o Noise	SNGAF	mAP50	mAP75
✓				0.8768	0.2453
✓	✓			0.9186	0.2583
✓		✓		0.9312	0.2895
✓			✓	0.9393	0.2982
✓	✓	✓		0.9475	0.3110
✓	✓		✓	0.9557	0.3159

Table 6. Ablation study results on the sparse-gated fusion scheme.

1CMAFE	2CMAFE	SNGAF	mAP50	mAP75
✓			0.9397	0.2742
	✓		0.9282	0.3131
		✓	0.9557	0.3159

Table 7. Ablation study results of dual-modality RGB-Event input.

RGB Modality	Event Modality	mAP50	mAP75
✓		0.9074	0.2948
	✓	0.4519	0.0243
✓	✓	0.9557	0.3159

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, Y.; Li, M.; Chen, N.; Luo, Y.; He, S.; An, W. Sparse-Gated RGB-Event Fusion for Small Object Detection in the Wild. Remote Sens. 2025, 17, 3112. https://doi.org/10.3390/rs17173112

AMA Style

Shi Y, Li M, Chen N, Luo Y, He S, An W. Sparse-Gated RGB-Event Fusion for Small Object Detection in the Wild. Remote Sensing. 2025; 17(17):3112. https://doi.org/10.3390/rs17173112

Chicago/Turabian Style

Shi, Yangsi, Miao Li, Nuo Chen, Yihang Luo, Shiman He, and Wei An. 2025. "Sparse-Gated RGB-Event Fusion for Small Object Detection in the Wild" Remote Sensing 17, no. 17: 3112. https://doi.org/10.3390/rs17173112

APA Style

Shi, Y., Li, M., Chen, N., Luo, Y., He, S., & An, W. (2025). Sparse-Gated RGB-Event Fusion for Small Object Detection in the Wild. Remote Sensing, 17(17), 3112. https://doi.org/10.3390/rs17173112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse-Gated RGB-Event Fusion for Small Object Detection in the Wild

Abstract

1. Introduction

2. Related Work

2.1. Event Processing

2.2. RGB-Event Fusion for Object Detection

2.3. RGB-Event Object Detection Dataset

3. Method

3.1. Network Overview

3.2. Temporal Multi-Scale Attention Fusion Module for Event Processing

3.3. Sparse Noisy Gated Attention Fusion Module for RGB-Event Fusion

4. Experiments

4.1. Dataset

4.2. Experiment Settings

4.3. Main Results

4.3.1. Quantitative Comparison

4.3.2. Qualitative Comparison

4.4. Ablation Studies

4.4.1. Temporal Scale Selection Analysis

4.4.2. Module Contribution Analysis

4.4.3. Expert Configuration Impact

4.4.4. Modality Complementarity Validation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI