Efficient Spiking Neural Network for RGB–Event Fusion-Based Object Detection

Fan, Liangwei; Yang, Jingjun; Wang, Lei; Zhang, Jinpu; Lian, Xiangkai; Shen, Hui

doi:10.3390/electronics14061105

Open AccessArticle

Efficient Spiking Neural Network for RGB–Event Fusion-Based Object Detection

by

Liangwei Fan

^1,†

,

Jingjun Yang

^1,†,

Lei Wang

²,

Jinpu Zhang

¹,

Xiangkai Lian

^1,*

and

Hui Shen

^1,*

¹

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

²

Defense Innovation Institute, Academy of Military Sciences, Beijing 100000, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(6), 1105; https://doi.org/10.3390/electronics14061105

Submission received: 11 February 2025 / Revised: 4 March 2025 / Accepted: 10 March 2025 / Published: 11 March 2025

(This article belongs to the Topic State-of-the-Art Object Detection, Tracking, and Recognition Techniques)

Download

Browse Figures

Versions Notes

Abstract

Robust object detection in challenging scenarios remains a critical challenge for autonomous driving systems. Inspired by human visual perception, integrating the complementary modalities of RGB frames and event streams presents a promising approach to achieving robust object detection. However, existing multimodal object detectors achieve superior performance at the cost of significant computational power consumption. To address this challenge, we propose a novel spiking RGB–event fusion-based detection network (SFDNet), a fully spiking object detector capable of achieving both low-power and high-performance object detection. Specifically, we first introduce the Leaky Integrate-and-Multi-Fire (LIMF) neuron model, which combines soft and hard reset mechanisms to enhance feature representation in SNNs. We then develop a multi-scale hierarchical spiking residual attention network and a lightweight spiking aggregation module for efficient dual-modality feature extraction and fusion. Experimental results on two public multimodal object detection datasets demonstrate that our SFDNet achieves state-of-the-art performance with remarkably low power consumption. The superior performance in challenging scenarios, such as motion blur and low-light conditions, highlights the robustness and effectiveness of SFDNet, significantly advancing the applicability of SNNs for real-world object detection tasks.

Keywords:

spiking neural network; multimodal fusion; neuromorphic vision; object detection

1. Introduction

Robust object detection in high-speed and high-dynamic scenarios remains a critical challenge for autonomous driving systems. Conventional frame-based RGB cameras operate at a fixed frame rate and suffer from issues such as motion blur and sensitivity to lighting conditions [1,2,3,4]. Recently, a bio-inspired neuromorphic camera, called the event camera, has emerged as a promising alternative [5,6,7,8]. Event cameras offer advantages such as high temporal resolution and high dynamic range, making them well suited to address the limitations of conventional cameras in high-speed and challenging lighting conditions [9,10,11]. However, event cameras struggle to capture object texture and color, and generate sparse event streams in relatively static scenes, leading to poor performance in slow or static scenarios. Therefore, event cameras and frame-based RGB cameras are complementary, and combining the advantages of RGB frames and events presents a promising approach for achieving high-performance and robust object detection [12,13,14,15,16,17,18].

A major issue is that current object detectors achieve high performance at the cost of significant computational power consumption. On one hand, the rapid development of vision transformers [19] has catalyzed the emergence of many transformer-based detectors [14,20,21,22], such as DETR [20] and SODFormer [14]. Additionally, CAFR [16] introduces a self-attention-based cross-modal feature fusion module, which significantly improves detection accuracy and robustness. However, the quadratic computational complexity of self-attention results in these detectors requiring substantial computational resources and power consumption. On the other hand, many existing object detectors [11,18,23,24] leverage temporal cues from event streams and adjacent frames to address challenges such as occlusion and shape variation. For example, RVT [11] and FAOD [18] utilize multi-layer LSTM to model temporal information, thereby improving object detection performance. However, these approaches require multiple input frames, leading to an increase in computation and power consumption proportional to the number of input frames. There is still a lack of a low-power, high-performance robust object detection method.

Spiking neural networks (SNNs) offer a promising solution due to their biologically plausible, event-driven computation and inherent energy efficiency. Current SNN-based object detectors remain relatively scarce and face significant performance bottlenecks, particularly in large-scale multimodal datasets for autonomous driving. Existing SNN-based detection algorithms [25,26,27,28,29,30], such as EMS-YOLO [25] and Spiking-YOLO [26], remain constrained to single-modal object detection. This limitation can be attributed to the binary information representation of current spiking neurons, which leads to substantial information loss. Regardless of the importance of the input at a given timestep, the output is restricted to a single spike, limiting the scalability of SNNs for complex computer vision tasks.

To address the challenges mentioned above, we propose a novel spiking RGB–event fusion-based detection network (SFDNet), which is a fully spiking object detector that integrates the complementary advantages of event streams and RGB frames to achieve low-power and high-performance object detection. Specifically, we first introduce a novel Leaky Integrate-and-Multi-Fire (LIMF) neuron model, which incorporates both soft reset and hard reset mechanisms to reduce information loss and enhance feature representation in SNNs. Then, we design a multi-scale hierarchical spiking residual attention network for dual-pathway feature extraction. Finally, a lightweight spiking aggregation module is designed to effectively integrate dual-modality information with minimal parameters and computational costs. Experimental results on two public multimodal object detection datasets demonstrate that our SFDNet achieves state-of-the-art detection performance with remarkably low power consumption. We also validate the robustness and effectiveness of SFDNet in challenging scenarios such as motion blur and low-light conditions, advancing the practicality of SNNs for real-world applications.

In summary, the main contributions of this work are as follows:

We propose a novel spiking RGB–event fusion-based detection network, termed SFDNet, which is a fully spiking object detector that achieves robust object detection with remarkably low power consumption;
We introduce the Leaky Integrate-and-Multi-Fire (LIMF) neuron model, which combines soft and hard reset mechanisms to enhance feature representation in SNNs;
We develop a multi-scale hierarchical spiking residual attention network and a lightweight spiking aggregation module for efficient and effective extraction and fusion of features from both events and RGB frames;
Extensive experimental results on two public multimodal object detection datasets demonstrate that our SFDNet outperforms state-of-the-art object detectors with significantly lower power consumption.

2. Related Works

2.1. RGB–Event Fusion for Object Detection

Multimodal fusion is a common practice in computer vision tasks [31,32,33,34,35,36,37]. Inspired by the complementary nature of RGB frames and event streams, recent object detectors [12,13,14,15,16,17,18] have explored integrating these two modalities. Early approaches [12,13] primarily focused on late fusion, where detection results from both modalities are merged. For instance, a confidence map fusion method has been proposed to combine detection outputs from both events and frames [13]. Additionally, Dempster–Shafer theory has been utilized to fuse detection results based on events and frames [12]. More recent studies [14,15,16,17,18] have shifted toward intermediate feature fusion to achieve robust detection. For example, simple concatenation of event and frame features has been demonstrated to improve detection performance under adverse conditions. SODFormer [14] introduced a transformer-based object detector that employs an attention-based fusion module to integrate event and RGB features effectively. Similarly, FAOD [18] and RENet [17] designed cross-attention mechanisms to fuse multimodal features at intermediate layers. However, these methods achieve high performance at the cost of significant computational resources and power consumption. In contrast to previous studies, we propose a fully spiking object detector that employs a lightweight spiking aggregation module, aiming to achieve robust object detection with remarkably low power consumption.

2.2. Spiking Neural Networks for Object Detection

SNNs demonstrate several inherent advantages over artificial neural networks (ANNs), particularly in terms of energy efficiency and event-driven computation. According to training strategies, deep SNNs for object detection can be categorized into ANN-to-SNN conversion methods [26,38,39,40] and directly trained SNNs [25,27,28,29,30]. Most early works [26,38,40] leveraged ANN-to-SNN conversion to construct fully spiking object detection networks. For instance, Spiking-YOLO [26] introduced channel-wise normalization and signed neurons to mitigate performance degradation in deep SNNs, achieving competitive detection performance. However, conversion-based methods typically require hundreds of timesteps to match the performance of ANN counterparts, leading to increased power consumption and prolonged inference time. Recent studies have explored directly training SNNs using surrogate gradients to overcome these limitations. Several hybrid ANN-SNN architectures [29,30] have been developed, where SNN backbones and ANN-based detection heads are directly trained together to achieve high accuracy in object detection tasks. However, these ANN-based detection heads still introduce significant computational overhead. End-to-end directly trained fully spiking SNNs [25,27,28] represent a promising direction for achieving low-power and high-performance object detection. For example, EMS-YOLO [25] proposed a fully spiking SNN-based object detector directly trained with LIF neurons over four timesteps, demonstrating the feasibility of efficient SNN-based detection. However, due to the simplicity of LIF neurons, LIF-based SNNs struggle to achieve high performance in complex multimodal object detection tasks. Thus, we design a novel Leaky Integrate-and-Multi-Fire (LIMF) neuron model, which combines soft and hard reset mechanisms to improve the representation capability of SNNs, enabling them to better handle complex multimodal object detection tasks.

3. Methods

3.1. Network Input

This work aims to achieve low-power and high-performance object detection through a fully spiking neural network that processes both RGB images and event streams. The input consists of two complementary modalities: fixed-frame-rate RGB images I and continuous event streams

E (x, y, t, p)

, where

(x, y)

represents the pixel coordinates, t denotes the timestamp, and the polarity

p \in [- 1, 1]

indicates whether the brightness has decreased or increased. To be compatible with deep learning methods, asynchronous event streams are commonly converted into image-like representations [9,41,42,43,44]. In this work, we adopt our proposed event temporal image [9] to map the event stream within an accumulation time window

Δ t

into 3-dimensional tensors. Unlike the commonly used event representations such as event histograms [42] and event images [41], the event temporal image preserves both the temporal information and the polarity of the events. Specifically, the event stream is first partitioned along the time axis into a set of B bins, where the polarities of events in each bin are accumulated and mapped to the range of

[0, 255]

using an activation function. The transformation process is formulated as follows:

t_{i}^{*} = ⌊ B \cdot \frac{t_{i} - t_{1}}{t_{N_{e}} - t_{1}} ⌋,

(1)

S (x, y, t) = σ (\sum_{i} p_{i} δ (x - x_{i}) δ (y - y_{i}) δ (t - t_{i}^{*})),

(2)

σ (a) = \frac{255}{1 + e^{- \frac{a}{2}}},

(3)

where

σ (\cdot)

and

δ (\cdot)

represent the activation function and Dirac delta function, respectively.

t_{N_{e}}

denotes the number of events. The generated event temporal image

S \in R^{B \times H \times W}

and the corresponding RGB frame are fed together into SFDNet.

3.2. Network Architecture

As shown in Figure 1, the overall network architecture consists of four main components: input encoding, dual-pathway feature extraction, the spiking aggregation module, and the spiking detection head. Firstly, direct encoding is employed to encode both RGB images and event temporal images into binary spike trains. The first convolutional layer with spiking neurons serves as the input encoding layer, which accumulates local pixel values and outputs spike trains. Then, we employ spiking dual-pathway backbones with identical structures but independent parameters to extract hierarchical feature maps from each modality. Additionally, a spiking aggregation module is designed to facilitate effective cross-modal fusion. Finally, the fused multimodal features are fed into the spiking detection head for classification and localization.

3.2.1. Dual-Pathway Feature Extraction

To enhance the performance of SNNs, it is essential to deepen their architecture while addressing the degradation problem that is commonly encountered in deep SNNs. Previous studies [45,46] have demonstrated that residual blocks are effective in preventing the vanishing gradient issue. However, directly applying standard residual blocks does not resolve the degradation problem in SNNs. To address this issue, we analyze the backpropagated gradients in the residual blocks.

Let

r_{l}

denote the input to the l-th Residual Unit,

F (\cdot)

represent the residual function,

h (\cdot)

denote the shortcut function,

W_{l}

denote the weights of the l-th Residual Unit, and

f (\cdot)

represent the activation function. The forward propagation process of the Residual Unit can be formulated as follows:

r_{l + 1} = f (h (r_{l}) + F (r_{l}, W_{l})) .

(4)

The gradient during backpropagation is computed as

\frac{\partial E}{\partial r_{l}} = \frac{\partial E}{\partial r_{l + 1}} \frac{\partial r_{l + 1}}{\partial r_{l}} = \frac{\partial E}{\partial r_{l + 1}} \cdot f^{'} \cdot (\frac{\partial h (r_{l})}{\partial r_{l}} + \frac{\partial}{\partial r_{l}} F (r_{l}, W_{l})),

(5)

where

E

represents the loss function. Recursively, we have

\begin{matrix} \frac{\partial E}{\partial r_{l}} & = \frac{\partial E}{\partial r_{L}} \frac{\partial r_{L}}{\partial r_{l}} \\ = \frac{\partial E}{\partial r_{L}} \cdot {(f^{'})}^{L - l} \cdot \prod_{i = l}^{L - 1} (\frac{\partial h (r_{i})}{\partial r_{i}} + \frac{\partial}{\partial r_{i}} F (r_{i}, W_{i})) . \end{matrix}

(6)

As shown in Equation (6), the term

{(f^{'})}^{L - l}

directly affects the stability of gradient backpropagation. Since deep SNNs can contain dozens of residual blocks, vanishing or exploding gradients can only be mitigated if

f^{'}

remains consistently equal to 1. Here, the function f represents the spiking neurons in spiking residual blocks, meaning that

{(f^{'})}^{L - l} = {(Θ^{'})}^{L - l}

. However, the surrogate gradient

Θ^{'} \leq 1

, which implies that deep SNNs with standard residual blocks may suffer from vanishing gradient issues.

To address the issue of vanishing gradients in deep SNNs, we set both the functions f and h as identity mappings. The Residual Unit can be formulated as

r_{l + 1} = r_{l} + F (r_{l}, W_{l}) .

(7)

Recursively, we have

r_{L} = r_{l} + \sum_{i = l}^{L - 1} F (r_{i}, W_{i}) .

(8)

From the chain rule of backpropagation, we have

\frac{\partial E}{\partial r_{l}} = \frac{\partial E}{\partial r_{L}} \frac{\partial r_{L}}{\partial r_{l}} = \frac{\partial E}{\partial r_{L}} (1 + \frac{\partial}{\partial r_{l}} \sum_{i = l}^{L - 1} F (r_{i}, W_{i})) .

(9)

It can be observed that even if the gradient

\frac{\partial}{\partial r_{l}} \sum_{i = l}^{L - 1} F (r_{i}, W_{i})

within the Residual Unit becomes arbitrarily small, the overall gradient will not vanish. In this work, the bottleneck block is selected as the residual block due to its ability to effectively balance performance and parameter efficiency [47]. To achieve identity mappings, the spiking neurons needs to be placed before the convolutional layer as the pre-activation (Figure 1).

Based on the pre-activation bottleneck block, we design the Multi-scale Hierarchical Spiking Residual Attention Network (MHSANet) for effective feature extraction. Specifically, we stack multiple spiking convolutional layers within a residual block to enlarge the receptive field and naturally extract multi-scale features. As illustrated in Figure 1, the feature maps

X

are divided into s groups along the channel dimension, denoted as

{X_{1}, X_{2}, \dots, X_{s}}

. These groups are then processed by the hierarchical spiking convolutional layers, which can be formulated as follows:

Y_{i} = \{\begin{matrix} X_{i} & i = 1, \\ C o n v_{i} (X_{i}) & i = 2, \\ C o n v_{i} (X_{i} + Y_{i - 1}) & 2 < i ⩽ s . \end{matrix}

(10)

Subsequently, the outputs are concatenated and passed through a

1 \times 1

convolutional layer, which integrates features across the groups and fuses multi-scale information. Inspired by the prior work [46], an attention layer incorporating both channel and spatial attention mechanisms is employed to refine the feature map. To balance model performance and computational cost, we construct MHSANet-50 (

26 w \times 4 s

) based on ResNet-50 as the backbone, where w represents the width of the filters and s denotes the number of scales.

Considering the significant differences between RGB and event modalities, dual-pathway spiking backbones with the same architectures and independent weights are employed for multi-scale feature extraction.

3.2.2. Spiking Aggregation Module

The RGB pathway provides the texture and color information of objects, while the event pathway serves as an auxiliary role, offering motion information and addressing issues such as motion blur in high-speed and high-dynamic-range scenarios. Prior methods [14,31] typically employ attention-based modules to fuse the information from the two modalities. However, these attention-based fusion modules often introduce substantial computational overhead. Given the high-dimensional nature of multi-scale feature maps extracted from the dual-pathway backbones, we introduce a lightweight spiking aggregation module to merge features from the RGB and event pathways. Specifically, a pre-activation spiking activation layer followed by a shared weight 1 × 1 convolution layer is employed to reduce the channel to half numbers for both features from the two pathways. Then, these two reduced features are concatenated and added to the original features from the RGB pathway. The fused feature maps at multiple levels are subsequently fed into the Feature Pyramid Network (FPN) [48] for coarse-to-fine information fusion.

3.2.3. Spiking Detection Head

For classification and localization, we adopt the YOLOX-L [49] detection head due to its optimal balance between computational efficiency and detection accuracy. Compared to other single-stage detectors [50,51], YOLOX eliminates anchor-based design through its advanced anchor-free paradigm, which simplifies training pipelines while maintaining competitive performance. The decoupled detection head further enhances performance by separately optimizing classification and regression tasks. Specifically, the detection head consists of a

1 \times 1

convolutional layer followed by two parallel convolutional branches. Each branch comprises two

3 \times 3

convolutional layers. To achieve a fully spiking detection network, we replace all activation neurons with pre-activation spiking neurons. The detection loss can be formulated as follows:

L = λ \cdot L_{r e g} + L_{o b j} + L_{c l s},

(11)

where

L_{r e g}

,

L_{o b j}

, and

L_{c l s}

denote the regression loss, objectness loss, and classification loss [9], respectively.

λ

is a balancing coefficient used to adjust the weight of the regression loss. Consistent with previous works [9,49,50], we adopt the setting of

λ

= 5.

3.3. Spiking Neuron Model

The Leaky Integrate-and-Fire (LIF) [52,53] neuron is the most commonly used spiking neuron model for SNNs. The layer-wise LIF model (Figure 2a) can be uniformly described as follows:

\{\begin{matrix} X^{t, n} = W^{n} S^{t, n - 1}, \\ U^{t, n} = H^{t - 1, n} + X^{t, n}, \\ S^{t, n} = Θ (U^{t, n} - V_{t h}), \\ H^{t, n} = V_{r e s e t} \cdot S^{t, n} + (β U^{t, n}) ⊙ (1 - S^{t, n}), \end{matrix}

(12)

where t and n represent the timestep and layer, respectively.

W^{n}

denotes the weight matrix, which can be either fully connected or convolutional.

U^{t, n}

,

S^{t, n}

,

H^{t, n}

, and

X^{t, n}

represent the membrane potential tensor, spiking output tensor, hidden state tensor, and spatial feature tensor, respectively.

The LIF neuron model is widely used in SNN-based object detectors [25,26] due to its simplicity and computational efficiency. However, LIF neurons exhibit a significant limitation in their ability to accurately encode supra-threshold stimuli. In the vanilla LIF model, when the membrane potential exceeds a predefined threshold

V_{th}

, the neuron fires a single spike, followed by a reset of the membrane potential to

V_{reset}

, regardless of the stimulus strength. As illustrated in Figure 2c, this simplified reset mechanism severely restricts the representation of strong stimuli, as only one spike is generated at a given timestep, even in the presence of a strong stimulus.

To address this limitation, we propose the Leaky Integrate-and-Multi-Fire (LIMF) neuron model. The LIMF neuron introduces both a soft reset and a hard reset mechanism, which allows for the generation of multiple spikes in response to strong stimuli. Specifically, when the membrane potential U exceeds the threshold

V_{th}

, the neuron fires a spike and undergoes a soft reset. If the membrane potential remains above the threshold, the neuron continues to fire additional spikes, repeating the soft reset process. Once the membrane potential falls below the threshold, a hard reset occurs if the neuron has fired previously; otherwise, the membrane potential decays by a factor

β

. As shown in Figure 2d, under a strong stimulus, the LIMF neuron generates multiple spikes, undergoing soft resets until the potential drops below the threshold, at which point the hard reset occurs.

The primary advantage of the LIMF neuron over the LIF neuron is its ability to capture more complex spatiotemporal dynamics by generating multiple spikes in response to strong stimuli. This improves the spatiotemporal representation capability of the neuron, making it more suitable for complex multimodal object detection tasks.

4. Results

4.1. Experiment Settings

Our experiments are conducted on two benchmark datasets: PKU-DAVIS-SOD [14] and DSEC-Detection [54]. The PKU-DAVIS-SOD dataset was collected using the DAVIS346 event camera, which simultaneously outputs events with high temporal resolution and RGB frames at a resolution of 346 × 260 pixels and 25 FPS. It includes 1080.1 k manually annotated labels for three object categories (cars, pedestrians, and two-wheelers). Specifically, 671.3 k labels are used for the training set, 194.7 k for the validation set, and 214.1 k for the testing set. Furthermore, the dataset covers three environmental conditions: normal illumination, motion blur, and low-light scenarios.

The DSEC-Detection dataset is a high-resolution, large-scale object detection dataset that provides aligned event and RGB data at a resolution of 640 × 480 pixels. Data acquisition was conducted through a synchronized dual-sensor configuration comprising a Prophesee Gen3.1M event camera and a FLIR BlackFly S USB3 RGB camera, with time synchronization and spatial calibration applied to ensure precise alignment between the event streams and RGB frames. The dataset captures diverse urban driving scenarios under varying illumination conditions, featuring automatically generated annotations for eight object categories through a multi-object tracker, with subsequent manual quality verification. The dataset comprises 60 sequences, with 41 sequences for training, 6 sequences for validation, and 13 sequences for testing.

To train SNNs directly, we apply the standard Backpropagation Through Time (BPTT) algorithm [55]. The models are trained for 15 epochs using two NVIDIA 4090 GPUs, with the SGD optimizer and a cosine learning rate schedule, starting with an initial learning rate of 1 × 10⁻³. Continuous event streams are segmented according to the timestamps of the corresponding RGB frames and are subsequently mapped to image-like event representations. Additionally, we have incorporated data augmentation strategies during the training process, including random horizontal flipping and multi-scale training. To comprehensively evaluate the detection performance and computational efficiency of different approaches, we employ the COCO mean average precision (mAP) [56] and theoretical power consumption (mJ). The mAP metric evaluates localization and classification accuracy across multiple Intersection-over-Union (IoU) thresholds. The theoretical power consumption is calculated based on the number of multiplication and addition (MAC) and accumulation calculation (AC) operations required by the models.

4.2. Performance Evaluation in Various Scenarios

To comprehensively evaluate the effectiveness and robustness of the models across different scenarios, we compare the detection performance under three distinct scenarios in the PKU-DAVIS-SOD dataset. For single-modality baselines, we use a single-branch model with the same architecture but without the spiking aggregation module to process either RGB frames or events independently.

Normal Scenarios: The performance evaluations of single-modality RGB frames, events, and RGB–event fusion under normal scenarios are presented in Table 1. It can be observed that the single-modality baseline utilizing RGB frames achieves significantly better performance compared to using events in normal scenarios. This performance gap arises because event cameras generate sparse events only in regions where brightness changes occur, lacking the texture and color information of objects, especially in static or low-relative-speed driving scenarios. In contrast, RGB cameras provide fine-grained texture and color details under normal scenarios. Therefore, in normal scenarios, event streams can serve as supplementary information to RGB frames to achieve high-performance object detection. Notably, our SFDNet achieves a performance improvement of 0.8% mAP compared to the single-modality baseline using RGB frames.

Motion Blur Scenarios: As illustrated in Table 1, the baseline using RGB frames experiences a more significant performance degradation compared to the baseline using events. Notably, the baseline using events even outperforms the baseline using RGB frames in the AP50 for “Two-wheeler”. This is due to the motion blur effect in RGB frames, especially for objects with high relative speeds. In contrast, event cameras can capture high-speed moving objects due to their asynchronous and low-latency characteristics. Our SFDNet improves detection performance by 1.2% mAP over the baseline using RGB frames by incorporating event streams.

Low-Light Scenarios: The performance comparison between normal and low-light scenarios in Table 1 reveals that the baseline using RGB frames suffers a significant decline of 18.7% mAP, while the baseline using events only decreases by 6.1% mAP. Notably, the AP50 for “Two-wheeler” and “Car” in the baseline using events even slightly improves compared to the normal scenario. This is because event cameras excel in extreme lighting conditions due to their high dynamic range. Consequently, our SFDNet achieves a significant performance improvement over the baseline using RGB frames (44.3% vs. 37.9% mAP50).

Overall, our SFDNet demonstrates comprehensive performance improvements over single-modality baselines across all three scenarios, proving that RGB–event fusion enhances both robustness and accuracy in object detection under diverse conditions. Specifically, SFDNet achieves average improvements of 18.5% and 2.4% mAP50 compared to the baselines using events and RGB frames, respectively. As shown in Figure 3, representative visual results across the three scenarios further highlight the effectiveness of our SFDNet.

4.3. Comparison with State-of-the-Art Methods

In this section, we present a comprehensive evaluation of SFDNet against state-of-the-art methods on the PKU-DAVIS-SOD and DSEC-Detection datasets. As summarized in Table 2, we compare SFDNet with current event-based object detectors, RGB-based object detectors, and three fusion-based object detectors (SODFormer [14], ReNet [17], and FAOD [18]). We also employ the dual-pathway spiking backbone, YOLOX head, additive fusion operation, and LIF neurons to construct SpikeYOLOX, which serves as the SNN baseline for processing RGB–event fusion data. As depicted in Table 2, SFDNet significantly outperforms single-modality methods by a large margin. For example, our SFDNet achieves improvements of 3.4% and 12.1% mAP over the RGB-based RVT [11] on the PKU-DAVIS-SOD and DSEC-Detection datasets, respectively, highlighting the importance of multimodal fusion for achieving high-performance and robust object detection. Furthermore, when compared to transformer-based detectors, such as DETR [20] using RGB frames, SFDNet improves the mAP from 27.5% to 31.3% on the PKU-DAVIS-SOD dataset, while reducing power consumption by a factor of 115.4. This demonstrates the significant advantages of our approach in terms of both power efficiency and performance.

Compared to state-of-the-art CNN- and transformer-based models, our SFDNet also achieves higher detection performance with significantly lower power consumption. For instance, SFDNet outperforms SODFormer by 10.6% mAP on PKU-DAVIS-SOD, while employing fewer parameters and consuming 32.3× less power. SFDNet also achieves an impressive 8.8% mAP improvement over FAOD with 30.4× lower power consumption on DSEC-Detection. The computational overhead of the spatiotemporal transformer architecture in SODFormer and the multi-layer LSTM modules in FAOD results in substantial power and resource costs. However, SFDNet achieves high-performance computation with low power consumption. Compared to vanilla SNNs such as SpikeYOLOX, our SFDNet achieves a significant performance improvement with only a slight increase in power consumption. Overall, our SFDNet strikes an effective balance between accuracy and efficiency, making it highly promising for deployment on resource-constrained edge computing platforms.

As illustrated in Figure 4, we further present representative visual comparison results on DSEC-Detection, comparing SFDNet with FAOD and SpikeYOLOX. The results indicate that our SFDNet surpasses the state-of-the-art detectors, suggesting that SFDNet can achieve high-performance and robust object detection in complex and challenging scenarios, combining the advantages of both RGB frames and events.

To investigate the power consumption of different model components, we conduct a detailed analysis of the power consumption across various modules (backbone, detection head, and spiking aggregation module) of SFDNet on the PKU-DAVIS-SOD dataset. The power consumption for each module is as follows: backbone (5.8 mJ), detection head (2.9 mJ), and spiking aggregation module (0.2 mJ). The results show that the backbone’s feature extraction accounts for the largest portion of the computational power, while the spiking aggregation module consumes only 0.2 mJ, constituting approximately 2.2% of the total model power consumption. These findings not only demonstrate that the lightweight spiking aggregation module achieves efficient multimodal fusion with minimal computational cost, but also highlight the potential for further optimization of the backbone to enhance computational efficiency and achieve greater energy savings in future work.

4.4. Ablation Studies

In this section, we conduct an ablation study to investigate the impact of spiking neurons, the spiking aggregation module, and event representation on model performance. First, we analyze the influence of spiking neurons and the spiking aggregation module on the final detection performance. We use a single-branch SpikeYOLOX model with vanilla LIF neurons that process RGB frames as the baseline. As shown in Table 3, both LIMF neurons and the spiking aggregation module consistently improve the model’s performance on the PKU-DAVIS-SOD dataset, with only a minor increase in power consumption. Specifically, replacing LIF neurons with LIMF neurons results in an 8.6% improvement in mAP50, incurring a slight increase in power consumption (2.4 mJ). Furthermore, incorporating the lightweight spiking aggregation module into the LIMF-based SNNs results in an additional 2.4% improvement in mAP50, with only a marginal increase of 3.2 mJ in power consumption. Overall, our SFDNet achieves high-performance object detection by leveraging LIMF neurons and the spiking aggregation module.

To verify the generalization capability of the model, we evaluate the impact of different event representation methods on model performance on the PKU-DAVIS-SOD dataset, including event images, event histograms, and event temporal images. It is worth noting that the different event representation methods exhibit similar computational complexity and do not involve any learnable parameters. As a result, they do not significantly affect computational efficiency or model complexity. As demonstrated in Table 4, the model’s performance exhibits only slight variations across different event representations, indicating that SFDNet provides an effective and general framework for fusing RGB frames and event streams for object detection. Notably, event temporal images achieve superior detection performance compared to the event histograms and event images, with improvements of 1.4% and 0.8% mAP50, respectively. This suggests that an appropriate event representation method can contribute to achieving high-performance object detection in our SFDNet.

To analyze the influence of multi-scale feature extraction, we conduct an experiment comparing SFDNet with and without this module on the PKU-DAVIS-SOD dataset. Specifically, we replace multi-scale feature extraction with single-scale feature extraction, where the input is directly processed through a wider 3 × 3 convolutional layer without splitting. As shown in Table 5, the multi-scale design improves detection accuracy by 0.6% mAP while maintaining parameter efficiency.

To demonstrate the effectiveness of the fusion strategy, we compare our fusion approach with the early-fusion method [15], where the multimodal inputs are concatenated before being fed into the network. As shown in Figure 5, the results clearly indicate that our proposed method significantly outperforms the early-fusion method by a large margin on the PKU-DAVIS-SOD dataset. For instance, our fusion approach achieves a 4.0% improvement in mAP over the early-fusion method. The experimental results are consistent with prior work [15], which show that early-fusion models are more susceptible to perturbations in one modality, negatively affecting their overall performance.

Moreover, we conduct an ablation study to analyze the impact of different fusion modules on the PKU-DAVIS-SOD dataset. Specifically, we compare our proposed spiking aggregation module against the concatenation method [15] and the transformer-based fusion module [31]. As presented in Table 6, the experimental results show that our spiking aggregation module outperforms the typical fusion modules. More precisely, the spiking aggregation module yields notable improvements of 2.5% and 1.8% in mAP over the concatenation method and transformer-based fusion module, respectively.

4.5. Discussion

In this section, we discuss both the generality and limitations of SFDNet. In our experiments, we tested SFDNet on two diverse datasets: PKU-DAVIS-SOD and DSEC-Detection, which were collected using different event camera sensors. Specifically, PKU-DAVIS-SOD was recorded with the DAVIS346 event camera, while DSEC-Detection utilized a combination of the Prophesee Gen3.1M event camera and the FLIR BlackFly S USB3 RGB camera. Both datasets cover a wide range of environmental conditions, ensuring that both the training and test sets include diverse sequences. The results show that SFDNet performs well across various sensor types and environmental conditions, indicating its strong potential for real-world deployment.

Although SFDNet achieves superior detection performance across a variety of scenes, there are still several limitations and challenges that need to be addressed in future work. As illustrated in Figure 6, representative failure cases from the DSEC-Detection dataset highlight challenges in accurately detecting small or occluded objects. Moreover, the failure cases indicate that SFDNet struggles to perform robust detection for objects from categories with fewer samples, such as bicycles. This issue may be due to class imbalance present in the datasets. To mitigate the impact of class imbalance, we plan to explore additional data augmentation techniques and loss functions, such as Focal Loss [62].

5. Conclusions

In this paper, we propose a novel fully spiking object detection framework, termed SFDNet, which achieves low-power and robust object detection by effectively integrating event streams and RGB frames. To mitigate information loss and enhance feature representation in SNNs, we introduce the LIMF neuron model, which incorporates both soft and hard reset mechanisms. Furthermore, we develop a multi-scale hierarchical spiking residual attention network for efficient and effective dual-modality feature extraction. Finally, we design a lightweight spiking aggregation module that integrates RGB and event features while minimizing computational cost and power consumption. Extensive experimental results demonstrate that SFDNet achieves state-of-the-art detection performance with remarkably low power consumption while exhibiting robust detection capabilities in challenging scenarios such as motion blur and low-light conditions. We believe that this study significantly advances the applicability of SNNs for real-world object detection tasks, paving the way for energy-efficient and high-performance solutions in autonomous driving.

Author Contributions

Conceptualization, L.F. and H.S.; methodology, L.F. and J.Y.; validation, L.F., X.L., J.Z. and J.Y.; formal analysis, X.L.; writing—original draft preparation, L.F.; writing—review and editing, L.W., H.S. and J.Z.; visualization, J.Y.; supervision, H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, K.; Wu, W.; Liu, T.; Yang, S.; Wang, Q.; Zhou, Q.; Ye, Z.; Qian, C. FAB: A robust facial landmark detection framework for motion-blurred videos. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5462–5471. [Google Scholar]
Sayed, M.; Brostow, G. Improved handling of motion blur in online object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1706–1716. [Google Scholar]
Hu, Y.; Delbruck, T.; Liu, S.C. Learning to exploit multiple vision modalities by using grafted networks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 85–101. [Google Scholar]
Zhang, J.; Newman, J.P.; Wang, X.; Thakur, C.S.; Rattray, J.; Etienne-Cummings, R.; Wilson, M.A. A closed-loop, all-electronic pixel-wise adaptive imaging system for high dynamic range videography. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 1803–1814. [Google Scholar] [CrossRef] [PubMed]
Brandli, C.; Berner, R.; Yang, M.; Liu, S.C.; Delbruck, T. A 240 × 180 130 db 3 μs latency global shutter spatiotemporal vision sensor. IEEE J. Solid-State Circuits 2014, 49, 2333–2341. [Google Scholar] [CrossRef]
Moeys, D.P.; Corradi, F.; Li, C.; Bamford, S.A.; Longinotti, L.; Voigt, F.F.; Berry, S.; Taverni, G.; Helmchen, F.; Delbruck, T. A sensitive dynamic and active pixel vision sensor for color or neural imaging applications. IEEE Trans. Biomed. Circuits Syst. 2017, 12, 123–136. [Google Scholar] [CrossRef] [PubMed]
Lichtsteiner, P.; Posch, C.; Delbruck, T. A 128 ×128 120 dB 15μs latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 2008, 43, 566–576. [Google Scholar] [CrossRef]
Finateu, T.; Niwa, A.; Matolin, D.; Tsuchimoto, K.; Mascheroni, A.; Reynaud, E.; Mostafalu, P.; Brady, F.; Chotard, L.; LeGoff, F.; et al. A 1280× 720 back-illuminated stacked temporal contrast event-based vision sensor with 4.86 μm pixels, 1.066 GEPS readout, programmable event-rate controller and compressive data-formatting pipeline. In Proceedings of the 2020 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 16–20 February 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 112–114. [Google Scholar]
Fan, L.; Li, Y.; Shen, H.; Li, J.; Hu, D. From Dense to Sparse: Low-Latency and Speed-Robust Event-Based Object Detection. IEEE Trans. Intell. Veh. 2024. [Google Scholar] [CrossRef]
Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; et al. Event-based vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 154–180. [Google Scholar] [CrossRef]
Gehrig, M.; Scaramuzza, D. Recurrent vision transformers for object detection with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13884–13893. [Google Scholar]
Li, J.; Dong, S.; Yu, Z.; Tian, Y.; Huang, T. Event-Based Vision Enhanced: A Joint Detection Framework in Autonomous Driving. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo, Shanghai, China, 8–12 July 2019; pp. 1396–1401. [Google Scholar] [CrossRef]
Jiang, Z.; Xia, P.; Huang, K.; Stechele, W.; Chen, G.; Bing, Z.; Knoll, A. Mixed frame-/event-driven fast pedestrian detection. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8332–8338. [Google Scholar]
Li, D.; Li, J.; Tian, Y. SODFormer: Streaming Object Detection with Transformer Using Events and Frames. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14020–14037. [Google Scholar] [CrossRef]
Tomy, A.; Paigwar, A.; Mann, K.S.; Renzaglia, A.; Laugier, C. Fusing event-based and rgb camera for robust object detection in adverse conditions. In Proceedings of the International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 933–939. [Google Scholar]
Cao, H.; Zhang, Z.; Xia, Y.; Li, X.; Xia, J.; Chen, G.; Knoll, A. Embracing events and frames with hierarchical feature refinement network for object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 161–177. [Google Scholar]
Zhou, Z.; Wu, Z.; Boutteau, R.; Yang, F.; Demonceaux, C.; Ginhac, D. RGB-event fusion for moving object detection in autonomous driving. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7808–7815. [Google Scholar]
Zhang, H.; Wang, X.; Xu, C.; Wang, X.; Xu, F.; Yu, H.; Yu, L.; Yang, W. Frequency-Adaptive Low-Latency Object Detection Using Events and Frames. arXiv 2024, arXiv:2412.04149. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020; pp. 1–16. [Google Scholar]
Wang, T.; Yuan, L.; Chen, Y.; Feng, J.; Yan, S. PnP-DETR: Towards efficient visual analysis with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 4661–4670. [Google Scholar]
Perot, E.; De Tournemire, P.; Nitti, D.; Masci, J.; Sironi, A. Learning to detect objects with a 1 megapixel event camera. Adv. Neural Inf. Process. Syst. 2020, 33, 16639–16652. [Google Scholar]
Lu, Y.; Lu, C.; Tang, C.K. Online video object detection using association LSTM. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2344–2352. [Google Scholar]
Su, Q.; Chou, Y.; Hu, Y.; Li, J.; Mei, S.; Zhang, Z.; Li, G. Deep directly-trained spiking neural networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6555–6565. [Google Scholar]
Kim, S.; Park, S.; Na, B.; Yoon, S. Spiking-YOLO: Spiking neural network for energy-efficient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11270–11277. [Google Scholar]
Luo, X.; Yao, M.; Chou, Y.; Xu, B.; Li, G. Integer-valued training and spike-driven inference spiking neural network for high-performance and energy-efficient object detection. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 October 2025; pp. 253–272. [Google Scholar]
Yao, M.; Hu, J.; Hu, T.; Xu, Y.; Zhou, Z.; Tian, Y.; XU, B.; Li, G. Spike-driven Transformer V2: Meta Spiking Neural Network Architecture Inspiring the Design of Next-generation Neuromorphic Chips. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Zhang, H.; Li, Y.; He, B.; Fan, X.; Wang, Y.; Zhang, Y. Direct training high-performance spiking neural networks for object recognition and detection. Front. Neurosci. 2023, 17, 1229951. [Google Scholar] [CrossRef] [PubMed]
Lien, H.H.; Chang, T.S. Sparse compressed spiking neural network accelerator for object detection. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2060–2069. [Google Scholar] [CrossRef]
Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12878–12895. [Google Scholar] [CrossRef] [PubMed]
Tulyakov, S.; Gehrig, D.; Georgoulis, S.; Erbach, J.; Gehrig, M.; Li, Y.; Scaramuzza, D. Time lens: Event-based video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16155–16164. [Google Scholar]
Duan, P.; Wang, Z.; Shi, B.; Cossairt, O.; Huang, T.; Katsaggelos, A. Guided event filtering: Synergy between intensity images and neuromorphic events for high performance imaging. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8261–8275. [Google Scholar] [CrossRef]
Zhang, J.; Yang, X.; Fu, Y.; Wei, X.; Yin, B.; Dong, B. Object tracking by jointly exploiting frame and event domain. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 13043–13052. [Google Scholar]
Gehrig, D.; Rüegg, M.; Gehrig, M.; Hidalgo-Carrió, J.; Scaramuzza, D. Combining events and frames using recurrent asynchronous multimodal networks for monocular depth prediction. IEEE Robot. Automat. Lett. 2021, 6, 2822–2829. [Google Scholar] [CrossRef]
Zuo, Y.F.; Yang, J.; Chen, J.; Wang, X.; Wang, Y.; Kneip, L. DEVO: Depth-event camera visual odometry in challenging conditions. In Proceedings of the International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2179–2185. [Google Scholar]
Gao, L.; Liang, Y.; Yang, J.; Wu, S.; Wang, C.; Chen, J.; Kneip, L. VECtor: A Versatile event-centric benchmark for multi-sensor SLAM. IEEE Robot. Automat. Lett. 2022, 7, 8217–8224. [Google Scholar] [CrossRef]
Kim, S.; Park, S.; Na, B.; Kim, J.; Yoon, S. Towards fast and accurate object detection in bio-inspired spiking neural networks through Bayesian optimization. IEEE Access 2020, 9, 2633–2643. [Google Scholar] [CrossRef]
Yuan, M.; Zhang, C.; Wang, Z.; Liu, H.; Pan, G.; Tang, H. Trainable Spiking-YOLO for low-latency and high-performance object detection. Neural Netw. 2024, 172, 106092. [Google Scholar] [CrossRef]
Li, Y.; He, X.; Dong, Y.; Kong, Q.; Zeng, Y. Spike calibration: Fast and accurate conversion of spiking neural network for object detection and segmentation. arXiv 2022, arXiv:2207.02702. [Google Scholar]
Chen, N.F. Pseudo-labels for supervised learning on dynamic vision sensor data, applied to object detection under ego-motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 644–653. [Google Scholar]
Maqueda, A.I.; Loquercio, A.; Gallego, G.; García, N.; Scaramuzza, D. Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5419–5427. [Google Scholar]
Sironi, A.; Brambilla, M.; Bourdis, N.; Lagorce, X.; Benosman, R. HATS: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1731–1740. [Google Scholar]
Zhu, A.Z.; Yuan, L.; Chaney, K.; Daniilidis, K. Unsupervised event-based learning of optical flow, depth, and egomotion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 989–997. [Google Scholar]
Hu, Y.; Deng, L.; Wu, Y.; Yao, M.; Li, G. Advancing spiking neural networks toward deep residual learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 2353–2367. [Google Scholar] [CrossRef]
Yao, M.; Zhao, G.; Zhang, H.; Hu, Y.; Deng, L.; Tian, Y.; Xu, B.; Li, G. Attention spiking neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9393–9410. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Maass, W. Networks of spiking neurons: The third generation of neural network models. Neural Netw. 1997, 10, 1659–1671. [Google Scholar] [CrossRef]
Gerstner, W.; Kistler, W.M.; Naud, R.; Paninski, L. Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Gehrig, D.; Scaramuzza, D. Low-latency automotive vision with event cameras. Nature 2024, 629, 1034–1040. [Google Scholar] [CrossRef]
Yin, B.; Corradi, F.; Bohté, S.M. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nat. Mach. Intell. 2021, 3, 905–913. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 740–755. [Google Scholar]
Li, J.; Li, J.; Zhu, L.; Xiang, X.; Huang, T.; Tian, Y. Asynchronous spatio-temporal memory network for continuous event-based object detection. IEEE Trans. Image Process 2022, 31, 2975–2987. [Google Scholar] [CrossRef]
Liu, B.; Xu, C.; Yang, W.; Yu, H.; Yu, L. Motion robust high-speed light-weighted object detection with event camera. IEEE Trans. Instru. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
Zubić, N.; Gehrig, M.; Scaramuzza, D. State Space Models for Event Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 459–479. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]

Figure 1. The pipeline of the proposed spiking RGB–event fusion-based detection network (SFDNet). The event stream is first transformed into event representations. Subsequently, we design dual-pathway spiking backbones for feature extraction. A lightweight spiking aggregation module is then employed to generate fused features, which are fed into the spiking detection head.

Figure 2. Comparison of the LIF neuron and our Leaky Integrate-and-Multi-Fire (LIMF) neuron. (a) The layer-wise LIF model. (b) The layer-wise LIMF model. (c) The response of the LIF neuron to supra-threshold stimuli of varying intensities. (d) The response of the LIMF neuron to supra-threshold stimuli of varying intensities.

Figure 3. Representative visualization results of our SFDNet compared to single-modality baselines using either events or RGB frames. The results are evaluated across three scenarios on the PKU-DAVIS-SOD dataset: normal illumination, motion blur, and low-light conditions.

Figure 4. Representative examples of different object detection results on the DSEC-Detection dataset. FP and FN refer to false positives and false negatives, respectively.

Figure 5. Performance comparison between our fusion approach and early-fusion method on the PKU-DAVIS-SOD dataset.

Figure 6. Representative failure cases of our SFDNet on the DSEC-Detection dataset. FP and FN refer to false positives and false negatives, respectively.

Table 1. Performance evaluation of our SFDNet and single-modality baselines in various scenarios on the PKU-DAVIS-SOD dataset. The single-modality baselines process the RGB frames or events without the spiking aggregation module.

Scenario	Modality	AP50			mAP	mAP50
Scenario	Modality	Car	Pedestrian	Two-Wheeler	mAP	mAP50
Normal	Events	54.4	29.4	52.8	21.1	45.5
	Frames	82.8	48.3	67.3	35.0	66.1
	Frames + Events	83.0	49.5	69.9	35.8	67.5
Motion blur	Events	44.5	13.6	49.0	16.3	35.7
	Frames	69.4	35.3	46.9	25.8	50.5
	Frames + Events	71.3	34.6	52.6	27.0	52.8
Low-light	Events	54.6	1.8	53.2	15.0	36.5
	Frames	59.8	12.4	41.5	16.3	37.9
	Frames + Events	62.3	12.4	58.2	19.6	44.3
All	Events	52.6	22.2	51.7	19.1	42.2
	Frames	78.9	40.8	55.3	30.1	58.3
	Frames + Events	79.6	42.0	60.4	31.3	60.7

Table 2. Comparison with state-of-the-art methods on the PKU-DAVIS-SOD and DSEC-Detection datasets.

			PKU-DAVIS-SOD [14]			DSEC-Detection [54]
Modality	Method	Backbone	mAP	mAP50	Power (mJ)	mAP	AP50	Power (mJ)	Params (M)
Events	ASTMNet [57]	CNN + RNN	-	29.1	-	-	-	-	>100
	AED [58]	CNN	23.0	45.7		27.1	43.2		15.5
	VIT-S5 [59]	Transformer + SSM	23.2	46.6		23.8	38.7		18.2
	RVT [11]	Transformer + RNN	25.6	50.3		27.7	44.2		18.5
Frame	YoloX [49]	CNN	27.4	50.9	-	38.5	57.8	-	16.5
	MaxVit [60]	Transformer	26.8	50.5		32.8	51.0		15.7
	Swins [61]	Transformer	27.7	52.3		34.1	52.0		15.8
	VIT-S5 [59]	Transformer + SSM	28.2	52.2		33.2	49.6		18.1
	RVT [11]	Transformer + RNN	27.9	53.0		39.2	61.0		18.5
	DETR [20]	Transformer	27.5	56.2	1027.3	44.8	68.2	1028.1	41.3
Fusion	SODFormer [14]	Transformer	20.7	50.4	287.5	-	-	-	82.5
	ReNet [17]	CNN	28.8	54.9	-	31.6	49.0	-	59.8
	FAOD [18]	CNN + RNN	30.5	57.5	792.5	42.5	63.5	960.7	20.3
	SpikeYOLOX [49]	SNN	24.0	51.4	5.1	34.8	58.8	17.4	57.4
	SFDNet(ours)	SNN	31.3	60.7	8.9	51.3	73.3	31.6	58.1

Table 3. The impacts of spiking neurons and the spiking aggregation (SA) module on the overall model performance.

LIF	LIMF	SA Module	mAP	mAP50	Power (mJ)	Params (M)
✓			23.8	49.7	3.3	42.2
✓		✓	24.6	51.6	5.2	58.1
	✓		30.1	58.3	5.7	42.2
	✓	✓	31.3	60.7	8.9	58.1

Table 4. Comparison of our SFDNet with various event representations on the PKU-DAVIS-SOD dataset.

Method	mAP	mAP50	mAP75
Histogram [42]	30.8	59.3	27.6
Event images [41]	30.6	59.9	27.0
Event temporal images [9]	31.3	60.7	27.9

Table 5. The influence of multi-scale feature extraction in our SFDNet on the PKU-DAVIS-SOD dataset.

Method	mAP	mAP50	mAP75	Params (M)
SFDNet (without multi-scale)	30.7	59.7	27.3	63.7
SFDNet (with multi-scale)	31.3 (+0.6)	60.7 (+1.0)	27.9 (+0.6)	58.1

Table 6. Comparison of different fusion modules on the PKU-DAVIS-SOD dataset.

Method	mAP	mAP50	mAP75	Params (M)
Concatenation [15]	28.8	58.3	24.8	58.1
Fusion transformer [31]	29.5	58.0	25.9	68.4
Spiking aggregation module (Ours)	31.3	60.7	27.9	58.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, L.; Yang, J.; Wang, L.; Zhang, J.; Lian, X.; Shen, H. Efficient Spiking Neural Network for RGB–Event Fusion-Based Object Detection. Electronics 2025, 14, 1105. https://doi.org/10.3390/electronics14061105

AMA Style

Fan L, Yang J, Wang L, Zhang J, Lian X, Shen H. Efficient Spiking Neural Network for RGB–Event Fusion-Based Object Detection. Electronics. 2025; 14(6):1105. https://doi.org/10.3390/electronics14061105

Chicago/Turabian Style

Fan, Liangwei, Jingjun Yang, Lei Wang, Jinpu Zhang, Xiangkai Lian, and Hui Shen. 2025. "Efficient Spiking Neural Network for RGB–Event Fusion-Based Object Detection" Electronics 14, no. 6: 1105. https://doi.org/10.3390/electronics14061105

APA Style

Fan, L., Yang, J., Wang, L., Zhang, J., Lian, X., & Shen, H. (2025). Efficient Spiking Neural Network for RGB–Event Fusion-Based Object Detection. Electronics, 14(6), 1105. https://doi.org/10.3390/electronics14061105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Spiking Neural Network for RGB–Event Fusion-Based Object Detection

Abstract

1. Introduction

2. Related Works

2.1. RGB–Event Fusion for Object Detection

2.2. Spiking Neural Networks for Object Detection

3. Methods

3.1. Network Input

3.2. Network Architecture

3.2.1. Dual-Pathway Feature Extraction

3.2.2. Spiking Aggregation Module

3.2.3. Spiking Detection Head

3.3. Spiking Neuron Model

4. Results

4.1. Experiment Settings

4.2. Performance Evaluation in Various Scenarios

4.3. Comparison with State-of-the-Art Methods

4.4. Ablation Studies

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI