1. Introduction
In recent years, Spiking Neural Networks (SNNs) have attracted significant attention for providing energy-efficient, real-time solutions in event-driven perception tasks such as object tracking and recognition [
1,
2]. Beyond algorithmic accuracy, sensor-oriented systems emphasise low-latency inference, bounded memory footprint, predictable compute under varying event rates, and energy proportionality on neuromorphic and embedded platforms [
3]. These properties make spike-based computation well suited to high-speed imaging under high dynamic range conditions in which frame-based pipelines often suffer from motion blur or saturation [
4,
5]. This perspective motivates architectures co-designed with the sensing pipeline and evaluated with sensor-relevant indicators (latency, event throughput, spike rate, and an energy proxy), positioning SNNs as a compelling alternative for edge-deployed tracking and recognition. The objective of this study is to identify the most effective methods for performing inference at the sensor or in close proximity to the sensor, while adhering to stringent energy and latency constraints. The utilisation of event sparsity, microsecond time stamps and polarity encoding ensures that computation is only performed when informative changes occur, thereby reducing redundant operations and power consumption.
Despite the advantages of Spiking Neural Networks, several challenges remain in optimising their training and handling the spatio-temporal complexity of event-driven data. A significant challenge pertains to the training of SNNs, which is hindered by the non-differentiability of the spike activation function [
6,
7]. Early research attempted to apply traditional backpropagation (BP) algorithms. However, significant challenges were encountered due to discontinuities in the activation function, which made gradient computation unfeasible [
8]. To address this issue, surrogate gradient methods were introduced to approximate the gradients, thus enabling the application of BP to SNNs [
9,
10]. These methods provide continuous approximations to the spike function, thereby overcoming the gradient issue and facilitating more effective training. Training on asynchronous event streams is non-trivial, yet the use of surrogate gradients, modules that leverage sparsity and a loss that weights events makes optimisation tractable. The efficiency claim refers to inference, where sparse spikes reduce active operations and memory access.
The potential of SNNs is particularly evident when paired with Dynamic Vision Sensors (DVS)—neuromorphic image sensors that capture asynchronous events corresponding to scene changes with microsecond-level latency—thus ensuring high temporal resolution and low latency [
11,
12,
13,
14]. In a typical representation, each event is denoted as e = (x, y, t, p), where (x, y) is the pixel location, t is the timestamp, and p is the polarity. An event is an atomic brightness change that occurs at one pixel and one time; recognition and tracking aggregate many events from many pixels over time, rather than following a single pixel.
Figure 1 summarises a DVS-driven neuromorphic sensing pipeline and a pure spiking neural network framework (DTEASN), covering hardware, event generation, time-window slicing, spiking inference, and object-level outputs. However, integrating event cameras with SNNs is challenging due to the sparse and irregular nature of the event data, which makes it difficult for traditional SNN models to process these events efficiently in real time [
15,
16,
17]. To address this challenge, recent research has investigated the application of rate coding and transformer-based attention mechanisms to effectively manage sparsity and enhance feature extraction from event streams [
18,
19,
20]. While these approaches have improved accuracy by focusing on significant events and optimising the encoding process [
21,
22,
23], the full potential of SNNs in real-time applications remains unrealized, particularly when sensor characteristics (asynchrony, sparsity, and polarity) are not explicitly taken into account.
Conventional frame-based pipelines based on reconstructed frames, optical flow trackers or recurrent models for dense video require substantial computation and memory traffic because sampling is synchronous and background pixels are redundant. The DVS-aligned spiking pipeline exploits sparse event streams and binary spike operations at inference, which lowers computational activity and improves the energy and latency profile on edge platforms.
Dynamic Vision Sensors find application in a variety of time-critical contexts, including high-speed robotics, HDR automotive perception under flicker and in low-light conditions, gesture-driven human–computer interaction, UAV navigation, and industrial inspection. The temporal resolution offered by these models is in the order of microseconds, the dynamic range is wide, and the outputs are sparse without motion blur. However, a gap remains in the model: inference must recover spatio-temporal structure efficiently on constrained edge hardware. It is therefore proposed that a sensor-aligned, pure spiking pipeline be adopted, with the objective of reducing frame and tensor overhead while preserving low latency and favourable energy behaviour. The effects of this approach on latency, event throughput, spike activity and an energy proxy are reported and discussed [
12]. In this context, high dynamic range describes the imaging conditions rather than the scene. Sensor and data properties used in this study, including spatial resolution, polarity channels and temporal windowing, are summarised in
Section 3.1, and full reproducibility details are provided in
Appendix A.
Spiking neural networks are of increasing importance, both as standalone models and as the computational core for event camera perception. Recent neuromorphic platforms, including programmable digital chips such as TrueNorth, Loihi 2 and SpiNNaker 2 [
24,
25,
26], as well as CMOS memristive arrays and superconducting Josephson junction devices, demonstrate practical routes to energy and area efficiency [
27,
28]. It is important to note that this study is software-only. The approach adopted is to implement a sensor-aligned, purely spiking recognition backbone with lightweight tracking, in order to meet the edge constraints of event-driven updates, sparse activity and limited memory. In addition to accuracy, the present study reports latency, event throughput, spike activity and an energy proxy to reflect these constraints. Both task accuracy and resource proxies aligned with event-sparse inference, including spike activity and an energy proxy, are reported in the Results section to substantiate the efficiency claim.
Event-camera tracking in real time encounters abrupt target and ego motion that produce sparse yet bursty events, large and rapid scale or aspect changes, partial or full occlusions, and strong distractors in high-dynamic-range backgrounds; event rates vary over time, while memory, throughput, and energy on embedded platforms remain limited, so inference must remain stable at low latency. Let the DVS stream be asynchronous events
with an initial target state
at
. At time
, the tracker estimates
from events in a sliding window
according to
The objective is to minimise localisation error under a bounded latency budget. The estimate uses a constant-velocity prior for stability and measurements derived from the clustering and event-driven attention features described in
Figure 2 and
Section 3, matching the notation already introduced in the method.
To address the challenges of real-time object tracking, this paper introduces the Dynamic Tracking with Event Attention Spiking Network (DTEASN). As a pure SNN framework, DTEASN eliminates the computational overhead of traditional CNN operations, thereby reducing GPU dependency [
29,
30], and aligns the processing with DVS signal characteristics via time-window slicing and polarity-aware operations. DTEASN incorporates a multi-scale, event-driven attention mechanism designed to focus on the most salient portions of the event stream, thereby enhancing feature extraction. Additionally, a spatio-temporal event convolver is introduced to optimise processing by capturing both spatial and temporal features. Another significant innovation is the Event-Weighted Spiking Loss (EW-SLoss), a loss function that enhances tracking accuracy by differentiating between relevant and irrelevant events and improving robustness to sensor noise. The framework also includes a lightweight event tracking mechanism to reduce computational burden and a custom synaptic connection rule to optimise information flow. We evaluate DTEASN on event-based recognition and tracking benchmarks, and report sensor-relevant system indicators (latency, event throughput (events/s), spike rate (spikes/s), memory footprint, and a simple energy proxy), showing superior performance in terms of accuracy, energy consumption, and computational efficiency compared with conventional methods. Classification uses the category sets defined by the benchmark datasets throughout the study, and species level labels are not considered.
The primary contributions are summarised below:
- (1)
Innovative attention and convolution structures: Sensor-aware spatio-temporal multi-scale attention and event convolution that are aligned with DVS event statistics to enhance feature extraction from raw event streams.
- (2)
Event-Weighted Spiking Loss (EW-SLoss): A loss function that improves accuracy by prioritising informative events and enhancing robustness to sensor noise for event-driven SNN.
- (3)
Lightweight event tracking mechanism: An efficient tracking module that reduces resource usage while maintaining real-time performance for embedded sensing.
- (4)
Pure SNN architecture: A CNN-free spiking architecture that reduces GPU dependency and better matches asynchronous DVS inputs.
3. Method
3.1. Sensor-Aligned Framework for DVS Event Streams
The input to the framework is constituted by a sequence of events produced by a Dynamic Vision Sensor. Each occurrence is delineated by a set of pixel coordinates, a timestamp with microsecond resolution, and a polarity indicator. The data transmission is segmented into predetermined temporal windows, with each window measuring 20 milliseconds and separated by a stride of 10 milliseconds. Within each designated time period, timestamps undergo a process of normalisation. Two polarity channels are maintained for ON and OFF events. A causal voxel grid is constructed over space and time to aggregate events while preserving temporal order. The occurrence of spurious isolated events is suppressed, and the total event count per window is constrained to ensure the exclusion of outliers. It is noteworthy that these settings are consistently maintained across all datasets utilised in the study, aligning with the examples depicted in
Figure 2.
The processing of DVS event streams is conducted from start to finish, employing time-window slicing and polarity-aware operations to ensure the preservation of the asynchronous timing and sparsity characteristics inherent to DVS signals.
The overall framework depicted in
Figure 2 integrates several novel modules for event-driven object tracking and learning. Initially, the event stream is processed by means of sampling the incoming events within a 20-millisecond time window, with a 10-millisecond sliding step. A validation sweep over window and step sizes is summarised in
Table 1. These events are then mapped into a 3D spatiotemporal grid [
47], which captures both spatial and temporal information, forming the foundation for accurate tracking and feature extraction. For each window
we process all events across the sensor that fall into the window. In other words, the slice is the set of events from all active pixels in that interval, which forms the basis for tracking and recognition.
At inference, the process entails the conversion of each event window into a voxel representation, followed by the application of the attention module and the spatio-temporal event convolver. The resulting features are temporally aggregated and forwarded to two heads. The classification head is responsible for producing class logits that correspond to the dataset level categories. The predicted label is the class that exhibits the highest logit. The tracking head generates the target trajectory, which is stabilised by the Kalman block as outlined in the framework. It is imperative to note that this procedure is applicable to all datasets utilised in the experimental process.
The Shape Detection and Event Supplement block is responsible for the connection of raw event clusters with the tracking and readout heads. The process of shape detection involves the aggregation of neighbouring ON and OFF events within each designated time window. This is followed by the estimation of a coarse contour and a centroid. The measurement is then spatially supported, thereby stabilising the Kalman update. The function of the event supplement is to interpolate short gaps when the inter-event interval in a slice becomes large or when occlusion disrupts clusters. Interpolation is constrained by a small spatial radius and a short confirmation window derived from the bounding-box diagonal and the constant-velocity prior. The additional points serve to preserve the asynchronous nature of the stream whilst maintaining polarity consistency. Their contribution is down-weighted by the attention module, thus ensuring that genuine measurements prevail. In practice, the block has been shown to reduce identity switches under rapid scale or aspect changes, and to lower the variance of the readout with negligible computational cost.
A key component of the framework is the Kalman Tracking Block, which predicts the object’s state, including position and velocity components
[
48]. The predicted state is then refined using a measurement update, ensuring that the object’s trajectory is accurately estimated over time. Concurrently, the Advanced Tracking Cluster employs a radius-based clustering technique to group the events based on proximity, utilising a convergence threshold. This method enables efficient object identification by calculating the weighted centre and performing a radius search for neighbouring events, ensuring robust tracking in noisy environments.
The Event-Driven Multi-Scale Attention mechanism is a sophisticated computerised system that processes the event stream at multiple scales, thereby capturing critical spatial features at both fine and coarse levels. This mechanism enhances the feature extraction process by focusing on the most relevant regions of the spatiotemporal grid. Concurrently, the Spatio-Temporal Event Convolver employs convolution kernels on the event data to extract essential features, including edges, motion, and centres. The resulting feature maps undergo response computation and polarity-aware enhancement, ensuring that the most significant events are given higher priority in the tracking process.
The extracted features are then processed by the hierarchical spiking network, which refines the tracking results across multiple layers and captures both fine-grained details and broader object dynamics. A distinguishing element of the framework is the Event Weighted Spiking Loss, which assigns higher importance to events with greater spatiotemporal relevance. Here, spatiotemporal relevance denotes the evidential strength of an event measured by three factors: closeness in time to neighbouring events within the current slice, coherence in space with the local motion pattern estimated from the discrete velocity and acceleration described in Equations (4) and (5), and consistency of polarity with the dominant local contrast. These factors are combined and normalised to produce an attention weight as formalised in Equations (9)–(12), and this weight both scales the synaptic input during inference and multiplies the per-event error in the loss defined in Equation (15). The Event Weighted Readout then aggregates the weighted responses with class prototypes to obtain the final classification, yielding a lightweight design that supports real-time event tracking with modest computational resources. Instrument note. Event data were acquired with a DAVIS346 event+frame camera (iniVation AG, Zurich, Switzerland); experiments used the vendor DV-Platform where applicable.
3.2. Event-Driven Object Tracking
The tracking method processes an asynchronous event stream
with
, where
is the timestamp,
the pixel location, and
the polarity. A single event does not track a pixel; object-level trajectories are constructed by associating time-ordered events across pixels according to the rules in
Section 3.2.
To enhance the robustness of the tracking system, data augmentation is applied to the event stream [
44], which includes translation and rotation transformations. The translation modifies the coordinates
and
by random values, expressed as
and
, where
and
are random translation values. The rotation transformation is applied using a standard 2D rotation matrix to adjust the coordinates:
where
is a randomly selected rotation angle.
The object length is estimated by calculating the maximum Euclidean distance between any two points in the event trajectory:
where
and
are two distinct points in the event stream. For larger datasets, the bounding box diagonal is computed as:
where
are the minimum and maximum coordinates of the object’s bounding box.
The quantities introduced earlier are used to scale and stabilise the kinematic estimates. Let the radius vector
denote the spatial position of event
. The maximum trajectory length
in Equation (2) and the bounding-box diagonal in Equation (3) provide two scene-dependent scales. We normalise displacements and velocities by the diagonal so that all kinematic terms are measured per unit scene scale, and we cap neighbourhood searches and interpolation radii by a fraction of
to avoid drift on long tracks. Accordingly, Equations (4) and (5) are used in a discrete form consistent with the event scheme:
and their normalised counterparts are
and
. In practice we set the spatial search radius for supplementation to
with a fixed
. With these definitions, Equations (4) and (5) describe the event trajectory as a time-ordered sequence
, while Equations (2) and (3) supply the reference scales that determine the magnitude and the admissible neighbourhood for interpolation and filtering. Interpolation declares a gap when the inter event interval exceeds three times the median per track, uses a spatial search radius of half the box diagonal, and a two-millisecond confirmation window, which limits corrections in size and prevents drift accumulation.
The Kalman filter is applied to predict the object’s future position using the following prediction step:
Here denotes the a priori state estimate at time given measurements up to , that is, the prediction step. The subscript is the time index and the bar indicates prediction from the previous step. The state stacks image-plane position and velocity as and is propagated with a constant-velocity model with sampling interval . The term denotes a known exogenous control input such as commanded motion or IMU-derived acceleration expressed in the same kinematic units; when no control is available, and the prediction reduces to constant-velocity propagation with process noise.
Finally, the bounding box around the object is computed using:
where
are the minimum and maximum coordinates of the object’s bounding box. These bounding box coordinates are used to remap the events to the full-resolution image:
Here denotes the image width and is unrelated to the trainable weights used later; learnable parameters are written in lowercase as and are the width and height of the full-resolution frame, and and are the normalised coordinates.
3.3. Training Framework for Event-Based Learning
The proposed training framework for event-based learning integrates several innovative modules that optimise the processing of event-driven data within a pure Spiking Neural Network (SNN) architecture. The primary innovation is the multi-scale event attention mechanism, which assigns importance to events based on their temporal and spatial relevance. For each event
, where
is the timestamp,
and
are the spatial coordinates, and
is the polarity, the temporal weight is calculated using the time difference between the event and its neighbours:
where
is the time difference between events
and
, and
is a temporal scaling factor. The spatial weight is determined by the Euclidean distance between the spatial coordinates of the events:
where
is the spatial distance, and
is the spatial scaling factor. The parameters
and
set the temporal and spatial support of the attention and act as bandwidths for the respective kernels. Their values are selected on the training split by a small grid centred on sensor statistics;
and
pixels, with
and
pixels adopted unless stated otherwise. On the validation split, one factor sweeps around the selected scales showed stable optima. Performance varied by at most 0.3 percentage points for
and 0.4 percentage points for
within the explored ranges. The polarity weight is based on whether the polarities of the two events match, expressed as:
Here
and
denote event polarities. The parameter
is an attenuation factor for pairs of opposite polarity with
and is kept fixed within each experiment; the same setting is applied across all methods. This two-case definition makes the exclusivity explicit and removes the ambiguity that would arise from adding mutually incompatible conditions. The final attention weight is a weighted sum of these components:
The coefficients weight the temporal, spatial and polarity terms in the composite attention. They are constrained to be non-negative and to sum to one and are chosen on the training split by a convex grid search; the equal setting is used when multiple settings perform similarly. Increasing favours short, burst-like transients, increasing favours coherent spatial structure, and increasing favours polarity consistency.
The attention weight determines the importance of each event in the network, guiding the model to focus on the most relevant events.
In addition, the spatio-temporal event convolution module is employed to extract key features from the event stream. This module applies convolution filters to the event data, allowing the network to capture spatial and temporal features such as edges and motion. The convolution operation is expressed as:
where
represents the event stream, and
are the convolution kernels. These kernels extract features from the event stream, which are then processed in the network.
Kernels adapt to DVS polarity and sparsity. ON and OFF branches share a 3 × 3 spatial kernel and a short causal temporal kernel. Each branch is softly gated by the local ON to OFF ratio, which amplifies the dominant polarity and suppresses the other without duplicating spatial parameters. Convolution is evaluated only on active voxels and their small neighbourhoods. Spatial dilation and temporal integration length are conditioned on local occupancy: sparse regions use larger dilation and longer integration, while dense regions use tighter dilation and shorter integration. Outputs from both branches are merged and normalised by the local event count to remove rate bias.
The Leaky Integrate-and-Fire (LIF) neuron model processes the events within the SNN. The membrane potential
is updated according to the equation:
Membrane potential follows leaky integrate-and-fire dynamics with time constant and leak reversal . The synaptic drive is a current that is converted to an equivalent voltage by the membrane resistance , so all terms are expressed in volts.
Each event carries a composite attention weight according to Equation (12). During inference, the synaptic drive to a neuron is the weighted sum of its afferent event kernels, with each event contribution multiplied by its attention weight. Informative events deliver larger membrane increments and are more likely to reach the threshold. In contrast, events that are isolated in time or space, or that have inconsistent polarity, receive small weights and are effectively suppressed.
The training framework uses the Event-Weighted Spiking Loss (EW-SLoss) function to guide the weight update process. This custom loss function adjusts the weights based on the attention-weighted events, ensuring that the model focuses on the most relevant data during training. During learning the per event error is multiplied by the same attention weight so that gradients are concentrated on coherent structures and noisy updates are attenuated. The weight update rule is given by:
Here denotes any trainable parameter in the spiking recognition backbone and in the event-weighted readout; is the event-weighted spiking loss. Gradients are computed with a surrogate-gradient approximation to the spike nonlinearity, and parameters are updated as .