1. Introduction
The development of autonomous vehicles has gradually evolved from modular pipelines to end-to-end approaches, primarily to address the high costs associated with handling corner cases in rule-based modular methods. In modular pipelines [
1], standalone components for localization, perception, planning, and control work collaboratively to generate feasible actions for autonomous vehicles. This approach emphasizes the integration of various modules to ensure reliability and performance. However, end-to-end approaches [
2], which aim to directly learn a mapping function between sensory inputs and vehicle actuators using neural networks, eliminate the need for explicit modular segmentation. This paradigm, often categorized under imitation learning, relies on collecting expert datasets and training neural networks to mimic expert behavior for longitudinal or lateral control.
Within end-to-end frameworks, frame-based RGB cameras have been a predominant input modality, demonstrating impressive results in learning the mapping from visual inputs to actions. These systems excel in structured environments but face challenges in capturing reliable and comprehensive scene information in unstructured or dynamic scenarios. The primary limitation of frame-based cameras lies in their inability to capture fast-moving objects or sudden scene changes with high fidelity. Since they rely on capturing frames at fixed intervals, there is a delay between frames that can lead to motion blur, particularly in high-speed environments. This results in a loss of temporal information and can hinder the vehicle’s ability to respond to quick or unpredictable changes in the environment, such as a pedestrian suddenly crossing the road or a rapidly changing traffic light.
In recent years, event-based sensors have emerged as a promising alternative for autonomous driving. Unlike traditional frame-based cameras, event cameras capture per-pixel brightness changes as asynchronous events, offering higher temporal resolution, wider dynamic range, and reduced susceptibility to motion blur. This mechanism is more analogous to the human eye’s triggering process, which responds to dynamic changes in the scene rather than capturing static frames at fixed intervals. These characteristics make event cameras well-suited for dynamic and challenging driving environments, providing low-latency and low-bandwidth advantages. Specifically, event cameras achieve sub-microsecond latency, with pixel-level events being transmitted as soon as changes are detected. On the lab bench, latency is around 10 μs, and in real-world conditions, it can be as low as sub-millisecond [
3]. This rapid response enables event cameras to capture fast-moving objects with minimal delay, making them ideal for real-time applications in autonomous driving.
As for static information, event-based cameras can still capture it by integrating mechanisms that track background changes over time. While event cameras focus on detecting dynamic events, they can also record static elements through continuous monitoring of the scene. Some designs, such as the Asynchronous Time-Based Image Sensor (ATIS) [
4], and Dynamic and Active Pixel Vision Sensor (DAVIS) [
5], incorporate additional subpixels to measure absolute intensity, allowing the camera to capture static brightness information alongside dynamic events. This enables event-based cameras to provide a more comprehensive view of the environment, processing both dynamic and static elements efficiently.
Humans achieve robust movements and accomplish tasks by perceiving geometric features and dynamic information in their environment through visual observation and interactive evaluation. Emulating this human perception-control mechanism provides a viable framework for designing end-to-end algorithms. However, most current end-to-end models output control actions with event camera input are constructed using standard convolutional neural networks (CNNs), such as ResNet [
6]. Lacking a neural basis, these networks struggle to effectively capture the complex perception-control processes required for autonomous driving, making it difficult to achieve high levels of generalizability and robustness. Moreover, these networks often contain a large number of parameters, leading to redundant computations and repeated processing of unnecessary events, which reduces their efficiency and practicality.
Inspired by Neural Circuit Policies (NCPs) [
7], designing network models from a brain-inspired perspective could be a promising direction. NCPs draw inspiration from the neural computations observed in biological brains, incorporating increased computational capabilities per neuron to create sparse, interpretable networks. These models have demonstrated exceptional performance in tasks like lane-keeping, where small networks with only a few neurons have outperformed state-of-the-art methods in directly learning to steer a vehicle from high-dimensional visual RGB inputs.
We believe that brain-inspired networks are a choice for constructing end-to-end models with event camera input. On the one hand, event cameras, designed to mimic the way biological eyes perceive changes in brightness, naturally align with the computational principles of brain-inspired networks like NCPs. On the other hand, brain-inspired networks have been shown to capture more effective causal relationships with fewer neurons, which aligns well with the current needs for processing event camera data. To address this gap, our work investigates the capability of brain-inspired networks to learn end-to-end steering predictions using event-based inputs in autonomous driving scenarios.
To achieve this goal, we propose LiS-Net, a novel framework that combines the strengths of event cameras and brain-inspired neural networks. By leveraging the unique properties of event-based data, LiS-Net can learn robust, efficient, and interpretable control policies for end-to-end steering prediction. By combining bio-inspired sensors with brain-inspired computation, LiS-Net demonstrates the potential to pave the way for robust, scalable, and efficient autonomous driving solutions.
The architecture of LiS-Net introduces a novel integration of event camera data with brain-inspired neural networks to predict steering angles in an end-to-end manner. This framework incorporates four key components: tensorization, a convolutional feature extractor, a brain-inspired liquid neural network, and a smooth adjustment module. The process begins with the preprocessing of raw event data, which are converted into a structured tensor format, suitable for downstream processing. The preprocessed tensors are then passed into a convolutional feature extractor that learns spatial representations from the input data. These features are subsequently fed into the brain-inspired liquid neural network, which is designed to capture the temporal dependencies and dynamic patterns inherent in event-based data. The final step involves a smooth adjustment module that refines the predictions, ensuring both accuracy and temporal smoothness in the steering angle output. To thoroughly assess the robustness and generalizability of LiS-Net, we conducted experiments using the Carla simulation dataset EventScape for training and testing, and we further validated the trained model on a self-collected dataset that includes multiple sensors, such as event cameras, frame-based RGB cameras, GPS, and Controller Area Network (CAN) data. The deployment of the model on real-world data not only extends its validation beyond simulated environments but also provides a comprehensive evaluation of its performance in diverse conditions.
The main contributions of this work are as follows:
Integration of brain-inspired sensor inputs with brain-inspired neural networks: We first introduce event camera inputs combined with brain-inspired Neural Circuit Policies (NCP) for end-to-end steering prediction, setting a new baseline and highlighting the potential of bio-inspired sensors and computation.
Continuous Dynamic Modeling and Performance Analysis of LiS-Net: Inspired by the continuous state transitions of biological neurons, LiS-Net preprocesses event-based data and uses a differentiable approach to model neurons with continuous dynamics. This enables more accurate temporal information capture and better environmental modeling. By continuously modeling neuron transitions, LiS-Net ensures temporal accuracy and smoothness in end-to-end steering prediction. Trained and tested on simulated datasets, LiS-Net outperforms traditional networks in RMSE and MAE. Validation on real-world data shows LiS-Net’s efficiency with fewer neurons and relatively fewer FLOPs, demonstrating generalization for autonomous driving.
Dataset contribution: This work also contributes to the research community by providing a self-collected dataset that includes synchronized recordings from multiple sensors, including an RGB camera, event camera, and Controller Area Network (CAN) data. This dataset enables the evaluation of model generalization to different real-world driving environments and provides valuable resources for future research.
3. Methodology
This section provides a detailed discussion of our proposed framework for learning end-to-end steering angles from event-based data, as illustrated in
Figure 2. The framework consists of several key components, and these components are designed to capture long-range dependencies between the encoded features and enable accurate steering angle predictions.
The first step in the framework is tensorization, converting the event streams into tensor form, followed by the use of a convolutional feature extractor to extract meaningful spatial feature representations. Next, the brain-inspired liquid neural network processes the sequential dynamics of the data, focusing on time-series modeling to extract causal relationships. Finally, by incorporating loss design, the framework performs smooth adjustments to produce more stable and refined steering control outputs.
To facilitate understanding of the proposed model and its derivations, we summarize the key symbols and notations used throughout the paper in
Table 1.
3.1. Data Distribution
The primary objective of this study is to investigate the robustness of brain-inspired networks in processing event-based sensor inputs, with a specific focus on predicting the steering angle for autonomous vehicles. Event-based cameras are particularly appealing for real-time autonomous driving applications due to their high temporal resolution and low latency. However, the availability of real-world autonomous driving datasets incorporating event-based cameras remains limited. To address this challenge, as depicted in
Figure 3, the study utilizes EventScape [
48], a simulated dataset generated using the CARLA simulator, which provides event camera data for training, validation, and testing purposes. It is important to highlight that the EventScape dataset does not account for variations in weather conditions, with all data collected under clear weather scenarios. Moreover, we qualitatively assess the feasibility and generalization capability of our model using a real-world dataset collected under realistic driving conditions.
As shown in
Figure 4, the real-world dataset in this study consists of data collected from a forward-facing event camera mounted on the roof rack of a vehicle. The vehicle was driven manually by a professional driver, ensuring precise control during data collection. During the data acquisition process, both the environmental data captured by the event camera and the corresponding control signals (e.g., steering angles and vehicle speed) were recorded concurrently. The control signals, obtained from the vehicle’s control system, serve as the ground truth for evaluating the accuracy of the model’s predictions. Similar to the EventScape dataset, the real-world data used for testing was also recorded under clear weather conditions, ensuring consistency with the simulated training data. This approach allows for a thorough assessment of the model’s robustness and its ability to generalize from simulated environments to real-world scenarios.
Both the simulated data (EventScape) and real-world data were collected in urban environments, containing typical urban features such as vehicles, pedestrians, and roads. However, the real-world data were specifically captured in a parking lot, ensuring controlled conditions during data collection. To ensure the quality and consistency of the data, the vehicle followed a predefined route, and the data were collected over multiple passes. Additionally, to verify the accuracy and reliability of the collected data, we cross-referenced the event camera data with visual data captured by an RGB camera at corresponding time points. The steering angles at these time points were also manually confirmed to ensure the authenticity and reliability of the data.
3.2. Brain-Inspired Liquid Neural Network
Our approach leverages a brain-inspired liquid neural network to process features extracted from the convolutional layer for autonomous vehicle steering angle prediction. A brain-inspired liquid neural network can be divided into two components: the neuron model and the hierarchical network topology. The neuron model defines the internal structure and mathematical principles for modeling and solving individual neurons, and the hierarchical network topology determines how neurons are organized and connected, including the layering structure and specialized connection mechanisms between different layers. The design of these two components aims to achieve a more sparse and efficient model.
3.2.1. Neuron Model
As is defined by [
46], the Liquid Time Constant (LTC) network models the behavior of neurons and synapses as a continuous-time differentiable dynamical system. The dynamics of each neuron
i in the LTC network are governed by the following Ordinary Differential Equation (ODE):
where
represents the hidden state of neuron
i at time
t,
represents the time constant of neuron
i, controlling the rate of state decay, and
represents the synaptic input from neuron
j to neuron
i, given by a nonlinear synaptic function.
To provide a more specific formulation, consider a set of LTC neurons, each with state dynamics
, connected through an input synapse to a neuron
j:
where
represents the time constant of neuron
i, where
is the leakage conductance and
is the membrane capacitance;
represents the synaptic weight from neuron
i to
j;
represents the nonlinear activation function;
represents the resting potential of neuron
i; and
represents the reversal synaptic potential defining the polarity of the synapse.
The overall coupling sensitivity (time constant) of the LTC neuron is defined as follows:
Although the Liquid Time Constant (LTC) model effectively simulates the mechanisms of a neuron, it suffers from the excessive computational time required to solve an ordinary differential equation (ODE). To address this issue, closed-form continuous-time (CfC) models [
49] are derived from the scalar closed-form solution of the Liquid Time Constant (LTC) system. The hidden state of an LTC network is determined by the solution of the following initial value problem (IVP):
where
represents the hidden state of the LTC layer with
D cells at time
t,
represents the exogenous input with
m features,
represents the time-constant parameter vector,
represents the bias vector,
f represents a neural network parameterized by
, and ⊙ represents the Hadamard (element-wise) product.
The dependence of on allows for the presence of recurrent connections in the system. CfCs address continuous-time dynamics explicitly while maintaining trainability and integration into modern neural network frameworks. The closed-form solution provides a theoretical foundation for solving scalar ordinary differential equation (ODE) systems and ensures well-behaved gradients during optimization.
Given an LTC system defined by an initial value problem (IVP) receiving a single-dimensional time-series input
with no self-connections, the approximate closed-form solution is as follows:
Here, represents the initial state of the system, A represents a bias-like parameter, represents the weighted time constant of the neuron, and represents the input-dependent nonlinearity parameterized by .
Since the CfC model is inherently a time-sequence model, one can perform predictions based on an entire sequence of observations. As shown in
Table 2, when comparing CfC with other state-of-the-art time-sequence models in terms of time complexity, for a given sequence of length n and a neural network with k hidden units (p being the order of the ODE solver), we observe that the complexity of the ODE-based networks and transformer is at least an order of magnitude higher than that of discrete RNNs and CfCs in both sequence prediction and auto-regressive modeling (time-step prediction) frameworks [
49]. This indicates that using CfC as our neuron model can significantly enhance the computational efficiency of the overall network.
Considering both computational efficiency and accuracy, we adopt the approximate closed-form solution, as shown in Equation (5), as the neuron model in our brain-inspired liquid neural network.
3.2.2. Hierarchical Network Topology
Neural Circuit Policies (NCPs) [
7] are inspired by the wiring diagram of the
C. elegans nematode, which features a four-layer hierarchical network topology: sensory neurons, interneurons, command neurons, and motor neurons. As illustrated in the right block of
Figure 2, this structure allows for sparse, efficient, and distributed control with hierarchical temporal dynamics. NCPs leverage the specific connectivity patterns found in
C. elegans, including feedforward connections (from sensory to intermediate neurons and command to motor neurons) and recurrent connections (within interneurons and command neurons). This topology results in compact networks with around 90% sparsity.
3.2.3. Module Implementation
In the final LiS-Net, we adopt the CfC neuron model and connect it in the form of NCP, thus implementing the module as a brain-inspired liquid neural network. Therefore, this module constructs a recurrent neural network using brain-like ODEs, sequentially introducing the input of sequential environmental dynamics into the CfC model. NCPs are used to sparsify the CfC model, enhancing the network’s efficiency and outputting sequence-controlled actions.
In our hierarchical network topology, the final layer, referred to as the motor layer, includes one neuron that outputs the final prediction for the steering angle. This neuron generates the steering angle prediction based on the processed sequential dynamics, effectively linking the network’s decision-making process to the control action of the vehicle. The detailed network topology parameters are as follows:
inter_neurons: 22 (Number of interneurons in the network)
command_neurons: 12 (Number of command neurons used for decision-making)
motor_neurons: 1 (Number of motor neurons, responsible for outputting the final action)
For more detailed information on the layer structures of LiS-Net, please refer to
Table A7. For specific details regarding the convolutional layers in LiS-Net, see
Table A1.
3.3. Loss Module
The training process uses a custom loss function, which combines two components: an imitation loss and a smoothness loss. The total loss ensures accurate prediction of the steering angle while encouraging smooth transitions between consecutive predictions. The total loss is defined as:
where
represents the imitation loss that measures the error between the predicted steering angle
and the expert steering angle
,
represents the smoothness loss that encourages gradual changes in the steering angles over time, and
w represents a hyperparameter (
smooth_weight) that controls the relative importance of the smoothness loss. The value of
w is determined by ablation experiments, which show that a value of w = 1 typically yields the best performance based on our current experiments.
3.3.1. Imitation Loss
The imitation loss is implemented as the Mean Squared Error (MSE) between the predicted steering angle and the expert steering angle. It ensures that the model accurately mimics the expert’s behavior. The imitation loss is given as follows:
where
N is the total number of samples in the batch.
3.3.2. Smoothness Loss
The smoothness loss penalizes large changes in the steering angles between consecutive time steps to ensure gradual and smooth transitions. It is calculated as the mean squared difference between consecutive predictions along the time sequence:
where
T is the total number of time steps in the sequence.
4. Experiments and Results
4.1. Datasets
As introduced in
Section 3.1, our task involves training and testing the model on a simulated dataset and deploying it on real-world data to evaluate its performance. This section focuses on the data-related work.
4.1.1. Our Self-Collected Dataset
We used the experimental vehicle platform developed by Volkswagen (Volkswagen AG, Wolfsburg, Germany), equipped with multiple modality and surround sensors, including a 128-channel Ouster LiDAR sensor, RGB frame sensors, and Prophesee event cameras. The experiments were conducted in Shanghai, China. The proposed method used CAN bus data and event data in this study. We utilized only a single forward-facing event camera with a resolution of 1280 × 720 pixels. The vehicle was also equipped with a drive-by-wire system to collect real CAN bus data, which include torque, speed, throttle, brake, and steering angle. Both the sensor data and CAN data were timestamped during collection, and time synchronization between the two data streams was performed based on absolute time.
Since the event camera has an extremely high temporal resolution, allowing for a very high sampling frequency, we aligned the data to the CAN data’s sampling frequency, which was set to 50 Hz. The entire process of data collection, temporal and spatial synchronization, and data decoding was managed by dedicated built-in software, with specific details omitted here for brevity. In this experiment, to validate the performance of our model, we collected 2 min of data for verification. We will release the processed dataset, ready for deployment, to the research community.
4.1.2. Simulated Dataset
EventScape [
48] is a large-scale synthetic dataset designed for multimodal learning tasks, recorded using the CARLA simulator [
50]. It contains a total of 743 driving sequences with 171,000 labeled frames, equivalent to approximately 2 h of driving data. EventScape includes diverse automotive scenarios, featuring dynamic actors such as pedestrians and vehicles, making it suitable for developing pedestrian-aware perception algorithms.
The dataset provides a variety of modalities, including events, RGB frames, semantic labels, depth maps, and vehicle signals (e.g., position, orientation, velocity, steering angle, throttle, and brake). Events are generated at 500 Hz using the ESIM simulator, while frames, depth, and semantic labels are synchronized and provided at 25 Hz. Vehicle control data are recorded at 1000 Hz. Events are processed to include realistic features such as a refractory period (100 µs) to resemble real event camera behavior.
Since our model uses only event inputs, we established event–control pairs at 500 Hz by matching each steering angle to its temporally nearest event using timestamp-based nearest-neighbor interpolation. This approach preserves the high temporal resolution of event data while maintaining synchronization with control signals.
EventScape is originally split into training (536 sequences, 71 GB), validation (103 sequences, 12 GB), and test sets (119 sequences, 14 GB). This corresponds approximately to a 70%/15%/15% split, which is commonly adopted in deep learning tasks based on this dataset [
48]. Training data are drawn from CARLA towns 1, 2, and 3, while validation and test data come from geographically distinct areas in town 5, ensuring no overlap. In our task, we also followed this split to conduct training, validation, and testing accordingly.
4.1.3. Preprocessing of Event Data
When working with event-based data, one of the key challenges is how to structure and represent the data in a format suitable for training neural networks, particularly in the context of sequential data. In this process, tensorization is a critical step. Specifically, histogram-based tensorization is a widely used method to efficiently convert event streams into structured data formats that can be processed by neural networks, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
The histogram method involves discretizing event data into cells based on the spatial coordinates (x,y) of the events, with each cell representing a pixel or region in the spatial grid. Events are then assigned to specific time bins according to their timestamp t. For each time bin, the number of events in each spatial cell is counted, with separate counts for each polarity (ON and OFF), resulting in two distinct channels in the final tensor. The histogram values represent the number of events in each spatial cell for a given time bin, categorized by their polarity. Let
H be a tensor of four dimensions
. For each event
, we update the histogram of the corresponding time bin accordingly:
where
is the time interval in microseconds.
After tensorization, the simulated dataset can be directly cropped to the required size and fed into the network for training, as it typically has lower resolution and minimal noise. However, for our collected dataset, the event camera sensor features a higher pixel resolution, requiring additional preprocessing steps to adapt to the network model. As shown in the
Figure 5, the resolution is first downsampled from 1280 × 720 pixels to 640 × 360 pixels. Due to noise and an excessive number of event points during data collection, further filtering is applied to filter outliers. This process involves calculating the mean (
) and standard deviation (
) of the data and defining thresholds based on a specified number of standard deviations (
).
The number of standard deviations (
num_std) used in the Hampel filter was empirically selected based on the distribution of data in our collected dataset. In our experiments, we found that using
num_std = 3 yielded optimal performance, effectively filtering outliers without removing valid data points. Specifically, values outside the range
were considered outliers. These outliers were then clipped, resulting in a normalized dataset free of extreme values, ensuring improved data quality for further processing.
It is important to note that the tensorization and preprocessing procedures were consistent across both our model and all baseline models. This uniformity ensures a fair and unbiased comparison of model performance.
4.2. Baseline
To explore the suitability of brain-inspired networks for event-based inputs, we conducted experiments to evaluate the impact of incorporating brain-inspired architectures. Specifically, we compared traditional networks with brain-inspired networks to analyze their performance in the context of event-based data processing for end-to-end steering prediction.
For traditional networks, we selected several traditional neural networks for comparison, including:
(1) CNN (Convolutional Neural Network): A classical convolutional architecture for spatial feature extraction. (2) LSTM (Long Short-Term Memory): A recurrent network capable of capturing temporal dependencies in sequential data. (3) CT-RNN (Continuous-Time Recurrent Neural Network): A model that incorporates ODE (ordinary differential equation) computations to handle irregular observations. CT-RNN achieves this by continuously evolving hidden states between observations, using neural ODEs or neural flow layers.
For brain-inspired networks, we selected (4) NCP (Neural Circuit Policies), a model originally designed for frame-based RGB inputs. To adapt it for event-based data, we first tensorized the event camera input and then fed it into the NCP model, using it as one of our baselines.
By comparing the performance of traditional networks (CNN, LSTM, CT-RNN) with the brain-inspired network, we aim to gain insights into the effectiveness of brain-inspired architectures for event-based end-to-end lateral control tasks. The quantitative and qualitative results of this comparison are provided in subsequent sections.
4.3. Training Details
We implemented the proposed LiS-Net network using PyTorch 2.0.0 with CUDA 12.2. The network was trained on a server equipped with two NVIDIA RTX A6000 GPUs (NVIDIA Corporation, Santa Clara, CA, USA) graphics cards. The batch size during training was set to 128. The input to the network, after tensorization of all event camera data, was resized to the same dimensions (1,66,200) before being fed into the network.
The optimizer used for training was the Adam optimizer, configured with a learning rate of , , , , and no weight decay (). The network was trained for 35 epochs.
Two evaluation metrics, the root mean square error (RMSE) and mean absolute error (MAE), were used to evaluate the effectiveness of the proposed network.
Here,
k represents the total number of predictions,
is the ground truth value, and
is the predicted value. These metrics were used to quantitatively assess the performance of the network.
4.4. Results
4.4.1. Model Efficiency
Quantitative and qualitative analyses were performed to evaluate both the predictive performance and computational efficiency of the proposed method, particularly on the CARLA-based EventScape dataset. As a first step, we examined the loss curves for all models during both the training and validation phases to visually assess their behavior. As illustrated in
Figure 6, the loss curves for all models are presented. To ensure a fair comparison, only the MSE loss is considered. From the figure, it can be seen that LiS-Net reaches the minimum loss during the final stage of stabilization, suggesting its efficient training and prediction performance.
In the context of steering angle prediction, two key metrics were used: Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). Higher RMSE values indicate the presence of large, infrequent deviations in steering angle predictions, potentially leading to instability in vehicle control. MAE reflects the average magnitude of prediction errors and provides insight into overall prediction accuracy. As shown in
Table 3, we report RMSE, MAE, the number of model neurons (as a proxy for parameter count), and floating-point operations (FLOPs), which offer a direct measure of computational complexity during inference.
To provide a more comprehensive perspective on model efficiency, we additionally calculate the FLOPs required by each network at different sequence lengths (Seq = 1 and Seq = 16). The results indicate that brain-inspired models such as NCP and LiS-Net exhibit lower FLOPs compared to conventional architectures like CNN and LSTM. Among them, NCP achieves the lowest computational cost, while LiS-Net remains highly competitive, achieving comparable efficiency with only a marginal increase in FLOPs. Notably, LiS-Net attains the best overall predictive accuracy, with the lowest RMSE and MAE scores across all evaluated models.
These results highlight the strength of LiS-Net in balancing accuracy and efficiency. By leveraging the compact structure of neural circuit policies and the dynamic temporal modeling capability of CfC neurons, LiS-Net demonstrates strong potential for real-time deployment in resource-constrained autonomous systems.
To further evaluate the performance of different models on event data, a qualitative comparison of steering angle prediction using various frameworks, including CNN, NCP, and LiS-Net, was performed. The results are illustrated in
Figure 7, where the left graph shows a comparison of predictions across the entire test dataset, the middle graph presents a zoomed-in segment for closer analysis, and the right graph illustrates the deployment results on our collected dataset.
As shown in
Figure 7, the brain-inspired networks outperformed the CNN in terms of fitting the amplitude of the steering angles more accurately. However, the predictions from the NCP model exhibited noticeable oscillations, which can affect stability. In contrast, LiS-Net, with its incorporation of a smooth adjustment mechanism, achieved significantly better overall smoothness in the predictions while maintaining a high degree of accuracy in amplitude fitting. This demonstrates the advantage of LiS-Net in achieving both precise and stable steering predictions, making it a robust choice for event-based steering tasks.
4.4.2. Model Generalization
When deploying the above models on an unfamiliar dataset, namely our collected dataset, it is evident from
Figure 7 that the accuracy and smoothness of the predictions were not as strong as the results observed during testing on the simulation dataset. However, comparing the models horizontally, our proposed LiS-Net demonstrates superior performance, particularly in terms of maintaining smooth predictions. To quantify smoothness, we computed the total variation (
) of the steering angle predictions. The total variation was 105.42 for the CNN, 130.08 for NCP, and 94.93 for LiS-Net. Clearly, LiS-Net achieved the lowest total variation, indicating significantly smoother predictions compared to the other models. This highlights the robustness and generalizability of LiS-Net, making it better suited for real-world deployment and diverse scenarios.
In discussing
Figure 7c, we note the two downfacing spikes in the ground truth data (in blue). Actually, these spikes occur at intersections where both left and right turns are feasible. However, the overall route design specifies right turns at these locations, meaning the ground truth reflects right-turn trajectories. The observed downfacing spikes may result from a mismatch between the simulated training data, which likely contained more left-turn intersections, and the real-world route, which requires right turns. Consequently, the models predict left turns at these points, not due to model failure but to the data distribution in the simulation. This suggests that the discrepancy is rooted in the nature of the training data rather than the model’s generalization capability.
4.5. Ablation Studies
To further investigate the efficacy of the proposed method, we conducted ablation studies focusing on the design of the loss function and the contribution of individual model components. As shown in
Figure 8, we designed two sets of ablation experiments to support our findings.
4.5.1. Ablation Study on the Impact of Hyperparameter w in the Loss Function
As previously mentioned, our loss function consists of two components: imitation loss and smoothness loss. The hyperparameter w determines the relative weight between these two terms, and choosing an appropriate value for w is crucial to achieving a balance between prediction accuracy and output smoothness.
This experiment visualizes the training and validation performance under different values of the weight w assigned to the smoothness term. The results show that setting achieves the best trade-off between the imitation loss and the smoothness loss, leading to optimal overall performance. This demonstrates the importance of carefully balancing accuracy and smoothness during training.
4.5.2. Ablation Study on Model Performance Across Different Modules
This analysis evaluates the contribution of different components within our LiS-Net architecture by comparing it against the baseline NCP model using three metrics: RMSE and MAE for accuracy, and Total Variation (TV) for smoothness. The experimental results confirm that integrating the CfC-based neuron model and the proposed smooth adjustment module significantly improves both accuracy and smoothness. These components are thus essential for achieving the overall performance gains observed in LiS-Net.
6. Conclusions
This paper explores the integration of event cameras with brain-inspired neural networks and proposes a novel architecture called LiS-Net for end-to-end steering angle prediction. By leveraging event cameras as input and combining them with brain-inspired neural networks, LiS-Net shows promising results when compared to traditional networks. Specifically, our approach reduces the number of neurons while achieving improved performance in terms of RMSE and MAE metrics. To evaluate its effectiveness, we trained and tested LiS-Net on a simulated dataset and further validated its performance on a real-world dataset collected using an experimental vehicle equipped with event cameras. The qualitative and quantitative evaluation indicates that LiS-Net consistently outperforms traditional methods, demonstrating reliable and accurate steering angle predictions. These findings suggest that combining bio-inspired sensors with brain-inspired computation could be a viable direction for more scalable and efficient autonomous driving solutions.
However, this work is limited to steering angle prediction, and further research is needed to extend the application of brain-inspired networks to longitudinal control tasks, such as velocity prediction or braking control. Additionally, our current study uses event cameras as the sole input modality. Event cameras offer significant advantages, such as high temporal resolution and robustness to dynamic lighting conditions, which are particularly useful for handling high-speed motion and fast-changing environments. However, event cameras also have certain limitations, such as the lack of detailed intensity information and lower spatial resolution compared to frame-based cameras.
Given these strengths and limitations, future research could explore the fusion of event cameras with frame-based cameras to capitalize on their complementary advantages. While event cameras excel at capturing fast-moving objects and providing detailed temporal information, frame-based cameras can offer rich spatial details and color information. By combining these modalities, future systems could achieve a more comprehensive understanding of the environment, leading to improved perception and decision-making capabilities for autonomous driving applications.