LiDAR-as-Camera for End-to-End Driving

The core task of any autonomous driving system is to transform sensory inputs into driving commands. In end-to-end driving, this is achieved via a neural network, with one or multiple cameras as the most commonly used input and low-level driving commands, e.g., steering angle, as output. However, simulation studies have shown that depth-sensing can make the end-to-end driving task easier. On a real car, combining depth and visual information can be challenging due to the difficulty of obtaining good spatial and temporal alignment of the sensors. To alleviate alignment problems, Ouster LiDARs can output surround-view LiDAR images with depth, intensity, and ambient radiation channels. These measurements originate from the same sensor, rendering them perfectly aligned in time and space. The main goal of our study is to investigate how useful such images are as inputs to a self-driving neural network. We demonstrate that such LiDAR images are sufficient for the real-car road-following task. Models using these images as input perform at least as well as camera-based models in the tested conditions. Moreover, LiDAR images are less sensitive to weather conditions and lead to better generalization. In a secondary research direction, we reveal that the temporal smoothness of off-policy prediction sequences correlates with the actual on-policy driving ability equally well as the commonly used mean absolute error.


I. INTRODUCTION
F ULLY end-to-end autonomous driving systems rely on a neural network to transform sensory inputs into raw driving commands, without clearly defined sub-modules [1], [2], [3].Most commonly, driving is achieved based solely on camera images [4], [5], [6], [7], often with additional information about the desired route provided via one-hot encoded navigation commands or via route planner screen images [8], [9], [10].This input combination is cheap and seems sufficient for human drivers to complete routes safely, making it an interesting object of study.However, in simulation a precise depth image can readily be generated and has been shown to be useful for driving models [11].Even an approximate depth image predicted from RGB image may improve results [11], [12].
In the real world, one can also attempt to predict depth images based on monocular camera images [13], [14], [15].Alternatively, stereo cameras can be used, but they suffer from a limited range.A more reliable depth image can be obtained by projecting the LiDAR point cloud to an RGB camera image.However, merging this depth image with a camera image is not trivial.The two sensors are usually not located in the same place and may see the world from different angles, leading to different blind spots.Besides the blind spots, extrinsic calibration of the sensors Manuscript accepted to Fresh Perspectives on the Future of Autonomous Driving workshop, hosted at ICRA 2022.
All authors are with the Institute of Computer Science, University of Tartu, Estonia.
Code and instructions for getting access to data: https://github.com/UT-ADL/e2e-rally-estoniaallows to match the depth and color values, but needs a precise procedure to be set up and may need to be repeated regularly.Furthermore, the capture frequencies may differ and it is not easy to guarantee temporal synchronization.Finally, even if good calibration is achieved for training data, synchronization failures or calibration errors may occur during deployment.
To remove the need to use multiple sensors, Ouster Li-DARs allow to generate a surround-view image containing perfectly aligned depth, intensity, and ambient radiation channels [16].The intensity and ambient radiation channels can be seen as providing visual information, albeit from the infrared wavelength range.We, therefore, have a 3-channel input with temporally and spatially perfectly aligned visual and depth components.
This input is in image form and can be analyzed using any of the very successful approaches developed in computer vision.Network architectures for extracting information from images are more mature than network architectures for point clouds.We hence have a sensor providing rich information in a form that we know well how to analyze.This is the first experiment in our effort to validate the usefulness of these LiDAR-images for increasingly complicated driving tasks such as highway driving and urban driving.In here, we restrict ourselves to the simpler task of road following, albeit in the complex settings of real-world rally tracks chosen to be challenging also for humans.Our main contributions are the following: 1) We compare the LiDAR-image-based driving with camera-based driving and show it yields beneficial robustness to light and weather conditions in this task.2) We study the correlation between off-policy and onpolicy performance metrics, which has not been done before in the real car context.3) We collect and publish a real-world dataset of more than 500 km of driving on challenging rally tracks, with LiDAR and camera sensors and centimeter-level accurate GNSS trajectory.The dataset covers a diverse set of weather conditions, including snowy winter.

A. Behavioral Cloning
Behavioral cloning takes a supervised learning approach to self-driving [17].Based on information from a chosen set of sensors, the model is optimized to produce the same driving behavior as a human would.This behavior is usually described by the sequence of low-level commands given or the trajectory taken [1], [2], [3].For model training, a arXiv:2206.15170v1[cs.AI] 30 Jun 2022 dataset is collected consisting of sensor recordings during human driving, accompanied by the driving commands or the trajectory produced by the driver.
Such imitation approach has worked well for simpler tasks such as lane following [4], [17], which seem to not require restrictive amounts of training data.However, dense traffic scenarios remain challenging for behavioral cloning [18].In addition to Tesla and comma.ai,multiple companies report promising performance in real-world urban driving with neural network based solutions [10], [19], [20], but it is unclear to what extent these can be considered as end-to-end.Though end-to-end models in other fields, e.g.speech recognition, have shown good generalization, replicating this success in self-driving is costly due to the massive amounts of data the cars produce.As further limitations, safety guarantees against rare situations and adversarial attacks are lacking [21], [22] and interpreting model decisions remains challenging [1].

B. Data Collection
In the period of May 2021 to October 2021 training recordings of human driving were collected from all non-urban WRC Rally Estonia tracks and a few similar routes.Driving was performed with Lexus RX 450h fitted with a PACMod v3 drive-by-wire system provided by Au-tonomouStuff.The following sensors were recorded: NovAtel PwrPak7D-E2 GNSS device, Ouster OS1-128 LiDAR, three Sekonix SF3324 120-degree FOV cameras, and one Sekonix SF3325 60-degree FOV camera.All tracks were recorded in both directions at least once, amounting to more than 500 km of driving.The road type was mostly very low traffic gravel roads.There were shorter sections of two-lane paved roads.In January-February 2022 and in May 2022 further data collection was performed in snowy and early spring conditions.This data was only used for off-policy metric computation, not for training.The list of recordings used in this work are detailed in Appendix IV.
The driving recordings from spring, summer, autumn, and winter differ strongly in vegetation levels and light conditions.All driving was done in daylight, but in differing weather conditions including heavy rain.The dataset, including recordings from sensors not used in this work, will be made fully available with this publication.

C. Data Preparation
For neural network training, only recordings from Ouster OS1-128 mid-range LiDAR and Sekonix SF3324 RGB camera placed front center of the car were used.The list of recordings from May-October 2021 was divided into training (460 km of driving) and validation sets (80 km of driving).Recordings from the evaluation track, where the on-policy evaluation was later performed, were not part of the training set, unless stated otherwise, but were part of the validation set that was used for early stopping.The lists of recordings used for model training and validations are given in Appendix IV.
The surround-view LiDAR-image output of Ouster OS1-128 LiDAR contains the channels of depth, intensity, and ambient radiation (Fig. 1).In this work, the depth channel is further pre-processed in a way that distances in the range from 0 to 50 meters are mapped linearly to the values 255 to 0, i.e 20 cm depth resolution.All distances beyond 50 meters are marked as 0. For both RGB and LiDAR-image inputs, the image was cropped horizontally to remove the hood of the car and all rows above the horizon.For both input types, the image was cropped vertically to keep 90 degrees of view in the center front.The camera image was also downscaled to make it match LiDAR-image size.No further processing was done.This resulted in a 258x66x3 image as neural network input for both input types.The target labels correspond to the steering wheel angles as produced by human drivers.
No data diversification methods were employed.Firstly because we assume our data set is large enough to learn the task without augmentations.Secondly, useful augmentation is not easy to perform -recent works have described that models often learn to detect the fact of augmentation itself, instead of learning a generalized policy [23], [24].
The dataset was not balanced in any way.The rally tracks are curvy and and are not dominated by stereotypical drivestraight behaviour.We do not think there is any type of situation in our data that should be undersampled.Crossroads and interactions with other cars on the road are not excluded from the dataset, despite using the data only for learning road-following.

D. Architecture and Training Details
Here our intent is to use a relatively simple network architecture because the goal is to compare two input modalities rather than to achieve the best possible model.With limited data amount or variability powerful models could overfit, masking the effect of chosen inputs.Recent works have reported success on as low as 30 hours of training data [10], [18].Our 500 km of data corresponds to only 15 hours of driving, placing us in danger of overfitting our models.
We use a slightly modified version of the classical PilotNet architecture from [4].We add a batch normalization after  [25].For similar reasons, we use LeakyReLU [26] as the activation function instead of ReLU.The architecture is summarized in Table I.The model outputs the steering angle as the lateral control command.The driving speed is not controlled by the network.We use mean absolute error (MAE) as the loss function, as this metric has been shown to correlate better than mean squared error with on-policy driving ability [27].We use Adam optimizer with weight decay [28] with default parameters in PyTorch [29].We use early stopping if no improvement in the validation set was achieved in 10 consecutive epochs, with the maximum epoch count fixed to 100.The code with model definitions and training procedure will be made available in GitHub with the publication.
It has been reported that multiple training runs can result in clearly differing on-policy behaviors [18].To minimize the potential effect of training instability, we train three versions of our main models and report the metrics for each.

E. Evaluation Metrics
The models were evaluated on-and off-policy.It is widely reported that off-policy metrics correlate poorly with actual driving ability [27].However, they are cheap to compute before deploying the solution.If a better off-policy metric could be found, development could be sped up by allowing to select only the best models for real-life evaluation.
Off-policy metrics are computed using human-driven validation recordings originating from the same season as when the on-policy evaluation happened.We limit the set of recordings to the same season because we assume that offpolicy metrics computed on summer data would have no information about driving ability in the winter and vice versa.
We report the mean absolute error (MAE) between human commands and model predictions as this metric has been reported as having favorable correlation with driving ability [27].In addition, following [30], [31], we also compute the whiteness of the predicted command sequence: where δP i is the change in predicted steering angle, D is the size of the dataset, and δt the temporal difference between decisions.δt=0.1 for LiDAR and δt=0.033 for camera.Whiteness measures the mean smoothness of the sequence of commands generated and can be computed on-and offpolicy.In here, W off−policy refers to the whiteness of the sequence of commands generated by a model on humandriving recordings from the evaluation track, in the same season as when the on-policy testing took place.In contrast, W on−policy refers to the whiteness of the commands generated during model-controlled driving during evaluation.
We consider the smoothness of commands a promising metric because during on-policy testing we observed that jerkiness of driving, i.e. temporally uncorrelated commands, seems to predict an imminent intervention.Model not responding to consecutive very similar frames in a consistent manner might reveal its inability to deal with the situation.
The number of interventions during a test route was counted and distance per intervention (DpI) computed as the main quality metric.The models were trained to perform route following and not to handle intersections.All interventions at intersections were removed from the count.For safety reasons, in the case of an oncoming car, the safety driver always took over the driving.Interventions due to traffic were also excluded from the intervention count.
As additional on-policy metrics, we measure the deviation of the model driving compared to a human trajectory on the same route.Locations were measured using NovAtel Pwr-Pak7 GNSS receiver combining the inertial navigation system (INS) data with real-time kinematic positioning (RTK) achieving centimeter-level precision.For each position in the model-driven trajectory, the offset is defined as the average distance to the two closest human trajectory points.The mean of this lateral difference along the route is reported as M AE trajectory .We define failure rate as the proportion of time this lateral difference was above 1 meter.

F. On-policy Evaluation Procedure
On-policy evaluation was performed on a 4.3 km section of SS20/23 Elva track in both driving directions (cf.Appendix II for map of the route).The speed along the route was set to 80% of the speed a human used in the same location on the route, as extracted from a prior recording of human driving.Driving at 100% human speed was attempted, but was too dangerous to use with weaker models.For covering different weather conditions, it was intended to be completed in two parts: autumn and winter, but due to technical reasons, a third session in spring was needed.
• In the last week of November 2021: the weather conditions and vegetation levels were very similar to the most recent training data recorded in the end of October.Due to a missing parameter in the inference code, the RGB models were run on BGR input and the results had to be discarded.Hence, only LiDAR-based models were adequately tested in this session.Night driving was performed with dipped-beam headlights on.Results from these tests are marked with (Nov) in the Table II.
• In the first week of February 2022: with snow coverage on the road.This constitutes a clearly out-of-distribution scenery for the camera models.Moreover, also for LiDAR models the surface shapes and reflectivity of snow piles differ from vegetation and constitute out-ofdistribution conditions.LiDAR and camera images from summer, autumn and winter are given in Appendix I. From this trial, marked with (Jan), we report only driving performance with LiDAR, as camera still operated in BGR mode.• In the first week of May 2022: early spring, which constitutes a close-to-training-distribution condition.Camera models were evaluated with correct inference code.
The location of LiDAR on the car had changed before this trial compared to the training data.LiDAR-based models underperfomed during this test, despite our efforts to adjust the inputs.In short, camera models were tested with adequate inputs only in spring 2022, whereas LiDAR models only in autumn 2021 and winter 2022.The weather conditions were not identical and direct comparison of values should not be made.
The rally tracks are narrow and bordered by objects harmful to the car, hence the safety driver was at liberty to take over whenever they perceived danger.An intervention is hence defined as a situation where the safety driver perceived excessive threat to the car or the passengers.An intervention was triggered by the safety driver applying force to turn the steering wheel.If the model turned the steering wheel at the same moment and in the same direction as the safety driver, no force was applied and no intervention counted.

III. RESULTS
In this section, we present the driving ability of our models as measured by on-policy metrics.We also compute offpolicy metrics, but only with the purpose to evaluate their correlations with on-policy performance.

A. Driving on an Unseen Track
The ultimate goal of end-to-end self driving is to create models that can generalize to new roads without the use of high-definition maps.Hence, we first summarize the models' ability to generalize to the evaluation track from other similar roads.Three instances of LiDAR models were tested in autumn and three camera models in spring.Using multiple models allows the reader to grasp the stability of results.A larger number of repetitions was not used due to the complexity of real-world evaluation.The metrics for these six evaluations are given in the first section of Table II.
The results indicate that in in-distribution weather, but on a novel route, the performance of LiDAR-based models is similar or better than camera models.The evaluations took place half a year apart, but conditions were suitable in both cases.Spring testing was done in largely cloudy day, with only short periods of direct sunlight.Autumn test took place in cloudy and dim daylight with short periods of very light rain.These conditions should be sufficiently close to ideal for camera and LiDAR models respectively.

B. Overfitting Setting
We next asked to what extent the task was more difficult due to needing to generalize to a new route.We trained camera and LiDAR models that included, in their training set, one human driving recording in each direction from the evaluation track.As these models will be exposed to the objects and types of turns on the evaluation track, we call this the "overfitting" (to the evaluation track) setting.The second section of Table II indicate that while the LiDAR model clearly benefitted from test-track recordings, the effect is weaker for the camera-based model.Overfitted LiDAR model drove without interventions, while the RGB-model yielded similar performance to the non-overfit models.
We conclude that approximately 500 km of road following data in the original training set did not suffice for good generalization to similar but unseen roads.Data augmentation techniques could be applied or more data collected to increase generalization over this source of data variability.

C. Night Driving and Winter Driving
The third set of on-policy tests evaluates the models' ability to generalize to weather conditions very different from the training distribution.We have a priori no expectation of camera-based driving models generalizing to these conditions.The camera images during the night differ drastically from daylight driving, despite using headlights.Similarly, the color distribution and brightness of camera images in the winter with snow coverage is clearly out-of-distribution.These differences are easy to detect for the human eye.
In contrast, the extent that these two novel conditions are out-of-distribution for LiDAR-based models is difficult to estimate by naked eye.A priori, we can assume that depth and intensity channels should be affected only minimally by lack of sunlight in night driving, with ambient radiation somewhat affected.Snow coverage adds more smooth surfaces to the landscape, but the resulting depth image may remain in the proximity of the diversity of scenes contained in the training data.Ambient radiation and intensity images are likely outof-distribution due different reflective properties of snow and vegetation, but the extent of its effect on LiDAR-based driving models was unknown before tested.
The results from these trials are marked with night and winter in Table II.The camera models immediately steered the car off the road, there is no performance to report.While the experiments were done with flawed BGR input to the models, judging from the performance of BGR models in other experiments we do not expect the performance with RGB input to be much different (cf.Appendix V).LiDAR models trained with day-time data sets see only a minimal drop in performance when deployed at night.However, when deploying models trained on data from spring, summer, and autumn to snow-covered roads, also LiDAR-based models see a clear drop in performance.LiDAR models manage to maintain some of their driving ability, but drop from ≈ 4000m to 226-425 meters per intervention.Qualitatively, we report that LiDAR models drove reasonably well in the forest where depth information

D. Informativeness of Individual LiDAR Channels
We also performed on-policy testing of models trained on individual LiDAR-image channels.This was done to obtain a better understanding of the usefulness of each of these channels.This evaluation was performed in in-distribution weather, in November 2021.As these experiments were performed on another day compared to the tables above, a 3-channel model was also re-evaluated to confirm the conditions were similar.The tested models were trained with no recordings from the evaluation track in the training set.
The examples of the images from these three channels in summer, autumn, and winter are given in the Appendix II.At visual inspection, the intensity channel seems approximately as sharp and as informative as a gray-scale camera image, albeit capturing a different wavelength.The depth image is less spatially dense but clearly informative about sufficiently large obstacles.However, ambient radiation images depend strongly on sunlight being present and seem an unreliable source of information.
In the Table II we observe that the model trained based on intensity channel can perform surprisingly well.However, neither depth nor ambient radiation channels contain sufficient information for safe driving.Depth-based model struggled to drive safely also in the forest, where trees could have provided depth-cues of where to steer towards.These channels may nevertheless still contribute useful information to the 3-channel model.

E. Correlation Study Between On-and Off-Policy Metrics
In this work, we trained a total of 11 models (3+1 LiDAR, 3+1 camera, and individual LiDAR channels).We deployed these models in more and less suitable conditions, including accidentally deploying RGB-image models using BGR video stream and deploying LiDAR models after the sensor location had been changed.Here, we wished to study if the on-policy performance of these test-drives could have been predicted before deployment via off-policy metrics or at least during the drive via non-discrete on-policy metrics.
In the following we used metrics from 17 modeldeployments.The list of trials included and the associated metrics are given in Appendix IV.This list includes using RGB-based model with BGR camera stream and using LiDAR in changed location, because the models were capable of driving despite the disadvantageous conditions.Each deployment was matched with an off-policy evaluation using recordings from the same track, similar season and similar sensor configuration (e.g. using BGR images).We computed the Pearson correlation [32] between DpI and the various other on-and off-policy metrics.The DpI for trials with no interventions was set to 10 km for the computations.The resulting correlation values are given in Table III.
Matching the perception of passengers, the whiteness of effective wheel angles during the drive shows a correlation with DpI, at r = −0.67.The whiteness of model outputs W on−policy and the mean distance from a human trajectory show somewhat weaker correlation with DpI.The measures used here are averages over multiple kilometers and actual danger prediction should happen on a more precise scale.Evaluating whiteness as an online predictor of end-to-end model reliability is outside the scope of this work.
Among the off-policy metrics, W of f −policy correlates to a similar degree with intervention frequency as the MAE of steering angles.The difference between the two Pearson correlation coefficients is not significant as per a permutation test.Notice that these two metrics are very different in nature -one measuring the quality and the other temporal stability of predictions.When combining these two metrics via summation after standardization, an even higher correlation with DpI can be obtained, at r = −0.82.The improvement over MAE-only correlation is however not statistically significant (permutation test, mean effect size -0.05,p val =0.16).
To our knowledge, mean absolute error is a very com- monly used off-policy metric for estimating model quality before deployment and for early stopping during model training.MAE has been shown to correlate with driving ability better than multiple other metrics [27].Our result suggests that the whiteness of the command sequence generated on an appropriate validation set might serve as a complementary model-selection metric (cf.Discussion).
IV. DISCUSSION In the present work we collected a high-quality dataset for the end-to-end road following task in challenging rural roads used for World Rally Championship.This dataset contains driving in narrow and complicated routes during the four seasons of the temperate climate.The measurements of all sensors, including those not used here, across more than 500 km of driving are made publicly available.
On two separate sensory inputs of this dataset, LiDARimage and frontal camera, we trained models to control the steering of the car.The models were evaluated off-policy and on-policy with speed fixed to 80% of human-speed.We show that LiDAR-image input as produced by Ouster OS1-128 LiDAR firmware contains sufficient information for road following also in the complex and narrow rally tracks designed to be challenging for humans.The task is not trivial as evidenced by the similarly-trained RGB-imagebased models achieving similar performance.Curiously, also models using only the intensity channel of the LiDAR image, information often discarded in point cloud analysis, performed competitively.If this information is sufficient or useful in more complex driving tasks such as highway driving may merit further study.
The benefits of LiDAR-based driving become apparent when needing to generalize to new conditions.LiDAR images are more similar across weather conditions (cf.Appendix I) and we hypothesize that this allows the entirety of the training data to be useful for driving in all conditions, including those not in the training data.Driving demonstrations from sunny summer days benefit LiDARbased driving on a dark autumn night, as evidenced by our LiDAR-models being able do drive in the night and to some extent even in the winter.In contrast, RGB-based models can not generalize to night driving.It seems that for a simple RGB-based behavioral cloning approach demonstrations of various traffic situations need to exist in a variety of visually different conditions, increasing the data need.A higher data efficiency of LiDAR-based models would be an interesting property at least for research institutions that cannot boast fleets of cars collecting massive amounts of data daily.
During night driving LiDAR models can rely on intensity and depth channels which are active sensing and independent of external light sources.Depth channel is also independent of the reflectivity of the surfaces and yields in-distribution values also with snow coverage.While depth alone was proven insufficient for safe end-to-end driving even in training conditions, it may still contribute reliable information to the 3-channel models.Furthermore, we assume that the importance of depth information becomes more apparent in highway and urban driving tasks, where distance with other traffic participants must be maintained.
LiDAR information is often used in its point cloud representation.Here, using image representation allowed us to perform a fair comparison of LiDAR and RGB camera input modalities, as identical methods could be applied.We believe processing LiDAR data in image form can be useful in general, because computer vision is one of the most studied topics in deep learning and many established architectures exist for image processing.Certain architectures are empirically validated to perform various tasks in a reliable manner and methods exist for sensitivity analysis.
Evaluating autonomous driving systems is complicated because the ability to drive safely can only be measured by deploying the model.When exploring architectures to use, data sampling techniques, or other aspects of the training procedure, one would need to deploy the models to know which techniques work best.This is extremely costly in the real world.If a combination of off-policy metrics could be found that correlates reliably with actual driving ability when deployed, only the more promising models could be selected for testing and evident failures discarded.Here we showed that among sufficiently capable models, the whiteness, i.e. smoothness of generated commands on an appropriate validation set predicted driving ability equally well as the magnitude of errors.We hypothesize that non-smoothness reveals models' uncertainty about the situation -model reacts differently to very similar inputs.In future work, we propose to evaluate the correlations of other measures of epistemic uncertainty with on-policy performance.Using a more general uncertainty measure carries the benefit of being applicable to a wider range of output modalities, e.g.trajectories and cost-maps.However, these metrics capture only variance and not bias, and a trivial constant model would show perfect stability.Hence, such stability measures should be used in combination with other metrics (e.g.MAE).

V. AKNOWLEDGEMENTS
This work was supported by the Estonian Research Council grant PRG1604 and collaboration project LLTAT21278 with Bolt Technologies.As a critique to behavioral cloning, we experienced that the models were highly sensitive to shifts in inputs.First of all, we accidentally performed experiment of feeding BGR images to models trained with RGB images.These models were able to drive with BGR input, but clearly worse than with clean RGB input.For a human, the scene would still be easily understandable after switching blue and red colors, especially with dim light and cloudy skies as during autumn testing.The comparison of RGB-models' performance using BGR and RGB inputs can be seen in Table VIII, albeit not in similar conditions (Nov and May sessions, respectively).
Similarly, we performed unplanned test with LiDAR sensor moved from its original location in the center of the roof to the front of the roof.The performance of LiDAR model suffered considerably despite our attempts to fix it by changing the cropped area from the surround-view LiDAR-image.The comparison of LiDAR performance with original and shifted location can be seen in Table VIII, Nov and May sessions respectively.
More worrying is the fact that changing the location of crop by only one pixel to the left or right resulted in perceivable biases in on-policy driving behavior.Also in vertical direction, few pixels difference in crop height clearly mattered.Indeed, logically, a crop more from the left should result in turning more towards the right and in a sequential decision-making task this effect can accumulate over time.However, the fact that just one pixel difference can bias the model to position itself differently on the road and struggle with certain turns, is worrying.We conclude that at least the end-to-end approach used here is extremely sensitive to changes in sensor data.

Fig. 1 .
Fig. 1.Input modalities.Red box marks the area used as model input.Top: surround view LiDAR image, with red: intensity, blue: depth and green: ambient.Bottom: 120 degree FOV camera.

Fig. 2 .
Fig.2.LiDAR and camera images in summer, autumn and winter (from top to down for LiDAR, left to right for camera).The area used for model inputs is marked with red rectangle.In LiDAR images, red channel corresponds to intensity, green to depth and blue to ambient radiation.

TABLE II RESULTS
OF ON-POLICY EVALUATIONS.EVALUATIONS INTERRUPTED DUE TO A HIGH FREQUENCY OF INTERVENTIONS ARE MARKED WITH *.HORIZONTAL LINES SEPARATE VALUES ILLUSTRATING THE DIFFERENT RESULTS SUBSECTIONS.

TABLE III PEARSON
CORRELATIONS OF THE MAIN DRIVING QUALITY METRIC DISTANCE PER INTERVENTION (DPI) WITH OTHER ON-AND OFF-POLICY METRICS.THE HIGHEST-CORRELATING METRICS OF BOTH TYPES ARE HIGHLIGHTED IN BOLD.

TABLE IV .
TRAINING DATASET.THE RECORDING NAMES START WITH DATES IN YYYY-MM-DD FORMAT, FOLLOWED BY HOUR.CERTAIN RECORDINGS FAILED TO RECORD GNSS LOCATIONS, BUT RECORDED CAMERA AND LIDAR FEEDS AND STEERING, SO WE CAN STILL USE THEM FOR MODEL TRAINING.

TABLE V .
VALIDATION DATASET USED FOR EARLY STOPPING.THE RECORDING NAMES START WITH DATES IN YYYY-MM-DD FORMAT, FOLLOWED BY HOUR.THE LAST TWO RECORDINGS ARE ALSO USED IN THE TRAINING SET OF MODELS OVERFITTED TO THE EVALUATION TRACK.SECTIONS OF THE LAST TWO RECORDINGS (SECTIONS CORRESPONDING TO THE ON-POLICY TESTING ROUTE) WERE USED FOR OBTAINING SEASONAL OFF-POLICY METRICS FOR MODELS TESTED IN THE AUTUMN.

TABLE VI .
WINTER RECORDINGS USED FOR SEASONAL OFF-POLICY METRICS COMPUTATION FOR MODELS TESTED IN THE WINTER.ONLY A 4.3KM SECTIONS FROM EACH RECORDING WAS USED, CORRESPONDING TO THE ON-POLICY TEST ROUTE.THE RECORDING NAMES START WITH DATES IN YYYY-MM-DD FORMAT, FOLLOWED BY HOUR.

TABLE VIII .
ON-POLICY TEST SESSIONS, ON-POLICY METRICS RECORDED AND THE CORRESPONDING OFF-POLICY METRICS FROM THE SAME TRACK IN SAME SEASON.THESE VALUES SERVE AS THE BASIS FOR CORREALTION CALCULATIONS BETWEEN DISTANCE PER INTERVENTION (DPI) AND OTHER METRICS.DPI FOR TESTS WITH 0 INTERVENTIONS WAS SET TO 10 KM.COMBINED REFERS TO THE SUM OF M AEsteer AND W of f −policy , BOTH STANDARDIZED TO ZERO MEAN AND STANDARD