XDLL: Explained Deep Learning LiDAR-Based Localization and Mapping Method for Self-Driving Vehicles

: Self-driving vehicles need a robust positioning system to continue the revolution in intelligent transportation. Global navigation satellite systems (GNSS) are most commonly used to accomplish this task because of their ability to accurately locate the vehicle in the environment. However, recent publications have revealed serious cases where GNSS fails miserably to determine the position of the vehicle, for example, under a bridge, in a tunnel, or in dense forests. In this work, we propose a framework architecture of explaining deep learning LiDAR-based (XDLL) models that predicts the position of the vehicles by using only a few LiDAR points in the environment, which ensures the required fastness and accuracy of interactions between vehicle components. The proposed framework extracts non-semantic features from LiDAR scans using a clustering algorithm. The identiﬁed clusters serve as input to our deep learning model, which relies on LSTM and GRU layers to store the trajectory points and convolutional layers to smooth the data. The model has been extensively tested with short-and long-term trajectories from two benchmark datasets, Kitti and NCLT, containing different environmental scenarios. Moreover, we investigated the obtained results by explaining the contribution of each cluster feature by using several explainable methods, including Saliency, SmoothGrad, and VarGrad. The analysis showed that taking the mean of all the clusters as an input for the model is enough to obtain better accuracy compared to the ﬁrst model, and it reduces the time consumption as well. The improved model is able to obtain a mean absolute positioning error of below one meter for all sequences in the short-and long-term trajectories.


Introduction
A self-driving vehicle requires more attention to be paid to the execution time of their components, since this may save many lives every day [1].It must be ready to perform the right action in real time to prevent incidents.In order to achieve this level of control, the system should be built to provide rapid interaction between its components, to receive instant location information to perform the overtaking of vehicles and perform rapid path planning [2].Localization of the vehicles is an essential step to ensure the functionality of other components.The information provided can be used to identify the distances between the vehicles and the structures in the environment, which is crucial information needed to avoid collisions, find the shortest distance to the destination, and so on [2].Global navigation satellite systems (GNSS) are already equipped, in almost all autonomous vehicles, to provide position coordinates by triangulation using at least three satellites.However, this technology faces several issues for real-time execution when the satellite signals are interrupted, such as under bridges, in tunnels, and around tall buildings [3,4].Several researchers have thought to localize the vehicle based on internal sensors, such as light detection and ranging (LiDAR), cameras, inertial measurement units (IMUs), or even radar sensors [2].
Our review of the state-of-art in this area divides the related methods into two categories: direct methods and feature-based methods.Direct methods try to estimate vehicle's position directly by calculating the distances between two positions and using dead reckoning (DR) to deduce the position.The IMU unit from the inertial navigation system (INS) is one of the methods that uses this approach [5].An IMU provides information about the acceleration and attitude rate of the vehicle.The double integration of the acceleration measurement will provide the position of the vehicle.However, the double integration is prone to errors that can be accumulated during the execution of the process [5].Several researchers have tried to correct these errors by sending them to a machine learning model, such as an input delay neural network (IDNN) [6], multi-layer feed forward neural network (MFNN) [7], recurrent neural network (RNN) [8], or long short-term memory (LSTM) model [9,10].For more information about deep learning (DL) methods in self-driving vehicles, the readers are referred to [11].
Wheel odometry is a great tool for estimating the vehicle's position that works by using the velocity and integrating it to get the position [12].However, as we mentioned before, the integration process provides an error that can be amplified and accumulated each time we re-execute the process.According to the article [12], using wheel odometry is better than using an IMU accelerometer, since it needs less integration time.The wheel odometry neural network (WhONet) [12] uses an RNN-based architecture to correct the wheel odometry errors, and an extensive experiment was performed during several GNSS outages.
Feature-based approaches for vehicle localization try to extract relevant features from data gathered from one or more sensors.We have surveyed in this work only LiDAR-based methods, since we believe that the camera images are easily affected by weather changes such as snow and rain [1,2].Extracting features from LiDAR measurements has been the subject of several papers in the literature.LiDAR odometry and mapping (LOAM) [13] distinguish edges and flat plan features to perform scan-to-scan and scan-to-map matching, which are techniques to track the vehicle's motions (translation and rotation) between consecutive sequences.However, the process consumes too much time when executed, which is covered in the article introducing the LOAM_Velodyne [14] method.To solve the same issue, lightweight and ground-optimized LOAM (LeGO-LOAM) [15] and A-LOAM [16] approaches remove noisy and useless features to reduce the computational time.Other methods, such LO-Net [17] and LO-Net-M [17], use end-to-end deep learning (DL) to improve the scan-to-matching process.SGLO [18] extracts geometric line and plane features to improve the matching process, which has provided good results.However, the accuracy is dramatically affected by the initialization.Methods such as those of Kummerle et al. [19], Weng et al. [20], Sefati et al. [21], and A. Schaefer et al. [22] use a probabilistic perspective to localize the vehicles.Based on detecting poles and walls from LiDAR scans, the methods used a particle filter algorithm to correct the IMU's accumulative errors.The method of Schaefer et al. [22] achieved excellent results on the Kitti dataset [23] and the University of Michigan North Campus Long-Term Vision and LiDAR Dataset (NCLT) [24].However, the features extracted for the operation may not exist in some environments, especially those with little in the way of texture, such as desert roads.That is why the work of Charroud et al. [25] proposed using non-semantic features to help perform the measurement-update step in the particle filter.An extension of this work was presented in the article [26], where the authors proposed a modified clustering particle filter that selects relevant particles to calculate the position by using sigma-point selection.Moreover, another extension of the work in [25] is the article [27], where it was proposed to extend the work on particle filters by selecting only the 10 best particles around the real position and regenerating the particles around them.This trick enables the particle filter to run fast and preserves the accuracy, as we generate particles close to the real position at each execution of the algorithm.
Our novel method uses non-semantic features as a pre-processing step to teach a machine-learning model the actual vehicle positions.The inputs are the extracted features, and the outputs are the positions.The model was extensively tested on short-and long-term trajectories using two benchmark datasets: Kitti [23] and NCLT [24].In addition, the model is explained by studying the features that contribute positively or negatively to the output model.Based on the results of explaining the first model, another model was constructed and compared to evaluate the influence of the changes made.The article [28] discusses the concept of explainable artificial intelligence (XAI), which refers to the development of AI systems that are able to provide clear, interpretable explanations for their decisions and actions.The authors provide an overview of the various types of XAI approaches that have been proposed, including post hoc explanation methods, integrated explanation methods, and interactive explanation methods.They also explore the potential benefits and challenges of XAI, and discuss the importance of responsible AI development in ensuring the transparency, accountability, and trustworthiness of AI systems.
This paper contains the following contributions: • As far as the authors are aware, this is the first method that uses only a few LiDAR cluster points to feed a deep-learning model for localization purposes.

•
The paper presents, in Section 2, a robust mathematical formulation of the localization problem, which opens up some opportunities to develop more solutions based on optimization and stochastic-differential-equation-based methods.• Deep Learning explanation methods were employed to find the most contributing cluster features in order to optimize the proposed model.
The sections of the paper are organized as follows.Section 2 provides an in-depth explanation of the problem and the architecture we propose to solve it.Section 3 presents the steps followed to create the proposed model.Section 4 presents the results of testing the model in the short and long-term scenarios.The contributions of the features are also discussed, and a comparison between the two models with further implications is presented in this section.Section 5 provides some conclusions.

Problem Description and Proposed Architecture
As mentioned before, any autonomous system requires a localization and mapping method to facilitate the scheduling of other principal tasks, such as path planning and overtaking vehicles.[1].In particular, self-driving vehicles must be given more attention regarding the execution times of different components, since they need an instant interaction with the environment, which means that we need to ensure fast execution of the methods without hurting the accuracy too much.Quickness and accuracy are critical to getting a better experience from driving.Most of the state-of-the-art methods divide the process of localization and mapping into two parts: feature extraction and localization prediction.Extracting relevant features is a principal task in creating a global map, and helps to execute the localization task easily by providing the environmental objects or shapes around the vehicles.Figure 1 presents a simple workflow architecture that consists of two major steps: feature extraction (A) and learning the positions (B).In this section, we want to give the theoretical intuitions about the approaches followed to solve the localization and mapping problem for self-driving vehicles, which could be extended to other autonomous systems.A practical implementation of the method is delivered in the next section.
Our method localizes the vehicle based on LiDAR measurements only.We suppose that the LiDAR scans and the ground truth positions are in the body-frame coordinates, where at each timestamp each position has its corresponding scan.We use non-semantic features to distinguish some relevant prototype features.This technique can represent a wide range of objects inside the environment, in contrast to focusing on a single object (such as poles or edges), which is very advantageous to representing environments with less texture information.We chose to employ a clustering algorithm to extract clusters, which form our non-semantic features.Let us have some mathematical formulation: Let P t be a LiDAR scan at time t: where D t is the number of LiDAR points at time t, and let also s t = [s t,i , s t,j ] be the real position at time t (the z coordinate is supposed to be the same for the real position).We apply a clustering technique by fixing the number of centers to a certain value (discussed in the next section), so we have: and D t is the number of clusters, C t is the group of center clusters, and c t,i is the center of cluster i, where i ∈ [i, D t ].For simplicity, below we consider that the number of clusters stays the same along the trajectory (D = D t for all t).
The main idea of this paper is to find {α t,1 , α t,2 , . . ., α t,D } and a function f that will minimize the problem (p): T is the last timestamp value.We try to estimate a function f that will aggregate and manipulate the centers' clusters C t = {c t,i } i∈ [1,D ] in order to get an estimated position that will be close to the real position s t .w t,i are some weights to ensure the reliability of the projection of the cluster from the 3D to the 2D environment.Scalars {α t,i } are used to provide a percentage (probability) of the importance of the clusters {c t,i } after manipulation with the function f .
To solve the problem (p), we established a method based on a deep learning approach that takes the cluster features as the input and returns back the estimated positions.The architecture was chosen manually by testing the capabilities and robustness of several time series regression methods, such as LSTM, GRU, and RNN.We envisaged that the proposed architecture in Figure 2 is robust enough to learn and predict unseen data-i.e., that it has the capacity for generalization.The reader may remark that several convolutions were applied to eliminate the data outliers.LSTM and GRU are the main layers to create the memory inside the proposed model.We have demonstrated the capacity of this architecture by testing on several popular time-series methods.Table 1 presents the results of comparing the training and validation MAE (mean absolute error) of our method and other architectures.Both datasets provide the information needed to train and test the model; the LiDAR point cloud was converted to the reference system, and ground truth of the positions was created based on GPS or/and a SLAM solution using LiDAR scan matching and highaccuracy RTK GPS.We have carefully distinguished sequences that have different sizes and contain various environmental scenarios, such as hard brakes, roundabouts, town drives, residential road drives, dirt roads, traffic, and sharp cornering, in order to teach robustly, as much as possible, the driving scenarios to the model.Table 2 depict the sequences used to train the model, which contain in total 50,758 units of LiDAR scans (data input) and ground truth (target) with different velocity variations.The Kitti dataset did not mention the kilometers of driving the vehicles either in its official website or in its paper.

Data Processing and Parameters' Discussion
In this part, we have manipulated the data (input and target) to teach the model the vehicle's real positions (target) based on the features extracted from the LiDAR scan at each timestamp (input):

•
Processing the target: Each dataset used here provides the ground truth of the vehicle positions.Along the trajectory, each timestamp of the position should coincide with the LiDAR scan timestamp or at least should be close to it (see Figure 3).

•
Processing the input data: In order to prepare meaningful inputs for the model, a feature-extraction process had to be employed for each LiDAR scan to reduce the huge amount of points it contains.First, we performed a "cropping" step that cut the scan to keep only the region of interest.We kept only the points that surround the vehicle and do not exceed the height of the vehicle.In fact, we manually distinguished the points that respect these constraints, as seen in Figure 4b: These limits were carefully chosen to ensure some overlap between scans.In addition, a fuzzy k-means clustering algorithm was performed to extract features from each cropped scan.The use of a clustering technique summarizes the LiDAR information into some core groups that are useful for explaining objects in the environment, e.g., trees, poles, and walls.Table 3 highlights how fast fuzzy k-means clustering is and that it is a good choice for this task, which is also proven in the article [25].
We decided to run the feature-extraction process with five center clusters, since the number of frames is big and each LiDAR frame (scan) contains a massive number of points, e.g., millions of points.Consequently, the calculation time will be huge if we choose a bigger number of clusters.However, we will analyze the contribution of each cluster (or feature) to the output model in the next section.
After calculating the five center cluster features at each scan, we calculate the mean of all those clusters and add it to the input as the sixth feature.Figure 5 provides an illustration of the inputs of our model, and Figure 4 illustrates an example of the following data processing step.Finding the best parameter configuration is a crucial step to ensure the best accuracy with the minimum time cost, which is well needed for machine-learning applications.
Self-driving vehicles require a fast and accurate positioning method to work in realtime.We have used a random algorithm that optimizes the parameters by selecting random parameters after defining a range for each of them [29].

Methods to investigate the contributions of features:
Explaining the deep learning model implies, in many cases, knowing how the features contribute (positively or negatively) to the output of the model, which makes it easy to understand for a human user.Recently, some improvements have been reported to provide more interpretation and transparency for the so called black box models.
Several publications and open sources have been proposed to explain the models and make them human-readable, which is very useful in applications and promotes the use of machine learning.In this study, we selected and applied some methods to explain the model.Note that all the methods cited here are adapted to be used for regression problems as in our case.
-Saliency: A saliency map is an image that marks the area on which people's eyes focus first.The purpose of a saliency map is to represent the degree of importance of a pixel to the human visual system [30].In our case, it is the importance of point features to the vehicle position.We identified which features need to be changed least to affect the result of prediction the most.
Let us suppose at time t we have C t = {c t,i } i∈ [1,D ] .Our clusters features are extracted at time t.We need to identify which c t,i = [x t,i , y t,i , z t,i ] changed least, but affected the model most.Let f be a regression(the model).Then, the importance value Φ t,i at time t for each cluster feature can be found by: where ) is a vector of the gradient of each component of c t,i .Calculating the infinity norm of ∇ c t,i f (c t,i ) provides only the importance value of the cluster feature i at time t.In order to identify which cluster contributes most to the output model, we need to find the index that corresponds to the maximum importance value: Consequently, c t,j will be the most important cluster that contributes most to the output model at time t.To calculate the importance values of clusters for all of the trajectory, we need to calculate the mean of the importance value of all the clusters over the trajectory: . . .
In order to know which cluster feature contributes most to results output along the trajectory, we need to calculate: Cluster c t,o is the most interesting features that contribute to the output model for all of the trajectory, i.e., for every t ∈ [1, T].We detailed above the method of how we get the importance value, which is supposed to be the same for all other methods, we just need to change the way we calculate the importance value, but everything else should be the same.
-Integrated gradient (IG): It computes the gradient of the model's prediction output relative to the input features and requires no modification of the original deep neural network.IG can be applied to any differentiable model used for images, text, or structured data [31].
where x 0 is the baseline, generally set to zero.The same method as in the salience method above is used to obtain the importance value and also to obtain the contributions of the cluster features.
-SmoothGrad: The gradient is averaged over several points corresponding to small perturbations around the point of interest.The smoothing effect by averaging reduces the visual noise and thus improves interpretation [32].
It is the same calculation as in the salience method to obtain the importance value and the contributions of the cluster features.The idea is to introduce a random noise variable δ ∼ N 0, σ 2 in order to estimate the above expectation by sampling from the newly added noise (i.e., using a Monte Carlo estimator).N is the number of samples from δ.
-VarGrad: Similar to the Soomthgrad method.Instead of the mean gradient, the variance of the gradient is calculated [33].
is the empirical mean.All other details are the same as for the SmoothGrad method above.

Methods' faithfulness
To measure the fidelity of these methods in explaining the model, the article [34] proposes several fidelity measures that indicate the extent to which one can trust the explanation of the results.We have chosen two methods of explanation: -Deletion: the metric evaluates the model's performance in making predictions while perturbing only the relevant features.A small value of deletion represents a more accurate explanation (note that this method was adapted for regression problems) [34].
-Insertion: It measures how well a saliency-map-based explanation can find elements that are minimal for the prediction tasks.A higher value of insertion represents a more accurate explanation (note that this method was adapted for regression problems as well) [34].

Results and Discussion
The model was tuned, trained, and validated on a computer with 20 GB RAM, 1TB Rom, and an i7 processor, which is capable of executing the commands rapidly.In order to test the performance of the model, we chose two types of trajectory scenarios, short-term and long-term.The short trajectories were chosen from the Kitti dataset [23], and the long trajectories were chosen for the NCLT dataset [24].This section discusses the capacity of the model to predict unseen data in both scenarios (short-term and long-term trajectories).Then, we provide some explanations of the output results and provide an opportunity to improve the accuracy of the model.Finally, we compare the performances of the new and the old models and provide an investigation of their accuracy.

How the Model Performs
According to Figure 6, the training loss decreased from 0.95 m to 0.75 m, which shows the importance of tuning the model.Meanwhile, the validation loss decreased from 0.915 to 0.75 m, which indicates that the model did not overfit.
Table 5 investigates the performance of the model in a series of tests on the Kitti dataset.We chose some small sequences to see how the model behaves in such a situation.We deduced that our model performs well in predicting the positions on these small sequences.The mean absolute error of all the sequences did not surpass 1 m, which means that the majority of the error values are less than 1 m; we registered a maximum value of up to Table 6 describes another situation where we roughly tested our model on long-term trajectories (more than 3 km) and in different environment scenarios and challenges.The model registered again a mean absolute error of less than 1 m between the predicted and the real positions.However, is there any chance to improve the accuracy?The answer is yes, if we explain the results, e.g., discovering which features contribute positively or negatively to the output of the model.

Why This Output?
As mentioned in Section 3.4, four explainers were chosen to analyze the contribution ofs the features (including x and y axes) to provide the resulting output: saliency, integrated gradients, SmoothGrad, and VarGrad.We also used Kernalshap and the LIME explainer, which are very popular methods for regression problems.However, we got similar values for each feature, which does not provide much information about the contributions of the features.Figure 7 shows the mean impact of each feature (from x and y axes) according to each explainer (left column) and the punctual impact of each feature (from x and y axes) according to each explainer (right column).Our criterion for selecting features was the average impact value of the feature coordinates; on x and y axis.Based on that, saliency and integrated gradients (IG) nominated cluster 2 to be the most interesting feature.However, SmoothGrad selected the mean of the clusters as the least important feature, indicating that it contributes negatively to the model.Additionally, VarGrad confirmed that the mean of clusters has a major positive contribution to the model.Note that each explainer has its own philosophy to calculate the impact value.Thus, how can we decide which feature has the best contribution?We used two metrics of faithfulness: deletion and insertion, following another article [34].According to Tables 7 and 8, SmoothGrad and VarGrad results are more trusted by the two criteria.SmoothGrad and VarGrad registered 89.61, and 105.72, respectively, in the deletion metric (lower is better), and 144.32 and 131.36, respectively, in the insertion metric (higher is better).
From Figure 7, we can extract more information from the punctual plots (right column), especially from Figure 7f,h.We remark that higher values for the mean of all feature data contribute positively, in contrast to the lowest values, which contribute negatively.
According to the analysis above, the mean of all the clusters is the most important feature that contributes positively to reducing the loss error.We then explored the fact that the mean of all the clusters is the most important feature to create another model with the same architecture as the first one and compared the prediction results of both models to investigate the improvement.Figure 8

Conclusions
Localization and mapping in self-driving vehicles have been extensively treated with different approaches, including probabilistic, optimization, and other machine-learning methods.In this paper, we presented a novel deep-learning workflow for learning vehicle positions.The input features of the model were extracted from the LiDAR scans at each time point.The extraction process was based on the application of a fuzzy k-means algorithm that extracts features from clusters.The model architecture was based on a combination of LSTM and GRU model layers and smoothing with a 1D convolution.We tuned the model to obtain the best hyperparameters, and we trained the model with different parameters, such as early stopping, reduction of the learning rate in the case of constant metrics, etc.The model has obtained very good short-and long-term results in different challenging environmental scenarios, such as weather changes and various trajectory scenarios.The model is able to keep the mean absolute positioning error below 1 m for all sequences in short-and long-term trajectories.
We also provided possible explanations for the model's results and examined the contribution of the clusters (features).We found that the mean of all the extracted clusters is the most important feature that contributes positively to the prediction result.We created a new explained model that takes only the mean of all clusters as input and ran the prediction again.According to the comparison results of both models, the new, more explainable model improves the accuracy of vehicle positioning and reduces the time and computational resources required to train and use these models.

Figure 1 .
Figure 1.Representation of the workflow architecture.Feature extraction from LiDAR scans (blue dots) and training the DL model using convolutions of a 1D layer, an LSTM layer, a GRU layer, and a fully connected network layer to estimate the real position (brown dot).

Figure 2 .
Figure 2. The architecture of the proposed model.

Figure 3 .Figure 4 .
Figure 3. Example-sequence 0001 from Kitti dataset: Blue dots represent the real positions of the vehicle at each timestamp explained with Cartesian coordinates (x and y in meters).

Figure 5 .
Figure 5. Inputs of the proposed model.Each frame contains 6 cluster-based features (five of them are the centers of clusters extracted using the fuzzy k-means algorithm, and the sixth is the mean of the five cluster centers).Each cluster contains 3D points.T is the number of frames.
presents the curve of the loss values (training and validation).Training loss decreased from 13.49 to

Table 1 .
Comparing several models to choose the best architecture.This dataset provides the opportunity to train the model in different categories provided in the dataset, including city, residential, road, campus, and person.This appropriate dataset enables also teaching the model in various challenging environments and investigating the performance of vehicles in short-term localization trajectories.•NCLT Dataset: The North Campus Long-Term (NCLT) dataset was acquired with a Segway robot on one of the campuses of the University of Michigan, USA.It is a great option for training and testing the model's performance, since it contains long-term trajectory sequences with an average length of 5.5 km over a period of 15 months.The recordings included different times of day; different weather conditions; seasonal changes; indoor and outdoor environments; many dynamic objects, such as people and moving furniture; and two large, constantly changing construction projects.Although the trajectories vary considerably from session to session, there is considerable overlap.

Table 2 .
The sequences used to train the model.
y t , z t ] be a LiDAR point from the scan t:−10 m ≤ x t ≤ 10 m −10 m ≤ y t ≤ 10 m z t ≤ 0.5 m

Table 3 .
[23] consumption (mm: ss) to execute the feature extraction for different clustering methods tested on the sequence 0001 from the Kitti dataset[23].
Table 4 depicts the results found after running 100 random trials.The best parameter combination achieved 1.21 m of the MAE in less than 30 s of training running.The choice of the optimization method was done manually by testing adam, adamax, SGD, RMSprop, and NAdam.We found that adamax gives the best results and converges to the minimum fastest.
• Training the model: The model was trained/validated on the datasets explained in Section 3.1 and based on the parameter search done as explained before.First, 30 epochs and 256 as a batch size were used to train the model.Early stopping was called if the loss value increased in 5 successive epochs.We reduced the learning rate if the loss metric stopped improving after 3 epochs.

Table 4 .
Parameter tuning for the model.
132 is the step size used here.

Table 10 .
Predictions of the new explained model for the long-term trajectories (NCLT dataset).