A Machine Learning Approach to Pedestrian Detection for Autonomous Vehicles Using High-Definition 3D Range Data

This article describes an automated sensor-based system to detect pedestrians in an autonomous vehicle application. Although the vehicle is equipped with a broad set of sensors, the article focuses on the processing of the information generated by a Velodyne HDL-64E LIDAR sensor. The cloud of points generated by the sensor (more than 1 million points per revolution) is processed to detect pedestrians, by selecting cubic shapes and applying machine vision and machine learning algorithms to the XY, XZ, and YZ projections of the points contained in the cube. The work relates an exhaustive analysis of the performance of three different machine learning algorithms: k-Nearest Neighbours (kNN), Naïve Bayes classifier (NBC), and Support Vector Machine (SVM). These algorithms have been trained with 1931 samples. The final performance of the method, measured a real traffic scenery, which contained 16 pedestrians and 469 samples of non-pedestrians, shows sensitivity (81.2%), accuracy (96.2%) and specificity (96.8%).


Introduction
Autonomous driving is presented as a highly disruptive feature for road means of transport, capable of influencing aspects as fundamental as road safety and mobility itself. Over the last 10 years there have been technological developments which have facilitated the gradual incorporation of various advanced driver assistance systems (ADAS), as well as the emergence of different prototypes capable of travelling in a reasonably autonomous manner.
An autonomous car is a vehicle that is capable of sensing its environment and navigating without human input [1]. Autonomous vehicles promise to radically change both personal and industrial transport. Traffic safety will be improved, since autonomous systems, in contrast to human drivers, have faster reaction times and are fatigue-proof in their functioning. Thus, the number of traffic collisions, dead and injured passengers, and damage to the environment will be reduced. New business models, such as mobility as a service, which aims to be cheaper than owning a car, would be available. Vehicles could be more efficiently used, reducing labour costs for transporting goods and persons (since no driver would be required) and allowing for more pleasant displacements. Finally, thanks to internet connection and vehicular networks, it would be possible to adopt a cooperative approach between vehicles, whereby vehicles and roadside units would share information in peer-to-peer networks. These networks would reduce traffic congestion in cities and roadways, improve safety, and even decrease the need for vehicle insurance. The work [1] reviews this technology, argues the

Pedestrian Detection Using 3D LIDAR Information
The introduction of high-resolution LIDAR 3D sensors has demonstrated the feasibility of this technology to detect pedestrians under actual driving circumstances. One of the first attempts to detect pedestrians by means of LIDAR 3D is described in Premebida's work [18] and it was performed by using multilayer LIDAR. The author uses a multiple classification method to detect pedestrians in a range of 30 m. He bases his method on the linearity and circularity of 2D data (range data) for feature extraction; by extending this method to 3D data, the computation time increases substantially. Spinello [19] worked afterwards with the 3D point cloud and divides it into various 2D point clouds. These are obtained by cutting the 3D point cloud at different heights, which helps Spinello to determine whether point associations may or may not be part of a pedestrian. This method shows great sensitivity to the viewing distance and is not recommended for ADAS. Navarro-Serment [20] presents a method which extracts information from a point cloud obtained through a high-resolution LIDAR by dividing the cloud into three parts corresponding to the pedestrian's torso and legs. This process uses the variances of the 3D points of each part as discriminative feature. An advantage that can be highlighted is the low computational load, but it is inconvenient due to the weak behaviour it has when the viewing distance increases.
A 3D LIDAR developed by the company DENSO (horizontal field of view 40 • with 0.1 • resolution, vertical field of view 4 • with 1 • resolution) is described in [21]. The authors also describe a pedestrian detection application using the laser, based on the time-of-flight, intensity and width of the laser signal, and performing a tracking of the pedestrian candidates using an interacting multiple model filter to compute position, velocity and acceleration of the candidates. The authors focus their work on the detection of pedestrians whose trajectory may collide with the car.
More recently, Kidono [22] proposed a method also based on the analysis of a 3D point cloud obtained by means of a high-definition LIDAR. In contrast to the previous authors, Kidono does not subdivide the cloud, but rather makes clusters of points following criteria of height regarding the ground plane and proximity/sparsity among them, and achieves segmenting the various objects in the scenario. The author then classifies the objects which are pedestrians by extracting different features from each portion of segmented points. This method, according to its authors, offers a good recognition rate of pedestrians within a 50 m range.

Materials and Methods
This section briefly describes the CIC, and the algorithm developed to process the cloud of points generated by the 3D LIDAR in order to detect pedestrians around the vehicle. Figure 1 shows the main components of the autonomous car architecture and their relationship. Each of these parts is described in the following subsections.

CIC Autonomous Vehicle
Sensors 2017, 17, 18 4 of 21 Figure 1 shows the main components of the autonomous car architecture and their relationship. Each of these parts is described in the following subsections.

Sensor System
The objective of a sensor system is to gather data from the surrounding environment of the AV and feed that data to the control system, where it would be processed and fused to decide what the AV should do next. Assuming that there is no perfect sensor, it is necessary to install multiple sensors in the AV to sense its environment. These sensors measure different physical magnitudes, which are normally selected to overlap with each other, providing the redundant information needed to correctly fuse and correlate the information. This is required to obtain better estimations of the environment, since fusing information gathered from different sensors helps reducing the number of plausible interpretations of the measured data in dynamic and unknown environments. Hence, the selection of sensors types and their placement are highly important. Figure 2 shows the areas around the CIC covered by each sensor mounted on it. Two kind of sensors are used for measuring the environment: short range sensors (up to 10 m) and long range sensors. The installed short range sensors include a 2D Sick laser range scanner and a time-of-flight camera. The long range sensors are a 3D LIDAR scanner and a camera in the visible spectrum.

Sensor System
The objective of a sensor system is to gather data from the surrounding environment of the AV and feed that data to the control system, where it would be processed and fused to decide what the AV should do next. Assuming that there is no perfect sensor, it is necessary to install multiple sensors in the AV to sense its environment. These sensors measure different physical magnitudes, which are normally selected to overlap with each other, providing the redundant information needed to correctly fuse and correlate the information. This is required to obtain better estimations of the environment, since fusing information gathered from different sensors helps reducing the number of plausible interpretations of the measured data in dynamic and unknown environments. Hence, the selection of sensors types and their placement are highly important. Figure 2 shows the areas around the CIC covered by each sensor mounted on it. Two kind of sensors are used for measuring the environment: short range sensors (up to 10 m) and long range sensors. The installed short range sensors include a 2D Sick laser range scanner and a time-of-flight camera. The long range sensors are a 3D LIDAR scanner and a camera in the visible spectrum.  Figure 1 shows the main components of the autonomous car architecture and their relationship. Each of these parts is described in the following subsections.

Sensor System
The objective of a sensor system is to gather data from the surrounding environment of the AV and feed that data to the control system, where it would be processed and fused to decide what the AV should do next. Assuming that there is no perfect sensor, it is necessary to install multiple sensors in the AV to sense its environment. These sensors measure different physical magnitudes, which are normally selected to overlap with each other, providing the redundant information needed to correctly fuse and correlate the information. This is required to obtain better estimations of the environment, since fusing information gathered from different sensors helps reducing the number of plausible interpretations of the measured data in dynamic and unknown environments. Hence, the selection of sensors types and their placement are highly important. Figure 2 shows the areas around the CIC covered by each sensor mounted on it. Two kind of sensors are used for measuring the environment: short range sensors (up to 10 m) and long range sensors. The installed short range sensors include a 2D Sick laser range scanner and a time-of-flight camera. The long range sensors are a 3D LIDAR scanner and a camera in the visible spectrum.   Short-range sensors. The car carries two TIM55x Laser scanners (Sick, Waldkirch, Germany, see Figure 3b), one on the front of the vehicle and one on the back, parallel to the ground and at a height of 30 cm. Each laser covers 180 • with an angular resolution of 1 • , has a scanning range of 0.05 to 25 m with a systematic error of ±60 mm, and spins at 15 Hz. The lasers are connected to the central control system through an Ethernet connection. The vehicle also carries two SENTIS3D-M420 time-of-flight cameras (BlueTechnix, Wien, Austria) one on each side of the vehicle (see Figure 3a). Their purpose is to gather data from the closest surroundings, which are too close for the 3D LIDAR. The effective operating distance is limited to approximately 4 m. Each camera captures up to 160 images per second, with 160 × 120 pixels resolution, independently of the outside lighting conditions.
Long-range sensors. The vehicle is equipped with a camera with a 1.2 Mpixels Charge-Coupled Device (CCD) in the visible spectrum, a Prosilica GT1290 from Allied Vision (Stadtroda, Germany), with a Gigabit Ethernet connection. It can capture up to 33 images per second, in grayscale or colour (see Figure 3a). CIC also has a Velodyne HDL-64E, a high-definition 3D LIDAR scanner (see Figure 3a) with 64 laser sensors. The LIDAR spins from 5 to 20 Hz, covering the complete 360 • horizontally (from 0.086 • to 0.345 • angular resolution) and 26.8 • vertically (separation between laser sensors 0.4 • ). The accuracy of the distance measurement is less than 2 cm, and maximum range is between 50 m (low reflective surfaces, like pavement) and 120 m (high reflective surfaces). Minimum distance is 0.9 m. The LIDAR produces more than 1 million points per second, reporting for each point its distance and the measured reflectivity. Depending on the revolution speed, it can capture from 66,000 to 266,000 points per revolution ( Figure 3a). Short-range sensors. The car carries two TIM55x Laser scanners (Sick, Waldkirch, Germany, see Figure 3b), one on the front of the vehicle and one on the back, parallel to the ground and at a height of 30 cm. Each laser covers 180° with an angular resolution of 1°, has a scanning range of 0.05 to 25 m with a systematic error of ±60 mm, and spins at 15 Hz. The lasers are connected to the central control system through an Ethernet connection. The vehicle also carries two SENTIS3D-M420 time-of-flight cameras (BlueTechnix, Wien, Austria) one on each side of the vehicle (see Figure 3a). Their purpose is to gather data from the closest surroundings, which are too close for the 3D LIDAR. The effective operating distance is limited to approximately 4 m. Each camera captures up to 160 images per second, with 160 × 120 pixels resolution, independently of the outside lighting conditions.
Long-range sensors. The vehicle is equipped with a camera with a 1.2 Mpixels Charge-Coupled Device (CCD) in the visible spectrum, a Prosilica GT1290 from Allied Vision (Stadtroda, Germany), with a Gigabit Ethernet connection. It can capture up to 33 images per second, in grayscale or colour (see Figure 3a). CIC also has a Velodyne HDL-64E, a high-definition 3D LIDAR scanner (see Figure 3a) with 64 laser sensors. The LIDAR spins from 5 to 20 Hz, covering the complete 360° horizontally (from 0.086° to 0.345° angular resolution) and 26.8° vertically (separation between laser sensors 0.4°). The accuracy of the distance measurement is less than 2 cm, and maximum range is between 50 m (low reflective surfaces, like pavement) and 120 m (high reflective surfaces). Minimum distance is 0.9 m. The LIDAR produces more than 1 million points per second, reporting for each point its distance and the measured reflectivity. Depending on the revolution speed, it can capture from 66,000 to 266,000 points per revolution (Figure 3a).

Control System
The main control systems of the Renault Twizy have been automated in order to allow the vehicle to be autonomously controlled. The modified systems are the steering wheel (see Figure 4a), the brake pedal (see Figure 4b), and the accelerator pedal. For the steering wheel, we have installed a gear and a clutch in the steering axis, so that we can rotate it by using an electrical motor. The brake pedal is actioned mechanically by a cam controlled by an electric motor, with a maximum rotation angle of 15°. The accelerator pedal is electronically simulated by the control system, which emulates the electronic analog signals generated by the real pedal. Both motors (steering wheel and brake pedal) are controlled by two controller drives through a CAN bus.

Control System
The main control systems of the Renault Twizy have been automated in order to allow the vehicle to be autonomously controlled. The modified systems are the steering wheel (see Figure 4a), the brake pedal (see Figure 4b), and the accelerator pedal. For the steering wheel, we have installed a gear and a clutch in the steering axis, so that we can rotate it by using an electrical motor. The brake pedal is actioned mechanically by a cam controlled by an electric motor, with a maximum rotation angle of 15 • . The accelerator pedal is electronically simulated by the control system, which emulates the electronic analog signals generated by the real pedal. Both motors (steering wheel and brake pedal) are controlled by two controller drives through a CAN bus.

Processing System
The main element of the processing system is a Nuvo-1300S/DIO computer (Neousys, New Taipei City, Taiwan, see Figure 5c), a fan-less embedded controller featuring an Intel Core i7 processor and 4 GB of RAM. The processing system carries out the high performance tasks associated with CIC navigation, such as: (1) execution of computer vision algorithms; (2) processing the cloud of points generated by the 3D LIDAR, to detect pedestrians, other vehicles, curbstones and so on; (3) generation of new trajectories and execution of obstacle avoidance manoeuvres; (4) test the correct operation of the rest of the systems; and (5) storing the data of the sensors on disk for later off-line analysis.

Processing System
The main element of the processing system is a Nuvo-1300S/DIO computer (Neousys, New Taipei City, Taiwan, see Figure 5c), a fan-less embedded controller featuring an Intel Core i7 processor and 4 GB of RAM. The processing system carries out the high performance tasks associated with CIC navigation, such as: (1) execution of computer vision algorithms; (2) processing the cloud of points generated by the 3D LIDAR, to detect pedestrians, other vehicles, curbstones and so on; (3) generation of new trajectories and execution of obstacle avoidance manoeuvres; (4) test the correct operation of the rest of the systems; and (5) storing the data of the sensors on disk for later off-line analysis.

Processing System
The main element of the processing system is a Nuvo-1300S/DIO computer (Neousys, New Taipei City, Taiwan, see Figure 5c), a fan-less embedded controller featuring an Intel Core i7 processor and 4 GB of RAM. The processing system carries out the high performance tasks associated with CIC navigation, such as: (1) execution of computer vision algorithms; (2) processing the cloud of points generated by the 3D LIDAR, to detect pedestrians, other vehicles, curbstones and

3D LIDAR Cloud of Points
The Velodyne HDL-64E sends the information (both distance to the point and reflectivity value) through UDP packets, configured to broadcast on port 2368. Each packet contains a data payload of 1206 bytes, comprising 12 blocks of 100 bytes of laser data, plus 6 bytes of calibration and other information pertaining to the sensor. Each block contains data from 32 lasers (three bytes per laser), plus information regarding the laser block number (two bytes) and the azimuthal angle (two bytes). Azimuthal angle is reported from 0 to 35,999 (tenths of a degree). Regarding the measurements reported by each laser sensor, two bytes encode distance to the nearest 0.2 cm, and the remaining byte reports intensity on a scale of 0-255. Six firings of each of the laser blocks takes 139 µs, plus around 100 µs to transmit the entire 1248 byte UDP packet, which equals 12.48 bytes/µs. Data points can be transformed to a Cartesian coordinates system by applying the equations listed in (1), where α is the azimuthal angle, and ω is the vertical angle: Distance measures are collected by the LIDAR, and organized in a cloud of points covering the complete 360 • around the vehicle. At its current rotation speed (10 Hz), the LIDAR produces more than 1 million points per revolution, with an angular resolution of 0.1728 • . This laser data is barely pre-processed: only values below the minimum (lower than 500) are deleted from the cloud of points before it is passed to the pedestrian detection algorithm, described in the following section.

Pedestrian Detection Algorithm
The pedestrian detection algorithm described below revolves around the processing of clusters of points contained inside a cube of dimensions 100 × 100 × 200 cm (the size of a human being). The algorithm comprises the following five steps:

1.
Select those cubes that contain a certain number of points, inside a threshold. This allows to eliminate objects with a reduced number of points which could produce false positives. Also, it reduces the computing time.

2.
Generate XY, YZ and XZ axonometric projections of the points contained inside the cube.

3.
Generate binary images of each projection, and then pre-process them.
Send the feature vector to a machine learning algorithm to decide whether it is a pedestrian or not. Figure 6 shows a screen-shot of all the data gathered by the Velodyne HDL-64 LIDAR in one revolution. We have highlighted inside a red cube one person, so that the reader can better understand the following steps. Cube selection is the first step of the algorithm, and it's very simple. The threshold for applying the algorithm described below to the cube is between 150 and 4000 points, and these numbers have been empirically determined by analysing three thousand frames captured by the LIDAR.  (2): Being 'n' the number of the sample, and 'ns' the number of points in the sample.
are the sets of Cartesian coordinates of all points contained inside the sample. After normalization, the new sets ( ) , ( ) , ( ) values are placed in the interval (0,1). Figure 7 shows the normalized axonometric projections of two pedestrian samples.

Step 3: Generation and Pre-processing of Binary Images of Axonometric Projections
We now need to generate an image-like artefact from the normalized values obtained in step 2, so that we can afterwards apply computer vision algorithms to each of the images generated from the axonometric projections. We tackle two related problems: how to generate an image from the normalized values in a way that keeps the original proportions to avoid deformations, and in a way that eases further processing. We aim to generate binary images, where the background is set to black (value '0'), and the pixels that correspond to points of the cloud are set to white (value '1'). Since the cloud of points being processed is obtained from a cubic region of size 100 × 100 × 200 cm, the proportions of the scaling coefficients that we have applied to the normalized values obtained in step 2 are 1-1-2.  (2): Being 'n' the number of the sample, and 'ns' the number of points in the sample. X (n) = {x 1 , . . . , x ns }, Y (n) = {y 1 , . . . , y ns }, Z (n) = {z 1 , . . . , z ns } are the sets of Cartesian coordinates of all points contained inside the sample. After normalization, the new sets {X norm } values are placed in the interval (0,1). Figure 7 shows the normalized axonometric projections of two pedestrian samples.

Step 3: Generation and Pre-processing of Binary Images of Axonometric Projections
We now need to generate an image-like artefact from the normalized values obtained in step 2, so that we can afterwards apply computer vision algorithms to each of the images generated from the axonometric projections. We tackle two related problems: how to generate an image from the normalized values in a way that keeps the original proportions to avoid deformations, and in a way that eases further processing. We aim to generate binary images, where the background is set to black (value '0'), and the pixels that correspond to points of the cloud are set to white (value '1'). Since the cloud of points being processed is obtained from a cubic region of size 100 × 100 × 200 cm, the proportions of the scaling coefficients that we have applied to the normalized values obtained in step 2 are 1-1-2.    Figure 8.
(a-f) binary images generated from the XY, XZ and YZ projections of two pedestrian samples; (g-l) results of the pre-processing of the binary images. We need to further process the images in order to obtain the "real" object was that generated by the cloud of points being analysed. For this, we apply several algorithms to remove small particles, reduce noise, smooth objects, and obtain a compact figure of the object to be recognized afterwards by the machine learning algorithm, as shown in Figure 8g-l. For this purpose, we apply two morphological operations and one filter operation in the following order: (1) an opening operation to achieve a compact set of pixels; (2) a filter to erase small particles; and (3) a closing operation to smooth edges.
The values we select for all operations are strongly related: if the images we generate are big, the processing algorithms have to be very aggressive to generate compact images. But if the images are small, we can lose information in the generation of the binary images. We empirically obtained the following values. X and Y axes have been mapped to 50 pixels, while the Z axis has been mapped to 100 pixels. These rates avoid the apparition of holes in the images and reduces the pre-processing. We used the 'ceil' operation to round values, since the image's pixels start at value '1'. The opening operation was configured to use a radius of six pixels, the filter erases particles with areas of less than 200 pixels, and the closing operation was configured to use a ratio of three pixels. Figure 8a-f shows the binary images obtained from the samples are shown in Figure 7, while Figure 8g-l shows the results of the pre-processing stage with these values.

Step 4: Compute Features Vector
In machine learning or computer vision, a feature vector is a n-dimensional vector of numerical features that represent an object [23]. The size of the vector and the features it stores depends on the objects you want to search for, and on the conditions the data was gathered. In order to increase or highlight the raw data, transformations to spaces such as Fourier [24], Gabor [25] or Wavelet [26,27] are commonly used. In other cases, raw data is reduced to binary format (e.g., grayscale images to binary images) in order to find geometric features of object [28], such as area, perimeter, compacity, excentricity, etc., which will allow the shape of the object to be characterized. The statistical features of first and second order [29,30] have also been often used as features in classifiers and machine learning systems [31]. In order to separate pedestrians from the rest of the information contained in the cloud of points, we compute a feature vector composed of fifty features (f 1 , . . . , f 50 ) per projection. The features can be divided into three groups. The first group measures shape characteristics. The second group measures the seven Hu invariant moments [32,33]. The third group computes statistical information, specifically, the normalized distance and the reflexivity reported from the 3D LIDAR raw data. Table 1 shows a summary of the fifty features contained in the vector, according to the group they belong to. The shape elements of the feature vector are:

1.
Area: number of pixels of the object.

2.
Perimeter: specifies the distance around the boundary of the region.

3.
Solidity: proportion of the pixels in the convex hull that are also in the region.

4.
Equivalent diameter: Specifies the diameter of a circle with the same area as the object (see Equation (3)):

5.
Eccentricity: Ratio of the distance between the foci of the ellipse and its major axis length. The value is between 0 and 1. 6.
Length of major/minor axis. A scalar that specifies the length (in pixels) of the major/minor axis of the ellipse that has the same normalized second central moments as the region under consideration.
The Hu invariant moments with respect to translation, scale and rotation are computed as shown in Equations (4)- (10): where 'η pq ' is the normalized central moment of order (p + q), computed as shown in Equation (11): The statistical elements of the feature vector are: mean, standard deviation, kurtosis, and skewness (see Equations (12)-(15)). These values are computed over normalized distance and reflexivity from 3D LIDAR raw data, generating eight features. Normalised distance is computed as shown in Equation (2): sk (n) R the means, standard deviations, kurtosis, and skewness of the set of 'ns' laser measurements of normalized distance (ND i ) and reflexivity (R i ) of a sample 'n'.

Step 5: Machine Learning Algorithm
We are in the Big Data era. Sixty trillion web pages are added every day [34]; one hour of video is uploaded to YouTube every second; new 3D LIDAR sensors supply four billion points per hour. Machine learning (ML) technics offer solutions to automate big data analysis. A short definition of ML is that it is a set of methods that can automatically detect patterns in data. Then, we can make use of this uncovered patterns to make predictions, or to perform other kinds of decision making under uncertainty [35]. ML methods are usually classified into two types: supervised and unsupervised learning.
Supervised learning is based on a labelled training data set, which is used to extract a mathematical model of the data. Supervised methods use Euclidean distance, Bayes theorem, linear regression, Gaussian distributions and so on to create the models. Among the most well-known supervised methods are k-Nearest Neighbours, Naïve Bayes classifiers, Artificial Neural Networks, or Support Vector Machines [27].
Unsupervised learning doesn't use a labelled training dataset because it doesn't try to learn anything in particular. Unsupervised learning divides the dataset into homogeneous groups, which is called clustering. It is used quite regularly for data mining. For instance, it can be used to detect patterns of fraudulent behaviour or in market stock analysis. K-means, mixture models or hierarchical clustering are approaches to unsupervised learning [36].
In this work, we tested the performance of three ML Algorithms (MLA) to detect pedestrians: k-Nearest Neighbours (kNN), Naïve Bayes Classifier (NBC), and Support Vector Machine (SVM). These algorithms were used to implement binary classifiers to detect two kinds of objects, pedestrian and no pedestrian, and they were tested with different configuration parameters and kernel functions (see Table 2). kNN was tested with Euclidean and Mahalanobis distance with data normalization; NBC was tested with Gauss and Kernel Smoothing Functions (KSF) without data normalization; and SVM was tested with linear and quadratic functions with data normalisation.
The reasons to select these MLA are: (1) they are easy to implement and reproduce the results, since they can be found in most MLAs libraries and programming languages such as: MATLAB, R, Python, C++, etc.; (2) they are very popular, we can find lots of works that use them, and we can compare them; (3) they are easy to configure and test and (4) in our experience [27,37], they produce good results.

Results and Discussion
Since the proposed algorithm for pedestrian detection comprises a MLA, and we have tested the performance of kNN, NBC, and SVM algorithms, we divide this section into two parts: performance and selection of the MLA that will detect pedestrians in the cloud of points generated by the 3D LIDAR, and performance of the overall algorithm in a real scenario. For this purpose, we have defined three scenarios. The first and second scenarios have been used to extract a set of samples of pedestrian and no-pedestrian objects from the data generated by the 3D LIDAR. The first scenario corresponds to the Intelligent Vehicle and Computer Vision Laboratory of our research group (DSIE). In this scenario, samples have been taken of a pedestrian in different angles and distances, as well as other elements such as walls, obstacles, vehicles, etc. (see Figure 9a). The second scenario is an underground parking, and it was used to capture pedestrians and multiple samples of cars, motorbikes, columns, and so on (see Figure 9b). The third scenario corresponds to a real traffic area around the Polytechnic University of Cartagena (see Figure 9c). This last scenario was used to test the system in a real traffic area.

Results and Discussion
Since the proposed algorithm for pedestrian detection comprises a MLA, and we have tested the performance of kNN, NBC, and SVM algorithms, we divide this section into two parts: performance and selection of the MLA that will detect pedestrians in the cloud of points generated by the 3D LIDAR, and performance of the overall algorithm in a real scenario. For this purpose, we have defined three scenarios. The first and second scenarios have been used to extract a set of samples of pedestrian and no-pedestrian objects from the data generated by the 3D LIDAR. The first scenario corresponds to the Intelligent Vehicle and Computer Vision Laboratory of our research group (DSIE). In this scenario, samples have been taken of a pedestrian in different angles and distances, as well as other elements such as walls, obstacles, vehicles, etc. (see Figure 9a). The second scenario is an underground parking, and it was used to capture pedestrians and multiple samples of cars, motorbikes, columns, and so on (see Figure 9b). The third scenario corresponds to a real traffic area around the Polytechnic University of Cartagena (see Figure 9c). This last scenario was used to test the system in a real traffic area.

Performance of Machine Learning Algorithms
Before selecting the MLA for pedestrian detection, we first need a training set. To capture a sample of points representative of a pedestrian, among the more than 1 million points generated by the 3D LIDAR in every revolution (or frame), we developed a software tool in MATLAB. This tool, shown in Figure 6, loads previously stored laser data and allows users to select sample s of points of various sizes. Default values of the sample are set to 100 × 100 × 200 centimetres. Users can also rotate the frame, filter points according to their distance to the origin of coordinates, as well as modify the size of the selection sample. Once a sample is selected, the software automatically stores the raw data from the sample in a raw feature vector, which includes X, Y, Z coordinates, distance and reflexivity, and allows the user to state whether the sample contains a pedestrian or not (i.e., its label). From the first and second scenarios, we have extracted with the MATLAB tool 277 pedestrians and 1654 nopedestrian samples. There are several evaluation metrics that can be used to evaluate the performance of MLA: calculate values based on the confusion matrix, generate the Receiver Operating Characteristic (ROC) curve, or perform the leave-one-out cross-validation method (LOOCV). We present the results we have obtained for all considered MLA in this section. LOOCV and ROC curves have been computed under different configurations of the MLAs, as shown in Table 3. The table also shows the errors obtained after applying LOOCV and ROC curves analysis to the training set. These metrics allowed us to select the optimal MLA for the last step of the pedestrian detection algorithm described in this article.

Performance of Machine Learning Algorithms
Before selecting the MLA for pedestrian detection, we first need a training set. To capture a sample of points representative of a pedestrian, among the more than 1 million points generated by the 3D LIDAR in every revolution (or frame), we developed a software tool in MATLAB. This tool, shown in Figure 6, loads previously stored laser data and allows users to select sample s of points of various sizes. Default values of the sample are set to 100 × 100 × 200 centimetres. Users can also rotate the frame, filter points according to their distance to the origin of coordinates, as well as modify the size of the selection sample. Once a sample is selected, the software automatically stores the raw data from the sample in a raw feature vector, which includes X, Y, Z coordinates, distance and reflexivity, and allows the user to state whether the sample contains a pedestrian or not (i.e., its label). From the first and second scenarios, we have extracted with the MATLAB tool 277 pedestrians and 1654 no-pedestrian samples. There are several evaluation metrics that can be used to evaluate the performance of MLA: calculate values based on the confusion matrix, generate the Receiver Operating Characteristic (ROC) curve, or perform the leave-one-out cross-validation method (LOOCV). We present the results we have obtained for all considered MLA in this section. LOOCV and ROC curves have been computed under different configurations of the MLAs, as shown in Table 3. The table also shows the errors obtained after applying LOOCV and ROC curves analysis to the training set. These metrics allowed us to select the optimal MLA for the last step of the pedestrian detection algorithm described in this article. The confusion matrix presents the number of samples that were correctly classified by a MLA (true positives (TP) and true negatives (TN)) against those which weren't (false positives (FP) and false negatives (FN)). In the case of binary classifiers, such as the one we propose in this article to detect pedestrians, we can compute several values from the confusion matrix, such as accuracy (Equation (19), the proportion of correctly classified samples); precision (Equation (18), assesses the predictive power of the algorithm); sensitivity/specificity (Equations (16) and (17), assesses the effectiveness of the algorithm on detecting a single class); and F-score (Equation (20), the harmonic mean of precision and sensitivity). These are normally the first metrics considered when evaluating a MLA, specially accuracy. However, these numbers alone cannot fully capture the performance of an MLA, in the same way as the mean does not capture all the characteristics of a data set.
The ROC curve better characterises the performance of MLAs. It is calculated by comparing the rate of true positives (sensitivity, see Equation (16)) against the rate of false positives (1-specificity, see Equation (17)) at various threshold levels. The Area Under the Curve (AUC) is also an important value, since it is a measure of the discrimination power of the classifier, that is, the ability of the classifier to correctly classify the patterns submitted (pedestrian and non-pedestrian in our case) [38,39]. AUC can also be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. Figure 10  The confusion matrix presents the number of samples that were correctly classified by a MLA (true positives (TP) and true negatives (TN)) against those which weren't (false positives (FP) and false negatives (FN)). In the case of binary classifiers, such as the one we propose in this article to detect pedestrians, we can compute several values from the confusion matrix, such as accuracy (Equation (19), the proportion of correctly classified samples); precision (Equation (18), assesses the predictive power of the algorithm); sensitivity/specificity (Equations (16) and (17), assesses the effectiveness of the algorithm on detecting a single class); and F-score (Equation (20), the harmonic mean of precision and sensitivity). These are normally the first metrics considered when evaluating a MLA, specially accuracy. However, these numbers alone cannot fully capture the performance of an MLA, in the same way as the mean does not capture all the characteristics of a data set.
The ROC curve better characterises the performance of MLAs. It is calculated by comparing the rate of true positives (sensitivity, see Equation (16)) against the rate of false positives (1-specificity, see Equation (17)) at various threshold levels. The Area Under the Curve (AUC) is also an important value, since it is a measure of the discrimination power of the classifier, that is, the ability of the classifier to correctly classify the patterns submitted (pedestrian and non-pedestrian in our case) [38,39]. AUC can also be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. Figure   LOOCV is a simple cross-validation method, which extracts one sample of the training set and uses the remaining samples for training the MLA. After that, it classifies the sample it took apart and computes the error of the classification. This process is repeated for all samples. The LOOCV error is afterwards computed as the mean of the errors of the classification of each sample on its own, representing a measure of how well the model adjusts itself to the supplied data. LOOCV is a method that is computationally expensive. Both LOOCV and AUC analysis shows that the best fitting of data model is produced by SVM with quadratic kernel, and the worst results were obtained with the NBC with Gaussian kernel.

Performance of the Complete Pedestrian Detection Algorithm
To determine the best MLA to be used in the pedestrian detection algorithm, we selected and tested in scenario 3 (see Figure 9c) the algorithms that had the lower LOOCV error, the higher AUC, and the higher Fscore in Table 3. Thus, we selected kNN with Euclidean distance and SVM with linear and quadratic functions. From scenario 3, we have been selected seven frames of thirty square meters, and all of them have been partitioned in samples of 100 × 100 × 200 cm. Table 4 shows the results of the performance of the selected MLAs, obtained after a manual computation of the TP, FP, TN and FN. The number of pedestrian in each frame has been determined manually (see Table 4 third column). Table 4. TP, FP, TN and FN computed seven frames of scenario 3. TP FP TN FN TP FP TN FN TP FP TN FN  1  100  1  1  7  92  0  1  2  97  0  1  14 85  0  2  53  1  1  5  47  0  1  2  50  0  0  3  49  1  3  47  1  1  4  42  0  1  1  45  0  1  10 36  0  4  58  3  1  7  48  2  1  3  52  2  2  6  49  1  5  79  4  3  5  70  1  3  3  72  1  4  9  66  0  6  45  4  4  5  36  0  4  1  40  0  4  2  39  0  7  103  2  2  9  92  0  2  3  98  0  2  6  95  0  Sum  485  16  13 42 427 3  13 15 454 3  14 50 419 2   Table 5 shows the final metric results of the tested MLAs. As shown in the table, the best compromise between the metrics computed corresponds with SVM with linear function, which is the MLA we finally selected. The differences we find between the theoretical results of Table 3 (where SVM with a quadratic kernel shows the best results) with respect to the results of scenario 3 (shown in Table 5), are due to the "overfitting problem". A MLA presents overfitting when it is capable of learning all of the samples provided in the training set by itself, but it is not able to generalize to new data. Given the results shown in Table 3, we expected SVM with quadratic kernel to achieve a 100% success when classifying new, unknown samples, but its performance decreases in all metrics, as shown in the last column of Table 5, being even lower than other MLAs.

Frame Samples Pedestrians in the Sample
The overfitting problem can be caused by many factors, one of which is a low number of training samples. In our case, we tried a different number of training samples (from 200 to 2000 with step of 200), but the problem persisted. For this reason, we finally decided to use the linear SVM as MLA for pedestrian detection. Figure 11 shows the result obtained to execute the pedestrian detection algorithm over frame six, where there are three pedestrians. In this frame, the algorithm evaluated forty-five samples. As shown in the picture, the algorithm detected four pedestrians (one false positive).
Sensors 2017, 17, 18 17 of 21 Figure 12 shows the result obtained to execute the pedestrian detection algorithm over frame six, where there are three pedestrians. In this frame, the algorithm evaluated forty-five samples. As shown in the picture, the algorithm detected four pedestrians (one false positive).  Table 6 shows a comparison of the proposed method with the five methods we reviewed in Section 1.2, which use High Definition 3D LIDAR to carry-out pedestrian detection. The comparison is made in terms of MLA implementation, metric used (graphical and numeric values) and performance. We want to highlight that there is a lack of homogeneity in the reporting of performance statistics in the works reviewed, and an absence of numerical results in many of them. This fact makes performing a thorough comparison of the results almost impossible. Nevertheless, we have estimated the AUC of the works [20,22], and use the Fscore metric (see Equation (20)) to compare with the works where the authors report ad-hoc mean values, since Fscore is defined as the harmonic mean of precision and sensitivity (or recall) and it establishes the relationship between them. As shown in Table 6, the AUC of authors [20,22] is lower than the proposed method and the performance mean reported by [18,19,21] is less than the Fscore obtained by the proposed method.  Table 6 shows a comparison of the proposed method with the five methods we reviewed in Section 1.2, which use High Definition 3D LIDAR to carry-out pedestrian detection. The comparison is made in terms of MLA implementation, metric used (graphical and numeric values) and performance. We want to highlight that there is a lack of homogeneity in the reporting of performance statistics in the works reviewed, and an absence of numerical results in many of them. This fact makes performing a thorough comparison of the results almost impossible. Nevertheless, we have estimated the AUC of the works [20,22], and use the Fscore metric (see Equation (20)) to compare with the works where the authors report ad-hoc mean values, since Fscore is defined as the harmonic mean of precision and sensitivity (or recall) and it establishes the relationship between them. As shown in Table 6, the AUC of authors [20,22] is lower than the proposed method and the performance mean reported by [18,19,21] is less than the Fscore obtained by the proposed method.

Conclusions and Future Work
This article presents a machine learning approach to pedestrian detection for autonomous vehicles using high-definition 3D range data, since pedestrian detection has been identified as one of the most critical tasks for this kind of vehicles. The work is included in the development of an autonomous vehicle based on a Renault Twizy platform by the DSIE Research Group at UPCT. The major remarkable features of the proposed approach are: • Unlike other works [18,20,22], a novel set of elements is used to construct the feature vector the machine learning algorithm will use to discriminate pedestrians from other objects in its surroundings. Axonometric projections of cloud points samples are used to create binary images, which has allowed us to employ computer vision algorithms to compute part of the feature vector. The feature vector is composed of three kind of features: shape features, invariant moments and statistical features of distances and reflexivity from the 3D LIDAR. • An exhaustive analysis of the performance of three different machine learning algorithms have been carried out: k-Nearest Neighbours (kNN), Naïve Bayes classifier (NBC), and Support Vector Machine (SVM). Each algorithm was trained with a training set comprising tool 277 pedestrians and 1654 no pedestrian samples and different kernel functions: kNN with Euclidean and Mahalanobis distances, NBC with Gauss and KSF functions and SVM with linear and quadratic functions. LOOCV and ROC analysis were used to detect the best algorithm to be used for pedestrian detection. The proposed algorithm has been tested in real traffic scenarios with 16 samples of pedestrians and 469 samples of non-pedestrians. The results obtained were used to validate theoretical results obtained in the Table 3. An overfitting problem in the SVM with quadratic kernel was found. Finally, SVM with linear function was selected since it offers the best results. • A comparison of the proposed method with five other works that also use High Definition 3D LIDAR to carry-out the pedestrian detection, comparing the AUC and Fscore metrics. We can conclude that the proposed method obtains better performance results in every case.
Pedestrian detection has traditionally been performed using machine vision and cameras, but these techniques are affected by changing lighting conditions. 3D LIDAR technology provides more accurate data (more than 1 million points per revolution), which can be successfully used to detect pedestrians in any kind of lighting conditions. The high success rate and scalability of machine learning algorithms will enable the detection of different objects during vehicle navigation. The authors are working on increasing the data base of point cloud samples, as well as on creating learning machines capable of detecting other important objects in the scene, such as bikes, cars, traffic signs and lights, etc. Sensor fusion is also very important, and therefore we are very interested in using the information from the cameras mounted on our car. Also, given that the algorithms are computationally expensive, the authors are developing parallelizable code to increase the speed of the algorithm.