1. Introduction
Over the past years, pedestrian detection and tracking has become a significant and essential task for many traffic-related applications, such as autonomous vehicles (AV), advanced driving assisted systems (ADAS), and traffic management. For AV and ADAS, the reliable detection and tracking of pedestrians aims to make vehicles aware of potential dangers in their vicinity, thereby improving traffic safety. Such a system provides spatial–temporal information for vehicles to respond and take their subsequent actions. For traffic management, the precise detection and tracking of pedestrians could assist in optimizing traffic control and scheduling to achieve high safety and efficiency. For the purpose of pedestrian detection and tracking, vision-based approaches are prevalent [
1,
2,
3]. These approaches recognize and track pedestrians in images and videos by extracting the texture, color, and contour features of the targets. However, such approaches have difficulty in collecting accurate position information about humans, due to their limited accuracy in depth estimation. Some researchers have tried to deal with this problem, using RGB-D cameras, which combine information from images and 2D rangers to collect color information and dense point clouds, simultaneously [
4,
5,
6]. However, RGB-D cameras usually have a narrow field of view, both horizontally and vertically, and limited sensor ranges [
7]. As such, applications that incorporate LiDAR sensors in pedestrian detection and tracking have experienced dramatic development in recent years [
8,
9,
10]. Compared to cameras or RGB-D cameras, LiDAR is a direct 3D measurement technology without the need for image matching. Another significant advantage of LiDAR sensors is their ability to generate long-range and wide-angle point clouds. In addition, LiDAR point clouds are quite accurate and not affected by lighting conditions [
11].
LiDAR-based pedestrian detection studies can be broadly classified into two approaches: model-free and model-based. Model-free methods have no restrictions on or assumptions about the shape and size of the objects to be detected. As such, they can detect pedestrians and other dynamic objects simultaneously. For example, [
12] outlined a system for long-term 3D mapping in which they compared an input point cloud to a global static map and then extracted dynamic objects based on a visibility assumption. Ref. [
13] segmented the dynamic objects on the basis of discrepancies between consecutive frames and classified them according to the geometric properties of their bounding boxes. Ref. [
14] detected the motions of objects sequentially using RANSAC and then proposed a Bayesian approach to segment the objects. Most of the proposed model-free methods are mainly based on motion cues, so the performance of model-free approaches for detecting pedestrians has never been as good as their detection of other objects, such as vehicles and bicyclists, since pedestrians always move slowly [
14].
Model-based approaches are preferred when some information about the object to be detected is known and, therefore, can be modeled a priori. Currently, a large number of studies on pedestrian detection from LiDAR rely on machine learning strategies, which numerically represent pedestrians by hand-crafted features. Ref. [
15] proposed 11 features based on the property of clusters and PCA (principal component analysis) to describe human geometry. A classifier composed of two independent SVMs (support vector machines) was then used to classify pedestrians. The performance of the classifier was improved in [
16] by adding two new features: a slice feature for the cluster and a distribution pattern for the reflection intensity of the cluster. Their results showed that these two new features improved classification performance significantly, even if their dimensions were relatively low. Ref. [
17] divided the point cloud into a grid and represented each cell by six features. After that, a 3D window detector slid down all three dimensions to stack up all the feature vectors falling in its bound into a single vector, which was then classified by a pedestrian classifier. Ref. [
18] first segmented the point cloud and projected each candidate pedestrian cluster into three main planes, then a corresponding binary image for each projection was generated to extract the feature vectors. Then, k-Nearest Neighbor, Naive Bayes, and a SVM classifier were used to detect pedestrians based on the above features.
Some model-based neural networks for 3D object detection have been developed in recent years in an end-to-end manner. These approaches do not rely on hand-crafted features and typically follow one of the two pipelines, i.e., either two-stage or one-stage object detection [
19,
20,
21,
22]. Despite the fact that deep learning-based approaches provide state-of-the-art performance in many object detection tasks, this study did not adopt them for the following reasons. First, such methods typically require considerable fine-tuning with manual intervention, longer training time, and high-performance hardware [
23]. In addition, pedestrian detection is essentially a straightforward binary classification, rather than a complex object detection problem. Moreover, most 3D object detection neural networks are evaluated using the KITTI benchmark [
24], while the amount of data collected by Doppler LiDAR is much smaller than the KITTI data set. When training data are limited, deep learning strategies do not necessarily outperform traditional classification methods [
25].
Existing tracking approaches could be grouped into two categories based on their processing mode: offline tracking and online tracking [
26]. Offline tracking utilizes information both from past and future frames and attempts to find a globally optimal solution [
27,
28,
29], which could be formulated as a network flow graph and solved by min-cost flow algorithms [
30]. Offline tracking always has a high computational time cost since it deals with observations from all frames and analyzes them jointly to estimate the final output. In contrast, for online tracking, the LiDAR sequence is handled in a step-wise manner and only considers detections at the current frame, which is usually efficient for real-time applications [
31]. Ref. [
32] proposed a pedestrian tracking method which was able to improve the performance of pedestrian detection. A constant velocity model was adopted to predict the pedestrians’ location, and the global-nearest-neighbor algorithm was used to associate detected candidates and existed trajectories. Once a candidate was associated with an existed trajectory, it was classified as a pedestrian. Ref. [
30] proposed an online detection-based tracking method, using a Kalman filter to estimate the state of pedestrians, and the Hungarian algorithm to associate detections and tracks. Based on this work, [
33] calculated the covariance matrix in a Kalman filter, using the statistics results from training data. They then used a greedy algorithm instead of the Hungarian algorithm to associate the objects and obtained a better result.
Currently, most LiDAR-based studies utilize point cloud data sets acquired by a pulsed LiDAR sensor, which emits short but intense pulses of laser radiation to collect spatial information of data points. However, when pedestrians are far from the sensor, fewer points of them are collected by the scanner, which may cause missing or wrong recognition of pedestrians [
34,
35]. Doppler LiDAR, which not only provides spatial information but the precise radial velocity of each data point, can possibly help to address this problem [
36,
37,
38]. For example, as a pedestrian moves away from the sensor, its point cloud becomes sparse, while its velocity does not change a lot. Unlike pulsed LiDAR, Doppler LiDAR emits a beam of coherent radiation to a target while keeping a reference signal, also known as a local oscillator [
39]. The motion of the target along the beam direction leads to a change of light’s frequency, due to the Doppler shift. Movement toward the LiDAR brings about a compression of the wave increasing in its frequency, while movement away stretches the wave and reduces its frequency. Thus, the difference of the outgoing and incoming frequency derives two beat frequencies, which could be utilized to derive the range and radial velocity [
40]. Ref. [
41] proposed a model-free approach that achieved high performance for pedestrian detection, using point clouds acquired by a Doppler LiDAR. They first detected and clustered all the moving points to generate a set of dynamic point clusters. Then, the dynamic objects were completed from the detected dynamic clusters by region growing. As such, most pedestrians could be detected successfully, except those with zero radial speed.
This paper aims to take advantage of the Doppler LiDAR to propose a new detection-based tracking method to detect and track pedestrians in urban scenes. The contribution of this study includes the following aspects. (1) A multiple pedestrian separation process based on the mean shift algorithm is utilized to further segment candidate pedestrians. This process can increase the true positive rate for candidates who become too close to other objects. (2) We use speed information from the Doppler LiDAR to improve both detection and tracking performance. Specifically, for pedestrian detection, the pedestrian classifier with the speed information is robust for classifying pedestrians at any distance. In the tracking step, the use of speed information provides a more accurate prediction of the pedestrian’s location, leading to better tracking performance.