1. Introduction
Autonomous navigation technology has gained significant attention in recent years, as it enables sensor-equipped unmanned ground vehicles (UGV) to perform a wide range of tasks, including data collection, environmental modeling, and monitoring [
1,
2]. An essential component of an autonomous navigation system is the real-time detection and tracking of obstacles. By fully considering the position, velocity, and historical trajectory of obstacles, the UGV can plan its actions more effectively, thereby preventing unintended collisions [
3,
4]. In addition, the era of intelligent systems has brought growing requirements for the perceptual and interactive capabilities of unmanned system equipment [
5]. Therefore, as the working settings of the UGV become increasingly complex, the ability of UGV to perceive dynamic objects is becoming more and more important.
In the field of autonomous navigation, cameras and LiDAR are the most commonly used sensors for object detection and tracking [
6]. In the commonly used tracking-by-detection framework, tracking algorithms typically first utilize neural networks to generate detection boxes, which are then associated with historical trajectories [
7]. Early algorithms focus on 2D detection [
8] and 2D tracking tasks, which use images to generate 2D boxes and trajectories for objects. In recent years, 3D LiDAR-based perception and navigation technologies have gained significant attention. On the one hand, LiDAR can accurately measure the distance between an obstacle and a UGV. On the other hand, the point cloud data provided by LiDAR offers detailed geometric information about the environment, which is highly beneficial for multi-object detection and tracking. While LiDAR provides significant advantages, it also presents great challenges. Unlike image data, point cloud data is not spatially continuous, and its distribution characteristics vary across frames [
9]. Therefore, convolutional neural networks, which are commonly used for image processing, struggle to directly extract features from point cloud data. Correspondingly, multi-object tracking algorithms using point clouds also face several challenges in industrial applications.
The first challenge lies in the low efficiency of traditional detection algorithms, primarily caused by the sparsity and unstructured nature of the point cloud. The irregular distribution of point cloud data compels most object detection algorithms to convert it into alternative representations to facilitate processing by convolutional neural networks. For instance, some algorithms transform 3D point clouds into 2D images, which are subsequently processed using convolutional networks [
10,
11]. However, this dimensionality reduction results in significant information loss. Consequently, these methods typically use multiple neural networks to extract features from various perspectives (such as bird’s eye view, frontal view, and images) [
12]. The concurrent execution of multiple networks severely compromises computational efficiency. Another common approach involves representing the 3D space using uniformly distributed cuboids or voxels, followed by object detection via 3D convolutional networks, with each voxel serving as the basic unit [
13,
14,
15]. This method also suffers from inefficiency due to the need to traverse a large number of empty voxels. Although some studies have proposed algorithms that operate directly on raw point clouds, their performance and efficiency on large-scale data remain inadequate [
16,
17]. As a result, these methods still fall short of meeting the real-time requirements of autonomous navigation, which often involves processing outdoor point clouds over extensive areas.
The second factor limiting the practical application of multi-object tracking algorithms is that the existing data association based on single-sensor input is not reliable enough. Weng et al. proposed the AB3DMOT algorithm [
18], which was regarded as a benchmark method in the 3DMOT field. Their method uses Intersection over Union (IoU) to represent the similarity between detection boxes and prediction boxes, and then uses the Hungarian algorithm and Kalman filtering to complete trajectory updating. The tracking performance of this kind of algorithm is strongly correlated with the precision of the detected bounding boxes. However, the complexity of point cloud data makes accurately generating 3D detection boxes a challenging task in practical industrial applications. Moreover, in real-world environments, the rapid changes in object pose often hinder the ability of 3D IoU to accurately quantify the similarity between detection results and trajectories [
19]. Some methods integrate multi-sensor data during the detection phase to enhance tracking performance by improving the accuracy of 3D boxes [
20]. Some other methods, such as the ByteTrack algorithm [
21,
22], employ a staged data association approach based on detection confidence scores. Nonetheless, these methods fail to address the problem at its core, since similarity is still computed based on 3D bounding box comparisons. Consequently, they often suffer from reliability problems in real-world scenarios.
Another critical yet frequently overlooked challenge is the heavy reliance of most existing tracking algorithms on high-performance GPUs and advanced sensors, which severely limits their deployment on cost-constrained UGV platforms. Mobile UGVs must handle multiple tasks simultaneously, including perception, localization, and planning, leaving limited computational resources and processing time for object tracking. Additionally, some UGVs are not equipped with high-line LiDAR, which can significantly degrade the performance of point cloud neural networks. Most existing studies focus on employing complex neural networks to improve algorithm performance on public datasets. The impressive results reported by these algorithms are primarily dependent on 64-line LiDAR sensors and high-performance GPUs. Shi et al. [
23] conducted experiments on a desktop equipped with an i9 CPU and an RTX 3090 GPU. Ye et al. [
13] conducted experiments on a 2080Ti GPU. The inference time of MV3D-Net for one image was around 0.36 s on a Titan X GPU [
12]. On UGVs equipped with only a CPU and 32-line LiDAR, these algorithms often fall short in both accuracy and efficiency, and in some cases, may fail to operate altogether. In real-world applications, limitations in cost, size, and thermal management prevent the control system from enhancing algorithm performance by simply adding more computational hardware [
24]. This dilemma causes many 3D MOT algorithms to struggle to meet the needs of autonomous navigation.
In this work, we propose a novel 3D object detection and tracking method designed for low-cost UGVs. To improve the algorithm’s efficiency, we refrain from using point cloud neural networks for 3D feature extraction. Instead, we first identify the region where the object is located, and then distinguish the object from noise points using point cloud clustering. To prevent the fragmentation of tracking trajectories caused by 3D detection errors, we propose a multi-object tracking algorithm that employs a multi-level data association strategy. On the one hand, we compute similarity using 2D IoU and 3D Mahalanobis distance, thereby mitigating the impact of detection errors in any single dimension. On the other hand, we perform grouped data association: first handling objects with both 2D and 3D information based on their confidence scores, and then conducting a third round of association specifically for objects that only contain 2D information due to clustering failures. The proposed tracking algorithm effectively mitigates the impact of detection errors on tracking performance.
Section 6 demonstrates that our method not only performs well on public datasets but also adapts to low-cost industrial computers and 32-line LiDAR, which is often unachievable by many existing MOT algorithms. The main contributions of our work are as follows:
- (1)
We propose an efficient multi-object detection method designed for low-cost UGVs. Our method combines image-based object detection techniques with point cloud data processing methods to achieve 3D object detection. The simplicity of the network enables our algorithm to run efficiently in a CPU environment, while the clustering-based feature extraction ensures compatibility with 32-line LiDAR data.
- (2)
We propose a tracking algorithm using a multi-level data association strategy, which effectively mitigates the impact of detection errors on tracking performance. In terms of similarity computation, we integrate both 2D and 3D detection results. For data association, we group the detections by first associating those with high confidence and complete information, and then perform a dedicated third matching step to handle cases where point clouds cannot be successfully clustered. Even when the 3D detector cannot provide accurate results at certain moments, our tracking process maintains high accuracy with image data.
- (3)
We conduct extensive comparative experiments on public datasets and a real UGV platform. The results demonstrate that our algorithm operates efficiently and reliably during the autonomous navigation process.
The remainder of this paper is organized as follows.
Section 2 reviews the related work on multi-object detection and tracking.
Section 3 presents the overall framework of the proposed method.
Section 4 and
Section 5 detail the principle of the proposed detection and tracking algorithm, respectively. In
Section 6, the proposed method is evaluated on both the KITTI MOT dataset and a real UGV platform to experimentally validate its effectiveness. Finally, some conclusions are drawn in
Section 7.
3. System Framework
In this work, we focus on the study of 3D multi-object detection and tracking algorithms, emphasizing improvements in computational efficiency and deployability on low-cost UGV platforms. The overall framework of our method is illustrated in
Figure 1. Our algorithm follows the tracking-by-detection paradigm, where we first perform object detection using both image and point cloud data. Then, the detected objects are matched to existing trajectories based on their 2D and 3D information. To address the limitations of low-cost devices, we design a novel data fusion-based object detection algorithm. Additionally, we enhance the critical data association process to improve the performance of the tracking system built on our detection algorithm.
In practical industrial applications, it may be challenging for UGVs to be equipped with high-performance and expensive GPUs and sensors. On industrial computers with only a CPU, complex point cloud neural networks may struggle to run efficiently. Furthermore, when using 32-line LiDAR as input, some neural networks may face difficulties in extracting 3D features. Therefore, we combine image-based 2D object detection techniques with point cloud processing methods to propose an efficient 3D object detection algorithm. Our algorithm first projects the point cloud onto the image and then extracts the region of interest in the point cloud based on the 2D detection boxes. For the point cloud regions containing objects and noise, we perform ground point filtering and clustering operations, ultimately generating 3D detection boxes as the output. In this process, we avoid using complex point cloud neural networks, making our algorithm more suitable for low-cost UGV platforms.
Consistent with the mainstream tracking algorithm, our method consists of three main steps: prediction, data association, and trajectory lifecycle management. Since the tracking process does not significantly contribute to runtime overhead, our primary innovation focuses on enhancing the data association process, thereby optimizing the tracking algorithm for better compatibility with our detector and the engineering application. In practical applications, missed detections and false positives are inevitable, and this issue is further exacerbated on cost-constrained UGV platforms. Conventional multi-object tracking algorithms are highly susceptible to this issue, often resulting in fragmented trajectories. To address this, we integrate multi-modal data during the data association process. Specifically, we evaluate similarity using the Mahalanobis distance between the 3D detection box and the predicted box. To address situations where 3D information is unavailable, we propose a multi-level data association strategy specifically designed to handle such cases, thereby mitigating trajectory fragmentation. In this process, the detection results are divided into three groups: the first and second groups both contain complete information but differ in confidence levels, while the third group lacks 3D information. Conventional tracking algorithms typically discard the third group of targets; however, our method employs a dedicated third round of data association to handle such cases, thereby preventing trajectory fragmentation. Specifically, we leverage the targets’ 2D information for association, ensuring that their identifiers remain consistent when they are re-clustered in the LiDAR coordinate system. The combined effect of data fusion and association strategies ensures robust reliability in practical engineering scenarios.
4. Detection with Point Cloud Clustering
The key advantage of our object detection process lies in its independence from powerful computing devices and high-line LiDAR. It consists of five main steps: performing 2D object detection, extracting regions of interest (ROIs) from the point cloud, removing ground points, applying clustering, and generating 3D bounding boxes. The first step uses the 2D detector to obtain the objects’ class, location, and confidence score. The remaining steps focus on extracting the point cloud corresponding to the detected object. Since our tracking algorithm uses 2D IOU and 3D Mahalanobis distance in the data association process, our detection process outputs both the 2D bounding boxes and the 3D spatial coordinates of all objects. Any unused information will not be computed during the detection process, which can further enhance the algorithm’s operational efficiency.
4.1. 2D Detector
Compared to point clouds, RGB images provide richer color and contextual information in a more compact format, making them easier to process with neural networks. In addition, the rich color and texture information they provide contribute to improved detection accuracy. Given the notable variations in efficiency, accuracy, and hardware demands across different 2D object detection algorithms, the choice of a suitable detection method is crucial for real-world applications. In this work, we use the YOLO algorithm for 2D object detection, as it provides end-to-end detection and demonstrates strong performance in both efficiency and accuracy. Furthermore, we integrate Squeeze-and-Excitation (SE) modules into YOLO, which produces significant performance improvements at minimal additional computational cost. The SE module, as a channel attention mechanism, enhances the network’s focus on important features by adaptively adjusting the weight of each channel, while suppressing redundant information, thereby improving the accuracy of the detector. Final outputs of the 2D detector can be formulated as
where
represents the
j-th detection bounding box in the image frame at time
t,
represent the horizontal and vertical coordinates of the four vertices of the detection box,
is the detection confidence score, and
denotes the category of the object.
4.2. ROI Extraction
After obtaining the 2D detection box, our algorithm extracts the point cloud ROI by utilizing the relative position relationship between the camera and the LiDAR. As shown in
Figure 2, this process involves three coordinate systems: the LiDAR coordinate system, the camera coordinate system, and the pixel coordinate system. The LiDAR and camera coordinate systems are 3D spatial coordinate systems with the origin at the locations of the respective sensors, while the pixel coordinate system is a 2D coordinate system corresponding to the RGB image.
To obtain the point cloud region corresponding to the 2D detection box, the point cloud data in the LiDAR coordinate system must be projected onto the pixel coordinate system. Since the LiDAR and camera are mounted at different positions on the vehicle, the point cloud needs to be transformed into the camera coordinate system first. This process can be formulated as
where
,
, and
represent the coordinates in the camera coordinate system, and
,
, and
represent the coordinates in the LiDAR coordinate system.
and
represent the rotation and translation matrices, respectively, that define the transformation between the two coordinate systems. After the transformation from the LiDAR coordinate system to the camera coordinate system, the points can be further projected into the pixel coordinate system using the corresponding camera intrinsic calibration parameters. This process can be formulated as
In Equation (3), (, ) represents the coordinate of a point in the point cloud within the pixel coordinate system, and is the intrinsic matrix of the camera. The transformation matrices , , and are obtained through the sensor calibration process, which is a necessary step before running any multi-sensor fusion algorithm. The camera intrinsic calibration process provides the intrinsic reference matrix , while the joint calibration of the camera and LiDAR provides the spatial transformation matrices and .
Equations (2) and (3) define the transformation between the LiDAR and pixel coordinate systems, which makes it possible to project 3D point cloud data onto the 2D image space. The subsequent step involves identifying potential object points in the point cloud by evaluating their projected locations within the pixel coordinate system.
For each detection box
, if the confidence score
exceeds a predefined threshold, it is likely to contain a valid object. Consequently, the points falling within these boxes is considered to belong to the ROI point cloud. The ROI point cloud can be formulated as
In Equation (4), represents the j-th region of interest in the point cloud at time t, which is associated with detection box , and is the i-th point contained in . As described in Equation (5), the location of each point in within the pixel coordinate system is constrained to lie inside the detection box . The set represents the output of the ROI extraction process, which includes the object point cloud along with some background noise.
It is worth noting that the accuracy of the 2D detector we use meets the practical requirements of autonomous navigation. This ensures that the size and position of the 2D bounding boxes are generally well aligned with the objects the vehicle is focused on. In other words, the detection boxes do not include excessive non-object pixels. Hence, the point cloud after rough extraction contains a high proportion of object points, which guarantees the feasibility of subsequent feature extraction.
4.3. Feature Extraction in Point Clouds
The 2D detector selects both the object and noise pixels, resulting in the ROI point cloud containing both object points and points of the surrounding environment, as shown in
Figure 3. In outdoor environments where UGVs operate, the objects to be detected are typically pedestrians, cars, and other similar objects, while noises in point clouds mostly consist of ground points and background features. Typically, points of these features and objects exhibit significant differences in their spatial distribution characteristics. The height of ground points is significantly lower than that of other features, while background features, such as buildings, have a different depth compared to the objects awaiting detection. Thus, our algorithm employs an unsupervised learning approach to classify the point cloud: it first removes ground points, then performs clustering, and ultimately selects the object point clouds. Compared to algorithms utilizing point cloud neural networks, our algorithm demonstrates superior efficiency. Furthermore, it excels at processing 32-line point cloud data. Consequently, our algorithm offers greater advantages in practical applications, particularly on low-cost devices.
4.3.1. Ground Point Filtering
Ground points have two distinguishing features: (1) they lie on the same horizontal plane, (2) their Z-coordinates, when negated, are roughly equal to the height at which the LiDAR is mounted. Our algorithm searches for points that exhibit these features, starting from the point closest to the LiDAR. In the subsequent step, we evaluate whether pairs of potential ground points belong to the same horizontal plane. We begin by examining the first feature. As illustrated in
Figure 4, consider two adjacent LiDAR beams intersecting the object at points
A and
B, whose coordinates are
and
. Let the height difference between points
A and
B be
, and their distance in the XOY-plane be
. The angle
formed between the line connecting points
A and
B and the horizontal plane can be computed as
Theoretically, if point A and B satisfy the second condition, the angle should be negligible. In practice, due to possible ground slope and sensor measurement errors, we allow tolerance. Specifically, point A and B are considered to meet the second feature if . If the vertical coordinates of points A and B are numerically close to the installation height of the LiDAR, they are considered as ground points. By iterating through all points and filtering out those identified as ground points, we obtain a refined point cloud that retains only background elements and objects of interest.
Figure 4.
Schematic diagram of ground points extraction. The red lines represent the laser beams emitted by the sensor, and the orange circles mark the points where the beams hit object surfaces.
Figure 4.
Schematic diagram of ground points extraction. The red lines represent the laser beams emitted by the sensor, and the orange circles mark the points where the beams hit object surfaces.
4.3.2. Point Cloud Clustering
After filtering out the ground points, we use the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to extract object point clouds from the ROI point cloud. The DBSCAN algorithm can cluster point cloud clusters of arbitrary shapes within noisy data, aligning well with the requirements of our application.
As shown in
Figure 5, DBSCAN algorithm divides the points in the data into three categories, which are core points, boundary points, and noise points.
Core points: A point
p is classified as a core point if the number of points
q within its neighborhood exceeds a predefined threshold
. It can be formulated as
. The neighborhood radius
and the threshold
are input parameters for the algorithm. In
Figure 5, circles represent the neighborhood radius, and red points indicate the core points.
Boundary points: Points that lie within the neighborhood of a core point but are not core points themselves are referred to as boundary points.
Noise points: Points that are neither core points nor located within the neighborhood of any core points are classified as noise points.
DBSCAN defines three fundamental concepts to characterize relationships between these points: directly density-reachable, density-reachable, and density-connected [
38]. A point
q is directly density-reachable from a point
p if
q lies within the ε-neighborhood of
p, and
p is a core point. A point
q is density-reachable from
p if there exists a chain of core points {
p1,
p2, …,
pk}, with
p1 =
p and
pk =
q, where each point is directly density-reachable from the previous one. This relationship is asymmetric. Two points
p and
q are density-connected if there exists a point
o from which both are density-reachable. These definitions allow DBSCAN to discover clusters of arbitrary shapes and effectively identify noise based on local point density. All density connected points are considered to belong to the same cluster point cloud. After inputting the point cloud data into the DBSCAN algorithm, several clusters of denoised point clouds can be obtained:
In Equation (9), represents the output of the clustering algorithm, which contains n clusters of point clouds of different categories. denotes the clustering process, and is the input of the clustering algorithm. Since the object within each detection box corresponds to the majority of the pixels, the associated point cloud within the corresponding region is also expected to contain the highest density of points representing the object. Therefore, for the clustered point cloud, our algorithm selects the cluster with the largest number of points as the object point cloud. For the object point cloud, a 3D bounding box can be fitted to its boundary. However, since our tracking algorithm does not rely on 3D IoU for similarity computation, the 3D information provided by the detection module to the tracking module in this work is limited to the coordinates of the center of each detected object.
5. Multi-Object Tracking with Multi-Sensor Fusion
The ultimate task of the multi-object tracking algorithm is the updating and management of trajectories. In tracking-by-detection frameworks, Kalman filtering is commonly used for updating trajectory states, with real-time detection results serving as the observation data. In this process, the algorithm needs to obtain predicted data based on historical trajectories. More importantly, it must perform a one-to-one matching between the predicted data and the observation data. Current mainstream algorithms follow the SORT framework, which performs data association using the IoU between the predicted and detected bounding boxes. However, when 3D detection results are inaccurate, the tracking process may not proceed smoothly. In fact, errors in 3D detection are likely to occur in real-world complex scenarios, especially when the LiDAR has a low number of scan lines. Therefore, current general-purpose 3D multi-object tracking algorithms may struggle to achieve the desired performance in real-world tasks.
In this work, we propose a multi-level data association strategy based on multi-sensor data fusion, which mitigates the impact of detection errors on tracking performance. Our tracking algorithm comprises three components: prediction, data association, and trajectory lifecycle management. In the data association stage, we calculate the similarity between detection results and predicted results using a weighted sum of 2D IoU and 3D Mahalanobis distance. Then, based on the similarity, the Hungarian algorithm is used to match trajectories with detection results. Unlike other algorithms that only use 3D IoU to calculate similarity, our algorithm can associate detection results with trajectories even in the absence of 3D detection boxes, preventing trajectory fragmentation. Furthermore, we adopt a grouped matching approach, where detection results are matched in batches according to the confidence levels and the availability of 3D information. This further enhances the accuracy of our algorithm.
5.1. State Prediction
Since 2D IoU and Mahalanobis distance are used during the data association process, both the predicted results and detection results must include the following information:
In Equation (10), is the information of the j-th object detected by our algorithm at time t. It includes the 2D bounding box, object category, confidence score, and the center coordinates of the corresponding object point cloud. During the tracking process, this information serves as the observation data for the state update step.
For a given trajectory, the object category and the detection box size at each historical time step are known. During the prediction stage, we assume that the size of the 2D bounding box corresponding to the object at time
t remains the same as at time
t. Therefore, only the object’s center position in both the image coordinate system and the LiDAR coordinate system needs to be estimated for time
t. Consistent with the AB3DMOT algorithm, our approach employs a constant velocity model to predict the object’s position. This process can be formulated as
where
is a prior estimate of the
i-th object’s center position at time
t in the LiDAR coordinate system. It is important to note that the
i-th object refers to the object associated with the
i-th trajectory.
represent the object’s velocity components along the three spatial axes, and
denotes the time interval. It can be observed that we do not use the object’s angular velocity information in the prediction process. This is because previous studies have shown that including angular velocity does not significantly improve prediction accuracy [
39]. Given the difficulty of estimating the object’s yaw angle, using angular velocity for position estimation may, in fact, reduce accuracy. Once the 3D center position of the object is estimated, its complete spatial information can be further derived. This process can be expressed as
Equation (12) provides an estimation of the boundary coordinates of the bounding box at time t. Here, represents the estimated center position of the i-th object at time t in the pixel coordinate system, which can be computed based on the estimated 3D coordinates using Equations (2) and (3). and represent the width and height of the detection box at time t, respectively. Equation (13) defines as the final predicted state of the i-th object at time t, where all variables retain the same meanings as in Equation (10), except for differences in subscripts and the time index. It is worth noting that these states are derived from the historical data of the i-th trajectory. Consequently, both the confidence score and the object category remain the same as in the previous time step.
5.2. Multimodal Data Association
Based on the characteristics of our multi-object detection algorithm, we propose a multimodal data association method that integrates 2D and 3D detection results. Unlike conventional algorithms, our method calculates the similarity matrix using both image and point cloud data. Furthermore, it performs data association in batches based on the 2D detection confidence scores. The combination of the similarity calculation method and the matching mechanism enables our algorithm to reduce the impact of detection errors on tracking accuracy.
Similarity calculation: For 2D and 3D detection results, we employ distinct methods to calculate the similarity between detected and predicted objects. Since the 2D detector can output accurate bounding boxes, we calculate the similarity between the 2D detection results and predicted results using the IoU of the bounding boxes. This process can be formulated as
where
is the similarity between the
i-th 2D detection result and the
j-th predicted result,
represents the area of the prediction box,
represents the area of the prediction box. The larger the similarity is, the closer the predicted result is to the actual detection result, and the more likely they belong to the same trajectory. More specifically,
and
can be derived from the boundary coordinates of Bounding the box contained in
and
, respectively.
Unlike 2D bounding boxes in images, 3D cuboid bounding boxes in 3D space additionally require the acquisition of the objects’ pose angles and depth information. However, pedestrians and cars exhibit significant uncertainty in both the temporal and spatial dimensions of their poses. Consequently, generating accurate 3D bounding boxes is a considerable challenge for both detection and prediction phases. In this context, it is not advisable to continue using 3D IoU to calculate similarity. Consequently, we employ the Mahalanobis distance between the detection and prediction results as a measure of similarity. This process can be formulated as
In Equation (15),
represents the Mahalanobis distance between the detection and the prediction,
is the 3D observation matrix,
is the noise covariance matrix,
and
represent the 3D coordinates of the detection result and the prediction result, respectively. In Equation (16),
denotes the 3D similarity. The final similarity matrix can be obtained by weighted summation of these two similarity matrices, and the specific calculation formula is as follows
Matching strategy: Our 3D detector first generates 2D detection boxes and then extracts features within these boxes. The output detection results can be categorized into three classes. The first class consists of objects that are most reliably detected, characterized by complete 3D information and a confidence score above the threshold. The second class includes objects that may be partially occluded, resulting in confidence scores below the threshold; however, they can still be clustered and thus retain complete 3D information. The third category of objects can be detected in images; however, due to factors such as distance and occlusion, the corresponding point clouds are too sparse to form clusters. As a result, these objects contain only 2D detection information and lack 3D information. If the first two categories of objects are treated equally using the Hungarian algorithm, the tracking accuracy deteriorates. Moreover, if the third category of objects is directly discarded, they will be regarded as new objects once they are re-clustered, leading to fragmented tracking trajectories. Therefore, we propose a multi-level data association strategy to address this problem.
In the data association process, we initially apply the Hungarian algorithm to match objects from the high-confidence group with all available trajectories. Subsequently, the remaining trajectories are matched with the results from the low-confidence group. In both matching stages, we use the overall similarity as the input to the Hungarian algorithm. Finally, the third group of objects is associated with the remaining trajectories, where 2D IoU is used as the similarity measure during this process. Since the transformation between the camera and LiDAR coordinates is constant, maintaining consistent object IDs in the image ensures that the correct IDs can be reassigned when objects inside the 2D detection boxes are re-clustered. Even when the 3D feature extraction for certain objects is suboptimal, our algorithm can leverage 2D detection boxes to refine the tracking results. As a result, our data association process is well-suited to our detector.
5.3. Trajectory Management
After completing data association, the tracking algorithm utilizes detection results to update and maintain existing trajectories. During this process, the algorithm addresses three categories: successfully matched pairs, unmatched detection results, and unmatched trajectories. Accordingly, the algorithm executes three operations: trajectory updating, new trajectory initialization, and old trajectory removal.
Trajectory updating: For successfully matched pairs, the corresponding trajectories are updated using Kalman filtering. In the previous steps, all detection and prediction data have already been obtained. Assume that the Hungarian algorithm matches the
i-th prediction with the
j-th detection, indicating that they belong to the same object. The state updating process of the trajectory can be formulated as
where
is the prior state of the object at time
,
is the posterior state the object at time
,
is the Kalman gain,
is the observation of the object, and
is the observation matrix. By integrating
into the trajectory, the algorithm finalizes the update of the corresponding trajectory.
New trajectory initialization: In the data association process, certain detection bounding boxes exhibit high confidence levels yet remain unmatched to existing trajectories. These detections likely represent newly emerged objects, necessitating the initialization of new trajectories. However, to avoid tracking errors induced by detection inaccuracies, our algorithm refrains from immediately outputting newly initialized trajectories. These potential trajectories are only confirmed and output as new trajectories upon successful matching with detection bounding boxes in the subsequent data association cycle.
Old trajectory removal: Trajectory objects that fail to match any detection objects in data associations are typically assumed to have exited the vehicle’s detection range. However, in complex and dynamic environments, such mismatches may also result from long-term occlusion, causing the detector to miss previously tracked objects. To distinguish between objects that have left and those missed due to occlusion, the tracker employs a deletion threshold of 30 frames. This prevents the premature removal of trajectories, thereby avoiding trajectory fragmentation. Specifically, if a trajectory object fails to match any detection object over 30 consecutive frames, it is deemed to have exited the vehicle’s vicinity and is removed from the trajectory set, excluding it from further data association.