Design of a Robust System Architecture for Tracking Vehicle on Highway Based on Monocular Camera

Multi-Target tracking is a central aspect of modeling the environment of autonomous vehicles. A mono camera is a necessary component in the autonomous driving system. One of the biggest advantages of the mono camera is it can give out the type of vehicle and cameras are the only sensors able to interpret 2D information such as road signs or lane markings. Besides this, it has the advantage of estimating the lateral velocity of the moving object. The mono camera is now being used by companies all over the world to build autonomous vehicles. In the expressway scenario, the forward-looking camera can generate a raw picture to extract information from and finally achieve tracking multiple vehicles at the same time. A multi-object tracking system, which is composed of a convolution neural network module, depth estimation module, kinematic state estimation module, data association module, and track management module, is needed. This paper applies the YOLO detection algorithm combined with the depth estimation algorithm, Extend Kalman Filter, and Nearest Neighbor algorithm with a gating trick to build the tracking system. Finally, the tracking system is tested on the vehicle equipped with a forward mono camera, and the results show that the lateral and longitudinal position and velocity can satisfy the need for Adaptive Cruise Control (ACC), Navigation On Pilot (NOP), Auto Emergency Braking (AEB), and other applications.


Introduction
Self-driving cars or autonomous vehicles will have a huge impact on our society once the technology is deployed at scale [1]. A camera sensor has become an ingredient component in the autonomous driving system. Using this sensor, the autonomous system will be able to perform multiple tasks critical to its autonomy, such as detecting pedestrians, lanes, traffic signs, or tracking multiple moving obstacles at the same time [2]. Most important is the small package size and low manufacturing cost for the camera, allowing car manufacturers to deploy multiple cameras such as forward, backward, or side corners for environment perception.
To be safe, autonomous vehicles must be capable of perceiving the surroundings and all objects that move around. Multi-object tracking is about the accurate perception of the driving environment [3]. So, multi-target tracking based on a mono camera is a key enabling technology for any self-driving vehicle system, in which it extracts the information from the raw camera sensor to constantly estimate the state of the moving object.
The challenge of 3D visual perception mainly lies in the following facts: (1) image is the projection of the real-world object; in the image plane in the projection transformation the distance information would be lost; (2) the size of the object on the image would change according to the distance; (3) it is hard to estimate the object's size and distance. To solve the above challenges, some solutions are proposed such as (1) by integrating another kind in most computer vision problems, such as object detection in every self-driving system such as two stages objector RCNN family [16] and one stage detector YOLO [17] series or SSD [18]. However, using deep learning adds additional constraints to the system because it requires more computational power. In an autonomous vehicle, one of the major concerns of the deep learning object detector is the speed. In contrast to other methods, the major advantage of the YOLO is its speed and it can be deployed in the autonomous system easily, so this paper adopts YOLO as the object detector of a multi-object tracking system.
In the mono camera system, every 3D object in the world is projected through the lens onto the image plane. The shortcoming of a monocular camera is that it will lose the depth information and cannot directly resolve this ambiguity; this paper resolves it through the depth estimator.
In the mono camera-based multi-target tracking system, another problem to deal with is the data association, which determines which measurement comes from which measurement. The Probabilistic Data Association (PDA) series methods are adopted in the literature [19]. These methods share a similar procedure of data association, in that they first compute the probabilities of being correct for each validated measurement at the current time and then weight these probabilities to obtain the state estimate of the target. Another method is the Hungarian method [20]. The Hungarian Algorithm solves the track and measurement assignment problem with the runtime complexity worst-case O n 3 . This paper adopts the global nearest neighbor with the gating trick, which is not only reduces the data association complexity but is also implemented easily.
In the algorithm of target motion state estimation, from the perspective of the highway driving scenario, this paper applies a linear Constant Acceleration (CA) motion model and non-linear observation model, since the camera measures the object in the image coordinates and needs to convert it to the ego-vehicle Cartesian coordinates. Extended Kalman Filter (EKF) [21] is widely used in nonlinear filtering, in which exist some nonlinear factors.
Track management is another crucial problem in the multi-target tracking system, which refers to the track initialization, maintenance, and cancellations because the moving objects may enter or disappear from the mono camera sensor field of view [3].
Due to the analysis of the above module, the mono camera-based multi-target tracking framework proposed in this paper is shown in Figure 1.
the major advantage of the YOLO is its speed and it can be deployed in the system easily, so this paper adopts YOLO as the object detector of a multi-ob system.
In the mono camera system, every 3D object in the world is projected lens onto the image plane. The shortcoming of a monocular camera is that it depth information and cannot directly resolve this ambiguity; this pape through the depth estimator.
In the mono camera-based multi-target tracking system, another prob with is the data association, which determines which measurement comes measurement. The Probabilistic Data Association (PDA) series methods are the literature [19]. These methods share a similar procedure of data associa they first compute the probabilities of being correct for each validated measur current time and then weight these probabilities to obtain the state estimate Another method is the Hungarian method [20]. The Hungarian Algorithm sol and measurement assignment problem with the runtime complexity wors This paper adopts the global nearest neighbor with the gating trick, which reduces the data association complexity but is also implemented easily.
In the algorithm of target motion state estimation, from the perspective way driving scenario, this paper applies a linear Constant Acceleration ( model and non-linear observation model, since the camera measures the obje age coordinates and needs to convert it to the ego-vehicle Cartesian coordinat Kalman Filter (EKF) [21] is widely used in nonlinear filtering, in which exist s ear factors.
Track management is another crucial problem in the multi-target trac which refers to the track initialization, maintenance, and cancellations becaus objects may enter or disappear from the mono camera sensor field of view [3 Due to the analysis of the above module, the mono camera-based multiing framework proposed in this paper is shown in Figure 1.   The rest of the paper is organized as follows. Section 2 discusses the deep learningbased object detector YOLO and the depth estimator based on the bounding boxes published by the object detector. Section 3 talks about the kinematic transition model of the moving object and the mono camera sensor measurement model. Section 4 analyzes how to apply the nonlinear filter approach Extended Kalman Filter in the mono camera tracking system. In Section 5, a gating method combined with the data association method Hungarian is proposed. In Section 6, this work adopts a simple track management policy. Finally, the performance of the mono camera tracking system is evaluated qualitatively and quantitatively.
The contribution of this paper can be summarized as follows:

1.
A multi-target tracking system based on a mono camera is constructed, which can be used on the expressway scene 2.
An object detector combined with a depth estimator is designed to resolve the mono camera depth lost problem. 3.
The whole system is tested under the highway scenario and the performance of the lateral and longitudinal is evaluated qualitatively and quantitatively.

The Object Detector and Depth Estimator
The object detector module adopts the deep learning approach YOLO. YOLO uses a single neural network for the full image. The network divides the image into regions and predicts bounding boxes and probabilities for each region. One of the major advantages of the YOLO framework is its speed. In contrast to other methods, it only makes a single pass through the neural network. When compared with its peers, such as SSD or RCNN family, the results are quite decent.
In the YOLO algorithm, the mono camera image is divided into a 13 × 13 grid of cells. Based on the size of the input image, the size of these cells in pixels varies. Each cell is then used for predicting a set of bounding boxes. For each bounding box, the network also predicts the confidence that the bounding box encloses a particular object as well as the probability of the object belonging to a particular class. Lastly, non-maximum suppression is used to eliminate bounding boxes with a low confidence level as well as redundant bounding boxes enclosing the same object. The result after feeding the mono camera image to the object detector can be seen in Figure 2. The rest of the paper is organized as follows. Section 2 discusses the deep learningbased object detector YOLO and the depth estimator based on the bounding boxes published by the object detector. Section 3 talks about the kinematic transition model of the moving object and the mono camera sensor measurement model. Section 4 analyzes how to apply the nonlinear filter approach Extended Kalman Filter in the mono camera tracking system. In Section 5, a gating method combined with the data association method Hungarian is proposed. In Section 6, this work adopts a simple track management policy. Finally, the performance of the mono camera tracking system is evaluated qualitatively and quantitatively.
The contribution of this paper can be summarized as follows: 1. A multi-target tracking system based on a mono camera is constructed, which can be used on the expressway scene 2. An object detector combined with a depth estimator is designed to resolve the mono camera depth lost problem. 3. The whole system is tested under the highway scenario and the performance of the lateral and longitudinal is evaluated qualitatively and quantitatively.

The Object Detector and Depth Estimator
The object detector module adopts the deep learning approach YOLO. YOLO uses a single neural network for the full image. The network divides the image into regions and predicts bounding boxes and probabilities for each region. One of the major advantages of the YOLO framework is its speed. In contrast to other methods, it only makes a single pass through the neural network. When compared with its peers, such as SSD or RCNN family, the results are quite decent.
In the YOLO algorithm, the mono camera image is divided into a 13 × 13 grid of cells. Based on the size of the input image, the size of these cells in pixels varies. Each cell is then used for predicting a set of bounding boxes. For each bounding box, the network also predicts the confidence that the bounding box encloses a particular object as well as the probability of the object belonging to a particular class. Lastly, non-maximum suppression is used to eliminate bounding boxes with a low confidence level as well as redundant bounding boxes enclosing the same object. The result after feeding the mono camera image to the object detector can be seen in Figure 2. In the mono camera system, every 3D object in the real world is projected to the 2D image plane, which loses the depth information. Thus, how to recover the depth information in the mono camera image is critical for multiple vehicle tracking. The depth estimator is shown in Figure 3. In the mono camera system, every 3D object in the real world is projected to the 2D image plane, which loses the depth information. Thus, how to recover the depth information in the mono camera image is critical for multiple vehicle tracking. The depth estimator is shown in Figure 3.  As shown in Figure 3, the distance can be computed by the following Equation (1): where, H c is the height mounted on the ego vehicle, F c is the focal length of the mono camera, y h is the pixel location of the vanishing point, and y b is the pixel location of the bottom line in the image plane.

System Model
After obtaining the detections from the object detector and depth estimator, there exists the bounding boxes information, the moving vehicle type information, and the distance to the ego vehicle. So, the mono camera tracking system is about to study the estimation of time-varying parameters, that is, the state estimation problem which refers to smoothing the past motion state of a target, filtering the present motion state, and predicting the future motion state of a target [17]. A typical forward-looking mono camera multitarget tracking system can be seen in Figure 4. In the highway scenario, this paper adopts the constant velocity motion model as the system state transition model. A system model of an object is represented by a Cartesian position and velocity components. The model assumes the motion of target vehicle with constant velocity in lateral and longitudinal direction and implements noise for the velocity components using two independent Wiener processes. The position, velocity, and acceleration can be expressed in the form as Equations (2)-(5): y k+1 = y k + ẏkT + 1 2 ÿkT 2 + 1 2 ν y T 2 ẋk +1 = ẋk + ẍkT + ν x T (4) ẏk +1 = ẏk + ÿk + ν y T (5) As shown in Figure 3, the distance can be computed by the following Equation (1): where, H c is the height mounted on the ego vehicle, F c is the focal length of the mono camera, y h is the pixel location of the vanishing point, and y b is the pixel location of the bottom line in the image plane.

System Model
After obtaining the detections from the object detector and depth estimator, there exists the bounding boxes information, the moving vehicle type information, and the distance to the ego vehicle. So, the mono camera tracking system is about to study the estimation of time-varying parameters, that is, the state estimation problem which refers to smoothing the past motion state of a target, filtering the present motion state, and predicting the future motion state of a target [17]. A typical forward-looking mono camera multi-target tracking system can be seen in Figure 4.  As shown in Figure 3, the distance can be computed by the following Equation (1): where, H c is the height mounted on the ego vehicle, F c is the focal length of the mono camera, y h is the pixel location of the vanishing point, and y b is the pixel location of the bottom line in the image plane.

System Model
After obtaining the detections from the object detector and depth estimator, there exists the bounding boxes information, the moving vehicle type information, and the distance to the ego vehicle. So, the mono camera tracking system is about to study the estimation of time-varying parameters, that is, the state estimation problem which refers to smoothing the past motion state of a target, filtering the present motion state, and predicting the future motion state of a target [17]. A typical forward-looking mono camera multitarget tracking system can be seen in Figure 4. In the highway scenario, this paper adopts the constant velocity motion model as the system state transition model. A system model of an object is represented by a Cartesian position and velocity components. The model assumes the motion of target vehicle with constant velocity in lateral and longitudinal direction and implements noise for the velocity components using two independent Wiener processes. The position, velocity, and acceleration can be expressed in the form as Equations (2)-(5): ẏk +1 = ẏk + ÿk + ν y T In the highway scenario, this paper adopts the constant velocity motion model as the system state transition model. A system model of an object is represented by a Cartesian position and velocity components. The model assumes the motion of target vehicle with constant velocity in lateral and longitudinal direction and implements noise for the velocity components using two independent Wiener processes. The position, velocity, and acceleration can be expressed in the form as Equations (2)-(5): .
x k+1 . y k+1 T and the process noise vector is ν(k) = ν x ν y T The corresponding state transition matrix and process matrix are respectively: The front-facing mono camera is mounted on the wind window of the vehicle as shown in Figure 5. Assuming that the state space is X(k) = [x k+1 y k+1 ẋk +1 ẏk +1 ] T and the process noise The corresponding state transition matrix and process matrix are respectively: The front-facing mono camera is mounted on the wind window of the vehicle as shown in Figure 5. In the mono camera measurement system, the vehicles in the real world are projected to the image plane. In a multi-object tracking system, it uses an ego-vehicle coordinate system for tracking the moving objects where there exist three types of coordinates, namely: the ego-vehicle coordinate system, the mono camera coordinate system, and the Image coordinate system, as shown in Figure 6. As shown in Figure 6, in the camera coordinate system, the x-axis is the camera's optical axis. The intersection of the optical axis and the image plane is called the image center or principle point. In the vehicle coordinate system for tracking, the x-axis points forward, the y-axis points to the left, and the z-axis points upward. The image coordinate system usually has its origin in the upper left corner of the image. The pixel coordinates are denoted by for the horizontal dimension and for the vertical dimension. Note that the pixel is not necessarily perfectly square. Instead of a single focal length , it may have two numbers( ) that might slightly differ. The image center C and the focal length are derived through intrinsic camera calibration.
The mono camera measurement model can be defined as Equation (7): In the mono camera measurement system, the vehicles in the real world are projected to the image plane. In a multi-object tracking system, it uses an ego-vehicle coordinate system for tracking the moving objects where there exist three types of coordinates, namely: the ego-vehicle coordinate system, the mono camera coordinate system, and the Image coordinate system, as shown in Figure 6. Assuming that the state space is X(k) = [x k+1 y k+1 ẋk +1 ẏk +1 ] T and the process noise The corresponding state transition matrix and process matrix are respectively: The front-facing mono camera is mounted on the wind window of the vehicle as shown in Figure 5. In the mono camera measurement system, the vehicles in the real world are projected to the image plane. In a multi-object tracking system, it uses an ego-vehicle coordinate system for tracking the moving objects where there exist three types of coordinates, namely: the ego-vehicle coordinate system, the mono camera coordinate system, and the Image coordinate system, as shown in Figure 6. As shown in Figure 6, in the camera coordinate system, the x-axis is the camera's optical axis. The intersection of the optical axis and the image plane is called the image center or principle point. In the vehicle coordinate system for tracking, the x-axis points forward, the y-axis points to the left, and the z-axis points upward. The image coordinate system usually has its origin in the upper left corner of the image. The pixel coordinates are denoted by for the horizontal dimension and for the vertical dimension. Note that the pixel is not necessarily perfectly square. Instead of a single focal length , it may have two numbers( ) that might slightly differ. The image center C and the focal length are derived through intrinsic camera calibration.
The mono camera measurement model can be defined as Equation (7): As shown in Figure 6, in the camera coordinate system, the x-axis is the camera's optical axis. The intersection of the optical axis and the image plane is called the image center or principle point. In the vehicle coordinate system for tracking, the x-axis points forward, the y-axis points to the left, and the z-axis points upward. The image coordinate system usually has its origin in the upper left corner of the image. The pixel coordinates are denoted by i for the horizontal dimension and j for the vertical dimension. Note that the pixel is not necessarily perfectly square. Instead of a single focal length f , it may have two numbers f i f j that might slightly differ. The image center C and the focal length f are derived through intrinsic camera calibration.
The mono camera measurement model can be defined as Equation (7): In Equation (6), p x p y p z is the 3D position of the vehicle in the real world. The vehicle is projected to the image plane. This is the mono camera measurement model h(X), the formula summarizes how to compute the image coordinates from a 3D object in vehicle coordinate. Projecting a 3D point to a 2D image plane space makes Equation (6) a nonlinear measurement function. Hence, for a mono camera, it needs to calculate the mapping to convert from Cartesian coordinates to image coordinates. So, the mono camera measurement equation can be defined as Equation (8): In Equation (7), z is the measurement vector, and ω is a white Gaussian measurement noise sequence with zero mean and covariance. As can be seen from Equation (7), the measurement function is nonlinear; in the next section, this paper will talk about how to deal with the nonlinear with the Extended Kalman Filter.

The State Estimator
The most famous state estimator is the Kalman filter [18], which obtains dynamic estimation of the moving targets under the linear Gaussian assumption, but in many actual cases, the measurement function is non-linear, as shown in Equation (8). The usual approach to turning nonlinear filtering into approximate linear filtering is by using linearization techniques and then applying linear filtering theory to the suboptimal filtering algorithm for the original nonlinear filtering problems. The most commonly used linearization method is the Taylor series expansion, by which the filtering method of the Extended Kalman Filter is achieved [19].
The mono camera measurement function h(X) is composed of two equations that show how the predicted state is mapped into the measurement space, as shown in Equation (6). After calculating all the partial derivatives, our resulting Jacobian matrix H j is defined as Equation (8): So, after linearizing the measurement equation, the transition and measurement equation are both linear equations. So, it can use the Standard Kalman Filter to predict and update the track state in the mono camera-based tracking system. The Kalman Filter includes two steps: prediction and update, and the process is shown as Equations (10)- (16): The prediction step is defined as Equation (10): The state prediction covariance is Equation (11): The updated state estimate is shown as Equation (12): W(k + 1) is the filter gain defined as Equation (13): ν(k + 1) is called the innovation or measurement residual defined as Equation (14): S(k + 1) is the measurement residual covariance following Equation (15): Finally, the updated covariance of the state at time k + 1 follows Equation (16): In the mono camera multi-target tracking system, by iterating between the prediction and update steps, it can maintain the states of the tracked objects. This mechanism can be tuned by specifying if the system should rely more on the motion model assumption or the measurement by specifying noise parameters for both. Measurement noise is typically specified by the sensor manufacturer and is based on the physical characteristic of the sensor as to how accurate it is. Process noise is the parameter that accounts for unknown or unmodeled motion. The ratio of process noise to measurement noise determines whether the tracking system relies more on process versus measurements.

Data Association
Data association is about what is being associated with what. On highway driving scenario, data association decides which track to update with which measurement. The data association module calculates track and measurement pairs and tells which measurement probably originated from which track. For the association, two assumptions are made: each track generates at most one measurement and each measurement originates from at most one track. A simple approach is to update the track with the closet measurement. This paper uses the Mahalanobis distance as the metric for decision, the Mahalanobis distance is defined as Equation (17), where z is the measurement, x is the position, and S is innovation covariance.
To decrease the computational effort to calculate all possible distances, it does not make sense to calculate the distances of very unlikely and faraway combinations. By defining a gate or threshold to the Mahalanobis distance, for every possible association between a track and a measurement, it must be first checked whether the Mahalanobis distance is smaller than the threshold; if the distance is bigger, ignore this possible association. The gating trick is shown in Figure 7. W(k + 1) = p(k + 1|k)H(k + 1) ′ S(k + 1) −1 (13) ν(k + 1) is called the innovation or measurement residual defined as Equation (14) ν(k + 1) = z(k + 1) − ẑ(k + 1|k) S(k + 1) is the measurement residual covariance following Equation (15) S(k + 1) = H(k + 1)p(k + 1|k)H(k + 1) T + R(k + 1) Finally, the updated covariance of the state at time k + 1 follows Equation (16) p(k + 1|k + 1) = p(k + 1|k) − W(k + 1)S(k + 1)W(k + 1) T In the mono camera multi-target tracking system, by iterating between the prediction and update steps, it can maintain the states of the tracked objects. This mechanism can be tuned by specifying if the system should rely more on the motion model assumption or the measurement by specifying noise parameters for both. Measurement noise is typically specified by the sensor manufacturer and is based on the physical characteristic of the sensor as to how accurate it is. Process noise is the parameter that accounts for unknown or unmodeled motion. The ratio of process noise to measurement noise determines whether the tracking system relies more on process versus measurements.

Data Association
Data association is about what is being associated with what. On highway driving scenario, data association decides which track to update with which measurement. The data association module calculates track and measurement pairs and tells which measurement probably originated from which track. For the association, two assumptions are made: each track generates at most one measurement and each measurement originates from at most one track. A simple approach is to update the track with the closet measurement. This paper uses the Mahalanobis distance as the metric for decision, the Mahalanobis distance is defined as Equation (17), where is the measurement, x is the position, and S is innovation covariance.
To decrease the computational effort to calculate all possible distances, it does not make sense to calculate the distances of very unlikely and faraway combinations. By defining a gate or threshold to the Mahalanobis distance, for every possible association between a track and a measurement, it must be first checked whether the Mahalanobis distance is smaller than the threshold; if the distance is bigger, ignore this possible association. The gating trick is shown in Figure 7. If the measurement lies outside a track's gate, the distance in the data association matrix is set to infinity as shown by Equation (18)  If the measurement lies outside a track's gate, the distance in the data association matrix is set to infinity as shown by Equation (18) In data association, it is assumed that each track generates at most one measurement and each measurement originates from at most one track. Suppose there are N tracks and M measurements. The association matrix A is NxM matrix that contains the Mahalanobis distance between each track and each measurement.
There also need a list of unassigned tracks and a list of unassigned measurements. This paper looks for the smallest entry in A to determine which track to update with which measurement, then delete this row and column from A and the track ID and measurement ID from the lists, and repeat this process until A is empty.
When the data association module is updated with new set of detections from the mono camera, the tracker attempts to assign these detections to the existing tracks it maintains. The assignment has three possible outcomes: detection is left unassigned, detection is assigned to a track, and a track is left unassigned as depicted in Figure 8. If the assignment gating or threshold is small, there is a chance that much detection is left unassigned, leading to the creation of many tracks. If it is too large, then an incorrect detection association may happen.
In data association, it is assumed that each track generates at most one measurement and each measurement originates from at most one track. Suppose there are N tracks and M measurements. The association matrix A is NxM matrix that contains the Mahalanobis distance between each track and each measurement.
There also need a list of unassigned tracks and a list of unassigned measurements. This paper looks for the smallest entry in A to determine which track to update with which measurement, then delete this row and column from A and the track ID and measurement ID from the lists, and repeat this process until A is empty.
When the data association module is updated with new set of detections from the mono camera, the tracker attempts to assign these detections to the existing tracks it maintains. The assignment has three possible outcomes: detection is left unassigned, detection is assigned to a track, and a track is left unassigned as depicted in Figure 8. If the assignment gating or threshold is small, there is a chance that much detection is left unassigned, leading to the creation of many tracks. If it is too large, then an incorrect detection association may happen.

Track Management Strategies
A track has a lifecycle of initialization, confirmation, update, coasting, and deletion. The mono camera-based multi-target tracking system must have its track management module. The track management strategies adopted by this paper are shown in Figure 9. A track is initialized from an unassigned detection. If the detection is not classified as anything, the track is initialized as tentative, which means that the tracker is uncertain whether the track is a false alarm or a real object. The track is confirmed when a classified detection is assigned to it or it meets the confirmation criteria set by the confirmation threshold. This is a 2-element vector [M,N], which means that tentative tracks will be confirmed if assigned at least M detections are made in a span of N time steps. If the track is

Track Management Strategies
A track has a lifecycle of initialization, confirmation, update, coasting, and deletion. The mono camera-based multi-target tracking system must have its track management module. The track management strategies adopted by this paper are shown in Figure 9.
In data association, it is assumed that each track generates at most one measurement and each measurement originates from at most one track. Suppose there are N tracks and M measurements. The association matrix A is NxM matrix that contains the Mahalanobis distance between each track and each measurement.
There also need a list of unassigned tracks and a list of unassigned measurements. This paper looks for the smallest entry in A to determine which track to update with which measurement, then delete this row and column from A and the track ID and measurement ID from the lists, and repeat this process until A is empty.
When the data association module is updated with new set of detections from the mono camera, the tracker attempts to assign these detections to the existing tracks it maintains. The assignment has three possible outcomes: detection is left unassigned, detection is assigned to a track, and a track is left unassigned as depicted in Figure 8. If the assignment gating or threshold is small, there is a chance that much detection is left unassigned, leading to the creation of many tracks. If it is too large, then an incorrect detection association may happen.

Track Management Strategies
A track has a lifecycle of initialization, confirmation, update, coasting, and deletion. The mono camera-based multi-target tracking system must have its track management module. The track management strategies adopted by this paper are shown in Figure 9. A track is initialized from an unassigned detection. If the detection is not classified as anything, the track is initialized as tentative, which means that the tracker is uncertain whether the track is a false alarm or a real object. The track is confirmed when a classified detection is assigned to it or it meets the confirmation criteria set by the confirmation threshold. This is a 2-element vector [M,N], which means that tentative tracks will be confirmed if assigned at least M detections are made in a span of N time steps. If the track is A track is initialized from an unassigned detection. If the detection is not classified as anything, the track is initialized as tentative, which means that the tracker is uncertain whether the track is a false alarm or a real object. The track is confirmed when a classified detection is assigned to it or it meets the confirmation criteria set by the confirmation threshold. This is a 2-element vector [M,N], which means that tentative tracks will be confirmed if assigned at least M detections are made in a span of N time steps. If the track is left unassigned, it is coasted. The tracker then has to decide to either remove these tracks or keep them in the chance that it might be updated shortly. This coasting period is controlled by the deletion threshold. If the tracker goes above these numbers of updates and an existing track is still not updated, then that track is deleted. In this paper, 6 means that a track will be deleted if it does not receive any assignments six times steps in a row.
To test and evaluate the dynamic performance of the mono-based tracking system in the following part this paper designs two validation platforms one is based on the Lidar sensor and the other is based on RTK. The Lidar sensor is another type of sensor in autonomous vehicles and has the advantage of measuring position, but the price is much higher compared with a mono camera. So, this paper makes a comparison between the camera and Lidar. Besides this, to improve tracking accuracy another experiment is done based on RTK.

Construction of the Performance Analysis Platform
The test and validation platform is composed of front-view Lidar and a front-facing mono camera, as depicted in Figure 10, in which the perception result of the Lidar is the baseline to evaluate the performance of the mono camera-based multi-object tracking system. a track will be deleted if it does not receive any assignments six times steps in a row.
To test and evaluate the dynamic performance of the mono-based tracking system i the following part this paper designs two validation platforms one is based on the Lida sensor and the other is based on RTK. The Lidar sensor is another type of sensor in auton omous vehicles and has the advantage of measuring position, but the price is much highe compared with a mono camera. So, this paper makes a comparison between the camer and Lidar. Besides this, to improve tracking accuracy another experiment is done based on RTK.

Construction of the Performance Analysis Platform
The test and validation platform is composed of front-view Lidar and a front-facin mono camera, as depicted in Figure 10, in which the perception result of the Lidar is th baseline to evaluate the performance of the mono camera-based multi-object tracking sys tem.

Evaluation of Vehicle Detector
The test and validation of the YOLO-based detector is done under the highway sce nario with good weather condition as show in Figure 11. This paper divides the objects on the highway into three classes: vehicle, car and truck. The whole scene is composed of total of 16,859 objects and the detector detects 15,730 objects successfully. The performanc of the detector is shown in Table 1.

Evaluation of Vehicle Detector
The test and validation of the YOLO-based detector is done under the highway scenario with good weather condition as show in Figure 11. This paper divides the objects on the highway into three classes: vehicle, car and truck. The whole scene is composed of a total of 16,859 objects and the detector detects 15,730 objects successfully. The performance of the detector is shown in Table 1. sensor and the other is based on RTK. The Lidar sensor is another type of sensor in au omous vehicles and has the advantage of measuring position, but the price is much hi compared with a mono camera. So, this paper makes a comparison between the cam and Lidar. Besides this, to improve tracking accuracy another experiment is done b on RTK.

Construction of the Performance Analysis Platform
The test and validation platform is composed of front-view Lidar and a front-fa mono camera, as depicted in Figure 10, in which the perception result of the Lidar i baseline to evaluate the performance of the mono camera-based multi-object tracking tem.

Evaluation of Vehicle Detector
The test and validation of the YOLO-based detector is done under the highway nario with good weather condition as show in Figure 11. This paper divides the objec the highway into three classes: vehicle, car and truck. The whole scene is composed total of 16,859 objects and the detector detects 15,730 objects successfully. The perform of the detector is shown in Table 1.   As shown in Table 1, the average precision (AP) for all classes is more than 85% but less than 65% for truck. At the precise and recall index, the detector performance at vehicle and car is better than at truck. The recall of the truck is less than 68%. As truck is very common under highway scene and this may have potential hazards. In the future this paper will focus on improving recall and precision of the truck by adding more truck data into the training data set.

Evaluation of Experimental Data
As shown in Figure 12, which is on the highway test scene, there are two trucks: one is in the front of the ego vehicle and the other is in the left lane of the ego vehicle. The red rectangle with the truck is the tracking result of the Lidar and the green rectangle with the truck is the tracking result of the mono camera. As shown in Table 1, the average precision (AP) for all classes is more than less than 65% for truck. At the precise and recall index, the detector performance a and car is better than at truck. The recall of the truck is less than 68%. As truck common under highway scene and this may have potential hazards. In the fut paper will focus on improving recall and precision of the truck by adding more tru into the training data set.

Evaluation of Experimental Data
As shown in Figure 12, which is on the highway test scene, there are two tru is in the front of the ego vehicle and the other is in the left lane of the ego vehicle. rectangle with the truck is the tracking result of the Lidar and the green rectangle truck is the tracking result of the mono camera. To analyze, the performance of the proposed algorithm this paper adopts th perception result as the baseline. In Figure 13, the red curve line is the result of t and the green curve line represents the mono camera perception output. The ma tudinal position error is no more than 5 m. The lateral position error is no more m.  To analyze, the performance of the proposed algorithm this paper adopts the Lidar perception result as the baseline. In Figure 13, the red curve line is the result of the Lidar and the green curve line represents the mono camera perception output. The max longitudinal position error is no more than 5 m. The lateral position error is no more than 0.5 m. As shown in Table 1, the average precision (AP) for all classes is more than 85% but less than 65% for truck. At the precise and recall index, the detector performance at vehicle and car is better than at truck. The recall of the truck is less than 68%. As truck is very common under highway scene and this may have potential hazards. In the future this paper will focus on improving recall and precision of the truck by adding more truck data into the training data set.

Evaluation of Experimental Data
As shown in Figure 12, which is on the highway test scene, there are two trucks: one is in the front of the ego vehicle and the other is in the left lane of the ego vehicle. The red rectangle with the truck is the tracking result of the Lidar and the green rectangle with the truck is the tracking result of the mono camera. To analyze, the performance of the proposed algorithm this paper adopts the Lidar perception result as the baseline. In Figure 13, the red curve line is the result of the Lidar and the green curve line represents the mono camera perception output. The max longitudinal position error is no more than 5 m. The lateral position error is no more than 0.5 m.

Further Validation Validation Platform
The validation platform is composed of the ego-car and target car that are both quipped with the RTK suit and data transceiver suit as shown in Figure 14.

Validation Platform
The validation platform is composed of the ego-car and target car that are both quipped with the RTK suit and data transceiver suit as shown in Figure 14. The target car transfers the kinematic information to the ego car through the data transceiver and finally reaches the MCU board. The MCU board also receives the information from the ego car through series communication and then processes the data from the ego car and target car, then obtains the kinematic information of the target car in the ego car coordinate system. Finally, the result is sent out through the CAN bus. The CAN signal is shown in Figure 15. The test scenario is shown in Figure 16. The ego car drives with constant velocity and the target car moves in the front of the ego car with the action acceleration, deceleration, cut in, and cut out. The target car transfers the kinematic information to the ego car through the data transceiver and finally reaches the MCU board. The MCU board also receives the information from the ego car through series communication and then processes the data from the ego car and target car, then obtains the kinematic information of the target car in the ego car coordinate system. Finally, the result is sent out through the CAN bus. The CAN signal is shown in Figure 15.

Validation Platform
The validation platform is composed of the ego-car and target car that are both quipped with the RTK suit and data transceiver suit as shown in Figure 14. The target car transfers the kinematic information to the ego car through the data transceiver and finally reaches the MCU board. The MCU board also receives the information from the ego car through series communication and then processes the data from the ego car and target car, then obtains the kinematic information of the target car in the ego car coordinate system. Finally, the result is sent out through the CAN bus. The CAN signal is shown in Figure 15. The test scenario is shown in Figure 16. The ego car drives with constant velocity and the target car moves in the front of the ego car with the action acceleration, deceleration, cut in, and cut out. The test scenario is shown in Figure 16. The ego car drives with constant velocity and the target car moves in the front of the ego car with the action acceleration, deceleration, cut in, and cut out. where, pos_x _diff is the longitudinal position difference between mono camera and RTK; pos_y _diff is the lateral position difference between mono camera and RTK; vel_x _diff is the longitudinal velocity difference between mono camera and RTK; vel_y _diff is the lateral velocity difference between mono camera and RTK.
From Figure 17, it can be seen that the mono camera-based tracking system has better performance in lateral compared with longitudinal. From Table 2, it can be seen the position accuracy in longitudinal is less than 1.13 m and in the lateral is less than 0.11 m, which indicates that the mono camera system has better lateral accuracy. However, there exists disturbance in both the lateral and longitudinal velocity. Besides this, when the target car has maneuvers such as acceleration, deceleration, cut in, or cut out, the performance will decline as shown in the red frame. The longitudinal and lateral position and velocity error frequency can be seen in Figure 18.     For the autonomous system to have a comprehensive and accurate understanding of the surrounding dynamic environment, it is necessary to integrate more types of sensors to compensate for each other. In the future, we will integrate the Radar sensor into the tracking system to improve its performance of the tracking system. Aditionally, the robustness of the tracking system under different light and weather conditions is another big problem needed to address in future work.

Conclusions
This paper designs a robust tracking system based on a mono camera. According to the characteristics of the mono camera, a YOLO-based detector is used to give out the bounding boxes and a vanishing point-based depth estimator is used to estimate the distance to compensate for the lost depth information caused by projecting from the 3D Cartesian coordinate to the 2D image plane. To track the vehicle on the highway, a constant velocity model is used in this paper and the Extended Kalman filter is applied to deal with the mono camera nonlinear measurement problem. Nearest neighbor with a gating trick For the autonomous system to have a comprehensive and accurate understanding of the surrounding dynamic environment, it is necessary to integrate more types of sensors to compensate for each other. In the future, we will integrate the Radar sensor into the tracking system to improve its performance of the tracking system. Additionally, the robustness of the tracking system under different light and weather conditions is another big problem needed to address in future work.

Conclusions
This paper designs a robust tracking system based on a mono camera. According to the characteristics of the mono camera, a YOLO-based detector is used to give out the bounding boxes and a vanishing point-based depth estimator is used to estimate the distance to compensate for the lost depth information caused by projecting from the 3D Cartesian coordinate to the 2D image plane. To track the vehicle on the highway, a constant velocity model is used in this paper and the Extended Kalman filter is applied to deal with the mono camera nonlinear measurement problem. Nearest neighbor with a gating trick is adopted to handle the data association problem. Besides these, a track management strategy is proposed to initialize, maintain, and delete tracks. Finally, to evaluate the mono tracking system, a Lidar-based ground truth method is proposed. The research results in this paper provide a good and beneficial reference for the autonomous driving system based on a mono camera.