Vehicle Trajectory Estimation Based on Fusion of Visual Motion Features and Deep Learning

Driver situation awareness is critical for safety. In this paper, we propose a fast, accurate method for obtaining real-time situation awareness using a single type of sensor: monocular cameras. The system tracks the host vehicle’s trajectory using sparse optical flow and tracks vehicles in the surrounding environment using convolutional neural networks. Optical flow is used to measure the linear and angular velocity of the host vehicle. The convolutional neural networks are used to measure target vehicles’ positions relative to the host vehicle using image-based detections. Finally, the system fuses host and target vehicle trajectories in the world coordinate system using the velocity of the host vehicle and the target vehicles’ relative positions with the aid of an Extended Kalman Filter (EKF). We implement and test our model quantitatively in simulation and qualitatively on real-world test video. The results show that the algorithm is superior to state-of-the-art sequential state estimation methods such as visual SLAM in performing accurate global localization and trajectory estimation for host and target vehicles.


Introduction
Intelligent Driver Assistance Systems (IDASs) must predict and update a model of the scene around the vehicle in order to anticipate possible future collisions. This requires localization and trajectory estimation, in which we must turn sensor observations into a record of a moving vehicle's position with respect to the surroundings.
In this paper, we propose a collection of new methods that together enable accurate localization and trajectory estimation for a host vehicle equipped with monocular cameras and target vehicles that are visible around the host vehicle. Localization refers to the sequential state prediction techniques used to turn sensor observations into a record of a moving vehicle's position with respect to the surroundings. Most approaches to the localization problem combine odometry information with observations of landmarks; examples include SLAM, visual tracking, and 3D reconstruction (structure from motion). The Extended Kalman Filter, particle filter, and grid-based methods are popular state estimation methods for localization. Our primary interest is to combine visual odometry techniques with image-based object detection and target vehicle trajectory estimation, all in a common reference frame, in order to give a driver or an autonomous vehicle accurate situation awareness.
Until recently, image-based object detection was insufficiently accurate and too resource intensive for use in a practical IDAS. However, more recently, the field has greatly improved accuracy and runtime performance, primarily due to better feature extraction, large-scale machine learning models, and low cost high performance embedded Graphical Processing Units (GPU). Early methods for object detection using feature engineering and classical machine learning techniques [1][2][3] have been supplanted by Deep Neural Networks (DNNs) [4][5][6]. With fast, accurate methods such as YOLO [7] now available, it is finally possible to include sophisticated object detectors in an IDAS. However, although we can adopt YOLO for its vehicle detection capabilities, an object detector cannot alone precisely localize a detected vehicle in 3D. We therefore require new algorithms for precise 3D localization of vehicles detected in 2D by YOLO or comparable 2D object detectors.
To accomplish these goals, we propose a new method for situation awareness in an IDAS that tracks the host vehicle's velocity and the relative positions of target vehicles around the host vehicle. The overall framework is shown schematically in Figure 1. The algorithm performs these localization and target vehicle trajectory estimation tasks in a common reference frame using the noisy but simple and low cost monocular camera as the only sensor. The contributions are as follows: We introduce a new method for estimation of the host vehicle's linear and angular velocity that uses optical flow and RANSAC. 2.
We introduce a new method for estimating the instantaneous positions of any visible vehicles (target vehicles) relative to the host vehicle using YOLO, a camera calibration model, and a nonlinear optimization procedure.

3.
We introduce a new solution to the multiple target tracking (MTT) problem under conditions in which target vehicles switch from one camera view to another. 4.
We introduce a sensor fusion method utilizing an extended Kalman filter with novel system and sensor models.

Relative Position 0ptimization
(θ,s)  (3). In parallel, it predicts the relative positions between the host vehicle and any visible target vehicles (2). The method also solves the problem of MTT under conditions in which target vehicles switch from one camera view to another (4). The last step of the method performs sensor fusion using an extended Kalman filter utilizing novel system and sensor models (5). The resulting trajectories are fused into a common world coordinate system (6), then visualized for situation awareness. Note: the data in part (6) are from Experiment II.

Related Work
For localization of a host vehicle under mobility in IDASs, Schubert et al. [8] compare different motion models and find that a method assuming a constant yaw rate and acceleration gives the best results. Berthelot et al. [9] and Tamke et al. [10] implement a model for vehicle trajectory estimation. They find that estimated trajectories are precise if the vehicle has movement consistent with the motion estimation model. Huang et al. [11] explore methodologies for vehicle trajectory estimation that rely on differential global positioning systems. Liu et al. [12] describe a trajectory prediction approach that relies on driving behavior prediction and classification using hidden Markov models. Sorstedt et al. [13] consider driver control input parameters to obtain better estimates.
For localization of the host vehicle as well as surrounding vehicles in IDASs, Schreier et al. [14] describe an integrated approach that predicts trajectories based on a maneuver estimation model. Driving maneuvers are inferred for each vehicle with a Bayesian network. Ammoun et al. [15] explore a collision risk estimation model that predicts trajectories of surrounding vehicles.
As input to an IDAS, various sensors could be considered. Morris et al. [16] use LIDAR, which is probably the most useful and reliable sensor for intelligent vehicles, in their vehicle tracking and trajectory estimation system. Dickmann et al. [17] and Clarke et al. [18] use radar in their autonomous driving system and driver assistance systems to model the environment around the vehicle.
For target tracking, those methods can be categorized as either Single Target Tracking (STT) or Multiple Target Tracking (MTT) methods. The simplest and probably the most frequently considered tracking problem is STT. MTT requires locating target positions, maintaining target identities, and generating target trajectories given an input video. Basit et al. [19] employ CAMSHIFT for tracking of a single target for trajectory estimation. Wojke et al. [20] propose deepSORT, an approach rely on the tracking-by-detection paradigm. DeepSORT, when combined with accurate detectors like YOLO, has recently achieved MTT performance that is sufficient for an IDAS.
Additionally, several studies in recent years have analyzed potential applications of trajectory analysis and optimization to improve traffic flow in different scenarios. One example is prediction and planning of Connected Automated Vehicle (CAV) arrivals at intersections followed by optimization of those CAVs' trajectories through the intersection by an intersection controller [21,22].
Several authors have considered localization of a host vehicle under mobility in IDASs [8][9][10][11][12][13]. These methods require integration of information from multiple sensors. We are interested in how to bypass this limitation by using just a single sensor, the monocular camera, to track the host vehicle's velocity, and the relative positions of each target vehicle surrounding the host vehicle based on camera data alone.
Various sensors are useful for an IDAS, such as LIDAR [16] and radar [17,18]. Cameras provide better information about object identity than LIDAR or radar, but position accuracy is a challenge.
Some researchers have explored traffic forecasting issues. The goal of traffic forecasting is to predict traffic conditions (e.g., traffic speed, flow, and type of traffic) network-wide based on feeds from real-time traffic sensors, considering spatial and temporal correlations to provide accurate real-time predictions that could be used to guide automated vehicles in real time or city planners over longer terms [23,24]. The work described in this paper, besides informing drivers of the current real-time situation, could also inform centralized real-time traffic forecasting systems.
Recent monocular Visual Odometry (VO) tracking methods [25][26][27][28][29][30] use deep learning models with a monocular front-facing camera to estimate odometery from visual motion. The limitation of these deep learning methods in performing the VO task is that their knowledge of VO is embedded implicitly in the deep learning model, which as a black box, lacks the explainability of the explicit knowledge represented by the mathematical relationship between optical flow and vehicle motion. Critically, none of this work attempts to track target vehicles along with ego motion, the main focus of our work.
Once objects around the host vehicle are detected, they must be tracked. IDASs require Multiple Target Tracking (MTT). Early methods used classical computer vision techniques [19,31]. DeepSORT has recently achieved MTT performance that is sufficient for an IDAS. We shall see that DeepSORT's resource utilization is currently prohibitively high for embedded systems usable in an IDAS, and it is unable to process multiple cameras or handle targets that switch between cameras in real time. As our focus is on vehicle trajectory estimation for vehicles traveling on city streets alongside multiple other vehicles using MTT, we propose a new algorithm for tracking the 3D positions of vehicles based on multiple streams of 2D detections that also handles camera switching.
We now describe the proposed algorithm for estimation of host and target vehicle trajectories then present experimental results and conclude the paper.

Proposed Method
Our method includes four steps: (1) camera calibration, (2) video processing with optical flow to obtain estimates of linear and angular velocity for the host vehicle, (3) target vehicle detection, relative position estimation, and tracking, and (4) fused vehicle trajectory estimation.

Camera Calibration
Points (X, Y, Z) T in camera coordinates are mapped to points ( f x X/Z + u x , f y Y/Z + v y , 1) T on the image plane according to a linear mapping in homogeneous coordinates We estimate camera parameters f x , f y , v x , and v y along with radial distortion parameters k 1 , k 2 , k 3 and tangential distortion parameters p 1 and p 2 through a standard procedure with a checkerboard. The radial distortion is relatively strong in our lenses, while the tangential distortion is less prevalent. Adding the rotation R and translation of the center of the camera C by −RC, we obtain where X is an arbitrary 3D point in the vehicle coordinate frame. In order to obtain R and t, we perform an on-vehicle calibration. Using eight manually identified points on the ground and eight corresponding points in the image, R and t are estimated relative to the vehicle frame via Levenberg Marquardt nonlinear least squares minimization of the projection error as implemented by OpenCV's SolvePnP routine. This gives us a projection matrix P for each camera.

Linear and Angular Velocity Computation from Optical Flow
We use the Lucas-Kanade optical flow algorithm, which is efficient but generates false matches (outliers). This means mismatched points should be removed prior to motion estimation. We use a variant of the random sample consensus (RANSAC) algorithm [32] to remove outliers. Algorithm 1 repeatedly computes optical flow, linear velocityṡ, and angular velocityθ, accounting for outliers and noisy flows. The method can precisely calculate camera motion without a scale ambiguity, as accurate 3D positions of ground points are known when P is known. RANSAC attempts to find the largest consensus set. If a majority of the points are on the ground, the largest consensus set will necessarily include those points, as their motion will be consistent with just one correct homography that can be calculated from any inlier sample. Neither points off the ground nor incorrect correspondences will have motion consistent with ground plane motion, so they will not be included in the consensus set. To ensure that a majority of optical flows are indeed on the ground, we use ROIs that include mostly ground points in the majority of driving situations see Figure 7 for the RoIs). We estimate host vehicle velocities using optical flows with outliers assumed to be for ground points, so the ROIs should include mostly ground points in the majority of driving situations. On each iteration, we sample two corresponding pairs then compute R and t using the SVD method that follows. In order to compute the rotational and linear velocity of the host vehicle from optical flow, two sets of corresponding inlier points, namely P G and Q G , are acquired by steps (b) − (e), from frame I i and I i+1 , separately. Let P G = {p 1 , p 2 , ..., p n } and Q G = {q 1 , q 2 , ..., q n } be two sets of corresponding points on the plane. The motion model of the host vehicle is shown in Figure 2. In order to acquire rotation matrix R 2×2 and translation vector t 2×1 that optimally align the two sets P G and Q G , corresponding points p i and q i (i ∈ 1, 2, . . . n) should be approximately related by rigid planar motion, i.e., The optimization problem can be considered as where X is the set Q G arranged as a 2 × n matrix and Y is the set P G arranged also as a 2 × n matrix. We solve this optimization problem using the SVD, UΣV T = XY T , R = VU T .

Algorithm 1: Velocity from Optical Flow.
Require: Video V 1 , V 2 , and V 3 from right, left, and back cameras with regions of interest R 1 , R 2 , and R 3 . Ensure:θ (angular velocity),ṡ (linear velocity). Procedure: 1. Let t = 1 and acquire the first frame I i 1 of V i in gray scale, i = 1, 2, 3; 2. For each subsequent frame I i t+1 of input video, i = 1, 2, 3: (a) Acquire frame I i t+1 in gray scale. (b) Detect sub-pixel accurate corners in R i for frame I i t to obtain feature set Q i I . (c) Calculate optical flow for I i t , Q i I , and I i t+1 to obtain corresponding sparse feature set P i I using Lucas-Kanade. (d) Remove keypoints without correspondences from P i I and Q i I . (e) Project P i I and Q i I from the image plane to the ground plane to obtain P i G and Q i G using the camera calibration information (P matrix) determined according to the method described in Section 3.1. (f) Combine point set P i G and Q i G , i = 1, 2, 3, separately to obtain ground point sets P G and Q G . (g) Remove outliers from P G and Q G using RANSAC to obtain consistent ground plane point sets P G and Q G . (h) Compute rotation matrix R and translation vector t using P G and Q G .  [t x , t y ] T is the origin of the vehicle coordinate system at time t + 1 represented in the vehicle coordinate system at time t.
After acquiring R and t in step (h), the linear and angular velocity of the vehicle over the interval can be calculated simply aṡ where r 11 and r 12 are the first two elements of the first row of R. r = t y sin θ is the turning radius. Note that in addition to arc motions, for the particular case of straight motion of the vehicle, θ = 0, and the linear velocity can be computed more simply asṡ = t y ∆t , where t y is the second element of t.

Object Detection and Relative Position Estimation Based On Deep Learning
The goal here is to achieve accurate target vehicle detection and relative position estimation by fusion of CNN-based detection, camera calibration, and optimization. A complete trajectory can be obtained from the dynamic sequence of target vehicle observations over time.

Target Vehicle Detection
In order to perform target vehicle detection in real time, we use the YOLOv3 CNN detection model for detection and classification. YOLOv3 [33] is based on the Darknet 53 CNN and has 106 layers.

3D Backprojection to Obtain Relative Position
We define two planes, the image plane and the ground plane, and compute a projective transformation from one plane to the other plane. The homography from the image plane to the ground plane is obtained from the projective camera matrix P by removing the third column of P. The process implicitly takes into account the camera's height, orientation, and position relative to the vehicle frame, as the ground points are specified in the vehicle frame in the on-vehicle calibration process. The relationship can be written where λ is an arbitrary scale factor. We also utilize the reverse mapping To obtain the relative position of a target, we first guess the relative position of the target based on Equation (8) with the backprojection of the bottom center [u, v, 1] T of the bounding box detected by YOLOv3. As the actual position of the target vehicle relative to the backprojection of the bottom center of the bounding box is variable depending on position and orientation, we use the backprojection as an initial guess, but then we refine the guess via optimization. We use Levenberg Marquardt (LM) nonlinear least squares for this purpose. Suppose that the width, length, and height of the target are w, l, and h, respectively. Let the relative position of the target (x, y) be the center of the target cuboid on the ground plane. The eight corners of the cuboid are: We project these eight points in 3D to the 2D image. We then pick the four corners of the minimal bounding box enclosing the eight points. The objective function to minimize is the difference between the predicted bounding and the detected bounding boxes in image coordinates, more precisely, where z contains the u, v coordinates of the four corners of the bounding box detected by YOLOv3. f (x, y) reprojects the 3D points to 2D points in the image and calculates the minimum enclosing bounding box.

Visual Tracking and Camera Switch Processing
Target detection and target observation were discussed in Section 3.3. In this section, we describe how the tracker maintains vehicle identity and handles camera switches for targets.

Visual Track Handling
The tracking method integrates the predicted statex t , the detected bounding box a i for each detection i, and the predicted bounding box b j for each tracked target j. In the system state, estimates of the host vehicle and target vehicles' positions and velocities are recorded. In order to determine whether to match YOLO detection a j with predicted bounding box b j calculated from the system state for target a j , we compute the IoU (Intersection over Union) cost for the two bounding boxes: where A(a) gives the set of pixels included in bounding box a. We repeat the calculation for all a i , i ∈ 1 . . . n and b j , j ∈ 1 . . . m. We use the Hungarian algorithm to find the association minimizing the total cost. For predicted targets that are matched, we save the new detection's bounding box with the correct track data structure. For unmatched boxes, we create new track data structures and associated Kalman filter state variables. The Kalman filter will on subsequent observations update the target vehicle states using the predicted bounding boxes. Tracks for vehicles unmatched for t d frames are deleted from the Kalman filter's state vector and covariance matrix. We found empirically that when a vehicle is missed for seven or more frames, it is rarely recovered by the tracker. We therefore set t d = 7.

Camera Switches
To ensure consistent detection of all nearby vehicles, the system should be designed with overlapping fields of view between cameras. As target vehicles in overlapping fields of view may be detected in multiple cameras, we ensure that the tracker only uses one of the three cameras, based on a heuristic evaluation of which camera should be used to track. The heuristic is the size of the detected bounding box in each camera. Each target vehicle is tracked in the camera for which its predicted bounding box is largest in the image. An example is shown in Figure 3. Example image set I t , t > t containing the same target in the backward and right cameras. The area of the bounding box in right camera is larger than that in the backward camera. At that point, tracking switches from the backward to the right camera. (C) The target has disappeared in the backward camera but is tracked in the right camera.

Vehicle Trajectory Estimation
We employ an extended Kalman filter to fuse the host vehicle's linear and angular velocity with measured relative positions of target vehicles.

Vehicle State
The vehicle state x t describes the host vehicle's instantaneous position, linear velocity, and angular velocity, as well as target vehicles' positions and velocities in the world coordinate system: where x t and x t+1 are the state at times t and t + 1, and v t ∼ N (0, Q t ) is a Gaussian random vector modeling the randomness in the state transition. f(·) is the system transition function.
To define f(·),θ t is the rotational velocity around the z axis in the host vehicle coordinate system. The orientation θ t+1 in the world coordinate system at time t + 1 must therefore be The actual steering angular and linear velocity are not measured, so we assume any change in linear or rotational velocity to be noise. The host vehicle velocity parameters can thus be expressed asθ With finite r, the vehicle's displacement in the vehicle coordinate system is described by For the particular case of straight movement, we can obtain The host vehicle's position at time t + 1 can be described as For the other vehicles, we assume linear dynamics To approximate nonlinear function f(·), we linearize around an arbitrary pointx t , i.e., Here J f (x t ) is the Jacobian evaluated atx t .

Observation Model
We incorporate optical flow tracking of points on the ground and the deep learning model capable of producing a prediction of the position of neighboring vehicles' projections into the image plane at time t to obtain estimated observations z t = θ t ,ṡ t , x 1 t , y 1 t , . . . , x n t , y n t , i = 0, 1, . . . n, whereθ t , andṡ t are the measured angular and linear velocity of the host vehicle, respectively. [x i t , y i t ] are the measured relative offsets between the host vehicle and vehicle i. We define the observation with an equation h(·) mapping the vehicle state x t to the corresponding observation z t : where w t ∼ N (0, S t ).
The homogeneous representation of target i's bottom center [x v t , y v t , 1] T in the vehicle coordinate system, assuming a flat ground plane, is calculated as Here the rigid transformation T w/v t maps from the world coordinate system to the vehicle coordinate system at time t. The calculation can be repeated for each vehicle i ∈ 1, . . . , n. In detail, if we obtain a rotation matrix R t from the vehicle's orientation θ h t at time t, the transformation matrix Similar to the transition model f(·), we linearize h(x t ) around arbitrary pointx t using the Jacobian evaluated atx t :

Initialization
To initialize the system state, we assume the host vehicle's initial trajectory is accurately represented by the optical flow between two initial frames. We calculate the angular velocity and linear velocity as discussed in Section 3.2. We construct the first observation z 0 without including other vehicles in the system state: The initial state is obtained from z 0 , considering the position of the host vehicle to be the origin at time t = 0. The initial state vector is thus Whenever new vehicles are detected, the observation z t includes the optimized position of the vehicle relative to the host vehicle. The world position and orientation of the host vehicle are initialized to zero, while the velocities of the host vehicle are initialized from z t at time t, written as z t = θ t ,ŝ t ,x 1 t ,ŷ 1 t , . . . ,x n t ,ŷ n t , n = 0, 1, . . . n.
We likewise augment the state estimate x t with new entries for the new vehicle's position and velocity. We initialize the velocity of each target vehicle using the first two frames in which it is detected.

Noise Parameters
For the system and sensor noise, the Kalman filter imposes the assumption of Gaussian noise. We parameterize these noise distributions assuming reasonable correlations between the system state and the noise. We assume that the observation noise is a linear function of linear velocity, turning angle rate, and the other vehicles' relative positions in the vehicle coordinate system. We assume R t is diagonal. We let the entries of R t corresponding to the angular and linear velocity be (α 1 + α 2θt ) 2 and (α 3 + α 4ŝt ) 2 . We let the entries of R t corresponding to the relative position be (α 5 + α 6x i t ) 2 and (α 7 + α 8ŷ i t ) 2 . We also suppose for simplicity that Q t , the state transition noise covariance, is diagonal. For the host vehicle, the Q t elements corresponding to the positions are ∆ 2 t (η 1ŝ + η 2 ) and ∆ 2 t (η 1ŝ + η 2 ). The Q t element corresponding to the rotation is ∆ 2 t (η 3θ + η 4 ). The Q t element corresponding to the angular velocity is ∆ 2 t (η 5θ + η 6 ). The Q t element corresponding to the linear velocity is ∆ 2 t (η 7ṡ h t + η 8 ). For the other vehicles, the Q t elements corresponding to the position are ∆ 2 t (η 9ẋt + η 10 ) and ∆ 2 t (η 11ẏt + η 12 ). The Q t elements corresponding to the velocity of target vehicles are ∆ 2 t (η 13ẋt + η 14 ) and ∆ 2 t (η 13ẏt + η 14 ). These noise distributions are simplistic and ignore many factors, but they suffice for the experiments in this paper. The parameters' values are determined through a combination of analytical and empirical methods. We begin with reasonable values based on analysis of the underlying noise source, then we validate the parameters first in simulation and then in the real world.
P 0 denotes the initial state error. We propagate the observation error for z 0 through h(x 0 ) and take into account the initial uncertainty about the optical flow.

Update Algorithm
The system state model and observation model are described in Section 3.5, Equations (11) and (21). The main difference with an ordinary Kalman filter is that when updating the system state, the prediction method of the Extended Kalman Filter can be propagated without an observation correction to deal with cases in which optical flow fails due to key point mismatches or features undetected in the field of view. Under such conditions, we assume the observation z t is not available, so we estimate the system state and allow propagation (diffusion) of the state covariance without an observation update. This means only the prediction step is implemented, without a correction, when optical flow is unavailable.

Results and Discussion
We evaluate the proposed model with five experiments. All simulations were implemented in Python with synthetic data. First, we perform a simulation to estimate the robustness of the proposed method for linear and angular velocity estimation for the host vehicle. The simulation allows analysis with different noise levels and different outlier ratios for synthetic optical flow. Our method is compared with a state of the art visual odometry method, ORB-SLAM. Second, we perform a simulation to quantitatively evaluate the accuracy of trajectory estimation for the host vehicle and target vehicles. Third, we evaluate the target vehicle tracker module under real-world motion between cameras. This experiment provides a quantitative assessment of the system's ability to handle camera switches for target vehicles. Fourth, we perform experiments that combine optical flow, YOLOv3, Hungarian association, and the Extended Kalman Filter for estimating the trajectories of the host vehicle and target vehicles in the real world. Since there is no ground truth, the evaluation is qualitative, but integrated real-world runs do indicate the effectiveness of the method. Finally, we compare estimated trajectories generated by the proposed method with those of ORB-SLAM. This provides evidence of the proposed method's robustness compared to the predominant alternative method for ego-motion estimation based on monocular cameras. All of the experiments were implemented using C++, the OpenCV library, and Python. We tested the runtime performance of the proposed model on a 2.50 GHz × 4 Intel Core i5 laptop running 64-bit Ubuntu Linux with an NVIDIA GeForce MX150 as well as on an NVIDIA Jetson TX2.
In order to evaluate the proposed method, we prepared two video datasets for Experiment V. We mounted three cameras, one oriented backwards at the top of the rear windshield and two cameras oriented at approximately 45 • from the side-view mirrors. The first dataset was collected on a long, nearly straight road in a college campus. We recorded from three cameras concurrently during driving time of approximately five minutes. Thirty six target vehicles appeared in the field of view of one or more cameras during the run. The second dataset was collected on a winding road comprising a straight segment of 100 m, a left turn of 150 m, a straight segment of 150 m, and then a turn right of 100 m. Twenty two target vehicles appeared in the field of view of one or more cameras during this run. Although there were no sudden sharp turns, the winding curves were sharp enough to disrupt monocular visual SLAM but did not disrupt optical flow.

Experiment I (Velocity Estimate Comparison Using Simulation Data)
We conducted a simulation in which we compared the proposed method's estimates of host vehicle velocity to those of visual SLAM under various noise levels and outlier ratios with synthetic optical flow feature point generation.

Results
The experimental conditions and outcomes we obtained are shown in Table 1. In each simulation, we specify a hard-coded ground truth trajectory, then we use that trajectory to synthesize optical flows for random points on the ground plane. We give ORB-SLAM sufficiently clean initialization data that it can estimate an exact initial homography, thereby addressing monocular SLAM's inherent scale ambiguity. We add Gaussian noise and outliers to the synthetic data in the synthetic image plane. Since we generally observe real-world outlier ratios between 6% and 69%, in the simulation experiments, we perform experiments with two levels of outlier ratio: 40% and 70%. We add Gaussian random noise to each image point with σ = 1.0, 2.0, or 3.0.

Discussion
The results show that for Gaussian random noise with σ = 1.0 and σ = 2.0, our proposed method is better than visual SLAM. When the noise levels are more substantial (σ = 3.0), visual SLAM is better than the proposed method.

Experiment II (Trajectory Estimation in Simulation)
Besides instantaneous velocity estimation, to verify the correctness and effectiveness of the overall trajectory estimation approach, we evaluated predicted trajectories quantitatively in four simulations.

Results
The first two simulations were performed without target vehicles, and the last two simulations included target vehicles. A detailed list of experimental conditions and outcomes is shown in Table 2.
In each simulation, we generated ground truth and noisy observation data for the host vehicle and target vehicles. For the host vehicle, the data consist of synthetic optical flow for the scene surrounding the host vehicle, from which the method must estimate odometry (angular and linear velocity). For the target vehicles, the data are exact (run 3) or noisy (run 4) observations of target vehicles' simulated relative positions assuming appropriate velocities. During the simulation, target vehicles are assumed to be correctly detected inside each camera's field of view until they exit the camera view. Table 2. Experiment II (trajectory estimation in simulation) results. RMSE pg compares predicted values to ground truth, RMSE mg compares raw observations to ground truth.

Host Vehicle Target Vehicle(s) Performance
Noise: 0 Angular velocity:θ = −0.01 rad/s. In each simulation, to model the sensor observations, we first project the synthetic data into the image plane. We use 10% outliers in the optical flow. We add Gaussian random noise to each optical flow point in each image with σ = 2.0 and use the noisy data to compute the velocity of the host vehicle. We reproject image points to the ground plane to computeθ andṡ according to the method described in Section 3. In simulation 3 and 4, we add three target vehicles with three different speeds to the simulation. The speed of the target vehicle on the left is slower than that of the host vehicle. The speed of the target vehicle on the right is faster than the host vehicle. The speed of the target vehicle behind the host vehicle matches the host vehicle. The estimated host vehicle velocities in the first two simulations are shown in Figure 4. The estimated host vehicle and target vehicle trajectories over the fourth simulation are shown in Figure 5. The estimated relative positions of the three target vehicles over the fourth simulation are shown in Figure 6.

Discussion
This experiment allows us to compare the performance of sensor-only estimation and model-based correction under the same conditions, with the simulation of optical flows proceeding the same as in Experiment I, except that we also simulate ground truth trajectories and noisy YOLO detections for a set of target vehicles. Clearly, the estimated relative positions are substantially smoother than the raw observations.

Experiment III (Visual Tracking Evaluation)
In this experiment, We tested the target vehicle tracker in a real world environment with vehicles moving between camera views. As a point of comparison, one of the best known state-of-the-art multiple object tracking methods is DeepSORT. To determine the relative accuracy and resource utilization of our method compared to the state of the art, we compare the proposed model to DeepSORT in terms of accuracy and frame rate, on both an unconstrained system and a resource-constrained system.

Results
Results are shown in Table 3 and Figure 7. Table 3 lists the results of three separate runs on different roads. Figure 7 shows image sets and tracking results for one run. In the run shown in Figure 7 on the first frame, two vehicles are detected in the backward camera, and tracking begins. Target No. 1 appears in the three cameras, one after another.

Discussion
On the unconstrained system, an Intel i5 with GPU, DeepSORT's tracking accuracy is almost identical to our method's accuracy, as they are both tracking by detection methods using the same detector (YOLOv3). DeepSORT has a nearly negligible advantage in ID switches, as its deep association metric is sometimes able to maintain track identity when multiple vehicles cause a tracking error in our system. However, the processing speed of the proposed model is substantially faster than DeepSORT on the Intel system. The mean speed of our proposed method is approximately 30 fps, whereas the speed of DeepSORT is about 21 fps. On the NVIDIA Jetson TX2, which we set up to concurrently process streams from three cameras, the processing speed for our method was approximately 5 fps for each of three concurrent video streams (15 fps total). But with DeepSORT as the tracker, we found that the system cannot run alongside the detector and optical flow modules concurrently, due to memory constraints. An additional limitation is that DeepSORT cannot deal with target vehicle switching between cameras. In contrast, the proposed system explicitly deals with parallel processing of three cameras with overlapping views and seamlessly processes target switches between cameras. We conclude that our tracker provides near state-of-the-art accuracy with far less resource utilization than DeepSORT. This target vehicle was tracked in the back-side camera. Row (C): example images acquired as the vehicle moves from the back-side view to the right-side camera view. This vehicle was detected in frame 259 in the right-side view. Switching to the right-side camera is successful. Row (D): example image set acquired after the target has exited the back-side view. This vehicle continues to be tracked by the right-side camera. Row (E): example image set in which the target has exited the field of view in the right camera. This vehicle was mis-detected in the backward camera at frames 234, 235, and 236. It was then re-detected and tracked successfully.

Experiment IV (Velocity and Trajectory Estimation in Real World)
In comparisons between estimated target vehicle position and ground truth, we find that the maximum error in relative position estimation is 5.3%. From these results, we conclude that the accuracy of the camera calibration and the relative localization method based on camera calibration is sufficient for our purposes.

Results
As it is difficult to obtain ground truth data under real world conditions, we perform a qualitative evaluation of the integrated system's ability to predict the host vehicle's velocity and host and target vehicle trajectories in real scenes. In Figures 8 and 9, we show estimated velocities and trajectories for the host vehicle and some example targets in the world coordinate system.  In each run, the host vehicle starts parked. We record this initial position as (0, 0), and we compute the velocity of the host vehicle starting from the second frame of optical flow. Target vehicle positions are initialized from the second frame they are detected in. We use YOLOv3 to detect target vehicles, we use the camera calibration model to measure relative position, and we use the Hungarian algorithm to associate detections with tracks.

Discussion
The proposed approach provides less noisy, more smooth, and more stable estimates of the vehicle's trajectory than the raw observations. During tracking, some targets are missed temporarily. This is due to partial occlusion. In order to deal with this problem, we perform system state prediction without an update computation in the Kalman filter.

Experiment V (Comparison between Proposed Method and Visual-SLAM)
To compare the performance of the proposed approach with a state-of-the-art monocular visual odometry method, we carried out two tests.

Results
We compare the results of our model with a ROS-based monocular visual SLAM method, ORB-SLAM [34]. For the ground truth, we manually traced the host vehicle's trajectories on satellite maps. We qualitatively compare results from the proposed method to those of ORB-SLAM. ORB-SLAM generates a sparse feature point cloud as the map then localizes the camera within that map for each keyframe. The sequence of keyframe estimates give the estimated vehicle trajectory. The results of the experiment are displayed in Figure 10.  Orange lines indicate estimated trajectories of the host vehicle. ORB-SLAM was not able to detect features in frames 2790-3245. When ORB-SLAM relocalizes the vehicle, it estimates an incorrect trajectory for the host vehicle. The proposed model, on the other hand, is able to maintain a much more accurate trajectory. Row 2 shows results for a winding road. In this condition, ORB-SLAM works for only a short period of time. After that, it falls and cannot relocalize. The proposed model maintains a smooth trajectory until the host vehicle parks. We conclude that the proposed model for trajectory estimation of the host vehicle is better than ORB-SLAM in outdoor road environments.

Discussion
They show that our approach is more robust and provides a good approximation to the vehicle's trajectory in real world driving situations. ORB-SLAM, on the other hand, is only able to generate an estimated trajectory for the host vehicle, not for target vehicles.

Conclusions
This paper presents a new method for estimating vehicle trajectories from video sequences captured by moving cameras without additional sensors. The host vehicle's instantaneous velocity is estimated using optical flow with RANSAC. Target vehicles in the frame are detected by a deep learning model. Relative positions of target vehicles are obtained using a perspective transformation and a new optimization method. We use an extended Kalman filter to track host vehicle linear velocity, host vehicle angular velocity, and relative positions of target vehicles, resulting in precise host and target vehicle trajectory estimates in a common world coordinate system. In a series of experiments, we show that the new method substantially reduces the trajectory errors, relative position estimation errors, linear velocity errors, and angular velocity errors inherent in the use of a noisy sensor. The new method also compares favorably against state of the art visual odometry and tracking methods in terms of accuracy and resource utilization. We conclude that the method is an extremely good candidate for the next phase of commercial exploitation, given sufficient compute power.
The main limitation of the method is the assumption that the host vehicle and all target vehicles move on a common flat ground plane. In the current version of the system, when this assumption is violated, such as when the host vehicle goes over a speed bump, the estimated velocity of the host vehicle will be erroneous, and estimated relative positions of the vehicle will vary from the ground truth. In future work, we plan to eliminate this limitation, further test the algorithm on a vehicle-based embedded camera system, experiment with more sophisticated trackers, and develop a reasonably priced consumer grade real time display for driving situation awareness.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: