Vehicle Spatial Distribution and 3D Trajectory Extraction Algorithm in a Cross-Camera Traffic Scene

The three-dimensional trajectory data of vehicles have important practical meaning for traffic behavior analysis. To solve the problems of narrow visual angle in single-camera scenes and lack of continuous trajectories in 3D space by current cross-camera trajectory extraction methods, we propose an algorithm of vehicle spatial distribution and 3D trajectory extraction in this paper. First, a panoramic image of a road with spatial information is generated based on camera calibration, which is used to convert cross-camera perspectives into 3D physical space. Then, we choose YOLOv4 to obtain 2D bounding boxes of vehicles in cross-camera scenes. Based on the above information, 3D bounding boxes around vehicles are built with geometric constraints which are used to obtain projection centroids of vehicles. Finally, by calculating the spatial distribution of projection centroids in the panoramic image, 3D trajectories of vehicles are extracted. The experimental results indicate that our algorithm can effectively complete vehicle spatial distribution and 3D trajectory extraction in various traffic scenes, which outperforms other comparison algorithms.


Introduction
Vehicle spatial distribution and 3D trajectory extraction is an important sub-task in the field of computer vision. With the development of intelligent transportation systems (ITS), a large amount of vehicle trajectory data reflecting movements is obtained through traffic surveillance videos, which can be used for traffic behavior analysis [1,2] such as speeding and lane change, traffic flow parameter (volume, density, etc.) calculation and prediction [3][4][5] and so on. Based on these data, traffic state estimation [6,7] and traffic management and control [8] can be conducted, which plays a key role in ensuring traffic efficiency and is of great research significance and practical value.
In current applications, trajectories mainly refer to two-dimensional trajectories in the image space, which do not contain spatial information of vehicles in the real world. Compared with 2D trajectories, 3D trajectories have one more dimension of spatial information, which has more obvious advantages in practical applications and can be further applied to traffic accident scene reconstruction and responsibility identification [9], as well as vehicle path planning [10] in autonomous driving and cooperative vehicle infrastructure system (CVIS) to avoid collision.
Currently, the most commonly used methods for obtaining 3D vehicle trajectories are based on object detection and feature point methods [11][12][13], which have been maturely applied in single-camera scenes. With the development of deep convolutional neural networks (DCNNs), several excellent object detection networks [14][15][16][17][18] have emerged, which greatly improve the accuracy and speed of object detection compared with the traditional feature extraction and classifier methods [19]. Based on object detection, feature points are extracted for vehicles to obtain 3D trajectories in the world space

•
A method of road space fusion in cross-camera scenes based on camera calibration is proposed to generate a road panoramic image with physical information, which is used to convert cross-camera perspectives into 3D physical space. • A method of 3D vehicle detection based on geometric constraints is proposed to accurately obtain projection centroids of vehicles, which is used to describe vehicle spatial distribution in the panoramic image and 3D trajectory extraction of vehicles.
The rest of this paper is organized as follows. The proposed algorithm to complete vehicle spatial distribution and 3D trajectory extraction is illustrated in Section 2. Experiment results and some comparison experiments are presented in Section 3. Conclusions and future work are given out in Section 4.

Framework
The overall flow chart of the proposed algorithm is shown in Figure 1. First, a panoramic image of the road with spatial information is generated based on camera calibration, which is used to convert the cross-camera perspective into 3D physical space. Secondly, 3D bounding box is constructed by geometric constraints, which is used to obtain the projection centroid of the vehicle. Finally, 3D trajectory of the vehicle is extracted by calculating the spatial distribution of the projection centroid in the road panoramic image. To complete road space fusion in cross-camera scenes, the relationship between 2D image space and 3D world space must be derived through camera calibration. In this paper, we refer to the study [31] and our previous work [32,33] to define coordinate systems and camera calibration model, and choose the single vanishing point-based calibration method VWL (One Vanishing Point, Known Width and Length) to complete the calculation of calibration parameters.

Camera Calibration Model and Parameter Calculation
To complete road space fusion in cross-camera scenes, the relationship between 2D image space and 3D world space must be derived through camera calibration. In this paper, we refer to the study [31] and our previous work [32,33] to define coordinate systems and camera calibration model, and choose the single vanishing point-based calibration method VWL (One Vanishing Point, Known Width and Length) to complete the calculation of calibration parameters.
Schematic diagram of coordinate system and camera calibration model is shown in Figure 2. In this paper, three coordinate systems are defined, all of which are right-handed. The world coordinate system is defined by the x, y, z axis, and the origin O w is located at the projection point of the camera on the road plane, whereas z is perpendicular to the road plane upwards. The camera coordinate system is defined by the x c , y c , z c axis, and the origin O c is located at the camera optical center, and x c is parallel to x, z c pointing to the ground along the camera optical axis, y c perpendicular to the plane x c O c z c . The image coordinate system is defined by u, v axis, and the origin O i is located at image center. In the image coordinate system, u is horizontal right and v is vertical downward. z c intersects the road plane at r = (c x , c y ) in the image coordinate system, which is called the principal point and its default location is at the center of the image. c x , c y represent half of the image width and height, respectively. In this paper, the single vanishing point-based calibration method VWL [31,32] is adopted to solve the calibration parameters , , , f h φ θ , and the vanishing point x in the image coordinate system. From Equation (1), the calibration parameters , φ θ can be solved as follows: Besides vanishing points, markings on the road plane are also commonly used signs. In Figure   3a, the physical length of a line segment parallel to the road direction is l . The vertical coordinates of the front and back point are b where y represents the world coordinate In camera calibration, calibration parameters usually include camera focal length f , camera height h above the road plane, tilt angle φ and pan angle θ. In addition, roll angle can be represented by a simple image rotation, which has no effect on calibration results and is not considered in this paper. Through the camera model, the projection expression from the world coordinate system to the image coordinate system can be deduced as follows: where α 0 is the scale factor, the homogeneous coordinates of the world point and its projection are In this paper, the single vanishing point-based calibration method VWL [31,32] is adopted to solve the calibration parameters f , h, φ, θ, and the vanishing point VP = (u 0 , v 0 ) along the direction of traffic flow is extracted by road edge lines.
As shown in Figure 3, a line segment in the world coordinate system and its projection in the image coordinate system are presented, respectively. In Figure 3a, due to the pan angle θ, the point at infinity along the road direction can be expressed as x 0 = − tan θ 1 0 0 T in the world homogeneous coordinate. In Figure 3b, according to the vanishing point principle, (u 0 , v 0 ) is the projection of x 0 in the image coordinate system. From Equation (1), the calibration parameters φ, θ can be solved as follows: Sensors 2020, 20, x FOR PEER REVIEW 6 of 25 From Equation (6), f can be solved first. When f is uniquely determined, , φ θ can be solved according to Equations (2) and (3), and h can be finally solved according to Equations (4) or (5). Thus, all the calibration parameters are calculated and the mapping between world and image can be described according to Equation (1). To illustrate the road space in a straightforward way, the origin of the image coordinate system i O and the y axis of the world coordinate system are adjusted. First, the origin of the image coordinate system is moved to the upper left corner of the image, corresponding to the change of the internal parameter matrix K : Then, the y axis is adjusted to the direction along the traffic flow. Therefore, the rotation matrix R contains two parts, respectively representing a rotation of 2 φ π + about the x axis and θ about the z axis, which can be specifically expressed as: cos sin 0 ( 2) ( ) sin sin sin cos cos cos sin cos cos sin The translation matrix is: Besides vanishing points, markings on the road plane are also commonly used signs. In Figure 3a, the physical length of a line segment parallel to the road direction is l. The vertical coordinates of the front and back point are y b , y f and v b , v f , where y represents the world coordinate system while v the image. The physical width of the road is w with a pixel length δ in the corresponding image coordinate system. It can be obtained from literature [31] that h can be expressed by w or l indirectly as follows: , sin φ, cos φ, cos θ can be solved from Equation (2) and (3). By equating Equations (4) and (5) and substituting into sin φ, cos φ, cos θ, a fourth-order equation in f can be derived as: where k V = δτl/(wv 0 ). From Equation (6), f can be solved first. When f is uniquely determined, φ, θ can be solved according to Equations (2) and (3), and h can be finally solved according to Equations (4) or (5).
Thus, all the calibration parameters are calculated and the mapping between world and image can be described according to Equation (1).
To illustrate the road space in a straightforward way, the origin of the image coordinate system O i and the y axis of the world coordinate system are adjusted. First, the origin of the image coordinate system is moved to the upper left corner of the image, corresponding to the change of the internal parameter matrix K: Then, the y axis is adjusted to the direction along the traffic flow. Therefore, the rotation matrix R contains two parts, respectively representing a rotation of φ + π/2 about the x axis and θ about the z axis, which can be specifically expressed as: The translation matrix is: Therefore, the adjusted mapping from world point (x, y, z) to image point (u, v) in homogeneous form can be expressed as: where H = h ij , i = 1, 2, 3; j = 1, 2, 3, 4 is the 3 × 4 projection matrix from the world coordinate to the image coordinate, and s is the scale factor.
Finally, according to the derivation, the adjusted mapping between world and image can be described as follows: Image − to − World

Unified World Coordinate System and Road Panoramic Image Generation
The mapping between world and image in a single scene can be described through camera calibration. To complete 3D vehicle trajectory extraction in cross-camera scenes, the road space needs to be fused. At present, image stitching methods are often used, but most of them rely on overlapping areas to extract feature points for matching and obtaining transformation of scenes. However, feature extraction and matching are time-consuming. For multi-scene (more than two scenes) stitching, accumulated errors are existed in transformation of scenes, which will affect the quality of final image stitching result and the measurement accuracy of physical distance. Therefore, we propose a road space fusion algorithm in cross-camera scenes based on camera calibration which is Sensors 2020, 20, 6517 7 of 24 not completely dependent on overlapping areas between scenes. When there are no overlapping areas between scenes, only the distances between cameras are needed.
Schematic diagram of road space fusion in cross-camera scenes is shown in Figure 4. In Figure 4a, number of cameras in the scene is N(N ≥ 2), the set of sub-scene world coordinate systems is defined as W i s : O i w − x i y i z i ; i = 1, 2, · · · , N , which is the same as the world coordinate system in the single scene described in the previous section. The unified world coordinate system is defined as W u : O u − x u y u z u , and the origin O u is located in the road edge close to the camera. O u O 1 w is perpendicular to the road edge. The mapping matrix between the world coordinate system and the image coordinate system of each scene is the adjusted result described in the previous section, which is defined as H i , i = 1, 2, · · · , N. The red dots in Figure 4 are the control points set to identify the road areas. Two control points are set for each scene. The sets of control points in image and world coordinate system are P i 2d : p i 1 , p i 2 ; i = 1, 2, · · · , N and P i 3d : P i 1 , P i 2 ; i = 1, 2, · · · , N respectively. In Figure 4b, the panoramic image coordinate system is defined as O p − u p v p , and the origin O p is located at the upper left corner of the panoramic image, which is similar to the image coordinate system.  Schematic diagram of road distribution in the panoramic image is shown in Figure 5. The proposed road space fusion algorithm in cross-camera scenes is specifically illustrated with this figure.
Step 1: Camera calibration. The calibration method proposed in this paper is used to calculate calibration parameters of each camera in the scene, including internal parameter matrix i K , rotation matrix i R , translation matrix i T and projection matrix of each camera ; 1,2, , Step 2: Road area identification by setting control points. Harris corner extraction algorithm is used to obtain the image coordinate set of the nearest and furthest marking endpoints on the road plane in each scene, which is denoted as { } : ( , ,0), ( , ,0); 1,2, , The range of road area is calculated Schematic diagram of road distribution in the panoramic image is shown in Figure 5. The proposed road space fusion algorithm in cross-camera scenes is specifically illustrated with this figure.
Step 1: Camera calibration. The calibration method proposed in this paper is used to calculate calibration parameters of each camera in the scene, including internal parameter matrix K i , rotation matrix R i , translation matrix T i and projection matrix of each camera H i = K i R i T i ; i = 1, 2, · · · , N.
Step 2: Road area identification by setting control points. Harris corner extraction algorithm is used to obtain the image coordinate set of the nearest and furthest marking endpoints on the road plane in each scene, which is denoted as P i 2d : p i 1 = (x i 1 , y i 1 ), p i 2 = (x i 2 , y i 2 ); i = 1, 2, · · · , N . Equation (8)  to convert P i 2d to the world coordinate set P i 3d : Step 3: Set control parameter groups and divide pixels of the panoramic image M p into corresponding scenes.
The width of the road is w (mm). The scale in the road space along the width direction is r w (pixel/mm) and the length direction r l . The height and width of M p are wr w and r l Step 4: Generate the complete panoramic image M p . The panoramic image coordinates are traversed from the origin at the upper left corner. A point (u, v) in the image coordinate system belongs to scene i and its corresponding world coordinate point is ( Sensors 2020, 20, x FOR PEER REVIEW 9 of 25 are w wr and where the corresponding length of each scene on the panoramic Step The pixel in the road area pixel I corresponding to the world coordinate point is taken out (if any) and put to the position of the panoramic image coordinate point. Repeat this process until all the pixels of the corresponding road areas in all scenes are taken out and put into the panoramic image correctly. Since the generated panoramic image contains physical information of road space, the position in the sub-scene world coordinate system and the unified world coordinate system can be calculated directly from a point in the panoramic image. In addition, the position in the unified world coordinate system and the panoramic image coordinate system can also be analyzed from a point in the subscene world coordinate system. The specific mapping equation group is as follows:  panoramic image-to-world where a point in the panoramic image is denoted as ( , ) u v , i represents the number of the subscene, where a point in sub-scene i is denoted as ( , ,0) X Y , The pixel in the road area I pixel corresponding to the world coordinate point is taken out (if any) and put to the position of the panoramic image coordinate point. Repeat this process until all the pixels of the corresponding road areas in all scenes are taken out and put into the panoramic image correctly.
Since the generated panoramic image contains physical information of road space, the position in the sub-scene world coordinate system and the unified world coordinate system can be calculated directly from a point in the panoramic image. In addition, the position in the unified world coordinate system and the panoramic image coordinate system can also be analyzed from a point in the sub-scene world coordinate system. The specific mapping equation group is as follows: where a point in the panoramic image is denoted as (u, v), i represents the number of the sub-scene, where a point in sub-scene i is denoted as (X, Y, 0), Sensors 2020, 20, 6517 9 of 24

3D Bounding Boxes and Projection Centroids of Vehicles
Based on road space fusion in cross-camera scenes, to further obtain vehicle spatial distribution and 3D trajectory, vehicle detection in the scene is needed. Since the height of vehicle feature points is unknown, projection centroid is adopted in this paper instead, which depends on 3D vehicle detection. Considering actual application requirements, we choose YOLOv4 [34] for 2D vehicle detection. The detection results contain center point, width, and height of 2D bounding box in the image coordinate system, vehicle type (car, truck, bus) and its confidence. Then, the best 3D vehicle detection result and projection centroid are obtained by geometric constraints for vehicle spatial distribution and 3D trajectory extraction. Figure 6 shows the vehicle model of 2D/3D bounding box from left and right perspectives. In each sub-figure, the left represents 2D model while the right 3D model. 2D model is in the image coordinate system. The axes in 3D model are the same direction as the world coordinate system, and the origin is the bottom left point of the 3D model. The vertices of 2D bounding box model are numbered from 0 to 3, and the corresponding image coordinates are denoted as In the same way, the vertices of 3D bounding box model are numbered from 0 to 7, and the corresponding world and image coordinates are denoted as The world coordinates of eight vertices and projection centroid of the vehicle from different perspectives are presented in Table 1.

3D Bounding Boxes and Projection Centroids of Vehicles
Based on road space fusion in cross-camera scenes, to further obtain vehicle spatial distribution and 3D trajectory, vehicle detection in the scene is needed. Since the height of vehicle feature points is unknown, projection centroid is adopted in this paper instead, which depends on 3D vehicle detection. Considering actual application requirements, we choose YOLOv4 [34] for 2D vehicle detection. The detection results contain center point, width, and height of 2D bounding box in the image coordinate system, vehicle type (car, truck, bus) and its confidence. Then, the best 3D vehicle detection result and projection centroid are obtained by geometric constraints for vehicle spatial distribution and 3D trajectory extraction. Figure 6 shows the vehicle model of 2D/3D bounding box from left and right perspectives. In each sub-figure, the left represents 2D model while the right 3D model. 2D model is in the image coordinate system. The axes in 3D model are the same direction as the world coordinate system, and the origin is the bottom left point of the 3D model. The vertices of 2D bounding box model are numbered from 0 to 3, and the corresponding image coordinates are denoted as In the same way, the vertices of 3D bounding box model are numbered from 0 to 7, and the corresponding world and image coordinates are denoted as The world coordinates of eight vertices and projection centroid of the vehicle from different perspectives are presented in Table 1.   Figure 7 (the left represents 2D detection while the right 3D detection) and the algorithm is specifically described as follows: Schematic diagram of 2D/3D vehicle detection is shown in Figure 7 (the left represents 2D detection while the right 3D detection) and the algorithm is specifically described as follows: Step 1: YOLOv4 is used to obtain the vertices in the image coordinate system P 2D i = (u 2D i , v 2D i ), i = 0,1, 2, 3 and vehicle type. The base point of 2D bounding box is set as P 2D 1 = (u 2D 1 , v 2D 1 ) in the image coordinate system, which can be converted into P 3D 1 = (x 3D 1 , y 3D 1 , z 3D 1 ) in the world coordinate system by Equation (8) Step 3: The calculation results in Step 2 are converted to the image coordinates P 3Di (7) to complete 3D vehicle detection.

Geometric Constraints
According to the above 3D vehicle detection algorithm, obtaining accurate 3D vehicle physical size is the premise to complete precise 3D vehicle detection. Due to the factor of perspective distortion and lack of depth information in monocular image, accurate size cannot be obtained by vehicle type which is derived from YOLOv4. Therefore, geometric constraints are considered to accurately calculate 3D vehicle physical size, which includes diagonal constraint and vanishing point constraint.
3D vehicle detection is equivalent to obtaining 3D vehicle physical size v v v ( , , ) X l w h = , and the diagonal pixel length of 2D bounding box is defined as: where 2 ⋅ denotes the Euclidean distance between two points.
According to 3D bounding box model, 3Di 1 P and 3Di 7 P are selected, and the diagonal pixel length of 3D bounding box can also be defined as: The difference of Equation (11) and (12) consists of a set of diagonal constraint. Figure 8 shows the vehicle diagonal constraint. The red/yellow wireframe represents 2D/3D bounding box. When 2D bounding box and 3D bounding box are completely fitted, the blue line segment indicates that the 2D/3D diagonals completely coincide in the image coordinate system and the value of diagonal constraint is 0, which means 3D vehicle physical size is relatively accurate. The word relatively means the size is accurate in the case of diagonal constraint.
According to the principle of vanishing point, the straight line composed of 0-3, 1-2, 4-7, 5-6 point pairs in the 3D bounding box model must pass the vanishing point along the road direction in the image coordinate system. Therefore, it can be used as another set of constraints to accurately calculate 3D vehicle physical size.
In the image coordinate system, the included angle between two lines (one formed by point pairs, the other formed by one point and the vanishing point along the road direction) can be denoted as θ .

Geometric Constraints
According to the above 3D vehicle detection algorithm, obtaining accurate 3D vehicle physical size is the premise to complete precise 3D vehicle detection. Due to the factor of perspective distortion and lack of depth information in monocular image, accurate size cannot be obtained by vehicle type which is derived from YOLOv4. Therefore, geometric constraints are considered to accurately calculate 3D vehicle physical size, which includes diagonal constraint and vanishing point constraint.
3D vehicle detection is equivalent to obtaining 3D vehicle physical size X = (l v , w v , h v ), and the diagonal pixel length of 2D bounding box is defined as: where · 2 denotes the Euclidean distance between two points. According to 3D bounding box model, P 3Di 1 and P 3Di 7 are selected, and the diagonal pixel length of 3D bounding box can also be defined as: The difference of Equation (11) and (12) consists of a set of diagonal constraint. Figure 8 shows the vehicle diagonal constraint. The red/yellow wireframe represents 2D/3D bounding box. When 2D bounding box and 3D bounding box are completely fitted, the blue line segment indicates that the 2D/3D diagonals completely coincide in the image coordinate system and the value of diagonal constraint is 0, which means 3D vehicle physical size is relatively accurate. The word relatively means the size is accurate in the case of diagonal constraint.
According to the principle of vanishing point, the straight line composed of 0-3, 1-2, 4-7, 5-6 point pairs in the 3D bounding box model must pass the vanishing point along the road direction in the image coordinate system. Therefore, it can be used as another set of constraints to accurately calculate 3D vehicle physical size. θ θ θ θ as follows: The sum of four equations above consists of a set of vanishing point constraint. As shown in Figure 8, the red line segment is used to extract the vanishing point. When 2D bounding box and 3D bounding box are completely fitted, the deep blue line shows that the line formed by point pairs and the vanishing point completely coincide in the image coordinate system and the value of vanishing point constraint is 0, which means 3D vehicle physical size is relatively accurate. The word relatively means the size is accurate in the case of vanishing point constraint.
In this paper, the steps to obtain the vehicle geometric constraints are as follows: Step 1: YOLOv4 is used to obtain the vertices in the image coordinate system u v = and vehicle type.
Step 2: Then, According to Table 1, the world coordinates 3D 0 P , 3D 2 P to 3D 7 P can be calculated.
Step 3: According to Equation (11), the diagonal pixel length of 2D bounding box is calculated.
Then, the world coordinates of vertex 1 and 7 are converted to the image coordinates according to Equation (7)   In the image coordinate system, the included angle between two lines (one formed by point pairs, the other formed by one point and the vanishing point along the road direction) can be denoted as θ.
For four point pairs, according to the cosine theorem, we can derive θ 1 , θ 2 , θ 3 , θ 4 as follows: cos θ 3 = P 3Di 4 − P 3Di cos θ 4 = P 3Di 5 − P 3Di The sum of four equations above consists of a set of vanishing point constraint. As shown in Figure 8, the red line segment is used to extract the vanishing point. When 2D bounding box and 3D bounding box are completely fitted, the deep blue line shows that the line formed by point pairs and the vanishing point completely coincide in the image coordinate system and the value of vanishing point constraint is 0, which means 3D vehicle physical size is relatively accurate. The word relatively means the size is accurate in the case of vanishing point constraint.
In this paper, the steps to obtain the vehicle geometric constraints are as follows: Step 1: YOLOv4 is used to obtain the vertices in the image coordinate system P 2D Step 2: (l v , w v , h v ) is considered to be a set of unknown parameters. The base point in the world coordinate system can be obtained by Equation (8) as P 3D 1 = (x 3D 1 , y 3D 1 , z 3D 1 ), where z 3D 1 = 0. Then, According to Table 1, the world coordinates P 3D 0 , P 3D 2 to P 3D 7 can be calculated.
Step 3: According to Equation (11), the diagonal pixel length of 2D bounding box is calculated.
Then, the world coordinates of vertex 1 and 7 are converted to the image coordinates according to Equation (7) as P 3Di 1 , P 3Di 7 . Finally, the diagonal pixel length of 3D vehicle bounding box is calculated according to Equation (12), and a set of diagonal constraints are formed.
Step 4: According to Equation group (8), the world coordinates of vertices from 0 to 7 are converted to image coordinates as P 3Di 0 to P 3Di 7 . The values of cos θ 1 to cos θ 4 can be calculated according to Equations (13) to (16), and a set of vanishing point constraints are formed.
According to the above algorithm, the diagonal constraint and vanishing point constraint are obtained to construct the constraint error as l cal − l truth . Where l cal is the actual constraint value obtained by calculation, and l truth is the ideal constraint value when 2D bounding box and 3D bounding box are completely fitted. By analyzing the above algorithm, it can be easily seen that the variables in the constraint error are composed of parameters l v , w v , h v , which can constitute the nonlinear constraint space of parameter vectors.
To sum up, the nonlinear constraint function of the parameter X = (l v , w v , h v ) is: where N f is the occurrence time of the same vehicle in video frames, λ d and λ v respectively represent the error coefficient of the diagonal constraint and vanishing point constraint which are usually set to 1 and can be adjusted in different conditions, and min X represents the value of X when the constraint function reaches the minimum. The constraint function is nonlinear. LM (Levenberg-Marquardt) method is adopted in this paper to solve the constraint function, which is easy to reach convergence. The initial value X 0 can be obtained by referring to the national road vehicle size standard [35] based on the vehicle type derived by YOLO.
After solving accurate 3D vehicle physical size, 3D vehicle detection can be completed. Then, the world coordinates of projection centroids can be calculated. According to Equations (9) and (10), coordinates of vehicles in the panoramic image and other scenes can be obtained. As shown in Figure 9, vehicle spatial distribution and 3D trajectory in cross-camera scenes can be obtained by vehicles in continuous motion.  The constraint function is nonlinear. LM (Levenberg-Marquardt) method is adopted in this paper to solve the constraint function, which is easy to reach convergence. The initial value 0 X can be obtained by referring to the national road vehicle size standard [35] based on the vehicle type derived by YOLO.
After solving accurate 3D vehicle physical size, 3D vehicle detection can be completed. Then, the world coordinates of projection centroids can be calculated. According to Equations (9) and (10), coordinates of vehicles in the panoramic image and other scenes can be obtained. As shown in Figure  9, vehicle spatial distribution and 3D trajectory in cross-camera scenes can be obtained by vehicles in continuous motion.

Results
In our experiments, we used the Intel Core i7-8700 CPU, NVIDIA 1080Ti GPU (Graphics Processing Unit), 32GB memory, and Windows 10 operating system. The open source framework Darknet is used for vehicle detection.
Experiments are carried out on the public dataset BrnoCompSpeed [36] and actual road scene respectively, and the algorithm illustrated in Section 2 is adopted in the experiments. First, road space fusion algorithm in cross-camera scenes is used to generate the panoramic image of road with spatial information. Secondly, YOLOv4 combined with geometric constraints is used for 3D vehicle detection to obtain projection centroids. Finally, the projection centroids are projected to the panoramic image to derive vehicle spatial distribution and 3D trajectories. The experiments can be divided into the following two aspects: (1) Verify the accuracy of projection centroids obtained by 3D vehicle detection algorithm for vehicle spatial distribution. (2) Compare the proposed 3D vehicle trajectory extraction algorithm with several 3D tracking methods in this paper.

BrnoCompSpeed Dataset Single-Camera Scene
Due to the lack of cross-camera datasets from road surveillance perspectives, we choose a public dataset of single-camera scenes from surveillance perspectives published by researchers of Brno University of Technology for our experiments. The cross-camera dataset made by ourselves and experiments carried out on this scene are described in detail in Section 3.2.
The public dataset BrnoCompSpeed contains six traffic scenes captured by roadside surveillance cameras. Each scene can be divided into left, middle, and right perspectives, with a total of 18 HD (High Definition) videos (about 200 GB). The resolution of all the videos is 1920 × 1080. The dataset contains various types of vehicles such as hatch-back, sedan, SUV, truck and bus, and the position and velocity of vehicles are accurately recorded by radar. Therefore, this dataset can be used to verify the accuracy of vehicle spatial distribution and 3D trajectories in single-camera scenes.
As shown in Figure 10, we select three scenes of different perspectives from six scenes for verification which do not contain winding roads. In all the three scenes, the width of a single lane is 3.5 m, the length of a single short white marking line is 1.5 m, the length of a single long white marking line is 3 m, and the length between the starting points of the long white marking lines is 9 m. First, the three scenes are calibrated separately. Calibration results are shown in Table 2. Based on calibration, the road space fusion algorithm described in Section 2.2.2 is adopted to generate the panoramic image with physical information. Since the scenes in the dataset are single-camera scenes, we generate a roadblock containing physical information for convenience which is shown in Figure 11. Each small square of the roadblock represents the actual road space size of 3.5 × 9 m.

Results
In our experiments, we used the Intel Core i7-8700 CPU, NVIDIA 1080Ti GPU (Graphics Processing Unit), 32GB memory, and Windows 10 operating system. The open source framework Darknet is used for vehicle detection.
Experiments are carried out on the public dataset BrnoCompSpeed [36] and actual road scene respectively, and the algorithm illustrated in Section 2 is adopted in the experiments. First, road space fusion algorithm in cross-camera scenes is used to generate the panoramic image of road with spatial information. Secondly, YOLOv4 combined with geometric constraints is used for 3D vehicle detection to obtain projection centroids. Finally, the projection centroids are projected to the panoramic image to derive vehicle spatial distribution and 3D trajectories. The experiments can be divided into the following two aspects: (1) Verify the accuracy of projection centroids obtained by 3D vehicle detection algorithm for vehicle spatial distribution. (2) Compare the proposed 3D vehicle trajectory extraction algorithm with several 3D tracking methods in this paper.

BrnoCompSpeed Dataset Single-Camera Scene
Due to the lack of cross-camera datasets from road surveillance perspectives, we choose a public dataset of single-camera scenes from surveillance perspectives published by researchers of Brno University of Technology for our experiments. The cross-camera dataset made by ourselves and experiments carried out on this scene are described in detail in Section 3.2.
The public dataset BrnoCompSpeed contains six traffic scenes captured by roadside surveillance cameras. Each scene can be divided into left, middle, and right perspectives, with a total of 18 HD (High Definition) videos (about 200 GB). The resolution of all the videos is 1920 × 1080. The dataset contains various types of vehicles such as hatch-back, sedan, SUV, truck and bus, and the position and velocity of vehicles are accurately recorded by radar. Therefore, this dataset can be used to verify the accuracy of vehicle spatial distribution and 3D trajectories in single-camera scenes.
As shown in Figure 10, we select three scenes of different perspectives from six scenes for verification which do not contain winding roads. In all the three scenes, the width of a single lane is 3.5 m, the length of a single short white marking line is 1.5 m, the length of a single long white marking line is 3m, and the length between the starting points of the long white marking lines is 9m. First, the three scenes are calibrated separately. Calibration results are shown in Table 2. Based on calibration, the road space fusion algorithm described in Section 2.2.2 is adopted to generate the panoramic image with physical information. Since the scenes in the dataset are single-camera scenes, we generate a roadblock containing physical information for convenience which is shown in Figure  11. Each small square of the roadblock represents the actual road space size of 3.5 × 9 m.
First, the three scenes are calibrated separately. Calibration results are shown in Table 2. Based on calibration, the road space fusion algorithm described in Section 2.2.2 is adopted to generate the panoramic image with physical information. Since the scenes in the dataset are single-camera scenes, we generate a roadblock containing physical information for convenience which is shown in Figure  11. Each small square of the roadblock represents the actual road space size of 3.5 × 9 m.   The real position of the vehicle in the world coordinate system is defined as P r and the measured position is P m . The effective field of view of the scene is set to L s (m). Then, the vehicle spatial distribution error can be defined as: Examples of the vehicle spatial distribution and 3D trajectories in dataset scenes are shown in Figure 12. In this experiment, L s is set to 450 m, and the base point in scene 2 can be selected using either left or right perspective. Each scene contains multiple vehicles, and there are some cases of vehicle occlusion. For each instance, the top image contains 3D vehicle detection and 2D trajectory results, and the roadblock on the bottom side contains vehicle spatial distribution and 3D trajectory results. Each vehicle corresponds to one color without repetition. Tables 3-5 correspond to the 3D physical size, the image, and world coordinates and spatial distribution error of each vehicle in dataset scene 1 to scene 3. The value of y-axis in the world coordinate system is presented in an ascending order which indicates the distance between the vehicle and the camera is from near to far. To present the results in a straightforward way, the position and direction of the vehicle is marked in the roadblock with a white line segment and a white arrow respectively.  From the experimental results, it can be seen that the average error of vehicle spatial distribution within the scope of hundred meters is less than 5%, which means the accuracy can reach the centimeter level. In the meanwhile, the proposed algorithm is also adaptable to the situation of part vehicle occlusion.

Actual Road Cross-Camera Scene
To further verify the application ability of the proposed algorithm, we choose the actual road with large traffic flow which is located on the Middle Section of South Second Ring Road in Xi'an, ShaanXi Province, China to make a small dataset of cross-camera scenes. The dataset consists of three groups of HD videos (a total of six videos), and each of which is about 0.5 h long. The resolution of all the videos is 1280 × 720. Figure 13 shows the image of the actual road scenes with no overlapping area which are taken by 2 cameras with a distance of 210 m. In the actual road scene, the road width is 7.5 m, the length of a single white marking line on the road plane is 6m, and the length between the starting points of the white marking lines is 11.80 m and 11.39 m in two scenes respectively. First, the scenes taken by two cameras are calibrated separately. Calibration results are shown in Table 6. Based on calibration, the panoramic image with physical information is generated by the road space fusion algorithm described in Section 2.2.2, which is shown in Figure 14. A degree scale in the image represents an actual distance of the starting points of four white marking lines and 3.75 m in the image width and height direction.

Actual Road Cross-Camera Scene
To further verify the application ability of the proposed algorithm, we choose the actual road with large traffic flow which is located on the Middle Section of South Second Ring Road in Xi'an, ShaanXi Province, China to make a small dataset of cross-camera scenes. The dataset consists of three groups of HD videos (a total of six videos), and each of which is about 0.5 h long. The resolution of all the videos is 1280 × 720. Figure 13 shows the image of the actual road scenes with no overlapping area which are taken by 2 cameras with a distance of 210 m. In the actual road scene, the road width is 7.5 m, the length of a single white marking line on the road plane is 6m, and the length between the starting points of the white marking lines is 11.80 m and 11.39 m in two scenes respectively. First, the scenes taken by two cameras are calibrated separately. Calibration results are shown in Table 6. Based on calibration, the panoramic image with physical information is generated by the road space fusion algorithm described in Section 2.2.2, which is shown in Figure 14. A degree scale in the image represents an actual distance of the starting points of four white marking lines and 3.75 m in the image width and height direction.   In our experiment, we choose three examples of vehicles, which are shown in Figure 15. For each example (similar to the dataset scene), 3D vehicle detection results in two cameras are shown in the first two lines respectively, and 3D vehicle trajectory extraction results are shown in the third line. Each vehicle corresponds to one color without repetition. Table 7 shows the results of vehicle spatial distribution in actual road scene. Similar to the single-camera scenes, we mark the position and   Figure 15. For each example (similar to the dataset scene), 3D vehicle detection results in two cameras are shown in the first two lines respectively, and 3D vehicle trajectory extraction results are shown in the third line. Each vehicle corresponds to one color without repetition. Table 7 shows the results of vehicle spatial distribution in actual road scene. Similar to the single-camera scenes, we mark the position and direction of the vehicle in the panoramic image with a green line segment and a white arrow, respectively. From the experimental results, it can be seen that continuous 3D trajectories of vehicles in cross-camera scenes can be effectively extracted. In our experiment, we choose three examples of vehicles, which are shown in Figure 15. For each example (similar to the dataset scene), 3D vehicle detection results in two cameras are shown in the first two lines respectively, and 3D vehicle trajectory extraction results are shown in the third line. Each vehicle corresponds to one color without repetition. Table 7 shows the results of vehicle spatial distribution in actual road scene. Similar to the single-camera scenes, we mark the position and direction of the vehicle in the panoramic image with a green line segment and a white arrow, respectively. From the experimental results, it can be seen that continuous 3D trajectories of vehicles in cross-camera scenes can be effectively extracted.     As shown in Figure 16, the proposed algorithm is compared with the 3D tracking methods based on feature point and 2D bounding box, which are represented by red, green, and orange respectively. It can be seen that the method based on feature point is greatly influenced by vehicle texture and surrounding environment, which cannot reflect true driving direction well, and may not be able to obtain continuous 3D trajectory under the condition of occlusion. The method based on 2D bounding box cannot accurately reflect the true driving position due to an unknown distance from bottom edge to the road plane. The proposed algorithm is superior to the existing methods because it can obtain accurate 3D vehicle bounding box, and is robust to vehicle occlusion and low visual angle of cameras. Comparison of the performance of several 3D tracking methods is summarized in Table 8.    Since the proposed 3D vehicle detection algorithm is based on geometric constraints, the overall processing speed is fast. It can be seen from examples in Figure 15, the average processing speed of our algorithm on the GPU platform is 16 FPS with an average time of 600 ms, which can achieve real-time performance.
During the experiment, it can also be found that the accuracy of vehicle spatial distribution and 3D trajectory is related to the pan angle θ of the camera. Therefore, we count the accuracy under different camera pan angles, which is shown in Figure 17. When the pan angle is close to 0 • , the information of the vehicle side surface is invisible, which leads to the decrease of 3D vehicle detection accuracy. In practical applications, the pan angle of the camera can be increased appropriately to retain most of the visual information of the vehicle. Since the proposed 3D vehicle detection algorithm is based on geometric constraints, the overall processing speed is fast. It can be seen from examples in Figure 15, the average processing speed of our algorithm on the GPU platform is 16 FPS with an average time of 600 ms, which can achieve realtime performance.
During the experiment, it can also be found that the accuracy of vehicle spatial distribution and 3D trajectory is related to the pan angle θ of the camera. Therefore, we count the accuracy under different camera pan angles, which is shown in Figure 17. When the pan angle is close to 0°, the information of the vehicle side surface is invisible, which leads to the decrease of 3D vehicle detection accuracy. In practical applications, the pan angle of the camera can be increased appropriately to retain most of the visual information of the vehicle.

Conclusions
Through experimental verification, the proposed algorithm of vehicle spatial distribution and 3D trajectory extraction in cross-camera scenes in this paper has achieved good results in both BrnoCompSpeed dataset single-camera scenes and actual road cross-camera scenes. The main contributions of this paper are as follows: (1) A road space fusion algorithm in cross-camera scenes based on camera calibration is proposed to generate the panoramic image with physical information in road space, which can be used to convert multiple cross-camera perspectives into continuous 3D physical space. (2) A 3D vehicle detection algorithm based on geometric constraints is proposed to accurately obtain 3D vehicle projection centroids, which is used to describe vehicle spatial distribution in the panoramic image and to extract 3D trajectories. Compared with existing vehicle tracking methods, continuous 3D trajectories can be obtained in the panoramic image with physical information by 3D projection centroids, which is helpful to applications in large scope road scenes.
However, 3D vehicle projection centroids obtained by the proposed algorithm in this paper is highly dependent on 2D vehicle detection results. When the vehicle is far from the camera, it is prone to be missed of detection and the accuracy will decrease when the camera pan angle is close to 0°.

Conclusions
Through experimental verification, the proposed algorithm of vehicle spatial distribution and 3D trajectory extraction in cross-camera scenes in this paper has achieved good results in both BrnoCompSpeed dataset single-camera scenes and actual road cross-camera scenes. The main contributions of this paper are as follows: (1) A road space fusion algorithm in cross-camera scenes based on camera calibration is proposed to generate the panoramic image with physical information in road space, which can be used to convert multiple cross-camera perspectives into continuous 3D physical space. (2) A 3D vehicle detection algorithm based on geometric constraints is proposed to accurately obtain 3D vehicle projection centroids, which is used to describe vehicle spatial distribution in the panoramic image and to extract 3D trajectories. Compared with existing vehicle tracking methods, continuous 3D trajectories can be obtained in the panoramic image with physical information by 3D projection centroids, which is helpful to applications in large scope road scenes.
However, 3D vehicle projection centroids obtained by the proposed algorithm in this paper is highly dependent on 2D vehicle detection results. When the vehicle is far from the camera, it is prone to be missed of detection and the accuracy will decrease when the camera pan angle is close to 0 • . Moreover, the proposed algorithm cannot currently be adapted to various road situations and congested traffic. In future work, a more efficient method for road space fusion can be developed to generate the panoramic image and calculate vehicle spatial distribution more precisely and a more sophisticated vehicle detection network can be designed to fuse various types of geometric constraints to further improve the accuracy of 3D vehicle detection under different camera pan angles. In addition, only straight roads and simple traffic conditions are considered in this paper, which is necessary to be further extended to complex traffic scenes such as road-crossing (containing winding roads) and traffic congestion for more practical and advanced applications. Efforts are also needed to collect a large dataset of these complex traffic scenes for algorithm validation. This direction is a key and difficult point in the future work.