Deep Learning Derived Object Detection and Tracking Technology Based on Sensor Fusion of Millimeter-Wave Radar/Video and Its Application on Embedded Systems

This paper proposes a deep learning-based mmWave radar and RGB camera sensor early fusion method for object detection and tracking and its embedded system realization for ADAS applications. The proposed system can be used not only in ADAS systems but also to be applied to smart Road Side Units (RSU) in transportation systems to monitor real-time traffic flow and warn road users of probable dangerous situations. As the signals of mmWave radar are less affected by bad weather and lighting such as cloudy, sunny, snowy, night-light, and rainy days, it can work efficiently in both normal and adverse conditions. Compared to using an RGB camera alone for object detection and tracking, the early fusion of the mmWave radar and RGB camera technology can make up for the poor performance of the RGB camera when it fails due to bad weather and/or lighting conditions. The proposed method combines the features of radar and RGB cameras and directly outputs the results from an end-to-end trained deep neural network. Additionally, the complexity of the overall system is also reduced such that the proposed method can be implemented on PCs as well as on embedded systems like NVIDIA Jetson Xavier at 17.39 fps.


Introduction
In recent years, the Advanced Driving Assistance System (ADAS) greatly promotes safe driving and might avoid dangerous driving events saving lives and damages to the infrastructure. Considering safety in autonomous system applications, it is crucial to accurately understand the surrounding environment under all circumstances and conditions. In general, autonomous systems need to estimate the positions and the velocities of probable obstacles and make decisions ensuring safety. The input data of the ADAS system is composed of various sensors, such as millimeter-wave (mmWave) radars, cameras, controller area networks (CAN) bus, light detection and ranging (LiDAR) and so on are utilized to help the road users perceive the surrounding environment and make correct decisions for safe driving. Figure 1 shows various equipment essential in an ADAS system.
Vision sensors are the most common sensors around us and their applications are everywhere. They have many advantages, such as high resolution, high frame rate, and low hardware cost. As deep learning (DL) has become extremely popular [1,2], the importance of vision sensors has gradually peaked. Since the visual sensors can preserve the appearance information of the targets, they are best suited for DL technology. As aforementioned, it can be noted that camera-only object detection is widely utilized in a lot of fields for numerous Vision sensors are the most common sensors around us and their applications are everywhere. They have many advantages, such as high resolution, high frame rate, and low hardware cost. As deep learning (DL) has become extremely popular [1,2], the importance of vision sensors has gradually peaked. Since the visual sensors can preserve the appearance information of the targets, they are best suited for DL technology. As aforementioned, it can be noted that camera-only object detection is widely utilized in a lot of fields for numerous applications, such as smart roadside units (RSU), self-driving vehicles, smart surveillance, etc. However, the results of object detection by the camera are severely affected by the ambient light and adverse weather conditions. Although the camera can distinguish the type of objects well, it cannot accurately obtain the physical characteristics such as the actual distance and velocity of the detected objects. In the ADAS industry, many companies with relevant research such as Tesla, Google, and Mobileye, use other sensors to design their self-driving cars to make up for the lack of camera failures in bad weather and lighting conditions such as nightlight, foggy, dusky, and rainy conditions as shown in Figure 2.  Vision sensors are the most common sensors around us and their applications are everywhere. They have many advantages, such as high resolution, high frame rate, and low hardware cost. As deep learning (DL) has become extremely popular [1,2], the importance of vision sensors has gradually peaked. Since the visual sensors can preserve the appearance information of the targets, they are best suited for DL technology. As aforementioned, it can be noted that camera-only object detection is widely utilized in a lot of fields for numerous applications, such as smart roadside units (RSU), self-driving vehicles, smart surveillance, etc. However, the results of object detection by the camera are severely affected by the ambient light and adverse weather conditions. Although the camera can distinguish the type of objects well, it cannot accurately obtain the physical characteristics such as the actual distance and velocity of the detected objects. In the ADAS industry, many companies with relevant research such as Tesla, Google, and Mobileye, use other sensors to design their self-driving cars to make up for the lack of camera failures in bad weather and lighting conditions such as nightlight, foggy, dusky, and rainy conditions as shown in Figure 2.  In contrast to the camera, the mmWave radars provide the actual distance and velocity of the detected object relative to the radar, and they can also provide the intensity of the object as a reference for identification, such that it convinces that the mmWave radar is a good choice to be employed together with camera for sensor fusion applications to yield better detection and tracking efficiency in all weather and lighting conditions. Compared to LiDAR, the mmWave radar has better penetration and is cheaper. Although mmWave radar has fewer point clouds than LiDAR, it is easy to use in the clustering algorithm to find the objects. When we can put the advantages of mmWave radar and camera to good use, these two sensors complement each other and provide a better perception capability compared to expensive 3-D LiDARs.
Basically, the three main fusion schemes have been proposed to use mmWave radar and camera together namely, (i) decision-level fusion, (ii) data-level fusion, and (iii) featurelevel fusion, respectively [3]. For autonomous systems, different sensors can make up for the shortcomings of the others and overcome the worse situation through sensor fusion methods. Hence, we think that the radar and camera sensor fusion is better and more reliable for drivers than utilizing a single sensor like the radar-only sensor, or the cameraonly sensor.
The following sections of the paper discuss the related works comprising of three existing mmWave radar and camera sensors fusion methods followed by the steps involved in the proposed early fusion technology of mmWave radar and camera sensors fusion, experimental results, and the conclusion.

Motivation
This paper focuses on the early fusion of the mmWave radar and camera sensors for object detection and tracking as the late fusion of the mmWave radar and camera sensors belongs to decision-level fusion [4][5][6]. First, the mmWave radar sensor and camera sensor detect obstacles individually. Then, the prediction results from them are fused together to obtain the final output results. However, different kinds of detection noises are involved in the predictions of these two heterogeneous sensors. Therefore, how to fuse the prediction results of these two kinds of sensors is a great challenge encountered in the late fusion of the mmWave radar and camera sensors.
To solve the above problem, this paper proposes the early fusion of the mmWave radar and camera sensors which is also known as feature-level fusion. To begin with, we need to transform the radar point cloud from the radar coordinates to that of an image. In the process, we add information like the distance, velocity, and intensity of the detected objects from the radar points cloud to the radar image channels corresponding to different physical characteristics of the detected objects. Finally, we fuse the visual image and the radar image to a multi-channel array and utilize a DL object detection model to extract the information from both sensors. Through the object detection model, the early fusion on the mmWave radar and camera sensors learns the relationship between the data from the mmWave radar and camera sensors, which can not only solve the problem of decision-level fusion but also solve the problem of detection in harsh environments when using the camera only.
This paper is organized as follows. Section 2 reviews the related works, which include the three existing mmWave radar and camera sensor fusion methods, related deep learning object detection models, some radar signal processing algorithms, and adopted image processing methods followed by the introduction of the proposed early fusion technology on the mmWave radar and camera sensors in detail in Section 3. Section 4 depicts our experiments and results along with the conclusion and future works in Section 5.

Types of Sensor Fusion
As discussed in the previous section, the methods of sensor fusion [7,8] are broadly categorized into three types viz, (a) decision-level fusion, (b) data-level fusion, and (c) featurelevel fusion as shown in the respective flowcharts in Figure 3.
For the decision-level fusion [9], there are two heterogeneous types of prediction results from mmWave radar and camera that are fused to obtain the final results. Considering that the data types from the mmWave and camera sensors are heterogeneous, there are no good methods to fuse their respective prediction results, which are involved in their detection noises.

Types of Sensor Fusion
As discussed in the previous section, the methods of sensor fusion [7,8] are broadly categorized into three types viz, (a) decision-level fusion, (b) data-level fusion, and (c) feature-level fusion as shown in the respective flowcharts in Figure 3. For the decision-level fusion [9], there are two heterogeneous types of prediction results from mmWave radar and camera that are fused to obtain the final results. Considering that the data types from the mmWave and camera sensors are heterogeneous, there are no good methods to fuse their respective prediction results, which are involved in their detection noises.
The second sensor fusion method is a data-level fusion [10][11][12], in which we first need to cluster the radar point cloud. Then, find the positions of the clustering points to generate the regions of interest (ROIs) where there may be objects to be detected. Finally, through the ROIs, we need to extract the corresponding image patches from the input image and utilize objection detection models to obtain the final predicted results. This fusion method requires a lot of valid radar points, so some objects cannot be detected if there are no valid radar points on them. Although the data-level fusion method can reduce the operational complexity and solve the decision problem in decision-level fusion, it is not suitable for the autonomous system from safety considerations.
The final sensor fusion method is a feature-level fusion [13][14][15]. Usually, in the feature-level fusion method, the radar point cloud is transformed from the radar coordinates to the image coordinates, namely the radar image, as shown in Figure 3. Then, the radar image and the corresponding vision image are fused and extracted based on the features of the DL models. The feature-level fusion can not only solve the decision-making problem in the decision-level fusion but also learns the relationship between the mmWave radar and vision image using the DL models.
The contributions of the early sensor fusion method proposed in this paper are: (i) It employs the fusion of the mmWave radar and the RGB camera sensor for more precise object detection and tracking compared to either camera-only or sensor-only methods. (ii) It can be used in an ADAS system for object detection and tracking as well as be applied to a smart Road Side Unit (RSU) in smart transportation to monitor real-time traffic flow for warning dangerous situations for all road users. The second sensor fusion method is a data-level fusion [10][11][12], in which we first need to cluster the radar point cloud. Then, find the positions of the clustering points to generate the regions of interest (ROIs) where there may be objects to be detected. Finally, through the ROIs, we need to extract the corresponding image patches from the input image and utilize objection detection models to obtain the final predicted results. This fusion method requires a lot of valid radar points, so some objects cannot be detected if there are no valid radar points on them. Although the data-level fusion method can reduce the operational complexity and solve the decision problem in decision-level fusion, it is not suitable for the autonomous system from safety considerations.
The final sensor fusion method is a feature-level fusion [13][14][15]. Usually, in the featurelevel fusion method, the radar point cloud is transformed from the radar coordinates to the image coordinates, namely the radar image, as shown in Figure 3. Then, the radar image and the corresponding vision image are fused and extracted based on the features of the DL models. The feature-level fusion can not only solve the decision-making problem in the decision-level fusion but also learns the relationship between the mmWave radar and vision image using the DL models.
The contributions of the early sensor fusion method proposed in this paper are: (i) It employs the fusion of the mmWave radar and the RGB camera sensor for more precise object detection and tracking compared to either camera-only or sensor-only methods. (ii) It can be used in an ADAS system for object detection and tracking as well as be applied to a smart Road Side Unit (RSU) in smart transportation to monitor real-time traffic flow for warning dangerous situations for all road users.

YOLO v3 Model
For YOLO v3 [1], it has some good characteristics like bounding box prediction, no softmax, feature pyramid networks (FPN), etc. The authors have used logistic regression to predict the confidence score of each object in the bounding box. The purpose is to distinguish the targets and the background. The IOU value of the bounding boxes and the ground truth are used as the criterion to evaluate the detection efficiency. One of the important features of YOLO v3 is that it does not use the softmax to classify each box, because the softmax imposes an assumption that each box contains only one category whereas, in practice, different objects possess overlapping labels. For example, it is predicted that boys belong to the category of people. For the autonomous system field, there are many multi-label scenarios, so the softmax is not suitable for multi-label classification. In addition, YOLO v3 makes predictions on three different scales, namely 13 × 13, 26 × 26, and 52 × 52, in which there are three bounding boxes predicted on each scale. This approach helps YOLO v3 to better detect small objects, and the up-sampling technology helps the network learn subtle features for detecting small objects. As autonomous systems require real-time and accurate judgment, the YOLO v3 model becomes an ideal choice.

YOLO v4 Model
YOLO v4 [2] is an improvement of YOLO v3, which improves the input terminal during the training phase so that training can yield good results on a single GPU. For instance, the mosaic used in YOLO v4 refers to the CutMix data augmentation method [16] proposed in 2019, but CutMix only uses two images for stitching. While mosaic data augmentation [17] uses four images to achieve random scaling, random cropping, and random arrangement for stitching. In the normal training processes, the average precision of small targets is generally much lower than that of medium and large targets. The COCO dataset [18] also contains a large number of small targets, but the crucial challenge is that the distribution of small targets is not uniform. Therefore, mosaic data augmentation can balance the proportion of small, medium, and large targets. Thus, the backbone of YOLO v4 uses Cross Stage Partial Network (CSPNet) [19] reduces repetitive gradient learning greatly enhancing the learning ability of the network. Although the model architecture of YOLO v4 is more complicated than that of YOLO v3, YOLO v4 uses a lot of 1 × 1 convolutions to reduce the number of calculations and increase the processing speed. Therefore, the YOLO v4 model is also suitable for the autonomous field.

Clustering
Based on our proposed system, the radar that we use is the Frequency Modulated Continuous Wave (FMCW) radar [20]. The FMCW radar emits continuous waves with varying frequencies during the frequency sweep period. The echo reflected by the object has a certain frequency difference from the transmitted signal. The distance information between the target and the radar can be obtained by measuring the frequency difference. The frequency of the difference frequency signal is relatively low, so the hardware processing is relatively simple. Therefore, the FMCW radar is suitable for data acquisition and digital signal processing.
K-means clustering [21] is the most common and well-known clustering method. K-means clustering is similar to the concept of finding the center of gravity. First, it divides the radar point cloud into k groups, and randomly selects k points to be the center of the cluster. Second, it classifies each point to its nearest cluster center. Third, it recalculates the cluster centers of each group. Finally, steps two and three are repeated until a stable k cluster is found. However, the problem of K-means clustering is that we cannot know the number of clusters and the number of repetitions prior. The data distribution and the initial location of the cluster centers affect the number of repetitions. Therefore, for autonomous applications, we think that K-means clustering is not the most suitable clustering algorithm.
Density-based spatial clustering of applications with noise (DBSCAN) [22] clustering algorithm is one of the most commonly used clustering analysis algorithms. In DBSCAN, there are two main parameters, distance (ε) and the minimum number of points (minPts), as shown in Figure 4. In step 1, it first decides the parameters and determines the ε and minPts. In step 2, it selects a random sample as the center point and draws a circle with the ε set from step 1. If the number of samples in the circle is greater than minPts, this sample is the core point and the marker can reach any point in the circle. If the number of samples in the circle is less than minPts, then this sample is a non-core point and cannot reach any point. In step 3, we repeat step 2 for each sample until all samples are over the center point. In step 4, we divide the connected sample points into a group, and other outlier points can be divided into different groups by examining whether they can be reached individually. the ε set from step 1. If the number of samples in the circle is greater than minPts, this sample is the core point and the marker can reach any point in the circle. If the number of samples in the circle is less than minPts, then this sample is a non-core point and cannot reach any point. In step 3, we repeat step 2 for each sample until all samples are over the center point. In step 4, we divide the connected sample points into a group, and other outlier points can be divided into different groups by examining whether they can be reached individually. Compared to K-means clustering, DBSCAN does not require a pre-declared number of clusters. It is possible to find clusters of any shape and even find a cluster that encloses but does not connect to another cluster. DBSCAN can also distinguish noise with only two parameters and is almost insensitive to the order of the points in the database. Therefore, for applications in the autonomous domain, we believe that DBSCAN is more suitable than the K-means for clustering in the proposed method. Figure 5 depicts the overall architecture of the proposed early sensor fusion method. The x and y positions and velocity indicate the relative 2-D distance (x, y) and velocity between the proposed system and the detected object. First, we will get the mmWave radar point cloud and the corresponding image. Then, the radar point cloud will be clustered and the radar and camera calibration is performed. The purpose of clustering is to find the areas where objects are really present and to filter out the noise of radar. The RGB Compared to K-means clustering, DBSCAN does not require a pre-declared number of clusters. It is possible to find clusters of any shape and even find a cluster that encloses but does not connect to another cluster. DBSCAN can also distinguish noise with only two parameters and is almost insensitive to the order of the points in the database. Therefore, for applications in the autonomous domain, we believe that DBSCAN is more suitable than the K-means for clustering in the proposed method. Figure 5 depicts the overall architecture of the proposed early sensor fusion method. The x and y positions and velocity indicate the relative 2-D distance (x, y) and velocity between the proposed system and the detected object. First, we will get the mmWave radar point cloud and the corresponding image. Then, the radar point cloud will be clustered and the radar and camera calibration is performed. The purpose of clustering is to find the areas where objects are really present and to filter out the noise of radar. The RGB image is represented as three channels, R, G, and B, while the radar image is represented as D, V, and I. All six channels are concatenated into a multi-channel array in the early fusion process. In Section 3.2, the clustering process and the related parameter adjustment are described in detail. The obtained clustering points are then clustered again to find out where most of the objects are present so that we can determine the ROIs of our multi-scale object detection. In Section 3.3, the radar and camera calibration is implemented to get the radar image that corresponds to the input image. In Sections 3.4-3.6, a detailed description of how to perform early fusion on the radar and camera sensors and how to determine our ROIs for multi-scale object detection are given, respectively. Then, in Section 3.7, we will explain how the Kalman filter is used for object tracking.

Overview
are described in detail. The obtained clustering points are then clustered again to fin where most of the objects are present so that we can determine the ROIs of our multi object detection. In Section 3.3, the radar and camera calibration is implemented to g radar image that corresponds to the input image. In Sections 3.4-3.6, a detailed descr of how to perform early fusion on the radar and camera sensors and how to dete our ROIs for multi-scale object detection are given, respectively. Then, in Section 3 will explain how the Kalman filter is used for object tracking.

Radar Clustering
We have to set two parameters first before using DBSCAN. One is the min point that forms the range of each clustering point. The other one is the minimum dis to form each cluster point range. We have experimentally set 4 as the minimum poin 40 cm as the minimum distance. Figure 6 shows the effect after using DBSCAN. In F 6, the green dot is the mmWave radar point cloud and the yellow dot is the clus point after DBSCAN. The red rectangular box is the ROI of multi-scale object dete which will be introduced in detail in the later section.

Radar Clustering
We have to set two parameters first before using DBSCAN. One is the minimum point that forms the range of each clustering point. The other one is the minimum distance to form each cluster point range. We have experimentally set 4 as the minimum point and 40 cm as the minimum distance. Figure 6 shows the effect after using DBSCAN. In Figure 6, the green dot is the mmWave radar point cloud and the yellow dot is the clustering point after DBSCAN. The red rectangular box is the ROI of multi-scale object detection, which will be introduced in detail in the later section.
Furthermore, we need to find out the area where most of the objects appear in each frame. Therefore, DBSCAN is performed again after radar and camera calibration for the above clustering points. Since the number of points in the cluster is fewer, we have experimentally set 1 as the minimum point and 400 pixels as the minimum distance.

Radar and Camera Calibration
To make the proposed system easier to set up and calculate the angle faster, we have derived the camera/radar calibration formula based on [23]. Figure 7 shows the early fusion device of mmWave and camera sensors. The following is a detailed description of the entire radar and camera calibration process.
For the calibration, we have three angles to be calculated, namely yaw angle, horizontal angle, and pitch angle as shown in the schematics of Figure 8. For the convenience of installation and more convenient to calculate the other angles, we have set the horizontal angle to zero. Figure 9 shows the relationship between mmWave radar, camera, and image coordinates. First, we need to transform the radar world (r coordinate O rw -x rw y rw z rw to the camera world coordinate O cw -x cw y cw z cw . Then, the camera world coordinate O cw -x cw y cw z cw is transformed into the camera coordinate O c -x c y c z c . Finally, we transform the camera coordinate O c -x c y c z c to the image coordinate O p -x p y p . Furthermore, we need to find out the area where most of the objects appear in each frame. Therefore, DBSCAN is performed again after radar and camera calibration for the above clustering points. Since the number of points in the cluster is fewer, we have experimentally set 1 as the minimum point and 400 pixels as the minimum distance.

Radar and Camera Calibration
To make the proposed system easier to set up and calculate the angle faster, we have derived the camera/radar calibration formula based on [23]. Figure 7 shows the early fusion device of mmWave and camera sensors. The following is a detailed description of the entire radar and camera calibration process.   Furthermore, we need to find out the area where most of the objects appear in each frame. Therefore, DBSCAN is performed again after radar and camera calibration for the above clustering points. Since the number of points in the cluster is fewer, we have experimentally set 1 as the minimum point and 400 pixels as the minimum distance.

Radar and Camera Calibration
To make the proposed system easier to set up and calculate the angle faster, we have derived the camera/radar calibration formula based on [23]. Figure 7 shows the early fusion device of mmWave and camera sensors. The following is a detailed description of the entire radar and camera calibration process.  First, we must transform the radar coordinate O r -x r y r to radar world coordinate O rw -x rw y rw z rw . Since the radar only has 2-D coordinates and no z-axis information, we can only get the relevant 2-D distance (x, y) between the radar and the object. Therefore, to get the radar world coordinates, we need to calculate the radar yaw angle and the height difference between the radar and the object. In the case of the height difference, since the radar does not have z-axis information, we need to consider the height difference between the radar and the object to calculate the projected depth distance "y r_new " correctly. In Figure 10, we show the height relationship of mmWave radar and the object. The parameter "y r " is the depth distance from mmWave radar and the "Height radar_object " is the height difference between the mmWave radar and the object. The function shows how we obtain the projected depth distance "y r_new " using Equation (1) For the calibration, we have three angles to be calculated, namely yaw angle, horizontal angle, and pitch angle as shown in the schematics of Figure 8. For the convenience of installation and more convenient to calculate the other angles, we have set the horizontal angle to zero. Figure 9 shows the relationship between mmWave radar, camera, and image coordinates. First, we need to transform the radar world (r coordinate Orw-xrwyrwzrw to the camera world coordinate Ocw-xcwycwzcw. Then, the camera world coordinate Ocwxcwycwzcw is transformed into the camera coordinate Oc-xcyczc. Finally, we transform the camera coordinate Oc-xcyczc to the image coordinate Op-xpyp.  First, we must transform the radar coordinate Or-xryr to radar world coordinate Orwxrwyrwzrw. Since the radar only has 2-D coordinates and no z-axis information, we can only get the relevant 2-D distance (x, y) between the radar and the object. Therefore, to get the radar world coordinates, we need to calculate the radar yaw angle and the height difference between the radar and the object. In the case of the height difference, since the radar does not have z-axis information, we need to consider the height difference between the radar and the object to calculate the projected depth distance "yr_new" correctly. In Figure  10, we show the height relationship of mmWave radar and the object. The parameter "yr" For the calibration, we have three angles to be calculated, namely yaw angle, horizontal angle, and pitch angle as shown in the schematics of Figure 8. For the convenience of installation and more convenient to calculate the other angles, we have set the horizontal angle to zero. Figure 9 shows the relationship between mmWave radar, camera, and image coordinates. First, we need to transform the radar world (r coordinate Orw-xrwyrwzrw to the camera world coordinate Ocw-xcwycwzcw. Then, the camera world coordinate Ocwxcwycwzcw is transformed into the camera coordinate Oc-xcyczc. Finally, we transform the camera coordinate Oc-xcyczc to the image coordinate Op-xpyp.  First, we must transform the radar coordinate Or-xryr to radar world coordinate Orwxrwyrwzrw. Since the radar only has 2-D coordinates and no z-axis information, we can only get the relevant 2-D distance (x, y) between the radar and the object. Therefore, to get the radar world coordinates, we need to calculate the radar yaw angle and the height difference between the radar and the object. In the case of the height difference, since the radar does not have z-axis information, we need to consider the height difference between the radar and the object to calculate the projected depth distance "yr_new" correctly. In Figure  10, we show the height relationship of mmWave radar and the object. The parameter "yr" is the depth distance from mmWave radar and the "Heightradar_object" is the height difference between the mmWave radar and the object. The function shows how we obtain the projected depth distance "yr_new" using Equation (1). Figure 10. The height relation of mmWave radar and object.
When we get the projected depth distance "yr_new", we also need to go through the When we get the projected depth distance "y r_new ", we also need to go through the yaw angle "β" to transform from the radar coordinate O r -x r y r to the radar world coordinate O rw -x rw y rw z rw . Figure 11 and Equation (2) show the relationship between radar coordinate, radar world coordinate, and yaw angle.
When we get the projected depth distance "yr_new", we also need to go through the yaw angle "β" to transform from the radar coordinate Or-xryr to the radar world coordinate Orw-xrwyrwzrw. Figure 11 and Equation (2) show the relationship between radar coordinate, radar world coordinate, and yaw angle.
The above steps help us to convert from the radar coordinate system to the radar world coordinate system. Then we need to transform from the radar world coordinate system to the camera coordinate system Ocw-xcwycwzcw, as shown in Equation (3). In Equation (3), "Lx" and "Ly" are the horizontal and vertical distances between the mmWave radar sensor and camera sensor, respectively. Thus, "Lx" and "Ly" are preset to zero. Figure 11. The relationship of radar coordinate, radar world coordinate, and yaw angle.
The above steps help us to convert from the radar coordinate system to the radar world coordinate system. Then we need to transform from the radar world coordinate system to the camera coordinate system O cw -x cw y cw z cw , as shown in Equation (3). In Equation (3), "L x " and "L y " are the horizontal and vertical distances between the mmWave radar sensor and camera sensor, respectively. Thus, "L x " and "L y " are preset to zero.
x cw = x rw − L x y cw = y rw + L y After transferring to the camera world coordinate O cw -x cw y cw z cw , we need to transform the camera world coordinate to the camera coordinate O c -x c y c z c . The function shown in Equation (4) is used for transferring the camera world coordinate to the camera coordinate. The parameters "H" and "θ" are the height and pitch angle of the camera sensor, respectively.
Then, similar to the above conversion of radar coordinate to radar world coordinate, we also regard the yaw angle "β" effect of the camera coordinate. Figure 12 shows the relationship between the camera coordinate, the new camera coordinate, and the yaw angle. Thus, the function shows the equation to transfer the original camera coordinate to the new camera coordinate influenced by "β" as in Equation (5).

  
x c_new = x c × cos β + z c × sinβ y c_new = y c z c_new = (−x c × sin β) + z c × cosβ (5) gle. Thus, the function shows the equation to transfer the original camera coordinate to the new camera coordinate influenced by "β" as in Equation (5). Finally, we can get Equation (6) by the new camera coordinate. The function helps us to transfer the new camera coordinate to the image coordinate as in Equation (6). The parameters "fx" and "fy" are the focal length, and the "cx" and "cy" are the principal points of the camera sensor. We can then calculate the four parameters using the MATLAB camera calibration toolbox.
x p x _ z _ fx cx y p y _ z _ fy cy (6) Finally, we can get Equation (6) by the new camera coordinate. The function helps us to transfer the new camera coordinate to the image coordinate as in Equation (6). The parameters "f x " and "f y " are the focal length, and the "c x " and "c y " are the principal points of the camera sensor. We can then calculate the four parameters using the MATLAB camera calibration toolbox.
x p = x c_new z c_new × f x + c x y p = y c_new z c_new × f y + c y (6) Figure 13 shows the experiments conducted to measure the accuracy of the radar and camera calibration. First, we measure the latitude and longitude of the system with a GPS meter and then use the center point position behind the vehicle as the ground truth measurement point. These two points are used to obtain the ground truth distance by using the haversine formula [24]. To estimate the radar distance, we take the radar point cloud information and calibrate it with the camera, and then use the clustering and data association algorithm [25] to find out which radar points belong to the vehicle. Finally, the radar points belonging to the vehicle are averaged to obtain the radar estimated distance. Table 1 shows the results of the radar and camera calibration, which indicates that the distance error of the calibration is at most 2% ranging from 5 m to 45 m. haversine formula [24]. To estimate the radar distance, we take the radar point cloud information and calibrate it with the camera, and then use the clustering and data association algorithm [25] to find out which radar points belong to the vehicle. Finally, the radar points belonging to the vehicle are averaged to obtain the radar estimated distance. Table  1 shows the results of the radar and camera calibration, which indicates that the distance error of the calibration is at most 2% ranging from 5 m to 45 m.

Radar and Camera Data Fusion
As mmWave radar and camera sensors are heterogeneous and are used by the deep learning models for object detection, it is essential to transform the radar point cloud information to the image coordinate system for the early fusion of the radar and camera sensors using the radar and camera calibration method discussed in Section 3.3. In this way, we can not only make the models learn the sizes and shapes of the objects but also let them learn the physical characteristics of the objects resulting in better detection results.
In our experiments, we use the distance "D", velocity "V" and intensity "I" of the mmWave radar as individual channels, and the DVI pairs are arranged and combined with the camera images. Since the pixel values of the image range from 0 to 255, we need

Radar and Camera Data Fusion
As mmWave radar and camera sensors are heterogeneous and are used by the deep learning models for object detection, it is essential to transform the radar point cloud information to the image coordinate system for the early fusion of the radar and camera sensors using the radar and camera calibration method discussed in Section 3.3. In this way, we can not only make the models learn the sizes and shapes of the objects but also let them learn the physical characteristics of the objects resulting in better detection results.
In our experiments, we use the distance "D", velocity "V" and intensity "I" of the mmWave radar as individual channels, and the DVI pairs are arranged and combined with the camera images. Since the pixel values of the image range from 0 to 255, we need to experimentally set the maximum value of DVI. We want to make the difference in physical characteristics bigger, so we have a conversion equation for DVI design as shown in Equation (7) where the parameter "d" means the distance of mmWave radar, and the maximum value is set to 90 m. The parameter "v" is the velocity of mmWave radar and the maximum value is set to 33.3 m/s. As there is no negative pixel value, we use the absolute value of the velocity in Equation (7). As for the parameter "I" is considered, TI IWR6843 mmWave radar only provides the signal-to-noise ratio (SNR) and noise, hence we need to convert them into the intensity "I" whose maximum value is set to 100 dBw. Figure 14 shows the RGB image from the vision sensor and the radar image from the mmWave radar sensor. For the radar image, if the pixel values exceed the value 255, we consider them to be equal to 255. In addition, we set all pixels where there are no radar points equal to zero. When we have the camera image and the radar image, we combine the two images to get multi-channel arrays. we need to convert them into the intensity "I" whose maximum value is set to 100 dBw. Figure 14 shows the RGB image from the vision sensor and the radar image from the mmWave radar sensor. For the radar image, if the pixel values exceed the value 255, we consider them to be equal to 255. In addition, we set all pixels where there are no radar points equal to zero. When we have the camera image and the radar image, we combine the two images to get multi-channel arrays.

Dynamic ROI for Multi-Scale Object Detection
This section proposes the method of employing mmWave radar to dynamically find ROI and apply it to multi-scale object detection. Thus, we also compare the difference between fixed ROI and dynamic ROI as shown in Figure 15.
We found that in the original multi-scale object detection method, we could only set the default ROI at the beginning because we could not know the position of the objects explicitly in advance. Therefore, we propose to use the mmWave radar sensor to find the area with the most objects and set it as the new ROI. As discussed in Section 3.2, we use the clustering algorithm to cluster the radar point cloud and find the presence of objects. Then, we cluster the clustering points again to find out which area has the most objects. This region is set as the new ROI that we have to find using the mmWave radar point cloud. Figure 15 shows the advantages of dynamic ROI. When there is no object in the default ROI, the dynamic ROI we proposed can find the area where objects may appear followed by the successful detection of objects.

Dynamic ROI for Multi-Scale Object Detection
This section proposes the method of employing mmWave radar to dynamically find ROI and apply it to multi-scale object detection. Thus, we also compare the difference between fixed ROI and dynamic ROI as shown in Figure 15.

Object Detection Model
For the ADAS applications, the object detection models must be capable of operating in real-time and detect various objects ranging from small objects at a distance to near, bigger objects. Therefore, we selected the YOLOv3 and YOLOv4 as our desired object detection convolutional neural network (CNN) models. As the inputs must be the fusion of mmWave radar sensors and camera sensors and the available open datasets are comprised of only image data, we recorded our own dataset including both radar data and image data. Additionally, we need to label the dataset thus collected by ourselves, the available open datasets are unsuitable for training the proposed model.
To solve this problem, we used camera-only datasets, such as the COCO dataset [16] and the VisDrone dataset [26] to increase the amount of training data. The Chinese characters in Figure 16a is the traffic rule craved on the road and in Figure 16b is the name of a business unit. Since these open datasets are only camera data, we set all pixel values in the radar channels to zero. Figure 16 shows examples of the datasets. Considering our applications also require RSU perspectives, we used VisDrone and the blind-spot datasets We found that in the original multi-scale object detection method, we could only set the default ROI at the beginning because we could not know the position of the objects explicitly in advance. Therefore, we propose to use the mmWave radar sensor to find the area with the most objects and set it as the new ROI. As discussed in Section 3.2, we use the clustering algorithm to cluster the radar point cloud and find the presence of objects. Then, we cluster the clustering points again to find out which area has the most objects. This region is set as the new ROI that we have to find using the mmWave radar point cloud. Figure 15 shows the advantages of dynamic ROI. When there is no object in the default ROI, the dynamic ROI we proposed can find the area where objects may appear followed by the successful detection of objects.

Object Detection Model
For the ADAS applications, the object detection models must be capable of operating in real-time and detect various objects ranging from small objects at a distance to near, bigger objects. Therefore, we selected the YOLOv3 and YOLOv4 as our desired object detection convolutional neural network (CNN) models. As the inputs must be the fusion of mmWave radar sensors and camera sensors and the available open datasets are comprised of only image data, we recorded our own dataset including both radar data and image data. Additionally, we need to label the dataset thus collected by ourselves, the available open datasets are unsuitable for training the proposed model.
To solve this problem, we used camera-only datasets, such as the COCO dataset [16] and the VisDrone dataset [26] to increase the amount of training data. The Chinese characters in Figure 16a is the traffic rule craved on the road and in Figure 16b is the name of a business unit. Since these open datasets are only camera data, we set all pixel values in the radar channels to zero. Figure 16 shows examples of the datasets. Considering our applications also require RSU perspectives, we used VisDrone and the blind-spot datasets to fit our real-life traffic scenario requirements.

Object Detection Model
For the ADAS applications, the object detection models must be capable of operating in real-time and detect various objects ranging from small objects at a distance to near, bigger objects. Therefore, we selected the YOLOv3 and YOLOv4 as our desired object detection convolutional neural network (CNN) models. As the inputs must be the fusion of mmWave radar sensors and camera sensors and the available open datasets are comprised of only image data, we recorded our own dataset including both radar data and image data. Additionally, we need to label the dataset thus collected by ourselves, the available open datasets are unsuitable for training the proposed model.
To solve this problem, we used camera-only datasets, such as the COCO dataset [16] and the VisDrone dataset [26] to increase the amount of training data. The Chinese characters in Figure 16a is the traffic rule craved on the road and in Figure 16b is the name of a business unit. Since these open datasets are only camera data, we set all pixel values in the radar channels to zero. Figure 16 shows examples of the datasets. Considering our applications also require RSU perspectives, we used VisDrone and the blind-spot datasets to fit our real-life traffic scenario requirements.

Tracking
Using the object detection model, we obtain the bounding box and detect the type of object, such as a person, car, motorcycle, or truck. We select the bounding boxes as the input of the trackers. Unlike the late fusion on the radar and camera sensors, we do not

Tracking
Using the object detection model, we obtain the bounding box and detect the type of object, such as a person, car, motorcycle, or truck. We select the bounding boxes as the input of the trackers. Unlike the late fusion on the radar and camera sensors, we do not do tracking of radar data and the bounding boxes of the camera individually. We only need to track the bounding boxes generated by the object detection model [27,28].
However, we still need to carry out certain pre-processing steps before feeding the bounding boxes to the trackers. The function given in Equation (8) shows the definition of the intersection of union (IoU) which is the overlapped area divided by the total area. Figure 17 shows the schematic diagram of IoU. The IoU input includes the bounding boxes of the tracker and object detection model. When the IoU value is higher than the set threshold, we can treat both as the same object. Based on the RSU application field characteristics, we adopt the Kalman filter to implement the tracking ensuring the trackers keep their motion information to solve the ID switch issue. IoU = overlap area total area (8) boxes of the tracker and object detection model. When the IoU value is higher than the set threshold, we can treat both as the same object. Based on the RSU application field characteristics, we adopt the Kalman filter to implement the tracking ensuring the trackers keep their motion information to solve the ID switch issue.

Sensor Fusion Equipment
In the proposed work, the TI IWR6843 is chosen as the mmWave radar. The IWR6843 mmWave radar has four receive antennas (RX) and three transmit antennas (TX). As this radar sensor has its own DSP core to process the radar signal, we can directly obtain the radar point cloud for experiments. Figure 18a shows the TI IWR6843 mmWave radar sensor employed in this paper. Since our application is set up at a certain height above the vehicle overlooking the ground, we choose this radar sensor with a large vertical field of view of 44° that facilitates this application.
The IP-2CD2625F-1ZS IP camera shown in Figure 18b is employed in the proposed work. It offers 30 fps with a high image resolution of 1920 × 1080. The waterproof, dustproof, and clear imaging against strong backlight characteristics of the camera aids to overcome the impact of the harsh environment in the ADAS scenarios.

Sensor Fusion Equipment
In the proposed work, the TI IWR6843 is chosen as the mmWave radar. The IWR6843 mmWave radar has four receive antennas (RX) and three transmit antennas (TX). As this radar sensor has its own DSP core to process the radar signal, we can directly obtain the radar point cloud for experiments. Figure 18a shows the TI IWR6843 mmWave radar sensor employed in this paper. Since our application is set up at a certain height above the vehicle overlooking the ground, we choose this radar sensor with a large vertical field of view of 44 • that facilitates this application. We choose NVIDIA Jetson AGX Xavier [30] as the embedded platform to demonstrate the portability of the proposed early fusion system on the radar and camera sensors. NVIDIA Jetson AGX Xavier comes with a pre-installed Linux environment. With the NVIDIA Jetson AGX Xavier, as shown in Figure 19, we can easily create and deploy endto-end deep learning applications. We can think of it as an AI computer for autonomous machines, offering the GPU workstation in an embedded module under 30 W. Therefore, NVIDIA Jetson AGX Xavier enables our proposed algorithm to be conveniently implemented for low-power applications. The IP-2CD2625F-1ZS IP camera shown in Figure 18b is employed in the proposed work. It offers 30 fps with a high image resolution of 1920 × 1080. The waterproof, dustproof, and clear imaging against strong backlight characteristics of the camera aids to overcome the impact of the harsh environment in the ADAS scenarios.
We choose NVIDIA Jetson AGX Xavier [30] as the embedded platform to demonstrate the portability of the proposed early fusion system on the radar and camera sensors.
NVIDIA Jetson AGX Xavier comes with a pre-installed Linux environment. With the NVIDIA Jetson AGX Xavier, as shown in Figure 19, we can easily create and deploy endto-end deep learning applications. We can think of it as an AI computer for autonomous machines, offering the GPU workstation in an embedded module under 30 W. Therefore, NVIDIA Jetson AGX Xavier enables our proposed algorithm to be conveniently implemented for low-power applications. Figure 18. (a)TI IWR6843 mmWave radar [29]; (b) IP-2CD2625F-1ZS IP camera.
We choose NVIDIA Jetson AGX Xavier [30] as the embedded platform to demonstrate the portability of the proposed early fusion system on the radar and camera sensors. NVIDIA Jetson AGX Xavier comes with a pre-installed Linux environment. With the NVIDIA Jetson AGX Xavier, as shown in Figure 19, we can easily create and deploy endto-end deep learning applications. We can think of it as an AI computer for autonomous machines, offering the GPU workstation in an embedded module under 30 W. Therefore, NVIDIA Jetson AGX Xavier enables our proposed algorithm to be conveniently implemented for low-power applications.

Implementation Details
We have collected 8285 frames of training data as radar/camera datasets by using a multi-threading approach to capture the latest radar and camera data in each loop and used 78,720 frames of camera-only datasets to make up for the lack of data. For testing purposes, we have collected 896 images for each of the four conditions namely, morning, noon, evening, and night. Figure 20 shows the ROI for the multi-scale object detection which has a big ROI covering the entire image, and the small ROI is used for the distant region. As the mmWave radar used in this work only detects around 50 m in a given field, we set the accuracy measurement in the 50-m range, as shown in Figure 21.
The accuracies of YOLOv3 and YOLOv4 models with and without the camera-only datasets and the comparison of the effects with and without multi-scale object detection

Implementation Details
We have collected 8285 frames of training data as radar/camera datasets by using a multi-threading approach to capture the latest radar and camera data in each loop and used 78,720 frames of camera-only datasets to make up for the lack of data. For testing purposes, we have collected 896 images for each of the four conditions namely, morning, noon, evening, and night. Figure 20 shows the ROI for the multi-scale object detection which has a big ROI covering the entire image, and the small ROI is used for the distant region. As the mmWave radar used in this work only detects around 50 m in a given field, we set the accuracy measurement in the 50-m range, as shown in Figure 21.      Table 2 shows the accuracy of the YOLOv3 model with camera-only datasets and multi-scale object detection. Table 2. Evaluation on YOLOv3 with camera-only datasets and multi-scale object detection where the readings in red highlight the highest value in each row.   Table 2 shows the accuracy of the YOLOv3 model with camera-only datasets and multi-scale object detection.  Table 3 shows the accuracies of the proposed method on the YOLOv4 model with camera-only datasets and multi-scale object detection. Table 4 shows a comparison of the best fusion of radar and camera between the YOLOv3 and YOLOv4 models. The left-hand side represents the training data of the models without the camera-only data, and the right-hand side represents the training data of the models with the camera-only data. From Table 4, we can know that the YOLOv3 model yields the best results when the input type is RGB + DV and multi-scale object detection is used.   Table 5 shows the accuracy comparison of the FP32, the FP16 RGB + DV models, the proposed system in INT8, and the late fusion method [4]. We can see that the proposed system has the best recall because of the addition of the Kalman filter. But the precision is reduced because of the ghost frame.

Proposed System Performance
Compared to the late fusion method, the proposed system is better in the aspects of precision, recall, and mAP. In addition, the average operational performance of the proposed system is 17.39 fps which is better than the average operational performance of the late fusion method which has 12.45 fps when implemented on the NVIDIA Jetson AGX Xavier. Table 6 shows the comparison of the proposed system and the late fusion method in rainy conditions. With the early fusion of DV and RGB data from the mmWave sensor and RGB sensor. It shows that the proposed system is significantly improved in overall mAP by 10.4% relative to the late fusion method. Figure 22 shows the demonstration of the result images for various scenarios that the proposed system can offer in terms of id, type, x-y coordinate, and velocity of the detected objects.

Conclusions
The proposed mmWave radar/camera sensor early fusion algorithm in this paper is mainly designed to solve the decision-making challenges encountered in late sensor fusion methods and the proposed method improves the detection and tracking of objects while attaining real-time operational performance. The proposed system combines the advantages of mmWave radar and vision sensors. Compared with the camera-only object detection model, Tables 2 and 3 show a significant improvement in the detection accuracies of the proposed design.
Compared to the radar/camera sensor late fusion method, the proposed system not only has better overall accuracy but also has a faster operating performance of about 5 fps. Unlike the camera-only object detection model, the proposed system offers additional relative x-y coordinates and the relative velocity of the detected objects. For the RSU applications, the proposed system can provide accurate relative positions of objects. Table 1 shows the distance errors of the proposed system, which is, at most, a 2% error rate between the ranges of 5 m to 45 m.
However, there is scope to carry out future work to improve the proposed early sensor fusion method. That is, the mmWave radar proposed in this paper outputs around 30 to 70 radar points. In complex scenes, this amount of radar points may not be enough. To overcome this challenge, we can improve the radar equipment in future work to obtain more radar information about the position and velocity information of the detected objects which is the future work of the proposed method. Data Availability Statement: The publicly available data set can be found at: https://cocodataset. org/#home (accessed on: 5 January 2021), and https://github.com/VisDrone/VisDrone-Dataset (accessed on: 5 January 2021).