You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

2 March 2023

Deep Learning Derived Object Detection and Tracking Technology Based on Sensor Fusion of Millimeter-Wave Radar/Video and Its Application on Embedded Systems

,
,
and
1
Institute of Electronics, Nation Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan
2
Pervasive Artificial Intelligence Research (PAIR) Labs, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan
3
Wistron-NCTU Embedded Artificial Intelligence Research Center, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan
4
Department of Multimedia, Mediatek Inc., Hsinchu 30010, Taiwan
This article belongs to the Special Issue Single Sensor and Multi-Sensor Object Identification and Detection with Deep Learning

Abstract

This paper proposes a deep learning-based mmWave radar and RGB camera sensor early fusion method for object detection and tracking and its embedded system realization for ADAS applications. The proposed system can be used not only in ADAS systems but also to be applied to smart Road Side Units (RSU) in transportation systems to monitor real-time traffic flow and warn road users of probable dangerous situations. As the signals of mmWave radar are less affected by bad weather and lighting such as cloudy, sunny, snowy, night-light, and rainy days, it can work efficiently in both normal and adverse conditions. Compared to using an RGB camera alone for object detection and tracking, the early fusion of the mmWave radar and RGB camera technology can make up for the poor performance of the RGB camera when it fails due to bad weather and/or lighting conditions. The proposed method combines the features of radar and RGB cameras and directly outputs the results from an end-to-end trained deep neural network. Additionally, the complexity of the overall system is also reduced such that the proposed method can be implemented on PCs as well as on embedded systems like NVIDIA Jetson Xavier at 17.39 fps.

1. Introduction

In recent years, the Advanced Driving Assistance System (ADAS) greatly promotes safe driving and might avoid dangerous driving events saving lives and damages to the infrastructure. Considering safety in autonomous system applications, it is crucial to accurately understand the surrounding environment under all circumstances and conditions. In general, autonomous systems need to estimate the positions and the velocities of probable obstacles and make decisions ensuring safety. The input data of the ADAS system is composed of various sensors, such as millimeter-wave (mmWave) radars, cameras, controller area networks (CAN) bus, light detection and ranging (LiDAR) and so on are utilized to help the road users perceive the surrounding environment and make correct decisions for safe driving. Figure 1 shows various equipment essential in an ADAS system.
Figure 1. The various devices required in collecting the inputs in an ADAS system.
Vision sensors are the most common sensors around us and their applications are everywhere. They have many advantages, such as high resolution, high frame rate, and low hardware cost. As deep learning (DL) has become extremely popular [,], the importance of vision sensors has gradually peaked. Since the visual sensors can preserve the appearance information of the targets, they are best suited for DL technology. As aforementioned, it can be noted that camera-only object detection is widely utilized in a lot of fields for numerous applications, such as smart roadside units (RSU), self-driving vehicles, smart surveillance, etc. However, the results of object detection by the camera are severely affected by the ambient light and adverse weather conditions. Although the camera can distinguish the type of objects well, it cannot accurately obtain the physical characteristics such as the actual distance and velocity of the detected objects. In the ADAS industry, many companies with relevant research such as Tesla, Google, and Mobileye, use other sensors to design their self-driving cars to make up for the lack of camera failures in bad weather and lighting conditions such as nightlight, foggy, dusky, and rainy conditions as shown in Figure 2.
Figure 2. (a) Night scene; (b) Rainy day.
In contrast to the camera, the mmWave radars provide the actual distance and velocity of the detected object relative to the radar, and they can also provide the intensity of the object as a reference for identification, such that it convinces that the mmWave radar is a good choice to be employed together with camera for sensor fusion applications to yield better detection and tracking efficiency in all weather and lighting conditions. Compared to LiDAR, the mmWave radar has better penetration and is cheaper. Although mmWave radar has fewer point clouds than LiDAR, it is easy to use in the clustering algorithm to find the objects. When we can put the advantages of mmWave radar and camera to good use, these two sensors complement each other and provide a better perception capability compared to expensive 3-D LiDARs.
Basically, the three main fusion schemes have been proposed to use mmWave radar and camera together namely, (i) decision-level fusion, (ii) data-level fusion, and (iii) feature-level fusion, respectively []. For autonomous systems, different sensors can make up for the shortcomings of the others and overcome the worse situation through sensor fusion methods. Hence, we think that the radar and camera sensor fusion is better and more reliable for drivers than utilizing a single sensor like the radar-only sensor, or the camera-only sensor.
The following sections of the paper discuss the related works comprising of three existing mmWave radar and camera sensors fusion methods followed by the steps involved in the proposed early fusion technology of mmWave radar and camera sensors fusion, experimental results, and the conclusion.

Motivation

This paper focuses on the early fusion of the mmWave radar and camera sensors for object detection and tracking as the late fusion of the mmWave radar and camera sensors belongs to decision-level fusion [,,]. First, the mmWave radar sensor and camera sensor detect obstacles individually. Then, the prediction results from them are fused together to obtain the final output results. However, different kinds of detection noises are involved in the predictions of these two heterogeneous sensors. Therefore, how to fuse the prediction results of these two kinds of sensors is a great challenge encountered in the late fusion of the mmWave radar and camera sensors.
To solve the above problem, this paper proposes the early fusion of the mmWave radar and camera sensors which is also known as feature-level fusion. To begin with, we need to transform the radar point cloud from the radar coordinates to that of an image. In the process, we add information like the distance, velocity, and intensity of the detected objects from the radar points cloud to the radar image channels corresponding to different physical characteristics of the detected objects. Finally, we fuse the visual image and the radar image to a multi-channel array and utilize a DL object detection model to extract the information from both sensors. Through the object detection model, the early fusion on the mmWave radar and camera sensors learns the relationship between the data from the mmWave radar and camera sensors, which can not only solve the problem of decision-level fusion but also solve the problem of detection in harsh environments when using the camera only.
This paper is organized as follows. Section 2 reviews the related works, which include the three existing mmWave radar and camera sensor fusion methods, related deep learning object detection models, some radar signal processing algorithms, and adopted image processing methods followed by the introduction of the proposed early fusion technology on the mmWave radar and camera sensors in detail in Section 3. Section 4 depicts our experiments and results along with the conclusion and future works in Section 5.

3. The Proposed Method

3.1. Overview

Figure 5 depicts the overall architecture of the proposed early sensor fusion method. The x and y positions and velocity indicate the relative 2-D distance (x, y) and velocity between the proposed system and the detected object. First, we will get the mmWave radar point cloud and the corresponding image. Then, the radar point cloud will be clustered and the radar and camera calibration is performed. The purpose of clustering is to find the areas where objects are really present and to filter out the noise of radar. The RGB image is represented as three channels, R, G, and B, while the radar image is represented as D, V, and I. All six channels are concatenated into a multi-channel array in the early fusion process. In Section 3.2, the clustering process and the related parameter adjustment are described in detail. The obtained clustering points are then clustered again to find out where most of the objects are present so that we can determine the ROIs of our multi-scale object detection. In Section 3.3, the radar and camera calibration is implemented to get the radar image that corresponds to the input image. In Section 3.4, Section 3.5 and Section 3.6, a detailed description of how to perform early fusion on the radar and camera sensors and how to determine our ROIs for multi-scale object detection are given, respectively. Then, in Section 3.7, we will explain how the Kalman filter is used for object tracking.
Figure 5. The overall architecture of the proposed method.

3.2. Radar Clustering

We have to set two parameters first before using DBSCAN. One is the minimum point that forms the range of each clustering point. The other one is the minimum distance to form each cluster point range. We have experimentally set 4 as the minimum point and 40 cm as the minimum distance. Figure 6 shows the effect after using DBSCAN. In Figure 6, the green dot is the mmWave radar point cloud and the yellow dot is the clustering point after DBSCAN. The red rectangular box is the ROI of multi-scale object detection, which will be introduced in detail in the later section.
Figure 6. The results of using DBSCAN: green dot: mmWave radar point cloud; yellow dot: clustering points.
Furthermore, we need to find out the area where most of the objects appear in each frame. Therefore, DBSCAN is performed again after radar and camera calibration for the above clustering points. Since the number of points in the cluster is fewer, we have experimentally set 1 as the minimum point and 400 pixels as the minimum distance.

3.3. Radar and Camera Calibration

To make the proposed system easier to set up and calculate the angle faster, we have derived the camera/radar calibration formula based on []. Figure 7 shows the early fusion device of mmWave and camera sensors. The following is a detailed description of the entire radar and camera calibration process.
Figure 7. The setting of the early fusion on the mmWave radar and camera sensors.
For the calibration, we have three angles to be calculated, namely yaw angle, horizontal angle, and pitch angle as shown in the schematics of Figure 8. For the convenience of installation and more convenient to calculate the other angles, we have set the horizontal angle to zero. Figure 9 shows the relationship between mmWave radar, camera, and image coordinates. First, we need to transform the radar world (r coordinate Orw-xrwyrwzrw to the camera world coordinate Ocw-xcwycwzcw. Then, the camera world coordinate Ocw-xcwycwzcw is transformed into the camera coordinate Oc-xcyczc. Finally, we transform the camera coordinate Oc-xcyczc to the image coordinate Op-xpyp.
Figure 8. Schematic diagram of mmWave radar installation.
Figure 9. The relationship of mmWave radar, camera, and image coordinates.
First, we must transform the radar coordinate Or-xryr to radar world coordinate Orw-xrwyrwzrw. Since the radar only has 2-D coordinates and no z-axis information, we can only get the relevant 2-D distance (x, y) between the radar and the object. Therefore, to get the radar world coordinates, we need to calculate the radar yaw angle and the height difference between the radar and the object. In the case of the height difference, since the radar does not have z-axis information, we need to consider the height difference between the radar and the object to calculate the projected depth distance “yr_new” correctly. In Figure 10, we show the height relationship of mmWave radar and the object. The parameter “yr” is the depth distance from mmWave radar and the “Heightradar_object” is the height difference between the mmWave radar and the object. The function shows how we obtain the projected depth distance “yr_new” using Equation (1).
y r _ new = y r 2 Height radar _ object 2
Figure 10. The height relation of mmWave radar and object.
When we get the projected depth distance “yr_new”, we also need to go through the yaw angle “β” to transform from the radar coordinate Or-xryr to the radar world coordinate Orw-xrwyrwzrw. Figure 11 and Equation (2) show the relationship between radar coordinate, radar world coordinate, and yaw angle.
x rw = x r × cos β + y r _ new × sin β y rw = x r × sin β + y r _ new × cos β
Figure 11. The relationship of radar coordinate, radar world coordinate, and yaw angle.
The above steps help us to convert from the radar coordinate system to the radar world coordinate system. Then we need to transform from the radar world coordinate system to the camera coordinate system Ocw-xcwycwzcw, as shown in Equation (3). In Equation (3), “Lx” and “Ly” are the horizontal and vertical distances between the mmWave radar sensor and camera sensor, respectively. Thus, “Lx” and “Ly” are preset to zero.
x cw = x rw L x y cw = y rw + L y
After transferring to the camera world coordinate Ocw-xcwycwzcw, we need to transform the camera world coordinate to the camera coordinate Oc-xcyczc. The function shown in Equation (4) is used for transferring the camera world coordinate to the camera coordinate. The parameters “H” and “θ” are the height and pitch angle of the camera sensor, respectively.
x c y c z c = 1 0 0 0 sin θ cos θ 0 cos θ sin θ x cw y cw z cw + 0 H cos θ H sin θ
Then, similar to the above conversion of radar coordinate to radar world coordinate, we also regard the yaw angle “β” effect of the camera coordinate. Figure 12 shows the relationship between the camera coordinate, the new camera coordinate, and the yaw angle. Thus, the function shows the equation to transfer the original camera coordinate to the new camera coordinate influenced by “β” as in Equation (5).
x c _ n e w = x c × cos β + z c × s i n β y c _ n e w = y c z c _ n e w = ( x c × sin β ) + z c × c o s β
Figure 12. The relationship of camera coordinate, new camera coordinate, and yaw angle.
Finally, we can get Equation (6) by the new camera coordinate. The function helps us to transfer the new camera coordinate to the image coordinate as in Equation (6). The parameters “fx” and “fy” are the focal length, and the “cx” and “cy” are the principal points of the camera sensor. We can then calculate the four parameters using the MATLAB camera calibration toolbox.
x p = x c _ new z c _ new × f x + c x y p = y c _ new z c _ new × f y + c y
Figure 13 shows the experiments conducted to measure the accuracy of the radar and camera calibration. First, we measure the latitude and longitude of the system with a GPS meter and then use the center point position behind the vehicle as the ground truth measurement point. These two points are used to obtain the ground truth distance by using the haversine formula []. To estimate the radar distance, we take the radar point cloud information and calibrate it with the camera, and then use the clustering and data association algorithm [] to find out which radar points belong to the vehicle. Finally, the radar points belonging to the vehicle are averaged to obtain the radar estimated distance. Table 1 shows the results of the radar and camera calibration, which indicates that the distance error of the calibration is at most 2% ranging from 5 m to 45 m.
Figure 13. (a) Car drives directly in front of our system; (b) Car drives 25 degrees to the left of our system.
Table 1. The distance error of radar and camera calibration from 5 m to 45 m.

3.4. Radar and Camera Data Fusion

As mmWave radar and camera sensors are heterogeneous and are used by the deep learning models for object detection, it is essential to transform the radar point cloud information to the image coordinate system for the early fusion of the radar and camera sensors using the radar and camera calibration method discussed in Section 3.3. In this way, we can not only make the models learn the sizes and shapes of the objects but also let them learn the physical characteristics of the objects resulting in better detection results.
In our experiments, we use the distance “D”, velocity “V” and intensity “I” of the mmWave radar as individual channels, and the DVI pairs are arranged and combined with the camera images. Since the pixel values of the image range from 0 to 255, we need to experimentally set the maximum value of DVI. We want to make the difference in physical characteristics bigger, so we have a conversion equation for DVI design as shown in Equation (7) where the parameter “d” means the distance of mmWave radar, and the maximum value is set to 90 m. The parameter “v” is the velocity of mmWave radar and the maximum value is set to 33.3 m/s. As there is no negative pixel value, we use the absolute value of the velocity in Equation (7). As for the parameter “I” is considered, TI IWR6843 mmWave radar only provides the signal-to-noise ratio (SNR) and noise, hence we need to convert them into the intensity “I” whose maximum value is set to 100 dBw. Figure 14 shows the RGB image from the vision sensor and the radar image from the mmWave radar sensor. For the radar image, if the pixel values exceed the value 255, we consider them to be equal to 255. In addition, we set all pixels where there are no radar points equal to zero. When we have the camera image and the radar image, we combine the two images to get multi-channel arrays.
D = d 2.83 V = v 7.65 I = ( 10 log 10 ( 10 SNR 0.01 ( P Noise 0.1 ) ) ) 2.55
Figure 14. (a) Camera image; (b) radar image.

3.5. Dynamic ROI for Multi-Scale Object Detection

This section proposes the method of employing mmWave radar to dynamically find ROI and apply it to multi-scale object detection. Thus, we also compare the difference between fixed ROI and dynamic ROI as shown in Figure 15.
Figure 15. (a) Fixed ROI; (b) Dynamic ROI.
We found that in the original multi-scale object detection method, we could only set the default ROI at the beginning because we could not know the position of the objects explicitly in advance. Therefore, we propose to use the mmWave radar sensor to find the area with the most objects and set it as the new ROI. As discussed in Section 3.2, we use the clustering algorithm to cluster the radar point cloud and find the presence of objects. Then, we cluster the clustering points again to find out which area has the most objects. This region is set as the new ROI that we have to find using the mmWave radar point cloud. Figure 15 shows the advantages of dynamic ROI. When there is no object in the default ROI, the dynamic ROI we proposed can find the area where objects may appear followed by the successful detection of objects.

3.6. Object Detection Model

For the ADAS applications, the object detection models must be capable of operating in real-time and detect various objects ranging from small objects at a distance to near, bigger objects. Therefore, we selected the YOLOv3 and YOLOv4 as our desired object detection convolutional neural network (CNN) models. As the inputs must be the fusion of mmWave radar sensors and camera sensors and the available open datasets are comprised of only image data, we recorded our own dataset including both radar data and image data. Additionally, we need to label the dataset thus collected by ourselves, the available open datasets are unsuitable for training the proposed model.
To solve this problem, we used camera-only datasets, such as the COCO dataset [] and the VisDrone dataset [] to increase the amount of training data. The Chinese characters in Figure 16a is the traffic rule craved on the road and in Figure 16b is the name of a business unit. Since these open datasets are only camera data, we set all pixel values in the radar channels to zero. Figure 16 shows examples of the datasets. Considering our applications also require RSU perspectives, we used VisDrone and the blind-spot datasets to fit our real-life traffic scenario requirements.
Figure 16. The examples of datasets we adopted: (a) Bling spot, and (b) VisDrone.

3.7. Tracking

Using the object detection model, we obtain the bounding box and detect the type of object, such as a person, car, motorcycle, or truck. We select the bounding boxes as the input of the trackers. Unlike the late fusion on the radar and camera sensors, we do not do tracking of radar data and the bounding boxes of the camera individually. We only need to track the bounding boxes generated by the object detection model [,].
However, we still need to carry out certain pre-processing steps before feeding the bounding boxes to the trackers. The function given in Equation (8) shows the definition of the intersection of union (IoU) which is the overlapped area divided by the total area. Figure 17 shows the schematic diagram of IoU. The IoU input includes the bounding boxes of the tracker and object detection model. When the IoU value is higher than the set threshold, we can treat both as the same object. Based on the RSU application field characteristics, we adopt the Kalman filter to implement the tracking ensuring the trackers keep their motion information to solve the ID switch issue.
I o U = o v e r l a p   a r e a t o t a l   a r e a
Figure 17. The schematic diagram of IoU.

4. Experimental Evaluation

4.1. Sensor Fusion Equipment

In the proposed work, the TI IWR6843 is chosen as the mmWave radar. The IWR6843 mmWave radar has four receive antennas (RX) and three transmit antennas (TX). As this radar sensor has its own DSP core to process the radar signal, we can directly obtain the radar point cloud for experiments. Figure 18a shows the TI IWR6843 mmWave radar sensor employed in this paper. Since our application is set up at a certain height above the vehicle overlooking the ground, we choose this radar sensor with a large vertical field of view of 44° that facilitates this application.
Figure 18. (a) TI IWR6843 mmWave radar []; (b) IP-2CD2625F-1ZS IP camera.
The IP-2CD2625F-1ZS IP camera shown in Figure 18b is employed in the proposed work. It offers 30 fps with a high image resolution of 1920 × 1080. The waterproof, dustproof, and clear imaging against strong backlight characteristics of the camera aids to overcome the impact of the harsh environment in the ADAS scenarios.
We choose NVIDIA Jetson AGX Xavier [] as the embedded platform to demonstrate the portability of the proposed early fusion system on the radar and camera sensors. NVIDIA Jetson AGX Xavier comes with a pre-installed Linux environment. With the NVIDIA Jetson AGX Xavier, as shown in Figure 19, we can easily create and deploy end-to-end deep learning applications. We can think of it as an AI computer for autonomous machines, offering the GPU workstation in an embedded module under 30 W. Therefore, NVIDIA Jetson AGX Xavier enables our proposed algorithm to be conveniently implemented for low-power applications.
Figure 19. NVIDIA Jetson AGX Xavier [].

4.2. Implementation Details

We have collected 8285 frames of training data as radar/camera datasets by using a multi-threading approach to capture the latest radar and camera data in each loop and used 78,720 frames of camera-only datasets to make up for the lack of data. For testing purposes, we have collected 896 images for each of the four conditions namely, morning, noon, evening, and night. Figure 20 shows the ROI for the multi-scale object detection which has a big ROI covering the entire image, and the small ROI is used for the distant region. As the mmWave radar used in this work only detects around 50 m in a given field, we set the accuracy measurement in the 50-m range, as shown in Figure 21.
Figure 20. The ROI for multi-scale object detection.
Figure 21. The ROI of calculation accuracy.
The accuracies of YOLOv3 and YOLOv4 models with and without the camera-only datasets and the comparison of the effects with and without multi-scale object detection are tabulated in Section 4.3. The input sizes are set to be 416 × 416 × N, where “N” is the channels of input arrays. The confidence thresholds are set at 0.2 for pedestrians, 0.2 for bicycles, 0.2 for motorcycles, 0.4 for cars, and 0.4 for full-size vehicles. The IoU threshold is set to 0.5.

4.3. Evaluation on YOLOv3

Table 2 shows the accuracy of the YOLOv3 model with camera-only datasets and multi-scale object detection.
Table 2. Evaluation on YOLOv3 with camera-only datasets and multi-scale object detection where the readings in red highlight the highest value in each row.

4.4. Evaluation on YOLOv4

Table 3 shows the accuracies of the proposed method on the YOLOv4 model with camera-only datasets and multi-scale object detection.
Table 3. Evaluation on YOLOv4 with camera-only datasets and multi-scale object detection where the values in red are the highest value in each row.

4.5. Comparison between YOLOv3 and YOLOv4

Table 4 shows a comparison of the best fusion of radar and camera between the YOLOv3 and YOLOv4 models. The left-hand side represents the training data of the models without the camera-only data, and the right-hand side represents the training data of the models with the camera-only data. From Table 4, we can know that the YOLOv3 model yields the best results when the input type is RGB + DV and multi-scale object detection is used.
Table 4. Evaluation between YOLOv3 and YOLOv4 with highest values highlighted in red.

4.6. Proposed System Performance

Table 5 shows the accuracy comparison of the FP32, the FP16 RGB + DV models, the proposed system in INT8, and the late fusion method []. We can see that the proposed system has the best recall because of the addition of the Kalman filter. But the precision is reduced because of the ghost frame.
Table 5. The comparison of the FP32, FP16, the proposed system, and the late fusion method where the highest values in each row is highlighted in red.
Compared to the late fusion method, the proposed system is better in the aspects of precision, recall, and mAP. In addition, the average operational performance of the proposed system is 17.39 fps which is better than the average operational performance of the late fusion method which has 12.45 fps when implemented on the NVIDIA Jetson AGX Xavier. Table 6 shows the comparison of the proposed system and the late fusion method in rainy conditions. With the early fusion of DV and RGB data from the mmWave sensor and RGB sensor. It shows that the proposed system is significantly improved in overall mAP by 10.4% relative to the late fusion method. Figure 22 shows the demonstration of the result images for various scenarios that the proposed system can offer in terms of id, type, x-y coordinate, and velocity of the detected objects.
Table 6. The comparison of the proposed system and late fusion method on rainy days in which the values in red indicate the highest value in each row.
Figure 22. The demo of the proposed system: (a) morning, (b) night, and (c) rainy night.

5. Conclusions

The proposed mmWave radar/camera sensor early fusion algorithm in this paper is mainly designed to solve the decision-making challenges encountered in late sensor fusion methods and the proposed method improves the detection and tracking of objects while attaining real-time operational performance. The proposed system combines the advantages of mmWave radar and vision sensors. Compared with the camera-only object detection model, Table 2 and Table 3 show a significant improvement in the detection accuracies of the proposed design.
Compared to the radar/camera sensor late fusion method, the proposed system not only has better overall accuracy but also has a faster operating performance of about 5 fps. Unlike the camera-only object detection model, the proposed system offers additional relative x-y coordinates and the relative velocity of the detected objects. For the RSU applications, the proposed system can provide accurate relative positions of objects. Table 1 shows the distance errors of the proposed system, which is, at most, a 2% error rate between the ranges of 5 m to 45 m.
However, there is scope to carry out future work to improve the proposed early sensor fusion method. That is, the mmWave radar proposed in this paper outputs around 30 to 70 radar points. In complex scenes, this amount of radar points may not be enough. To overcome this challenge, we can improve the radar equipment in future work to obtain more radar information about the position and velocity information of the detected objects which is the future work of the proposed method.

Author Contributions

Conceptualization, J.-J.L. and J.-I.G.; methodology, J.-J.L. and J.-I.G.; software, J.-J.L.; validation and visualization, J.-J.L., V.M.S. and S.-Y.C.; writing—original draft preparation, J.-J.L. and S.-Y.C.; writing—review and editing, J.-I.G., V.M.S. and S.-Y.C.; supervision, J.-I.G.; funding acquisition, J.-I.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Science and Technology Council (NSTC), Taiwan R.O.C. projects with grants 111-2218-E-A49-028-, 111-2634-F-A49 -009 -, 111-2218-E-002 -039 -, 111-2221-E-A49 -126 -MY3, 110-2634-F-A49 -004 -, and 110-2221-E-A49 -145 -MY3.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The publicly available data set can be found at: https://cocodataset.org/#home (accessed on: 5 January 2021), and https://github.com/VisDrone/VisDrone-Dataset (accessed on: 5 January 2021).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  2. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  3. Chang, S.; Zhang, Y.; Zhang, F.; Zhao, X.; Huang, S.; Feng, Z. Spatial attention fusion for obstacle detection using mmWave radar and vision sensor. Sensors 2020, 20, 956. [Google Scholar] [CrossRef] [PubMed]
  4. Lu, J.X.; Lin, J.C.; Vinay, M.S.; Chen, P.-Y.; Guo, J.-I. Fusion technology of radar and RGB camera sensors for object detection and tracking and its embedded system implementation. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing As-sociation Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 1234–1242. [Google Scholar]
  5. Obrvan, M.; Ćesić, J.; Petrović, I. Appearance based vehicle detection by radar-stereo vision integration. In Advances in Intelligent Systems and Computing; Elsevier: Amsterdam, The Netherlands, 2015; pp. 437–449. [Google Scholar]
  6. Wu, S.; Decker, S.; Chang, P.; Senior, T.C.; Eledath, J. Collision sensing by stereo vision and radar sensor fusion. IEEE Trans. Intell. Transp. Syst. 2009, 10, 606–614. [Google Scholar]
  7. Liu, T.; Du, S.; Liang, C.; Zhang, B.; Feng, R. A Novel Multi-Sensor Fusion Based Object Detection and Recognition Algorithm for Intelligent Assisted Driving. IEEE Access 2021, 9, 81564–81574. [Google Scholar] [CrossRef]
  8. Jha, H.; Lodhi, V.; Chakravarty, D. Object Detection and Identification Using Vision and Radar Data Fusion System for Ground-Based Navigation. In Proceedings of the 2019 6th International Conference on Signal Processing and Integrated Net-works (SPIN), Noida, India, 7–8 March 2019; pp. 590–593. [Google Scholar]
  9. Kim, K.-E.; Lee, C.-J.; Pae, D.-S.; Lim, M.-T. Sensor fusion for vehicle tracking with camera and radar sensor. In Proceedings of the 2017 17th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 18–21 October 2017; pp. 1075–1077. [Google Scholar]
  10. Wang, T.; Zheng, N.; Xin, J.; Ma, Z. Integrating Millimeter Wave Radar with a Monocular Vision Sensor for On-Road Obstacle Detection Applications. Sensors 2011, 11, 8992–9008. [Google Scholar] [CrossRef] [PubMed]
  11. Guo, X.; Du, J.; Gao, J.; Wang, W. Pedestrian detection based on fusion of millimeter wave radar and vision. In Proceedings of the 2018 International Conference on Artificial Intelligence and Pattern Recognition, Beijing, China, 18–20 August 2018; pp. 38–42. [Google Scholar]
  12. Wang, X.; Xu, L.; Sun, H.; Xin, J.; Zheng, N. On-road vehicle detection and tracking using MMW radar and monovision fusion. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2075–2084. [Google Scholar] [CrossRef]
  13. Chadwick, S.; Maddern, W.; Newman, P. Distant vehicle detection using radar and vision. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8311–8317. [Google Scholar]
  14. John, V.; Mita, S. Deep sensor fusion of monocular camera and radar for image-based obstacle detection in challenging environments. In Pacific-Rim Symposium on Image and Video Technology; Springer: Berlin/Heidelberg, Germany, 2019; pp. 351–364. [Google Scholar]
  15. Geisslinger, M.; Weber, M.; Betz, J.; Lienkamp, M. A deep learning-based radar and camera sensor fusion architecture for object detection. In Proceedings of the 2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany, 15–17 October 2019; pp. 1–7. [Google Scholar]
  16. Yun, S.; Han, D.; Joon Oh, S.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
  17. Hao, W.; Zhili, S. Improved Mosaic: Algorithms for more Complex Images. J. Phys. Conf. Ser. 2020, 1684, 012094. [Google Scholar] [CrossRef]
  18. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Cham, Germany, 6–12 September 2014; pp. 740–755. [Google Scholar]
  19. Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
  20. Meta, A.; Hoogeboom, P.; Leo, P. Ligthart. Signal processing for FMCW SAR. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3519–3532. [Google Scholar] [CrossRef]
  21. Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognition Letters. Corrected Proof 2010, 31, 651–666. [Google Scholar]
  22. Ester, M.; Kriegel, H.-P.; Sander, J.; Xiaowei, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
  23. Luo, X.; Yao, Y.; Zhang, J. Unified calibration method for millimeter-wave radar and camera. J. Tsinghua Univ. Sci. Technol. 2014, 54, 289–293. [Google Scholar]
  24. Chopde, N.R.; Nichat, M.K. Landmark based shortest path detection by using A* and Haversine formula. Int. J. Innov. Res. Comput. Commun. Eng. 2013, 1, 298–302. [Google Scholar]
  25. Taguchi, G.; Jugulum, R. The Mahalanobis-Taguchi Strategy: A Pattern Technology System; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
  26. Zhu, P.; Wen, L.; Du, D.; Xiao, B.; Fan, H.; Hu, Q.; Ling, L. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
  27. Ma, K.; Zhang, H.; Wang, R.; Zhang, Z. Target tracking system for multi-sensor data fusion. In Proceedings of the 2017 IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chengdu, China, 15–17 December 2017; pp. 1768–1772. [Google Scholar]
  28. Liu, Z.; Cai, Y.; Wang, H.; Chen, L.; Gao, H.; Jia, Y.; Li, Y. Robust Target Recognition and Tracking of Self-Driving Cars With Radar and Camera Information Fusion Under Severe Weather Conditions. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6640–6653. [Google Scholar] [CrossRef]
  29. Texas Instruments. IWR6843: Single-Chip 60-GHz to 64-GHz Intelligent mmWave Sensor Integrating Processing Capability. Available online: https://www.ti.com/product/IWR6843 (accessed on 23 July 2022).
  30. NVIDIA. NVIDIA Jetson AGX Xavier: The AI Platform for Autonomous Machines. Available online: https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit (accessed on 24 July 2022).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.