Next Article in Journal
An Experimental Investigation of an Open-Source and Low-Cost Control System for Renewable-Energy-Powered Reverse Osmosis Desalination
Previous Article in Journal
Multicriteria Machine Learning Model Assessment—Residuum Analysis Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Real-Time Object Detection and Tracking Based on Embedded Edge Devices for Local Dynamic Map Generation

1
Department of Robotics Engineering, Daegu Catholic University, 13-13, Hayang-ro, Hayang-eup, Gyeongsan-si 38430, Republic of Korea
2
IIR Seeker R&D Center, LIG Nex1, Mabuk-ro 207, Giheung-gu, Yongin-si 16911, Republic of Korea
3
Department of Electronic Engineering, Korea National University of Transportation, 50 Daehak-ro, Chungju-si 27469, Republic of Korea
4
Department of Intelligent Mechatronics Engineering, Sejong University, 209 Neungdong-ro, Gwangjin-gu, Seoul 05006, Republic of Korea
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(5), 811; https://doi.org/10.3390/electronics13050811
Submission received: 2 January 2024 / Revised: 7 February 2024 / Accepted: 15 February 2024 / Published: 20 February 2024

Abstract

:
This paper proposes a camera system designed for local dynamic map (LDM) generation, capable of simultaneously performing object detection, tracking, and 3D position estimation. This paper focuses on improving existing approaches to better suit our application, rather than proposing novel methods. We modified the detection head of YOLOv4 to enhance the detection performance for small objects and to predict fiducial points for 3D position estimation. The modified detector, compared to YOLOv4, shows an improvement of approximately 5% mAP on the Visdrone2019 dataset and around 3% mAP on our database. We also proposed a tracker based on DeepSORT. Unlike DeepSORT, which applies a feature extraction network for each detected object, the proposed tracker applies a feature extraction network once for the entire image. To increase the resolution of feature maps, the tracker integrates the feature aggregation network (FAN) structure into the DeepSORT network. The difference in multiple objects tracking accuracy (MOTA) between the proposed tracker and DeepSORT is minimal at 0.3%. However, the proposed tracker has a consistent computational load, regardless of the number of detected objects, because it extracts a feature map once for the entire image. This characteristic makes it suitable for embedded edge devices. The proposed methods have been implemented on a system on chip (SoC), Qualcomm QCS605, using network pruning and quantization. This enables the entire process to be executed at 10 Hz on this edge device.

Graphical Abstract

1. Introduction

Local dynamic map (LDM) refers to a map where real-time dynamic information on the road is reflected [1]. LDM is primarily utilized to support autonomous driving, particularly in preventing vehicle and pedestrian collisions in blind spots near intersections [2]. To achieve this, dynamic objects such as vehicles and pedestrians around intersections need to be detected, and this information should be reflected in real-time on the map integrated into the vehicle. To detect dynamic objects, raw data from sensors can be processed centrally or at a sensor level [3]. LDM is a time-critical application, and the former approach faces challenges in meeting time constraints due to the necessity of transmitting a large amount of data to a central server. Therefore, this paper proposes a camera system that detects and tracks dynamic objects in real-time and transmits this information to surrounding vehicles.
The proposed system employs a deep neural network (DNN) to detect dynamic objects and extract features for tracking objects, and it estimates the 3D paths of the objects to be transmitted. The system is implemented using Qualcomm’s QCS605, which integrates DSP, GPU, and CPU on a single chip [4]. For LDM, both detection performance and real-time processing are crucial. Hence, rather than using computationally intensive two-stage detectors, we modified YOLOv4—a one-stage detector known for its superior detection performance with low computational complexity—to our application domain. We optimized it to operate in real-time on the system hardware [5,6,7]. The proposed tracker employs both the positional information and appearance details of objects, similar to DeepSORT [8]. While DeepSORT applies a DNN individually to each object for feature extraction, the proposed tracker has been improved to reduce the computational load by applying a DNN to the entire image once. Unlike general surveillance systems, the proposed one needs to generate 3D motion paths for objects. To achieve this, the proposed system detects the bottom face quadrilateral (BFQ) of each object and estimates the 3D path of the quadrilateral’s center by using the camera calibration information [9,10].
The rest of this paper is organized as follows. Section 2 explains the hardware and software overview of the proposed system. Section 3, Section 4 and Section 5 describe the object detection network on DSP, tracking feature extraction network on GPU, and 3D trajectory estimation on CPU, respectively. Section 6 presents the experimental results for both object detection and tracking. Finally, Section 7 concludes the paper with future research directions.

2. Overview of the Proposed Camera System

2.1. Hardware Overview

The hardware of the proposed camera system is manufactured by WithRobot Inc. (Seoul, Republic of Korea) [11]. As shown in Figure 1, it consists of a main board processing an image, a carrier board providing interfaces with peripheral devices, and a camera module. The main board is equipped with Qualcomm’s QCS605 along with 4 GB of RAM and 64 GB of permanent storage. It runs on the Android operating system. QCS605 is a low-power system on chip (SoC) developed for the internet of things (IoT) applications. This SoC integrates DSP, GPU, and CPU [4]. The carrier board serves the function of connecting the main board with peripheral devices and supplying power. The carrier board and the camera are connected via MIPI CSI-2 [12], while external systems are connected through ethernet. The camera image is compressed in H.264 on the main board and transmitted externally, along with object detection and tracking results, using real-time streaming protocol (RTSP). The image sensor is the Sony IMX334 with a rolling shutter mechanism and its field of view can vary from 34.4° to 128°. The image resolution is 1920 × 1080 pixels. The sizes of the main board and carrier board are very compact, measuring 42 × 35 mm2 and 88 × 106 mm2, respectively. Moreover, the system’s average power consumption is low at 15 watts, making it highly suitable for application on mobile platforms. The power can be supplied through either a DC adapter or power over ethernet (PoE). The specifications of the proposed camera system are summarized in Table 1.

2.2. Software Overview

The core software components of the system are object detection, tracking feature extraction, and 3D trajectory estimation. These are executed on the DSP, GPU, and CPU of the QCS605, as illustrated in Figure 2. Although the DSP supports only 8-bit integer operations, it is specifically designed for artificial intelligence tasks and, as a result, it is significantly faster than the GPU for the inference of DNN. Therefore, the most computationally intensive task, object detection, is executed on the DSP. The DSP takes an image as input and produces not only typical object detection results like bounding boxes, but also the center of BFQ for each detected object. The GPU of QCS605 is designed for graphic display purposes; however, in our system, it is also utilized for DNN inference to meet the time constraints of our application. The GPU also takes an image and generates a feature map for object tracking. The CPU, based on the bounding boxes from the DSP, extracts feature vectors from the feature map received from the GPU. It then utilizes the feature vectors for the association between tracking and detected objects and converts the center of a BFQ of each object into 3D coordinates. The final outputs of the proposed system are object detection results along with the 3D trajectory of the objects.

3. Object Detection Network on DSP

Object detectors are categorized into two-stage and one-stage detectors. Two-stage detectors first estimate the bounding boxes of objects and subsequently classify the type of object within those boxes. One-stage detectors divide an image into multiple grids and simultaneously estimate the bounding box and object type within each grid. In general, two-stage detectors are known for superior detection performance compared to one-stage detectors, but their high computational demands make them less suitable for edge devices.
The representative one-stage detector is YOLO series, and we modify YOLOv4 among YOLO series to be suitable for our application, because YOLOv4 has been proven for a considerable period in various applications [13,14] and shows a compromise between computational cost and detection accuracy in various frameworks [15,16]. The proposed camera system, installed around intersections, detects moving objects such as pedestrians and vehicles for LDM generation. The longer the object detection range of the camera system, the more economical it becomes. The detection range is directly proportional to the input image resolution of the object detector. However, there is a limitation to increasing the input image resolution because of a trade-off between the resolution and the computational amounts of the detector. Therefore, we adjusted the input image resolution to 416 × 256, a level at which the detector can process the images at 10 Hz. Instead, we modified the detection head of YOLOv4 to enhance the detection of distant objects.
As shown in Figure 3, YOLOv4 has three detection heads, each producing output resolutions at low, medium, and high scales. Since the camera system for LDM is installed at a height of more than 15 m above the ground, there is a certain distance between the camera system and objects on the ground. Therefore, the proportion of space occupied by a single object in the image is not significantly high to cover the entire image. It means that the low-scale detection head of YOLOv4, which primarily detects larger objects, becomes less important. Therefore, in this paper, the unnecessary low-scale detection head is removed, and the ultra-high (UH) scale detection head, with an output resolution twice that of the high-scale head, is added to the path aggregation network (PAN) of YOLOv4 as shown in Figure 4. In Figure 3 and Figure 4, CBM and CBL refer to the block consisting of convolution, batch normalization, and mish activation layer or leaky RELU activation, respectively. CSP stands for cross-stage partial connection network block.
The proposed camera system also outputs 3D positions of detected objects on the ground. Assuming a flat road surface, the relationship between a point X on the road surface and its corresponding point X in the image plane is expressed by the homography matrix H as shown in Equation (1). In Equation (1), h i j represents the element in the i-th row and j-th column of H . ( x , y ) and ( x , y ) represent horizontal and vertical coordinates in the image and road surface, respectively.
X = H X ,   λ x y 1 = h 11 h 12 h 13 h 21 h 22 h 22 h 31 h 32 h 33 x y 1
This homography matrix is estimated through camera calibration [10]. That is, if the points on the road surface of the detected objects are known, it is possible to estimate the 3D positions of the objects by those points. When detecting objects on the ground with a camera positioned high above, estimating the 3D position based on the center of the object’s bounding box can lead to significant errors, as shown in Figure 5. To overcome this, there are studies that focus on detecting fiducial points in the image to estimate the 3D position of detected objects [17,18,19,20]. In this paper, we apply our previous approach, which estimates the center of the bottom face quadrilateral of an object as a 3D fiducial point [9]. As our previous approach is implemented by modifying the detection head of YOLOv4 to estimate the center of BFQ, it can be easily applied to the proposed system with minimal additional computational cost.
The matrix H in Equation (1) can be computed given the camera’s intrinsic parameter matrix K (3 × 3) and the extrinsic parameter matrix P (3 × 4) as shown in Equation (2).
λ x y 1 = K P x y 0 1 = K R R t C x y 0 1 = H x y 1
In the case of cameras with a wide field of view, such as the proposed camera system, lens distortion occurs as shown in Figure 6a. Equation (2) represents the pin-hole camera model without considering lens distortion. To apply Equation (2), lens distortion must be compensated for. We utilized the equidistance model for lens distortion compensation and the MATLAB toolbox for estimating intrinsic parameters [21,22]. Figure 6a,b show a raw image and an undistorted image of the proposed camera system, respectively. The intrinsic and extrinsic camera parameters can be estimated by using a camera calibration pattern. However, as shown in Figure 6, estimating external parameters using a calibration pattern on public roads is nearly impossible due to traffic flow interference and safety concerns. In this case, structures in the image, such as crosswalks, lanes, buildings, etc., can be utilized instead of a calibration pattern. The camera rotation matrix R can be estimated when there are three vanishing points [10]. In this paper, we applied a method to find two vanishing points with two parallel line pairs, such as the green and red line pairs shown in Figure 6c and estimate the remaining vanishing point under the assumption that the center of the image is the orthocenter of a triangle formed by the three vanishing points [23]. Once the rotation matrix R is obtained, the world coordinate C of the camera can be estimated using the known length of the structure such as a traffic pole depicted by the yellow line in Figure 6c. With the knowledge of the camera’s intrinsic and extrinsic parameters, along with the aspect ratio of the vehicle’s length, width, and height, generating the 3D ground truth for the vehicle becomes as simple as drawing one side of the vehicle in the image, exemplified by the red line in Figure 6d.
Applying YOLOv4 directly to edge devices with limited computational power can be challenging, despite its efficiency as a one-stage detector. Moreover, adapting it to devices like QCS605, which only supports int8 operations on its DSP, requires quantizing the network parameters. To deploy a DNN on edge devices, network structure optimization and network parameter quantization are necessary. There are various approaches for network structure optimization. First of all, there is a network simplification approach simplifying existing large networks, such as tensor decomposition [24,25] and network slimming [26,27]. Another strategy involves designing a new, smaller network trained with distilled knowledge from a larger model, known as knowledge distillation [28,29]. Additionally, there is an approach to algorithmically discover efficient network combinations, known as Neural Architecture Search (NAS) [30,31]. In this paper, we adopted the network simplification approach, taking advantage of the already validated performance of the YOLOv4 network. Among the network simplification methods, we adopted the widely embraced and effective network channel pruning [7]. The simplified network undergoes quantization before being deployed on the DSP, and the quantization is broadly categorized into post-training quantization and quantization-aware training (QAT) [27,28]. In general, for datasets where detection is relatively straightforward, the performance difference between the two methods is minimal. However, in cases of challenging detection, such as when objects are small, quantization-aware training (QAT) tends to outperform [28]. Due to the majority of targets in our application being visually small in projection, QAT is applied. The process of deploying the DNN on the proposed edge device is illustrated in Figure 7. First, less important channels of the object detector are eliminated through sparsity training and channel pruning. During the fine-tuning stage, QAT is performed to enhance the performance of the simplified network while minimizing the performance degradation due to the quantization. Finally, the model undergoes quantization and is embedded into edge devices.

4. Tracking Feature Extraction Network on GPU

To generate the trajectory of detected objects, object tracking is necessary. The SORT algorithm, which predicts the next position of each track using a Kalman filter and associates detected objects with predicted positions through the Hungarian algorithm, is widely utilized due to its low computational complexity. However, since the object appearance feature is not used in the data association, track IDs are often switched [32]. To address this, DeepSORT method has emerged, which involves extracting appearance feature vectors using a DNN and estimates the difference between the feature vectors of detected objects and the tracks using cosine distance [8]. Subsequently, various object tracking methods have been introduced [33,34,35]. However, we have modified DeepSORT to operate in real-time on our hardware, as its simpler structure is more conducive to application on edge devices compared to some state-of-the-art tracking methods with superior performance.
DeepSORT has a relatively shallow feature extraction network with only 15 layers, as shown in Figure 8. However, its computational load increases proportionally with the number of detected objects since the network is applied individually to each detected object. The proposed camera system must perform object detection, tracking, and 3D position estimation within 100 ms per frame. Since the object detector is barely operating at 10 Hz using DSP, the tracking feature extraction network should be run on the GPU. The GPU of QCS605, Adreno 615, operates at a clock speed of 430 MHz with only 256 ALUs. Although DeepSORT’s feature extraction network is lightweight, applying the network individually to each object is not feasible to meet the time requirements on this platform. Therefore, as depicted in Figure 8, the proposed method applies the feature extraction network to the entire input image, extracting a feature map with 128 channels. Subsequently, it applies ROI-Align to the feature map for each object to extract its feature vector. For DeepSORT, the output resolution of Residual 9, as depicted in Figure 8, is eight times lower than the network input. If the network of DeepSORT is directly applied to the entire image, the resolution occupied by the detected objects in the feature map becomes very low, leading to a diminished discriminative power of the feature vectors. To address this, the proposed method attaches the FAN structure to the DeepSORT architecture. This structure aggregates low-resolution and high-resolution feature maps, and its output is further processed with a 1 × 1 convolution, resulting in a feature map with dimensions 208 × 128 × 128 (width, height, channels), as depicted in Figure 8.

5. Three-dimensional Trajectory Estimation on CPU

When receiving tensors containing detection results and a tracking feature map from both DSP and GPU, respectively, 3D tracking paths of detected objects based on these tensors are generated as a final output in the CPU. The detection results are produced by applying non-maximum suppression (NMS) to objects with confidence scores exceeding a threshold (0.5) in the output tensor from the detector. Subsequently, for each detected object, ROI-Align is applied to the tracking feature map based on its bounding box, extracting a 128-dimensional feature vector. Afterward, data association is performed using the intersection over union (IoU) and feature vector distance between the detected objects and tracks, following the same methodology as the DeepSORT approach. While it is possible to utilize 3D spatial distance instead of IoU during the data association, the presence of one- or two-pixel errors in the image can lead to significant distance errors. Therefore, 3D positions are not employed in the data association; rather, they are exclusively utilized to generate 3D tracking paths.

6. Experimental Results

6.1. Detector Performance Evaluation

For the evaluation of the detector performance, the Visdrone2019-Det dataset and our surveillance camera object detection (SCOD) dataset are utilized [36]. Visdrone2019-Det consists of images captured by drones, as illustrated in Figure 9. The dataset comprises a total of 7019 images, with 6471 images used for training and 548 images for evaluation. This dataset contains ground truth annotations for a total of 10 object categories, including cars, pedestrians, bicycles, buses, etc. Visdrone2019-Det is suitable for evaluating performance in scenarios where there is a significant distance between objects and the camera, given its inclusion of numerous distant objects.
The SCOD dataset was obtained from traffic flow monitoring cameras around intersections, as depicted in Figure 10, aligning with the primary application of the proposed camera system for LDM generation. This dataset comprises a total of 24,723 images, with 21,494 images used for training and 3229 images for evaluation. It includes ground truth annotations for three categories: vehicles, pedestrians, and bicycles.
Table 2 shows the detection performance of YOLOv4 and the modified detector. For Visdrone2019-Det and SCOD the input image resolutions are 416 × 416 × 3 and 416 × 256 × 3, respectively. Although the modified detector incurs slightly more computational overhead compared to YOLOv4, its high-resolution output improves mAP by approximately 5% and 3% for Visdrone2019-Det and SCOD, respectively. Figure 11 compares the detection results of YOLOv4 and the modified detector on Visdrone2019-Det. As observed in Figure 11, it can be noted that the modified detector performs better in detecting small objects compared to YOLOv4. Figure 12 presents a comparison result on the SCOD dataset, where YOLOv4 tends to detect multiple pedestrians as a single entity, whereas the modified detector successfully separates and detects them individually.
As mentioned earlier, directly applying YOLOv4 to the edge device did not meet the processing time requirements, leading to network simplification. The performance analysis for the simplified networks is shown in Table 3. As seen in Table 3, for both Visdrone2019-det and SCOD datasets, increasing the network pruning ratio up to 70% results in minimal performance degradation. Interestingly, for Visdrone2019-det, the mAP of 70% pruned detector has improved by approximately 1%. This is likely attributed to the removal of unnecessary channels, resulting in less overfitting of the DNN to the training data. When 70% pruned version of the modified YOLOv4 for SCOD was deployed on the camera system, it was able to process approximately 11.8 frames per second.

6.2. Tracker Performance Evaluation

The tracking performance was evaluated using our own dataset, referred to as the surveillance camera object tracking (SCOT) dataset. SCOT dataset comprises 13 narrow field of view sequences with pedestrians as the primary tracking targets and 10 wide field of view sequences with vehicles as the primary tracking targets, as illustrated in Figure 13. The number of frames and objects to be tracked for each sequence are presented in Table 4. For instance, a narrow field of view sequence whose ID is 1 has a total of 810 image frames and 11 tracking targets, as shown in Table 4. Seven narrow field of view sequences and six wide field of view sequences from the SCOT dataset were used for training. The narrow field of view sequences used for training contain a total of 5741 image frames with 68 tracking targets, while the wide field of view sequences contain a total of 5098 frames with 156 tracking targets. For evaluation, six narrow field of view sequences and four wide field of view sequences were used. The reason for dividing the dataset into narrow and wide field of view sequences is that the difficulty of tracking pedestrians and vehicles, which are the primary targets, is too varied. For vehicles, the frequency of intersections between objects is low, and their movement is relatively straightforward, following lanes. Therefore, tracking vehicles is relatively easy. On the other hand, for pedestrians, the frequency of intersections between objects is high, and they have unrestricted movement paths, making tracking more challenging.
The performance of the proposed tracker was compared with DeepSORT. DeepSORT resizes the image within the bounding box for each detected object and feeds it into the feature extraction network. For the primary tracking targets, vehicles and pedestrians, the aspect ratio is significantly different. Resizing them to the same resolution may degrade the tracking performance. Therefore, for vehicles and pedestrians, separate feature extraction networks were used for DeepSORT. Vehicles were resized to 128 × 128, and pedestrians to 64 × 128, and then fed into the DNN. Multiple object tracking accuracy (MOTA) defined in Equation (3) is used as a key performance metric [35]. In Equation (3), FN, FP, GT and IDSW stand for false negative, false positive, ground truth, and the number of times the track ID changes.
M O T A = 1 t ( F N t + F P t + I D S W t ) t G T t × 100
Table 5 presents the tracking performance comparison results between DeepSORT and the proposed method. As mentioned earlier, objects in NFOV sequences are challenging to track, resulting in both DeepSORT and the proposed method having similarly low performance, with MOTA at about 41%. For DeepSORT, it requires 0.59 BFLOPs per pedestrian, while the proposed method requires 10.66 BFLOPs regardless of the number of objects. The objects in WFOV sequences are easy to track, resulting in both DeepSORT and the proposed method achieving more than 87% MOTA. In the case of DeepSORT, the input image resolution for vehicles is 128 × 128, leading to a higher computational load of 1.18 BFLOPs per object in WFOV sequences. In embedded systems, minimizing computational load is as crucial as tracking performance. Especially for the system stability, the computational load needs to be consistent. The proposed tracking method exhibits performance comparable to DeepSORT, but its consistent computational load makes it more suitable for embedded edge devices, such as the proposed camera system. If the proposed tracking feature extraction network is directly ported to the QCS605 GPU, it cannot meet the real-time requirement of 10 Hz operation. Therefore, the tracking feature extraction network was also simplified through channel pruning, but its parameters were not quantized, as the GPU supports floating-point operations. Table 6 shows that when the proposed method is pruned up to 70%, the computational load significantly decreases from 10.66 BFLOPs to 2.99. However, despite this reduction, the tracking performance did not exhibit significant degradation compared to before pruning.

7. Conclusions and Future Works

This paper proposes a camera system for real-time object detection and tracking to generate LDM. The object detector and tracking feature extracting network are improved for our application and optimized through channel pruning and quantization-aware training to suit embedded edge devices. Furthermore, the main components of the algorithm, including the object detector, tracking feature extracting network, and the tracking with 3D position estimation, are appropriately allocated to the DSP, GPU, and CPU of the Qualcomm QCS605 chip in the camera system, achieving the required operation speed of 10 Hz. Edge devices capable of real-time object detection and tracking are essential for various applications, such as autonomous driving, mobile robotics, intelligent surveillance, and more. Therefore, it is expected that the proposed method will find widespread applications.
Since the proposed system utilizes a lightweight network for real-time processing on low-end hardware, its tracking performance can be degraded on challenging datasets. In the future, to overcome this, we plan to improve the tracking performance as follows: First, we will develop a multi-functional DNN whose backbone network is shared by object tracking and detection. Since capturing subtle differences between similar objects is crucial for the feature vectors used in tracking, we will compare various backbone networks, such as DenseNet and HRNet, which effectively fuse low-resolution and high-resolution features [37,38]. Second, we will develop a network that predicts the appearance feature vector of each track based on its previous information for better appearance-based association. Last, we will apply various methods, such as the Dempster–Shafer theory [39], to accurately calculate the distance between the appearance feature vectors of tracks and those of detection results. In addition, we plan to enhance the detection performance by adapting various state-of-the-art one-stage object detectors. We will compare their detection performances and select the most appropriate detector for real-time embedded systems by applying sparsity training, channel pruning, and network quantization. Finally, we plan to extend the proposed single-camera system to multiple-camera systems. To this end, we will develop a method for object handover between cameras by fusion of position and appearance information. This will connect the trajectories of tracked objects in multiple cameras and provide richer information to LDM.

Author Contributions

Conceptualization, K.C., J.M., H.G.J. and J.K.S.; methodology, K.C., J.M., H.G.J. and J.K.S.; software, K.C., J.M.; validation, K.C., J.M., H.G.J. and J.K.S.; formal analysis, K.C., J.M., H.G.J. and J.K.S.; investigation, K.C., J.M.; resources, K.C., J.M.; data curation, K.C., J.M.; writing—original draft preparation, K.C.; writing—review and editing, H.G.J. and J.K.S.; visualization, K.C.; supervision, H.G.J. and J.K.S.; project administration, J.K.S.; funding acquisition, K.C., J.K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by research grants from Daegu Catholic University in 2023.

Data Availability Statement

The datasets presented in this article are not readily available because these are proprietary to the company.

Conflicts of Interest

Authors Jongwon Moon was employed by the company LIG Nex1. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. TR 102 863—V1.1.1; Intelligent Transport Systems (ITS); Vehicular Communications; Basic Set of Applications; Local Dynamic Map (LDM); Rationale for and Guidance on Standardization. ETSI: Sophia Antipolis, France, 2011.
  2. Damerow, F.; Puphal, T.; Flade, B.; Li, Y.; Eggert, J. Intersection Warning System for Occlusion Risks Using Relational Local Dynamic Maps. IEEE Intell. Transp. Syst. Mag. 2018, 10, 47–59. [Google Scholar] [CrossRef]
  3. Carletti, C.M.R.; Raviglione, F.; Casetti, C.; Stoffella, F.; Yilma, G.M.; Visintainer, F.; Risma Carletti, C.M. S-LDM: Server Local Dynamic Map for 5G-Based Centralized Enhanced Collective Perception. SSRN 2023. [Google Scholar] [CrossRef]
  4. Qualcomm QCS605 SoC|Next-Gen 8-Core IoT & Smart Camera Chipset|Qualcomm. Available online: https://www.qualcomm.com/products/technology/processors/application-processors/qcs605 (accessed on 29 September 2022).
  5. Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A Survey of Modern Deep Learning Based Object Detection Models. Digit. Signal Process. 2021, 126, 103514. [Google Scholar] [CrossRef]
  6. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020. [Google Scholar] [CrossRef]
  7. Choi, K.; Wi, S.M.; Jung, H.G.; Suhr, J.K. Simplification of Deep Neural Network-Based Object Detector for Real-Time Edge Computing. Sensors 2023, 23, 3777. [Google Scholar] [CrossRef] [PubMed]
  8. Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. arXiv 2017. [Google Scholar] [CrossRef]
  9. Kim, G.; Jung, H.G.; Suhr, J.K. CNN-Based Vehicle Bottom Face Quadrilateral Detection Using Surveillance Cameras for Intelligent Transportation Systems. Sensors 2023, 23, 6688. [Google Scholar] [CrossRef] [PubMed]
  10. Caprile, B.; Torre, V. Using Vanishing Points for Camera Calibration; Springer: Berlin/Heidelberg, Germany, 1990; Volume 4. [Google Scholar]
  11. RoadGaze Hardware Specification. Available online: http://withrobot.com/en/ai-camera/roadgaze/ (accessed on 2 February 2024).
  12. MIPI CSI-2. Available online: https://www.mipi.org/specifications/csi-2 (accessed on 2 February 2024).
  13. Yurdusev, A.A.; Adem, K.; Hekim, M. Detection and Classification of Microcalcifications in Mammograms Images Using Difference Filter and Yolov4 Deep Learning Model. Biomed. Signal Process. Control 2023, 80, 104360. [Google Scholar] [CrossRef]
  14. Dlamini, S.; Chen, Y.-H.; Jeffrey Kuo, C.-F. Complete Fully Automatic Detection, Segmentation and 3D Reconstruction of Tumor Volume for Non-Small Cell Lung Cancer Using YOLOv4 and Region-Based Active Contour Model. Expert Syst. Appl. 2023, 212, 118661. [Google Scholar] [CrossRef]
  15. YOLOv4. Available online: https://docs.nvidia.com/tao/tao-toolkit/text/object_detection/yolo_v4.html (accessed on 6 March 2023).
  16. Getting Started with YOLO V4. Available online: https://www.mathworks.com/help/vision/ug/getting-started-with-yolo-v4.html (accessed on 16 February 2024).
  17. Zhang, B.; Zhang, J. A Traffic Surveillance System for Obtaining Comprehensive Information of the Passing Vehicles Based on Instance Segmentation. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7040–7055. [Google Scholar] [CrossRef]
  18. Mauri, A.; Khemmar, R.; Decoux, B.; Haddad, M.; Boutteau, R. Real-Time 3D Multi-Object Detection and Localization Based on Deep Learning for Road and Railway Smart Mobility. J. Imaging 2021, 7, 145. [Google Scholar] [CrossRef] [PubMed]
  19. Li, P. RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving. arXiv 2020. [Google Scholar] [CrossRef]
  20. Zhu, M.; Zhang, S.; Zhong, Y.; Lu, P.; Peng, H.; Lenneman, J. Monocular 3D Vehicle Detection Using Uncalibrated Traffic Cameras through Homography. arXiv 2021. [Google Scholar] [CrossRef]
  21. Kannala, J.; Brandt, S.S. A Generic Camera Model and Calibration Method for Conventional, Wide-Angle, and Fish-Eye Lenses. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1335–1340. [Google Scholar] [CrossRef] [PubMed]
  22. Bouguet, J.-Y. Complete Camera Calibration Toolbox for Matlab. In Jean-Yves Bouguet’s Homepage; 1999; Available online: http://robots.stanford.edu/cs223b04/JeanYvesCalib/ (accessed on 1 January 2024).
  23. Cipolla, R.; Drummond, T.; Robertson, D. Camera Calibration from Vanishing Points in Image OfArchitectural Scenes. In Proceedings of the 1999 British Machine Vision Conference, Nottingham, UK, 1 January 1999. [Google Scholar]
  24. Li, N.; Pan, Y.; Chen, Y.; Ding, Z.; Zhao, D.; Xu, Z. Heuristic Rank Selection with Progressively Searching Tensor Ring Network. arXiv 2020. [Google Scholar] [CrossRef]
  25. Yin, M.; Sui, Y.; Liao, S.; Yuan, B. Towards Efficient Tensor Decomposition-Based DNN Model Compression with Optimization Framework. arXiv 2021. [Google Scholar] [CrossRef]
  26. Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and Quantization for Deep Neural Network Acceleration: A Survey. arXiv 2021. [Google Scholar] [CrossRef]
  27. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. arXiv 2017. [Google Scholar] [CrossRef]
  28. Masana, M.; Van De Weijer, J.; Herranz, L.; Bagdanov, A.D.; Alvarez, J.M. Domain-Adaptive Deep Network Compression. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  29. Yang, J.; Zou, H.; Cao, S.; Chen, Z.; Xie, L. MobileDA: Toward Edge-Domain Adaptation. IEEE Internet Things J. 2020, 7, 6909–6918. [Google Scholar] [CrossRef]
  30. Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.-J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive Neural Architecture Search. arXiv 2017. [Google Scholar] [CrossRef]
  31. White, C.; Neiswanger, W.; Savani, Y. BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search. arXiv 2019. [Google Scholar] [CrossRef]
  32. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. arXiv 2016. [Google Scholar] [CrossRef]
  33. Ondrasovic, M.; Tarabek, P. Siamese Visual Object Tracking: A Survey. IEEE Access 2021, 9, 110149–110172. [Google Scholar] [CrossRef]
  34. Chen, F.; Wang, X.; Zhao, Y.; Lv, S.; Niu, X. Visual Object Tracking: A Survey. Comput. Vis. Image Underst. 2022, 222, 103508. [Google Scholar] [CrossRef]
  35. Wojke, N.; Bewley, A. Deep Cosine Metric Learning for Person Re-Identification. arXiv 2018. [Google Scholar] [CrossRef]
  36. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
  37. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. arXiv 2019. [Google Scholar] [CrossRef] [PubMed]
  38. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2016. [Google Scholar] [CrossRef]
  39. Triki, N.; Ksantini, M.; Karray, M. Traffic Sign Recognition System Based on Belief Functions Theory. In Proceedings of the ICAART 2021—13th International Conference on Agents and Artificial Intelligence, Virtual Event, 4–6 February 2021; SciTePress: Setúbal, Portugal, 2021; Volume 2, pp. 775–780. [Google Scholar]
Figure 1. Hardware overview [11].
Figure 1. Hardware overview [11].
Electronics 13 00811 g001
Figure 2. Software overview.
Figure 2. Software overview.
Electronics 13 00811 g002
Figure 3. YOLOv4 architecture.
Figure 3. YOLOv4 architecture.
Electronics 13 00811 g003
Figure 4. Modified YOLOv4 architecture.
Figure 4. Modified YOLOv4 architecture.
Electronics 13 00811 g004
Figure 5. Distance error according to 3D fiducial point of an object.
Figure 5. Distance error according to 3D fiducial point of an object.
Electronics 13 00811 g005
Figure 6. Camera calibration.
Figure 6. Camera calibration.
Electronics 13 00811 g006
Figure 7. Process of deploying a DNN to edge devices.
Figure 7. Process of deploying a DNN to edge devices.
Electronics 13 00811 g007
Figure 8. Tracking feature extraction network.
Figure 8. Tracking feature extraction network.
Electronics 13 00811 g008
Figure 9. Visdrone2019-Det.
Figure 9. Visdrone2019-Det.
Electronics 13 00811 g009
Figure 10. SCOD dataset.
Figure 10. SCOD dataset.
Electronics 13 00811 g010
Figure 11. Detection results of Visdrone2019-Det: (a) YOLOv4, (b) Modified YOLOv4.
Figure 11. Detection results of Visdrone2019-Det: (a) YOLOv4, (b) Modified YOLOv4.
Electronics 13 00811 g011
Figure 12. Detection results of SCOD: (a) YOLOv4, (b) Modified YOLOv4.
Figure 12. Detection results of SCOD: (a) YOLOv4, (b) Modified YOLOv4.
Electronics 13 00811 g012
Figure 13. Examples of SCOT: (a) NFOV sequences, (b) WFOV sequences.
Figure 13. Examples of SCOT: (a) NFOV sequences, (b) WFOV sequences.
Electronics 13 00811 g013
Table 1. HW specification of the proposed camera system.
Table 1. HW specification of the proposed camera system.
Major
Components
ItemsSpecification
Main
Board
QCS605CPU: Kyro 300: 64 bit-8 cores, up 2.5 GHz
DSP: 2 × Hexagon Vector Processor, Hexagon 685
GPU: Adreno 615
Memory4 GB LPDDR4, eMMC 64 GB
OSAndroid
Size 42 × 35   m m 2
Carrier
Board
InterfaceExterior: Ethernet, Camera: MIPI
Power15 w (PoE or DC Adaptor)
Data
Transfer
CODEC: H.264
Protocol: RTSP
Size 88 × 106   m m 2
CameraSensorSony IMX334(CMOS)
M12 Mount, Rolling shutter
FOV34.4°~128°
Resolution 1920 × 1080
Table 2. Evaluation of object detectors.
Table 2. Evaluation of object detectors.
NetworkDatasetInput SizemAP (%)BFLOPs
YOLOv4Visdrone2019-Det 416 × 416 20.6259.75
SCOD 416 × 256 86.6936.79
Modified YOLOv4Visdrone2019-Det 416 × 416 25.7963.77
SCOD 416 × 256 89.4739.22
Table 3. Evaluation of the modified YOLOv4 according to the pruning rate.
Table 3. Evaluation of the modified YOLOv4 according to the pruning rate.
DatasetPruning Rate (%)mAP (%)ParameterBFLOPs
Visdrone2019-Det025.7948.0 M63.77
5027.3815.1 M38.45
7026.996.49 M27.28
SCOD089.4748.0 M39.22
5089.9812.7 M19.25
7089.566.48 M13.23
Table 4. Surveillance camera object tracking dataset.
Table 4. Surveillance camera object tracking dataset.
DBFOVSequences ID
12345678910111213Total
TrainN810
(11)
690
(12)
1020
(7)
930
(12)
780
(8)
900
(10)
611
(8)
5741
(68)
W990
(30)
1020
(25)
630
(26)
899
(22)
659
(20)
990
(33)
5098
(156)
TestN 1080
(7)
900
(7)
509
(4)
604
(10)
795
(19)
695
(7)
4583
(54)
W 630
(15)
900
(22)
660
(17)
450
(15)
2640
(69)
Table 5. Tracking performance evaluation.
Table 5. Tracking performance evaluation.
TrackerNFOV SequencesWFOV Sequences
MOTA
(%)
FNFPIDswBFLOPsMOTA
(%)
FNFPIDswBFLOPs
DeepSORT41.2426881885560.59 x N87.04277361211.18 x N
Modified DeepSORT41.54272618176210.6687.352713581410.66
Table 6. Tracking performance evaluation of modified DeepSORT according to pruning rate.
Table 6. Tracking performance evaluation of modified DeepSORT according to pruning rate.
Pruning
Rate
BFLOPsNFOV SequencesWFOV Sequences
MOTA
(%)
FNFPIDswMOTA
(%)
FNFPIDsw
010.6641.54272618176287.3527135814
504.3239.74277219136287.4926434923
702.9940.73287917325887.3726336019
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, K.; Moon, J.; Jung, H.G.; Suhr, J.K. Real-Time Object Detection and Tracking Based on Embedded Edge Devices for Local Dynamic Map Generation. Electronics 2024, 13, 811. https://doi.org/10.3390/electronics13050811

AMA Style

Choi K, Moon J, Jung HG, Suhr JK. Real-Time Object Detection and Tracking Based on Embedded Edge Devices for Local Dynamic Map Generation. Electronics. 2024; 13(5):811. https://doi.org/10.3390/electronics13050811

Chicago/Turabian Style

Choi, Kyoungtaek, Jongwon Moon, Ho Gi Jung, and Jae Kyu Suhr. 2024. "Real-Time Object Detection and Tracking Based on Embedded Edge Devices for Local Dynamic Map Generation" Electronics 13, no. 5: 811. https://doi.org/10.3390/electronics13050811

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop