Real-Time Object Detection and Tracking Based on Embedded Edge Devices for Local Dynamic Map Generation

: This paper proposes a camera system designed for local dynamic map (LDM) generation, capable of simultaneously performing object detection, tracking, and 3D position estimation. This paper focuses on improving existing approaches to better suit our application, rather than proposing novel methods. We modified the detection head of YOLOv4 to enhance the detection performance for small objects and to predict fiducial points for 3D position estimation. The modified detector, compared to YOLOv4, shows an improvement of approximately 5% mAP on the Visdrone2019 dataset and around 3% mAP on our database. We also proposed a tracker based on DeepSORT. Unlike DeepSORT, which applies a feature extraction network for each detected object, the proposed tracker applies a feature extraction network once for the entire image. To increase the resolution of feature maps, the tracker integrates the feature aggregation network (FAN) structure into the DeepSORT network. The difference in multiple objects tracking accuracy (MOTA) between the proposed tracker and DeepSORT is minimal at 0.3%. However, the proposed tracker has a consistent computational load, regardless of the number of detected objects, because it extracts a feature map once for the entire image. This characteristic makes it suitable for embedded edge devices. The proposed methods have been implemented on a system on chip (SoC), Qualcomm QCS605, using network pruning and quantization. This enables the entire process to be executed at 10 Hz on this edge device.


Introduction
Local dynamic map (LDM) refers to a map where real-time dynamic information on the road is reflected [1].LDM is primarily utilized to support autonomous driving, particularly in preventing vehicle and pedestrian collisions in blind spots near intersections [2].To achieve this, dynamic objects such as vehicles and pedestrians around intersections need to be detected, and this information should be reflected in real-time on the map integrated into the vehicle.To detect dynamic objects, raw data from sensors can be processed centrally or at a sensor level [3].LDM is a time-critical application, and the former approach faces challenges in meeting time constraints due to the necessity of transmitting a large amount of data to a central server.Therefore, this paper proposes a camera system that detects and tracks dynamic objects in real-time and transmits this information to surrounding vehicles.
The proposed system employs a deep neural network (DNN) to detect dynamic objects and extract features for tracking objects, and it estimates the 3D paths of the objects to be transmitted.The system is implemented using Qualcomm's QCS605, which integrates DSP, GPU, and CPU on a single chip [4].For LDM, both detection performance and real-time processing are crucial.Hence, rather than using computationally intensive two-stage detectors, we modified YOLOv4-a one-stage detector known for its superior detection performance with low computational complexity-to our application domain.We optimized it to operate in real-time on the system hardware [5][6][7].The proposed tracker employs both the positional information and appearance details of objects, similar to DeepSORT [8].While DeepSORT applies a DNN individually to each object for feature extraction, the proposed tracker has been improved to reduce the computational load by applying a DNN to the entire image once.Unlike general surveillance systems, the proposed one needs to generate 3D motion paths for objects.To achieve this, the proposed system detects the bottom face quadrilateral (BFQ) of each object and estimates the 3D path of the quadrilateral's center by using the camera calibration information [9,10].
The rest of this paper is organized as follows.Section 2 explains the hardware and software overview of the proposed system.Sections 3-5 describe the object detection network on DSP, tracking feature extraction network on GPU, and 3D trajectory estimation on CPU, respectively.Section 6 presents the experimental results for both object detection and tracking.Finally, Section 7 concludes the paper with future research directions.

Hardware Overview
The hardware of the proposed camera system is manufactured by WithRobot Inc. (Seoul, Republic of Korea) [11].As shown in Figure 1, it consists of a main board processing an image, a carrier board providing interfaces with peripheral devices, and a camera module.The main board is equipped with Qualcomm's QCS605 along with 4 GB of RAM and 64 GB of permanent storage.It runs on the Android operating system.QCS605 is a lowpower system on chip (SoC) developed for the internet of things (IoT) applications.This SoC integrates DSP, GPU, and CPU [4].The carrier board serves the function of connecting the main board with peripheral devices and supplying power.The carrier board and the camera are connected via MIPI CSI-2 [12], while external systems are connected through ethernet.The camera image is compressed in H.264 on the main board and transmitted externally, along with object detection and tracking results, using real-time streaming protocol (RTSP).The image sensor is the Sony IMX334 with a rolling shutter mechanism and its field of view can vary from 34.4 • to 128 • .The image resolution is 1920 × 1080 pixels.The sizes of the main board and carrier board are very compact, measuring 42 × 35 mm 2 and 88 × 106 mm 2 , respectively.Moreover, the system's average power consumption is low at 15 watts, making it highly suitable for application on mobile platforms.The power can be supplied through either a DC adapter or power over ethernet (PoE).The specifications of the proposed camera system are summarized in Table 1.

Software Overview
The core software components of the system are object detection, tracking feature extraction, and 3D trajectory estimation.These are executed on the DSP, GPU, and CPU of the QCS605, as illustrated in Figure 2.Although the DSP supports only 8-bit integer operations, it is specifically designed for artificial intelligence tasks and, as a result, it is significantly faster than the GPU for the inference of DNN.Therefore, the most computationally intensive task, object detection, is executed on the DSP.The DSP takes an image as input and produces not only typical object detection results like bounding boxes, but also the center of BFQ for each detected object.The GPU of QCS605 is designed for graphic display purposes; however, in our system, it is also utilized for DNN inference to meet the time constraints of our application.The GPU also takes an image and generates a feature map for object tracking.The CPU, based on the bounding boxes from the DSP, extracts feature vectors from the feature map received from the GPU.It then utilizes the feature vectors for the association between tracking and detected objects and converts the center of a BFQ of each object into 3D coordinates.The final outputs of the proposed system are object detection results along with the 3D trajectory of the objects.as input and produces not only typical object detection results like bounding boxes, but also the center of BFQ for each detected object.The GPU of QCS605 is designed for graphic display purposes; however, in our system, it is also utilized for DNN inference to meet the time constraints of our application.The GPU also takes an image and generates a feature map for object tracking.The CPU, based on the bounding boxes from the DSP, extracts feature vectors from the feature map received from the GPU.It then utilizes the feature vectors for the association between tracking and detected objects and converts the center of a BFQ of each object into 3D coordinates.The final outputs of the proposed system are object detection results along with the 3D trajectory of the objects.

Object Detection Network on DSP
Object detectors are categorized into two-stage and one-stage detectors.Two-stage detectors first estimate the bounding boxes of objects and subsequently classify the type of object within those boxes.One-stage detectors divide an image into multiple grids and simultaneously estimate the bounding box and object type within each grid.In general, two-stage detectors are known for superior detection performance compared to one-stage detectors, but their high computational demands make them less suitable for edge devices.
The representative one-stage detector is YOLO series, and we modify YOLOv4 among YOLO series to be suitable for our application, because YOLOv4 has been proven for a considerable period in various applications [13,14] and shows a compromise between computational cost and detection accuracy in various frameworks [15,16].The proposed camera system, installed around intersections, detects moving objects such as pedestrians and vehicles for LDM generation.The longer the object detection range of the camera system, the more economical it becomes.The detection range is directly proportional to the input image resolution of the object detector.However, there is a limitation to increasing the input image resolution because of a trade-off between the resolution and the computational amounts of the detector.Therefore, we adjusted the input image resolution to 416

Object Detection Network on DSP
Object detectors are categorized into two-stage and one-stage detectors.Two-stage detectors first estimate the bounding boxes of objects and subsequently classify the type of object within those boxes.One-stage detectors divide an image into multiple grids and simultaneously estimate the bounding box and object type within each grid.In general, two-stage detectors are known for superior detection performance compared to one-stage detectors, but their high computational demands make them less suitable for edge devices.
The representative one-stage detector is YOLO series, and we modify YOLOv4 among YOLO series to be suitable for our application, because YOLOv4 has been proven for a considerable period in various applications [13,14] and shows a compromise between computational cost and detection accuracy in various frameworks [15,16].The proposed camera system, installed around intersections, detects moving objects such as pedestrians and vehicles for LDM generation.The longer the object detection range of the camera system, the more economical it becomes.The detection range is directly proportional to the input image resolution of the object detector.However, there is a limitation to increasing the input image resolution because of a trade-off between the resolution and the computational amounts of the detector.Therefore, we adjusted the input image resolution to 416 × 256, a level at which the detector can process the images at 10 Hz.Instead, we modified the detection head of YOLOv4 to enhance the detection of distant objects.
As shown in Figure 3, YOLOv4 has three detection heads, each producing output resolutions at low, medium, and high scales.Since the camera system for LDM is installed at a height of more than 15 m above the ground, there is a certain distance between the camera system and objects on the ground.Therefore, the proportion of space occupied by a single object in the image is not significantly high to cover the entire image.It means that the low-scale detection head of YOLOv4, which primarily detects larger objects, becomes less important.Therefore, in this paper, the unnecessary low-scale detection head is removed, and the ultra-high (UH) scale detection head, with an output resolution twice that of the high-scale head, is added to the path aggregation network (PAN) of YOLOv4 as shown in Figure 4.In Figures 3 and 4, CBM and CBL refer to the block consisting of convolution, batch normalization, and mish activation layer or leaky RELU activation, respectively.CSP stands for cross-stage partial connection network block.
× 256, a level at which the detector can process the images at 10 Hz.Instead, we modified the detection head of YOLOv4 to enhance the detection of distant objects.
As shown in Figure 3, YOLOv4 has three detection heads, each producing output resolutions at low, medium, and high scales.Since the camera system for LDM is installed at a height of more than 15 m above the ground, there is a certain distance between the camera system and objects on the ground.Therefore, the proportion of space occupied by a single object in the image is not significantly high to cover the entire image.It means that the low-scale detection head of YOLOv4, which primarily detects larger objects, becomes less important.Therefore, in this paper, the unnecessary low-scale detection head is removed, and the ultra-high (UH) scale detection head, with an output resolution twice that of the high-scale head, is added to the path aggregation network (PAN) of YOLOv4 as shown in Figure 4.In Figures 3 and 4, CBM and CBL refer to the block consisting of convolution, batch normalization, and mish activation layer or leaky RELU activation, respectively.CSP stands for cross-stage partial connection network block.× 256, a level at which the detector can process the images at 10 Hz.Instead, we modified the detection head of YOLOv4 to enhance the detection of distant objects.
As shown in Figure 3, YOLOv4 has three detection heads, each producing output resolutions at low, medium, and high scales.Since the camera system for LDM is installed at a height of more than 15 m above the ground, there is a certain distance between the camera system and objects on the ground.Therefore, the proportion of space occupied by a single object in the image is not significantly high to cover the entire image.It means that the low-scale detection head of YOLOv4, which primarily detects larger objects, becomes less important.Therefore, in this paper, the unnecessary low-scale detection head is removed, and the ultra-high (UH) scale detection head, with an output resolution twice that of the high-scale head, is added to the path aggregation network (PAN) of YOLOv4 as shown in Figure 4.In Figures 3 and 4, CBM and CBL refer to the block consisting of convolution, batch normalization, and mish activation layer or leaky RELU activation, respectively.CSP stands for cross-stage partial connection network block.The proposed camera system also outputs 3D positions of detected objects on the ground.Assuming a flat road surface, the relationship between a point X ′ on the road surface and its corresponding point X in the image plane is expressed by the homography matrix H as shown in Equation (1).In Equation (1), h ij represents the element in the i-th row and j-th column of H. (x, y) and (x ′ , y ′ ) represent horizontal and vertical coordinates in the image and road surface, respectively.
This homography matrix is estimated through camera calibration [10].That is, if the points on the road surface of the detected objects are known, it is possible to estimate the 3D positions of the objects by those points.When detecting objects on the ground with a camera positioned high above, estimating the 3D position based on the center of the object's bounding box can lead to significant errors, as shown in Figure 5.To overcome this, there are studies that focus on detecting fiducial points in the image to estimate the 3D position of detected objects [17][18][19][20].In this paper, we apply our previous approach, which estimates the center of the bottom face quadrilateral of an object as a 3D fiducial point [9].As our previous approach is implemented by modifying the detection head of YOLOv4 to estimate the center of BFQ, it can be easily applied to the proposed system with minimal additional computational cost.
Electronics 2024, 13, x FOR PEER REVIEW 6 of 16 The proposed camera system also outputs 3D positions of detected objects on the ground.Assuming a flat road surface, the relationship between a point ′ on the road surface and its corresponding point  in the image plane is expressed by the homography matrix  as shown in Equation (1).In Equation (1), ℎ represents the element in the i-th row and j-th column of .,  and  ,  represent horizontal and vertical coordinates in the image and road surface, respectively.
This homography matrix is estimated through camera calibration [10].That is, if the points on the road surface of the detected objects are known, it is possible to estimate the 3D positions of the objects by those points.When detecting objects on the ground with a camera positioned high above, estimating the 3D position based on the center of the object's bounding box can lead to significant errors, as shown in Figure 5.To overcome this, there are studies that focus on detecting fiducial points in the image to estimate the 3D position of detected objects [17][18][19][20].In this paper, we apply our previous approach, which estimates the center of the bottom face quadrilateral of an object as a 3D fiducial point [9].As our previous approach is implemented by modifying the detection head of YOLOv4 to estimate the center of BFQ, it can be easily applied to the proposed system with minimal additional computational cost.
In the case of cameras with a wide field of view, such as the proposed camera system, lens distortion occurs as shown in Figure 6a.Equation (2) represents the pin-hole camera model without considering lens distortion.To apply Equation (2), lens distortion must be compensated for.We utilized the equidistance model for lens distortion compensation and the MATLAB toolbox for estimating intrinsic parameters [21,22].Figure 6a,b show a raw image and an undistorted image of the proposed camera system, respectively.The intrinsic and extrinsic camera parameters can be estimated by using a camera calibration pattern.However, as shown in Figure 6, estimating external parameters using a The matrix H in Equation (1) can be computed given the camera's intrinsic parameter matrix K (3 × 3) and the extrinsic parameter matrix P (3 × 4) as shown in Equation (2).
In the case of cameras with a wide field of view, such as the proposed camera system, lens distortion occurs as shown in Figure 6a.Equation (2) represents the pin-hole camera model without considering lens distortion.To apply Equation (2), lens distortion must be compensated for.We utilized the equidistance model for lens distortion compensation and the MATLAB toolbox for estimating intrinsic parameters [21,22].Figure 6a,b show a raw image and an undistorted image of the proposed camera system, respectively.The intrinsic and extrinsic camera parameters can be estimated by using a camera calibration pattern.However, as shown in Figure 6, estimating external parameters using a calibration pattern on public roads is nearly impossible due to traffic flow interference and safety concerns.In this case, structures in the image, such as crosswalks, lanes, buildings, etc., can be utilized instead of a calibration pattern.The camera rotation matrix R can be estimated when there are three vanishing points [10].In this paper, we applied a method to find two vanishing points with two parallel line pairs, such as the green and red line pairs shown in Figure 6c and estimate the remaining vanishing point under the assumption that the center of the image is the orthocenter of a triangle formed by the three vanishing points [23].Once the rotation matrix R is obtained, the world coordinate C of the camera can be estimated using the known length of the structure such as a traffic pole depicted by the yellow line in Figure 6c.With the knowledge of the camera's intrinsic and extrinsic parameters, along with the aspect ratio of the vehicle's length, width, and height, generating the 3D ground truth for the vehicle becomes as simple as drawing one side of the vehicle in the image, exemplified by the red line in Figure 6d.
Electronics 2024, 13, x FOR PEER REVIEW 7 of 16 calibration pattern on public roads is nearly impossible due to traffic flow interference and safety concerns.In this case, structures in the image, such as crosswalks, lanes, buildings, etc., can be utilized instead of a calibration pattern.The camera rotation matrix  can be estimated when there are three vanishing points [10].In this paper, we applied a method to find two vanishing points with two parallel line pairs, such as the green and red line pairs shown in Figure 6c and estimate the remaining vanishing point under the assumption that the center of the image is the orthocenter of a triangle formed by the three vanishing points [23].Once the rotation matrix  is obtained, the world coordinate  of the camera can be estimated using the known length of the structure such as a traffic pole depicted by the yellow line in Figure 6c.With the knowledge of the camera's intrinsic and extrinsic parameters, along with the aspect ratio of the vehicle's length, width, and height, generating the 3D ground truth for the vehicle becomes as simple as drawing one side of the vehicle in the image, exemplified by the red line in Figure 6d.Applying YOLOv4 directly to edge devices with limited computational power can be challenging, despite its efficiency as a one-stage detector.Moreover, adapting it to devices like QCS605, which only supports int8 operations on its DSP, requires quantizing the network parameters.To deploy a DNN on edge devices, network structure optimization and network parameter quantization are necessary.There are various approaches for network structure optimization.First of all, there is a network simplification approach simplifying existing large networks, such as tensor decomposition [24,25] and network slimming [26,27].Another strategy involves designing a new, smaller network trained with distilled knowledge from a larger model, known as knowledge distillation [28,29].Additionally, there is an approach to algorithmically discover efficient network combinations, known as Neural Architecture Search (NAS) [30,31].In this paper, we adopted the network simplification approach, taking advantage of the already validated performance of the YOLOv4 network.Among the network simplification methods, we adopted the widely embraced and effective network channel pruning [7].The simplified network undergoes quantization before being deployed on the DSP, and the quantization is broadly categorized into post-training quantization and quantization-aware training (QAT) [27,28].In general, for datasets where detection is relatively straightforward, the performance difference between the two methods is minimal.However, in cases of challenging Applying YOLOv4 directly to edge devices with limited computational power can be challenging, despite its efficiency as a one-stage detector.Moreover, adapting it to devices like QCS605, which only supports int8 operations on its DSP, requires quantizing the network parameters.To deploy a DNN on edge devices, network structure optimization and network parameter quantization are necessary.There are various approaches for network structure optimization.First of all, there is a network simplification approach simplifying existing large networks, such as tensor decomposition [24,25] and network slimming [26,27].Another strategy involves designing a new, smaller network trained with distilled knowledge from a larger model, known as knowledge distillation [28,29].Additionally, there is an approach to algorithmically discover efficient network combinations, known as Neural Architecture Search (NAS) [30,31].In this paper, we adopted the network simplification approach, taking advantage of the already validated performance of the YOLOv4 network.Among the network simplification methods, we adopted the widely embraced and effective network channel pruning [7].The simplified network undergoes quantization before being deployed on the DSP, and the quantization is broadly categorized into post-training quantization and quantization-aware training (QAT) [27,28].In general, for datasets where detection is relatively straightforward, the performance difference between the two methods is minimal.However, in cases of challenging detection, such as when objects are small, quantization-aware training (QAT) tends to outperform [28].Due to the majority of targets in our application being visually small in projection, QAT is applied.The process of deploying the DNN on the proposed edge device is illustrated in Figure 7. First, less important channels of the object detector are eliminated through sparsity training and channel pruning.During the fine-tuning stage, QAT is performed to enhance the performance of the simplified network while minimizing the performance degradation due to the quantization.Finally, the model undergoes quantization and is embedded into edge devices.
detection, such as when objects are small, quantization-aware training (QAT) tend perform [28].Due to the majority of targets in our application being visually sma jection, QAT is applied.The process of deploying the DNN on the proposed edg is illustrated in Figure 7. First, less important channels of the object detector are eli through sparsity training and channel pruning.During the fine-tuning stage, QA formed to enhance the performance of the simplified network while minimizing formance degradation due to the quantization.Finally, the model undergoes quan and is embedded into edge devices.

Tracking Feature Extraction Network on GPU
To generate the trajectory of detected objects, object tracking is necessary.Th algorithm, which predicts the next position of each track using a Kalman filter a ciates detected objects with predicted positions through the Hungarian algor widely utilized due to its low computational complexity.However, since the o pearance feature is not used in the data association, track IDs are often switched address this, DeepSORT method has emerged, which involves extracting appeara ture vectors using a DNN and estimates the difference between the feature vecto tected objects and the tracks using cosine distance [8].Subsequently, various obje ing methods have been introduced [33][34][35].However, we have modified Deep operate in real-time on our hardware, as its simpler structure is more conducive cation on edge devices compared to some state-of-the-art tracking methods with performance.
DeepSORT has a relatively shallow feature extraction network with only 1 as shown in Figure 8.However, its computational load increases proportionally number of detected objects since the network is applied individually to each dete ject.The proposed camera system must perform object detection, tracking, and tion estimation within 100 ms per frame.Since the object detector is barely operat Hz using DSP, the tracking feature extraction network should be run on the G GPU of QCS605, Adreno 615, operates at a clock speed of 430 MHz with only 25 Although DeepSORT's feature extraction network is lightweight, applying the individually to each object is not feasible to meet the time requirements on this p Therefore, as depicted in Figure 8, the proposed method applies the feature ex network to the entire input image, extracting a feature map with 128 channels quently, it applies ROI-Align to the feature map for each object to extract its featur For DeepSORT, the output resolution of Residual 9, as depicted in Figure 8, is eig lower than the network input.If the network of DeepSORT is directly applied to th image, the resolution occupied by the detected objects in the feature map becom low, leading to a diminished discriminative power of the feature vectors.To addr the proposed method attaches the FAN structure to the DeepSORT architectu

Tracking Feature Extraction Network on GPU
To generate the trajectory of detected objects, object tracking is necessary.The SORT algorithm, which predicts the next position of each track using a Kalman filter and associates detected objects with predicted positions through the Hungarian algorithm, is widely utilized due to its low computational complexity.However, since the object appearance feature is not used in the data association, track IDs are often switched [32].To address this, DeepSORT method has emerged, which involves extracting appearance feature vectors using a DNN and estimates the difference between the feature vectors of detected objects and the tracks using cosine distance [8].Subsequently, various object tracking methods have been introduced [33][34][35].However, we have modified DeepSORT to operate in realtime on our hardware, as its simpler structure is more conducive to application on edge devices compared to some state-of-the-art tracking methods with superior performance.
DeepSORT has a relatively shallow feature extraction network with only 15 layers, as shown in Figure 8.However, its computational load increases proportionally with the number of detected objects since the network is applied individually to each detected object.The proposed camera system must perform object detection, tracking, and 3D position estimation within 100 ms per frame.Since the object detector is barely operating at 10 Hz using DSP, the tracking feature extraction network should be run on the GPU.The GPU of QCS605, Adreno 615, operates at a clock speed of 430 MHz with only 256 ALUs.Although DeepSORT's feature extraction network is lightweight, applying the network individually to each object is not feasible to meet the time requirements on this platform.Therefore, as depicted in Figure 8, the proposed method applies the feature extraction network to the entire input image, extracting a feature map with 128 channels.Subsequently, it applies ROI-Align to the feature map for each object to extract its feature vector.For DeepSORT, the output resolution of Residual 9, as depicted in Figure 8, is eight times lower than the network input.If the network of DeepSORT is directly applied to the entire image, the resolution occupied by the detected objects in the feature map becomes very low, leading to a diminished discriminative power of the feature vectors.To address this, the proposed method attaches the FAN structure to the DeepSORT architecture.This structure aggregates low-resolution and high-resolution feature maps, and its output is further processed with a 1 × 1 convolution, resulting in a feature map with dimensions 208 × 128 × 128 (width, height, channels), as depicted in Figure 8. further processed with a 1 × 1 convolution, resulting in a feature map with dimensions 208 × 128 × 128 (width, height, channels), as depicted in Figure 8.

Three-dimensional Trajectory Estimation on CPU
When receiving tensors containing detection results and a tracking feature map from both DSP and GPU, respectively, 3D tracking paths of detected objects based on these tensors are generated as a final output in the CPU.The detection results are produced by applying non-maximum suppression (NMS) to objects with confidence scores exceeding a threshold (0.5) in the output tensor from the detector.Subsequently, for each detected object, ROI-Align is applied to the tracking feature map based on its bounding box, extracting a 128-dimensional feature vector.Afterward, data association is performed using the intersection over union (IoU) and feature vector distance between the detected objects and tracks, following the same methodology as the DeepSORT approach.While it is possible to utilize 3D spatial distance instead of IoU during the data association, the presence of one-or two-pixel errors in the image can lead to significant distance errors.Therefore, 3D positions are not employed in the data association; rather, they are exclusively utilized to generate 3D tracking paths.

Detector Performance Evaluation
For the evaluation of the detector performance, the Visdrone2019-Det dataset and our surveillance camera object detection (SCOD) dataset are utilized [36].Visdrone2019-Det consists of images captured by drones, as illustrated in Figure 9.The dataset comprises a total of 7019 images, with 6471 images used for training and 548 images for evaluation.This dataset contains ground truth annotations for a total of 10 object categories, including cars, pedestrians, bicycles, buses, etc. Visdrone2019-Det is suitable for evaluating

Three-dimensional Trajectory Estimation on CPU
When receiving tensors containing detection results and a tracking feature map from both DSP and GPU, respectively, 3D tracking paths of detected objects based on these tensors are generated as a final output in the CPU.The detection results are produced by applying non-maximum suppression (NMS) to objects with confidence scores exceeding a threshold (0.5) in the output tensor from the detector.Subsequently, for each detected object, ROI-Align is applied to the tracking feature map based on its bounding box, extracting a 128-dimensional feature vector.Afterward, data association is performed using the intersection over union (IoU) and feature vector distance between the detected objects and tracks, following the same methodology as the DeepSORT approach.While it is possible to utilize 3D spatial distance instead of IoU during the data association, the presence of one-or two-pixel errors in the image can lead to significant distance errors.Therefore, 3D positions are not employed in the data association; rather, they are exclusively utilized to generate 3D tracking paths.

Detector Performance Evaluation
For the evaluation of the detector performance, the Visdrone2019-Det dataset and our surveillance camera object detection (SCOD) dataset are utilized [36].Visdrone2019-Det consists of images captured by drones, as illustrated in Figure 9.The dataset comprises a total of 7019 images, with 6471 images used for training and 548 images for evaluation.This dataset contains ground truth annotations for a total of 10 object categories, including cars, pedestrians, bicycles, buses, etc. Visdrone2019-Det is suitable for evaluating performance in scenarios where there is a significant distance between objects and the camera, given its inclusion of numerous distant objects.camera, given its inclusion of numerous distant objects.
The SCOD dataset was obtained from traffic flow monitoring cameras around intersections, as depicted in Figure 10, aligning with the primary application of the proposed camera system for LDM generation.This dataset comprises a total of 24,723 images, with 21,494 images used for training and 3229 images for evaluation.It includes ground truth annotations for three categories: vehicles, pedestrians, and bicycles.Table 2 shows the detection performance of YOLOv4 and the modified detector.For Visdrone2019-Det and SCOD the input image resolutions are 416 × 416 × 3 and 416 × 256 × 3, respectively.Although the modified detector incurs slightly more computational overhead compared to YOLOv4, its high-resolution output improves mAP by approximately 5% and 3% for Visdrone2019-Det and SCOD, respectively.Figure 11 compares the detection results of YOLOv4 and the modified detector on Visdrone2019-Det.As observed in Figure 11, it can be noted that the modified detector performs better in detecting small objects compared to YOLOv4. Figure 12 presents a comparison result on the SCOD The SCOD dataset was obtained from traffic flow monitoring cameras around intersections, as depicted in Figure 10, aligning with the primary application of the proposed camera system for LDM generation.This dataset comprises a total of 24,723 images, with 21,494 images used for training and 3229 images for evaluation.It includes ground truth annotations for three categories: vehicles, pedestrians, and bicycles.
Electronics 2024, 13, x FOR PEER REVIEW 10 of 16 performance in scenarios where there is a significant distance between objects and the camera, given its inclusion of numerous distant objects.
The SCOD dataset was obtained from traffic flow monitoring cameras around intersections, as depicted in Figure 10, aligning with the primary application of the proposed camera system for LDM generation.This dataset comprises a total of 24,723 images, with 21,494 images used for training and 3229 images for evaluation.It includes ground truth annotations for three categories: vehicles, pedestrians, and bicycles.Table 2 shows the detection performance of YOLOv4 and the modified detector.For Visdrone2019-Det and SCOD the input image resolutions are 416 × 416 × 3 and 416 × 256 × 3, respectively.Although the modified detector incurs slightly more computational overhead compared to YOLOv4, its high-resolution output improves mAP by approximately 5% and 3% for Visdrone2019-Det and SCOD, respectively.Figure 11 compares the detection results of YOLOv4 and the modified detector on Visdrone2019-Det.As observed in Figure 11, it can be noted that the modified detector performs better in detecting small objects compared to YOLOv4. Figure 12 presents a comparison result on the SCOD Table 2 shows the detection performance of YOLOv4 and the modified detector.For Visdrone2019-Det and SCOD the input image resolutions are 416 × 416 × 3 and 416 × 256 × 3, respectively.Although the modified detector incurs slightly more computational overhead compared to YOLOv4, its high-resolution output improves mAP by approximately 5% and 3% for Visdrone2019-Det and SCOD, respectively.Figure 11 compares the detection results of YOLOv4 and the modified detector on Visdrone2019-Det.As observed in Figure 11, it can be noted that the modified detector performs better in detecting small objects compared to YOLOv4. Figure 12 presents a comparison result on the SCOD dataset, where YOLOv4 tends to detect multiple pedestrians as a single entity, whereas the modified detector successfully separates and detects them individually.dataset, where YOLOv4 tends to detect multiple pedestrians as a single entity, whereas the modified detector successfully separates and detects them individually.As mentioned earlier, directly applying YOLOv4 to the edge device did not meet the processing time requirements, leading to network simplification.The performance analysis for the simplified networks is shown in Table 3.As seen in Table 3, for both Visdrone2019-det and SCOD datasets, increasing the network pruning ratio up to 70% results in minimal performance degradation.Interestingly, for Visdrone2019-det, the mAP of 70% pruned detector has improved by approximately 1%.This is likely attributed to the removal of unnecessary channels, resulting in less overfitting of the DNN to the training data.When 70% pruned version of the modified YOLOv4 for SCOD was deployed on the camera system, it was able to process approximately 11.8 frames per second.As mentioned earlier, directly applying YOLOv4 to the edge device did not meet the processing time requirements, leading to network simplification.The performance analysis for the simplified networks is shown in Table 3.As seen in Table 3, for both Vis-drone2019-det and SCOD datasets, increasing the network pruning ratio up to 70% results in minimal performance degradation.Interestingly, for Visdrone2019-det, the mAP of 70% pruned detector has improved by approximately 1%.This is likely attributed to the removal of unnecessary channels, resulting in less overfitting of the DNN to the training data.When 70% pruned version of the modified YOLOv4 for SCOD was deployed on the camera system, it was able to process approximately 11.8 frames per second.

Tracker Performance Evaluation
The tracking performance was evaluated using our own dataset, referred to as the surveillance camera object tracking (SCOT) dataset.SCOT dataset comprises 13 narrow field of view sequences with pedestrians as the primary tracking targets and 10 wide field of view sequences with vehicles as the primary tracking targets, as illustrated in Figure 13.The number of frames and objects to be tracked for each sequence are presented in Table 4.For instance, a narrow field of view sequence whose ID is 1 has a total of 810 image frames and 11 tracking targets, as shown in Table 4. Seven narrow field of view sequences and six wide field of view sequences from the SCOT dataset were used for training.The narrow field of view sequences used for training contain a total of 5741 image frames with 68 tracking targets, while the wide field of view sequences contain a total of 5098 frames with 156 tracking targets.For evaluation, six narrow field of view sequences and four wide field of view sequences were used.The reason for dividing the dataset into narrow and wide field of view sequences is that the difficulty of tracking pedestrians and vehicles, which the primary targets, is too varied.For vehicles, the frequency of intersections between objects is low, and their movement is relatively straightforward, following lanes.Therefore, tracking vehicles is relatively easy.On the other hand, for pedestrians, the frequency of intersections between objects is high, and they have unrestricted movement paths, making tracking more challenging.
age frames with 68 tracking targets, while the wide field of view sequences contain a total of 5098 frames with 156 tracking targets.For evaluation, six narrow field of view sequences and four wide field of view sequences were used.The reason for dividing the dataset into narrow and wide field of view sequences is that the difficulty of tracking pedestrians and vehicles, which are the primary targets, is too varied.For vehicles, the frequency of intersections between objects is low, and their movement is relatively straightforward, following lanes.Therefore, tracking vehicles is relatively easy.On the other hand, for pedestrians, the frequency of intersections between objects is high, and they have unrestricted movement paths, making tracking more challenging.The performance of the proposed tracker was compared with DeepSORT.DeepSORT resizes the image within the bounding box for each detected object and feeds it into the feature extraction network.For the primary tracking targets, vehicles and pedestrians, the aspect ratio is significantly different.Resizing them to the same resolution may degrade the tracking performance.Therefore, for vehicles and pedestrians, separate feature extraction networks were used for DeepSORT.Vehicles were resized to 128 × 128, and pedestrians to 64 × 128, and then fed into the DNN.Multiple object tracking accuracy (MOTA) defined in Equation ( 3) is used as a key performance metric [35].In Equation (3), FN, FP, GT and IDSW stand for false negative, false positive, ground truth, and the number of times the track ID changes.
Table 5 presents the tracking performance comparison results between DeepSORT and the proposed method.As mentioned earlier, objects in NFOV sequences are  The performance of the proposed tracker was compared with DeepSORT.DeepSORT resizes the image within the bounding box for each detected object and feeds it into the feature extraction network.For the primary tracking targets, vehicles and pedestrians, the aspect ratio is significantly different.Resizing them to the same resolution may degrade the tracking performance.Therefore, for vehicles and pedestrians, separate feature extraction networks were used for DeepSORT.Vehicles were resized to 128 × 128, and pedestrians to 64 × 128, and then fed into the DNN.Multiple object tracking accuracy (MOTA) defined in Equation ( 3) is used as a key performance metric [35].In Equation (3), FN, FP, GT and IDSW stand for false negative, false positive, ground truth, and the number of times the track ID changes.
Table 5 presents the tracking performance comparison results between DeepSORT and the proposed method.As mentioned earlier, objects in NFOV sequences are challenging to track, resulting in both DeepSORT and the proposed method having similarly low performance, with MOTA at about 41%.For DeepSORT, it requires 0.59 BFLOPs per pedestrian, while the proposed method requires 10.66 BFLOPs regardless of the number of objects.The objects in WFOV sequences are easy to track, resulting in both DeepSORT and the proposed method more than 87% MOTA.In the case of DeepSORT, the input image resolution for vehicles is 128 × 128, leading to a higher computational load of 1.18 BFLOPs per object in WFOV sequences.In embedded systems, minimizing computational load is as crucial as tracking performance.Especially for the system stability, the computational load needs to be consistent.The proposed tracking method exhibits performance comparable to DeepSORT, but its consistent computational load makes it more suitable for embedded edge devices, such as the proposed camera system.If the proposed tracking feature extraction network is directly ported to the QCS605 GPU, it cannot meet the real-time requirement of 10 Hz operation.Therefore, the tracking feature extraction network was also simplified through channel pruning, but its parameters were not quantized, as the GPU supports floating-point operations.Table 6 shows that when the proposed method is pruned up to 70%, the computational load significantly decreases from 10.66 BFLOPs to 2.99.However, despite this reduction, the tracking performance did not exhibit significant degradation compared to before pruning.

Conclusions and Future Works
This paper proposes a camera system for real-time object detection and tracking to generate LDM.The object detector and tracking feature extracting network are improved for our application and optimized through channel pruning and quantization-aware training to suit embedded edge devices.Furthermore, the main components of the algorithm, including the object detector, tracking feature extracting network, and the tracking with 3D position estimation, are appropriately allocated to the DSP, GPU, and CPU of the Qualcomm QCS605 chip in the camera system, achieving the required operation speed of 10 Hz.Edge devices capable of real-time object detection and tracking are essential for various applications, such as autonomous driving, mobile robotics, intelligent surveillance, and more.Therefore, it is expected that the proposed method will find widespread applications.Since the proposed system utilizes a lightweight network for real-time processing on low-end hardware, its tracking performance can be degraded on challenging datasets.In the future, to overcome this, we plan to improve the tracking performance as follows: First, we will develop a multi-functional DNN whose backbone network is shared by object tracking and detection.Since capturing subtle differences between similar objects is crucial for the feature vectors used in tracking, we will compare various backbone networks, such as DenseNet and HRNet, which effectively fuse low-resolution and highresolution features [37,38].Second, we will develop a network that predicts the appearance

Figure 5 .
Figure 5. Distance error according to 3D fiducial point of an object.The matrix  in Equation (1) can be computed given the camera's intrinsic parameter matrix  (3 × 3) and the extrinsic parameter matrix  (3 × 4) as shown in Equation (2).

Figure 5 .
Figure 5. Distance error according to 3D fiducial point of an object.

Figure 7 .
Figure 7. Process of deploying a DNN to edge devices.

Figure 7 .
Figure 7. Process of deploying a DNN to edge devices.

Table 1 .
HW specification of the proposed camera system.

Table 1 .
HW specification of the proposed camera system.

Table 2 .
Evaluation of object detectors.

Table 2 .
Evaluation of object detectors.

Table 3 .
Evaluation the modified YOLOv4 according to the pruning rate.

Table 3 .
Evaluation of the modified YOLOv4 according to the pruning rate.

Table 4 .
Surveillance camera object tracking dataset.

Table 4 .
Surveillance camera object tracking dataset.

Table 6 .
Tracking performance evaluation of modified DeepSORT according to pruning rate.