Real-Time Motorbike Detection: AI on the Edge Perspective

: Motorbikes are an integral part of transportation in emerging countries, but unfortunately, motorbike users are also one the most vulnerable road users (VRUs) and are engaged in a large number of yearly accidents. So, motorbike detection is very important for proper traffic surveillance, road safety


Introduction
Surveillance systems are a very important and essential part of the built environment due to safety and security.One of the major applications of surveillance is traffic monitoring.Traffic surveillance is an integral part of road safety and security.In terms of security, traffic detection is useful to identify, track, and extract the license plate and behavior of vehicles.Traditional surveillance systems are generally driven by human effort, i.e., a video is captured and people are deployed to examine activity or security risks.However, this approach has disadvantages like high labor costs, lower efficiency, human limitations, and high resource requirements.With the rising trend of artificial intelligence, computer vision and deep learning play a vital role in many advanced applications, ranging from security, medical analysis, sports, military, and general surveillance.These techniques can replace traditional monitoring systems with Intelligent Surveillance Systems (ISSs).This technique takes the input from a camera, i.e., fixed CCTV, drones, or any other imaging source, and helps to identify, track, or classify objects using deep learning algorithms.
For traffic monitoring, road vehicles can be categorized into two main classes: twowheeled vehicles, and four-wheeled and above vehicles.In developed countries, cars are the most popular means of urban transportation, but in developing nations, the most popular and common means of transportation is motorbikes, mainly because they tend to be more affordable and more easily accessible.However, motorbikes are also more likely to result in fatal road accidents.The World Health Organization (WHO) declared the motorbike as a vulnerable road user (VRU).Motorbikes are one the most venerable road users (VRUs) and are engaged in a large annual number of accidents; therefore, traffic monitoring systems need to detect them in real-time.Once detected, further processing might involve tracking, the detection of rule violations, helmet-wearing compliance, license plate reading, overloading, etc.
Most of the work that has been carried out on road surveillance is for four-wheelers or over (cars, trucks, etc.).Although motorbike detection is equally important, it is much more difficult due to their smaller size, higher occlusion (e.g., often due to the high density of bikes), varying angles of capture, and rider variations.These factors have a high impact on learning algorithms that lead to difficulty in detection.Furthermore, a major difficulty is the unavailability of large datasets.Previous works that have been carried out for motorbike detection have mainly focused on improving the accuracy of bike detection against issues such as occlusion and environmental variations, but they are impractical because of high computation costs, which makes them difficult to deploy in practice.So, the potential challenge that we have identified is in determining how we can make a neural network-based motorbike detection system more cost-efficient for real-time and practical implementation, using off-the-shelf embedded hardware.Regarding bike number plate recognition, it is worth noting that OCR (Optical Character Recognition) is a thoroughly researched technique, as evidenced by numerous studies, e.g., [1][2][3].While we recognize the potential advantages of integrating number plate detection with OCR capabilities into our research, our primary objective in this study was to enhance motorbike detection using deep learning models specifically designed for edge devices that can be a precursor to OCR-based plate recognition and association to a specific motorbike.
For real-time implementation, a brute force approach with powerful hardware can complete the work, but another critical challenge is how to make the detection system practical.So, an ideal system should be a small form factor embedded solution for using the model in a real-world scenario.Both solutions are shown in Figure 1.This work has studied multiple cost-effective solutions for motor bike detection in realtime as deployable solutions and provides a performance analysis of popular object detection models for this task on different embedded hardware.It has the following contributions: Improved Baseline Accuracy: Achieved 99% accuracy with custom YoloV5 and augmented dataset, surpassing previous results.
Optimization Analysis: Conducted a comprehensive analysis, comparing motorbike object detection performance on various embedded edge devices, benchmarking results and optimizations.
Edge Deployable Real-Time Systems: Developed real-time motorbike detection models on GPUs, optimized for deployment on NVIDIA, Intel, and Google embedded edge devices.
The rest of the paper is structured as follows: Section 2 describes related work on real-time motorbike detection models for embedded edge devices.Section 3 presents the methodology adopted in this study.Section 4 presents and analyzes the experimental results.Finally, Section 5 provides the conclusions drawn from the study and outlines potential future work.

Related Work
Traditionally, motorbike detection has comprised two main steps.Initially, features are detected, followed by classification to distinguish between motorbikes and non-motorbikes, as well as to identify specific features like helmets and license plates.The Circular Hough Transform (CHT) serves as a valuable tool for localizing regions of interest (ROI), as utilized by Silva et al. [4] in helmet detection and by Mukhtar et al. [5] for detecting helmets and headlights on motorbikes.Harr-like features, known for their real-time performance, are employed by Wonghabut et al. [6] and Gavadi et al. [7] in motorbike helmet detection.However, Harr-like features are susceptible to factors such as capture angle and proximity, rendering them less robust for surveillance applications.Moreover, Histogram of Oriented Gradient (HOG) feature extraction is employed by Ghonge et al. [8] to detect license plates of riders without helmets, and also by Dahiya et al. [9] and Singh et al. [10] for a comparative analysis of feature extraction techniques in motorbike and helmet detection.At the time of these studies, HOG was considered one of the most effective feature extractors, alongside SWIFT [11].In terms of classification, various methods such as Support Vector Machine (SVM), decision trees, random forests, and k-nearest neighbor (KNN) algorithms are employed.Mukhtar et al. [5] and Barics et al. [12] explored SVM classifiers.Shuo et al. [13] utilized SVM with Harr-like features.In contrast, Dupuis et al. [14] employed decision trees for motorcycle classification, combined with manually labeled blobs to mitigate overfitting.Le et al. [15] enhanced classification accuracy by combining random forests with other methods to classify different motorbike components.Waranusast et al. [16] employed K-nearest neighbors (KNNs) for motorcycle classification and helmet detection, while Barics et al. [12] integrated classifiers with hybrid camera systems to achieve improved accuracy.
The paper [17] employs a deep learning approach in conjunction with traditional methods for motorbike detection.They combine a Convolutional Neural Network (CNN) with Histogram of Gradient (HOG) and Support Vector Machine (SVM) for classification and detection.To address false detections, they categorize data into four classes, achieving 84% precision.However, their accuracy significantly decreases in occluded scenarios, and the processing time of two minutes per image renders it unsuitable for real-time implementation.Occlusion poses a significant challenge in motorbike detection.In [18], the authors introduce the annotated dataset MB7500 to handle occlusion, employing a Faster R-CNNbased network with a custom two-layer CNN.Despite these efforts, detection remains irregular, with accuracy dropping to 20% in heavily occluded scenes.The low frame rate per second (fps) of Faster R-CNN makes real-time application challenging.In [19], a fourlayer CNN feature extractor serves as the backbone for Faster R-CNN, augmented with the Markov Decision Process (MDP) for tracking on the MB1000 dataset, achieving 88% mean average precision (mAP) in occluded scenarios.However, the compute-intensive two-step detection process results in high inference time, making it impractical for real-time deploy-ment, especially on edge devices.Addressing false detections, missed detections, and small objects with complex backgrounds, the authors od [20] propose a customized technique alongside a Fast R-CNN-based deep neural network.Despite achieving high accuracy, reaching 90.8% and 88.6% for low and high difficulty sets, respectively, the computational costs are considerable, prioritizing detection robustness over real-time applicability.In [21], Faster R-CNN with Inception-ResNet backbone performs well, with the single-stage SSD network with Inception outperforming others in accuracy with preprocessing techniques.However, real-time applications are not considered, posing challenges for embedded solutions.The work in [22] utilizes deep learning for traffic surveillance, extending to detect traffic violations.They incorporate the "DhakaAI" dataset with their images, enhancing it with 17 vehicle types, including around 4200 motorbike images.While achieving 15-20 fps using YoloV4 and 24-30 with 72.02% mAP using YoloV4-tiny, real-time deployment on edge devices is hindered by low mAP and fps.In [23], a custom network based on SSD is proposed for detecting complex traffic scenarios, achieving an mAP of 89.05% and an overall speed of 32 fps.While meeting real-time criteria, challenges persist in deployment on embedded platforms.Proposing a methodology for motorbike rider detection with and without helmets, the paper [24] trains two models using Faster R-CNN with the Inception V2 model, achieving 93.37% accuracy overall.However, the fps is not discussed, suggesting potential difficulties in real-time operation and deployment on embedded platforms.
Regarding datasets, detecting two-wheeled vehicles presents a challenging task, primarily due to the scarcity of comprehensive datasets.Despite the critical nature of this issue, there exists no single dataset that sufficiently covers all aspects and variations essential for benchmarking two-wheeled vehicle detection.This lack of availability of a comprehensive dataset has been underscored by the authors of [25].One contributing factor to this scarcity could be the smaller contribution of motorbikes to traffic in developed nations compared to the developing world.Several benchmark datasets include the motorbike category, albeit with various limitations.For instance, the dataset introduced in [26] comprises five videos and 3880 annotated images, encompassing diverse traffic scenarios.However, it lacks motorbike category annotations and is more suited for autonomous car applications rather than wide-area surveillance.Similarly, the CBCL street scenes database [27], containing 8000 images with annotations across nine object classes, is relevant for street scene understanding but lacks motorbike images in urban settings.KITTI [28], renowned in autonomous vehicle research, offers datasets for optical flow, stereo, 3D object detection, tracking, and semantic segmentation.Despite its versatility, KITTI does not include a motorbike category.The PASCAL VOC dataset, a popular benchmark for detection, segmentation, and classification tasks, includes the motorbike category but with only 713 annotated images by 2012 [29,30].Its limited size raises concerns about overfitting, and its focus on detection makes it less suitable for surveillance applications.The Caltech dataset [31] features 30,607 images spanning 256 object categories, including the motorbike category with 798 images.However, its limitation lies in the absence of frontal views, rendering it less suitable for surveillance purposes.Some researchers resort to custom datasets, which unfortunately are not publicly available.These include datasets used by [4,13,14,[32][33][34][35][36][37].The public MB7500 dataset, published by the authors of [38], stands out for its coverage of occlusion and representative surveillance viewpoints, making it a valuable resource for two-wheeled detection datasets.To encapsulate the diverse approaches employed by various authors in motorbike detection, it is evident that different methods and models have been utilized.Among these, the approach outlined by the authors of [19] stands out prominently, boasting an exceptionally high mean average precision (96%).This work suggests that SSD Mobilenet emerges as the optimal choice for motorbike detection among available methods.Shifting the focus to dataset comparisons, it becomes apparent that MB7500 holds a position of prominence.This dataset stands out as the most suitable for motorbike detection, particularly from a surveillance perspective.Not only is MB7500 publicly accessible, but it also stands as the most frequently utilized dataset for motorbike detection in prior research publications.

Methodology
The main purpose of this work is to investigate motorbike detection using stateof-the-art neural networks on low-power, embedded edge devices in real-time.Several steps are applied to accomplish this task.A detailed flow of all these steps is depicted in Figure 2. In the first step, a dataset is chosen and prepared for the application.Secondly, an appropriate model is selected and trained.Lastly, a suitable edge device is selected and model optimization is applied to generate the final inference model file for deployment.All these steps are discussed in more detail in the following sections.Step: 1 Step: 2 Step: 3(b) Step: 4 Step: 3(a)

Dataset and Augmentation
The performance of a deep learning model heavily depends upon the quality and quantity of the dataset.As discussed earlier, for motorbike detection, MB7500 is the most suitable current dataset.However, it has some issues, such as a lack of variability, i.e., lighting condition, angle of capture, and other environmental conditions that are not covered in the dataset.To enhance the quantity and introduce variability in the dataset, image level and object level augmentation are performed.Image level is a type of augmentation that is applied on the entire image and object level augmentation is applied only within the object bounding box.In total, 14 different types of augmentations are applied that include flip, crop, 90 • rotation, minor rotations, grayscale, saturation, brightness, blur, and mosaic for image level and flip, crop, 90 • rotation: clockwise and anticlockwise, minor rotations, and brightness for object level augmentation.As a result, the dataset size is increased from 7500 images to 15,000 images.Figure 3 depicts the augmentation types.Augmentations not only help to increase the dataset but also help introduce a more difficult scenario to improve accuracy on the different datasets.To make the model less sensitive to camera orientation, roll-robustness, position, and subject translation are performed by rotation, flip, 90 • rotation, rotation of the frame (30 • rotation in this case), and by cropping the frames randomly.

Model Selection
Motorbike detection is an object detection task with several models available for this purpose, as pointed out earlier.Paper [21] reports several experiments with over 50 networks to identify the best model for motorbike detection and concludes that SSD with a Resnet backbone works best for this application.This work also uses the SSD architecture, but its Resnet backbone is replaced with Mobilenet V2.This is because Resnet can only perform on a GPU-based system and is not suitable for edge devices, while Mobilenet V2 is specialized for edge devices.Then, since motorbike detection is a surveillance application and Yolo models are well suited for this task and are suitable for edge devices, we also studied the use of YoloV5 for this task.

Single Shot Detection (SSD)
Mobilenet V2 backbone acts as a feature extractor; it is comparatively faster than other backbone architectures and is a popular choice for real-time applications on edge devices.For this reason, SSD Mobilenet V2 is supported by all of the major edge devices' optimization engines.In this work, the SSD Mobilenet V2 with the augmented images of the MB7500 dataset has been trained for 1000 epochs on a GPU.

Customized YoloV5
For most surveillance applications, Yolo architectures are used.Not too long ago, YoloV5 was released which outperformed all its predecessors.YoloV5 offers the flexibility of customizing the architecture according to the available computation resources, which can be very advantageous for real-time applications and edge deployment.Another advantage of YoloV5 is that it is known to generate very stable inference results in traffic surveillance with high mAP.The architecture used here is customized according to the computation resources on the edge devices so that it can perform in real-time.To reduce the computation of YoloV5, layer reduction and sparsity techniques have been applied.
For a fair comparison, the same dataset (Augmented MB7500) has been used for both networks.To train these two networks, transfer learning has been applied to further boost accuracy.This has been performed using pre-trained weights of these models for motorbike detection.These networks have been trained iteratively with different hyperparameters until the best accuracy is achieved in terms of mAP.After training on a GPU, we achieved 91.5% mAP for SSD Mobilenet V2 and 99% mAP for YoloV5.This sets the baseline performance for the optimization process, as shown in Table 1.

Edge Deployment
This section is divided into two main parts.First, it deals with the process of selecting suitable edge devices.Secondly, it explores the optimization of algorithms to ensure efficient execution on the chosen edge devices.

Edge Devices
Conventional detection procedures are not generally deployable solutions for realtime applications, due to extensive computation resource requirements and high power consumption.So, an appropriate edge device has to be selected for implementing a model in real-time.Edge devices come in all shapes and sizes with their respective advantages and disadvantages.Currently, there are many options available to choose from, but the main three types of edge devices are Intel, Nvidia, and Google.Before deploying a model on the edge device, it needs to pass through a vendor-specific optimization tool that reduces the computation cost and inference time of the model.Below is a list of available edge devices from these three suppliers: 1.
Google (USA, CA, Mountain View) This work uses all these devices so that the best platform can be identified.The specifications of each of these devices are shown in Table 2 and the devices are shown in Figure 4.

Optimization
The conventional deep learning and machine learning algorithm development and deployment cycle has two phases.The first phase is the training phase in which the focus is on selecting a suitable neural network architecture and dataset preparation to train the model.In this process, the priority is on the desired accuracy because the higher the validation accuracy, the better the solution.The second phase is the inference process where the trained model is used to predict unseen data and the goal is to achieve comparable accuracy in making a decision as fast as possible.Conventional approaches use brute force to increase the computation resources to make this possible.A more suitable approach is to transform the trained model such that it uses less computation and resources, this process is called optimization.Optimization offers several advantages as it helps to reduce the size of the network and uses low precision to make it possible to efficiently run large-scale networks on edge devices.Optimization is what stands between conventional inference and a smart, compact, edge deployable deep learning solution.Due to this, optimization has now become the third phase of the modern machine learning-based solution development cycle.
Optimization is a general strategy that can be performed directly on the model but where each vendor provides their proprietary tools to optimize the model.This is because the optimization of a model for an edge device heavily depends upon the device architecture, and vendors do not expose the architectural details of their devices.Ten-sorRT (Release 9.3.0) is used for optimizing the trained model for Nvidia's edge devices.OpenVINO (Release 2023.3.0) is an Intel open-source toolkit for the optimization of the network to be used for inference and quick deployment.TensorFlow Lite (Version 2.9) is an open-source package of deep learning tools/framework to carry out the optimized deep learning inference on the edge for Google Edge devices.In general, there are two types of optimization: software-related optimization and hardware-related optimization.Each optimizer uses different methods to optimize the deep learning trained graph, but the general sub-processes that are involved in the optimization process are quantization, pruning, layer fusion, clustering, sparsity, time fusion, kernel auto-tuning, dynamic tensor memory, and so on.
Since we are using all these devices, all these optimization tools have been used in this work.Optimization is not a stable process and it often generates uncertain outputs (low accuracy, high inference time, high memory consumption, etc.), which requires iterative adjustments of user-defined parameters to achieve the best results.

Results
The final optimized network file is copied to the target device to run it for producing predictions.Then, the performance of the optimized network can be measured on the target device.If the desired results are achieved, the solution can be applied in the field.Before proceeding with the results, it is necessary to understand the evaluation metrics, as discussed next.

Experimental Setup
In this work, the object detection system implemented on the edge devices is coded in Python due to its versatility, ease of development, and compatibility with the selected frameworks and libraries.These choices were made to facilitate comprehensive comparison of the performance of state-of-the-art edge devices with that of a GPU, considering metrics such as mAP (mean average precision), FPS (frames per second), model size after optimization, power consumption, and memory usage.Additionally, two baseline models have been established to mitigate any bias stemming from a particular device on a model.

Dataset
As pointed out before, there is a general lack of suitable datasets to train motorbike detection models [25].Although no comprehensive dataset is available, we have selected the MB7500 open-source dataset, contributed by Espinosa et al., because it is focused on motorbikes with surveillance-like views.Although it would have been possible to use KITTI, it is mainly aimed at autonomous driving and mobile robotics as the camera is mounted on a car, facing forward.Nevertheless, it also does not include a separate class of motorbikes.
The data was collected for MB7000 using a Phantom 4 drone with a video camera.The dataset consists of 7500 annotated images of motorbikes in urban occluded scenarios with a size of 640 × 364, including 60% of occluded scenarios.The dataset is split into a 60-40 ratio to avoid overfitting.The dataset is collected using a drone with an angle similar to a CCTV camera.This dataset covers the occluded scenario very efficiently.However, this dataset does not address the problem of diversity, i.e., lightning conditions, weather conditions, angle problems, and the diversity in shape, place, and number of riders.

Evaluation Metrics
In a normal classification-based task, the performance of a network is defined in terms of accuracy, and this is sufficient to justify the impact of the method.However, in an object detection-based task, performance is better evaluated based on mAP, which further depends on ROI (region of interest) and prediction accuracy.The actual region of interest is the area on the image where the actual object exists.The area on the image that is predicted by the model in which it believes that some object is identified can be referred to as the predicted region of interest.A correct detection is measured as the overlap (typically 50%) between the predicted region of interest and the actual region of interest, given that the object identified in that area is the same as the actual label.This is the performance criteria for object detection-based tasks.To evaluate the performance of an object detection model in real-time, further metrics are used such as frames per second or inference time per frame.However, further metrics are required to demonstrate the performance of such object detection-based tasks on small, embedded edge devices.For that, we are using model size, power consumption, and memory usage, where the latter two are monitored during the runtime.This gives us insight into the real-time performance of different devices against different optimization flows, for the common task and neural network architecture.

Result and Analysis
This section presents a detailed analysis of the performance of SSD Mobilenet V2 and YoloV5 models for real-time motorbike detection on various embedded edge devices and a standard GPU.Although YoloV8 is a recent introduction, potential deployment in the field is limited due to licensing costs, and we selected YoloV5 for comparison due to its relevance to our research objectives and full availability at the time of the study.We begin by measuring the baseline performance of SSD Mobilenet V2 on a normal GPU, serving as a reference for comparison.This section highlights the mean average precision (mAP) scores and frames per second (FPS) achieved by each device, offering insights into their computational power and real-time capabilities.Additionally, we explore the impact of model size on trade-off between accuracy and speed for edge devices.Furthermore, we present inference results of SSD Mobilenet V2 on both the GPU and edge devices, showcasing the detection performance with predicted bounding boxes.Then, we present the evaluation of YoloV5 with and without dataset augmentation, along with various input tensor sizes.A comparison of YoloV5 with other state-of-the-art models and custom networks is provided based on parameter count, mAP, and FPS on the GPU.Overall, the results reveal the advantages and drawbacks of utilizing these models for real-time motorbike detection tasks on different embedded edge devices and a standard GPU.

Results and Analysis Using SSD Mobilenet V2
First, we measure the performance of this model on a normal GPU, which gives the baseline performance.From Figure 5, it can be seen that the Single Shot Detection (SSD) Mobilenet V2 gives a 91% mean average precision score, which is higher than all the embedded edge devices but at the cost of high computation power and memory, as shown in Figure 6.On the other hand, the GPU processes around 49 FPS, which is more than the real-time requirement but is somewhat low compared to most edge devices, given the fact that edge device are very low-power and limited on resources (Figure 5).As far as mAP and FPS are concerned, it can be observed from Figure 5 that the mean average precision score of the edge devices starts to decrease with an increase in FPS.Therefore, the devices offer a straight trade-off between FPS and mean average precision.To maintain a certain FPS and real-time performance, a device has to drop its mAP.This way, despite their small sizes and limited resources, some of these devices can exceed the GPU performance in terms of FPS score, i.e., Coral Dev Board achieves an FPS twice as high as THE GPU, Jetson Xavier achieves 80% higher FPS as compared to the GPU, and Jetson TX2 achieves 63 FPS, which is also considerably higher than the 49 FPS of the GPU.On the other hand, from Figure 5, it can be seen that some smaller edge devices lack performance in terms of mean average precision (Intel compute stick); moreover, although their FPS is lower than that of the GPU (Intel compute stick and jetson nano), they are closer at reaching the threshold of real-time performance.
When looking at the impact of model size, since we are using the same model for the evaluation of the performance on all edge devices, the optimized model size can provide a direct measure of the effectiveness of the model optimization flow of a respective vendor that has caused a significant reduction in the computation complexity of a model.One core observation, from Figure 6, is that a reduction in the computation complexity of a model causes a decrease in mAP while it increases the FPS.Optimization of the Coral Dev Board provides the highest compression in terms of model size, and this in turn provides the highest frame rate.
Inference results of different frames of SSD Mobilenet V2 on GPU are shown in Figure 7.These are the same set of images as depicted during dataset selection but with predicted bounding boxes.For comparison, inference results of embedded devices are also presented in Figures 8-10 for TX2, Coral Board, and Xavier, respectively.It can be observed that the highest number of bikes are detected in the GPU-based implementation, but almost the same number of bikes are detected on the edge device-based implementations as well, and with higher FPS.

Results and Analysis of Customized YoloV5
YoloV5 has been trained with two settings, i.e., with and without dataset augmentation and also with different-sized input tensors.The results of YoloV5 on a GPU and the Jetson Xavier are presented in Table 3.It can be seen that training without augmentation gives around 97 percent mAP, while training with augmentation gives around 99 percent mAP, while in terms of FPS, it gives around 65 FPS on 640 × 364-sized images.When the image size is decreased to 300 × 300, we see some drop in mAP, but the inference rate goes up to over 100 FPS.YoloV5 provides around 90 percent mAP on a Jetson Xavier with 27 FPS.This low FPS is due to partial optimization and these results can be further improved with optimal optimization as discussed earlier.There is, however, another way to improve the FPS, i.e., image size reduction.A model has been trained with a 300 × 300 image size, and when this small network is partially optimized for Jetson Xavier, the FPS increases to 51 FPS.This significant boost is due to reduced computation, caused by a small frame size with only a marginal drop in mAP.This suggests that YoloV5 can still be used with partial optimization on a Jetson Xavier with acceptable mAP and FPS.For comparison, the inference results of different frames of YoloV5 on the GPU and on Jetson-Xavier are shown in Figures 11 and 12, respectively.
Table 4 provides a comparison of different variants of YoloV5, along with a custom network, based on parameter count, mAP and FPS on a standard GPU.It can be seen that the custom network provides almost the same mAP but has a significantly reduced parameter (size) count.Table 5 provides a comparison of YoloV5 with other state-of-the-art models used in previous papers.As mentioned earlier, we have trained YoloV5 with and without dataset augmentation.From Table 5, it can be seen that training without augmentation gives around 97% mAP while training with augmentation gives around 99% mAP.Both these accuracies are better than all other previous baseline accuracies.On the other hand, we obtain around 65 FPS on the GPU with YoloV5, which is also better than the previously reported inference rates.In conclusion, using YoloV5 has the following advantages:  One drawback of YoloV5 is that its robustness is low compared to SSD Mobilenet.In our observation, SSD Mobilenet trained on one motor bike dataset can be used for other motor bike datasets as well, but YoloV5 does not have such robustness.

Conclusions and Future Work
In this paper, we have presented an investigation of motorbike detection on edge devices for real-time applications.We have investigated five different state-of-the-art embedded edge devices from three different vendors, using state-of-the-art SSD Mobilenet and YoloV5.The most important aspect that differentiates between conventional machine learning implementation and edge deployable machine learning is the optimization of the neural network.We have used three different frameworks/toolkits to perform optimization for edge devices.We have demonstrated that with proper optimization, neural networks can be made to run on small, low-power edge devices with some pros and cons.Firstly, model size is significantly reduced by several folds after optimization.Due to this reduced model size, power consumption and memory usage drop significantly.Reduced model size is a direct outcome/indication of reduced computation complexity, which in turn increases inference rate.For some embedded devices, we have achieved inference rates that are twice as high as a GPU.For the Coral Dev Board, the inference rate is 96 FPS; for Nvidia Xaviera, the inference rate is 88 FPS; and for Nvidia Jetson TX2, the inference rate is 63 FPS, which is higher than the standard GPU inference rate of 53 FPS.A boost in inference rate comes at the cost of some decrease in mAP.In this case, this decrement lies in the range of 5 percent of the original baseline accuracy, which is acceptable for most edge devices.On the other hand, we have improved the previous best baseline accuracy of motorbike detection using a custom network based on YoloV5 on a standard GPU-based system.We have achieved 99 percent mAP with 65 FPS, which is around 2-3 percent better than the previous mAP and 15 FPS higher.
One of the possible future directions is related to the unavailability of the dataset.We have used the MB7500 dataset with some custom-labeled images.This is a small dataset in terms of the total number of images.Future work could include collecting much more representative data for motorbike-related analysis, including not only the variability of the motor vehicle, but also the atmospheric conditions, type of riders, and traffic violations.Another useful work would be to investigate and improve the lack of robustness of YoloV5 in "domain shifts", i.e., when data change in type, camera view, etc.This is a major problem that currently requires the time-consuming manual annotation of additional data.

Figure 1 .
Figure 1.Comparison between traditional surveillance methods and edge deployable surveillance solutions.

Figure 2 .
Figure 2. Optimization flow for deploying a deep learning algorithm on edge devices: illustrating the process of tailoring a model for efficient execution on edge devices.

Figure 3 .
Figure 3. Augmentation techniques applied at image and object levels: depicting crop, flip, rotate, and brightness adjustments for enhanced object detection performance.

Figure 4 .
Figure 4. Various embedded devices utilized in the study for real-time deployment of deep learning models.

FPSFigure 5 .
Figure 5. Speed and accuracy comparison across different devices for SSD Mobilenet V2: visualizing performance metrics of various edge devices for real-time deep learning model deployment.

Figure 7 .
Figure 7. Inference results of SSD Mobilenet V2 on GPU: visualization of object detection performance using GPU acceleration.

Figure 9 .
Figure 9. Inference results of SSD Mobilenet V2 on Coral Dev Board (Google): object detection performance on Coral Dev Board.

Figure 11 .
Figure 11.Inference results of YoloV5 Model on GPU: object detection performance using GPU acceleration.

Figure 12 .
Figure 12.Inference results of YoloV5 on Jetson Xavier (NVIDIA): object detection performance on Jetson Xavier.

Table 2 .
Comparison of edge devices: detailed comparison of hardware components and features across various embedded platforms.

Table 3 .
Results of YoloV5 on GPU and edge devices: object detection performance comparison (bold indicates best performance).

Table 4 .
Comparison of different variants of YoloV5 with custom model: performance evaluation of various YoloV5 configurations.

Table 5 .
Comparison of YoloV5 with other models: performance evaluation across different object detection models.