Next Article in Journal
Multi-Temporal Shoreline Monitoring and Analysis in Bangkok Bay, Thailand, Using Remote Sensing and GIS Techniques
Previous Article in Journal
Double-Gated Mamba Multi-Scale Adaptive Feature Learning Network for Unsupervised Single RGB Image Hyperspectral Image Reconstruction
Previous Article in Special Issue
Enhanced Object Detection Algorithms in Complex Environments via Improved CycleGAN Data Augmentation and AS-YOLO Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Object Detection on Road: Vehicle’s Detection Based on Re-Training Models on NVIDIA-Jetson Platform

Grupo de Investigación en Circuitos y Sistemas Electrónicos de Alta Frecuencia, Universidad Nacional Tecnológica de Lima Sur, Villa El Salvador, Lima 15834, Peru
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
J. Imaging 2026, 12(1), 20; https://doi.org/10.3390/jimaging12010020 (registering DOI)
Submission received: 27 October 2025 / Revised: 28 November 2025 / Accepted: 2 December 2025 / Published: 1 January 2026
(This article belongs to the Special Issue Advances in Machine Learning for Computer Vision Applications)

Abstract

The increasing use of artificial intelligence (AI) and deep learning (DL) techniques has driven advances in vehicle classification and detection applications for embedded devices with deployment constraints due to computational cost and response time. In the case of urban environments with high traffic congestion, such as the city of Lima, it is important to determine the trade-off between model accuracy, type of embedded system, and the dataset used. This study was developed using a methodology adapted from the CRISP-DM approach, which included the acquisition of traffic videos in the city of Lima, their segmentation, and manual labeling. Subsequently, three SSD-based detection models (MobileNetV1-SSD, MobileNetV2-SSD-Lite, and VGG16-SSD) were trained on the NVIDIA Jetson Orin NX 16 GB platform. The results show that the VGG16-SSD model achieved the highest average precision (mAP 90.7 % ), with a longer training time, while the MobileNetV1-SSD ( 512 × 512 ) model achieved comparable performance (mAP 90.4 % ) with a shorter time. Additionally, data augmentation through contrast adjustment improved the detection of minority classes such as Tuk-tuk and Motorcycle. The results indicate that, among the evaluated models, MobileNetV1-SSD ( 512 × 512 ) achieved the best balance between accuracy and computational load for its implementation in ADAS embedded systems in congested urban environments.

1. Introduction

Currently, the use of artificial intelligence (AI) techniques has transformed perception systems for vehicles, driving advances in object detection and driving assistance worldwide [1,2]. Training deep learning (DL) models requires significant computational and communication resources, making it necessary to analyze the trade-off between computational cost and accuracy for deployment in environments with hardware limitations [3,4]. Furthermore, the integration of these techniques with smart vehicles can generate massive volumes of data, increasing transmission and storage costs. In addition, the growth of the automotive market and public policies aimed at safety have increased interest in advanced driver assistance systems (ADAS), but the heterogeneity of infrastructure (insufficient signage, non-homogeneous roads) and the diversity of vehicles present particular challenges for detectors trained with country-specific datasets [5,6]. The city of Lima, which has highly congested traffic conditions and obstacles on the roads, provides a realistic and challenging test bench for the evaluation of embedded computer vision models, considering that according to the TomTom platform report [7], Lima ranks seventh worldwide in terms of traffic congestion, with a level of 47%, and a loss of 150 h per year during peak hours.
The research described in this article analyzes the performance of detection architectures, evaluating the best compromise between accuracy and efficiency in an embedded device such as the Jetson Nano-Orin in the context of Peruvian traffic. The problem identified focuses on evaluating and quantifying which vehicle detection models can be executed in real time and with limited resources on an embedded device (NVIDIA Jetson Orin NX 16 GB) without sacrificing the minimum accuracy required for driver assistance applications. This involves evaluating not only detection accuracy in real-world scenarios in Lima, but also the requirements and methodologies for deploying detection techniques.
Previous work has shown that lightweight architectures such as MobileNet-SSD and SSD-Lite variants can be compared in terms of metrics such as accuracy and efficiency [8,9] in similar application contexts, while heavier backbones (such as VGG16) tend to provide higher mean average precision (mAP) at the cost of higher latency and consumption [10,11]. In addition, recent studies on deployment in embedded systems highlight the importance of optimizations (quantization, pruning, acceleration by TensorRT) to achieve realistic performance on Jetson platforms [12,13]. However, most evaluations are performed with datasets and conditions that differ from the Peruvian context of this research. Therefore, an empirical and comparative evaluation under real conditions and on specific hardware is required.
The methodology follows an adapted CRISP-DM/KDD approach, which includes the collection and annotation of a representative dataset of Lima traffic, preprocessing, and comparative field evaluation using metrics based on the confusion matrix [14,15]. The objective of the research is to evaluate and compare the performance of MobileNetV1-SSD, MobileNetV2-SSD-Lite, and VGG16-SSD detection models on NVIDIA Jetson Orin NX 16 GB for the detection of vehicles characteristic of Peru in real traffic conditions in Lima. To this end, a series of steps are carried out, including the collection and annotation of a set of images and videos representative of the vehicle fleet, the implementation of the three target architectures on the Jetson Orin, the deployment of models, and the comparison and verification of results with experiments based on new datasets. Therefore, an evaluation methodology is described that allows researchers to reproduce the mechanisms for choosing the most appropriate architecture and configuration for ADAS with light autonomous vehicles in contexts similar to Peru. Based on the above, the contributions of the research are as follows:
  • Generation of a representative annotated dataset that can be made available to the community for replicability;
  • Reproducible evaluation of embedded detection models for Lima’s vehicle fleet and urban conditions;
  • Training and implementation of models on a Jetson Orin NX 16 GB embedded platform;
  • Provides recommendations based on analysis of model accuracy and preprocessing strategies for deployment on embedded devices.
This article is organized as follows. Section 1 provides an introduction based on the context, description of the problem, contribution, and motivation. Section 2 describes related work and the state of the art in embedded detection and deployment architectures. Section 3 details the methodology for data collection, annotation, implementation procedure, and metrics used. Section 4 presents the results and discussion. Finally, Section 5 offers conclusions, limitations of the study, and future work.

2. Related Works

The autonomous navigation of vehicles in streets and avenues has become an important area of research in computer vision, robotics, and intelligent systems [16]. Recent advances have integrated multiple autonomous driving techniques, ranging from trajectory planning [17] and control systems [18] to the intensive use of perception algorithms based on computer vision, which are necessary to interpret complex urban environments [19,20]. In this context, the key focus of this research is determining the efficiency of these systems for vehicle localization, tracking, and detection in complex scenarios through the combination of optimized architectures for developing autonomous navigation systems and their application in urban mobility.
In the field of computer vision, object detection applied to vehicles has evolved from classical methods based on machine learning and distance or radiofrequency sensors to modern high-performance architectures. In this domain, models such as SSD and EfficientDet have been evaluated and compared considering metrics such as precision, processing time, and detection speed—critical aspects in autonomous driving scenarios and advanced driver assistance systems (ADAS), where proactive responses are necessary to prevent vehicular accidents [21,22]. On the other hand, other R-CNN-based approaches have stood out for their accuracy, while lightweight models such as YOLO have demonstrated their applicability in real-time environments for terrestrial vehicle detection [23,24]. These techniques are evaluated using benchmark datasets such as KITTI, COCO, and Cityscapes, which have enabled measurement of detector performance against specific traffic elements, such as lighting variability and high object density [25,26].
The integration of lightweight classification and detection models optimized for embedded devices has gained importance in recent years due to the need to implement real-time detection systems on hardware platforms with limited resources. Architectures such as SSD-MobileNet have been studied for their low number of parameters and reduced computational requirements, being integrated into detectors for embedded devices in computer vision scenarios [27,28]. Furthermore, research has demonstrated that such models, when deployed on low-power accelerators like Raspberry Pi, face the challenge of the trade-off between latency and precision [29]. Similarly, their integration into devices such as the Jetson Nano has been used for vehicle detection, evaluating model performance based on inference times and energy consumption [9,30]. In this context, the literature reports different optimization strategies, such as quantization, pruning, and the use of embedded accelerators in critical applications such as ADAS and autonomous systems.
The use of NVIDIA Jetson platforms for real-time vision has been widely utilized in recent years due to their capability to execute detection models in embedded environments with limited resources. Various studies have evaluated the performance of devices such as Jetson Nano and Jetson Orin NX, revealing significant differences in energy consumption, latency, and processing capacity, which guide hardware selection according to model complexity and its application [31,32,33]. To improve efficiency, tools such as TensorRT have been employed for the optimization of deep neural networks through quantization and precision reduction techniques [34,35].

3. Materials and Methods

The general structure of the work is summarized graphically in Figure 1, from the video acquisition stage of a vehicle in motion in a traffic environment in the city of Lima, through the training of three deep learning models used in computer vision, to the live testing of object detection for moving vehicles. The implementation has been tested and mounted on a vehicle to traverse the streets of Lima, utilizing the NVIDIA embedded platform, Jetson Orin NX with 16 GB, and a webcam as the data input source for the deployed models.

3.1. System Architecture

The process shown in Figure 1 consists of the following steps. i. Video Recording: Vehicles are captured in a traffic environment using a dashcam mounted on a vehicle. ii. Video Splitting: Information of interest is extracted and the videos are segmented to facilitate data labeling. iii. Video Data Storage: The selected video segments are aggregated to generate video data ready for annotation. iv. Using the Annotation Tool: Manual labeling is performed using bounding boxes and identifying the classes involved in each frame of the videos. v. Loading Data to the Embedded Platform: The annotations performed and exported in Pascal-VOC format as *.xml files and the extracted video frames in *.png format are loaded onto the Jetson. vi. Training Models on the Embedded Platform: Three models based on SSD as the backbone are re-trained on the embedded platform. vii. Extracting Metrics and Exporting Models: *.tfevents Files are extracted for interpretation using TensorBoard and visualizing metrics such as Average Precision (AP) for each class and mAP, which is the average across classes, in addition to generating *.onnx files for each trained model. viii. Testing Models on the Embedded Platform: DetecNet is used to load the exported models and object detection tests are performed on the Jetson.

3.2. Data Acquisition

The video data capture scheme was performed using a dashcam installed on an automobile, which captures videos of objects (defined classes) on the road through the car’s windshield, as shown in Figure 2. The videos recorded from the route on the road in an urban environment, specifically in the San Juan de Miraflores district, Lima, Peru, are stored on a micro-SD memory card and later extracted on a desktop computer for the segmentation stage.
The captured videos have been segmented, as shown in Figure 3. The recordings obtained during the urban route consist of videos with a duration of 3 min, comprising a total of n + 1 video segments, with each segment having a duration of t f seconds and containing m frames per second. For example, in Figure 3, video V 0 is composed of segments V 0 = S 0 + S 1 + S 2 + + S n .
Figure 4 presents a general overview of all video segments collected during the acquisition stage. Only some of the segments have been considered for the dataset; segments have been removed if, for example, they contain very few objects of interest (defined classes). Considering the number of images existing in the selected video dataset, the total number of images for the dataset is obtained by calculating n f g = t f r × n f r × m f p s , where n f is the number of segments, t f = 20 s (duration of a video segment), and m f p s = 30 (frames per second). This yields a total of n f g = 3000 images for the dataset.

3.3. Data Preparation

The prepared dataset of 3000 images was annotated using the Computer Vision Annotation Tool (CVAT) [36]. The tool facilitates the creation of labels for each instance of the defined classes using bounding boxes and provides an interpolation mode that alleviates the need to redraw each bounding box for adjacent frames. After completing the annotation of all dataset images, the tool allows exporting in different formats according to the project requirements. For this project, the PASCAL-VOC format [37] was employed, with annotations exported in *.xml format, along with images from video segments that contain the positions of bounding boxes for the previously defined classes of interest.
A scheme for data preparation and loading onto the embedded platform is shown in Figure 5, which is used after the data acquisition stage mentioned in the previous subsection.
The diagram in Figure 5 consists of four stages: i. Labeling—annotation using the CVAT annotation tool on a computer using Docker; ii. Dataset Construction—exporting the images and annotations for each image; iii. Data Preparation—segmenting the data to obtain three datasets: training, validation, and test; iv. Dataset Loading—importing the data onto the embedded platform using a remote access tool (NoMachine) to the Jetson to facilitate file upload or download.
The distribution of object classes in the 3000 labeled images with seven defined classes (Car, Tuk-tuk, Suv, Motorcycle, Bus, Pedestrian, and Van) is shown in Table 1, with a total of 14,012 objects across all labeled images.
Data augmentation was performed by adjusting the contrast for each frame of the dataset, as shown in Figure 6.
By executing a script that increases and decreases the contrast by ± 25 % for each image, the new dataset was tripled with respect to the number of images in the initial dataset. To obtain the new annotations, reference files extracted from the CVAT annotation tool were used, duplicating the annotations from the original image but referencing the new images with adjusted contrast. This was done because the information corresponding to the location of the bounding boxes was not modified.
For both datasets of 3000 and 9000 images (dataset with contrast adjustment), a two-level dataset split was performed to obtain training, test, and validation sets. At the first level, the dataset was divided into 85 % and 15 % , with the first percentage allocated to a temporary dataset and the latter to the test set. At the second level, the temporary dataset was further split, with 20 % for validation and the remainder for the training set. This resulted in a final training set of 68 % , a validation set of 17 % , and a test set of 15 % with respect to the total number of images in the dataset before the split levels.
The distribution of objects across the three subsets for both datasets is shown in Table 2.
The split percentages of the dataset to obtain the three subsets do not necessarily correspond to the multiplicative factor for the distribution of objects from each class shown in Table 2.

3.4. Deep Learning Models

In the present work, three computer vision models have been explored—MobileNetV1-SSD, MobileNetV2-SSD-Lite, and VGG16-SSD—which are based on the Single-Shot-Detector (SSD) network architecture for object detection, proposed by Liu [38], as shown in Figure 7.
SSD is a fast and efficient model for real-time tasks and applications such as object detection in videos. It is capable of detecting objects of different sizes in the same image since it makes predictions at different layers of the network. Additionally, SSD performs post-processing because there can be multiple bounding boxes for the same object. It uses a technique called NMS (Non-Maximum Suppression) to select the best bounding box for each object, eliminating redundant bounding boxes from the same instance.
The three architectures mentioned possess a backbone, which is the part of the architecture that extracts features from images. They are based on MobileNetV1, MobileNetV2, and VGG16, with the first two being lightweight and fast architectures for devices with limited resources, while the latter is more complex and accurate compared to other lightweight models. The three models were retrained on the NVIDIA Jetson Orin NX embedded platform using data from both collected datasets of 3000 and 9000 images.

3.5. Training Models

Model training was performed on the same embedded platform, using the Github repository called jetson-inference [39] as a reference and following the documentation that describes the training stage.
In the file structure for the object detection problem based on the SSD model, there is a script for performing training, which receives several arguments that can be modified as needed. For the development of this work, the arguments and values shown in Table 3 were considered.
The pretrained-ssd values are the names of .pth files that were obtained from another repository called pytorch-ssd [40], which were used as the basis for retraining the three models mentioned previously in Section 3.4.
In addition to the eight arguments listed, the argument –validation-mean-ap was employed to compute, at each –validation-epochs, the Average Precision (AP) metric for each class defined in the detection problem, and the mean Average Precision (mAP) as the average of the previous results.
It is worth noting that for the resolution argument, the value of 300 × 300 was used for all three models, and the value of 512 × 512 was only used for the MobileNetV1-SSD model reported in the results section.

4. Results and Discussion

A diagram for the collection and extraction of metrics from the embedded platform to a computer that manages the Jetson is shown in Figure 8.
The diagram consists of three stages. i. Management: Here, the embedded platform is remotely managed from a computer using the NoMachine application. ii. Extract metrics: During training, with the arguments mentioned in Table 3, metrics are obtained from the models (Loss, mAP), which are generated by an eval_results script found in the repository. These metrics are stored as events in files with extension *.tfevents, in addition to other metrics calculated on the computer (confusion matrix, Precision–Recall). iii. Results: All metrics are visualized on a computer using a Python (version 3.9.4) virtual environment that has TensorBoard installed, whose tool enables the visualization and reading of all event files extracted from the Jetson.
In Figure 9, the loss curves for the train and validation datasets are shown, obtaining six curves in total for each case. A trend is observed in the curves that indicates that the models learn important patterns from the dataset. Considering that the VGG16-SSD model presents the lowest loss values in validation, this is related to its capacity to extract more features. On the other hand, MobileNetV2-SSD-Lite has higher and less stable loss during training. Furthermore, there are no signs of overfitting, as the validation curves do not diverge from the training curves.

4.1. Metrics

The work developed addresses an object detection problem, for which the most commonly used metric for this type of task is mean Average Precision (mAP). This requires several preliminary steps: i. obtain the probability of each object associated with a bounding box, which has been correctly identified in the data labeling stage; ii. calculate Precision and Recall; iii. calculate the PR curve for each class; iv. calculate the Average Precision (AP) for each class; v. calculate the average of AP across different classes.
The probability associated with each bounding box is calculated by dividing the area of overlap by the area of the union, as shown in the following equation:
IoU = B Ground Truth B Predicted B Ground Truth B Predicted
where B Ground Truth is the position of the bounding box of a labeled class and B Predicted is the predicted position of the bounding box. It is presented graphically in Figure 10.
This probability, IoU, is used to define the correct prediction (True Positive), which must be greater than a threshold. For object detection problems, a value of 0.5 is usually accepted; if the prediction value is greater, it is a True Positive (TP), but if it is lower, it is considered a False Positive (FP).
For each class, it is possible to obtain the Precision and Recall metrics using the values of TP and FP, in addition to False Negatives (FN), using the following equations:
Recall = TP TP + FN ; Precision = TP TP + FP

4.1.1. Confusion Matrix

Figure 11a,b show row-normalized confusion matrices for the MobileNetV1-SSD model in the object detection task with the collected dataset and the dataset with increment through contrast adjustment, with seven labeled classes. The results indicate an improvement with data increment, notably in the Tuk-tuk class, with an increase of up to 19.6%, according to the percentages reported in the matrices for all classes, with the exception of the Car and Motorcycle classes. The most significant class achieves 79% for the Car class, with the dataset of 3000 images.
Figure 12a,b show row-normalized confusion matrices for the MobileNetV2-SSD-Lite model, with the collected dataset and the dataset with increment through contrast adjustment. The results indicate an improvement with data increment, notably in the Tuk-tuk class, with an increase of up to 12.5%, according to the percentages reported in the matrices for all classes, with the exception of the Suv and Pedestrian classes. The most significant class achieves 78.8% for the Car class, with the dataset of 9000 images.
Figure 13a,b show row-normalized confusion matrices for the VGG16-SSD model, with the collected dataset and the dataset with increment through contrast adjustment. The results indicate an improvement with data increment, notably in the Tuk-tuk class, with an increase of up to 39.6%, according to the percentages reported in the matrices for all classes. The most significant class achieves 96.8% for the Motorcycle class, with the dataset of 9000 images.
Figure 14a,b show row-normalized confusion matrices for the MobileNetV1-SSD (512 × 512) model, with the collected dataset and the dataset with increment through contrast adjustment. The difference with the previous model is the image size in the model’s input layer, adjusted to a resolution of 512 × 512 . The results indicate an improvement with data increment, notably in the Tuk-tuk class, with an increase of up to 15%, according to the percentages reported in the matrices for all classes. The most significant class achieves 91.8% for the Car class, with the dataset of 9000 images.
The confusion matrices show, in general, that the Car class presents high levels of accuracy, which is due to its shapes being more defined and clear than those of other elements. On the other hand, the Tuk-tuk and Motorcycle classes show higher confusion rates, especially in the MobileNetV1-SSD (300 × 300) and MobileNetV2-SSD-Lite models, due to their reduced size. The models trained with the augmented dataset show improvements, which evidences that contrast adjustment increases detection under lighting variations.

4.1.2. Precision–Recall

The Precision–Recall (PR) curve is used to evaluate the performance of object detection models, whose presented graphs show the confidence for each class. A detection model is good if Precision remains high when Recall increases. The PR curve plots Precision on the Y-axis and Recall on the X-axis for different threshold values of the classifier.
Figure 15, Figure 16, Figure 17 and Figure 18 show PR graphs for the MobileNetV1-SSD, MobileNetV2-SSD-Lite, VGG16-SSD, and MobileNetV1-SSD (512 × 512) models, respectively. Additionally, subfigure (a) in all figures shows results using the collected dataset of 3000 images, while subfigure (b) shows results with the augmented dataset (9000 images) through contrast adjustment.
The PR curves show that the Car, Suv, and Motorcycle classes have curves with high precision, which indicates detections with low false positive rates. The VGG16-SSD model has more stable PR curves with larger area, which indicates its greater feature extraction capacity.

4.1.3. Average Precision

To measure the quality of the predictions from the retrained models for each of the defined classes, the Average Precision (AP) metric is considered, which combines the percentage of correct detections and the percentage of real objects that were detected by the models used. This metric is calculated as the area under the PR curve, calculated previously, for a specific class.
AP = 0 1 p ( r ) d r
where p ( r ) is Precision as a function of Recall.
In Figure 19, the AP result is shown in each subfigure (seven classes) for each defined class; furthermore, in each of them are all the considered models and the two datasets used, resulting in a total of eight curves for each class.
In Figure 19, it is observed that subfigures (a), (c), and (d), belonging to the Car, Suv, and Motorcycle classes, respectively, present higher AP across all curves of all explored models. In subfigure (b), belonging to the Tuk-tuk class, a higher AP is seen for some of the models compared to other subfigures. Subfigure (e), of the Bus class, presents the most variable AP across the explored models compared to the other classes. On the other hand, subfigure (f), belonging to the Pedestrian class, presents a lower AP compared to the rest of the classes shown in the other subfigures. Finally, subfigure (g), of the Van class, presents a higher AP for a specific model.
The Car, Suv, and Motorcycle classes achieve high AP values quickly. Furthermore, the augmented dataset accelerates convergence in most classes, due to the greater data variability generated by contrast adjustment.

4.1.4. Mean Average Precision

This metric is the average of all the previously calculated AP (Average Precision) for each class defined in the dataset:
mAP = 1 N i = 1 N AP ( i )
where N is the number of defined classes. In Figure 20, the mAP metric of all the initially defined classes is shown, where each curve in the graph is one of the fine-tuned models used with the collected dataset and the dataset increased by contrast adjustment.
According to the mAP metric reported in the graph, it is shown that the VGG16-SSD model is superior to the MobileNetV1-SSD and MobileNetV2-SSD-Lite models because it uses more parameters and convolutional layers, which enables information capable of discriminating elements with variable shapes (such as the Tuk-tuk class) to be obtained, generating more detailed feature maps. This generates an increase in computational cost that increases the training time for the NVIDIA Jetson device (as reported in Table 4). However, the MobileNetV1-SSD (512 × 512) model with an input layer of 512 × 512 presents results similar to the performance of the VGG16-SSD model.
In the case of MobileNetV2-SSD-Lite, it generates irregular behavior when the data to be evaluated have high shape variability and does not capture edges or shapes in small vehicles. On the other hand, MobileNetV1-SSD uses more stable convolutional stages that allow more spatial information, complex environments, and variable lighting conditions to be identified, which is why this model presents better performance. However, it also presents a higher use of computational resources.

4.2. Test on Platform

All models have been fine-tuned on the platform, as shown in Table 4, considering the lowest loss value; the validation set for both datasets of all the models used, showing the mAP value; the epoch number in which this value was found; and the time it took the platform to train up to that epoch.
Note that the superior performance of VGG16-SSD with 90.74 % is again confirmed, but it requires a longer training time than the others of 1.64 days, however the MobileNetV1-SSD (512 × 512) model presents a considerably very close mAP value ( 90.44 % ); even with the unaugmented dataset ( 88.47 % ) and with a much shorter training time ( 8.02 h), it is possible for it to achieve very similar metrics of 88.79 % compared to the VGG16-SSD model.
The trained MobileNetV1-SSD (512 × 512) model is implemented on the Jetson Orin NX 16 GB using the available documentation and libraries from NVIDIA. For the detection test of this trained model, DetecNet is used, loading the previously exported model in *.onnx format, and indicating the input data or source (images, videos or camera) and the output path to save the images once the detections have been performed. Some examples of object detection of the different classes that worked with the aforementioned model are shown in Figure 21.

4.3. Energy Consumption on the Jetson Platform

Energy consumption readings were obtained using the software tool available from NVIDIA, based on data obtained by the INA3221 current sensor, which is embedded in the Jetson Orin NX 16 GB. Measurements were performed during the detection stage of the model loaded and implemented with DetecNet, using the validation image dataset as the input. During a determined time period, while performing inference for each model, all readings were recorded in a *.csv file, which was extracted from the embedded platform. The consumption of VDD_IN (total power consumption); VDD_CPU_GPU_CV (combined CPU, GPU, and CV power rail); and VDD_SOC (SOC power rail) was monitored. All of these readings were averaged to obtain the data displayed in Table 5, including tests with both datasets.
Note that the total power consumption readings VDD_IN (≈7.3 W, ≈7.4 W, and ≈7.5 W) for each explored model (MobileNetV2-SSD-Lite, MobileNetV1-SSD-300 × 300, and MobileNetV1-SSD-512 × 512) increase in order, as does the mAP metric (75%, 80%, and 90%) achieved by each of the implemented models.

4.4. Comparative Performance with YOLO-v8-Nano

TensorRT, a C++ library, is used to facilitate inference on NVIDIA GPU units. TensorRT takes a trained model and generates an optimized engine that performs inference for that network [41]. A comparison of latency and size is made between the best model obtained (MobileNetV1-SSD) and the YOLO (You Only Look Once) model in version 8 Nano, presented in Ref. [42], using an FP32 engine for both models on the Jetson Orin NX. Additionally, the number of parameters of each model and the mAP metric for the dataset of 9000 frames with an input layer size of 512 × 512 for both models are shown in Table 6.
YOLO-v8-Nano shows good results in terms of accuracy, in addition to being lightweight in size and also showing lower latency when performing inference. However, despite having 60% fewer parameters compared to MobileNetV1-SSD, the difference in latency is very close. All latency values, as well as the FPS number shown (calculated as 1000/Latency(median)), correspond only to the pure inference of each of the object detection models.
In implementation within an embedded system, like the present work, in addition to considering the inference time of the model, it is necessary to consider additional times such as pre and post-processing. A comparison of both models is shown in Table 7. For this activity, in the case of MobileNetV1-SSD, DetecNet is used to load the model, which creates an FP16 engine; in the case of YOLO-v8-Nano, TensorRT has been used to generate an engine of the same type (FP16). DetecNet shows the visualizations of the images with the generated bounding-boxes immediately after performing the inference, while YOLO does not. Therefore, for a fairer comparison, the visualization stage (0.14 ms) has been omitted, and three stages have been considered: pre-process, network, and post-process.
The main reason for the difference in time between both models is that the YOLO source code is written in Python, while DetecNet uses C++ and CUDA, which means it takes less time in the pre and post-processing stages.
Finally, a comparison of the energy consumption is shown in Table 8, and as in the previous table, the visualization stage has also been omitted.
MobileNetV1-SSD, in the FP16 engine, presents a consumption of approximately 7.2 W, being lower than the value reported in Table 5, due to the absence of the visualization stage. On the other hand, YOLO presents a consumption of approximately 7.6 W, showing that the SSD-based model has slightly more energy-efficient consumption compared to YOLO.
An energy report and analysis regarding YOLO-v8 applied in different engines (FP32, FP16, and INT8) and on the same platform used in this study (NVIDIA Jetson Orin NX) can be found in [34].

4.5. Comparative Performance with the Literature

Finally, a comparison with similar works with computer vision models, for object detection, implemented on an NVIDIA embedded platform is performed, which is presented in Table 9.
The results demonstrate that the MobileNetV1-SSD (512 × 512) model implemented in this work achieves a significantly higher mAP compared to the other models on the NVIDIA Jetson platform.

5. Conclusions

In this work, the study and implementation of vehicle and person detection models based on deep learning networks, such as the SSD network, is proposed, with the aim of contributing to the creation of autonomous driver assistance systems, in a real-world environment on the streets of Lima, Peru. The computer vision models have been implemented on an embedded platform with GPU, specifically the NVIDIA Jetson Orin NX card. In this study, a custom dataset has been collected, and data augmentation has been performed with contrast adjustment on the constructed dataset. The training of the models has been carried out on the same platform, considering the tools available from NVIDIA and some models available within the documentation of previous works. The times taken for the fine-tuning of each model have been reported, as well as the reliability metrics of the system, using the collected dataset and the augmented dataset. In future work, other data augmentation techniques and the capture of a larger dataset, considering possible changes in weather conditions, will be explored. Additionally, studies on the power consumption monitoring of the execution of each implemented models will be carried out.
This work contributes to the exploration and implementation of intelligent models in embedded applications with high computational performance and low power consumption, specifically in real-time tasks, where detection accuracy and speed are paramount, as in the case of driver assistance systems.

Author Contributions

Conceptualization, S.R.-S., J.L., and R.Y.; methodology, S.R.-S., J.L., and R.Y.; software, S.R.-S., J.L., and J.Z.; validation, S.R.-S., J.L., and R.Y.; formal analysis, S.R.-S., J.L., and R.Y.; investigation, S.R.-S., J.L., and R.Y.; data curation, S.R.-S. and J.Z.; writing—original draft preparation S.R.-S. and R.Y.; writing—review and editing, J.L. and R.Y.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by CONCYTEC-PROCIENCIA as part of the “E041-2023-02 Proyectos de Investigación Aplicada” [PE501083603-2023].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in Google Drive at https://drive.google.com/drive/folders/1phXCg5s1JgLBtfeS3lXWXR2nuC7QzqSl (accessed on 17 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Cheng, L.; Zhang, D.; Zheng, Y. Road Object Detection in Foggy Complex Scenes Based on Improved YOLOv8. IEEE Access 2024, 12, 107420–107430. [Google Scholar] [CrossRef]
  2. Sim, S.; Bae, J.; Kim, S.B. Robust Semi-Supervised Regression for Vehicle Interior Noise Prediction. IEEE Access 2024, 12, 60–72. [Google Scholar] [CrossRef]
  3. Khalil, H.A.; Hammad, S.A.; Abd El Munim, H.E.; Maged, S.A. Low-Cost Driver Monitoring System Using Deep Learning. IEEE Access 2025, 13, 14151–14164. [Google Scholar] [CrossRef]
  4. Salakapuri, R.; Navuri, N.K.; Vobbilineni, T.; Ravi, G.; Karmakonda, K.; vardhan, K.A. Integrated deep learning framework for driver distraction detection and real-time road object recognition in advanced driver assistance systems. Sci. Rep. 2025, 15, 25125. [Google Scholar] [CrossRef]
  5. Hammoud, M.S.; Lupin, S. Optimization of Road Detection using Semantic Segmentation and Deep Learning in Self-Driving Cars. Ann. Emerg. Technol. Comput. 2024, 8. [Google Scholar] [CrossRef]
  6. Mehta, A.A.; Padaria, A.A.; Bavisi, D.J.; Ukani, V.; Thakkar, P.; Geddam, R.; Kotecha, K.; Abraham, A. Securing the Future: A Comprehensive Review of Security Challenges and Solutions in Advanced Driver Assistance Systems. IEEE Access 2024, 12, 643–678. [Google Scholar] [CrossRef]
  7. TomTom International BV. Traffic Index Ranking|TomTom Traffic Index. 2025. Available online: https://www.tomtom.com/traffic-index/ranking/ (accessed on 24 October 2025).
  8. Ihsan, M.; Trinanda Putra, K.; Yusvin Mustar, M.; Faiz Setyawan, A.; Chu, H.T.; Chen, H.C. Edge AI-Driven Video Analytics: A Mobile Deep Learning Framework for Victim Detection in SAR Robotics. In Proceedings of the 2024 FORTEI-International Conference on Electrical Engineering (FORTEI-ICEE), Badung, Indonesia, 24–25 October 2024; pp. 241–246. [Google Scholar] [CrossRef]
  9. Mohd Razif, M.H.; Ismail, A.P.; Che Abdullah, S.A.; Shafie, M.A.; Isa, I.S.; Sulaiman, S.N.; Che Soh, Z.H. On Edge Crowd Traffic Counting System using Deep Learning on Jetson Nano for Smart Retail Environment. J. Adv. Res. Appl. Sci. Eng. Technol. 2024, 42, 1–13. [Google Scholar] [CrossRef]
  10. Lu, J.; Huang, T.; Zhang, Q.; Chen, X.; Zhou, J. A lightweight vehicle detection network fusing feature pyramid and channel attention. Internet Things 2024, 26, 101166. [Google Scholar] [CrossRef]
  11. Abdigapporov, S.; Miraliev, S.; Alikhanov, J.; Kakani, V.; Kim, H. Performance Comparison of Backbone Networks for Multi-Tasking in Self-Driving Operations. In Proceedings of the 2022 22nd International Conference on Control, Automation and Systems (ICCAS), Busan, Republic of Korea, 27 November–1 December 2022; pp. 819–824. [Google Scholar] [CrossRef]
  12. Nguyen, H.H.; Nguyen-Ngoc Tran, D.; Pham, L.H.; Jeon, J.W. Optimizing Monocular Driving Assistance for Real-Time Processing on Jetson AGX Xavier. IEEE Access 2024, 12, 71853–71865. [Google Scholar] [CrossRef]
  13. Jun, W.; Yoo, J.; Lee, S. Synthetic Data Enhancement and Network Compression Technology of Monocular Depth Estimation for Real-Time Autonomous Driving System. Sensors 2024, 24, 4205. [Google Scholar] [CrossRef]
  14. Svabova, L.; Culik, K.; Hrudkay, K.; Durica, M. Analysing Urban Traffic Patterns with Neural Networks and COVID-19 Response Data. Appl. Sci. 2024, 14, 7793. [Google Scholar] [CrossRef]
  15. Mehta, S.; Yusuf, A.B.; Ghafari, S. Data-driven framework for pothole repair automation using unmanned ground vehicle fleets. Autom. Constr. 2025, 174, 106176. [Google Scholar] [CrossRef]
  16. Zhuang, Y.; Liu, P.; Yang, H.; Zhang, K.; Wang, Y.; Pu, Z. Few-shot learning for novel object detection in autonomous driving. Commun. Transp. Res. 2025, 5, 100194. [Google Scholar] [CrossRef]
  17. Cheng, C.H.; Lin, H.H.; Luo, Y.Y. Research on Autonomous Vehicle Lane-Keeping and Navigation System Based on Deep Reinforcement Learning: From Simulation to Real-World Application. Electronics 2025, 14, 2738. [Google Scholar] [CrossRef]
  18. Fang, R.; Wang, Z.; Liang, J.; Yin, G.; Liu, C.; Liu, H.; Li, A. A Vision-Based Robust H Gain Scheduling Longitudinal and Lateral Following Controller for Autonomous Vehicles on Urban Curved Roads. IEEE Trans. Intell. Veh. 2024, 9, 7498–7513. [Google Scholar] [CrossRef]
  19. Iqbal, H.; Sadia, H.; Al-Kaff, A.; Garcié, F. Novelty Detection in Autonomous Driving: A Generative Multi-Modal Sensor Fusion Approach. IEEE Open J. Intell. Transp. Syst. 2025, 6, 799–812. [Google Scholar] [CrossRef]
  20. Hong, S.; Im, J.; Noh, B. Attention-Driven Lane Change Trajectory Prediction With Traffic Context in Urban Environments. IEEE Access 2025, 13, 108059–108075. [Google Scholar] [CrossRef]
  21. Wang, C.; Zheng, B.; Li, C. Efficient traffic sign recognition using YOLO for intelligent transport systems. Sci. Rep. 2025, 15, 13657. [Google Scholar] [CrossRef]
  22. Nakaguchi, V.M.; Rasika, D.; Abeyrathna, R.; Liu, Z.; Noguchi, R.; Ahamed, T. Development of a Machine stereo vision-based autonomous navigation system for orchard speed sprayers. Comput. Electron. Agric. 2024, 227, 109669. [Google Scholar] [CrossRef]
  23. Ge, T.; Ning, B.; Xie, Y. YOLO-AFR: An Improved YOLOv12-Based Model for Accurate and Real-Time Dangerous Driving Behavior Detection. Appl. Sci. 2025, 15, 6090. [Google Scholar] [CrossRef]
  24. Wei, F.; Wang, W. SCCA-YOLO: A Spatial and Channel Collaborative Attention Enhanced YOLO Network for Highway Autonomous Driving Perception System. Sci. Rep. 2025, 15, 6459. [Google Scholar] [CrossRef] [PubMed]
  25. Pan, S.H.; Wang, S.C. Identifying vehicles dynamically on freeway CCTV images through the yolo deep learning model. Sensors Mater. 2021, 33, 1517–1530. [Google Scholar] [CrossRef]
  26. Duvvuri, B.L.; J, R.G. SSD Framework in Raspberry Pi for Real-Time Object Detection in Autonomous Vehicles. In Proceedings of the 2024 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 12–14 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
  27. Mosquera, B.G.; Castelo, B.G.; Lema, H.P.; Changoluisa, I.D.; Cruz, P.J.; Valencia, E. Object Detection and Tracking Based on Artificial Vision for a Single Board Computer (SBC). In Proceedings of the Latest Advances in Electrical Engineering, and Electronics, Quito, Ecuador, 15–17 December 2021; Botto-Tobar, M., Zambrano Vizuete, M., Diaz Cadena, A., Vizuete, A.Z., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 154–167. [Google Scholar]
  28. Silva, R.; Zapata, A.; Alulema, D. Access Control Through Mask Detection and Estimation of People Capacity in Covered Premises. In Proceedings of the Applied Technologies, Quito, Ecuador, 23–25 November 2022; Botto-Tobar, M., Zambrano Vizuete, M., Montes León, S., Torres-Carrión, P., Durakovic, B., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 227–239. [Google Scholar]
  29. Kamath, V.; A, R. Investigation of MobileNet-Ssd on human follower robot for stand-alone object detection and tracking using Raspberry Pi. Cogent Eng. 2024, 11, 2333208. [Google Scholar] [CrossRef]
  30. Saeed, Z.; Yousaf, M.H.; Ahmed, R.; Velastin, S.A.; Viriri, S. On-Board Small-Scale Object Detection for Unmanned Aerial Vehicles (UAVs). Drones 2023, 7, 310. [Google Scholar] [CrossRef]
  31. Zagitov, A.; Chebotareva, E.; Toschev, A.; Magid, E. Comparative analysis of neural network models performance on low-power devices for a real-time object detection task. Comput. Opt. 2024, 48, 242–252. [Google Scholar] [CrossRef]
  32. Magalhães, S.C.; dos Santos, F.N.; Machado, P.; Moreira, A.P.; Dias, J. Benchmarking edge computing devices for grape bunches and trunks detection using accelerated object detection single shot multibox deep learning models. Eng. Appl. Artif. Intell. 2023, 117, 105604. [Google Scholar] [CrossRef]
  33. Kim, S.; Kim, C.; Kim, S. Improving Performance of Real-Time Object Detection in Edge Device Through Concurrent Multi-Frame Processing. IEEE Access 2025, 13, 1522–1533. [Google Scholar] [CrossRef]
  34. Rey, L.; Bernardos, A.M.; Dobrzycki, A.D.; Carramiñana, D.; Bergesio, L.; Besada, J.A.; Casar, J.R. A Performance Analysis of You Only Look Once Models for Deployment on Constrained Computational Edge Devices in Drone Applications. Electronics 2025, 14, 638. [Google Scholar] [CrossRef]
  35. Terakura, K.; Chang, Q.; Miyazaki, J. Acceleration of Neural Network Inference for Embedded GPU Systems. In Proceedings of the 2024 IEEE International Conference on Big Data and Smart Computing (BigComp), Bangkok, Thailand, 18–21 February 2024; pp. 361–362. [Google Scholar] [CrossRef]
  36. CVAT.ai Corporation. Computer Vision Annotation Tool. 2025. Available online: https://www.cvat.ai/ (accessed on 31 January 2025).
  37. Everingham, M.; Eslami, S.M.A.; Gool, L.V.; Williams, C.K.I.; Winn, J.M.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2014, 111, 98–136. [Google Scholar] [CrossRef]
  38. Liu, W.; Anguelov, D.; Parikh, D.; Bergemann, S.; Rastegari, M.; Kumar, S.; Dollár, P.; Fei-Fei, L. SSD: Single Shot MultiBox Detector. arXiv 2015, arXiv:1512.02325. [Google Scholar]
  39. NVIDIA. Hello AI World Guide to Deploying Deep-Learning Inference Networks and Deep Vision Primitives with TensorRT and NVIDIA Jetson. 2016–2019. Available online: https://github.com/dusty-nv/jetson-inference/tree/master (accessed on 24 October 2025).
  40. qfgaohao. MobileNetV1, MobileNetV2, VGG Based SSD/SSD-Lite Implementation in Pytorch 1.0/Pytorch 0.4. 2019. Available online: https://github.com/qfgaohao/pytorch-ssd (accessed on 24 October 2025).
  41. NVIDIA Corporation. NVIDIA TensorRT Documentation. 2025. Available online: https://docs.nvidia.com/deeplearning/tensorrt/latest/index.html (accessed on 24 October 2025).
  42. Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
  43. Barba-Guaman, L.; Eugenio Naranjo, J.; Ortiz, A. Deep Learning Framework for Vehicle and Pedestrian Detection in Rural Roads on an Embedded GPU. Electronics 2020, 9, 589. [Google Scholar] [CrossRef]
  44. Farooq, M.A.; Corcoran, P.; Rotariu, C.; Shariff, W. Object Detection in Thermal Spectrum for Advanced Driver-Assistance Systems (ADAS). IEEE Access 2021, 9, 156465–156481. [Google Scholar] [CrossRef]
  45. Elmanaa, I.; Sabri, M.A.; Abouch, Y.; Aarab, A. Efficient Roundabout Supervision: Real-Time Vehicle Detection and Tracking on Nvidia Jetson Nano. Appl. Sci. 2023, 13, 7416. [Google Scholar] [CrossRef]
  46. Liang, Z.; Wang, W.; Meng, R.; Yang, H.; Wang, J.; Gao, H.; Li, B.; Fan, J. Vehicle and Pedestrian Detection Based on Improved YOLOv7-Tiny. Electronics 2024, 13, 4010. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the implementation used on the embedded platform.
Figure 1. Flowchart of the implementation used on the embedded platform.
Jimaging 12 00020 g001
Figure 2. Diagram to record and store videos from the road.
Figure 2. Diagram to record and store videos from the road.
Jimaging 12 00020 g002
Figure 3. Segmentation of recorded videos into small segments.
Figure 3. Segmentation of recorded videos into small segments.
Jimaging 12 00020 g003
Figure 4. Construction of video data into small segments.
Figure 4. Construction of video data into small segments.
Jimaging 12 00020 g004
Figure 5. Diagram for loading dataset on the Jetson platform.
Figure 5. Diagram for loading dataset on the Jetson platform.
Jimaging 12 00020 g005
Figure 6. Data augmentation by adjusting the dataset’s contrast.
Figure 6. Data augmentation by adjusting the dataset’s contrast.
Jimaging 12 00020 g006
Figure 7. Architecture of the Single-Shot-Detector model [38].
Figure 7. Architecture of the Single-Shot-Detector model [38].
Jimaging 12 00020 g007
Figure 8. Extract metrics from Jetson platform.
Figure 8. Extract metrics from Jetson platform.
Jimaging 12 00020 g008
Figure 9. Loss/train and loss/validation.
Figure 9. Loss/train and loss/validation.
Jimaging 12 00020 g009
Figure 10. Intersection over Union (IoU) score.
Figure 10. Intersection over Union (IoU) score.
Jimaging 12 00020 g010
Figure 11. Results with MobileNetV1-SSD model: (a) confusion matrix with a 3000 frame dataset; (b) confusion matrix with a 9000 frame dataset.
Figure 11. Results with MobileNetV1-SSD model: (a) confusion matrix with a 3000 frame dataset; (b) confusion matrix with a 9000 frame dataset.
Jimaging 12 00020 g011
Figure 12. Results with MobileNetV2-SSD-Lite model: (a) confusion matrix with 3000 frames dataset; (b) confusion matrix with 9000 frames dataset.
Figure 12. Results with MobileNetV2-SSD-Lite model: (a) confusion matrix with 3000 frames dataset; (b) confusion matrix with 9000 frames dataset.
Jimaging 12 00020 g012
Figure 13. Results with VGG16-SSD model: (a) confusion matrix with 3000 frames dataset; (b) confusion matrix with 9000 frames dataset.
Figure 13. Results with VGG16-SSD model: (a) confusion matrix with 3000 frames dataset; (b) confusion matrix with 9000 frames dataset.
Jimaging 12 00020 g013
Figure 14. Results with MobileNetV1-SSD model (512 × 512): (a) confusion matrix with 3000 frames dataset; (b) confusion matrix with 9000 frames dataset.
Figure 14. Results with MobileNetV1-SSD model (512 × 512): (a) confusion matrix with 3000 frames dataset; (b) confusion matrix with 9000 frames dataset.
Jimaging 12 00020 g014
Figure 15. Resultswith MobileNetV1-SSD model: (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Figure 15. Resultswith MobileNetV1-SSD model: (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Jimaging 12 00020 g015
Figure 16. Results with MobileNetV2-SSD-Lite model: (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Figure 16. Results with MobileNetV2-SSD-Lite model: (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Jimaging 12 00020 g016
Figure 17. Results with VGG16-SSD model: (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Figure 17. Results with VGG16-SSD model: (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Jimaging 12 00020 g017
Figure 18. Results with MobileNetV1-SSD model (512 × 512): (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Figure 18. Results with MobileNetV1-SSD model (512 × 512): (a) Precision vs. Recall with 3000 frames dataset; (b) Precision vs. Recall with 9000 frames dataset.
Jimaging 12 00020 g018
Figure 19. Results per class, Average Precision vs. epoch: (a) Car; (b) Tuk-tuk; (c) Suv; (d) Motorcycle; (e) Bus; (f) Pedestrian; (g) Van.
Figure 19. Results per class, Average Precision vs. epoch: (a) Car; (b) Tuk-tuk; (c) Suv; (d) Motorcycle; (e) Bus; (f) Pedestrian; (g) Van.
Jimaging 12 00020 g019aJimaging 12 00020 g019b
Figure 20. Mean Average Precision for all classes for each model.
Figure 20. Mean Average Precision for all classes for each model.
Jimaging 12 00020 g020
Figure 21. Object detection using MobileNetV1-SSD (512 × 512) on an embedded platform.
Figure 21. Object detection using MobileNetV1-SSD (512 × 512) on an embedded platform.
Jimaging 12 00020 g021
Table 1. Distribution of classes in the dataset.
Table 1. Distribution of classes in the dataset.
#ClassNo. of Objects
1Car4924
2Tuk-tuk1101
3SUV3082
4Motorcycle822
5Bus430
6Pedestrian2201
7Van1452
Total14,012
Table 2. Class distribution.
Table 2. Class distribution.
#Class Dataset (3000) Dataset (9000)
TrainTestVal.TrainTestVal.
1Car329777285510,06422682440
2Tuk-tuk7451741822188531584
3SUV2072443573616514491650
4Motorcycle5551261411679390397
5Bus27255103857225208
6Pedestrian1500312389449810571048
7Van9662262602988629739
Table 3. Arguments to train models.
Table 3. Arguments to train models.
#ArgumentValue(s)
1–dataset-typevoc
2–netmb1-ssd; mb2-ssd-lite; vgg16-ssd
3–resolution300; 512
4–pretrained-ssdmobilenet-v1-ssd-mp-0_675.pth; mb2-ssd-lite-mp-0_686.pth; vgg16-ssd-mp-0_7726.pth
5–batch-size4
6–epochs110
7–validation-epochs1
8–debug-steps10
9–use-cudaTrue
Table 4. Report on the NVIDIA Jetson platform.
Table 4. Report on the NVIDIA Jetson platform.
DatasetModel# Step (Epoch)TimeLoss/ValmAP
MobileNetV1-SSD (300 × 300)987.212 h0.787175.43
3000 framesMobileNetV1-SSD (512 × 512)998.019 h0.641388.47
MobileNetV2-SSD-Lite957.966 h0.925270.71
VGG16-SSD10613.19 h0.519488.79
MobileNetV1-SSD (300 × 300)1011.105 days0.527280.44
9000 framesMobileNetV1-SSD (512 × 512)1061.103 days0.453990.44
MobileNetV2-SSD-Lite1031.029 days0.638875.17
VGG16-SSD1061.644 days0.373490.74
Table 5. Report of the energy consumption.
Table 5. Report of the energy consumption.
ModelVDD TypeVoltageCurrentPower
MobileNetV1-SSD
(300 × 300)
IN5087.80 mV1449.86 mA7374.64 mW
CPU_GPU_CV5072.86 mV442.60 mA2244.70 mW
SOC5080.00 mV320.54 mA1627.79 mW
MobileNetV1-SSD
(512 × 512)
IN5088.00 mV1481.60 mA7538.90 mW
CPU_GPU_CV5072.00 mV461.60 mA2340.68 mW
SOC5080.00 mV324.13 mA1646.11 mW
IN5088.00 mV1428.31 mA7263.58 mW
MobileNetV2-SSD-LiteCPU_GPU_CV5074.77 mV431.38 mA2190.00 mW
SOC5080.92 mV318.15 mA1615.77 mW
Table 6. Performance of inference comparison: MobileNetV1-SSD vs. YOLO-v8-Nano.
Table 6. Performance of inference comparison: MobileNetV1-SSD vs. YOLO-v8-Nano.
Model#ParametersSize (Engine)Latency (Mean)Latency (Median)Latency (P-99%)FPSMetric (Value)
MobileNetV1-SSD7.48 M30.6 MB≈12.446 ms≈12.380 ms≈13.987 ms≈81mAP = 90.4%
YOLO-v8n3.03 M13.3 MB≈11.134 ms≈10.927 ms≈15.740 ms≈92mAP = 91.7%
Table 7. Timing report comparison: MobileNetV1-SSD vs. YOLO-v8-Nano.
Table 7. Timing report comparison: MobileNetV1-SSD vs. YOLO-v8-Nano.
StagesMobileNetV1-SSDYOLO-v8-Nano
Pre-Process≈0.1 ms≈5.4 ms
Network≈9.0 ms≈7.8 ms
Post-Process≈0.2 ms≈14.9 ms
Total Time≈9.3 ms≈28.1 ms
Table 8. Report of the energy consumption: MobileNetV1-SSD vs. YOLO-v8-Nano.
Table 8. Report of the energy consumption: MobileNetV1-SSD vs. YOLO-v8-Nano.
MobileNetV1-SSD YOLO-v8-Nano
VDD TypeVoltageCurrentPowerVoltageCurrentPower
IN5088.00 mV1417.71 mA7212.86 mW5088.00 mV1468.00 mA7560.38 mW
CPU_GPU_CV5073.71 mV424.57 mA2153.71 mW5073.00 mV437.00 mA2221.50 mW
SOC5080.00 mV316.00 mA1607.43 mW5080.00 mV333.00 mA1691.13 mW
Table 9. Performance comparison with the literature.
Table 9. Performance comparison with the literature.
ReferenceModelEmbedded-PlataformClassesMetric (Value)
Barba-Guaman et al., 2020 [43]PedNetJetson Nanocar, pedestrianAcc. = 78.71%
Farooq et al., 2021 [44]YOLO-v5Jetson Nanobike, bus, bycicle, car, dog, person, polemAP = 85.5%
Elmanaa et al., 2023 [45]YOLO-v7-tinyJetson Nanocar, bus, motorcycle, truckmAP = 80.1%
Liang et al., 2024 [46]YOLO-v7-tiny (prune)Jetson Xavier AGXcar, cyclist, pedestrian, tram, tricycle, truckmAP = 72.5%
This workMobileNetV1-SSD (512 × 512)Jetson Orin NXbus, car, motorcycle, pedestrian, suv, tuk-tuk, vanmAP = 90.44%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ramos-Sanchez, S.; Lezama, J.; Yauri, R.; Zevallos, J. Object Detection on Road: Vehicle’s Detection Based on Re-Training Models on NVIDIA-Jetson Platform. J. Imaging 2026, 12, 20. https://doi.org/10.3390/jimaging12010020

AMA Style

Ramos-Sanchez S, Lezama J, Yauri R, Zevallos J. Object Detection on Road: Vehicle’s Detection Based on Re-Training Models on NVIDIA-Jetson Platform. Journal of Imaging. 2026; 12(1):20. https://doi.org/10.3390/jimaging12010020

Chicago/Turabian Style

Ramos-Sanchez, Sleiter, Jinmi Lezama, Ricardo Yauri, and Joyce Zevallos. 2026. "Object Detection on Road: Vehicle’s Detection Based on Re-Training Models on NVIDIA-Jetson Platform" Journal of Imaging 12, no. 1: 20. https://doi.org/10.3390/jimaging12010020

APA Style

Ramos-Sanchez, S., Lezama, J., Yauri, R., & Zevallos, J. (2026). Object Detection on Road: Vehicle’s Detection Based on Re-Training Models on NVIDIA-Jetson Platform. Journal of Imaging, 12(1), 20. https://doi.org/10.3390/jimaging12010020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop