The performance of the overall system mainly depends on two modules: the vision module and the distance module, so the results and performance evaluation of these modules are discussed in the following sub-sections.
4.2. Object Detection: Training and Performance Evaluation
In this section, the training and performance of the object detection model is discussed in detail.
Figure 7a,b show the training loss over epochs for two metrics: box loss (accuracy of bounding box predictions) and class loss (accuracy of object classification) for both the training and validation data. Both losses decreased rapidly during the first 50 epochs as the model quickly learns data patterns. Between 50 and 181 epochs, the decline slows, and eventually, the model nearly converges with box loss around 0.02 and class loss around 0.003, indicating high confidence in its predictions.
The mAP@50 and mAP@50-95 measure the mean average precisions at a 50 IoU threshold, while the mAP@50-95 measures the mean average precisions over thresholds ranging from 50 to 95 (in 0.05 increments). As shown in
Figure 7c, the mAP@50 is higher due to its less stringent criteria during processing. The model learns rapidly in the first 50 epochs, slows from 50 epochs, and stabilizes with the mAP@50 settling at around 0.845 and the mAP@50-95 around 0.593.
To check the performance, the model is evaluated on the unseen test dataset. The trained model achieves a 0.845 value of mAP@50 which shows the model effectiveness in detecting five different types of interactive objects. The average precision metric (mAP) for all five interactive objects is shown in
Table 4. The table provides precision, recall, and both AP@50 and AP@50-95 for all five types of interactive objects. The high precision, recall, and average precision values for all classes indicate that the model’s performance is considered good in detecting different interactive objects. Lower values of mAP at a high IoU threshold are expected for our model.
4.4. Comparative Evaluation of Object Detection Models
To presents a comprehensive comparative analysis of SSD, YOLOv5n, and YOLOv7tiny across different dataset scales and hardware platforms, the evaluation focuses on detection accuracy, localization precision, computational efficiency, and real-time deployment feasibility. Experiments were conducted on NVIDIA RTX 3060 GPU and Jetson Nano embedded hardware using standardized metrics, including mAP@50, mAP@50-95, inference latency, FPS, parameter count, and model size.
The initial experiments were conducted on a dataset comprising 1000 images to investigate the baseline performance of the selected models. As reported in
Table 5, YOLOv5n achieved an mAP@50 of 0.845 and an mAP@50-95of 0.593, outperforming both YOLOv7tiny and SSD. In comparison, YOLOv7tiny obtained mAP@0.5 and mAP@50-95 values of 0.584 and 0.325, respectively, while SSD achieved 0.6219 and 0.4792. Consequently, YOLOv5n exhibited a relative improvement of 44.7% in mAP@50 over YOLOv7tiny and 35.9% over SSD. Similarly, in terms of mAP@50-95, YOLOv5n improved performance by 82.5% over YOLOv7tiny and 23.7% over SSD. From a model complexity perspective, YOLOv5n required only 1.17 million parameters and occupied 3.38 MB of storage, whereas YOLOv7tiny and SSD required 6.02 million parameters (12 MB) and 24.3 million parameters (94.35 MB), respectively. This corresponds to an 80.6% reduction in parameter count relative to SSD and an 80.5% reduction relative to YOLOv7tiny, highlighting the architectural efficiency of YOLOv5n. In terms of inference efficiency, YOLOv5n achieved an inference latency of 2.9 ms and an FPS of 75, compared to 3.8 ms and 60 FPS for YOLOv7tiny, and 10.93 ms and 22 FPS for SSD. Thus, YOLOv5n reduced inference latency by 73.5% and increased FPS by 241% compared to SSD, demonstrating its suitability for real-time applications. However, given the limited dataset size, the representational capacity of larger models such as SSD and YOLOv7tiny may not have been fully exploited. Therefore, the dataset was expanded to ensure a more reliable and statistically meaningful comparison.
To evaluate the scalability and generalization capability of the models, the dataset size was increased to 3000 images, and all models were retrained under identical experimental conditions. The results, summarized in
Table 5, confirm the robustness of YOLOv5n. On the enhanced dataset, YOLOv5n achieved an mAP@50 of 0.907 and an mAP@50-95 of 0.674. In contrast, YOLOv7tiny achieved 0.796 and 0.500, while SSD achieved 0.6877 and 0.4921. Quantitatively, YOLOv5n outperformed YOLOv7tiny by 13.9% and SSD by 31.9% in mAP@50. Similarly, YOLOv5n achieved improvements of 34.8% and 37.0% in mAP@50-95 over YOLOv7tiny and SSD, respectively. These results indicate that YOLOv5n not only maintains superior accuracy but also exhibits stronger data scalability. Notably, inference latency and FPS remained constant across dataset sizes, implying that YOLOv5n preserves its computational efficiency irrespective of dataset scale, which is critical for real-time systems.
To further analyze the detection characteristics of the models, a class-wise evaluation was performed on the 3000-image dataset (
Table 6). The results demonstrate that YOLOv5n consistently achieves higher accuracy across most object categories. For example, YOLOv5n achieved mAP@50 values of 0.886 (Door), 0.917 (Fire), 0.931 (Fire Extinguisher), and 0.958 (Chair). Compared to SSD, YOLOv5n improved detection accuracy by 144.8% (Door), 53.6% (Fire), 44.1% (Chair), while showing comparable performance for Fire Extinguisher. Moreover, the mAP@50-95 metric reveals that YOLOv5n provides superior localization precision, particularly for objects with complex geometries and scale variations. This suggests that YOLOv5n benefits from more effective multi-scale feature fusion and bounding-box regression mechanisms. The class-wise analysis further indicates that SSD exhibits relatively lower robustness across classes, while YOLOv7tiny shows higher inter-class performance variance, reflecting limitations in feature representation and localization accuracy.
To assess the feasibility of real-time deployment in resource-constrained environments, YOLOv5n and YOLOv7tiny were evaluated on the Jetson Nano platform (
Table 7). These models were selected based on their superior performance on GPU. YOLOv5n achieved an inference time of 52.6 ms and an FPS of 19, which decreased to 12 FPS when system-level delays were considered. In contrast, YOLOv7tiny achieved an inference time of 76.9 ms and an FPS of 13, dropping to 5 FPS with delays. Quantitatively, YOLOv5n reduced inference latency by 31.6% and improved FPS by 46.2% compared to YOLOv7tiny. When system-level delays were included, YOLOv5n achieved a 140% higher effective FPS, highlighting its superior suitability for embedded real-time applications. These findings confirm that model compactness and architectural efficiency play a critical role in achieving real-time performance on low-power hardware platforms.
The experimental results consistently demonstrate that YOLOv5n achieves an optimal balance between detection accuracy, localization precision, and computational efficiency. Its superior performance can be attributed to its lightweight backbone, efficient feature pyramid network, and optimized anchor-based detection head, which collectively enhance feature representation while minimizing computational overhead. In contrast, SSD suffers from limited multi-scale feature extraction capability, leading to reduced localization accuracy, while YOLOv7tiny, despite its improved architectural design, incurs higher computational complexity without proportional accuracy gains. Therefore, the results indicate that YOLOv5n is the most suitable model for real-time object detection in both high-performance GPU environments and embedded systems. Its consistent superiority across datasets, metrics, and hardware platforms underscores its robustness and practical applicability in real-world vision-based applications.