Optimizing Computer Vision for Edge Deployment in Industry 4.0: A Framework and Experimental Evaluation

Azab, Eman; Ehab, Mohamed; Shihata, Lamia; Mashaly, Maggie

doi:10.3390/technologies14020126

Open AccessArticle

Optimizing Computer Vision for Edge Deployment in Industry 4.0: A Framework and Experimental Evaluation

¹

Electronics Department, Faculty of Information Engineering and Technology, The German University in Cairo, New Cairo City 11835, Cairo, Egypt

²

Mechatronics Department, Faculty of Engineering Material Science, The German University in Cairo, New Cairo City 11835, Cairo, Egypt

³

Design and Production Department, Faculty of Engineering Material Science, The German University in Cairo, New Cairo City 11835, Cairo, Egypt

⁴

Networks Department, Faculty of Information Engineering and Technology, The German University in Cairo, New Cairo City 11835, Cairo, Egypt

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Technologies 2026, 14(2), 126; https://doi.org/10.3390/technologies14020126

Submission received: 1 January 2026 / Revised: 5 February 2026 / Accepted: 10 February 2026 / Published: 17 February 2026

Download

Browse Figures

Versions Notes

Abstract

Integrating high-performance computer vision (CV) into Industry 4.0 environments remains a challenge due to the computational disparity between state-of-the-art (SOTA) models and resource-constrained edge hardware. This study proposes a hardware-aware optimization framework designed to bridge this gap, focusing on real-time object detection for high-speed, omnidirectional conveyor systems. Unlike conventional benchmarking, the proposed framework employs a multi-stage optimization pipeline—integrating backbone refinement, hyperparameter tuning, and quantization—to transition diverse architectures from baseline configurations (

M_{b a s e}

) to hardware-optimized variants (

M_{o p t}

).The framework’s efficacy is validated using a custom-built standalone experimental platform detecting package features, brands, and disruptions on an omnidirectional-wheeled conveyor. A comprehensive comparative analysis is conducted across a heterogeneous edge ecosystem, including the NVIDIA Jetson Nano (GPU), Raspberry Pi 4 (CPU), and Google Coral (TPU). Our findings demonstrate that through systematic tuning, the YOLOv10n variant emerged as the superior architecture, achieving a precision of 98.1% and an mAP_50:95 of 81.22%. Post-deployment characterization reveals that the optimized YOLOv10n model on the NVIDIA Jetson Nano achieved a peak inference speed of 25 frames per second (FPS), successfully striking the “Pareto-optimal” balance between predictive accuracy and real-time processing. The primary contributions of this work include a reproducible optimization methodology, a comparative performance map across three distinct hardware backends, and the release of a specialized industrial conveyor dataset.

Keywords:

computer vision; deep learning; edge devices; omnidirectional conveyor; optimization

1. Introduction

Industry 4.0 aims to transform traditional manufacturing processes into smart ones. Industry 4.0 is all about creating smart factories that can operate autonomously. Processes like scheduling and products handling and distributions are attracting researchers interest [1,2,3,4]. One of the critical components of Industry 4.0 is the flexible material flow system that enables efficient and flexible material handling and achieves higher efficiency in terms of cost and time. The omnidirectional wheeled conveyor is one example of those systems. It enables efficient and flexible material handling that solves traffic bottlenecks in warehouses and logistics centers [1,2].

In recent years, there has been growing interest in the development of conveyor systems, which are designed to move objects in any direction without changing their orientation. Several researchers have explored different approaches to build, control, and improve the performance of the conveyor by using sensors to provide real-time data to monitor the system. In particular, a camera was used for package detection, and consequently image processing and CV techniques were explored to improve conveyor performance by adding more features. The use of CV in Industry 4.0 applications is expected to continue to grow, leading to even greater improvements in efficiency, productivity, and quality control.

With the use of the camera as a sensor, the researchers explored image processing and CV for various purposes [5]. Furthermore, with the development of edge devices, the detection of features of packages transported on the conveyor in real-time became possible with a low-cost standalone device. In the following paragraph, a summary of related work is provided that used CV and image processing in the Industry 4.0 framework for conveyor systems. All of the previous related work was not deployed on edge devices and did not include all features detected in real time. In [6,7,8], the authors discussed kinematic modeling and control of the conveyor. In [7], no camera was used, and the conveyor control system was explained. In [8], image thresholding was used to detect and track the location of a package.

In [9,10], the authors developed path planning and sorting of packages for the conveyor using reinforcement learning (RL). The authors used Q-learning, Double Q-learning, Deep Q-learning, and the Double Deep Q-learning algorithms to develop the proposed systems. The study’s findings were supported by experimental hardware results showing that the proposed RL methods were as successful as the conventional mathematical methods in path planning and package sorting in much less processing time. A camera was used to collect the dataset for the RL training and testing in [9].

In [11], the authors applied more advanced image processing methods using a 2D camera on an e-Pattern conveyor. The image processing methods in [11] were used to detect omnidirectional wheels and brown packages with contour detections to determine the center coordinate of the package in real time. No CV models for feature detection were investigated.

A comparative study was presented in [12] between the use of conventional image processing techniques and CV models for package detection in real time. The work in [12] showed that the YOLOv5 model is better at detecting packages at different brightness levels, with an accuracy of 98% at 45.5 FPS compared with normal image processing methods. An NVIDIA GPU was used, and no feature extraction to detect damaged packages or text was performed.

In [13], CV models were used for pallet size detection for turntable control. The authors in [13] compared different object detection and segmentation techniques such as YOLOv5, R-CNN, and YOLOv5-OBB. The authors observed that the YOLOv5-OBB segmentation model had better results, with an average precision (AP) of 0.93. The CV models ran on a PC (NVIDIA GTX 1070Ti), achieving 14 FPS.

All the related work mentioned above used the camera only to detect the package position and for control purposes. Furthermore, the CV models were processed on a PC. Deploying the CV model on a standalone edge device was not explored, although this solution is highly promising for real-time applications. In addition, the challenges of deploying the CV model on a specific edge board have not been discussed in detail. The need for a generic framework for assessing the compatibility of a specific edge device for a certain CV application is vital. Furthermore, the dataset preparation process for a specific use case is crucial to ensure satisfactory results.

Nevertheless, there have been many recent articles and research work studying the deployment of CV models on edge nodes. In [14], the authors focused on YOLO algorithm usage on NVIDIA boards for UAV applications. Also, the authors focused on edge versus cloud deployment issues. The work in [14] focused on performance analysis of YOLO models without exploring the model architecture and hyperparameter tuning; TPU-based edge nodes were not explored. In [15], the authors also explored YOLO on a Jetson Nano board for UAV applications as well, with a brief discussion of model optimization. Other work was presented in [16], were multiple SOTA models were investigated across multiple edge nodes. However, the authors in [16] did not investigate the fine-tuning effect of the CV model, nor did they use the system in a real-time application to compute the FPS. Also, they used a hybrid configuration of two edge nodes together, and they did not investigate each type of edge node on its own.

While our previous work [12] established the feasibility of vision-based tracking for omnidirectional conveyors using standard desktop computing, it did not address the critical constraints of industrial deployment, such as power efficiency, hardware-specific latency, and the research-to-production gap. In this study, we propose a comprehensive optimization framework designed specifically for Industry 4.0 edge nodes. The novelty of this work lies in three areas:

The Comparative Framework: Unlike existing case studies, we provide a multi-dimensional analysis of multiple SOTA models (including YOLOv10 and EfficientDet) across three heterogeneous hardware architectures (GPU, CPU, and TPU).
Deployment Optimization: We quantify the impact of hardware-specific acceleration (TensorRT and Edge TPU compilers) on real-time Industry 4.0 application.
Operational Pareto Frontiers: We identify the “sweet spot” between detection mAP and inference speed, providing a decision-making tool for engineers to select the optimal model based on the specific hardware budget of a factory floor.

The presented work’s target application is to detect the transported package features and detect package flow interruptions in real time. The presented work compares multiple CV models and edge boards to achieve a balance between the speed and accuracy of detecting packages’ features on the conveyor under study in real time to achieve a higher FPS on the edge nodes. This paper is organized as follows. Section 2 discusses the proposed framework, and Section 2.1 describes the standalone system components and the experimental set-up. The dataset preparation for the CV model training and the fine-tuning of the CV model parameters are explained in Section 2.2 and Section 2.3, respectively. The CV models’ optimization is discussed in Section 2.4, and the hardware deployment results and discussions are given in Section 3. Finally, this paper is concluded in Section 4.

2. Methodology

The proposed framework is shown in Figure 1. The aim of the framework is to provide a comprehensive overview of the key steps needed to deploy CV models on edge devices for Industry 4.0 applications, taking the feature extraction of moving packages on an omnidirectional wheeled conveyor as a case study. The workflow begins by preparing the experimental set-up for dataset collection, which includes camera calibration to determine the essential geometric parameters of the image formation process. The study of the edge device’s specifications is crucial, specifically its hardware configuration and compatibility with the requirements of CV model deployment, which will be discussed in detail in the coming sections.

During the second step of the workflow, meticulous data preparation takes place to facilitate generalized training for the CV models. This step involves curating and preprocessing the data to ensure its quality, diversity, and suitability for CV model training purposes. Proceeding to the third step in the framework, which focuses on fine-tuning the state-of-the-art CV model’s parameters, the aim is to improve the obtained CV model accuracy in the context of our case study. This task involves adjusting the CV model’s parameters and optimizing its performance before considering the deployment on a specific edge device. The fourth step takes into consideration the edge device’s specifications. At this point, the CV model is optimized to reduce the inference time required to execute the targeted application on the edge device. This process takes into account the hardware configuration of the edge device and explores techniques such as quantization to improve the CV model’s execution efficiency and speed. Once the CV models have been fine-tuned and optimized for a specific edge device, the models are deployed. Furthermore, it is crucial to evaluate the CV models’ performance experimentally. This evaluation process allows for a comprehensive assessment of the CV models’ performance on the edge device, enabling selection of the model that strikes the optimal balance between accuracy and speed. By following this workflow, the design and implementation of a standalone edge platform for CV-based Industry 4.0 applications can be executed effectively.

2.1. Proposed Experimental Set-Up

As shown in Figure 2, the experimental setup consisted of an omnidirectional wheeled conveyor, edge device, streaming screen, and 2D camera. The camera captured real-time images of the conveyor system. The camera was a Microsoft HD-3000 USB webcam, manufactured in China and bought from Egypt, which is capable of capturing images with a 640 × 480 pixel resolution at 30 FPS, and it was placed perpendicular to the conveyor. The conveyor mechanical structure consisted of 13 hexagonal cells, and each cell has three omnidirectional wheels separated by 120°. The wheels of each cell were driven by a single DC motor. As for the edge device, many options exists in today’s market. These devices are small and lightweight. Many of these devices include either a graphical processing unit (GPU), computational processing unit (CPU), or tensor processing unit (TPU), which enables them to execute a wide range of CV models in real-time.

2.2. Dataset Preparation

Before starting the training process for the CV models, the dataset should be collected, labeled, and divided into classes. The dataset classes represent packages, features, and defects. Video frames were captured under different brightness levels to obtain sufficient samples of each class. After removing duplicated frames, the dataset was augmented using flipping, rotation, and brightness level and exposure level variation. Data augmentation helped in generalizing the CV model. The final dataset contained 11,562 images.

The dataset was prepared through a combination of real-world video captures and supplemental online sources. Videos were recorded across a variety of scenarios, including different types of boxes, branded packages, and damaged items with a wide range of sizes. Data collection was conducted while covering the conveyor surface with a dark green shaded color cover and without it to reflect realistic industrial settings, with variations in the brightness conditions. To further enrich the dataset and improve model generalization, additional images were sourced from the internet, ensuring representation of diverse packaging types and conditions. This comprehensive approach ensured that the dataset captured a broad spectrum of object appearances and deployment conditions.

The dataset included eight classes: Amazon, Box, Conveyor, DamagedBox, HP, Interruption, Nestle, and OpenBox, with core classes like Box, DamagedBox, Interruption, and OpenBox representing typical conveyor objects. Data augmentation techniques, such as flipping, rotation, and brightness adjustments, were employed to increase the dataset’s size and variety, ensuring better generalization and reducing overfitting. Initially, the dataset contained 9344 annotations from 4785 images. After augmentation, it grew to 11,562 images with 20,043 annotations, where the training set accounted for 88% of the data (10,166 images and 20,043 annotations), the validation set accounted for 8% (926 images and 1795 annotations), and the testing set accounted for 4% (470 images and 868 annotations). Shown in Figure 3 are examples of images of packages with and without defects.

To ensure accurate feature detection, camera calibration was performed in the experimental set-up’s preparation phase. This was carried out to avoid having any distortions and to ensure detection of the package’s exact position. This calibration was conducted by capturing multiple images of the check-board of 7 × 9 boxes from different angles and analyzing them to create a 3D model of the package. This was accomplished using MATLAB version 9.8 (see Figure 4).

The camera calibration parameters are listed in Table 1. For the intrinsic matrix,

c_{x}

and

c_{y}

are the principal point coordinates,

f_{x}

and

f_{y}

are the focal lengths in pixels, and

k_{1}, k_{2}

are used to account for radial distortion. For the extrinsic matrix, parameters

r_{1}, r_{2}, r_{3}

define the rotation of the camera along the three axes, indicating its orientation, while

d_{x}, d_{y},

and

d_{z}

represent the camera’s translation along the x, y, and z axes, respectively, describing its position in 3D. These calibrated parameters enabled accurate image capturing. Table 1 shows the calibration parameters’ values.

The collected dataset was divided into 88%, 8%, and 4% for training, validation, and testing, respectively. The training dataset batch size was set to 16 images per batch. Images were annotated in terms of detection of the conveyor, package, brand names like Amazon, HP, and Nestle, and defects like interruptions and damaged packages.

The used data augmentation methods were flipping (horizontally and vertically), rotation (

- 20 deg, + 20 deg

), brightness (

- 15 %, + 15 %

), and exposure (

- 25 %, + 25 %

), as shown in Figure 5. Data augmentation methods like blurring, cutout, and adding noise will reduce the overall performance because these conditions do not happen in industrial environments.

2.3. SOTA CV Model Fine-Tuning and Evaluation

This section presents the third step of the workflow given in Figure 1, which is training and evaluation of the SOTA CV models used for our case study using the dataset created in Section 2.2. Each SOTA CV model discussed in this work has its pros and cons, which are as follows:

Single-Shot MultiBox Detector (SSD) predicts bounding boxes and class probabilities in a single pass, enabling moderate detection speed and accuracy. Its straightforward architecture is computationally efficient, but its performance may be limited when it comes to detecting smaller objects, such as package logos, or handling complex features like defects and interruptions.
You Only Look Once (YOLO) is designed for real-time object detection by processing images from end to end in a single network pass. This architecture is highly efficient for real-time applications and can effectively handle moderate-to-small object detection tasks. However, there is a trade-off between speed and accuracy which varies depending on the YOLO model variant and its parameter optimization.
EfficientDet leverages compound scaling, and it uses EfficientNet as its backbone to achieve high accuracy with optimized resource utilization. It excels at detecting fine details, including smaller and partially occluded objects, making it highly suitable for detecting intricate defects and interruptions. However, its computational demands are higher than SSD and YOLO, requiring extensive optimization for edge deployment.

A comparative study of the SOTA CV models mentioned above was conducted to select the optimal one for the presented work. The models’ variants were also studied. Initially, a transfer learning strategy was adopted by using pretrained weights from the COCO dataset. This was carried out to accelerate the training process, and this approach allowed the SOTA CV models to converge faster and obtain better results. During the training phase, the created dataset was used to train the SOTA CV models while varying the hyperparameters, backbones, and loss functions. The SOTA CV models’ performance was evaluated based on accuracy and speed. The performance was assessed using metrics such as precision (P), recall (R), average precision (AP), and mean average precision (mAP).

To improve the CV model performance during training, a structured fine-tuning approach was applied based on insights from both the training loss curves and the convergence behavior of the precision and recall metrics. As shown in Table 2, certain hyperparameters such as the learning rate, layer size, and loss function exhibited high sensitivity and therefore had a substantial impact on the training stability and convergence speed [17]. Accordingly, fine-tuning began by adjusting the learning rate, as small changes in this parameter significantly influence CV model convergence.

Additionally, modifications to model depth and layer sizes were explored by experimenting with architectures from compact to deep (e.g., from the baseline model to multiple variations, each defined in the subscript), resulting in noticeable improvements in detection performance.

In contrast, hyperparameters like the batch size and optimizer type were found to be of lower sensitivity and thus had minimal impact on the final results. Throughout the training experiments, models were iteratively refined through changes in the epoch count, backbone architecture, and auxiliary parameters. Each version—represented as SOTA CV models

M_{b a s e}

, etc.—reflects a progressively improved configuration, validated through consistent improvements in evaluation metrics such as the mAP, precision, and recall. The SOTA CV models, referred to as

M_{b a s e}

and so on, which will be shown in the coming sections, represent different training configurations and fine-tuning adjustments, starting with model

M_{b a s e}

as the baseline. This tuning process demonstrates a principled approach to hyperparameter optimization aligned with empirical sensitivity trends observed in DL research [17].

2.3.1. Single Shot Detector (SSD)

SSD is a popular object detection CV model that achieves fast and efficient detection with high accuracy. Unlike other detection methods that use multiple stages and complex algorithms, SSD uses a single NN to perform object detection and classification in a single forward pass [18].

The MobileNetSSD

M_{S S D, v 1}

and

M_{S S D, v 2}

models were introduced in [19,20]. These models are optimized for mobile and embedded devices, with a low hyperparameter count and fast computational speed. These models achieve high accuracy with a lightweight NN architecture. Furthermore, these models show promising results in terms of accuracy and speed in various CV applications. In our case study, MobileNetSSD models

M_{S S D, v 1}

and

M_{S S D, v 2}

and their lite editions were studied. During the training phase, MobileNetSSD

M_{S S D, v 2}

had a slight advantage in terms of loss compared with the other models, with a P of 95.5%, R of 82.02%, mAp at anIoU of 0.5 (mAp@50) of 79.01%, and mAp50:95 of 74.7%, as shown in Table 3.

2.3.2. You Only Look Once (YOLO)

The YOLO algorithm was introduced in [21]. Since then, several variants of the YOLO algorithm have been proposed. The variants introduced improvements to the original YOLO algorithm, such as anchor boxes, skip connections, and feature pyramid networks, leading to further improvements in detection accuracy and speed. YOLO works by dividing an input image into a grid and predicting the bounding boxes, class probabilities, and Objectness scores for each cell in the grid [22,23,24]. YOLO differs from other object detection algorithms that use region proposal networks (RPNs), which can be computationally expensive [25].

YOLOv3 uses a 53-layer Darknet backbone as a feature extractor with residual connections to improve the quality of feature extractions and settle upon a brilliant solution for the gradient vanishing problem [22]. As illustrated in the YOLOv3 architecture, the network then applies upsampling and feature-map concatenation (skip fusion) to increase the spatial resolution of the feature grids passed to the scale-prediction layers, enabling better detection of smaller objects. In these prediction layers, na denotes the number of anchors per scale and nc denotes the number of classes, so each output tensor has na×(nc+5) channels per grid cell, as shown in Figure 6 and Figure 7. YOLOv3 processes input images through a 106-layer convolutional network, combining the 53-layer Darknet backbone with additional layers for detection. It generates multi-scale outputs for bounding boxes, described by the Objectness scores, coordinates, and class labels, with the IoU and NMS used to refine overlapping predictions.

The YOLOv3 model architecture has no pooling layers to downsample the size of the convolution layer. Instead, YOLOv3 uses convolutional layers with a stride of two to downsample the convolution layers. This advantage prevents the model from losing low-level features and improves the ability of the model to predict the smaller objects perfectly. While YOLOv3 may not be the optimal choice for real-time deployment on edge boards due to its computational demands, it offers impressive performance capabilities. Nevertheless, by employing the same architecture in a smaller model (Tiny), real-time object detection with minimal degradation in accuracy can be achieved. This makes smaller models a more suitable option for edge devices where computational resources are limited.

The training optimizer choices were SGD and Adam optimizers, the batch size was selected to be between 12 and 32, and the training epochs varied between 100 and 300. Table 4 presents the training results for the YOLOv3 (

M_{o p t, e}

) model, which was trained using SGD for up to 170 epochs. Early stopping was applied after 30 epochs without improvement, which slightly enhanced the training performance, with a P of 97.5%, R of 97.6%, mAp at an IoU of 0.5 (mAp@50) of 96.97%, and mAp50:95 through an IoU increasing by 0.05 (mAp@50:95) of 80.6%.

The YOLOv4 model architecture uses a backbone of CSPDarknet-53 or Darknet-53 which is responsible for extracting the features of the input image through five Resblocks, each consisting of a series of convolution layers with sizes of 1 × 1 and 3 × 3 [23]. Additionally, each convolution layer is accompanied by a batch normalization (BN) layer and a leaky-ReLU activation layer.

The architecture of YOLOv4 is shown in Figure 8, which highlights the model’s use of CSP or Darknet-53 as the backbone, enhanced by Mish activation functions, multi-scale detection, and advanced augmentation techniques. This configuration enables the model to achieve high accuracy, especially in detecting small objects. Figure 9 illustrates the YOLOv4-Tiny model, a streamlined version that retains the core principles of YOLOv4 but reduces the number of layers and computations for faster inference, which makes it more suitable for deployment on edge devices with limited resources. Both models employ three detection heads at different scales.

In terms of object detection, YOLOv4 significantly outperformed YOLOv3, especially in small object detection, thanks to advanced data augmentation techniques such as mosaic augmentation and self-adversarial training. These techniques allow YOLOv4 to be more generalized across different scales and environments, leading to improved accuracy. The YOLOv4 models were trained with SGD through 100 epochs with CSP as a backbone and a leaky-ReLU as an activation function. The YOLOv4 model (

M_{c s p, o p t}

) showed the best performance, with a P of 65.1%, R of 97.5%, mAp@50 of 96.97%, and mAp@50:95 of 78.4%, as shown in Table 5.

The YOLOv5 series offers different architecture sizes, such as YOLOv5x, YOLOv5l, YOLOv5m, YOLOv5s, and YOLOv5n, with varying network architectures and floating-point operations per second (FLOPS) values [24]. For our case study, the smaller models (YOLOv5m, YOLOv5s, and YOLOv5n) were chosen due to their suitability for training and deployment on edge devices. A key feature that sets YOLOv5 apart from previous versions, including YOLOv4, is the auto-learning bounding box anchors. In previous models, anchors were manually predefined or tuned based on a specific dataset. However, YOLOv5 introduces a dynamic approach where the model automatically learns the optimal anchor box sizes during training.

The network architecture for YOLOv5 (s,m) exhibits a structural similarity to the YOLOv5n model, as shown in Figure 10 and Figure 11. However, YOLOv5 (s,m) models incorporate an increased number of modules, leading to a more complex architecture compared with YOLOv5n. This complexity results in higher accuracy at the cost of a reduced inference speed.

As shown in Table 6, the model trained with SGD through 300 epochs with a batch size of 64 (YOLOv5n (

M_{n, o p t}

)) showed the best performance among the smaller models, with a P of 98.07%, R of 95.46%, mAp@50 of 96.46%, and mAp@50:95 of 79.25%. Meanwhile, as shown in Table 7, the model trained with SGD through 100 epochs with a batch size of 16 (YOLOv5m (

M_{m, o p t}

)) had the best performance, with a P of 96.96%, R of 96.61%, mAp@50 of 97.33%, and mAp@50:95 of 81.69%. This was the best achieved accuracy among the larger YOLOv5 models.

The YOLOv7 model uses Bag of Freebies and Bag of Specials as in YOLOv4 [28]. At its core, YOLOv7 employs CSPDarknet-53 as its backbone network, serving as the primary feature extraction engine that processes input images through multiple convolutional layers enhanced with residual connections. Similar to YOLOv4, the architecture optimizes the performance without increasing the inference time. These residual connections play a crucial role in maintaining gradient flow throughout the deep network while facilitating effective feature propagation.

In addition to the backbone, YOLOv7 develops two innovative blocks, namely the Attention-Enhanced Local Aggregation Network (a-ELAN) and Bottleneck-Enhanced a-ELAN (b-E-ELAN), as shown in Figure 12. These blocks significantly enhance feature extraction and improve object detection accuracy. The a-ELAN block applies an attention mechanism to focus on important features while reducing the effect of irrelevant regions in the image. These enhancements enable YOLOv7 to detect fine-grained details in small objects more effectively, outperforming previously used YOLO architectures in terms of both speed and accuracy.

The architecture of YOLOv7, shown in Figure 13, incorporates CSPDarknet-53 as the backbone, along with SiLU activation and multi-scale detection techniques to improve region detection accuracy. Additionally, YOLOv7 uses data augmentation techniques, such as mosaic augmentation and self-adversarial training (SAT), as introduced in YOLOv4 architecture, which help the model generalize better across various object scales [30,31]. The inclusion of a-ELAN and b-E-ELAN as shown in Figure 12 further optimizes YOLOv7’s ability to focus on critical region areas of the image, ensuring higher accuracy. This combination of architectural features and techniques enables YOLOv7 to achieve higher performance compared with its predecessors.

YOLOv7 was trained using the Adam optimizer for 300 epochs with CSPDarknet-53 as the backbone and SiLU as the activation function. The YOLOv7 model (

M_{b a s e, e 300}

) achieved impressive performance, with a P of 97.8%, R of 93.7%, mAp@50 of 95.75%, and mAp@50:95 of 75.21%, as shown in Table 8.

YOLOv10 has a different architecture compared with other YOLO models [32]. YOLOv10’s backbone is based on a more advanced and efficient CSPDarknet-63 backbone, an enhancement over the CSPDarknet-53 backbone used in earlier versions, which extracts features from the input image through a series of convolution layers, using efficient residual connections to enhance gradient flow and feature propagation. Like its predecessors, YOLOv10 employs BN layers and activation functions such as leaky-ReLU to ensure stable training.

The architecture of YOLOv10, as shown in Figure 14, highlights the use of CSPDarknet-63 as the backbone, alongside Mish activation functions and advanced augmentation techniques. YOLOv10 introduces innovative techniques like a new attention module to enhance feature extraction and improve the model’s accuracy. Moreover, the model uses optimized multi-scale detection for better performance on both small and large objects. Moreover, the YOLOv10-Nano model, depicted in Figure 15, is a lighter and more efficient variation of YOLOv10 designed for deployment on edge devices with limited computational resources. It reduces the number of layers and operations to achieve faster inference times while still retaining high accuracy for small object detection tasks. Similarly, the YOLOv10-Small version maintains the core principles of YOLOv10 but offers a balance between computational efficiency and performance. Both the Nano and Small models use multiple detection heads at different scales to maximize detection accuracy across various object sizes.

YOLOv10 uses distributed focal loss in the training phase to address class imbalance issues by dynamically weighting the contribution of easy and hard examples during training. This approach enables the model to focus more effectively on difficult-to-detect objects while reducing emphasis on well-classified objects. In the previous YOLO models discussed (YOLOv3, YOLOv4, YOLOv5, and YOLOv7), the distributed loss function was not implemented. Therefore, a significant advancement in YOLOv10’s optimization strategy was noticed.

For training the YOLOv10 models, the Adam optimizer was used, with the models trained over 100 epochs and utilizing CSPDarknet-63 backbone variants with Mish activation functions. The YOLOv10n model (

M_{b a s e, n}

) achieved impressive performance, with a P of 98.1%, R of 94.5%, mAp@50 of 98.65%, and mAp@50:95 of 81.2% using the CSPDarknet-63 tiny backbone at a 320 × 320 input resolution with a batch size of 32, as shown in Table 9.

2.3.3. EfficientDet

The EfficientDet model idea is an upgraded version of the EfficientNet model concept [35,36]. EfficientNet uses a compound scaling method, which systematically scales the network’s depth (number of layers), width (number of channels), and resolution (height and width of input images). This balanced approach ensures optimal performance, avoiding the inefficiencies of scaling each dimension independently. In this work, the EfficientDet-D0 architecture, the baseline model in the EfficientDet family, was utilized. This model was selected for its compact size and efficiency, making it ideal for deployment on edge devices with limited computational resources.

As shown in Table 10, the model trained with AdamW through 40,000 epochs with a batch size of 16 (EfficientDet-D0 (

M_{e f f, o p t}

)) showed superior performance, with a P of 97.55%, R of 96.76%, mAp@50 of 97.41%, and mAp@50:95 of 80.71%. This CV model was less accurate compared with the SSD and YOLO models, which means that this model was not suitable for our case study.

2.3.4. SOTA CV Model Comparison

The aforementioned results demonstrate that YOLOv10s achieved the highest overall accuracy with an mAP 50:95 of 0.819, followed closely by YOLOv5s (0.8169) and YOLOv10n (0.8122). Among the three architectures tested (SSD, YOLO and EfficientDet), the YOLO models significantly outperformed both the SSD and EfficientDet variants, with all YOLO versions achieving mAP 50:95 scores above 0.73, while the best SSD model (MobileNetSSD

M_{S S D, v 2}

) reached 0.747 and that for EfficientDet-D0 achieved only 0.6114. The SSD models showed consistent performance across versions with good precision-recall balance, making them suitable for applications requiring moderate accuracy with faster inference. In contrast, EfficientDet-D0 demonstrated the worst performance among all tested models, suggesting it may require further optimization or different hyperparameter configurations for this specific dataset. Overall, the YOLO family, particularly the newer versions (v5 and v10), provided the best accuracy for object detection tasks in the presented use case. Results of the aforementioned comparison are given in Table 11.

2.4. Optimizing SOTA CV Models for Edge Boards

To prepare the trained SOTA CV models for deployment on edge devices, the fourth step of the methodology, as seen in Figure 1, is to optimize them. This step is performed to ensure the models are compatible with the edge device’s hardware specifications and to evaluate the object detection speed in real time. The specs of the used edge devices for this work is given in Table 12.

Figure 16 shows the used flowchart for optimizing the SOTA CV models. This process includes model conversion to formats like ONNX, TFlite, or TensorRT, along with employing quantization techniques that encompass float16 and INT8 precision. Nevertheless, it is important to test the optimized SOTA CV models in the fifth step (hardware deployment) on the targeted edge device. This will ensure that the models meet the accuracy and speed criteria of our application.

The SOTA CV models’ used optimization techniques include quantization, CUDA acceleration for deployment on NVIDIA Jetson Nano GPUs, and TFLite optimization for Raspberry Pi 4 and Google Coral Dev deployment. These techniques are employed to minimize computational overhead at the deployment stage. By implementing these methods, the SOTA CV models achieve efficient operation on edge devices, enabling real-time performance and faster inference times.

2.4.1. Quantization

Quantization reduces the size and computational complexity of the DL models, enabling their deployment on resource-constrained edge devices such as CPUs, GPUs, and TPUs. This work employs post-training quantization to ensure that the models are trained with full precision, preserving their learning capacity. The quantization process converts model weights and activation functions from 32-bit floating-point precision to INT8 format, drastically reducing memory usage and improving the inference speed.

For the Google Coral Dev board TPU, the quantization process requires a representative dataset during conversion to calibrate the model and optimize its operations for INT8 precision. This ensures that the TPU can execute tensors efficiently while maintaining high accuracy. Similarly, quantization on the NVIDIA Jetson Nano and Raspberry Pi 4 achieved optimal precision tailored to their respective architectures.

2.4.2. CUDA and TensorRT Optimization for NVIDIA Jetson Nano

Deploying raw models on the NVIDIA Jetson Nano is not feasible due to the hardware’s reliance on highly optimized inference engines. In this work, the trained SOTA CV models were first converted into the ONNX format, a standardized intermediate representation that facilitates compatibility across frameworks such as PyTorch and TensorFlow. The ONNX models were then optimized using TensorRT, which applies layer fusion, kernel selection, and precision reduction (including INT8 quantization where applicable), to create an optimized runtime engine for GPU inference. By leveraging CUDA for parallel processing, the NVIDIA Jetson Nano executed models significantly faster, with a speedup of up to 10×.

2.4.3. TFLite Optimization for Raspberry Pi 4 and Google Coral Dev Boards

For the Raspberry Pi 4, which uses an ARM-based CPU, the SOTA CV models were optimized using TensorFlow Lite (TFLite) to leverage ARM-specific capabilities. TFLite reduces the computational overhead of the model, enabling efficient inference with the Raspberry Pi 4’s limited resources. For the Google Coral Dev, TFLite integrates INT8 quantization to align with the Edge TPU’s hardware requirements. The TPU operations are specifically designed to process INT8 tensors, requiring a representative dataset during quantization to calibrate the model. This ensures that the TPU can execute inference tasks efficiently, taking full advantage of its hardware acceleration while maintaining acceptable accuracy.

SOTA CV model optimization is a mandatory requirement for most of the edge platforms to support inference execution. On the Jetson Nano, raw models cannot be deployed directly; they must first be converted to ONNX format and then optimized using TensorRT, which compiles them into a GPU-executable runtime engine. Similarly, the Google Coral Dev board supports only models in TFLite format with INT8 quantization, requiring a representative dataset during the quantization process to calibrate the model for TPU execution. In contrast, the Raspberry Pi 4 provides more deployment flexibility, supporting CV models without optimization (e.g., standard TensorFlow) and with optimization (e.g., TFLite Float16). While optimization is not strictly required for a Raspberry Pi, it significantly enhances the inference speed and resource efficiency on the device.

3. Hardware Deployment Results and Discussion

As the final step of the methodology shown in Figure 1, hardware deployment of SOTA CV models on edge devices was conducted. This section presents a comprehensive evaluation of the results of this work for object detection in a real-world Industry 4.0 scenario.

3.1. Hardware Deployment Results

Table 13 includes the obtained FPS values of the NVIDIA Jetson Nano, Raspberry Pi 4, and Google Dev boards after deployment. As shown in Table 12, the specifications of these three edge devices differ in terms of processing units and memory. The NVIDIA Jetson Nano board delivered the best FPS for the CV models thanks to the GPU processing and CUDA parallelization. The model that showed the lowest latency was MobileNet SSDv1 on the three boards. On the other hand, the YOLO-Tiny models had asymptotic results in terms of FPS, making the comparison more about the CV model performance metrics and not the FPS.

The results show that by combining post-training quantization, TensorRT acceleration for GPU optimization, and TFLite deployment for ARM CPUs and TPUs, the SOTA CV models can be effectively tailored to the hardware constraints of the NVIDIA Jetson Nano, Raspberry Pi 4, and Google Coral Dev boards, respectively. The optimization step ensured efficient, real-time inference while maintaining high accuracy, making these SOTA CV models suitable for deployment in Industry 4.0 applications. Figure 17, Figure 18 and Figure 19 show the system output in real-time.

Table 12. Edge boards’ specifications [37].

Spec	NVIDIA Jetson Nano	Raspberry Pi 4	Google Dev Board
Type of Processor	GPU	CPU	TPU
Processor	Quad-core ARM	Hexa-core Carmel	Quad-core ARM
	Cortex-A57	Quad-core Cortex-A72	Cortex-A53
Accelerator	(128 CUDA cores,	(384 CUDA cores,	Edge TPU
	472 GFLOPs)	48 Tensor cores,	(4 TOPS (int8)
Memory	4 GB LPDDR4	21 TOPs)	2 TOPS per watt)
Flash Memory	32 GB eMMC		8 GB eMMC
Supported	Major ML frameworks		TensorFlow Lite
Frameworks	(TensorFlow, PyTorch, and Caffe)
Networking	$10 / 100 / 1000$	Wi-Fi 5, Bluetooth 5.0
	BASE-T Ethernet
TDP	5∼10 $W$	10∼20 $W$	12.5∼15 $W$
Cost	$$ 99$	$$ 200$	$$ 99.99$

Table 13. FPS results for the edge platform presented in this work.

Models	NVIDIA Jetson Nano	Raspberry Pi 4	Google Coral Dev
MobileNet SSDv1	25	3.51	11.5
MobileNet SSDv2	26	3.42	12
MobileNet SSDLite-v1	27.5	3.57	14.3
MobileNet SSDLite-v2	26.5	3.38	12
YOLOv7 (640)	2.25	-	-
YOLOv7 (320)	4	-	-
YOLOv10n (640)	15.5	-	-
YOLOv10n (320)	25	-	-
YOLOv10s	8	-	-
YOLOv5n	23	1.9	-
YOLOv5s	4.6	1.6	-
YOLOv5m	2.5	0.1	-
YOLOv4	4.5	0.2	-
YOLOv4-Tiny	25	1.22	-
YOLOv3	5	0.6	-
YOLOv3-Tiny	25.5	1.5	-
EfficientDet-D0	27.3	4.3	12.8

Deployment on the Raspberry Pi 4 was carried out by efficiently using its CPU capability. Model quantization techniques, such as reducing precision to Float16, are implemented to minimize memory usage while maintaining compatibility with the CPU architecture. Additionally, model pruning and compression methods were employed to eliminate redundant parameters and features, optimizing the model’s size and memory footprint. These strategies, coupled with algorithmic optimizations of multi-threading and capping the detection visualizations to the quad-core ARM Cortex-A72 CPU, ensured efficient execution on the Raspberry Pi 4. As for the NVIDIA Jetson Nano, since it is equipped with CUDA-enabled GPUs and TensorRT optimization capabilities, it offers robust hardware acceleration for DL tasks. Model quantization techniques are utilized to minimize memory usage and maximize inference speeds on the NVIDIA Jetson Nano board. These optimizations are particularly effective when deploying models for real-time object detection. GPU acceleration is leveraged to exploit parallel processing capabilities, accelerating computationally intensive operations and enhancing the inference speed. These strategies ensure efficient execution of DL tasks on the NVIDIA Jetson Nano board.

The Google Coral Dev board, featuring a quad-core ARM Cortex-A53 CPU and Google Edge TPU co-processor, offers hardware acceleration optimized for DL workloads. Optimization strategies for the Coral Dev board include leveraging the Edge TPU co-processor for accelerated inference tasks and TensorFlow Lite optimization for efficient deployment of DL models. Model quantization techniques, such as reducing precision to INT8 with a representative dataset, are employed to minimize memory usage and ensure compatibility with the Edge TPU co-processor.

MobileNetSSD, YOLO, and EfficientDet models were evaluated on edge devices as well in the real-time experimental set-up shown in Figure 2. MobileNet SSDLite-v1 demonstrated the fastest inference speed after optimization through Float-16 quantization, ONNX conversion for the NVIDIA Jetson Nano board, and TFLite optimization for the Raspberry Pi 4 and Google Coral Dev boards, respectively. However, the MobileNet models running on Google Coral Dev with INT8 quantization exhibited significantly reduced accuracy. In contrast, the YOLO models offered superior object detection performance but required proper quantization and optimization for deployment on edge devices.

Despite their robustness, the computational demands of YOLO models posed challenges on resource-constrained platforms like the Raspberry Pi 4, often leading to a slower FPS due to limited RAM and processing power. Conversely, the EfficientDet models struggled with accuracy, particularly for smaller objects, and experienced further performance degradation when INT8 quantization was applied on the Google Coral Dev board. These findings highlight the importance of selecting and optimizing models to balance speed, accuracy, and hardware constraints for real-time edge device applications.

The power consumption analysis revealed distinct characteristics across these edge computing platforms. The Raspberry Pi operated most efficiently, with a thermal design power (TDP) of 5–10 W, making it suitable for battery-powered applications. The Jetson platform consumed 10–20 W, reflecting its enhanced processing capabilities for AI workloads. Meanwhile, the Coral Dev board maintained moderate power usage at 12.5–15 W. Regarding thermal management, both the Raspberry Pi and Coral Dev board implement fan activation at 60 °C to maintain optimal operating temperatures, while the Jetson platform requires additional heat sinks beyond standard cooling solutions due to its higher thermal output and enhanced processing demands.

3.2. Empirical Validation and Ablation Analysis

To evaluate the necessity of our optimization framework, we conducted an empirical ablation study by comparing the deployment metrics of more than 15 models across three distinct hardware platforms. This approach allowed us to isolate the impact of hardware-specific acceleration and served as a critical extension of our baseline tracking study [12].

While our prior work achieved high-speed tracking on a standard PC, the results in Table 13 demonstrate that moving to edge nodes creates a significant performance bottleneck. For example, a standard YOLOv4-Tiny model only reached 1.22 FPS on a Raspberry Pi 4, which is insufficient for real-time based application. However, by using our framework to select and optimize YOLOv10n (

M_{b a s e, n, I 320}

), we achieved 25 FPS on the Jetson Nano. This represents a massive improvement that justifies the systematic selection and quantization steps in our pipeline.

By evaluating multiple models, we provide a level of “practical significance”, showing that even if a model is “fast” in theory, its real-world performance depends entirely on the hardware (e.g., EfficientDet-D0 performing 3× faster on the Coral Dev board than on the Raspberry Pi). This extensive benchmarking provides a robust validation of our framework’s efficacy in Industry 4.0 settings without the need for additional stochastic seed testing. The following findings can be observed:

Resolution ( $M_{b a s e, I 320}$ vs. $M_{b a s e, I 640}$ ): We demonstrated that increasing the resolution from 320 to 640 (e.g., in $M_{b a s e, I 640}$ ) provided a 5.8% gain in mAP for detecting small package defects, justifying the higher computational cost for GPU backends like the Jetson Nano.
Backbone Optimization ( $M_{b a s e}$ vs. $M_{c s p}$ ): By replacing standard backbones with CSPDarknet-63 (as seen in our YOLOv4 results), we achieved a 12% reduction in Train Box loss compared with the baseline in [12], significantly improving localization accuracy on the omnidirectional conveyor.
Hyperparameter Tuning ( $M_{b a s e}$ vs. $M_{o p t}$ ): The transition from Adam to SGD with tuned momentum ( $0.937$ ) resulted in a 4.2% increase in recall, which is critical for industrial safety to ensure no disruptions go undetected.

3.3. Future Work

There are many other SOTA CV models that could be investigated for Industry 4.0 applications, such as anchor-free models like Fully Convolutional One-Stage Object Detection (FCOS). These models are not typically pretrained with optimizations for edge devices, making them harder to deploy efficiently. Without edge-specific optimizations like TensorRT or EdgeTPU support, running these models on constrained devices requires additional effort in model compression and custom optimization. Moreover, when detecting small objects, such as logos, anchor-free models can struggle due to their reliance on coarse feature maps, which makes precise localization of small objects challenging, especially on devices with limited computational resources. On the other hand, Real-Time Detection Transformer (RT-DETR) models may offer faster performance but remain computationally intensive for edge devices like the Raspberry Pi and Jetson Nano. Despite optimizations, RT-DETR still struggles with small object detection, such as logos, due to its coarse feature maps. This could be a potential area for future work, where RT-DETR can be compared with other models like YOLO or EfficientDet for better performance on edge devices with small object tasks.

4. Conclusions

This research proposed a structured framework for deploying computer vision (CV) models on edge computing platforms, with the goal of supporting real-time industrial applications in Industry 4.0 settings. The study involved selecting and fine-tuning state-of-the-art object detection models, preparing a task-specific dataset, and designing an experimental conveyor-based testbed to simulate real-world industrial scenarios. Three widely used edge boards—the NVIDIA Jetson Nano, Raspberry Pi 4, and Google Coral Dev board—were evaluated in terms of their ability to execute CV tasks with low latency and high reliability. Initial software-based evaluations identified YOLOv10n as the top-performing model in terms of precision and recall. Also, when deployed to physical edge hardware, YOLOv10n on the Jetson Nano achieved the best operational balance, delivering 25 FPS while maintaining strong detection accuracy. The study also highlighted the trade-offs between model complexity and hardware capabilities, emphasizing the importance of matching the model architecture to the constraints of edge devices. Overall, the results confirm the feasibility of deploying real-time CV applications at the edge, offering a cost-effective and scalable alternative to cloud-dependent solutions. The proposed methodology provides a practical guide for optimizing and validating CV models in industrial environments, laying the groundwork for intelligent edge systems capable of handling tasks such as quality control, damage detection, and anomaly monitoring in automated production and logistics.

Author Contributions

Conceptualization, E.A. and M.M.; methodology, M.E.; software, M.E.; validation, M.E., E.A., and M.M.; formal analysis, M.E.; investigation, M.E.; resources, L.S.; data curation, M.E.; writing—original draft preparation, E.A.; writing—review and editing, M.E., E.A., and M.M.; visualization, M.E.; supervision, E.A., L.S., and M.M.; project administration, E.A.; funding acquisition, E.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset is available at https://universe.roboflow.com/old-dataset-uqd89/new-data-blode/dataset/15 accessed on 1 January 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

a-ELAN	Attention-Enhanced Local Aggregation Network
AI	Artificial intelligence
AP	Average precision
ANN	Artificial Neural Network
E-ELAN	Bottleneck-Enhanced a-ELAN
b-E-ELAN	(Bottleneck-Enhanced a-ELAN)
BN	Batch normalization
CV	Computer iision
CNN	Convolutional neural network
CPU	Computational processing unit
DL	Deep learning
FCOS	Fully Convolutional One-Stage Object Detection
FLOPS	Floating-point operations per second
FPS	Frames per second
GPU	Graphical processing unit
ML	Machine learning
mAP	Mean average precision
NN	Neural network
P	Precision
R	Recall
RL	Reinforcement learning
RPN	Region proposal network
RT-DETR	Real-Time Detection Transformer
SAT	Self-adversarial training
SSD	Single Shot Detector
SOTA	State of the art
TDP	Thermal design power
TPU	Tensor processing unit
YOLO	You Only Look Once

References

Hanzel, K. Modern Warehouse and Delivery Object Monitoring—Safety, Precision, and Reliability in the Context of the Use the UWB Technology. In Information Systems; Springer: Cham, Switzerland, 2023; pp. 451–462. [Google Scholar]
Youssef, G.; Taha, I.; Shihata, L.; Abdel-ghany, W.; Ebeid, S. Improved energy efficiency in troughed belt conveyors: Selected factors and effects. Int. J. Eng. Tech. Res. 2015, 3, 174–180. [Google Scholar]
Azab, E.; Ali Said, N.; Nafea, M.; Samaha, Y.; Shihata, L.A.; Mashaly, M. Employing Genetic Algorithm and Discrete Event Simulation for Flexible Job-Shop Scheduling Problem. In Proceedings of the 2021 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 15–17 December 2021; pp. 620–624. [Google Scholar] [CrossRef]
Ali Said, N.; Samaha, Y.; Azab, E.; Shihata, L.A.; Mashaly, M. An Online Reinforcement Learning Approach for Solving the Dynamic Flexible Job-Shop Scheduling Problem for Multiple Products and Constraints. In Proceedings of the 2021 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 15–17 December 2021; pp. 620–624. [Google Scholar] [CrossRef]
Saeed, S.; Bilal, K.; Rehman, Z.; Azmat, S.; Shuja, J.; Jamil, A. Object identification-based stall detection and stall legitimacy analysis for traffic patterns. J. Electron. Imaging 2022, 31, 061812. [Google Scholar] [CrossRef]
Youssef, A.W.; Elhusseiny, N.M.; Shehata, O.M.; Shihata, L.A.; Azab, E. Kinematic modeling and control of omnidirectional wheeled cellular conveyor. Mechatronics 2022, 87, 102896. [Google Scholar] [CrossRef]
Sun, T.; Zhang, Y.; Zhang, H.; Wang, P.; Zhao, Y.; Liu, G. Three-wheel driven omnidirectional reconfigurable conveyor belt design. In Proceedings 2019 Chinese Automation Congress (CAC); IEEE: Piscataway, NJ, USA, 2019; pp. 101–105. [Google Scholar] [CrossRef]
Uriarte, C.; Asphandiar, A.; Thamer, H.; Benggolo, A.; Freitag, M. Control strategies for small-scaled conveyor modules enabling highly flexible material flow systems. Procedia CIRP 2019, 79, 433–438. [Google Scholar] [CrossRef]
Zaher, W.; Youssef, A.W.; Shihata, L.A.; Azab, E.; Mashaly, M. Omnidirectional-Wheel Conveyor Path Planning and Sorting Using Reinforcement Learning Algorithms. IEEE Access 2022, 10, 27945–27959. [Google Scholar] [CrossRef]
Emara, M.B.; Youssef, A.W.; Mashaly, M.; Kiefer, J.; Shihata, L.A.; Azab, E. Digital Twinning for Closed-Loop Control of a Three-Wheeled Omnidirectional Mobile Robot. Procedia CIRP 2022, 107, 1245–1250. [Google Scholar] [CrossRef]
Keek, J.S.; Loh, S.L.; Chong, S.H. Design and Control System Setup of an E-Pattern Omniwheeled Cellular Conveyor. Machines 2023, 9, 43. [Google Scholar] [CrossRef]
El-sayed, M.E.; Youssef, A.W.; Shehata, O.M.; Shihata, L.A.; Azab, E. Computer vision for package tracking on omnidirectional wheeled conveyor: Case study. Eng. Appl. Artif. Intell. 2022, 116, 105438. [Google Scholar] [CrossRef]
Castaño-Amorós, J.; Fuentes, F.; Gil, P. MOSPPA: Monitoring System for Palletised Packaging Recognition and Tracking. Int. J. Adv. Manuf. Technol. 2023, 125, 179–195. [Google Scholar] [CrossRef]
Rey, L.; Bernardos, A.M.; Dobrzycki, A.D.; Carramiñana, D.; Bergesio, L.; Besada, J.A.; Casar, J.R. A Performance Analysis of You Only Look Once Models for Deployment on Constrained Computational Edge Devices in Drone Applications. Electronics 2025, 14, 638. [Google Scholar] [CrossRef]
Meimetis, D.; Daramouskas, I.; Patrinopoulou, N.; Lappas, V.; Kostopoulos, V. Comparative Analysis of Object Detection Models for Edge Devices in UAV Swarms. Machines 2025, 13, 684. [Google Scholar] [CrossRef]
Alqahtani, D.K.; Cheema, M.A. Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices. In Service-Oriented Computing (ICSOC 2024); Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
A. End-to-End Pipelines. In Deep Learning Bible: Building Scalable AI Systems, 1st ed.; DeepAI Press: Laguna Beach, CA, USA, 2023; Available online: https://wikidocs.net/book/8926 (accessed on 9 February 2026).
Liu, W.; Anguelov, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Howard, G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020. [Google Scholar] [CrossRef]
Ultralytics. YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5/tree/v4.0 (accessed on 1 February 2023).
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2014, arXiv:1311.2524. [Google Scholar] [CrossRef]
Li, Y.; Wang, H.; Dang, L.M.; Han, D.; Moon, H.; Nguyen, T. A Deep Learning-Based Hybrid Framework for Object Detection and Recognition in Autonomous Driving; IEEE Access: Piscataway, NJ, USA, 2020; Volume 8, pp. 194228–194239. [Google Scholar] [CrossRef]
Saponara, S.; Elhanashi, A.; Qinghe, Z. Developing a real-time social distancing detection system based on YOLOv4-tiny and bird-eye view for COVID-19. J. Real-Time Image Process. 2022, 19, 551–563. [Google Scholar] [CrossRef] [PubMed]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022. [Google Scholar] [CrossRef]
Du, H.; Zhu, W.; Peng, K.; Li, W. Improved High Speed Flame Detection Method Based on YOLOv7. Open J. Appl. Sci. 2022, 12, 2004–2018. [Google Scholar] [CrossRef]
Tan, F.; Zhai, M.; Zhai, C. Foreign object detection in urban rail transit based on deep differentiation segmentation neural network. Heliyon 2024, 10, e37072. [Google Scholar] [CrossRef]
Tan, F.; Tang, Y.; Yi, J. Multi-pose face recognition method based on improved depth residual network. Int. J. Biom. 2024, 16, 514–532. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024. [Google Scholar] [CrossRef]
Ferreira, F.; Couto, L. Using deep learning on microscopic images for white blood cell detection and segmentation to assist in leukemia diagnosis. J. Supercomput. 2025, 81, 410. [Google Scholar] [CrossRef]
Liao, L.; Song, C.; Wu, S.; Fu, J. A Novel YOLOv10-Based Algorithm for Accurate Steel Surface Defect Detection. Sensors 2025, 25, 769. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. arXiv 2019. [Google Scholar] [CrossRef]
Kang, P.; Somtham, A. An Evaluation of Modern Accelerator-Based Edge Devices for Object Detection Applications. Mathematics 2022, 10, 4299. [Google Scholar] [CrossRef]

Figure 1. Proposed framework for CV deployment on edge.

Figure 2. Experimental set-up of the conveyor.

Figure 3. Dataset package examples.

Figure 4. Check-board detecting point examples.

Figure 5. Data augmentation methods.

Figure 6. YOLOv3 building blocks.

Figure 7. YOLOv3 SOTA Model architecture.

Figure 8. YOLOv4 SOTA Model architecture [26].

Figure 9. YOLOv4-Tiny SOTA Model architecture [27].

Figure 10. YOLOv5 SOTA Model architecture.

Figure 11. YOLOv5 SOTA building blocks modules explanation.

Figure 12. (a) ELAN and (b) E-ELAN [29].

Figure 13. YOLOv7 SOTA model architecture [29].

Figure 14. YOLOv10 SOTA Model architecture [33].

Figure 15. YOLOv10-Nano SOTA Model architecture [34].

Figure 16. SOTA CV models’ optimization flow chart.

Figure 17. Qualitative detection performance of MobileNet SSDv2 on the Raspberry Pi 4.

Figure 18. Qualitative detection performance of YOLO models on the Jetson Nano.

Figure 19. Qualitative detection performance of EfficentDet models on the Google Dev Coral board.

Table 1. Camera calibration parameters.

Intrinsic Parameter		Extrinsic Parameter
$c_{x}$ (pixel)	$610.5$	$r 1 (\deg)$	$- 0.027$
$c_{y}$ (pixel)	$319.3$	$r 2 (\deg)$	$0.155$
$f_{x}$ (pixel)	$2298.1$	$r 3$ (deg)	$0.802$
$f_{y}$ (pixel)	$2296.1$	$d_{x} (mm)$	$- 551.775$
$k_{1}$	$0.314$	$d_{y} (mm)$	$- 351.146$
$k_{2}$	$- 4.461$	$d_{z} (mm)$	$3102.6$

Table 2. Hyperparameter Sensitivity.

Hyperparameter	Approximate Sensitivity
Learning rate	High
Optimizer choice	Low
Other optimizer parameters (e.g., Adam)	Low
Batch size	Low
Weight initialization	Medium
Loss function	High
Model depth	Medium
Layer size	High
Layer params (e.g., kernel size)	Medium
Weight of regularization	Medium
Nonlinearity	Low

Table 3. MobileNetSSD models’ training results.

Model	$M_{SSD, v 1}$	$M_{SSD, v 2}$	$M_{SSD, v 1 L}$	$M_{SSD, v 2 L}$
Epochs	100	100	100	100
Batch size	16	16	16	16
Optimizer	SGD	SGD	SGD	SGD
Learning rate	0.003	0.03	0.01	0.01
Momentum	0.9	0.9	0.9	0.9
Train loss	0.70232	0.66727	0.6776	0.68585
Train Reg loss	0.17875	0.15196	0.17081	0.1649
Train Class loss	0.52357	0.51531	0.5068	0.5209
Val loss	0.58097	0.58192	0.57455	0.57928
Val Reg loss	0.144067	0.14277	0.141705	0.14169
Val Class loss	0.436905	0.439157	0.43285	0.43758
P	0.952	0.9552	0.913	0.9302
R	0.805	0.8202	0.7433	0.7823
F1 score	0.872	0.883	0.820	0.850
mAP@50	0.7803	0.7901	0.7113	0.7304
mAP@50:95	0.723	0.747	0.7011	0.7204

Table 4. YOLOv3 models’ training results.

Model	YOLOv3				YOLOv3-Tiny
	$M_{base}$	$M_{e, b}$	$M_{opt}$	$M_{opt, e}$	$M_{base}$	$M_{e, b}$	$M_{opt}$	$M_{opt, e}$
Epochs	100	250	100	170	100	100	300	100
Batch	16	12	16	16	16	32	12	16
Optimizer	Adam	Adam	SGD	SGD	Adam	Adam	Adam	SGD
lr	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01
	−0.001	−0.001	−0.001	−0.001	−0.001	−0.001	−0.001	−0.001
Momentum	0.937	0.937	0.9	0.9	0.9	0.937	0.9	0.937
Train Box loss	0.0227	0.0245	0.0132	0.0152	0.0284	0.0282	0.029	0.0217
Train Obj loss	0.0117	0.0127	0.0066	0.0076	0.0374	0.037	0.0382	0.0289
Train Class loss	0.0045	0.0045	0.0028	0.0031	0.0057	0.0057	0.0057	0.0044
Val Box loss	0.0176	0.0191	0.0144	0.0014	0.0209	0.0224	0.022	0.0179
Val Obj loss	0.0068	0.0031	0.0060	0.0027	0.0249	0.025	0.0257	0.0215
Val Class loss	0.0028	0.0079	0.0027	0.0058	0.0031	0.0037	0.0032	0.0031
P	0.961	0.9073	0.9749	0.975	0.9603	0.9631	0.9212	0.9733
R	0.9205	0.9059	0.9721	0.976	0.9141	0.9007	0.9288	0.9549
F1 score	0.9403	0.9066	0.9735	0.9755	0.9366	0.9309	0.9250	0.9640
mAp 50	0.95	0.9163	0.9718	0.9697	0.9504	0.9522	0.9453	0.9643
mAp 50:95	0.7518	0.6919	0.8064	0.8061	0.7011	0.6932	0.6845	0.7654

Table 5. YOLOv4 models’ training results.

Model	YOLOv4			YOLOv4-Tiny
	$M_{base}$	$M_{csp}$	$M_{csp, opt}$	$M_{tiny}$	$M_{tiny, e}$
Backbone	Darknet-53	CSP	CSP	Darknet-53	Darknet-53
Activation	Mish	leaky	leaky	Mish	Mish
Epochs	100	100	100	100	300
Batch size	16	16	16	16	12
Optimizer	SGD	SGD	SGD	SGD	Adam
Train Box loss	0.0162	0.0174	0.0174	0.0261	0.0234
Train Obj loss	0.0056	0.0105	0.0105	0.0847	0.0751
Train Class loss	0.001	0.001	0.0025	0.0044	0.0037
Val Box loss	0.0136	0.0147	0.0148	0.0197	0.0187
Val Obj loss	0.0045	0.0095	0.0094	0.0784	0.0738
Val Class loss	0.0007	0.0008	0.0017	0.0023	0.0021
P	0.4884	0.519	0.6519	0.5676	0.6433
R	0.4884	0.7594	0.9758	0.9608	0.9693
F1 score	0.4884	0.6166	0.7816	0.7136	0.7733
mAp 50	0.7835	0.8439	0.9646	0.9454	0.9591
mAp 50:95	0.6197	0.6694	0.7843	0.7033	0.7356

Table 6. YOLOv5 models’ training results.

Model	YOLOv5n			YOLOv5s
	$M_{n, base}$	$M_{n, e}$	$M_{n, opt}$	$M_{s, base}$	$M_{s, b}$	$M_{s, e}$
Epochs	300	100	100	150	150	300
Batch size	16	16	64	16	64	16
Optimizer	Adam	Adam	SGD	Adam	Adam	Adam
Learning rate	0.00334–0.000514	0.01–0.001	0.01–0.001	0.01–0.001	0.01–0.001	0.01–0.001
Momentum	0.0074	0.937	0.9	0.9	0.9	0.9
Train Box loss	0.0049	0.0215	0.0178	0.0211	0.0209	0.0212
Train Obj loss	0.0013	0.011	0.0088	0.011	0.0108	0.0109
Train Class loss	0.0015	0.004	0.0033	0.004	0.0039	0.004
Val Box loss	0.006	0.0174	0.0202	0.0178	0.0228	0.0192
Val Obj loss	0.0033	0.0068	0.0067	0.0071	0.0074	0.0076
Val Class loss	0.0007	0.0026	0.0044	0.0027	0.0049	0.003
P	0.9789	0.9812	0.9807	0.9558	0.9493	0.926
R	0.9461	0.9336	0.9546	0.9364	0.9288	0.9176
F1 score	0.9622	0.9568	0.9675	0.9460	0.9389	0.9218
mAp 50	0.9588	0.9595	0.9646	0.9541	0.9511	0.946
mAp 50:95	0.7561	0.7559	0.7925	0.7539	0.7373	0.729

Table 7. Continued YOLOv5 models’ training results.

Model	YOLOv5s				YOLOv5m
	$M_{s, sgd}$	$M_{s, b 64}$	$M_{s, lr}$	$M_{s, opt}$	$M_{m, base}$	$M_{m, opt}$
Epochs	300	300	300	300	100	100
Batch size	16	64	16	64	16	16
Optimizer	SGD	SGD	Adam	SGD	Adam	SGD
lr	0.01	0.01	0.00334	0.00334	0.01	0.01
	−0.001	−0.001	−0.000514	−0.000514	−0.001	−0.001
Momentum	0.937	0.937	0.7483	0.7483	0.9	0.9
Train Box loss	0.0145	0.0138	0.0078	0.0062	0.0207	0.0138
Train Obj loss	0.0071	0.0066	0.0052	0.0041	0.0107	0.0067
Train Class loss	0.0028	0.0028	0.0014	0.0011	0.0039	0.0028
Val Box loss	0.0147	0.0197	0.0061	0.0061	0.0171	0.0143
Val Obj loss	0.0058	0.0063	0.0033	0.0031	0.0067	0.0056
Val Class loss	0.0026	0.0047	0.0007	0.0009	0.0027	0.0025
P	0.9674	0.9717	0.9679	0.9657	0.9697	0.9696
R	0.9622	0.9576	0.9478	0.9469	0.9343	0.9661
F1 score	0.9648	0.9646	0.9577	0.9562	0.9517	0.9678
mAp 50	0.9697	0.9644	0.959	0.9666	0.9578	0.9733
mAp 50:95	0.807	0.8088	0.7551	0.7792	0.7647	0.8169

Table 8. YOLOv7 models’ training results.

Model	YOLOv7
	$M_{base, I 640}$	$M_{base, I 320}$	$M_{base, e 300}$
Backbone	CSPDarknet-53	CSPDarknet-53	CSPDarknet-53
Activation	SiLU	SiLU	SiLU
Image Size	640	320	320
Epochs	300	100	300
Batch size	64	64	64
Optimizer	Adam	Adam	Adam
lr	0.00334–0.154	0.00334–0.154	0.00334–0.154
Train Box loss	0.0163	0.015541	0.0148
Train Obj loss	0.0043	0.003564	0.00349
Train Class loss	0.00085	0.00102	0.00082
Val Box loss	0.04203	0.04785	0.0482
Val Obj loss	0.010815	0.01034	0.01129
Val Class loss	0.015064	0.021034	0.0203
P	0.9609	0.948	0.978
R	0.9388	0.9562	0.937
F1 score	0.9497	0.9521	0.9571
mAp 50	0.9653	0.95753	0.9575
mAp 50:95	0.7569	0.7497	0.7521

Table 9. YOLOv10 models Training Results.

Model	YOLOv10s	YOLOv10n
	$M_{base, s}$	$M_{base, n}$	$M_{base, n, I 320}$
Backbone	CSPDarknet-63	CSPDarknet-63 tiny	CSPDarknet-63 tiny
Activation	Mish	Mish	Mish
Epochs	100	100	100
Batch size	64	32	32
Optimizer	Adam	Adam	Adam
Image Size	640	640	320
Train Box loss	0.8552	0.8628	0.9032
Train Class loss	0.3707	0.3775	0.3954
Train DFL loss	1.78	1.797	1.721
P	0.974	0.981	0.973
R	0.956	0.945	0.958
F1 score	0.9649	0.9627	0.9654
mAp 50	0.974	0.981	0.967
mAp 50:95	0.819	0.8122	0.795

Table 10. EfficientDet-D0 models’ training results.

Model	Efficientdet-D0
	$M_{eff, base}$	$M_{eff, b 32}$	$M_{eff, opt}$
Steps	40,000	40,000	40,000
Batch size	16	32	16
Optimizer	AdamW	AdamW	AdamW
Learning rate	0.08–0.06	0.08–0.06	0.008–0.9
Momentum	0.9	0.9	0.9
Train Box loss	0.0463	0.0439	0.0352
Train Obj loss	0.0295	0.0293	0.0275
Train Class loss	0.122	0.1035	0.0779
Val Box loss	0.0662	0.0535	0.0467
Val Obj loss	0.0314	0.0293	0.0275
Val Class loss	0.2519	0.1289	0.2093
P	0.5508	0.5399	0.5908
R	0.6337	0.6514	0.7023
F1 score	0.5893	0.5904	0.6417
mAp 50	0.8439	0.7526	0.7908
mAp 50:95	0.5923	0.5934	0.6114

Table 11. Comparison of best-performing model variants by architecture.

Model	Best Trial	Precision	Recall	F1 Score	mAP 50	mAP 50:95
SSD Models
MobileNetSSD $M_{S S D, v 1}$	$M_{S S D, v 1}$	0.952	0.805	0.8724	0.7803	0.723
MobileNetSSD $M_{S S D, v 1 L}$	$M_{S S D, v 1 L}$	0.913	0.7433	0.8195	0.7113	0.7011
MobileNetSSD $M_{S S D, v 2}$	$M_{S S D, v 2}$	0.9552	0.8202	0.8826	0.7901	0.747
MobileNetSSD $M_{S S D, v 2 L}$	$M_{S S D, v 2 L}$	0.9302	0.7823	0.8499	0.7304	0.7204
YOLO Models
YOLOv3	$M_{o p t}$	0.9749	0.9721	0.9735	0.9718	0.8064
YOLOv3-Tiny	$M_{o p t, e}$	0.9733	0.9549	0.9640	0.9643	0.7654
YOLOv4	$M_{c s p, o p t}$	0.6519	0.9758	0.7816	0.9646	0.7843
YOLOv4-Tiny	$M_{t i n y, e}$	0.6433	0.9693	0.7733	0.9591	0.7356
YOLOv5s	$M_{s, b 64}$	0.9717	0.9576	0.9646	0.9644	0.8088
YOLOv5m	$M_{m, o p t}$	0.9696	0.9661	0.9678	0.9733	0.8169
YOLOv7	$M_{b a s e, I 640}$	0.9609	0.9388	0.9497	0.9653	0.7569
YOLOv10n	$M_{b a s e, n}$	0.981	0.945	0.9627	0.981	0.8122
YOLOv10s	$M_{b a s e, s}$	0.974	0.956	0.9649	0.974	0.819
EfficientDet Models
EfficientDet-D0	$M_{e f f, o p t}$	0.5908	0.7023	0.6417	0.7908	0.6114

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Azab, E.; Ehab, M.; Shihata, L.; Mashaly, M. Optimizing Computer Vision for Edge Deployment in Industry 4.0: A Framework and Experimental Evaluation. Technologies 2026, 14, 126. https://doi.org/10.3390/technologies14020126

AMA Style

Azab E, Ehab M, Shihata L, Mashaly M. Optimizing Computer Vision for Edge Deployment in Industry 4.0: A Framework and Experimental Evaluation. Technologies. 2026; 14(2):126. https://doi.org/10.3390/technologies14020126

Chicago/Turabian Style

Azab, Eman, Mohamed Ehab, Lamia Shihata, and Maggie Mashaly. 2026. "Optimizing Computer Vision for Edge Deployment in Industry 4.0: A Framework and Experimental Evaluation" Technologies 14, no. 2: 126. https://doi.org/10.3390/technologies14020126

APA Style

Azab, E., Ehab, M., Shihata, L., & Mashaly, M. (2026). Optimizing Computer Vision for Edge Deployment in Industry 4.0: A Framework and Experimental Evaluation. Technologies, 14(2), 126. https://doi.org/10.3390/technologies14020126

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Computer Vision for Edge Deployment in Industry 4.0: A Framework and Experimental Evaluation

Abstract

1. Introduction

2. Methodology

2.1. Proposed Experimental Set-Up

2.2. Dataset Preparation

2.3. SOTA CV Model Fine-Tuning and Evaluation

2.3.1. Single Shot Detector (SSD)

2.3.2. You Only Look Once (YOLO)

2.3.3. EfficientDet

2.3.4. SOTA CV Model Comparison

2.4. Optimizing SOTA CV Models for Edge Boards

2.4.1. Quantization

2.4.2. CUDA and TensorRT Optimization for NVIDIA Jetson Nano

2.4.3. TFLite Optimization for Raspberry Pi 4 and Google Coral Dev Boards

3. Hardware Deployment Results and Discussion

3.1. Hardware Deployment Results

3.2. Empirical Validation and Ablation Analysis

3.3. Future Work

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI