1. Introduction
In recent years, video-based object detection technology has played a crucial role in various fields, including smart cities [
1,
2,
3], smart homes [
4,
5], intelligent transportation systems (ITS), autonomous driving, industrial automation, smart agriculture, and security systems. In particular, IoT (Internet of Things) and AIoT (Artificial Intelligence of Things) devices such as CCTV (Closed Circuit Television) and smart sensors continuously collect video data and analyze it to detect specific objects [
6,
7,
8]. However, these IoT devices typically operate on low-power embedded systems, which have limited computational resources for running deep learning-based object detection models [
9,
10,
11]. Although cloud computing is widely used for video data processing, this approach suffers from network bandwidth limitations and increased latency, making real-time object detection challenging. To overcome these issues, edge computing-based distributed processing has recently gained significant attention [
12,
13,
14,
15].
Despite the growing importance of edge computing, research on deploying deep learning-based object detection (OD) models directly on edge devices remains relatively limited. This gap constrains the understanding of how such models can be effectively utilized in resource-constrained environments while ensuring both high accuracy and real-time performance.
To address these issues, we propose an innovative collaborative edge computing framework in which multiple edge devices cooperate for video-based object detection. The novelty of this approach lies in its integration of state-of-the-art YOLOv11 detection with a custom communication interface and distributed processing logic specifically designed for resource-constrained IoT environments. The system employs Raspberry Pi-based devices, each capturing video from its own source and dynamically distributing the workload to enhance real-time performance while preserving model accuracy.
Edge computing enables data processing at nodes closer to the source, reducing network bottlenecks and enhancing real-time processing capabilities. In this study, we propose a collaborative edge computing framework where multiple edge computers cooperate to improve the efficiency of video-based object detection. Specifically, the edge computers used in this research are implemented using Raspberry Pi-based embedded systems, designed to perform human detection much faster even in resource-constrained IoT environments. Each edge device collects video data from its camera or input source and collaborates with other edge nodes to distribute the processing workload efficiently.
This study employs YOLOv11 (You Only Look Once version 11), the latest iteration of the YOLO [
16,
17] object detection model, to detect objects in video streams. The YOLO series [
18] is a CNN-based [
11,
19,
20,
21] real-time object detection framework that has demonstrated significant improvements in both accuracy and speed across successive versions. YOLOv11 integrates advanced Vision Transformers and an enhanced backbone network, making it superior to its predecessors in terms of detection precision and computational efficiency. To effectively run the YOLOv11 model in an edge computing environment, we propose a communication interface and distributed processing logic for collaborative object detection among multiple edge devices.
Our proposed system aims to analyze how the number of edge computing nodes impacts human detection performance and processing speed with the input video. Rather than relying on traditional AI optimization techniques such as quantization, pruning, or model compression, this proposal employs a single base model in conjunction with distributed edge computing strategies to achieve efficient object detection. Specifically, we measure the FPS (Frames Per Second) of processed video streams as the number of edge computers increases, evaluating whether collaborative processing enhances video quality. Furthermore, we examine how effectively the workload is distributed when multiple edge devices process video data simultaneously. Through this research, we aim to demonstrate that a distributed object detection system utilizing multiple edge devices can significantly improve real-time processing efficiency.
There have been several studies related to object detection using Raspberry Pi. Cheng-Kai et al. [
22] achieved performance at 15 FPS with 2.8 W power consumption with their scenarios. And the results of this study show that the standalone operation achieved approximately 2.22 FPS in video display performance when we captured video at a 1280 × 720 resolution at 30 FPS. However, by applying the proposed approach and utilizing up to ten edge clients, the performance improved by up to 26.44 FPS.
The findings of this study make significant and practical contributions to the advancement of lightweight, edge-based human detection systems optimized for smart IoT environments. Furthermore, the proposed collaborative edge computing framework, along with its efficient communication protocol, establishes a foundational technology for future edge-driven video analysis systems, supporting scalable, real-time, and resource-efficient deployment across a wide range of next-generation applications.
The remainder of this paper is organized as follows.
Section 2 reviews related work relevant to the implementation of this study.
Section 3 presents the proposed method and describes its design and implementation to validate the approach. In
Section 4, we conduct experiments and analyze the results. Finally,
Section 5 discusses the effectiveness of the proposed method based on the experimental findings and concludes the paper with suggestions for future research directions.
2. Related Works
2.1. Edge Computing
Edge computing is a paradigm that enables data processing closer to the source of data generation, reducing latency and alleviating network congestion. Unlike traditional cloud computing, which relies on centralized data centers, edge computing distributes computational tasks across multiple local edge nodes, enabling real-time processing and reducing dependence on cloud infrastructure. This approach is particularly beneficial for resource-constrained IoT devices, such as CCTV cameras, smart sensors, and embedded systems, which continuously generate large volumes of video data and require real-time analysis [
12,
13,
14,
15].
The application of edge computing in video-based object detection has gained significant attention due to several key advantages:
Reduced latency: Processing video data at the edge minimizes the time required for data transmission to the cloud, leading to faster inference times.
Bandwidth optimization: By performing initial data filtering and processing at the edge, only essential information needs to be transmitted to the cloud, reducing bandwidth consumption.
Privacy and security: Since sensitive video data can be processed locally without being transmitted over the network, edge computing enhances data security and privacy.
Scalability: Multiple edge devices can collaborate to distribute computational workloads, improving the overall efficiency of object detection tasks.
Recent studies have explored various methods for implementing object detection within edge computing frameworks [
23,
24,
25]. Some notable approaches include the following:
Lightweight object detection models for edge devices: Several researchers have focused on optimizing deep learning models for edge computing environments. For instance, MobileNet [
26,
27], Tiny-YOLO [
28], and EfficientDet [
29] have been widely adopted due to their reduced computational complexity while maintaining high accuracy in object detection tasks.
Collaborative edge computing: A growing body of research has examined distributed inference frameworks, where multiple edge devices share computational tasks to improve detection efficiency.
Edge to cloud synergy: While edge computing enhances local processing capabilities, hybrid approaches integrating edge and cloud computing have also been explored. These frameworks offload complex computations to the cloud while retaining real-time processing capabilities at the edge.
Despite its advantages, implementing deep learning-based object detection in edge computing environments presents several challenges:
Computational limitations: Most edge devices, including Raspberry Pi and Jetson Nano, have limited processing power, which constrains their ability to execute large-scale deep learning models efficiently.
Network synchronization: In multi-edge collaborative systems, maintaining synchronized inference across distributed nodes is critical to prevent inconsistencies in object detection results.
Load balancing: Dynamic allocation of computational workloads among edge devices is essential to ensure optimal performance while avoiding bottlenecks.
Building upon existing research, this study introduces a multi-edge collaborative framework designed to enhance YOLOv11-based human detection performance for real-time video processing. The proposed system employs Raspberry Pi 5-based edge computers to process video streams collaboratively, leveraging a custom communication interface for efficient task distribution. By increasing the number of edge nodes, this study evaluates how human detection processing speed (FPS) can be improved while maintaining video quality. This research provides practical insights into the scalability and effectiveness of collaborative edge computing for real-time object recognition in IoT environments.
2.2. YOLOv11
YOLO (You Only Look Once) [
16,
17,
18] is a state-of-the-art real-time object detection model that has evolved through multiple versions, continuously improving detection speed and accuracy. The latest iteration, YOLOv11, builds upon its predecessors by integrating enhanced feature extraction techniques, improved backbone architectures, and advanced optimization strategies to achieve superior performance in object detection tasks.
YOLOv11 introduces several key advancements over previous versions:
Enhanced backbone network: The new architecture incorporates Hybrid Vision Transformers (HVTs) and EfficientNet-based feature extraction [
29], allowing the model to capture both local and global image contexts more effectively.
Dynamic feature pyramid network (DFPN): Unlike traditional feature pyramids, DFPN [
30] in YOLOv11 dynamically adjusts receptive fields based on object scale, improving small and large object detection simultaneously.
Improved anchor-free mechanism: By refining the anchor-free detection approach, YOLOv11 enhances detection accuracy while reducing computational overhead.
Optimized post-processing with Soft-NMS: Non-Maximum Suppression (NMS) techniques have been improved with a soft suppression approach, minimizing false negatives and improving precision.
Computational efficiency: YOLOv11 maintains real-time processing capabilities [
31], making it highly suitable for deployment in edge computing environments.
Due to these improvements, YOLOv11 achieves higher FPS (Frames Per Second) and mean Average Precision (mAP) than its predecessors, making it an ideal choice for edge-based object detection applications.
For this research, we utilize YOLOv11.pt, which is a pre-trained model checkpoint specifically designed for efficient deployment. YOLOv11.pt contains the trained weights optimized for real-world applications, allowing for the following:
In this study, YOLOv11.pt is deployed on Raspberry Pi 5-based edge computers to efficiently detect objects from video streams. The performance of YOLOv11 is evaluated based on FPS, detection accuracy, and computational efficiency, demonstrating the feasibility of using multi-edge collaborative computing for real-time video processing.
2.3. Raspberry Pi
Raspberry Pi is a low-cost, single-board computer (SBC) originally developed by the Raspberry Pi Foundation to promote computer science education and embedded system development. Over the years, Raspberry Pi has evolved into a powerful platform for various applications, including the IoT (Internet of Things), robotics, AI, and edge computing. Due to its compact size, energy efficiency, and affordability, Raspberry Pi is widely used for real-time data processing and AI inference at the edge.
By utilizing multiple Raspberry Pi devices in a collaborative edge computing framework, this study explores how increasing the number of edge nodes can enhance human detection performance and video processing efficiency. Specifically, we measure FPS improvements in video playing as additional edge devices participate in the detection process. This approach demonstrates the feasibility of using low-cost embedded platforms for real-time AI-based object recognition in resource-constrained environments.
3. Proposed Methodology
3.1. General Operations
To implement our proposed method, communication between edge devices adopts the client–server model. The transmission of data amongst edge devices is based on the UDP (User Datagram Protocol). An edge device that collects and displays video images acts as a server, while all clients analyze the received image frames from the server and perform human detections.
For the proposed functionality, all participating edge devices form a clustering network, where the edge server acts as the header of the network, controlling the operational states of all edge clients within the network. To manage the clustering network, the edge server periodically broadcasts a request message to nearby nodes, asking for participation in the cluster. Edge clients that wish to join or remain in the cluster respond to the edge server using a unicast method.
The conceptual operation we propose is illustrated in
Figure 1. The server sequentially distributes the collected image frames from its attached camera to participating edge clients in the cluster, ensuring that the human detection process can be performed remotely. Each edge client processes the received image frames using the YOLOv11 model for human detections and reports the result values, including the frame number, the number of detected individuals, and their coordinates, back to the edge server. In this process, messages for image frame analysis requests and responses are transmitted using in a unicast manner.
3.2. Edge Server’s Operations
The edge server device, which functions as the header in the clustering network, is structured with several modules to perform the proposed functionalities, as illustrated in
Figure 2. These modules include the Video Capture Manager, Communication Manager, Cluster Manager, Cooperation Manager, and Video Displayer. The main functions of each module are as follows:
Video Capture Manager: extracts image frames at a given period (FPS) and stores the Frame ID, extraction timestamp, and other metadata in the Frames DB. If the number of stored frame records exceeds a defined threshold, the oldest record that has not yet been transmitted over the network is removed to retain only the latest frame information.
Communication Manager: delivers messages generated by the Cluster Manager and Cooperation Manager to the corresponding target clients. It also forwards received messages to either the Cluster Manager or the Cooperation Manager for further processing.
Cluster Manager: forms and manages the clustering network for collaborative operations, maintaining the member list and their statuses.
Cooperation Manager: the core module of the proposed method. It sequentially extracts frames stored in the Frames DB and distributes them to cluster members using a round-robin approach. In this proposal, video input is captured at 30 Frames Per Second (FPS). To prevent infinite accumulation in the Frames DB, which is caused by an insufficient number of responses from cluster members relative to the distribution rate, the system limits the database to a maximum of 30 frames and employs a circular queue mechanism. Additionally, the detection results received from the cluster members are stored in the Detections DB.
Video Displayer: This module visualizes the content stored in the Detections DB and the Frames DB. To manage and display video frames sequentially, images transmitted and received by the Cooperation Manager are recorded along with their corresponding frame IDs and timestamps. The Video Displayer does not render frames immediately upon receiving detection data. Instead, it waits until either the number of detection records accumulated in the Detections DB is at least twice the number of edge devices or the timestamp of the earliest detection record differs from the current time by at least one second. Once either condition is met, the image associated with the lowest frame ID in the Detections DB is displayed. After rendering, the corresponding frame and its detection record are removed from the Frames DB and the Detections DB, respectively. If a detection record is received with a frame ID lower than the one currently displayed, it is ignored and not stored in the Detections DB.
In the illustration, the Frames DB stores captured image frames, Frame IDs, and capture timestamps, while the Detections DB maintains the IDs of the cluster members (edge devices), Frame IDs, and detected human information (such as the number of detected people and the coordinates of each of them).
3.3. Edge Client’s Operations
The edge client is structured with the Cluster Manager, Communication Manager, Frame Manager, and Objects Detector modules, as illustrated in
Figure 3.
The main functions of each module are as follows:
Cluster Manager: It manages the participation of the edge client in the clustering network for collaborative operations.
Communication Manager: It forwards received messages from other modules to the server or delivers received messages to the appropriate module for processing.
Frame Manager: It reconstructs complete frames by assembling image fragments received via the Communication Manager. The reconstructed frames are then stored in the Frames DB. To optimize processing performance, the Double Buffering technique is applied, meaning that only two image frames are maintained in the Frames DB. Amongst these, the image that is not currently being processed by the Objects Detector can be replaced with the most recently received image.
Objects Detector: It performs human detections in the latest image frame stored in the Frames DB using the YOLOv11 model. Finally, it generates a message containing the detected information (number of people and their coordinates) along with the corresponding Frame ID and transmits it to the edge server via the Communication Manager.
3.4. Mathematical Analysis
Assuming that there is no message loss despite UDP-based transmission, the total number of messages over the operating time can be obtained using the following equation.
To manage the clustering network over a specific period (
t), the number of messages for cluster group creation can be derived through the following computation.
Here, represents the total number of messages for a cluster group. The parameters n, t, and j correspond to the number of edge client devices, total execution time, and execution interval, respectively.
Human detection processing in the video can be obtained as follows:
Here, denotes the total count of human detection messages generated by n edge client devices during time interval t.
As a result, the total number of messages (
) generated through the proposed approach can be expressed as follows:
In Equation (3), represents the total number of messages, n represents the number of edge client devices, t denotes the execution period, j refers to the message transmission interval, and finally, α corresponds to the additional count for other protocols.
4. Experiments and Results
4.1. Experiment Environments
For the implementation experiments and performance measurements, the system was configured as shown in
Table 1. A total of eleven IoT devices were set up for edge computing functionality, all utilizing the Raspberry Pi 5 model with identical performance specifications. Additionally, all devices operated on the same version of the Debian Linux distribution.
All IoT devices were configured in a star network topology based on Gigabit Ethernet, where the server and clients communicated using UDP messages.
For video data collection and processing, a USB-based webcam connected to the edge server was used to capture video at a 1280 × 720 resolution at 30 FPS. The edge client devices, responsible for analyzing and processing the video frames, utilized the YOLOv11n.pt trained model. The parameter settings were configured as follows: confidence threshold = 0.5 and IoU (Intersection over Union) = 0.5. All other parameters were left at their default values.
For the visualization of results, only processed image frames with completed person detection were displayed on the network remote screen. The output display of the remote screen was rendered on a separate desktop PC within the same network using XDMCP (X Display Manager Control Protocol) for remote visualization.
Figure 4 illustrates the network configuration for IoT devices used in the experiment, as well as the hardware setup for the testing environment. Additionally, to eliminate the impact of external network traffic on processing performance, the experiment was conducted within a closed network configuration.
4.2. Scenarios and Experiments
To measure the performance based on the number of edge clients participating in collaborative person detection, the movement of the video subject was simplified, as shown in
Figure 5.
As illustrated in the figure, the subject of a video target performed two types of movements: (A) moving both arms forward and backward and (B) raising and lowering both arms. These actions were repeated for approximately one hundred seconds while collecting video data. This process was conducted repeatedly for each variation in the number of edge clients. The repetition cycle of the arm movements was approximately 2 s.
Additionally, for comparison, the experiment was conducted in a standalone mode where only a single IoT device processed the video processing and human detections without collaborative functionality. In this case, the same movement patterns of the subject were maintained.
4.3. Experiment Results
When the implementation is executed based on the previously mentioned environments and scenarios, the output is displayed as shown in
Figure 6.
On the CLI (Command Line Interface) prompt accessed via SSH (Secure Shell) to the edge server, detection messages received from edge clients are displayed in real time. Additionally, basic detection information, which includes the edge client ID, analyzed frame ID, and the number of detected people, is output as text.
For visualization, overlay symbols representing detection results are superimposed onto the corresponding image frame, and the final video output is displayed accordingly.
Based on the experimental results from the implemented system under the proposed research scenario, the average data size per frame at a 1280 × 720 resolution was approximately 12.1 KB, with a variation of around 2.1 KB. The conversion of frames to JPEG format and the subsequent data transmission over the network took an average of 6.792 microseconds and 0.142 s, respectively, with standard deviations of approximately 0.996 microseconds and 0.024 s.
The frame update rate (FPS) based on the number of edge clients was analyzed, as shown in
Figure 7. The horizontal axis at the bottom of the figure represents the number of collaborating edge clients, while the section labeled “s” indicates a standalone operation without collaboration.
As observed in the figure, when a single IoT device performed person detection independently, the measured performance was 2.22 FPS. In contrast, when the function was executed in collaboration with edge client devices, the FPS increased as more edge clients were added. Initially, with one edge client, the FPS measured 2.83 FPS, and with eight edge clients, the FPS improved to 24.36 FPS, showing an approximate 3 FPS increase per additional device.
However, when the system operated with nine edge clients, the FPS increased by approximately 2 FPS compared to eight clients, reaching 26.43 FPS. With ten edge clients, the FPS increased by only 0.008% to 26.44 FPS, indicating that further increases in the number of edge clients resulted in negligible performance improvements.
The CPU utilization of cluster nodes was measured as shown in
Table 2. When performing person detection independently (in a single device), the CPU utilization was recorded at 60.4%.
When operating in a 1:1 configuration between the edge server and a single edge client, the server exhibited an average CPU utilization of 27.5%, while the edge client utilized an average of 57.4%. The edge server’s CPU usage was attributed to JPEG encoding and the transmission of captured frames, as well as overlaying detection symbols on the display using the detection data that are received from edge clients.
As the number of edge clients increased, the CPU utilization remained similar to that of a single edge client, with a variation within 0.5% up to eight edge clients. However, from nine edge clients onward, the CPU utilization decreased, and the variation in CPU usage became more pronounced.
Figure 8 presents a graphical representation of the sample monitoring values over a specific period, which were used to obtain the CPU utilization results shown in
Table 2. Additionally, the CPU utilization status was monitored at 100ms intervals during the person detection process.
As observed in
Figure 8, when operating with nine and ten edge clients, the CPU utilization frequently dropped significantly, occasionally reaching an idle state at certain intervals. The figure illustrates the CPU usage of a single edge device under each scenario involving multiple edge computing nodes. The experimental results showed no significant change in CPU utilization when the number of edge devices ranged from one to seven. However, beginning with eight edge devices, a noticeable decrease in CPU usage was observed.
Additionally, the experimental analysis revealed that processing a single image frame (including transmission, reception, and person detection) on a single edge client took approximately 350ms.
5. Conclusions
This study proposed a method to enhance video frame playback in FPS by utilizing the collaborative processing of multiple IoT devices for video frame analysis and person detection. The proposed system was implemented, experimented with, and evaluated, demonstrating a significant performance improvement, thereby confirming its effectiveness.
For an input video of 1280 × 720 resolution at 30 FPS, a single IoT device performing the detection independently achieved approximately 2.22 FPS. However, with the proposed collaborative framework, operating up to ten edge clients resulted in an improved performance of approximately 26.44 FPS, as confirmed by the experimental results.
Additionally, when testing different input resolutions (1024 × 768 and 800 × 600), no significant performance difference was observed apart from network bandwidth variations. This is attributed to the fast detection characteristics of the YOLO model.
Importantly, the proposed framework holds great potential for intelligent transportation systems (ITSs). Applications such as real-time pedestrian monitoring at crosswalks, traffic incident detection, and decentralized safety alerting mechanisms can benefit from the improved FPS and scalable processing architecture. By enabling low-latency, distributed video analytics at the edge, the system can contribute to more responsive and intelligent traffic infrastructure, especially in urban environments where timely decision making is critical.
Through this study, several challenges for future work have been identified:
Regarding the Frame Synchronization Issue, the detection results from multiple edge clients are integrated and displayed by the edge server. However, due to timing differences in the received data, the display order of some frames became misaligned. To address this, a temporary buffering mechanism with 15 received frames was applied, but it introduced delays in continuous frame rendering. A more effective synchronization method is required.
In the case of Uneven Frame Processing, some image frames completed detection at varying time intervals, leading to irregularities in frame playback and an unnatural video output. A solution is needed to balance frame distribution for smoother playback.
Finally, in terms of Network Throughput Optimization, while an edge client is processing person detection, any newly received frames are automatically discarded. This results in unnecessary data transmission, reducing network throughput. A new predictive algorithm should be developed to anticipate and prevent the transmission of unnecessary frames, improving overall network efficiency.
These challenges highlight areas for further research and optimization to enhance the efficiency and reliability of collaborative IoT-based video processing systems. In particular, by refining these aspects, the proposed framework can be better positioned as a robust foundation for ITS applications, enabling smarter, safer, and more efficient transportation infrastructures.
This study did not include measurements of the system’s power consumption. However, since power efficiency is a critical factor in edge computing environments, we plan to obtain the necessary equipment and collect relevant power metrics in future extensions of this research. We also intend to expand the proposed system to support video streaming by leveraging the GPU embedded in the edge devices to implement functions and evaluate performance through accelerated H.264 or H.265 encoding. Additionally, we anticipate that deploying the proposed system on platforms such as an Nvidia Jetson-based single-board computer (SBC) or a daughterboard equipped with a Google Coral TPU could enable higher-performance inference. Future research will focus on enhancing system performance through the use of such SBCs and daughterboards. Furthermore, we aim to incorporate exception-handling scenarios such as object detection failures into future studies to examine their impact on system performance and reliability.
Author Contributions
Conceptualization, B.K., S.W. and J.L.; methodology, B.K.; software, S.W.; validation, B.K., S.W. and J.L.; formal analysis, B.K.; investigation, B.K.; resources, S.W.; data curation, S.W.; writing—original draft preparation, B.K.; writing—review and editing, J.L.; visualization, B.K.; supervision, J.L.; project administration, J.L.; funding acquisition, B.K. All authors have read and agreed to the published version of the manuscript.
Funding
This paper was supported by the Sahmyook University Research Fund in 2025.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The dataset presented in this paper is not readily available for the following reasons: It is a research result intended for commercial use in the final product and cannot be shared. Requests to access the dataset should be directed to dearbk@syu.ac.kr.
Acknowledgments
The authors would like to thank the members of AIoT Laboratory in Sahmyook University.
Conflicts of Interest
Author Soohyun Wang was employed by the company AI Development Team at Sensorway. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
YOLO | You only look once |
FPS | Frames per second |
IoT | Internet of things |
AIoT | Artificial intelligence of things |
CCTV | Closed circuit television |
IoU | Intersection over union |
HVT | Hybrid vision transformers |
DFPN | Dynamic feature pyramid network |
CUDA | Compute unified device architecture |
mAP | Mean average precision |
PTQ | Post-training quantization |
ONNX | Open neural network exchange |
SBC | Single-board computer |
XDMCP | X display manager control protocol |
CLI | Command line interface |
SSH | Secure shell |
References
- Zanella, A.; Bui, N.; Castellani, A.; Vangelista, L.; Zorzi, M. Internet of Things for smart cities. IEEE Internet Things J. 2014, 1, 22–32. [Google Scholar] [CrossRef]
- Sanchez, L.; Muñoz, L.; Galache, J.A.; Sotres, P.; Santana, J.R.; Gutiérrez, V.; Ramdhany, R.; Gluhak, A.; Krco, S.; Theodoridis, E.; et al. SmartSantander: IoT experimentation over a smart city testbed. Comput. Netw. 2024, 61, 217–238. [Google Scholar] [CrossRef]
- Kirimtat, A.; Krejcar, O.; Kertesz, A.; Tasgetiren, M.F. Future trends and current state of smart city concepts: A survey. IEEE Access 2020, 8, 86448–86467. [Google Scholar] [CrossRef]
- Mocrii, D.; Chen, Y.; Musilek, P. IoT-based smart homes: A review of system architecture software communications privacy and security. Internet Things 2018, 1, 81–98. [Google Scholar] [CrossRef]
- Stolojescu-Crisan, C.; Crisan, C.; Butunoi, B.-P. An IoT-Based Smart Home Automation System. Sensors 2021, 21, 3784. [Google Scholar] [CrossRef] [PubMed]
- Perera, C.; Liu, C.H.; Jayawardena, S.; Chen, M. A Survey on Internet of Things From Industrial Market Perspective. IEEE Access 2014, 2, 1660–1679. [Google Scholar] [CrossRef]
- Awaisi, K.S.; Ye, Q.; Sampalli, S. A Survey of Industrial AIoT: Opportunities, Challenges, and Directions. IEEE Access 2024, 12, 96946–96996. [Google Scholar] [CrossRef]
- Elijah, O.; Rahman, T.A.; Orikumhi, I.; Leow, C.Y.; Hindia, M.N. An overview of Internet of Things (IoT) and data analytics in agriculture: Benefits and challenges. IEEE Internet Things J. 2018, 5, 3758–3773. [Google Scholar] [CrossRef]
- Mao, H.; Yao, S.; Tang, T.; Li, B.; Yao, J.; Wang, Y. Towards Real-Time Object Detection on Embedded Systems. IEEE Trans. Emerg. Top. Comput. 2018, 6, 417–431. [Google Scholar] [CrossRef]
- Sung, T.-W.; Tsai, P.-W.; Gaber, T.; Lee, C.-Y. Artificial intelligence of things (AIoT) technologies and applications. Wirel. Commun. Mob. Comput. 2021, 2021, 9781271. [Google Scholar] [CrossRef]
- Ma, X.; Yao, T.; Hu, M.; Dong, Y.; Liu, W.; Wang, F.; Liu, J. A Survey on Deep Learning Empowered IoT Applications. IEEE Access 2019, 7, 181721–181732. [Google Scholar] [CrossRef]
- Lee, S.; Lee, S.; Lee, S.-S. Deadline-Aware Task Scheduling for IoT Applications in Collaborative Edge Computing. IEEE Wirel. Commun. Lett. 2021, 10, 2175–2179. [Google Scholar] [CrossRef]
- Porambage, P.; Okwuibe, J.; Liyanage, M.; Ylianttila, M.; Taleb, T. Survey on Multi-Access Edge Computing for Internet of Things Realization. IEEE Commun. Surv. Tutor. 2018, 20, 2961–2991. [Google Scholar] [CrossRef]
- Schulz, P.; Matthe, M.; Klessig, H.; Simsek, M.; Fettweis, G.; Ansari, J.; Ashraf, S.A.; Almeroth, B.; Voigt, J.; Riedel, I.; et al. Latency Critical IoT Applications in 5G: Perspective on the Design of Radio Interface and Network Architecture. IEEE Commun. Mag. 2017, 55, 70–78. [Google Scholar] [CrossRef]
- Coutinho, R.W.L.; Boukerche, A. Modeling and Performance Evaluation of Collaborative IoT Cross-Camera Video Analytics. In Proceedings of the IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023. [Google Scholar] [CrossRef]
- Liu, H.; Hou, Y.; Zhang, J.; Zheng, P.; Hou, S. Research on Weed Reverse Detection Methods Based on Improved You Only Look Once (YOLO) v8: Preliminary Results. Agronomy 2024, 14, 1667. [Google Scholar] [CrossRef]
- Yang, G.; Feng, W.; Jin, J.; Lei, Q.; Li, X.; Gui, G.; Wang, W. Face Mask Recognition System with YOLOV5 Based on Image Recognition. In Proceedings of the IEEE International Conference on Communications (ICC), Chengdu, China, 11–14 December 2020. [Google Scholar] [CrossRef]
- Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Goodfellow, Y.; Bengio, A. Courville, Deep Learning, Cambridge; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 17 August 2025).
- Hubel, D.H.; Wiesel, T.N. Receptive Fields and Functional Architecture of Monkey Striate Cortex. J. Physiol. 1968, 195, 215–243. [Google Scholar] [CrossRef]
- LeCun, Y.A.; Jackel, L.D.; Bottou, L.; Brunot, A.; Cortes, C.; Denker, J.S.; Drucker, H.; Guyon, I.; Muller, U.A.; Sackinger, E.; et al. Learning algorithms for classification: A comparison on handwritten digit recognition. In Neural Networks; World Scientific: London, UK, 1995; pp. 261–276. [Google Scholar]
- Lu, C.-K.; Shen, J.-Y.; Lin, C.-H.; Lien, C.-Y.; Ding, S.Y. Efficient Embedded System for Small Object Detection: A Case Study on Floating Debris in Environmental Monitoring. IEEE Embed. Syst. Lett. 2025, 17, 264–267. [Google Scholar] [CrossRef]
- Tsai, M.H.; Chen, J.-Y.; Chien, W.-C. Mobile Edge Computing for Rapid deployment Object Detection System. In Proceedings of the IET International Conference on Engineering Technologies and Applications (IET-ICETA), Changhua, Taiwan, 14–16 October 2022. [Google Scholar] [CrossRef]
- Kim, R.; Kim, G.; Kim, H.; Yoon, G.; Yoo, H. A Method for Optimizing Deep Learning Object Detection in Edge Computing. In Proceedings of the International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Republic of Korea, 21–23 October 2020; pp. 1164–1167. [Google Scholar] [CrossRef]
- Guo, Y.; Zou, B.; Ren, J.; Liu, Q.; Zhang, D.; Zhang, Y. Distributed and Efficient Object Detection via Interactions Among Devices, Edge, and Cloud. IEEE Trans. Multimed. 2019, 21, 2903–2915. [Google Scholar] [CrossRef]
- Howard, A.; Sandler, M.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for MobileNet V3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Wang, W.; Zhou, X.; He, X.; Qing, L.; Wang, Z. Facial expression recognition based on improved MobileNet network. Comput. Appl. Softw. 2020, 37, 137–144. [Google Scholar]
- Ahmad, A.; Pasha, M.A.; Raza, G.J. Accelerating Tiny YOLOv3 using FPGA-Based Hardware/Software Co-Design. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Zhang, K.; Li, Z.; Hu, H.; Li, B.; Tan, W.; Lu, H.; Xiao, J.; Ren, Y.; Pu, S. Dynamic Feature Pyramid Networks for Detection. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Jeong, E.; Kim, J.; Tan, S.; Lee, J.; Ha, S. Deep Learning Inference Parallelization on Heterogeneous Processors With TensorRT. IEEE Embed. Syst. Lett. 2022, 14, 15–18. [Google Scholar] [CrossRef]
- Caesar, H.; Uijlings, J.; Ferrari, V. COCO-Stuff: Thing and Stuff Classes in Context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar] [CrossRef]
- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A. The open image dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
- Nagel, M.; Amjad, R.A.; Van Baalen, M.; Louizos, C.; Blankevoort, T. Up or down? adaptive rounding for post-training quantization. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 7197–7206. [Google Scholar]
- ONNX: Open Neural Network Exchange. 2019. Available online: https://onnx.ai/ (accessed on 18 July 2025).
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).