Collaborative Real-Time Single-Object Anomaly Detection Framework of Roadside Facilities for Traffic Safety and Management Using Efficient YOLO

Kang, Jiheon; Jang, Soohyen; Choi, Yoonyoung; Lee, Wooyong; Kim, Byoungkug

doi:10.3390/app152011139

Open AccessArticle

Collaborative Real-Time Single-Object Anomaly Detection Framework of Roadside Facilities for Traffic Safety and Management Using Efficient YOLO

by

Jiheon Kang

¹

,

Soohyen Jang

¹

,

Yoonyoung Choi

¹

,

Wooyong Lee

^2,* and

Byoungkug Kim

^3,*

¹

Department of Software, Duksung Women’s University, Seoul 01369, Republic of Korea

²

School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea

³

Division of Computer Science and Engineering, Sahmyook University, 815 Hwarang-ro, Nowon-gu, Seoul 01795, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11139; https://doi.org/10.3390/app152011139

Submission received: 1 October 2025 / Revised: 14 October 2025 / Accepted: 15 October 2025 / Published: 17 October 2025

(This article belongs to the Special Issue Advances in Intelligent Transportation and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

Featured Application

This study proposes a collaborative Edge AI system for real-time anomaly detection of roadside facilities using vehicle-mounted cameras. It can be applied to intelligent traffic management systems to enhance road safety and enable proactive infrastructure maintenance.

Abstract

This paper proposes an Edge AI-based collaborative framework for real-time anomaly detection of roadside facilities to enhance traffic safety and management. Traditional detection methods rely on fixed cameras or manual inspections, which are time-consuming and inefficient. Our approach embeds lightweight YOLO models in vehicle dashboard cameras to collect and analyze diverse video data across multiple vehicles in real time. This distributed system overcomes the limitations of individual vehicles through collaborative data aggregation and enables robust anomaly detection in various types of roadside facilities. We evaluate several YOLO variants to identify the optimal balance between detection accuracy and computational efficiency. Experimental results demonstrate improved anomaly detection precision and faster response times, validating the feasibility of our system for practical deployment. The proposed method offers a scalable and efficient solution for proactive traffic management and accident prevention by leveraging distributed edge intelligence.

Keywords:

roadside facility anomaly detection; collaborative real-time monitoring; lightweight YOLO; edge AI; single-object detection

1. Introduction

With the rapid adoption of autonomous vehicles, the condition of road infrastructure has become a critical factor influencing safe driving. Proper management of road surface conditions is essential not only for vehicle performance but also for the safety of drivers and pedestrians [1,2]. Among the multiple factors affecting safe driving environments, the role of roadside facilities—such as traffic cones, barriers, and warning signs—is increasingly recognized. These facilities are installed to alert drivers of potentially hazardous road conditions, guide traffic flow, and prevent accidents. Therefore, maintaining the integrity and correct positioning of these roadside assets is vital for overall traffic safety and accident prevention [3].

Traditional research in the domain of road condition monitoring has predominantly focused on the detection of physical anomalies on the road surface itself, including potholes, cracks, and uneven pavement [2]. Accurate identification and maintenance of these defects are important for vehicle durability and traffic smoothness. However, another crucial dimension is the management of roadside facilities that contribute to accident prevention. These elements can themselves become damaged, misplaced, or degraded over time due to weather, traffic incidents, or vandalism, compromising their intended function and increasing safety risks.

Existing detection methods largely depend on fixed surveillance cameras or manual inspections, which suffer from limited coverage, high labor costs, and delayed response. Recently, some local governments and transportation agencies have attempted to enhance anomaly detection by employing vehicle-mounted cameras that traverse specific regions to capture and analyze real-time footage for road defects [4]. Though effective in certain contexts, this approach has limitations—it requires specialized vehicles to repeatedly cover targeted areas, which is time-consuming and inefficient, particularly for large-scale road networks.

Further challenges arise in the computational requirements for real-time processing. Many current studies rely on high-resolution cameras and perform analysis using powerful computing units such as desktop PCs or cloud servers. This arrangement is not optimal for wide deployment due to hardware costs, latency, and network dependency issues. To address these limitations, edge AI solutions embedded directly in vehicle dashboard camera, which are resource-constrained but pervasive, offer a promising alternative.

This study proposes a system to perform real-time anomaly detection and monitoring of roadside facilities using lightweight deep learning models deployed on edge devices like dashcam, which are ubiquitous in everyday vehicles. Our approach differs from conventional methods by assigning a specialized lightweight YOLO model for recognizing a single type of roadside facility to individual vehicles. Through collaborative aggregation of detection information from multiple vehicles, the system achieves comprehensive coverage and robust anomaly identification without requiring dedicated inspection vehicles or extended monitoring routes.

We utilize the publicly available joint dataset from Korea AI-Hub [5] to train and evaluate various versions of the YOLO object detection model. Performance metrics such as detection accuracy, resource consumption, and inference speed are analyzed to find a suitable trade-off for real-time edge deployment. YOLO’s scalability and efficiency make it an ideal candidate for this task, supporting the development of a practical monitoring solution for large, distributed traffic environments.

Several previous studies have explored video-based road condition diagnosis and anomaly detection using fixed installations or aerial images. For example, road surface defect detection has been conducted through deep learning on UAV-captured images or stationary sensor networks [6,7]. Some works also highlight collaborative approaches like crowdsourcing, where data from multiple independent sources are combined to improve detection accuracy and coverage [8,9]. However, collaborative edge AI-based roadside facility monitoring with lightweight models on vehicle-mounted devices remains underexplored, underlining the novelty and significance of the present work.

The remainder of this paper is organized as follows: Section 2 reviews prior research on image and video analysis for road state diagnosis and collaborative monitoring networks such as crowdsensing. We describe the characteristics of each YOLO version used in the experiments, as well as the case studies and advantages of a single object detection model applied to the object detection field; Section 3 describes the dataset used in this study and provides a detailed description and evaluation of the proposed methodology; Section 4 presents experimental setup and results, assessing model performance and efficiency; finally, Section 5 discusses the contributions, practical implications, and potential future advancements of this research.

2. Related Works

2.1. Road and Roadside Anomaly Detection

Anomaly detection for roads and roadside facilities has evolved rapidly in response to the demands of autonomous driving and smart transportation systems. Early research focused on pavement defect identification using vibration sensors and accelerometers installed in vehicles, which facilitated cost-effective detection of anomalies such as potholes and cracks [10,11]. However, these methods often struggled with precise localization and were sensitive to environmental factors.

With the advent of deep learning, vision-based approaches using images and videos captured by vehicle-mounted or roadside cameras have become prominent. For instance, pixel-wise anomaly detection using encoder–decoder semantic segmentation networks has proven effective for identifying road surface defects under varying lighting and weather conditions [12]. Recent video analytics studies utilize convolutional networks and spatiotemporal modeling, allowing for robust real-time anomaly recognition in urban environments [13,14]. These models are able to process large-scale traffic footage to detect not only conventional defects but also non-standard anomalies, such as obstructions and hazardous debris.

Roadside facility anomaly detection, which includes monitoring of infrastructure such as traffic cones, guardrails, and signage, offers a new perspective on traffic safety. Systematic reviews demonstrate the relevance of integrating sensor networks, IoT devices, and edge computing to track and diagnose anomalies across distributed infrastructure. Crowdsourcing and collaborative sensing platforms—where data from multiple vehicles or users are collectively analyzed—have greatly expanded the coverage and reliability of monitoring systems. Example projects such as VADtalk [15] and UIT-ADrone [16] deployed anomaly detection models across fleets of connected vehicles and drones, enabling near real-time alerts and mitigation strategies for both road and roadside threats.

As intelligent traffic management evolves, edge AI-based frameworks enable on-device anomaly recognition and decentralized aggregation of results, overcoming the limitations of centralized processing and slow response times. These modern approaches are highly scalable, reduce latency, and support the dynamic and collaborative monitoring needs of next-generation transportation networks.

2.2. Comparison of YOLO Family

The YOLO (You Only Look Once) series has evolved significantly, with each version refining the balance between detection accuracy, inference speed, and computational efficiency. These advancements have been crucial in the context of road and roadside facility anomaly detection, where real-time processing on edge devices is often required.

YOLOv5 [17] further enhanced architecture with features like CSPDarknet backbone, Mish activation, and PANet for better path aggregation. YOLOv5, released with smaller models (YOLOv5s, v5m, v5l, v5x), optimized for deployment on embedded devices, offered a latency reduction of up to 40% with comparable accuracy, reaching mAPs about 50–65% depending on model size and dataset. These versions have been widely adopted in pavement crack and pothole detection tasks, achieving detection accuracies above 85% in some public datasets.

YOLOv6 [18] targeted industrial applications with enhanced backbone and neck designs (e.g., EfficientNet-inspired), emphasizing inference speed and object localization precision. It achieved nearly 60 FPS on edge GPUs while improving mAP by 2–3% over YOLOv5 in traffic object datasets. Studies applying YOLOv6 for vehicle detection and roadside anomaly recognition reported improved small object recall and decreased false positives owing to SimAM attention modules.

YOLOv8n, the nano model, balances speed and precision, delivering up to 120 FPS on edge devices. These models have demonstrated 90%+ detection accuracy for standard road anomalies and roadside objects like cones and barriers in varying weather and illumination, proving their suitability for autonomous vehicles and edge AI traffic monitoring scenarios.

YOLOv9tr introduced generalized efficient layer aggregation and attention mechanisms such as SE and CBAM blocks, pushing frame rates to over 136 FPS with detection accuracy reaching 70% mAP on road damage datasets. Latest experiments demonstrate improved detection of challenging anomalies such as road pits, cracks, and general roadside objects under occlusions [19,20].

YOLOv10 [21] continued to refine architecture scalability and introduced wavelet transform-based feature extraction for better fine-detail anomaly detection. YOLOv10 models report mAP scores surpassing 72% on road defect benchmarks, with inference latency under 10 ms on RTX 3080 GPUs, supporting near real-time operation in urban traffic management.

YOLOv11 [22] and YOLOv12 [23] feature advanced neural architectures with improved feature pyramid networks, adaptive attention modules, and transformer-inspired layers. These strengthen robustness in harsh environments like night driving and inclement weather, with mAP improvements of 3–6% compared to YOLOv10 in controlled studies. Real-world deployments confirm their effectiveness for multi-object tracking and anomaly classification in dense traffic scenarios.

Selecting a YOLO version thus depends on the application requirements—earlier versions offer a good balance for well-understood tasks, while newer versions incorporate advanced feature extraction, attention mechanisms, and NAS techniques that push precision without sacrificing speed [24,25]. This makes the latest YOLO models especially suitable for real-time, collaborative roadside facility anomaly detection in edge AI and intelligent transportation systems.

2.3. Specialized Single-Object Detection Models for Collaborative Multi Agent Deployment

Conventional multi-object detection models typically aim to detect multiple classes within a single unified model, or use multi-modal inputs to improve object detection performance [26,27]. However, in resource-constrained environments such as vehicle dashcam or drone-embedded edge devices, running complex multi-class models can be computationally demanding and can reduce inference speed [28,29]. A recently proposed alternative approach involves developing multiple specialized lightweight models, each finely tuned to detect a single object type or anomaly [30]. These specialized models can then be deployed across a fleet of vehicles, drones, or other edge devices, enabling collaborative and parallel detection of diverse object categories.

This modular methodology offers clear benefits:

Reduced computational load and model complexity on each device, resulting in faster inference and lower power consumption.
Improved detection accuracy and robustness for each object type by enabling targeted training on highly specific features.
Flexibility in deployment, where different agents can be assigned different detection tasks based on situational needs and available resources.

For example, one study demonstrated the effectiveness of dedicated models for pothole detection and monitoring, each optimized independently using enhanced Single Shot Detector (SSD)-based architectures and lightweight YOLO variants [31]. This approach contrasts with holistic multi-class detectors that trade off detail for generality.

Collaborative or crowd sensing frameworks further amplify the advantages of this approach by aggregating detection results from multiple independent agents [32]. Vehicle networks equipped with single-class detectors for different roadside assets have shown increased coverage and higher detection confidence, ensuring robust monitoring over large road networks [33]. Similarly, drone fleets use specialized lightweight models on-board to capture diverse environmental anomalies in real time, facilitating real-world surveillance at scale with minimal hardware overhead [34,35].

This strategy aligns well with contemporary edge AI and federated learning paradigms, where decentralized, task-specific inference on distributed devices lowers communication overhead and latency [36]. Federated aggregation of detection outputs across vehicles or drones enhances anomaly detection reliability and supports scalable traffic safety systems [37,38,39].

Overall, the deployment of multiple single-object detection models across collaborative multi-agent platforms presents a promising and practical solution for efficient, scalable, and high-fidelity road and roadside anomaly detection systems. This approach enables dynamic assignment of detection tasks across agents, tailored resource usage, and improved adaptability to complex and varying urban monitoring environments.

3. Data and Method

3.1. Dataset

The “Local Government Road Facility Damage Dataset,” provided by AI Hub [5], is a comprehensive image dataset designed to support AI-driven detection and classification of damage in roadside infrastructure managed by local governments throughout Korea. Table 1 and Figure 1 show the composition of the original dataset registered in AI Hub. This dataset is specifically curated to facilitate research and development of deep learning models for automated monitoring and maintenance of diverse road facilities, including traffic cones, guardrails, guardrail, signage, and other critical safety apparatuses.

Each sample within the dataset comprises high-resolution images captured in real-world urban and suburban environments under varying conditions of lighting, weather, and traffic density. The images are annotated with detailed metadata, including the precise location, facility type, and the specific nature of the observed defect or abnormality—such as deformation, dislocation, overturning, or disappearance. Object-level annotations include bounding boxes, class labels, and damage type descriptors, enabling granular model training and evaluation for both detection and fine-grained classification tasks.

This dataset is its coverage of both normal and abnormal cases across a broad spectrum of facility types, ensuring balanced representation for supervised learning. The data encompasses diverse contexts, from lanes and intersections densely populated with roadside facilities to remote stretches where infrequent damage events occur. This variety provides a robust foundation for developing models capable of generalizing to different urban scenarios.

To ensure the efficiency and reliability of the training process, particular attention was given to the quality and diversity of the training dataset. Redundant samples, especially those captured under nearly identical conditions such as images taken from the same location or with highly similar viewpoints, can introduce unnecessary bias into the model and increase the risk of overfitting. Moreover, the inclusion of such data leads to inefficient use of computational resources, resulting in longer training times and excessive storage requirements. Therefore, to mitigate these issues, images captured in similar environments or under overlapping conditions were selectively removed, and the curated dataset was subsequently used for model training. This strategy helped maintain a balanced and diverse dataset while preventing overrepresentation of specific scenes.

3.2. The Proposed Method

The core of the proposed method centers on a collaborative, edge AI-powered, real-time anomaly detection system designed specifically for roadside facilities. This system is architected with the goal of overcoming the limitations inherent in traditional infrastructure monitoring approaches, such as dependence on fixed cameras, periodic manual inspections, or multi-object detection models not optimized for edge devices.

In our approach, rather than relying on a single, complex multi-class detector to recognize all types of roadside facilities (such as traffic cones, guardrails, PE drums, and PE fence), we instead develop a set of specialized, lightweight YOLO-family models, each dedicated to a single object type. By training object-specific detectors, we are able to achieve high-precision classification for both normal and damaged states of each facility class, while greatly reducing computational burden and memory usage. Each model is optimized for its assigned object, leveraging tailored hyperparameter settings and transfer learning with the large-scale, expert-annotated images provided by road facility damage dataset. This dataset includes diverse conditions—lighting, weather, and traffic environments—ensuring that our models generalize robustly to real-world field deployment.

The deployment strategy utilizes a distributed multi-agent framework, where these specialized models are installed on a fleet of conventional vehicles, each equipped with edge compute resources (e.g., automotive black boxes or on-board computers). Each edge device performs continuous, on-device inference while patrolling road networks or urban sectors. When an abnormality (such as a damaged facilities or overturned cone) is detected, the event—along with contextual data like location, timestamp, and an image crop—is immediately packaged and transmitted to a central or federated aggregation server. In addition, the federated nature of this multi-agent system supports efficient re-training and upgrade cycles: as new types of anomalies are discovered, centralized or federated learning updates can be distributed to the edge devices, ensuring continuous improvement over time.

The proposed method provides a scalable, reliable, and resource-efficient solution for collaborative anomaly detection of roadside facilities. By eschewing monolithic models for a suite of purpose-built detectors and leveraging multi-agent field deployment with distributed data aggregation, our approach is well-suited to the practical requirements of next-generation intelligent infrastructure monitoring and smart city traffic safety management.

Figure 2 represents the core steps and procedures of our proposed method. The overall workflow progresses from data collection and model training to deployment on edge devices, on-site detection, and finally, a location-based management system.

3.2.1. Data Collection an Annotation

When a new roadside facility object to be managed is identified, separate collection and labeling of normal and abnormal data solely for this newly added object offers significant advantages over retraining or modifying existing large-scale multi-class models. By isolating dataset construction to only the new facility class, the approach ensures focused and efficient data management. This prevents data imbalance and model interference that often arise when integrating small volumes of new samples into extensively trained, multi-class datasets. Additionally, the modular dataset allows for quicker iteration cycles, facilitating rapid model development tailored specifically to the new object’s unique features and anomaly patterns. Maintaining separate data subsets also enables flexible experimental comparisons against various candidate models (e.g., YOLO variants, transformer-based detectors) without impacting existing object classes.

This initial phase establishes the foundational data needed for our new object detection model. The diagram shows a vehicle with a dash camera collecting images from diverse environments, including various locations, views, and weather conditions. The images contain both normal road facilities and abnormal objects, such as damaged or displaced items. For the model to learn to identify these anomalies, the abnormal objects are meticulously marked with bounding boxes during the labeling process, creating the ground truth data for model training

3.2.2. Model Training and Selection for Deploying on Edge Devices

Using the collected and labeled data, we train and evaluate various lightweight object detection models, such as YOLO. The training process focuses on two key metrics: Performance and Efficiency. Performance is measured using metrics like mAP (mean Average Precision), while efficiency is gauged by model size and inference speed. The graph in the diagram illustrates how we select the optimal model by balancing these two metrics, ensuring it is both accurate and suitable for the hardware constraints of the target vehicle terminals.

Once the ideal model is selected, it is prepared for deployment. This involves converting and optimizing the model to run efficiently on the vehicle’s hardware, which may vary across different devices. The optimized model is then deployed to the vehicle’s terminal, where it performs edge inference. This means the vehicle can analyze video data in real time and detect anomalies directly on the device, without relying on a remote server.

The selection of an optimal object detection model for a resource-constrained target device is a critical step that requires a balanced consideration of both performance and efficiency. Simply choosing the model with the highest accuracy or the smallest size is often insufficient, as it may fail to meet the specific operational requirements of the target hardware. This section outlines the quantitative criteria and a systematic methodology for selecting the most suitable model from a pool of candidates.

The relationship between performance and efficiency is typically a trade-off, where improvements in one often lead to a degradation in the other. Therefore, the goal is to identify the model that occupies the “sweet spot” on the performance-efficiency curve, meeting the minimum acceptable performance while maximizing efficiency.

The model selection process is driven by two primary metrics:

Performance (P): This refers to the model’s accuracy in detecting the target objects. For object detection tasks, the mean Average Precision (mAP) is the standard metric used to quantify performance. A higher mAP value indicates greater detection accuracy across all object classes.
Efficiency (E): This encompasses the computational cost and resource footprint of the model. Key efficiency metrics include the Inference Speed (FPS, Frames Per Second), Model Size (MB), and Computational Cost (GFLOPs). A higher FPS, smaller model size, and lower GFLOPs indicate greater efficiency.

We introduce a utility function (U) that quantifies a model’s suitability based on a weighted combination of its performance and efficiency metrics. This allows for a single, comprehensive score for each candidate model.

Let

P_{m o d e l}

be the performance (mAP) of a candidate model, and

E_{m o d e l}

be its efficiency (e.g., FPS, normalized for comparison). The utility function can be defined as:

U (P_{m o d e l}, E_{m o d e l}) = w_{p} \cdot \frac{P_{m o d e l} - P_{m i n}}{P_{m a x} - P_{m i n}} + w_{e} \cdot \frac{E_{m o d e l} - P_{m i n}}{E_{m a x} - E_{m i n}}

(1)

where

P_{m a x}

,

P_{m i n}

and

E_{m a x}

,

E_{m i n}

are the maximum and minimum observed performance and efficiency values among all candidate models, used for normalization.

w_{p}

and

w_{e}

are weighting factors

(w_{p} + w_{e} = 1)

that reflect the relative importance of performance and efficiency for the specific application. For instance, in a real-time surveillance system, we would be higher to prioritize inference speed, whereas in an offline image analysis task,

w_{p}

might be more dominant.

The optimal model is then selected as the one that maximizes the utility function:

O p t i m a l M o d e l = {a r g m a x}_{m o d e l \in C a n d i d a t e s} U ((P_{m o d e l}, E_{m o d e l}))

(2)

The selection process proceeds through the following steps:

Hardware Constraint Analysis: Analyze the target device’s specifications, including its CPU/GPU/NPU capabilities, available RAM, and storage. These constraints define the acceptable range for model size and computational cost.
Candidate Pool Generation: A diverse set of lightweight object detection models is trained on the collected dataset. This pool includes models with varying architectures to ensure a wide range of performance-efficiency profiles.
Empirical Measurement: Each candidate model is empirically tested on the target hardware to measure its exact performance (mAP) and efficiency metrics (FPS, model size). This step is crucial as theoretical values can differ significantly from real-world performance.
Utility Score Calculation: Using the empirical data, the utility score for each model is calculated with pre-defined weights ( $w_{p}$ , $w_{e}$ ) that align with the project’s priorities.
Final Selection: The model with the highest utility score is chosen for deployment. This systematic approach ensures that the final selection is not based on a single metric but on a holistic evaluation tailored to the specific application’s needs.

3.2.3. On-Site Detection and Localization

The final phase of our proposed methodology is to perform real-time anomaly detection using the optimized model on the vehicle’s edge device and to precisely localize the detected objects. This process is crucial for enabling a dynamic and responsive management system. The core of this stage is the fusion of data from multiple on-board sensors to transform a 2D bounding box from an image into a precise 3D geographic coordinate [40].

The optimized object detection model operates on the edge device, processing a continuous video stream from the vehicle’s dash camera. For each video frame, the model performs inference to identify and classify predefined abnormal objects, such as damaged road signs or displaced barriers. The output is a set of bounding boxes (

B_{i m g}

) for each detected instance. Each bounding box is defined by its pixel coordinates on the image plane, typically represented as (

x_{m i n}

,

y_{m i n}

,

x_{m a x}

,

y_{m a x}

) along with a confidence score indicating the model’s certainty. This process runs autonomously on the edge device, ensuring minimal latency and network dependency.

For each detected object, the system initiates a data fusion process to determine its precise real-world location. This involves combining the image-based detection with real-time data from the vehicle’s Global Positioning System (GPS) and Inertial Measurement Unit (IMU). This fusion process is essential because a 2D bounding box alone provides no depth information, making it impossible to determine the object’s distance from the camera.

Let

P_{v e h}

be the vehicle’s real-world position at the moment of detection, given by GPS coordinates (

L a t

,

L o n

, Alt). The IMU provides the vehicle’s orientation, including its heading (yaw), pitch, and roll, which is used to derive a transformation matrix

T_{v e h_i m u}

. This matrix defines the vehicle’s orientation relative to a global coordinate system.

To transform the 2D bounding box (

B_{i m g}

) into a 3D world coordinate (

P_{o b j}

), we apply the principles of camera geometry. This requires two sets of parameters: intrinsic parameters and extrinsic parameters.

Intrinsic Camera Parameters (K): These are properties inherent to the camera itself, such as focal length ( $f_{x}$ , $f_{y}$ ), principal point ( $c_{x}$ , $c_{y}$ ), and skew coefficient. They are typically determined through a one-time camera calibration process and are used to model the camera’s projection of a 3D scene onto a 2D image plane.

K = [\begin{matrix} \begin{matrix} f_{x} & 0 \\ 0 & f_{y} \end{matrix} & \begin{matrix} c_{x} \\ c_{y} \end{matrix} \\ \begin{matrix} 0 & 0 \end{matrix} & 1 \end{matrix}]

(3)

Extrinsic Parameters ( $T_{e x t}$ ): These define the camera’s position and orientation relative to the global world coordinate system. This is where the GPS and IMU data are integrated. A key assumption is that the road surface is a flat plane, and all detected objects are located on this plane. This simplification allows for more robust depth estimation. The GPS sensor is assumed to be at a known, fixed height ( $h_{g p s}$ ) above the road surface.

R_{v e h_c a m} = R o l l \cdot P i t c h \cdot Y a w \cdot {[d_{x} d_{y} d_{z} 1]}^{T}, T_{w o r l d_v e h} = [\begin{matrix} \begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix} & \begin{matrix} 0 & P_{v e h}^{L o n} \\ 0 & P_{v e h}^{L a t} \end{matrix} \\ \begin{matrix} 0 & 0 \\ 0 & 0 \end{matrix} & \begin{matrix} 1 & P_{v e h}^{a l t} \\ 0 & 1 \end{matrix} \end{matrix}], T_{e x t} = T_{w o r l d_v e h} \cdot R_{v e h_c a m}

(4)

T_{w o r l d_v e h}

represents a transformation matrix from the vehicle’s local coordinate system to the global world coordinate system, derived from the GPS and IMU data.

R_{v e h_c a m}

represents the rotation transformation matrix from the vehicle coordinate frame to the camera coordinate frame, taking into account the distance (

d_{x}, d_{y}, d_{z}

) from the GPS sensor.

The process of converting the 2D image coordinates to 3D world coordinates is fundamentally an inverse projection. For a given detected object, we use the center point of the bottom edge of its bounding box, denoted as (

u_{b a s e}

,

v_{b a s e}

), as the reference point for localization. This point is assumed to be on the road plane, consistent with our flat-road assumption. For a given pixel coordinate (

u

,

v

) within the bounding box and a known depth

Z_{c}

in the camera’s coordinate frame, the 3D point

P_{c}

can be found:

P_{c} = {[X_{c} Y_{c} Z_{c}]}^{T} = Z_{c} \cdot K^{- 1} \cdot {[u v 1]}^{T}, P_{o b j} = {T_{e x t}}^{- 1} \cdot P_{c}

(5)

The depth (

Z_{c}

) is estimated based on the size of the bounding box and the known dimensions of the target object. This estimated 3D point (

P_{c}

) is then transformed into the global world coordinate system to yield the final object location (

P_{o b j}

).

To mitigate the effects of distance measurement errors and improve the reliability of anomaly reporting, our system implements a filtering mechanism. The accuracy of depth estimation from a 2D image is inversely proportional to the object’s distance from the camera, meaning small objects far away can have large localization errors. Furthermore, as the vehicle moves, a small, distant object will eventually appear larger and closer, at which point its detection can be more accurately localized.

To address this, the system only considers objects for final anomaly reporting if their estimated distance from the vehicle falls within a predefined, reliable range, denoted as

D_{m i n} < D_{o b j} < D_{m a x}

. The object distance

D_{o b j}

is derived from the depth

Z_{c}

. This filtering ensures that only detections with a high degree of positional certainty are reported, preventing the system from flagging distant objects prematurely and reducing false positives. The final 3D world coordinates of the detected object,

P_{o b j}

, are then transmitted to a central management server. This enables the precise mapping of each detected anomaly onto a digital map, facilitating an efficient and data-driven response to roadside infrastructure issues.

4. Experiment Setting and Evaluation Metrics

4.1. Experimental Environment

The experiments were conducted on Python 3.10.8, a Pytorch 2.0.1 + cu117 framework, CUDA11.7 and a computing setup comprising an Intel Xeon W-2235 CPU complemented by two NVIDIA RTX 3090 GPUs, designed to accommodate the computational demands of training multiple YOLO object detection models.

A curated subset of the AI Hub’s roadside facilities dataset was utilized. To enhance dataset integrity and learning effectiveness, images captured from highly similar viewpoints (e.g., near-duplicate angles or positions) were systematically identified and eliminated. This rigorous de-duplication process was critical to preventing performance inflation due to redundant samples and to ensuring robust, generalizable model evaluation.

For the anomaly detection task, a dual-class labeling approach was adopted for each object of interest: both “normal” and “abnormal” classes were explicitly annotated and used during training. Unlike single-class abnormality detection—which is prone to misclassifying normal instances as abnormal in unfamiliar contexts—this scheme was deliberately selected to improve discrimination performance. By training the network to recognize both normal and abnormal examples, the risk of over-detection (false positives) associated with anomaly-only training was substantially mitigated.

The YOLO (You Only Look Once) family of models represents a leading class of one-stage, real-time object detection algorithms. YOLO architectures utilize a single convolutional neural network that divides the input image into a grid and simultaneously predicts bounding boxes and class confidences for all regions in a single pass, making them ideally suited for speed-constrained environments such as smart cameras and embedded systems. For this study, model selection was explicitly bounded by the constraints of prospective edge deployment. Specifically, the evaluation was limited to YOLO versions and their subtypes (namely, n, s, and m models) with baseline model sizes around 50 MB, ensuring compatibility with memory and compute resources typical of devices such as the Raspberry Pi.

Each YOLO version included in the evaluation (with directly comparable n, s, and m variants) was assessed on identical hardware. This allowed direct, controlled comparison of inference speed, memory footprint, and detection accuracy across architectures.

To rigorously analyze model suitability for edge-based deployment in road and facility monitoring applications, validation experiments were designed to evaluate trade-offs between detection performance (accuracy, mAP, recall) and efficiency (model size, inference latency, and memory utilization).

Additionally, for object localization, spatial accuracy analysis was conducted using a global position estimation pipeline: smartphone-extracted geospatial metadata (latitude, longitude, altitude) together with inertial measurement unit (IMU) sensor data enabled calculation of the physical distance between detected objects and their ground truth positions. This quantitative error analysis provided a robust metric for assessing each model’s applicability to field deployment scenarios.

This experimental setup not only reflects the methodological rigor required for performance benchmarking of lightweight object detection models but also provides crucial empirical insights into the feasibility of deploying such models in real-world, resource-constrained edge environments for anomaly and infrastructure monitoring. For object localization accuracy, a global position estimation method was employed. This approach integrated spatial data including latitude, longitude, and altitude obtained from images captured via smartphones, as well as inertial measurement unit (IMU) sensor data. By combining these inputs, the distances between detected objects and their real-world locations were calculated, facilitating quantitative error analysis to verify the practical applicability of the proposed detection models in monitoring road and facility infrastructure.

4.2. Evaluation Matrics

The performance evaluation of the YOLO models was conducted through a comprehensive set of metrics encompassing accuracy-related indicators and efficiency measures. Fundamental to understanding detection accuracy are the concepts of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). A True Positive refers to a correctly detected object that exists in the ground truth, whereas a True Negative denotes the correct identification of background regions where no object is present. False Positives, often called false alarms, occur when the model incorrectly predicts an object in a non-object region, while False Negatives represent missed detections where the model fails to identify an existing object.

From these basic terms, key performance metrics such as Precision and Recall are derived. Precision measures the ratio of true positive detections to all positive predictions, effectively quantifying the reliability of positive identifications by minimizing false alarms. Recall, in contrast, is the ratio of true positive detections to all actual objects, indicating the model’s ability to detect the majority of real instances and avoid missed detections.

Mean Average Precision (mAP) is an advanced metric summarizing the model’s precision-recall trade-off over multiple confidence thresholds and Intersection over Union (IoU) criteria. Specifically, mAP@0.5 calculates average precision at the typical IoU threshold of 0.5, whereas mAP@0.5:0.95 evaluates performance over a spectrum of stricter IoU thresholds, reflecting the model’s robustness in accurate localization and classification. The F1 score serves as a harmonic mean of precision and recall, providing a singular metric that balances the trade-off between false positives and false negatives.

In addition to accuracy, evaluating computational efficiency and resource demands is vital for practical deployment. Inference time represents the average processing duration for a single input sample, signifying the model’s suitability for real-time applications. Frames per second (FPS) is the inverse metric, measuring throughput capability in continuous input scenarios. Training time indicates the total duration required to learn model parameters, impacting development overhead. The number of parameters quantifies model complexity, influencing storage and computational load, while GFLOPs (Giga Floating Point Operations) measure the computational cost per inference. Model size directly affects storage and memory requirements, with smaller models preferred for deployment on resource-constrained edge devices.

The mathematical formulations and detailed definitions of these metrics, framed with practical application considerations, are summarized in Appendix A Table A1. This enables a holistic understanding of both the predictive accuracy and operational efficiency of the evaluated YOLO variants, guiding optimal model selection for specific deployment contexts.

Additionally, we introduce the 3D Euclidean Distance Error (3DEDE) metric to specifically quantify the positional error of detected objects in the real-world coordinate system. This metric utilizes the camera’s global position coordinates to transform 2D image detections into 3D world coordinates, enabling more practical evaluation of positional accuracy in edge deployment scenarios.

The 3DEDE is defined as:

3 D E D E = \sqrt{{(x_{p r e d} - x_{g t})}^{2}} + \sqrt{{(y_{p r e d} - y_{g t})}^{2}} + \sqrt{{(z_{p r e d} - z_{g t})}^{2}}

(6)

where

x_{p r e d}, y_{p r e d}, z_{p r e d}

and

x_{g t}, y_{g t}, z_{g t}

are the predicted and ground truth 3D coordinates, derived from YOLO bounding box centers using camera intrinsic/extrinsic parameters and depth estimation.

In our agricultural context, where objects are near the ground, we project the Z-axis onto the ground plane (e.g., setting

z_{p r e d} = z_{g t} = 0

). This reduces 3DEDE to a 2D Euclidean Distance Error (2DEDE) We report the mean and standard deviation of these errors, with ground truth from GPS-referenced measurements.

4.3. Evaluation Reuslts Analysis and Discussion

The objective of this experimental evaluation was to identify the optimal YOLO (You Only Look Once) model for deployment on edge devices with similar specifications as the Raspberry Pi 4 or 5 for roadside assistance facility anomaly detection tasks. Table 2 provides a comprehensive view of the detection performance and efficiency of seven YOLO versions (v5, v6, v8, v9, v10, v11, v12) and their respective nano (n), small (s), and medium (m) sub-models for the task of distinguishing between normal and abnormal classes of ‘sandwich board sign’ objects. The analysis is based on a rigorously curated dataset, where duplicate images from similar viewpoints were removed, resulting in 21,440 normal and 27,879 abnormal images. Of these, 39,620 images were used for training and 9699 for validation, ensuring a robust evaluation environment.

Figure 3 shows a comparison of the performance indicators for the yolo version and each sub-model. Performance metrics measure the model’s accuracy and robustness in detecting the target objects. All metrics in this section follow the principle that higher values indicate better performance.

Across all YOLO versions and sub-models, the models demonstrated consistently high precision and recall, with most values exceeding 0.98. This indicates that the models are highly effective at both correctly identifying true positives and minimizing false positives for both normal and abnormal classes. The F1-score, which harmonizes precision and recall, also remained high, reflecting the models’ balanced detection capabilities. Notably, the small (s) and middle (m) models generally achieved slightly higher precision and recall than the nano (n) models, but the differences were marginal, suggesting that even the most lightweight models are sufficiently robust for this application.

The mAP@0.5 metric, which measures the mean average precision at an Intersection over Union (IoU) threshold of 0.5, was above 0.99 for most models, underscoring their strong localization and classification performance. The more stringent mAP@0.5:0.95 metric, which averages precision across multiple IoU thresholds, showed a slight decrease, with values ranging from approximately 0.955 to 0.986. Middle models tended to outperform nano and small models in this metric, particularly in YOLOv8 and YOLOv5, indicating their superior ability to localize objects precisely across varying degrees of overlap.

Figure 4 shows a comparison of the efficiency indicators for the yolo version and each sub-model. Efficiency metrics quantify the resource demands and operational speed of the models. For edge deployment, lower values are desirable for Inference Time (IT), Training Time (TT), Parameters (Params), GFLOPs, and Model Size (MB), while higher values are desirable for FPS (Frames Per Second).

In terms of computational efficiency, nano models consistently delivered the highest frames per second (FPS) and the lowest inference times, with YOLOv8n and YOLOv5n achieving over 1200 FPS and inference times below 1 ms. This makes them highly suitable for real-time deployment on edge devices, where low latency is critical. As model size increased from nano to small and middle, FPS decreased and inference time increased, reflecting the expected trade-off between model complexity and speed. Middle models, while offering marginally better accuracy, required significantly more computational resources, including longer training times, higher parameter counts, greater GFLOPs, and larger model sizes. Training Time is less critical for deployment but reflects the development overhead. The values provided, which are converted to hours, show a wide range. The quickest models to train were the ‘n’ variants, such as YOLOv6n (0.55 h) and YOLOv8n (0.58 h), highlighting a significant reduction in development cycles for the lightweight architecture.

Particularly in YOLOv8 and YOLOv5 nano and small variants, offer the most favorable balance between detection accuracy and computational efficiency. These models deliver near state-of-the-art performance while maintaining the low resource requirements necessary for edge computing. YOLOv10n and YOLOv10s, on the other hand, exhibited slightly lower recall and mAP@0.5:0.95, suggesting that further optimization may be needed for these versions in the context of anomaly detection. YOLOv11 and YOLOv12 maintained high performance and efficiency, with their nano models providing competitive accuracy and speed.

The results, as detailed in Table 2 and visualized in Figure 3 and Figure 4, highlight the practical trade-offs between model size, speed, and detection performance. While middle models offer the highest accuracy, their resource demands may limit their applicability in resource-constrained environments. Nano and small models, especially from YOLOv8 and YOLOv5, emerge as optimal choices for real-time anomaly detection on edge devices, providing a compelling combination of speed, efficiency, and robust detection capability. These findings are consistent with recent literature, which emphasizes the importance of balancing accuracy and efficiency for practical deployment of object detection models in real-world applications, particularly when dealing with small or medium-sized objects and limited hardware resources.

In order to systematically select the optimal YOLO model for the ‘Sandwich Board Sign’ anomaly detection task, we conducted a multi-objective analysis focusing on both detection performance and computational efficiency. Specifically, we examined the relationship between mAP@0.5:0.95—a comprehensive measure of detection accuracy—and two key efficiency metrics: GFLOPs (computational complexity) and model size. As shown in Figure 5, scatter plots were generated for each metric pair, revealing a clear Pareto frontier structure in both cases: as model performance (mAP@0.5:0.95) increases, so do GFLOPs and model size, indicating an inherent trade-off between accuracy and resource consumption.

To enable fair comparison and facilitate model selection, all results from the 21 YOLO sub-models (across seven versions and three scales) were normalized to a 0–1 range for each metric. In the normalized scatter plots, models with mAP@0.5:0.95 values closer to 1 represent higher detection accuracy, while those with GFLOPs and model size values closer to 0 indicate lower computational and memory requirements. This normalization allows for direct visual and quantitative assessment of each model’s position relative to the Pareto frontier, highlighting those that achieve a favorable balance between performance and efficiency.

We applied the formula (1) proposed in Section 3.2.2 to identify the most suitable model for edge deployment. This formula positively weights the normalized mAP@0.5:0.95 (favoring higher accuracy) and negatively weights the normalized GFLOPs and model size (favoring lower complexity and smaller footprint). Models with lower composite scores are considered more optimal because they combine high detection performance with minimal resource usage. In the normalized graph in Figure 5, models with shorter distances from the optimal area point can be selected as suitable models for deployment.

The analysis of the normalized scatter plots confirms that the YOLO models form a distinct Pareto frontier: those with the highest accuracy also tend to be the most complex and largest, while the most efficient models may sacrifice some performance. By applying the weighted selection formula, we can objectively rank all candidate models and select those that best meet the dual requirements of accuracy and efficiency for real-world edge computing scenarios. This methodology not only provides a transparent and reproducible framework for model selection but also aligns with recent research trends in neural architecture search and autonomous model optimization for object detection.

In this experiment, the relationship between mAP@0.5:0.95, GFLOPs, and model size for the compared YOLO models revealed a distinct Pareto frontier. This means that as detection performance (mAP@0.5:0.95) increases, both computational complexity (GFLOPs) and model size also tend to increase, highlighting the inherent trade-off between accuracy and efficiency in object detection models.

Based on this analysis, YOLOv8s was identified as the most suitable model for the given task, as it achieves a favorable balance between high detection performance and efficient resource usage. However, it is important to recognize that the optimal model selection can vary depending on the specific deployment objectives and constraints. For example, in scenarios where real-time inference and minimal memory usage are critical—such as on embedded or edge devices—models with lower GFLOPs and smaller sizes may be preferred, even if their mAP@0.5:0.95 is slightly lower. Conversely, in applications where detection accuracy is paramount and sufficient computational resources are available, larger and more complex models may be justified despite their increased resource demands.

Furthermore, the choice of model may be influenced by other performance metrics such as precision, recall, F1-score, and inference speed, as well as the characteristics of the target environment and dataset. For instance, models that excel in detecting small or occluded objects, or those that maintain robust performance across diverse conditions, may be prioritized for certain use cases.

Figure 6 presents visual examples of anomaly detection results for the ‘Sandwich Board Sign’ object using the YOLOv8s model. This experiment used image data captured in a road environment similar to the dataset used for training. The effects of interior and exterior vehicle lighting were not considered, and the data was captured during the day and under light rain. The middle image in the first row of Figure 6 is an example image captured under light rain. While typical object detection models may encounter some occlusions depending on the object location and camera angle, we assumed that, except in some special circumstances, there would be no occlusions for auxiliary facilities installed near actual road surfaces. The detection outcomes are systematically arranged to illustrate key scenarios encountered during evaluation. The images in the left column demonstrate cases where only normal (undamaged) signs were correctly detected, as indicated by the blue bounding boxes labeled ‘Normal’ along with high confidence scores (e.g., 0.94, 0.97). In these instances, the model reliably identifies intact facilities in diverse environments.

The middle column showcases detection results for cases where solely abnormal (damaged) signs are present within the scene. Here, the model successfully localizes the damaged signs, assigning the label ‘Damaged’ and consistently high confidence values, exemplifying effective anomaly identification across various viewpoints and backgrounds.

The right column depicts images containing both normal and abnormal signs simultaneously. The model accurately distinguishes and detects both object types within the same frame, correctly assigning their respective classes and confidence scores. Such results demonstrate the model’s ability to handle complex real-world scenarios involving co-occurrence of normal and abnormal objects, which is essential for practical infrastructure monitoring and maintenance applications.

Of particular note is the bottom-right image in the right column, where a ‘False Negative’ is marked. In this example, the model failed to detect an abnormal sign that is present in the scene, highlighting one of the limitations in rare or edge cases where object visibility, occlusion, or perspective may hinder proper detection. However, in this application, there is a high probability that the falsely detected object will be detected as a normal object as the vehicle approaches it, depending on its direction of travel.

The images included in Figure 6, a separate set of 100 images was captured using a smartphone, allowing for the acquisition of both GPS coordinates and IMU sensor data (pitch, roll, yaw angles) at the time of image capture. By leveraging this information, the position of each detected object—specifically, the center-bottom point of the bounding box—could be estimated relative to the smartphone’s location, enabling the calculation of latitude and longitude for both normal and abnormal ‘Sandwich Board Sign’ objects.

During the analysis, it was observed that when detected objects appeared near the top of the image frame, the error in distance estimation (2D Euclidean Distance Error, 2DEDE) tended to increase. The estimated 2DEDE for the ‘sandwich board sign’ object detected in the test image ranged from approximately 1.2 to 28 m, depending on the distance to the object and environmental factors. This is likely due to the geometric projection and increased uncertainty in estimating the real-world position of objects that are farther from the camera or at steeper viewing angles. As a result, the accuracy of location estimation is highest for objects situated closer to the center or lower part of the image, where the projection from image coordinates to real-world coordinates is more reliable.

Furthermore, it was found that selectively filtering out detection results for objects located beyond a certain distance threshold from the smartphone can effectively reduce both detection errors and location estimation inaccuracies. By applying such a distance-based filtering strategy, only objects within a practical and reliable range are considered for facility management, thereby improving the overall robustness of the anomaly detection and spatial mapping process.

These findings demonstrate that, by combining YOLOv8s-based object detection with smartphone sensor data, it is possible not only to identify and classify roadside facilities but also to estimate their real-world locations with practical accuracy. The ability to selectively apply detection results based on estimated distance further enhances the reliability of the system, making it a valuable tool for efficient and scalable management of road infrastructure.

In order to validate the efficiency of the single object detection model and its applicability to various objects, comparative experiments were conducted using the selected YOLOv8s model from the proposed experimental scenario. This experiment utilized a refined dataset of eight types of roadside facilities, removing similar or duplicate images. We performed anomaly detection simultaneously for all objects and individually for each of the eight object types. The experimental results for each model using the validation data are presented in Table 3. Training time increased linearly with the number of training samples per class. From an efficiency perspective, the differences in model size and GFLOPs were only driven by the number of classes, and these differences were not significant. However, from a detection performance perspective, this experiment demonstrated that a single object detection approach using separate models for each class can be more efficient than using a single model to detect multiple object classes.

The YOLOv8s model fine-tuned for detecting traffic cones and tubular marker objects performed lower on average than the YOLOv8s model for multi-object detection. Traffic Cone and Tubular Markers are relatively small objects compared to other objects, so their detection performance was measured to be slightly lower. This is interpreted as a result of applying YOLOv8s, which was selected based on the Sandwich Board Sign. Detection performance could be improved by switching to a higher-level YOLO version optimized for detecting small objects, or to the m, l, and x sub-models of the same YOLOv8 version.

As inferred from the experimental results, the findings suggest that, rather than relying on a single integrated model capable of detecting multiple objects simultaneously, selecting an appropriate model optimized for each object type—considering both detection performance and computational efficiency—can offer a more robust and efficient solution.

5. Discussion and Conclusions

The findings of this study offer significant insights into the deployment and optimization of lightweight YOLO models for anomaly detection in roadside facility management, particularly in the context of cloud sensing and distributed edge AI. By systematically comparing seven YOLO versions and their respective nano, small, and middle sub-models, we observed a clear Pareto frontier between detection performance (mAP@0.5:0.95) and computational efficiency (GFLOPs, model size). This trade-off is consistent with previous research, which has emphasized the challenge of balancing accuracy and resource consumption in real-time object detection tasks.

Our experiments, conducted on a rigorously curated dataset and validated with smartphone-captured images, confirmed that models such as YOLOv8s can achieve high detection accuracy while maintaining efficient inference and manageable model size. The integration of GPS and IMU sensor data enabled reliable estimation of object locations. Selectively filtering detection results based on estimated distance is effective in reducing both false positives and position errors, thereby improving system robustness.

A key contribution of this work is the demonstration of the advantages of deploying multiple, optimized single-object detection models across a fleet of vehicles, rather than relying on a single multi-object model. Each single-object model can be tailored for its specific detection task, resulting in improved accuracy, reduced computational load, and smaller model size. This approach is particularly well-suited for edge deployment, where resource constraints are critical. By distributing different single-object detection models to multiple vehicles, the system leverages the strengths of cloud sensing: decentralized data collection and processing enable real-time anomaly detection and facility monitoring at scale. Each vehicle acts as an intelligent sensor node, contributing specialized detection results to a centralized cloud platform, which enhances scalability, robustness, and coverage.

The cloud sensing paradigm further enables efficient data aggregation, collaborative analysis, and rapid deployment of model updates. Facility managers gain access to a comprehensive, up-to-date map of infrastructure status, facilitating timely maintenance and decision-making. The flexibility to select and deploy the most suitable model for each object type—based on performance, efficiency, and deployment context—ensures that the system remains adaptable and cost-effective.

The implications of this work extend beyond roadside facility anomaly detection. The methodology and analysis framework presented here can be adapted to other domains requiring real-time, resource-efficient object detection and spatial mapping, such as smart city infrastructure, environmental monitoring, and industrial inspection. The use of normalized multi-objective metrics and Pareto optimization provides a transparent and reproducible approach to model selection, supporting scalable deployment across diverse hardware platforms.

Future research directions may include further refinement of spatial estimation algorithms, integration of additional sensor modalities, and exploration of adaptive model selection strategies based on dynamic operational requirements. Real-world field trials and feedback-driven model updates will be essential for enhancing generalization and reliability. While this study primarily proposed and validated a collaborative anomaly detection framework for vehicle-mounted edge AI applications, further improvements in detection model robustness under severe weather and occluded scenarios are essential for real-world deployment. Incorporating simulated adverse weather conditions and systematic occlusion scene generation through advanced data augmentation will be explored in future research, strengthening the adaptability and reliability of single-object detection models within the system.

In summary, this study advances the understanding of lightweight object detection model deployment in resource-constrained environments, offering practical solutions and a foundation for continued innovation in intelligent infrastructure management and collaborative cloud sensing systems.

Author Contributions

Conceptualization, J.K., W.L. and B.K.; methodology, J.K., W.L. and B.K.; software, S.J., Y.C. and J.K.; validation, S.J., Y.C. and J.K.; formal analysis, J.K., W.L. and B.K.; investigation, S.J., Y.C. and J.K.; resources, S.J., Y.C. and J.K.; data curation, S.J., Y.C.; writing—original draft preparation, S.J., Y.C. and J.K.; writing—review and editing, all authors; visualization, S.J., Y.C. and J.K.; supervision, W.L., B.K.; project administration, W.L., B.K.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This paper used datasets from ‘The Open AI Dataset Project (AI-Hub, S. Korea)’. All data information can be accessed through AI Hub “https://www.aihub.or.kr (accessed on 16 October 2025)”.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. The performance evaluation metrics.

Type	Evaluation Metrics	Formulation	Description
Performance (Accuracy)	Precision	$\frac{T P}{T P + F P}$	High precision reduces false alarms of object damage, preventing unnecessary inspections and resource waste.
	Recall	$\frac{T P}{T P + F N}$	High recall minimizes missed detection of anomaly, ensuring early response to safety risks.
	mAP@0.5	$m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}$	Strong mAP@0.5 confirms that damages can be consistently localized and classified under standard inspection criteria.
	mAP@0.5:0.95	$m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}$	High mAP@0.5:0.95 demonstrates stable detection of object defects across stricter evaluation conditions, proving robustness in real-world monitoring.
	F1	$\frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$	A balanced F1 score means the system both avoids false detections and captures true damages, achieving efficient and safe infrastructure management.
Efficiency (Lightweight)	FPS	$\frac{1}{I n f e r e n c e T i m e (s e c)}$	Real-time guarantee
	TT	-	A shorter training time reduces development cost. Training time depends on the size of the dataset and the complexity of the model.
	IT	$I n f e r . T i m e = \frac{T o t a l P r o c e s s i n g T i m e}{N u m b e r o f I n f e r e n c e}$	A shorter inference time enables real-time detection and faster system response, enhancing safety and efficiency
	Params.	-	More parameters make the model more expressive, but also increase computational cost and storage demand.
	GFLOPs	-	Lower GFLOPs indicate higher efficiency and faster inference on limited resources.
	Model size	-	Model size directly impacts deployment cost and storage needs; smaller models are well suited for memory-constrained devices such as mobile and embedded systems, while also improving loading time and transfer efficiency
Positioning	2DEDE	$\sqrt{{(x_{p r e d} - x_{g t})}^{2}} + \sqrt{{(y_{p r e d} - y_{g t})}^{2}}$	The smaller the distance error, the more accurately the location of the actual object can be estimated.

References

Baccari, S.; Hadded, M.; Ghazzai, H.; Touati, H.; Elhadef, M. Anomaly detection in connected and autonomous vehicles: A survey, analysis, and research challenges. IEEE Access 2024, 12, 19250–19276. [Google Scholar] [CrossRef]
Rathee, M.; Bačić, B.; Doborjeh, M. Automated Road Defect and Anomaly Detection for Traffic Safety: A Systematic Review. Sensors 2023, 23, 5656. [Google Scholar] [CrossRef]
Fang, L.; Shen, G.; Lue, H.; Chen, C.; Zhao, Z. Automatic Extraction of Roadside Traffic Facilities from Mobile Laser Scanning Point Clouds Based on Deep Belief Network. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1964–1980. [Google Scholar] [CrossRef]
Bello-Salau, H.; Onumanyi, A.J.; Salawedeen, A.T.; Muazu, M.B.; Oyinbo, A.M. An Examination of Different Vision Based Approaches for Road Anomaly Detection. In Proceedings of the 2nd International Conference of the IEEE Nigeria, Zaria, Nigeria, 14 October 2019. [Google Scholar] [CrossRef]
AI-Hub. Local Government Road Facility Damage Dataset. Available online: https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71306 (accessed on 24 September 2025).
Khan, N.N.; Ahmed, M.M. Weather and Surface Condition Detection Based on Road-Side Webcams: Application of Pre-Trained Convolutional Neural Network. Int. J. Transp. Sci. Technol. 2022, 11, 468–483. [Google Scholar] [CrossRef]
Yang, Z.; Lan, X.; Wang, H. Comparative Analysis of YOLO Series Algorithms for UAV-Based Highway Distress Inspection: Performance and Application Insights. Sensors 2025, 25, 1475. [Google Scholar] [CrossRef]
Manoni, L.; Orcioni, S.; Conti, M. Recent Advancements in Deep Learning Techniques for Road Condition Monitoring: A Comprehensive Review. IEEE Access 2024, 12, 154271–154293. [Google Scholar] [CrossRef]
Xin, H.; Ye, Y.; Na, X.; Hu, H.; Wang, G.; Wu, C.; Hu, S. Sustainable Road Pothole Detection: A Crowdsourcing Based Multi-Sensors Fusion Approach. Sustainability 2023, 15, 6610. [Google Scholar] [CrossRef]
Zareei, M.; Castaneda, C.A.L.; Alanazi, F.; Granda, F.; Perez-diaz, A.J.A. Machine Learning Model for Road Anomaly Detection Using Smartphone Accelerometer Data. IEEE Access 2025, 13, 122841–122851. [Google Scholar] [CrossRef]
Martinez-Ríos, E.A.; Bustamante-Bello, M.R.; Arce-Sáenz, L.A. A Review of Road Surface Anomaly Detection and Classification Systems Based on Vibration-Based Techniques. Appl. Sci. 2022, 12, 9413. [Google Scholar] [CrossRef]
Srivastava, V.; Mishra, S.; Gupta, N. Enhancing Safety in Autonomous Vehicles Using Advanced Deep Learning-Based Pothole Detection. In Advanced Technologies in Electronics, Communications and Signal Processing, ICATECS 2024, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering; Koganti, K.K., E., S.R., Gupta, N., Eds.; Springer: Cham, Switzerland, 2025; Volume 620. [Google Scholar] [CrossRef]
Yang, L.; Yu, K.; Tang, T.; Li, J.; Yuan, K.; Wang, L.; Zhang, X.; Chen, P. BEVHeight: A Robust Framework for Vision-Based Roadside 3D Object Detection. arXiv 2023, arXiv:2303.08498. [Google Scholar] [CrossRef]
Assemlali, H.; Bouhsissin, S.; Sael, N. Computer Vision-Based Detection and Classification of Road Obstacles: Systematic Literature Review. IEEE Access 2025, 13, 128603–128638. [Google Scholar] [CrossRef]
Lu, Y.-B.; Yang, S.-R.; Lin, P.; Huang, C.-W. VADtalk: An Internet of Vehicles Platform Facilitating Anomaly Detection Modeling and Deployment for Self-Driving Vehicles. In Proceedings of the International Wireless Communications and Mobile Computing, Marrakesh, Morocco, 19–23 June 2023. [Google Scholar] [CrossRef]
Tran, T.M.; Vu, T.N.; Nguyen, T.V.; Nguyen, K. UIT-ADrone: A Novel Drone Dataset for Traffic Anomaly Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5590–5601. [Google Scholar] [CrossRef]
Redmon, J.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892v1. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Kong, Z.; Ding, B.; Zeng, Z.; Wang, K.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Youwai, S.; Chaiyaphat, A.; Chaipetch, P. YOLO9tr: A lightweight model for pavement damage detection utilizing a generalized efficient layer aggregation network and attention mechanism. J. Real Time Image Process. 2024, 21, 163. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lamos, L.T.; Sappa, A.D. A Decade of You Only Look Once (YOLO) for Object Detection. arXiv 2025, arXiv:2504.18586. [Google Scholar]
Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computer 2024, 13, 336. [Google Scholar] [CrossRef]
Wang, R.-F.; Qin, Y.-M.; Zhao, Y.-Y.; Xu, M.; Schardong, I.B.; Cui, K. RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition. AI 2025, 6, 235. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, Q.; Xie, X.; Shen, Y.; Ran, J.; Gui, S.; Zhang, H.; Li, X.; Zhang, Z. GLNet-YOLO: Multimodal Feature Fusion for Pedestrian Detection. AI 2025, 6, 229. [Google Scholar] [CrossRef]
Nghiem, V.Q.; Nguyen, H.H.; Hoang, M.S. LEAF-YOLO: Lightweight Edge-Real-Time Small Object Detection on Aerial Imagery. Intell. Syst. Appl. 2025, 25, 200484. [Google Scholar] [CrossRef]
Navardi, N.; Humes, E.; Manjunath, T.; Mohenin, T. MetaE2RL: Toward Meta-Reasoning for Energy-Efficient Multigoal Reinforcement Learning With Squeezed-Edge You Only Look Once. IEEE Micro 2023, 43, 29–39. [Google Scholar] [CrossRef]
Guo, Y.; Wu, Z.; You, B.; Chen, L.; Zhao, J.; Li, X. YOLO-SDD: An Effective Single-Class Detection Method for Dense Livestock Production. Animals 2025, 15, 1205. [Google Scholar] [CrossRef]
Fortin, L.V.; Lantos, O.E. Performance Analysis of YOLO versions for Real-time Pothole Detection. Procedia Comput. Sci. 2025, 257, 77–84. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, L.; Huang, Y.; Wang, X.; Hsieh, S.-Y.; Gadekallu, T.R. A Cooperative Vehicle-Road System for Anomaly Detection on Vehicle Tracks with Augmented Intelligence of Things. IEEE Internet Things J. 2024, 11, 35975–35988. [Google Scholar] [CrossRef]
Ramesh, A.; Nikam, D.; Balachandran, V.N.; Guo, L.; Wang, R.; Hu, L.; Comert, G.; Jia, Y. Cloud-Based Collaborative Road-Damage Monitoring with Deep Learning and Smartphones. Sustainability 2022, 14, 8682. [Google Scholar] [CrossRef]
Teixeira, K.; Miguel, G.; Silva, H.; Madeiro, F. A Survey on Applications of Unmanned Aerial Vehicles Using Machine Learning. IEEE Access 2023, 11, 117582–117621. [Google Scholar] [CrossRef]
Alzamzami, O.; Babour, A.; Baalawi, W.; Al Khuzayem, L. PDS-UAV: A Deep Learning-Based Pothole Detection System Using Unmanned Aerial Vehicle Images. Sustainability 2024, 16, 9168. [Google Scholar] [CrossRef]
Zhang, S.; Li, J.; Ding, M.; Nguyen, D.C.; Tan, W. Federated Learning in Intelligent Transportation Systems: Recent Applications and Open Problems. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3259–3285. [Google Scholar] [CrossRef]
El-Wakeel, A.S.; Li, J.; Noureldin, A.; Hassanein, H.S.; Zorba, N. Towards a Practical Crowdsensing System for Road Surface Conditions Monitoring. IEEE Internet Things J. 2018, 5, 4672–4685. [Google Scholar] [CrossRef]
Chen, D.; Deng, T.; Huang, H.; Jia, J.; Dong, M.; Yuan, D. Mobility-Aware Multi-Task Decentralized Federated Learning for Vehicular Networks: Modeling, Analysis, and Optimization. IEEE Trans. Mob. Comput. 2025, 1–17. [Google Scholar] [CrossRef]
Zhang, C.; Liu, X.; Yao, A.; Bai, J.; Dong, C.; Pal, S.; Jiang, F. Fed4UL: A Cloud–Edge–End Collaborative Federated Learning Framework for Addressing the Non-IID Data Issue in UAV Logistics. Drones 2024, 8, 312. [Google Scholar] [CrossRef]
Chen, X.; Chen, M.; Tang, S.; Niu, Y.; Zhu, J. MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues. arXiv 2024, arXiv:2404.05280. [Google Scholar] [CrossRef]

Figure 1. Example images of normal/damaged states by target object category.

Figure 2. Procedures of our proposed schemes. They should be listed as: (a) Data collection and annotation; (b) model training and deployment on edges; (c) on-site detection.

Figure 3. Comparison of key performance indicators by YOLO version.

Figure 4. Comparison of key efficiency indicators by YOLO version.

Figure 5. Comparison of key efficiency indicators by YOLO version.

Figure 6. Results of anomaly detection using the selected YOLO sub-model (YOLOv8s). Blue boxes indicate normal and cyan boxes indicate abnormal.

Table 1. Dataset Composition and Distribution.

Category	Class	Number of Images (EA)	Ratio (%)
PE Drum	Normal	30,511	4.36
PE Drum	Damaged	40,946	5.84
PE Guardrails	Normal	55,577	7.93
PE Guardrails	Damaged	31,918	4.57
No parking Cone	Normal	48,027	6.86
No parking Cone	Damaged	31,970	4.56
Traffic Cone	Normal	60,276	8.60
Traffic Cone	Damaged	58,462	8.34
Tubular marker	Normal	72,370	10.33
Tubular marker	Damaged	39,561	5.65
Snow removal box	Normal	30,299	4.32
Snow removal box	Damaged	50,643	7.23
Sandwich Board Sign	Normal	41,572	5.93
Sandwich Board Sign	Damaged	55,655	7.94
PE Fence	Normal	26,426	3.77
PE Fence	Damaged	26,430	3.77
Total		700,643	100

Table 2. The performance evaluation for the ‘sandwich board sign’ detection by YOLO version.

Model		Performance					Efficiency
Model		Precision	Recall	F1	mAP@0.5	mAP@0.5:0.95	FPS	IT (ms)	TT (Hour)	Params (EA)	GFLOPs	Model Size (MB)
YOLOv5	n	0.99	0.991	0.990	0.995	0.975	1250	0.8	0.64	2,503,854	7.2	5.2
	s	0.992	0.993	0.992	0.995	0.981	833	1.2	0.65	9,112,310	23.8	18.6
	m	0.991	0.99	0.990	0.995	0.984	357	2.8	0.91	25,066,278	64.4	50.5
YOLOv6	n	0.981	0.981	0.981	0.994	0.964	1000	1	0.55	4,238,342	11.9	8.7
	s	0.986	0.985	0.985	0.994	0.974	370	2.7	0.66	16,306,230	44.2	32.8
	m	0.988	0.984	0.986	0.994	0.975	208	4.8	1.23	51,997,798	161.6	104.3
YOLOv8	n	0.991	0.991	0.991	0.995	0.978	1428.6	0.7	0.58	3,011,238	8.2	6.65
	s	0.993	0.993	0.993	0.995	0.983	500	2	0.84	11,136,374	28.6	22.6
	m	0.99	0.992	0.991	0.995	0.986	169.5	5.9	0.94	25,857,478	79.1	52.1
YOLOv9	t	0.982	0.975	0.978	0.993	0.962	1000	1	1.20	2,005,798	7.8	4.4
	s	0.984	0.985	0.984	0.994	0.973	345	2.9	1.22	7,167,862	26.7	15.2
	m	0.987	0.989	0.988	0.995	0.978	285.7	3.5	1.21	20,014,438	76.5	40.8
YOLOv10	n	0.966	0.958	0.962	0.99	0.955	1111.1	0.9	0.82	2,707,820	8.4	5.7
	s	0.978	0.968	0.973	0.993	0.97	454.5	2.2	0.88	8,067,900	24.8	16.5
	m	0.986	0.972	0.979	0.994	0.977	333	3	1.14	15,314,326	58.9	33.4
YOLOv11	n	0.99	0.99	0.990	0.994	0.975	1111.1	0.9	0.69	2,590,230	6.4	5.6
	s	0.987	0.986	0.986	0.994	0.98	526.3	1.9	0.72	9,413,574	21.3	19.1
	m	0.99	0.993	0.991	0.995	0.985	212.8	4.7	0.99	20,054,550	68.2	40.7
YOLOv12	n	0.985	0.99	0.987	0.995	0.975	833.3	1.2	0.84	2,568,422	6.5	5.6
	s	0.983	0.989	0.986	0.995	0.981	556	1.8	0.90	9,253,910	21.5	19
	m	0.989	0.992	0.990	0.995	0.984	270	3.7	1.24	20,139,030	67.7	40.9

Table 3. Comparison of YOLOv8s-based multi-object model and single-object detection models. The reason why the sum of the data sets is different is because there are two or more object types in same image file.

Type of Object	Number of Dataset Train/Val.	Performance					Efficiency
Type of Object	Number of Dataset Train/Val.	Precision	Recall	F1	mAP@0.5	mAP@0.5:0.95	TT (Hour)	GFLOPs	Params (EA)	Model Size (MB)
8 Objects (16 Classes)	174,633/43,455	0.976	0.973	0.947	0.991	0.946	2.76	28.7	11,141,792	22.6
PE Drum	17,866/4493	0.979	0.967	0.973	0.992	0.951	0.28	28.6	11,136,374	22.6
sPE Barrier	22,181/5642	0.977	0.98	0.978	0.993	0.962	0.35	28.6	11,136,374	22.6
No Parking Cone	18,333/4563	0.975	0.984	0.979	0.994	0.957	0.29	28.6	11,136,374	22.6
Traffic Cone	30,430/7348	0.974	0.972	0.973	0.992	0.936	0.48	28.6	11,136,374	22.6
Tubular Marker	16,962/4316	0.959	0.953	0.95	0.988	0.892	0.27	28.6	11,136,374	22.6
Snow Removal Box	24,858/6226	0.996	0.997	0.996	0.995	0.962	0.39	28.6	11,136,374	22.6
Sandwich Board Sign	39,620/9699	0.993	0.993	0.993	0.995	0.983	0.63	28.6	11,136,374	22.6
PE Fence	16,423/4104	0.995	0.995	0.995	0.994	0.983	0.26	28.6	11,136,374	22.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, J.; Jang, S.; Choi, Y.; Lee, W.; Kim, B. Collaborative Real-Time Single-Object Anomaly Detection Framework of Roadside Facilities for Traffic Safety and Management Using Efficient YOLO. Appl. Sci. 2025, 15, 11139. https://doi.org/10.3390/app152011139

AMA Style

Kang J, Jang S, Choi Y, Lee W, Kim B. Collaborative Real-Time Single-Object Anomaly Detection Framework of Roadside Facilities for Traffic Safety and Management Using Efficient YOLO. Applied Sciences. 2025; 15(20):11139. https://doi.org/10.3390/app152011139

Chicago/Turabian Style

Kang, Jiheon, Soohyen Jang, Yoonyoung Choi, Wooyong Lee, and Byoungkug Kim. 2025. "Collaborative Real-Time Single-Object Anomaly Detection Framework of Roadside Facilities for Traffic Safety and Management Using Efficient YOLO" Applied Sciences 15, no. 20: 11139. https://doi.org/10.3390/app152011139

APA Style

Kang, J., Jang, S., Choi, Y., Lee, W., & Kim, B. (2025). Collaborative Real-Time Single-Object Anomaly Detection Framework of Roadside Facilities for Traffic Safety and Management Using Efficient YOLO. Applied Sciences, 15(20), 11139. https://doi.org/10.3390/app152011139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Collaborative Real-Time Single-Object Anomaly Detection Framework of Roadside Facilities for Traffic Safety and Management Using Efficient YOLO

Abstract

Featured Application

Abstract

1. Introduction

2. Related Works

2.1. Road and Roadside Anomaly Detection

2.2. Comparison of YOLO Family

2.3. Specialized Single-Object Detection Models for Collaborative Multi Agent Deployment

3. Data and Method

3.1. Dataset

3.2. The Proposed Method

3.2.1. Data Collection an Annotation

3.2.2. Model Training and Selection for Deploying on Edge Devices

3.2.3. On-Site Detection and Localization

4. Experiment Setting and Evaluation Metrics

4.1. Experimental Environment

4.2. Evaluation Matrics

4.3. Evaluation Reuslts Analysis and Discussion

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI