A Deep Learning Framework for Real-Time Pothole Detection from Combined Drone Imagery and Custom Dataset Using Enhanced YOLOv8 and Custom Feature Extraction

Reddy, Shiva Shankar; Janarthanan, Midhunchakkaravarthy; Khan, Inam Ullah; Amrutha, Kankanala

doi:10.3390/math14050898

Open AccessArticle

A Deep Learning Framework for Real-Time Pothole Detection from Combined Drone Imagery and Custom Dataset Using Enhanced YOLOv8 and Custom Feature Extraction

by

Shiva Shankar Reddy

^1,*

,

Midhunchakkaravarthy Janarthanan

²,

Inam Ullah Khan

^1,3 and

Kankanala Amrutha

⁴

¹

Faculty of Engineering and Built Science, Lincoln University College, Petaling Jaya 47301, Malaysia

²

Faculty of AI Computing and Multimedia, Lincoln University College, Petaling Jaya 47301, Malaysia

³

Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63000, Malaysia

⁴

Department of Computer Science and Engineering, Sagi Rama Krishnam Raju Engineering College (Autonomous), Bhimavaram 534204, Andhra Pradesh, India

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(5), 898; https://doi.org/10.3390/math14050898

Submission received: 16 January 2026 / Revised: 25 February 2026 / Accepted: 4 March 2026 / Published: 6 March 2026

(This article belongs to the Special Issue Advances in Machine Learning and Graph Neural Networks)

Download

Browse Figures

Versions Notes

Abstract

Road safety depends heavily on the timely identification and repair of potholes; however, detecting potholes is challenging due to various lighting and weather conditions. This work presents an attention-enhanced object detection framework for aerial pothole detection design that relies on a pre-trained backbone, YOLOv8, and a custom feature-extraction network, the Feature Pyramid Network (FPN). An enhanced detection head is used to make the model aware of discriminative areas in space to get accurate localization of a pothole to overcome the major limitations of the standard YOLOv8 used in aerial road inspection, irrespective of the road surface. The underlying architecture incorporates a purpose-built data layer and a preprocessing engine that can accommodate scenarios such as seasonal changes and bad weather. To further enhance learning dynamics, a customized loss function and a new optimizer framework are incorporated to improve convergence towards overall detection reliability. Specifically, a custom differential optimizer that uses layer-wise adaptive learning rates and momentum-based gradient updates to help suppress false positives and accelerate convergence. Conversely, the IoU-based personal loss function, combined with real-time validation, stabilizes training across a range of road conditions. A major feature of the proposed system is its ability to process aerial imagery from unmanned drone platforms. Empirical analysis proves a good result: an average precision of 0.980 with the IoU of 0.5 and an F1-score of 0.97 with a confidence threshold of 0.30. Precision is high (0.97 at the 90-percent confidence level). These metrics show how well the model will be able to balance false positives and false negatives—a critical need in a safety-critical deployment. The results make the framework a potential, scalable, and reliable candidate for integrating smart transportation systems and autonomous vehicle navigation.

Keywords:

Feature Pyramid Network (FPN); drone images; LabelImg; YOLO (You Only Look Once); CLAHE; custom differential optimizer

MSC:

37M10

1. Introduction

A reliable, well-maintained transportation network provides the backbone of any thriving economy, supporting public safety, social connectivity, and sustainable city growth. Yet road surfaces around the world keep degrading, forming cracks, potholes, and faded lane markings that accumulate, making roads unsafe for all traffic and challenging wider infrastructure development. Conventional road inspection practices, which are largely based on manual field surveys, are inherently labour-intensive, slow, and expensive to maintain. In addition to operational inefficiency, such approaches introduce a significant margin of human error and routinely expose inspection personnel to hazardous roadside environments. As urban centres develop towards smarter, ever more autonomous configurations, the demand to implement road-monitoring solutions to meet the requirements for accurate, continuous, and scalable data has expanded in inverse proportion. Advances in deep learning, computer vision, and artificial intelligence mean that infrastructure assessment offers meaningful avenues for automation, enabling quick and reliable surface defect identification without constant human involvement. These capabilities have practical value not only in more populated cities but also in underserved rural corridors. Among other types of damage on road surfaces, potholes are especially lethal. They hasten the deterioration of the surrounding infrastructure, increase the cost of public vehicle maintenance, and are a direct contributing factor in road accidents. In rapidly developing countries like India, the tendency is further worsened by extreme weather conditions and the continuous increase in traffic volume, which places exceptional stress on ageing road networks. Sensor-equipped inspection vehicles, although they can be used in controlled environments, struggle to handle a wide road network with heavy traffic. This lack of flexibility has led to a shift in research towards frameworks based on artificial intelligence (AI) for detection, such as convolutional neural networks, image segmentation, and object recognition architectures, to identify potholes and lane anomalies more efficiently. These systems have been specifically developed to work in adverse conditions in the real world, such as poor illumination, physical obstructions, wet surfaces, and degraded lane markings. Embedded within a network of intelligent transportation systems, such technologies make meaningful contributions to the management of road assets. At the same time, they can focus on achieving the much larger objectives of smart city development and the introduction of autonomous vehicles.

A good example of such an approach is FLLENet, a framework specifically designed to improve lane detection [1]. By combining attention-based mechanisms and zero-reference learning, high-efficiency endpoint lane identification has proven very powerful in adverse lighting conditions. It belongs to an ever-growing collection of novel applications that highlight the utility of deep learning structures for realistic transportation tasks. Lane detection is a basic building block for autonomous cars and advanced driver assistance systems; it provides the spatial awareness needed to drive in the correct lane safely. The accuracy of such mechanisms, however, reduces significantly as the road surface deteriorates; blurred lane markings, pavement deformations, and potholes create visual discontinuities that affect advanced detection pipelines, making it difficult to identify lane boundaries and maintain the vehicle’s correct position.

Low-light conditions pose specific challenges for semantic segmentation algorithms, as low contrast and a lack of perceptible edges make it extremely difficult for these methods to detect lane markings. In the face of these limitations, the development of specialized models that maintain lane visibility under adverse conditions has attracted significant research effort, aiming to maximize the potential of deep learning architectures in real-world, degraded, and restricted visual environments [2]. This model shows that end-to-end learning methods could make cars more reliable and enable the real-time use of built-in technology. The Transformer and Vision Transformer (ViT) paradigms have had a significant impact on lane identification research, thanks to their diverse attention mechanisms and receptive fields. The Reformer-based design uses reversible layers and Locality-Sensitive Hashing (LSH) [3] to handle complex sequence dependencies better and make real-vehicle trials more generalizable. LiDAR-based methods are becoming more popular because they do not change with changes in lighting conditions. By using different attention mechanisms, lane recognition in LiDAR data has improved through enhanced multi-head self-attention in the ViT-based framework [4]. Using multitask learning frameworks is one way to better understand road scenes. These frameworks can simultaneously detect potholes, lanes, and traffic objects. The CFFM architecture’s pixel-centric modules and cross-layer feature fusion enable simultaneous semantic segmentation and object recognition. Their method is a great way to make systems without people more aware of their surroundings, as it can handle task conflicts and feature duplication while learning many tasks [5]. In countries like India [6], it is even harder to see potholes because roads are not well-maintained, the road surface is uneven, and markers are hidden. Photos of Indian roads may show obstructions, wear and tear, and weather effects. To handle these unpredictable situations, a YOLOv8n-based identification pipeline was made.

Using image enhancement and object identification is a major step towards improving vision in rural and semi-urban areas. A complex optimization process based on diffusion models led to improvements in rural roads, including more precise lane lines, vegetation trimming, and better traffic signs. Using explainable AI algorithms like SHAP and XGBoost [7] to analyze the many factors that affect driving speed makes it easier to determine how fast someone is driving and how they behave behind the wheel. System-level improvements like exit-ramp management and truck platooning, as well as object-based solutions, have given us new ways to think about how road conditions affect traffic flow and safety. Truck platoons were able to change lanes without hitting anything thanks to a safety-first nonlinear model predictive control (NMPC) method. Potholes were fixed by following the steps in [8].

To implement these changes, it is crucial to use vision-based technologies. Digital cameras, drones, and image-processing algorithms make it much easier to find and sort road defects in high-resolution images. Object identification frameworks and convolutional neural networks (CNNs) such as YOLOv4, YOLOv5, and YOLOv7 [9,10] can be used to analyze road-surface data effectively. UAVs used the YOLOv7 algorithm to detect potholes, demonstrating that airborne surveillance can monitor the entire road. More and more people are using drones, or UAVs, to check the roads. UAVs offer many advantages, such as adaptability, a wide field of view, and the ability to cover a large area quickly. These benefits are especially useful in hard-to-reach places on foot or by car. UAVs equipped with imaging systems may be able to send and receive real-time data on the road surface. Deep learning systems can then use the collected data to identify problems with the road surface [11]. Systems that depend on visual signals may not work well in low-light conditions, heavy rain, or snow. Real-time detection also requires lightweight, efficient models, as it must run on edge devices with limited processing power [12].

Predictive modelling to determine how much wear and tear a road will experience is also becoming more popular. By combining traffic data with environmental factors like average temperature, relative humidity, snowfall, and vehicle volume, it is now possible to build machine learning models. These methods were used to determine how often potholes occur in cities. Researchers examined how well multivariate regression models performed with weather data for Seoul’s road networks [13]. These methods provide predictive maintenance, enabling prompt resource allocation and the avoidance of potentially disastrous accidents or damage. Researchers have investigated hybrid system architectures that can simultaneously handle detection and predictive inference, because neither alone can provide robust deployment in the real world. Systems like Eye Way [14] have made progress by combining externally mounted AI modules with real-time visualization tools. Nonetheless, technical challenges associated with different environmental conditions remain, preventing consistent performance. Alzamzami O et al. [15] propose a deep learning-based pothole detection framework using UAV aerial images to automate road inspection processes. The system integrates convolutional neural networks with high-resolution drone imagery to accurately localize and classify potholes, enabling large-scale surface monitoring. Their results show that UAV-assisted inspection is a more powerful and scalable alternative to traditional methods for assessing roads from the ground. Computer vision, therefore, has significant potential to improve road safety and reduce traffic accidents. Yet the actual implementation still shows a range of unsolved problems. These include extreme computational processing requirements, the difficulty of transferring trained models across different geographic and environmental contexts, the ongoing scarcity of large-scale annotated datasets, and the ongoing need for fast, efficient inference under resource constraints. Compounding these problems is the challenge of knowing the difference between co-occurring anomalies on the road-surface, for example, a pothole, faded lane markings, and road surface fractures, especially when there is no sensor fusion available, and computation is limited. The field is also very much lacking in sufficiently standardized evaluation benchmarks that consider real-world variables such as shadows, physical obstructions, and angle distortions on a camera.

Despite this progress, research on intelligent transportation is still oriented towards the construction of detection systems that can: operate in real time, generalize across diverse deployment environments, run on low-power edge hardware, and make effective use of current deep learning methodologies. The next generation of road anomaly detection systems is expected to significantly change the standards for vehicle safety and the way roads are maintained, thanks to advances in data-driven optimizations, attention-based modelling, and sophisticated image-processing techniques.

The human cost of ignoring potholes is measurable, making the undertaking to address them severely distressing. Approximately 1000 accidents and deaths occur each year directly due to poor road surfaces, and many could be prevented through early detection and timely maintenance. The imperative to have effective pothole detection systems is more evident now than ever before, as evidenced by documented real-world incidents in the literature and by reputable news outlets highlighting the devastating effects of poor road conditions when left unaddressed.

The incidents shown in Figure 1 [16] provide strong proof of the immediate need for the deployment of state-of-the-art pothole detection and monitoring systems. In this context, unmanned aerial vehicles have become revolutionary technology, changing how roads are inspected and controlled. Compared with traditional manual surveys and vehicle-based equipment used to identify cracks, defects, and other minor road conditions, drone-based approaches are a much faster, cost-effective, and certainly more accurate method for observing road surfaces in areas that are difficult to reach or hazardous for personnel on the ground. These limitations of traditional methods have made UAVs a key solution for remotely identifying potholes: they are autonomous, can acquire high-resolution imagery, provide real-time information, and enable large-scale 3D reconstruction.

The above catastrophic incidents, as depicted in Figure 1, underscore the need for advanced pothole detection and monitoring systems. To address these constraints, drones have become a revolutionary technology for automated remote pothole detection, as high-resolution imaging, real-time data, and 3D surface reconstruction are now accessible. Next-generation enhancements to UAV-based profiling have surpassed alternative measurement technologies, such as rotary laser levelling and road-surface testing, including the legal apparatus, and achieving sub-centimetre precision in surface modelling and defect identification [17]. In addition, deep learning-based detection algorithms, e.g., YOLOv5s-Road, have been developed to robustly detect road-surface imperfections, such as rutting, cracks, and potholes, under severe environmental conditions, enabling resilient detection from drone imagery [18]. To complement these imaging technologies, research on economic cooperation between trucks and drones has been carried out using multi-agent reinforcement learning (MARL) to study truck–drone collaboration and to maximize real-time logistics and road observation capabilities, focusing on drones’ responsiveness in dynamic transportation systems [19]. Concurrently, the Internet of Drones (IoD) paradigm has enabled a communication scheme, such as Smooth Message Dissemination (SMD), for effective data exchange between drone groups and server control, which is an imperative need for large-scale pothole monitoring networks [20].

Additionally, drone image segmentation and 3D reconstruction technologies previously applied to the measurement of agricultural canopies have shown a high potential for adapting similar models to road-surface mapping, with much higher accuracy in identifying small surface deformations [21]. This advancement highlights the increasing importance of UAV technology in the management of modern platforms, for which drones are not only a sensing technology but an intelligent agent for pothole detection, decision-making and maintenance. By combining computer vision, machine learning, and real-time sensors for urban road management, it will be possible to reduce the number of accidents significantly. Even with a strong pothole detection system and measures to ensure citizen safety, predictive infrastructure repair methods are made easier, reducing long-term infrastructure maintenance costs. It is high time that road safety initiatives moved beyond mere signage and low-tech measures and adopted modern, intelligent, and autonomous solutions.

2. Literature

Recent advances in deep learning have improved automated pothole detection systems; however, these systems may still have several disadvantages, including real-time performance, model efficiency, and robustness across diverse environments and defect types. Many innovative models have been proposed to overcome limitations and open the door to intelligent transportation systems (ITS) and efficient infrastructure maintenance.

Luo et al. [22] have proposed the EfficientDet architecture, an enhanced version for road damage detection. The main goal was to improve the detection accuracy. A key component, the Feature-Extraction Enhancement Module (FEEM), was established, and the LC-BiFPN (Lightweight Cascaded Bidirectional Feature Pyramid Network) was proposed for the feature extraction. The Global Road Damage Detection Challenge 2020 dataset was used, and a 2.41% increase in accuracy over the baselineEfficientDet-D0 was achieved. What is more, YOLOv5s and Faster R-CNN were beaten. The E-EfficientDet-D2 variant further improved performance, achieving 57.51%. Building on previous advances in lightweight, real-time detection, Omer Kaya et al. [23] proposed an embedded system for automatic pothole detection and classification using YOLOv5. The model was trained on images from eight datasets, each containing different types of demerits captured in images. Providing a top-level overview, Fan et al. [24] conducted an extensive discussion of the pothole detection problem, categorizing image-based methods into fully supervised, unsupervised, and hybrid learning approaches. Various deep learning models, ranging from traditional CNNs to transformers, were studied, including their effectiveness and challenges. Benchmark datasets such as CRACK500 and KITTI were studied with respect to performance metrics and directions.

To address computational complexity and accuracy issues, Jinlei Wang et al. [25] proposed an improved YOLOv8s model with several architectural enhancements. The structure bottleneck was replaced with a C2f-Faster-EMA design with partial convolutions and multi-scale attention, reducing the parameter count. The SimSPPF module was used to achieve spatial pyramid pooling for faster processing, and the Detect-Dyhead added new features to better represent the data through dynamic attention mechanisms in both the spatial and task domains. Further developing mobile-oriented solutions, Chenguang Wu et al. [26] introduced YOLO-LWNet, a lightweight adaptation of YOLOv5 designed primarily for mobile and embedded devices. YOLO-LWNet utilizes an innovative lightweight convolution module (LWC), an improved attention mechanism, and a BiFPN-based feature fusion network to balance. YOLO-LWNet-Small and YOLO-LWNet-Tiny were tested on the RDD-2020 dataset, and YOLO-LWNet-Small performed better. Darmawan et al. [27] introduced a UAV-based pothole detection approach that employs photogrammetric techniques to transform aerial images into detailed 3D surface reconstructions for comprehensive pavement analysis. By leveraging structure-from-motion (SfM) and multi-view stereo methods, the study generates dense point clouds that facilitate precise localisation and dimensional estimation of potholes. The proposed framework highlights the effectiveness of 3D photogrammetric modelling for accurate road-surface assessment, surpassing conventional 2D image-based detection methods.

Building on these improvements, Zhibin Han et al. [28] proposed MS-YOLOv8, an improved version of YOLOv8 designed to detect small-scale and complex pavement defects across different backgrounds. Deformable Large Kernel Attention (DLKA), Large Separable Kernel Attention (LSKA), and Multi-Scale Dilated Attention (MSDA) were integrated, resulting in significant improvements in the model’s multi-scale feature extraction and localization precision. Madan Raj Upadhyay et al. [29] proposed a deep learning-based system that utilizes CSPDarknet53 as the backbone for YOLOv8, enabling efficient pothole detection. CCTV images captured from the roadside were processed through a cloud-based pipeline. The model is optimized for edge deployment using OpenVINO and has achieved a precision score of 89.2% when trained on Kaggle and Roboflow datasets. Extending this concept further into user-interactive systems, Rathin Raj et al. [30] combined a pothole detection and notification system using deep learning and IoT. The images from vehicle-mounted cameras were captured using YOLOv5 and stored on a backend server. GPS-based alerts were given to passengers about pothole locations. The proposed model has achieved a detection accuracy of 87.5%, and the system has been successfully tested in challenging weather and lighting conditions along the Tiruvannamalai highway in Tamil Nadu, India. For autonomous vehicle navigation, Omar Chaaban et al. [31] proposed an end-to-end pothole detection and avoidance system that integrates real-time stereo camera data with instance segmentation. YOLOv8 was used for 3D spatial pothole detection. A path planner was used to evaluate four possible maneuvers: avoidance, alignment, deceleration, and stopping, depending on the characteristics of the pothole, such as size and position. The proposed model is deployed on the NV-X1 autonomous vehicle and is more effective in generating and executing better predictions with various pothole scenarios.

On the other hand, sensor fusion-based methods have also gained traction. Vijay Gaikwad et al. [32] developed a hybrid pothole detection system that integrates an MPU6050 Inertial Measurement Unit (IMU) for roughness detection and a NEO-6M GPS module for location tagging. High-frequency acceleration and gyroscope data from the IMU enabled reclassifying potholes by severity, while GPS data ensured accurate mapping. Focusing on the optimization problem in deep learning, Karan Thakkar et al. [33] combined transfer learning with the ResNet50 architecture and Google Maps’ geo-tagging capabilities to facilitate pothole location mapping. As the model is a transfer learning model, reliance on large datasets is reduced, thereby improving training speed and offering a practical, scalable solution for urban road management and preventive maintenance.

In another approach focused on the mobile aspect, Dhanvanth Prasath et al. [34] developed a Smart Pothole Detection System that combines machine learning and geospatial analysis. Smartphone applications were used to capture images of the road at XYZ coordinates. Support Vector Machine (SVM), Logistic Regression, and Logistic Regression combined with AdaBoost were used for classification purposes. Geospatial mapping has facilitated targeted resource allocation to mitigate and set maintenance priorities, providing a cost-effective, scalable alternative to older inspection techniques. To enhance detection robustness using multimodal inputs, Abhiram Karukayil et al. [35] proposed a system that combines vision and LiDAR data. Geometric features such as depth and volume are extracted from LiDAR point clouds using the Convex Hull method, and GNSS (Global Navigation Satellite System is used for pothole mapping. YOLOv5 has demonstrated better performance than earlier versions in terms of precision and mAP. The proposed model is tested in Edinburgh City, and the system has successfully detected and characterized 52 potholes of varying dimensions. P. T. SatyanarayanaMurty et al. [36] explored a comparative study of CNN architectures—ResNet50, ResNetV2, VGG19, and YOLOv8 for pothole detection in road images. The models were trained on datasets gathered from various sources, including smartphones and robotic devices. YOLOv8 has outperformed the remaining models in terms of real-time performance and high detection accuracy, making it suitable for deployment in live road-monitoring scenarios.

Leni et al. [37] proposed a distributed federated learning-based framework, primarily optimized for smart city infrastructure. The YOLOv8 model is deployed on edge devices, where captured images are processed locally. The model is then aggregated, and updates are applied via edge servers to a centralized global server. The federated approach helps with data privacy and improves the model’s global performance. The proposed system has demonstrated significant improvements in detection accuracy and real-time efficiency. Its distributed design provides low latency and allows the scale of the monitoring to be configured for large-scale urban deployments, ensuring security. Extending real-time detection to video analysis, Kumar et al. [38] applied YOLOv4, YOLOv4-tiny, and SSD MobileNet v2 for object detection in video sequences. YOLOv4 had the highest accuracy with mAP of 75.48%, whereas YOLOv4-tiny was more reliable for real-time performance with its faster inference time (24.86 ms). A cosine similarity-based tracking algorithm was developed, improving pothole counting performance beyond that of traditional tracking methods. Chen et al. [39] propose a UAV-based approach for automatic road pavement inspection and pothole characterization using imagery converted to 3D point clouds via structure-from-motion (SfM). A privacy-preserving, artificial intelligence-based pothole classifier system was presented by Singh et al. [40], which uses YOLOv5 and fine-tuning with a CNN. The model classifies the potholes, those that are dry and the ones immersed in water, paving the way to maintain the roads and also to navigate roads automatically, integrated into self-driving cars. Meier et al. [41] present a UAV-based pothole detection and segmentation framework that leverages RGB and thermal aerial imagery combined with deep neural networks. Their approach combines YOLOv8 for detection with geometric transformation methods to estimate pothole area from an aerial perspective, demonstrating reliable performance across various lighting and environmental conditions.

YOLOv5 has proven its superiority over YOLOv8, achieving higher precision (83%), higher recall (71%), and lower loss, confirming its performance in a non-centralized, real-time environment. The federated approach enhances data confidentiality, enables adaptive learning for connected devices, and can therefore be used for its deployment in secure urban infrastructure. A similar research work carried out by Wijaya et al. [42] exploited the YOLOv8 algorithm on four publicly available pothole datasets (featuring 2661 annotated images). The combination provides over 85% classification accuracy and stable, high precision and recall across several evaluation metrics. Bias removal and pothole size-based annotation (minor, mid, huge) used for data preprocessing are intended to increase detection granularity. In a narrower implementation, Bhatt et al. [43] proposed a framework integrating PyTorch and YOLOv8 models. Evaluation metrics such as precision, recall, and mAP50-95 have been reported to exceed 70%, indicating that the model is more reliable at detecting potholes across different shapes and orientations.

To further improve detection performance, Pandi et al. [44] proposed an ensemble learning strategy that combines three fine-tuned CNN models, namely Inception-V3, Xception, and MobileNet, on a custom pothole dataset collected from crowdsourced smartphone and in-vehicle camera images. The ensemble model outperformed individual networks, achieving 97.49% accuracy while reducing false positives. Focusing on model refinement and edge deployment, Shasikala et al. [45] developed a pothole classification system using a fine-tuned Faster R-CNN model trained on a custom dataset under varied road and environmental conditions. Huang et al. [46] introduced LEPS (Lightweight Effective Pothole Segmentation), a single-stage segmentation network that integrates an Embranchment Aggregation and Detail Enhancement (EADE) module, a GhostNet-based Context Aggregation (GCA) module, and an Optimized ProtoNet segmentation head. LEPS achieved a Mask AP50 of 0.892 and AP50:95 of 0.648, outperforming baseline models by up to 20.6%. Reddy SS et al. [47] used a YOLO model on a custom dataset to predict object classes and precise locations in a single forward pass, making it suitable for real-time applications. Designed for real-time edge computing with visual sensors, LEPS ensures accurate pothole segmentation with minimal resource usage, making it ideal for mobile or embedded deployment in urban road-monitoring systems. Table 1 provides a comprehensive overview of the existing literature on pothole detection.

As summarized in Table 1, although significant progress has been made in pothole detection, existing methods still have clear limitations that prevent them from being robust and practical. Based on this analysis, the following gaps in the research are identified.

Gaps identified while doing the related work:

1.: Low-Light and Adverse Weather Conditions

Many models for detecting potholes do not work well in low-light conditions, either at night or under the shadows, and also in unfavourable weather conditions, such as rain, fog, or snow, which influence the visibility and the quality of the images.

2.: Small and Occluded Pothole Detection

Small potholes or those partially covered with debris or vehicles are missed, making them harder to spot, which plays a major role in lower recall rates, especially in significantly cluttered scenes.

3.: High False Positives from Road Artefacts

Most models frequently misclassify patched potholes, oil stains, or shadows as real potholes. These false positives reduce the reliability of the detection system in practical deployments.

4.: Complex and Dynamic Environment Challenges

Performance drops significantly when roads are busy, especially in multi-lane areas or those with pedestrian and vehicle interference. Dynamic environments contain unpredictable visual noise and occlusions.

It makes accurate localisation and classification more difficult.

5.: Limitations in Real-Time Aerial Adaptability

Existing pothole detection systems primarily rely on ground-level or offline drone data, lacking real-time processing and the ability to adapt to altitude, lighting, and motion variations in aerial imagery.

3. Methodology

3.1. Objectives

To enhance image quality under poor lighting and adverse weather conditions using contrast and noise enhancement techniques.
To accurately detect small and partially occluded potholes through multi-scale feature extraction and attention mechanisms.
To reduce false positives caused by road artefacts, such as patches, stains, and shadows, we use contextual feature learning.
To enhance detection performance in complex and dynamic road environments, such as multi-lane roads and pedestrian areas.
To ensure robustness against variations in lighting and image quality through a comprehensive preprocessing pipeline.
To develop a real-time drone-based pothole detection framework using aerial imagery and deep learning for accurate and efficient detection under diverse environmental conditions.

3.2. Dataset

The dataset used combines the [68] HighRPD drone-based pavement distress dataset with a real-world, specifically created pothole dataset to achieve greater robustness, diversity, and generalization in the advanced deep learning framework developed in this work to support automatic pothole detection. In the HighRPD dataset, 1412 full-scene images from drone imagery were selected to train the pothole detection algorithm. The original dataset, the HighRPD dataset, comprises 11,696 images captured via drones under consistent resolution, i.e., 640 × 640 pixels; however, only images containing pothole instances are applied in this work.

To overcome limitations in the range of environmental conditions and enhance the dataset’s geographical diversity, another set of 8000 road images was collected across diverse urban, highway, and rural environments. These images were acquired under different lighting conditions, including bright and overcast weather, evening scenes, and even night scenes. Additionally, images were collected under various weather conditions, including clear skies, rain, and foggy road scenes. As a result, a total of 9412 full-scene images were collected, with approximately 60% containing positive road scenes with potholes and the remaining 40% containing normal road scenes without potholes.

All images have been annotated using the LabelImg tool in YOLO format, with each bounding box specified in normalized coordinates. The data is collected from a single-class dataset, where potholes belong to class 0. Consistency in the annotation of the data set has been maintained by verifying and cross-checking bounding boxes for precise location and eliminating labelling ambiguity. There are a total of 9412 potholes in the entire data set, out of which there is an average of approximately 5647 potholes per positive image. Regarding the dataset’s experimental evaluation, it has been split into training (70%), validation (20%), and test (10%) subsets, with the training set containing 6588 images, the validation set containing 1882 images, and the test set containing 942 images. It is important to note that images along the same road segment or drone sequence were excluded from the evaluation process for the same reason explained above. All images will be 640 × 640 pixels during model training.

Conversely, the diverse range of illumination, road textures, environmental sounds, and structural details aids the model in effectively capturing complex visual features, such as surface reflections on damp road surfaces, shadow interference, maintenance work, and partially covered potholes. Also, unlike cropped data in object-centric datasets, the data from the suggested dataset contains all the context information for the scenes. This will aid the model in being more effective while being generalized for practical application in drone-assisted road scenarios. Images depicting various lighting, weather, and road texture scenarios are shown in Figure 2. Images in the last row depict drone-captured scenarios.

3.3. Proposed Work

In our earlier work [69], we developed a YOLO-based system capable of detecting potholes under challenging weather and lighting conditions. While the model has shown better performance across various environments, the primary focus is on real-time detection. Still, issues such as overlapping predictions, varied surface textures, and low-contrast scenarios were not addressed. As an extension of this research, we propose an advanced model that enhances detection accuracy and sweeping statement by integrating feature enhancement, non-maximum suppression techniques, and deeper convolutional layers, thereby further improving performance in real-world deployment across highly variable road conditions. This section provides a detailed explanation of the custom-designed system framework for automated pothole detection using deep learning and computer vision techniques. The architecture, illustrated in Figure 3, is performed in multiple individual modules that work in series to achieve robust pothole detection in the real world. These modules comprise data preparation, preprocessing, feature extraction, model training, and prediction, evaluation, and result representation. Each module is explained in detail, with related algorithms and mathematics.

A.: Data Layer

The data layer is the foundational component of the pothole detection architecture. It is responsible for supplying the raw and annotated data required to train, validate, and test deep learning models. A robust data layer ensures that the downstream modules—such as preprocessing, feature extraction, and classification—receive high-quality, diverse, and accurately labelled data. This layer directly affects the effectiveness and generalization of the entire detection system.

Ground Truth Annotations: All images in the dataset are carefully annotated using LabelImg, a graphical image annotation tool.

Each annotation is structured as

(c, x, y, w, h)

, where

c

is the Class ID (e.g., 0 for pothole);

x, y

: normalized centre coordinates;

w, h

: normalized width and height of the bounding box.

Normalization makes annotation resolution-independent and can therefore be effectively applied across different images. High-quality annotations are essential, especially for pothole detection, as the accuracy of pothole location and shape significantly affects classification (distinguishing potholes from the road’s surroundings) and localization (localizing the pothole’s location and size). Prominently, defined bounding boxes support model convergence during training and accuracy and recall during inference, which ultimately supports the real-world applicability of the detection system.

The dataset is curated to overcome key challenges in automated shortcomings detection on a road. It deals with illumination, and the model can respond to changes in brightness, shadows and weather conditions. This robustness enables it to perform reliably under varying conditions in the real world. Furthermore, the dataset supports several stages of the system development process. It is used for training the model on labelled images to learn in a supervised manner, validating that the model is well-generalized, and assessing its performance on unseen data, including the detection accuracy and Intersection over Union (IoU). A visual presentation, typically shown in Figure 1, presents representative samples from the dataset and demonstrates the variety of disturbances sampled. These visuals highlight the complexity and realism of the dataset, thereby increasing understanding of the challenges in detecting potholes.

B.: Preprocessing Engine

The preprocessing step is important for converting raw image data into a form that is conducive to training a deep neural network to learn the data. Pothole images are taken in real-world conditions, so a wide range of conditions are acquired, from bright daylight to a low-light or rainy environment, which leads to inconsistencies—shadows, glares, noise, resolution mismatches, etc. Preprocessing aims to address these variations by increasing image contrast, reducing noise, preserving structural details such as edges, and adjusting image size and intensity range. These improvements enable the model to better learn from a wide variety of data and difficult-to-learn input data.

CLAHE (Contrast-Limited Adaptive Histogram Equalization) is a contrast-enhancing method for images. This technique is useful for images taken in dim lighting or in fog, as it highlights localized areas. Unlike global histogram equalization, which can sometimes over-amplify contrast throughout the image and enhance noise, CLAHE performs histogram equalization by clipping the histogram at a predefined threshold and equalizing small image tiles. These tiles are then combined smoothly using bilinear interpolation, filling in details (such as faint potholes) that could otherwise be lost in underexposed areas. The following equation calculates it:

CLAHE (I) = AdaptiveEq (I, ClipLimit, GridSize)

(1)

where I is the input image, Clip Limit is the threshold for Contrast Clipping to avoid over-enhancement, and Grid Size is the size of local regions for histogram equalization. The function AdaptiveEq represents Contrast-Limited Adaptive Histogram Equalization (CLAHE), which improves the local contrast of the input image I by dividing it (Grid Size) into tiles; the histogram of each tile will be equalized, the histogram will be clipped by Clip Limit to suppress the amplification of noise, and finally the local contrast of the image will be interpolated to make the transition of local parts of the image smooth.

Noise Reduction Using Gaussian Filtering: Gaussian Filtering is a technique widely used in tech to smooth images and reduce unwanted noise. It works by convolving the image with a Gaussian kernel, which assigns higher weights to pixels closer to the centre and lower weights to those farther away. It creates a blurring effect and reduces high-frequency noise whilst minimally affecting the underlying structure. The standard deviation of the Gaussian kernel determines the amount of smoothing. For the problem of pothole detection, this helps remove irrelevant background noise whilst preserving useful visual information, e.g., texture and shape.

I_{denoised} = G_{σ} * I

(2)

where I is the original image,

G_{σ}

is the Gaussian kernel with standard deviation σ, and * is the convolution operation.

This helps remove speckle noise and other high-frequency artefacts without losing the continuity of road textures or the edges of potholes.

Multi-Scale Feature Enhancement: Since potholes can vary dramatically in size from small surface cracks to large pockets, they tend to pose huge challenges in content extraction. To ensure the model can detect features at all scales, multi-scale feature enhancement methods are used. These include generating Gaussian and Laplacian pyramids very frequently. The Gaussian pyramid produces lower and lower resolutions of the image, while the Laplacian pyramid emphasizes differences in image resolution, preserving important high-frequency information such as edges. This multi-scale representation makes the model very robust to variations in pothole size and camera distance to the pothole.

The Gaussian pyramid successively reduces the image’s pixel count while retaining its important structures.

The Laplacian pyramid is concerned with the high frequencies at different scales.

Edge-preserving filtering through bilateral filtering: Bilateral filtering is a nonlinear, edge-preserving smoothing method that accounts for both spatial proximity and pixel intensity differences. It ensures that similar pixels are averaged together. In contrast, distinct edges will be preserved by reducing the influence of different intensities—noise reduction in road images without the cry of pothole edges. By preserving sharpness at image edges, bilateral filtering enhances the sharpness of pothole contours, which is crucial for proper segmentation or detection.

The edge-preserving filter is

I_{filtered} = \frac{1}{W_{p}} \sum_{q \in Ω} G_{s} (∥ p - q ∥) \cdot G_{r} (|I_{p} - I_{q}|) \cdot I_{q}

(3)

where

$p$ is the current pixel and $q \in Ω$ are its neighbours;
$(G_{s} ∥ p - q ∥)$ : spatial Gaussian kernel based on pixel distance;
$G_{r} (|I_{p} - I_{q}|)$ : range Gaussian kernel based on intensity difference;
$I_{q}$ : intensity at neighbouring pixel qq;
$W_{p}$ : normalization factor to ensure the result stays in a valid range.

Normalization and Resizing: Standardizing image dimensions and pixel intensities is a fundamental preprocessing step in any deep learning pipeline. Resizing ensures that all input images match the fixed dimensions required by the neural network (e.g., 640 × 640 pixels), while normalization scales pixel intensities to a standard range, such as [0, 1] or [−1, 1]. It ensures consistency across data samples but also accelerates the model’s convergence during training. Normalization also helps prevent numerical instability, enabling the model to understand better and generalize across varying input conditions.

The pipeline is resized to a fixed dimension (e.g., 640 × 640) to maintain consistency across batches.

Pixel values are normalized to a range of [0, 1] or [−1, 1], depending on the model.

One of the core operations in the pothole detection channel is the preprocessing engine, which can make the model much more effective at learning by allowing it to gain more insight and simultaneously generalize to unseen situations in real scenes. The preprocessing module combines several methods to create clean, consistent, and information-rich images from raw wild images, including contrast enhancement, noise reduction, edge preservation, normalization, and augmentation. In this way, systematic improvements would serve to improve the visibility and legibility of the pothole features, besides providing optimal inputs to the deep learning model, producing consistent optimal inputs. Finally, the proper use of preprocessing methods is key to designing a powerful, correct, and effective computer vision system to analyze road surfaces and identify potholes automatically.

C.: Novel Detection Architecture

The key element of the pothole detection is the Novel Detection Architecture, designed to recognize potholes under different real-world conditions. The architecture of this design integrates a pre-trained YOLOv8 model with a personalized feature-extraction architecture, namely a Feature Pyramid Network (FPN) and an Attention-Enhanced Head. This combination provides high detection of several potholes with varying shapes and sizes, even under different lighting conditions. Structural design uses the power of deep learning to extract, fuse, and refine features, then make predictions.

i.: Backbone Network: This is a deep convolutional neural network that identifies features of images on which the images are drawn in the form of a visual pattern. It converts the preprocessed image into a rich hierarchical representation across many convolutional layers. Different levels of abstraction, such as simple edges and complex shapes, are learned by each layer and are important for correct object recognition. The backbone of this architecture is YOLOv8 due to its robust transfer learning capabilities.

F_{l} = σ (W_{l} * F_{l - 1} + b_{l})

(4)

where

F_{l}

is the feature map at layer

l

,

W_{l}

is convolutional kernel weights,

b_{l}

is the bias term,

σ

is the activation function (e.g., ReLU or LeakyReLU), and

*

is the convolution operator.

Pre-trained YOLOv8: YOLOv8 is the version of the You Only Look Once (YOLO) object detector which uses an anchor-free approach and a decoupled head design for efficient object detection. It is pre-trained on large datasets such as COCO and then fine-tuned on a custom pothole dataset. YOLOv8 predicts bounding boxes, class probabilities, and confidence scores.

B = (x, y, w, h), S = sigmoid (s)

(5)

where

(x, y)

is the centre coordinates of the bounding box,

(w, h)

is width and height, and

s

is the raw objectness score before activation.

L_{Y O L O} = λ_{1} L_{c l s} + λ_{2} L_{o b j} + λ_{3} L_{b b o x}

(6)

The YOLO loss function

L_{Y O L O}

is a weighted sum of the classification loss

L_{c l s}

, objectness loss

L_{o b j}

j, and bounding box regression loss

L_{b b o x}

, where λ1, λ2, and λ3 are hyperparameters that balance the contribution of each component to the overall loss.

ii: Customer Feature Extraction: This module enhances YOLOv8’s base model’s generalization by introducing two critical improvements: a Feature Pyramid Network (FPN) and an Attention-Enhanced Head.

Feature Pyramid Network (FPN): FPN is a critical component in modern object detection systems, particularly effective at detecting objects at multiple scales. In the context of pothole detection, FPN enhances the model’s ability to identify potholes of varying sizes and levels of visibility, which can be influenced by factors such as distance, camera angle, and lighting conditions. The main concept of FPN is to fuse low-level and high-level feature maps by combining low-resolution and high-resolution feature maps extracted at various stages of our network backbone to produce a richer, more informative feature representation.

In a traditional CNN, lower layers extract stronger semantic features at lower spatial resolution. Therefore, this type of CNN is good at understanding what is in the image but not where it is. On the other hand, earlier layers retain substantial spatial information but lack sufficient semantic information to achieve viable detection. FPN solves this problem by introducing a top-down pathway in which the network successively upsamples higher-level features and merges them with features from previous layers via lateral connections. Mathematically, this combination of features from layer to layer is stated as

F_{FPN}^{i} = Upsample (F_{high}^{i + 1}) + Conv (F_{low}^{i})

(7)

where

$F_{FPN}^{i}$ is the output feature at pyramid level iii;
$F_{high}^{i + 1}$ is the higher-level (semantically stronger but spatially coarser) feature map;
$F_{low}^{i}$ is the lower-level (spatially richer but semantically weaker) feature map;
$Upsample ()$ is typically nearest-neighbour or bilinear interpolation to match resolutions;
Conv() denotes a 1 × 1 convolution to match the channel dimensions.

The high-level features that are upsampled add semantic information, while the low-level features contain spatial information. After the fusion, a 3 × 3 convolution is usually applied to smooth the feature maps and remove artefacts introduced by the upsampling process. This process leads to a feature pyramid—each feature containing both spatial and semantic richness—so that the detector can perform object detection at multiple resolutions. For pothole detection, this capability is vital, as potholes can be small and shallow or large and deep, and they can appear in cluttered road scenes. By leveraging FPN, the model becomes more robust to these variations, leading to higher detection accuracy and better localization quality.

Attention-Enhanced Head: It is a special module that improves the effectiveness of object detection by a simple procedure in which the model focuses its attention on the most informative locations in the feature maps. This is especially important in the case of pothole detection, since they are usually faint and sometimes shadowed, stained, or patched on the road, or even obscured by clutter. Classical convolutional filters assign equal weights to all spatial positions, which can wash out regions of interest. Conversely, by suppressing background information, the attention mechanisms assign varying significance to different regions in space, enabling the network to focus on areas likely to harbour potholes.

The fundamental concept of attention is to derive a set of attention weights that enable the model to give more weight to certain features and less to others. These weights are learnt during training and dynamically adjust to the input content. The Attention-Enhanced Head computes these weights using a compatibility function over the feature maps and learned parameters and then applies a SoftMax to convert the weights into a proper probability distribution.

The computation of the attention weight

α_{i}

i for a spatial position

i

is expressed as

e_{i} = W_{a}^{⊤} \cdot t a n h (W_{f} F_{i} + b_{f}), α_{i} = \frac{e x p (e_{i})}{\sum_{j} e x p (e_{j})}

(8)

where

$α_{i}$ : attention weight for feature $F_{i}$ ;
$W_{a}, W_{f}$ : trainable matrices;
$b_{f}$ : bias term.

Once the attention weights are computed, they are used to produce a weighted combination of all features, giving greater weight to regions with higher attention scores. The final attention-enhanced feature map is given by:

F_{att} = \sum_{i} α_{i} F_{i}

(9)

where

$F_{att}$ represents the refined feature representation that emphasizes pothole-relevant regions while reducing the influence of noise or distractions.

The mechanism provides the model with spatial selectivity, which greatly exceeds its ability to identify potholes under occlusion, low contrast, or cluttered road surfaces. Along with spatial attention, channel-wise/self-attention variants may also be combined; though spatial attention is quite sufficient for localizing irregularly shaped objects, e.g., potholes. The architecture is more context-aware and attention-centred, thereby enabling more accurate classification and bounding box regression by adding the Attention-Enhanced Head to the detection pipeline. This eventually leads to a stronger, more precise detection system applicable to real-world road conditions.

D.: Novel Optimizer Framework

The performance of deep learning models in terms of training efficiency and generalization highly depends on the optimization strategy applied. In such a challenging, changeable task as pothole detection across a wide range of road environments, ordinary optimizers can hardly strike a balance between convergence speed and detection robustness. To address these shortcomings, a new optimizer architecture is suggested—an architecture that combines those sophisticated gradient-based algorithms, layer-specific learning control, and regularization-based generalization to yield a less error-prone and more stable detection model.

The heart of the framework is the custom differential optimizer, based on AdamW with domain-specific improvements. Those model parameters are organized by architectural functional areas, such as bias terms, convolutional weights, and feature-extraction layers (e.g., the neck or backbone). The different groups are assigned different learning rates and a weight-decay structure. This process is known as a different learning rate strategy, and it enables more precise control over the relative aggressiveness of the various layers during training. For example, when the backbone has pre-trained weights, it can be fine-tuned with a reduced learning rate to avoid loss of learned attributes. Conversely, this is to be assigned a higher learning rate to the detection head to converge faster on new tasks.

Formally, the learning rate of a parameter theta i is chosen among a group-specific schedule:

η_{total} = \{\begin{array}{l} η_{backbone}, & if θ_{i} \in backbone \\ η_{head}, & if θ_{i} \in detection head \\ η_{bias}, & if θ_{i} is a bias ed term \end{array}

(10)

where

η_{backbone}, η_{head}

, and

η_{bias}

denote the learning rates assigned to the backbone layers, head layers, and bias parameters, respectively, during training. This is a selective learning method that goes a long way toward enhancing convergence and avoiding overfitting or underfitting of important model components. Momentum-based updates are used to further improve optimization stability. Momentum incorporates historical gradients at every update, filtering out deviations and increasing convergence towards a consistent direction. The update rule is defined as

v_{t} = μ v_{t - 1} - η \nabla L_{t}, θ_{t + 1} = θ_{t} + v_{t}

(11)

where

v_{t}

is the velocity at iteration is

t

,

μ

is the momentum factor (usually between 0.8 and 0.99),

η

is the learning rate,

\nabla L_{t}

is the gradient of the loss function at iteration

t

, and

θ_{t}

represents the model parameters.

To refine the training dynamics over time, the framework integrates a cosine annealing learning rate scheduler with warm restarts, enabling cyclical learning rate reductions that help escape sharp minima and settle into flatter, more generalizable solutions. The learning rate at any iteration

t

follows

η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + c o s (\frac{T_{cur}}{T_{\max}} π))

(12)

where

η_{m a x}

and

η_{m i n}

are the maximum and minimum learning rates, respectively.

T_{cur}

is the current epoch within the cycle and

T_{\max}

is the total number of epochs in one cycle.

Also, the optimizer framework includes a linear warm-up, in which the learning rate increases progressively during the first few hundred steps. This prevents drastic changes at the start of the training process, especially when gradients are kept open. There are some predetermined warm-up iterations, T warm-up, during which a learning rate specific to each group is slowly scaled down with respect to the current iteration.

An exponential moving average (EMA) process is used to model weights, maintaining their stability over time and enhancing overall stability. The update of an EMA copy of the model parameters at every iteration is based on

θ_{EMA}^{(t)} = α θ_{EMA}^{(t - 1)} + (1 - α) θ^{(t)}

(13)

where

θ_{EMA}^{(t)}

is the averaged parameter at time

t

,

α

is the EMA decay factor (e.g., 0.9999), and

θ^{(t)}

is the current model parameter.

The Novel Optimizer Framework proposes a well-thought-out set of optimization best practices and task-specific improvements to address the needs of pothole detection using YOLO. It not only provides fast convergence to the model but also generalizes well across dissimilar environmental conditions, such as fog, rain, glare, and low-light conditions, making it appropriate for use in real-world intelligent transportation systems.

E.: Training Infrastructure

A robust training system is important for creating object detection models that are highly reliable and work across a wide range of real-world conditions. This work includes a modular, flexible training pipeline and performance monitoring. It has a specialized training engine, a task-based loss function, automated hyperparameter search, and a full validation module.

i.: Training Engine: The training engine is built on top of the Ultralights YOLOv8 framework and additionally develops customized components, thereby supporting training on any GPUs or CPUs and multi-scale inputs, as well as adaptive learning. The training loop repeats through phases of annotated image batches, processing the images and updating the model weights using backpropagation with the Adam optimizer, whilst tracking training statistics in real time. The model provides predicted bounding boxes and the classification probability for detected potholes at every forward pass. Such predictions are checked against the ground truth wires, and the loss is computed to inform parameter rewrites. The model weights are then updated by the optimizer according to the calculated gradients, using differential learning rates, cosine annealing schedules, and exponential moving averages to achieve smooth convergence and generalization.

IoU-Based Loss Function: The main element of the training goal is the Intersection over Union (IoU) loss that measures the similarity between the predictions and the ground truth in terms of distance:

L_{IoU} = 1 - \frac{|B_{p} \cap B_{g t}|}{|B_{p} \cup B_{g t}|}

(14)

where

B_{p}

is the predicted bounding box,

B_{g t}

is the ground truth bounding box, ∩ is the intersection area between the predicted and ground truth, and ∪ is the union area of both boxes. This loss function penalizes the misalignment between predicted and actual boxes. A perfect overlap yields an IoU of 1, making

L_{IoU}

= 0. Conversely, no overlap leads to a loss of 1. IoU loss is particularly effective in object detection because it directly optimizes for spatial accuracy, unlike traditional L1 or L2 losses, which only consider coordinate differences.

Hyperparameter Optimization: To ensure the model performs optimally, critical training hyperparameters are tuned using automated search strategies. These hyperparameters include:

Learning rate;
Weight decay;
Momentum;
Warm-up epochs;
Batch size;
Confidence threshold.

Two primary methods are used:

Grid Search: Exhaustively tries combinations of predefined hyperparameter values.
Random Search: Randomly samples hyperparameters and configurations within specified ranges.

This search uses a validation set to ensure generalization to unseen data. The best hyperparameter configuration is then locked in for the final training phase.

ii.: Validation Module

The validation module plays a crucial role in evaluating model performance both during and after training. It evaluates the model on a separate validation set that was not seen during training. Key metrics computed include:

Validation Loss: Total loss on the validation set, including IoU and classification losses.
IoU Score: Average Intersection over Union between predicted and actual boxes.
Precision: Proportion of correctly predicted pothole boxes out of all predicted boxes.
Recall: Proportion of actual potholes correctly identified.

The training infrastructure is designed to strike a balance between training speed, stability, and accuracy. With IoU-driven loss optimization, adaptive learning strategies, and systematic validation, the model is well-tuned for robust pothole detection across a wide range of environmental conditions. These components collectively contribute to the model’s deployability in real-time intelligent road surveillance systems.

F.: Evaluation Framework

To ensure reliable and interpretable performance analysis of the pothole detection model, an Evaluation Framework is used post-training. This module calculates quantitative metrics that reflect the model’s ability to accurately classify and localize potholes, especially under diverse environmental conditions.

Performance Analyser: The analyser uses a confusion matrix, a fundamental tool in classification tasks, to compute evaluation metrics such as precision, recall, and F1-score. These metrics are crucial in assessing the trade-offs between false detections and missed detections. For a real-world pothole detection system, high recall is important to avoid missing defects, while high precision ensures reliability by avoiding false alarms.

Metrics Calculator: In addition to standard metrics, the framework also computes:

Mean average precision (mAP) at different IoU thresholds (e.g., mAP@0.5, mAP@0.5:0.95);
IoU Accuracy: Average Intersection over Union between predicted and ground truth boxes;
Number of detections per image;
False-positive rate and false-negative rate.

These values are logged during training and validation, enabling the tracking of performance trends over epochs. The results help determine optimal model checkpoints and guide hyperparameter tuning.

G.: Inference Engine

The inference engine is responsible for deploying the trained pothole detection model on new, unseen images, either in batch processing mode or in real-time streaming scenarios.

Real-Time Inference: The inference pipeline supports both

Batch Inference: Efficient for testing and offline evaluation.
Single-Image or Stream-Based Inference: Suitable for real-time deployment on embedded devices, edge servers, or cloud platforms.

The engine processes each image through the trained YOLO-based model, generates detection predictions (bounding boxes, class probabilities), and applies postprocessing to refine the results.

Postprocessing with NMS: To avoid duplicate detections of the same pothole, the system applies non-maximum suppression (NMS). This technique filters overlapping bounding boxes based on their confidence scores and spatial overlap, measured by the IoU (Intersection over Union) metric.

IoU (A, B) = \frac{|A \cap B|}{|A \cup B|}

(15)

Boxes with an IoU greater than a specified threshold (typically 0.5) are considered redundant, and only the one with the highest confidence is retained, where A is the predicted bounding box with the highest confidence score and B is another overlapping predicted bounding box. This step ensures clean, precise localization of potholes, eliminating clutter from multiple overlapping boxes. The NMS threshold can be fine-tuned depending on whether the use case prefers high recall (detecting all potholes while allowing some overlap) or high precision (removing nearly all overlapping boxes).

H.: Output Stream

Once inference is complete, the output stream module handles the formatting, storage, and visualization of results.

Visualization Module: Detected potholes are overlaid graphically on the original input images. Each bounding box is annotated with:

The class label (e.g., “pothole”);
The confidence score of the prediction.

This visual feedback is crucial for qualitative assessment and demonstration purposes. It also allows manual reviewers or human-in-the-loop systems to validate model predictions.

Result Storage: All inference results—including annotated images, raw bounding box coordinates, and confidence scores—are stored in a structured output directory. These outputs can be:

Used for further downstream analytics (e.g., GIS mapping, maintenance scheduling);
Streamed to a cloud dashboard or local web UI;
Exported in formats like JSON, CSV, or YOLO label files.

This modular design enables seamless integration of the pothole detection system with smart city platforms, mobile apps, or automated vehicle navigation systems (see Algorithm 1).

Algorithm 1. DPD-Net: Deep Pothole Detection Network
Input Parameters: Image $I ϵ R^(H \times W \times C)$ Output Parameters: Predicted bounding boxes with class labels.
Step 1:	Collect raw pothole images and ground truth bounding box annotations in YOLO format. The input dataset contains labelled images with bounding boxes around potholes. The YOLO format uses normalized values for object centre coordinates, width, and height.
Step 2:	Enhance image quality through preprocessing techniques such as contrast enhancement, noise reduction, and edge preservation. CLAHE improves local contrast using Equation (1); noise reduction filters remove image noise using Equation (2); and multi-scale enhancement and edge preservation, as outlined in Equation (3), enhance the visibility of potholes in challenging conditions such as fog, glare, or low-light.
Step 3:	Extract features using a pre-trained YOLOv8 backbone network, as described in Equation (4). The YOLOv8 model includes a convolutional backbone pre-trained with Equations (5) and (6) on large datasets to extract spatial features that represent road textures and pothole structures.
Step 4:	Enhance spatial features using a custom neck and attention-based detection head. A customized Feature Pyramid Network (FPN) aggregates features at multiple scales using Equation (7). Attention modules guide the model to focus on relevant regions, such as cracks or potholes, using Equations (7) and (8).
Step 5:	Train the model using IoU-based loss and optimized training strategies. The IoU (Intersection over Union) loss helps align predicted and ground truth bounding boxes, which is crucial for accurate localization.
Step 6:	Apply the novel optimizer with differential learning rates and momentum-based updates, as described in Equations (10) and (11). The optimizer groups parameters and assigns learning rates using Equation (12) differently for backbone, neck, and head layers. EMA helps stabilize training and accelerate convergence, as shown in Equation (13).
Step 7:	Validate the model using a separate module to monitor loss (as in Equation (14)) and accuracy metrics. The validation module is used to validate the model after every epoch using metrics such as loss, precision, recall, etc., which help track overall learning and prevent overfitting.
Step 8:	Evaluate the model’s performance using the confusion matrix and calculate precision, recall, and F1-score. The confusion matrix illustrates true positives (TP), false positives (FP), and false negatives (FN).
Step 9:	Perform inference tests on new images or batches of images (using the trained model). The trained YOLOv8 model processes images it has not seen before, either individually or in bulk, predicts whether they are potholes, and then determines the location of that object.
Step 10:	Use the non-maximum suppression function to filter overlapping bounding boxes. NMS is used to compare the overlapping predictions using IoU. The boxes with an IoU threshold greater than the threshold value (e.g., 0.5) are treated as duplicates and suppressed; the most confident prediction is kept using Equation (15).
Step 11:	Visualize the detected potholes by annotating them on the original images. Bounding boxes and labels are drawn on the input images for human validation or deployment for monitoring systems.
Step 12:	Store the output images and detection results for further analysis or deployment. Results in the form of labelled images as well as bounding box data are saved in structured directories and formats (e.g., for release as a final product, e.g., in a dashboard or maintenance plan, e.g., as a json or text file).

4. Result Analysis

Figure 4 compares the pothole detection performance between the pre-trained standard YOLOv8 and the proposed novel pothole optimizer with a Feature Pyramid Network, Adaptive CLACHE, and edge-preserving techniques. All the experiments were done on a Kaggle environment with an Nvidia Tesla P-100 GPU (with 16GB). The models were trained using the same settings: image size of 640 × 640, batch size of 16, 30 epochs of training, SGD optimizer with momentum of 0.937 and weight decay of 0.0005, and an initial learning rate of 0.01 with cosine decay scheduling. In Figure 4, each row corresponds to input images containing potholes under various environmental and lighting conditions. The first column displays the original input image, the second column shows the YOLOv8 detection results, and the third column presents the results from the proposed model.

From Figure 4, it is deduced that the detection performance has improved significantly with the proposed preprocessing. The proposed method has achieved improved accuracy in detecting faint, deformed, or partially occluded potholes compared to YOLOv8 in its standard setting. The normal model fails to detect or produce poor bounding boxes in cases of complex textures or heavily deteriorated roads, whereas the proposed method detects multiple potholes with improved precision. Most crucially, on densely pothole-filled images, the improved pipeline helps identify hundreds of instances that were missed or misclassified (as by YOLOv8). Overall, the developed system improves the stability and credibility of pothole detection across various environmental conditions and is thus suitable for practical implementation in intelligent transportation and autonomous road maintenance systems.

Drone-captured imagery has been used to test the feasibility of the proposed model under different altitudes and lighting conditions. Such high-resolution aerial photographs clearly reveal surface features, such as surface cracks, stains, and shallow depressions. After the consolidated dataset training described earlier, the model was tested on drone-captured images of roads not included in the training dataset and was evaluated on how well it would perform in a real-world scenario, as shown in Figure 5.

These results are superior to the best results obtained using the combined dataset across the efficiency and processing pipeline, contrast detection, and misclassification reduction, and they provide evidence for the feasibility of drone-based imagery as a viable mechanism for large-scale automated road-surface surveillance. Deep learning combined with aerial imagery forms a scalable pothole detection system that enables real-time detection, facilitating proactive road maintenance and enhancing overall transport safety.

In the case of applications where the goal is to detect potholes, an imbalance is not just about collecting false positives, i.e., misrepresenting depressions that are not potholes and thus no pothole holes, it also includes collecting false negatives, i.e., generating depressions that are depressions that go undetected. The F1-score is an amalgamation of the precision and recall curves, which marks the specific extent to which this balance is achieved. The images in Figure 6, Figure 7 and Figure 8 are direct outputs of the evaluation process, obtained by running the trained model on the test dataset.

From Figure 6, we see that the optimal F1-score of 0.97 is obtained at a confidence level of 0.300. This means that, with 30% confidence, the model has the best balance between detecting most potholes (recall) and avoiding false alarms (precision). This boundary point (0.3) may be the optimal operating point if the costs of both types of errors are equal.

A precision–recall curve is useful for assessing a model’s performance on a dataset with high class imbalance, because ROC curves can be misleading. Precision is how accurately instances (positive) are classified as such, while recall is how accurately the identification of relevant instances is. The area under the curve is the mean average precision (mAP).

As shown in Figure 7, the curve begins at high levels of precision and recall, indicating that the model is quite capable of distinguishing pothole cases with minimal spurious alarms. The pothole-class mean average precision (mAP@0.5) is 0.980, and the total mean average precision (mAP) of all classes is 0.980 as well, which demonstrates the same performance level for all target labels. This extremely high mAP indicates that the model can be used with confidence in systems for live pothole detection, with a very low chance of missing or misclassifying potholes. The precision–confidence curve illustrates how high-confidence predictions affect the model’s precision. A very high threshold of confidence would decrease false positives, but it could also decrease true positives.

It is apparent from Figure 8 that precision gradually increases with confidence and reaches a maximum at 0.97 (confidence = 0.909). That means that if the model is highly confident (confidence > 90%), it is likely to make the correct prediction with high confidence. However, extremely high thresholds will reduce recall, so numerous real potholes are likely to be missed. Therefore, if false positives are more serious (e.g., unnecessary road work), a higher threshold will be optimal. But in safety-critical detection (e.g., self-driving cars), this must be done cautiously, balancing trade-offs.

The recall vs. confidence curve shows how the model’s ability to identify all potholes varies with confidence score. A low threshold will cover more real potholes, but at the expense of increased false alarm probability. The curve is significant in applications where a false alarm is safer than a failure to detect a pothole.

The graph in Figure 9 shows that recalls are maximum at 0.97 when the confidence threshold is at its minimum. In practice, the model finds most (if not all) of the potholes, even if it accepts predictions with low confidence. Increasing the threshold is a trade-off between recall and precision: we get stricter detection criteria, at the cost of a larger percentage of true positives being missed. The accompanying evaluation curves can therefore provide a comprehensive description of the model’s operational characteristics, including the error thresholds and conditions under which it performs optimally.

To carefully evaluate the value of each architectural change, the controlled experiments were conducted in a sequence, using the original YOLOv8 framework and its variants, including increasingly advanced ones. The evaluation metrics used—precision, recall, F1-score, and mAP@0.5, are the standard metrics used for object detection tasks. These experiments were designed to investigate the effects of disparate components, ranging from algorithmic changes and preprocessing techniques to the Feature Pyramid Network and the attention head.

The data in Table 2 show that each architectural enhancement has a positive effect on detection performance. The precision of the baseline YOLOv8 model is 0.89, with an mAP of 0.87 at 0.5, which sets a strong baseline. The use of preprocessing techniques, when combined, results in better precision (0.92) and recall (0.90), indicating improved feature representation and reduced noise power. Incorporation of the Feature Pyramid Network substantially improves multi-scale features, reaching its final form as a significant performance enhancement, as evidenced by an F1-score of 0.97. This is a clear confirmation of the critical importance of effective feature fusion for object detection across different scales. The refinement that followed, the addition of an attention head, results in even more attention being paid to relevant spatial locations, which is reflected by a strong boost of mAP@0.5 (0.98). It is worth noting that, despite the precision and recall scores being comparable between the attention head and the full proposed model, the robustness of high performance strongly confirms that the final model maintains detection performance without any degradation.

To evaluate the proposed model’s performance, Table 3 compares it with other state-of-the-art methods for pothole detection. Table 3 lists a range of deep learning-based methods reported in 2024, including typical CNN architectures, ensemble classifiers, and advanced object detection frameworks such as YOLOv5, YOLOv8, and Faster R-CNN. Critical evaluation criteria such as accuracy, precision, recall, F1-score, and mean average precision (mAP) are shown to provide a proper basis for comparison. Benchmarking reflects recent trends in model development and serves as a baseline for measuring the approach’s performance.

As indicated in Table 3, the accuracy levels of the preceding models have ranged from 75 to 93 percent, with F1-scores close to 0.86. While YOLO-based detectors still underpin the field due to their real-time capabilities, studies have also examined the fusion of CNNs with clustering or ensemble techniques to enhance classification accuracy. Despite these promising results, the current model outperforms existing solutions across all evaluation metrics: recall of 0.97, F1-score of 0.97, and mAP of 0.98. These results support the practicality of the proposed model in terms of monitoring the road surface since all the measures of these parameters, accuracy, precision, recall, F1-score, mean average precision (mAP), would prove its effectiveness in practice where different lighting conditions, occlusiveness, and noise levels constantly dominate. A detailed comparison with previous methodologies is given in Table 3. Classic CNN-based models, such as the AlexNet variant in [40], achieved respectable accuracy (>93%). Still, their generalizations to harsh environments, such as low lighting, occlusion, and heavy traffic, were not explicitly addressed. Likewise, the CNN + YOLOv3 model in [46] achieved 93% accuracy but is not very robust in terms of recall and real-time adaptability, especially in challenging conditions. YOLOX with DSASNet [43] achieves 87.71% recall, 83.96% F1-score, and 0.842 mAP. There are improvements in object detection pipelines; nevertheless, its relatively low F1-score indicates a trade-off between precision and recall, which might be due to shadows or road segments being misclassified as false positives. By contrast, our model treats both scores equally, yielding a result of 0.97, indicating greater robustness and generalization. In [45], the combination strategy using YOLOv5, K-means, and Random Forest classifiers achieved an accuracy of 86.7%, precision of 83%, and recall of 87.5%. Although ensemble tactics were useful for classification, these methods may be unsuitable for real-time inference due to the additional computational cost. Our model, however, is designed to work in real time, using an efficient YOLOv8 backbone and postprocessing units, including non-maximum suppression and real-time visualization. Some, such as YOLOv4 + R-CNN in [47] and CNN + Faster R-CNN in [49], achieved 80% and 92.19% accuracy, respectively, and were not particularly interested in detecting occluded or small potholes. In contrast, our approach outperforms by exploiting multi-scale feature extraction. In the same direction, the results obtained with CycleGAN in [48] by combining the SE-ResNet-18 achieved an F1-score equal to 0.86 and compared with a F1-score of 0.97 for the sense-to-suggested method; this is a good result, but still not good enough with respect to the classification of potholes under low-contrast or complex background conditions. Recent YOLOv8 implementations in [50,55] and [57] have achieved accuracies of 78.27%-88.6% and mAP@0.5 of 78.27%. These models cannot often produce false positives due to road artefacts or changes in lighting. Our model directly addresses these problems by combining context feature learning with advanced image-processing techniques, such as CLAHE and edge-preserving filters. In [52], a classical machine learning pipeline consisting of VGG16, CNNs, SVMs, and MLPs achieved 75% accuracy, 68.23% precision, and an F1-score of 0.73, further demonstrating that deep, end-to-end trainable networks outperform traditional classifiers in complex environments.

5. Conclusions

This work aimed to address several important issues in pothole detection, including detection under poor illumination and in bad weather, detection of small and occluded potholes, false detection of road artefacts, and improved performance in challenging environments. To improve image quality under low-light and adverse weather conditions, a powerful preprocessing pipeline was created using CLAHE and noise- and edge-preservation filters. The improvements enabled by the updated feature-extraction pipeline have significantly increased clarity, allowing extraction of many more features even under less-than-optimal conditions. The model uses customized multi-scale feature pyramids and attention mechanisms, in a modification of the YOLOv8 detection system, to detect small, partially shaded potholes with greater precision. This design enhances the model’s ability to uncover hidden and subtle patterns, thereby increasing recall and overall reliability. The optimizer exploits varying learning rates and momentum-based updates to highlight information in an image and remove deception in the background.

To improve the model’s performance in complex, dynamic driving conditions, the paper also adds a real-time training tool with a proprietary implementation of the IoU loss function and an automated hyperparameter search. This provides robustness against pale changes, occlusion, and the complexity of road scenes. Lastly, to enable real-world implementation, the modules for non-maximum suppression and postprocessing for annotated visualization have been incorporated, along with a real-time inference engine and a detection database. By leveraging drone-derived aerial imagery, the proposed framework enables large-scale, high-resolution monitoring of road networks, enabling rapid evaluation and precise localization of potholes across large, inaccessible terrain. The proposed model achieved a recall of 0.97, F1-score of 0.97, and mean average precision (mAP) of 0.98. These results demonstrate beyond a doubt the model’s efficacy and reliability for real-time pothole detection across a wide variety of environments. The scalability and flexibility of the framework are also supported when it is complemented with drone-based imaging, ensuring strong, real-time detection capabilities even under unstable environmental conditions. Although the results of the study are promising, there are certain constraints, especially when it comes to the large-scale, real-time deployment of hardware with limited resources. In future research, the proposed framework will be further developed and implemented on edge devices to enable low-latency, on-device pothole detection for smart transportation systems. To enable large-scale, realistic implementation in real-life environments, this extension will focus on compression, computational efficiency, and hardware-aware optimization of the model.

Author Contributions

Methodology, S.S.R., M.J., I.U.K. and K.A.; validation, S.S.R., M.J., I.U.K. and K.A.; writing—original draft, S.S.R., M.J., I.U.K. and K.A.; writing—review and editing, S.S.R., M.J., I.U.K. and K.A.; visualization, S.S.R., M.J., I.U.K. and K.A.; supervision, S.S.R., M.J. and I.U.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhang, R.; Peng, J.; Gou, W.; Ma, Y.; Chen, J.; Hu, H.; Li, W.; Yin, G.; Li, Z. A robust and real-time lane detection method in low-light scenarios to advanced driver assistance systems. Expert Syst. Appl. 2024, 256, 124923. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Li, Q. Lane detection based on real-time semantic segmentation for end-to-end autonomous driving under low-light conditions. Digit. Signal Process. 2024, 155, 104752. [Google Scholar] [CrossRef]
Li, D.; Yang, Z.; Nai, W.; Xing, Y.; Chen, Z. A road lane detection approach based on the reformer model. Egypt. Inform. J. 2025, 29, 100625. [Google Scholar] [CrossRef]
Patil, O.; Nair, B.B.; Soni, R.; Thayyilravi, A.; Manoj, C. BoostedDim attention: A novel data-driven approach to improving LiDAR-based lane detection. Ain Shams Eng. J. 2024, 15, 102887. [Google Scholar] [CrossRef]
Zhang, Y.; Zheng, Y.; Tu, Z.; Wu, C.; Zhang, T. CFFM: Multi-task lane object detection method based on cross-layer feature fusion. Expert Syst. Appl. 2024, 257, 125051. [Google Scholar] [CrossRef]
Yaamini, H.S.G.; Swathi, K.J.; Manohar, N.; Ajay Kumar, G. Lane and Traffic Sign Detection for Autonomous Vehicles: Addressing Challenges on Indian Road Conditions. MethodsX 2025, 14, 103178. [Google Scholar] [CrossRef]
Yu, B.; Zhu, Z.; Chen, Y.; Wang, J.; Gao, K.; Qian, X. A Diffusion model-based intelligent optimization method of rural road environments. Int. J. Transp. Sci. Technol. 2025. [Google Scholar] [CrossRef]
Lourenço, B.; Silvestre, D. Enhancing truck platooning efficiency and safety—A distributed Model Predictive Control approach for lane-changing manoeuvres. Control Eng. Pract. 2025, 154, 106153. [Google Scholar] [CrossRef]
Yik, M. YOLOv3-based real-time pothole detection with Google Maps visualization. Int. J. Image Graph. 2019, 19, 1950030. [Google Scholar]
Al Shaghouri, O.; Al Kalaoni, S.; Al Nsour, M. Evaluation of Deep Learning Models for Pothole Detection: YOLOv4 vs. Others. In International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2021; pp. 1–8. [Google Scholar]
Silva, L.A.; Leithardt, V.R.Q.; Batista, V.F.L.; González, G.V.; De Paz Santana, J.F. Automated Road Damage Detection Using UAV Images and Deep Learning Techniques. IEEE Access 2023, 11, 62918–62932. [Google Scholar] [CrossRef]
Doycheva, S.; Koch, C.; Koch, R. Real-time pavement distress detection using deep learning and computer vision. In IEEE Conference on Big Data; IEEE: New York, NY, USA, 2022; pp. 551–559. [Google Scholar]
SLee, S.Y.; Lee, J.; Lee, S. Improved Regression Modeling for Pothole Prediction in Urban Pavements Considering Climate and Traffic. Dev. Built Environ. 2023, 13, 100109. [Google Scholar]
Fortin, L.V.; Santos, A.R.V.D.; Cagud, P.J.B.; Castor, P.R.; Llantos, O.E. Eyeway: An Artificial Intelligence of Things Pothole Detection System with Map Visualization. Procedia Comput. Sci. 2024, 251, 216–223. [Google Scholar] [CrossRef]
Alzamzami, O.; Babour, A.; Baalawi, W.; Al Khuzayem, L. PDS-UAV: A deep learning-based pothole detection system using unmanned aerial vehicle images. Sustainability 2024, 16, 9168. [Google Scholar] [CrossRef]
Available online: https://www.ndtv.com/topic/pothole-accident (accessed on 5 October 2025).
Kuttah, D.; Waldemarson, A. Next generation gravel road profiling—The potential of advanced UAV drones in comparison with road surface testers and rotary laser levels. Transp. Eng. 2024, 17, 100260. [Google Scholar] [CrossRef]
Jiang, T.Y.; Liu, Z.Y.; Zhang, G.Z. YOLOv5s-road: Road surface defect detection under engineering environments based on CNN-transformer and adaptively spatial feature fusion. Measurement 2025, 242, 115990. [Google Scholar] [CrossRef]
Arishi, A.; Ahuja, P. Multi-Agent Reinforcement Learning for truck–drone routing in smart logistics: A comprehensive review. Comput. Electr. Eng. 2025, 127, 110529. [Google Scholar] [CrossRef]
Yaqoob, S.; Ullah, A.; Awais, M.; Katib, I.; Albeshri, A.; Mehmood, R.; Raza, M.; ul Islam, S.; Rodrigues, J.J. Novel congestion-avoidance scheme for the Internet of Drones. Comput. Commun. 2021, 169, 202–210. [Google Scholar] [CrossRef]
Rayamajhi, A.; Jahanifar, H.; Mahmud, M.S. Measuring ornamental tree canopy attributes for precision spraying using drone technology and self-supervised segmentation. Comput. Electron. Agric. 2024, 225, 109359. [Google Scholar] [CrossRef]
Luo, H.; Li, C.; Wu, M.; Cai, L. An enhanced lightweight network for road damage detection based on deep learning. Electronics 2023, 12, 2583. [Google Scholar] [CrossRef]
Kaya, Ö.; Çodur, M.Y. Automatic detection and classification of road defects on a global scale: Embedded system. Measurement 2025, 243, 116453. [Google Scholar] [CrossRef]
Fan, L.; Wang, D.; Wang, J.; Li, Y.; Cao, Y.; Liu, Y.; Chen, X.; Wang, Y. Pavement defect detection with deep learning: A comprehensive survey. IEEE Trans. Intell. Veh. 2023, 9, 4292–4311. [Google Scholar] [CrossRef]
Wang, J.; Meng, R.; Huang, Y.; Zhou, L.; Huo, L.; Qiao, Z.; Niu, C. Road defect detection based on an improved YOLOv8s model. Sci. Rep. 2024, 14, 16758. [Google Scholar] [CrossRef] [PubMed]
Wu, C.; Ye, M.; Zhang, J.; Ma, Y. YOLO-LWNet: A lightweight road damage object detection network for mobile terminal devices. Sensors 2023, 23, 3268. [Google Scholar] [CrossRef] [PubMed]
Darmawan, M.A.; Mukti, S.N.; Tahar, K.N. Pothole Detection Based on UAV Photogrammetry. Rev. Int. De Geomat. 2025, 34, 21–35. [Google Scholar] [CrossRef]
Han, Z.; Cai, Y.; Liu, A.; Zhao, Y.; Lin, C. MS-YOLOv8-based object detection method for pavement diseases. Sensors 2024, 24, 4569. [Google Scholar] [CrossRef]
Upadhyay, M.R.; Ushasukhanya, S.; Teja, V.R.; Malleswari, T.N.; Mahendra, K.V. A Deep Learning Approach for Pothole Detection Using Yolov8 Model. In Proceedings of the International Conference on Advancement in Renewable Energy and Intelligent Systems (AREIS), Hyderabad, India, 5 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Raj, R.; Vadapu, B.K.; Gundu, S.; Lekshmi, R.R. An Artificial Intelligence-Based Assistant to Detect, Notify and Update Potholes. In Proceedings of the IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), Chennai, India, 14 March 2024; IEEE: New York, NY, USA, 2024; Volume 2, pp. 1–6. [Google Scholar]
Chaaban, O.; Daher, L.; Ghaddar, Y.; Zantout, N.; Asmar, D.; Daher, N. An End-to-End Pothole Detection and Avoidance System for Autonomous Ground Vehicles. In Proceedings of the International Conference on Control, Automation, and Instrumentation (IC2AI), Dubai, United Arab Emirates, 11–13 February 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Thakkar, K.; Shah, S.; Mulchandani, B.; Katre, N.; Dalvi, H. Automated Pothole Detection using Transfer Learning. In Proceedings of the IEEE 9th International Conference for Convergence in Technology (I2CT), Mumbai, India, 5–7 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Gaikwad, V.; Hepat, O.; Humne, S.; Heda, K.; Ingawale, V.; Bhadke, H.; Inani, H. Automated Pothole Detection and Mapping for Road Maintenance. In Proceedings of the 4th International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), Pune, India, 14–15 May 2024; IEEE: New York, NY, USA, 2024; pp. 1806–1812. [Google Scholar]
Dhanvanth Prasath, M.P.; Avinash, M.; Prabu, M. Comparative Analysis of Machine Learning Models for Smart Pothole Detection: Enhancing Road Safety and Maintenance Efficiency. In Proceedings of the Global Conference on Communications and Information Technologies (GCCIT), Dubai, United Arab Emirates, 25–27 October 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Karukayil, A.; Quail, C.; Cheein, F.A. Deep Learning Enhanced Feature Extraction of Potholes Using Vision and LiDAR Data for Road Maintenance. IEEE Access 2024, 12, 184541–184549. [Google Scholar] [CrossRef]
Murty, P.S.; Sree, P.K.; Sree, G.P.; Gubbala, D.M.; Bezawada, D.P.; Vineetha, D. Detection and Classification of Potholes using CNN. In Proceedings of the Sixth International Conference on Computational Intelligence and Communication Technologies (CCICT), Hyderabad, India, 19–20 April 2024; IEEE: New York, NY, USA, 2024; pp. 83–90. [Google Scholar]
Shalen, S. Distributed Framework for Pothole Detection and Monitoring Using Federated Learning. In Proceedings of the Second International Conference on Advances in Information Technology (ICAIT), Bangalore, India, 24–25 July 2024; IEEE: New York, NY, USA, 2024; Volume 1, pp. 1–7. [Google Scholar]
Kumar, A.; Reddy, R.M. Efficient Pothole Detection in Video Sequences Using YOLOv4 with Cosine Similarity: A Road Safety Solution. In Proceedings of the International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE), Delhi, India, 16–17 January 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Chen, S.; Laefer, D.F.; Zeng, X.; Truong-Hong, L.; Mangina, E. Volumetric pothole detection from UAV-based imagery. J. Surv. Eng. 2024, 150, 05024001. [Google Scholar] [CrossRef]
Singh, A.; Mehta, A.; Padaria, A.A.; Jadav, N.K.; Geddam, R.; Tanwar, S. Enhanced pothole detection using YOLOv5 and federated learning. In Proceedings of the 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; IEEE: New York, NY, USA, 2024; pp. 549–554. [Google Scholar]
Meier, J.; Welborn, E.; Diamantas, S. Pothole segmentation and area estimation with thermal imaging using deep neural networks and unmanned aerial vehicles. Mach. Vis. Appl. 2025, 36, 17. [Google Scholar] [CrossRef]
Wijaya, J.; Abiyyu’Ammaar, M.; Kusuma, I.K.; Gunawan, A.A. Enhancing Infrastructure Monitoring: Pothole Detection in Road Images Using YOLOv8 and Open Datasets. In Proceedings of the 6th International Conference on Cybernetics and Intelligent System (ICORIS), Surabaya, Indonesia, 29–30 November 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Bhatt, R.; Shah, J.; Patel, S. Enhancing Transportation Infrastructure: A Deep Learning Approach for Pothole Detection. In Proceedings of the 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kanpur, India, 24–26 June 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Vasanthi, R.; Reshmy, A.K.; Praveen, D.S. Improving Road Maintenance and Safety through Weighted Ensemble of Deep Convolutional Neural Networks: Focus on Pothole Detection. In Proceedings of the 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), Kollam, India, 8–9 August 2024; IEEE: New York, NY, USA, 2024; Volume 1, pp. 1900–1905. [Google Scholar]
Shasikala, Y.; Kantha, M.R.; Dinesh, M.; Nagamani, K. INCL: An Effective Deep Learning Model to Identify Road Potholes by Using Image Based Neural Classification Logic. In Proceedings of the International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Coimbatore, India, 12–13 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Huang, X.; Cheng, J.; Xiang, Q.; Dong, J.; Wu, J.; Fan, R.; Tang, X. LEPS: A Lightweight and Effective Single-Stage Detector for Pothole Segmentation. IEEE Sens. J. 2023, 24, 22045–22055. [Google Scholar] [CrossRef]
Reddy, S.S.; Rao, V.V.; Priyadarshini, V.; Nrusimhadri, S. You only look once model-based object identification in computer vision. IAES Int. J. Artif. Intell. 2024, 13, 827–838. [Google Scholar] [CrossRef]
Liu, J.; Wang, H.; Wei, J.; Nan, J. Research on Road Pothole Recognition Based on Deep Convolutional Neural Networks. In Proceedings of the IEEE 2nd International Conference on Image Processing and Computer Applications (ICIPCA), Zhangjiakou, China, 28–30 June 2024; IEEE: New York, NY, USA, 2024; pp. 1420–1424. [Google Scholar]
Saranya, E.; Nivetha, R.; Abirami, S.; MohaideenArsath, M.; Dharaneesh, S. Revolutionizing Road Maintenance: YOLO Based Pothole Detection System. In Proceedings of the 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 14–16 March 2024; IEEE: New York, NY, USA, 2024; Volume 1, pp. 1991–1997. [Google Scholar]
Cui, H.; Lin, K.; Li, X.; Qian, J.; Hu, H. Road Pothole Detection Model Based on AlexNet Convolutional Neural Network and Confusion Matrix. In Proceedings of the 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering (ICEACE), Shenzhen, China, 29–30 December 2024; IEEE: New York, NY, USA, 2024; pp. 284–287. [Google Scholar]
Cai, Y.; Deng, M.; Xu, X.; Wang, W.; Xu, X. Road Pothole Recognition and Size Measurement Based on the Fusion of Camera and LiDAR. IEEE Access 2025, 13, 46210–46227. [Google Scholar] [CrossRef]
Bharadwaj, D.V.; Korapati, P.; Varshitha, S.; Babu, V.L.; Sai, K.L. Smart Pothole Detection and Real-Time Monitoring Using GPS Technology. In Proceedings of the International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), Bangalore, India, 20–21 January 2025; IEEE: New York, NY, USA, 2025; pp. 1020–1026. [Google Scholar]
Wang, A.; Lang, H.; Chen, Z.; Peng, Y.; Ding, S.; Lu, J.J. The two-step method of pavement pothole and raveling detection and segmentation based on deep learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5402–5417. [Google Scholar] [CrossRef]
Wang, D.; Xu, Y.; Zhu, H.; Liu, K. A Novel Framework for Pothole Area Estimation Based on Object Detection and Monocular Metric Depth Estimation. In Proceedings of the IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Chongqing, China, 22–24 November 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Suhas, A.; Pavan, T.; Sandeep, R.P.; Shreyas, B.; Mahendra, M. Advanced Pothole Detection Using Image Processing. In Proceedings of the International Conference on Signal Processing, Computation, Electronics, Power and Telecommunication (IConSCEPT), Mangalore, India, 4–5 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Hamed, H.; Khan, M.R. AI-based Detection of Potholes Ahead of a Visually Impaired Person Using Ultrasonic Sensors Array and Camera for Blind Navigation. In Proceedings of the 9th International Conference on Mechatronics Engineering (ICOM), Karachi, Pakistan, 13–14 August 2024; IEEE: New York, NY, USA, 2024; pp. 448–453. [Google Scholar]
Faisal, A.; Gargoum, S. Cost-effective LiDAR for pothole detection and quantification using a low-point-density approach. Autom. Constr. 2025, 172, 106006. [Google Scholar] [CrossRef]
Wang, Z.; Ma, Z.; Wang, Z.; Gao, S.; Peng, J. A novel road damage detection model with efficient attention and Dynamic Snake Convolution. Eng. Appl. Artif. Intell. 2026, 163, 112618. [Google Scholar] [CrossRef]
Singh, P.; Wijethunga, R.; Sadhu, A.; Samarabandu, J. Expert evaluation system for pothole defect detection. Expert Syst. Appl. 2025, 277, 127280. [Google Scholar] [CrossRef]
Nath, N.D.; Cheng, C.S.; Behzadan, A.H. Drone mapping of damage information in GPS-Denied disaster sites. Adv. Eng. Inform. 2022, 51, 101450. [Google Scholar] [CrossRef]
Fortin, L.V.; Llantos, O.E. Performance Analysis of YOLO versions for Real-time Pothole Detection. Procedia Comput. Sci. 2025, 257, 77–84. [Google Scholar] [CrossRef]
Parvez, M.S.; Moridpour, S. Application of smart technologies in safety of vulnerable road users: A review. Int. J. Transp. Sci. Technol. 2024, 18, 285–304. [Google Scholar] [CrossRef]
Ejaz, N.; Alam, A.B.; Choudhury, S. Generative AI-Driven Edge-Cloud System for Intelligent Road Infrastructure Inspection. Results Eng. 2025, 27, 105844. [Google Scholar] [CrossRef]
He, J.; Gong, L.; Xu, C.; Wang, P.; Zhang, Y.; Zheng, O.; Su, G.; Yang, Y.; Hu, J.; Sun, Y. HighRPD: A high-altitude drone dataset of road pavement distress. Data Brief 2025, 59, 111377. [Google Scholar] [CrossRef]
Mirajkar, R.; Yenkikar, A.; Nawalkar, S.; Kaul, R.; Rokade, A.; Rothe, K. Enhanced Pothole Detection in Road Condition Assessment Using YOLOv8. In Proceedings of the IEEE International Conference for Women in Innovation, Technology & Entrepreneurship (ICWITE), Pune, India, 16–17 February 2024; IEEE: New York, NY, USA, 2024; pp. 429–433. [Google Scholar]
Sengupta, S.; Gupta, P.; Verma, M. Enhancement of Potholes Images using Luminosity Based Enhancement Technique. In Proceedings of the IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), Allahabad, India, 9–10 February 2024; IEEE: New York, NY, USA, 2024; Volume 5, pp. 362–365. [Google Scholar]
Widodo, H.; Taufiqurrohman, H.; Muis, A.; Wijayanto, Y.N.; Prihantoro, G.; Dwiyanti, H.; Cahya, Z.; Widaryanto, A.; Nugroho, T.H. Experimental Evaluation of Pothole Detection and Its Dimension Estimation Using YOLOv8 and Depth Camera for Road Surface Analysis. In Proceedings of the International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), Bandung, Indonesia, 12–13 November 2024; IEEE: New York, NY, USA, 2024; pp. 339–344. [Google Scholar]
Available online: https://data.mendeley.com/datasets/sywswj7djj/1 (accessed on 15 June 2025).
Reddy, S.S.; Khan, I.U. Weather and Monsoon Resilient Pothole Detection: A YOLO Based Real-Time Application for Diverse Road Conditions. SGS-Eng. Sci. 2025, 1. Available online: https://spast.org/techrep/article/view/5240 (accessed on 3 March 2026).

Figure 1. Potholes on Indian roads: a silent killer.

Figure 2. Sample dataset images.

Figure 3. DPD-Net: Deep Pothole Detection Network.

Figure 4. Comparison of performance between YOLOv8 and DPD-Net.

Figure 5. Detection results on drone images from the High PRD dataset.

Figure 6. F1–score.

Figure 7. PR curve.

Figure 8. Precision.

Figure 9. Recall.

Table 1. Literature review.

Ref	Year	Models Used	Results	Limitations
[48]	2024	EfficientNetB0, CNN	Accuracy > 90%	Limited generalization in certain lighting and road conditions
[49]	2024	YOLOv7, YOLOv3, YOLOv4	High accuracy, fast processing, and real-time pothole detection	Challenges in varied road conditions and weather scenarios
[50]	2024	AlexNet CNN	Accuracy > 93%	Performance drops under poor lighting conditions, struggles with complex backgrounds such as multi-lane roads, and is affected by pedestrian interference
[51]	2025	Fusion of Camera and LiDAR, Mean-shift clustering	27.4% improvement in accuracy, 88.2% reduction in processing time, and real-time processing at 45.6 ms per frame	Performance drops with complex environments, limited to scenarios with significant depth contrast
[52]	2025	Arduino, Ultrasonic sensor, GPS, GSM	Real-time pothole detection reduces vehicle damage, improves road safety	Accuracy depends on sensor calibration and is susceptible to false positives in complex environments
[53]	2024	YOLOX (improved), DSASNet	Recall of potholes at 87.71%, ravelling at 54.97%, F1-score for potholes at 83.96%, mAP of 0.842	Issues with ravelling misclassification, challenges in noisy and complex environments
[54]	2024	YOLOv5n-p6, ZoeDepth (monocular depth estimation)	High accuracy in pothole detection with a 2.6% improvement in mAP@50, and real-time pothole area estimation	Challenges in detecting minor potholes and varying lighting conditions.
[55]	2024	YOLOv5, K-means clustering, Random Forest	Accuracy of 86.7%, precision: 83%, recall: 87.5%, detects potholes in real-time with good precision	Sensitive to lighting conditions, challenges with large datasets
[56]	2024	CNN, YOLOv3	93% accuracy in pothole detection, adequate for both large and small potholes, integrated with ultrasonic sensors	Challenges with water-filled potholes, inconsistencies in complex environments
[57]	2025	Curvature-based LiDAR algorithm using voxelization and statistical analysis	Achieved 3–10% error margin even with 205 points/m² density; processing speed 23–88 s/km	Limited to LiDAR datasets; does not include aerial (drone) data or real-time implementation
[58]	2026	YOLOv8 + Bi-level Routing Attention + Dynamic Snake Convolution	mAP@0.5 = 90.5%, outperforming baseline YOLO models	High computational cost; trained on ground images, not drone datasets
[59]	2025	IoT-enabled vibration sensors + thresholding algorithm	Detected and measured potholes accurately in field tests with nine pothole samples	Focused on UGVs (ground vehicles); no aerial/drone-based sensing
[60]	2022	Scale-Invariant Feature Mapping + Homography estimation	Achieved RMS error of 32.7–36.9 ft in mapping without GPS; effective in unstructured disaster scenes	Focused on disaster mapping, not road-specific distress; moderate spatial error
[61]	2025	YOLOv9-tiny, YOLOv10-nano, YOLOv11-nano	Dataset of 16,054 images; improved real-time performance and detection accuracy with nano/tiny models	Lacks aerial/drone deployment; only ground-level camera-based
[62]	2025	Various smart techs, including drones, cameras, and sensors	Highlights drones as key tools for collecting real-time road safety data on cyclists and pedestrians	Focused on VRU safety; does not directly analyze pothole detection performance
[63]	2025	Edge AI (MobileNetV3) + Cloud AI (EfficientNet-B4, MiDaS, T5-XL)	50–70% reduced bandwidth, 30–50 ms edge inference, and automated text report generation	Needs extensive computational setup; drone integration mentioned but not evaluated
[64]	2025	Dataset (11,696 UAV images annotated for cracks, potholes, block distress)	Provides a large-scale UAV dataset (8192 × 5460 px resolution, YOLO-format annotations)	Dataset only; no detection model proposed or evaluated
[65]	2024	YOLOv8	Accuracy: 78.27%, mAP@0.5: 78.27%, effective in real-time pothole detection with low computational cost	Struggles with small potholes, challenges under varying weather conditions
[66]	2024	Luminosity-based enhancement, Histogram Equalization	Entropy values for enhanced images: 7.77, improved clarity and accuracy in pothole detection	Limited by image quality and environmental lighting conditions, sensitivity to background noise
[67]	2024	YOLOv8, Intel RealSense D455 Depth Camera	mAP@0.5 = 78.27%	Low F1-score (0.41), difficulty distinguishing patched potholes from real ones

Table 2. Ablation study results of the proposed model variants.

Model Variant	Precision	Recall	F1-Score	mAP@0.5
YOLOv8 (Baseline)	0.89	0.86	0.875	0.87
+Preprocessing	0.92	0.90	0.915	0.91
+FPN	0.93	0.93	0.92	0.94
+Attention Head	0.958	0.96	0.95	0.96
Full Proposed Model	0.97	0.97	0.97	0.98

Table 3. Comparison table with the existing works.

Ref	Year	Model Used	Results
[40]	2024	AlexNet CNN	Accuracy > 93%
[43]	2024	YOLOX (improved), DSASNet	Recall 87.71%, F1-score 83.96%, mAP of 0.842
[45]	2024	YOLOv5, K-means clustering, Random Forest	Accuracy of 86.7%, precision: 83%, recall: 87.5%
[46]	2024	CNN, YOLOv3	93% accuracy
[47]	2024	YOLOv4, R-CNN	80% detection accuracy
[48]	2024	SE-ResNet-18 with CycleGAN for pseudosample generation	F1-score: 0.86
[49]	2024	CNN, faster R-CNN	Accuracy: 92.19%
[50]	2024	YOLOv8	Accuracy: 88.6%
[52]	2024	VGG16 + CNN, MLP, SVM	Accuracy: 75%, precision: 68.23%, F1-score: 0.73
[55]	2024	YOLOv8	Accuracy: 78.27%, mAP@0.5: 78.27%
[57]	2024	YOLOv8, Intel Real Sense D455 Depth Camera	mAP@0.5 = 78.27%
Proposed Model			Recall 0.97, F1-score 0.97, mAP of 0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Reddy, S.S.; Janarthanan, M.; Khan, I.U.; Amrutha, K. A Deep Learning Framework for Real-Time Pothole Detection from Combined Drone Imagery and Custom Dataset Using Enhanced YOLOv8 and Custom Feature Extraction. Mathematics 2026, 14, 898. https://doi.org/10.3390/math14050898

AMA Style

Reddy SS, Janarthanan M, Khan IU, Amrutha K. A Deep Learning Framework for Real-Time Pothole Detection from Combined Drone Imagery and Custom Dataset Using Enhanced YOLOv8 and Custom Feature Extraction. Mathematics. 2026; 14(5):898. https://doi.org/10.3390/math14050898

Chicago/Turabian Style

Reddy, Shiva Shankar, Midhunchakkaravarthy Janarthanan, Inam Ullah Khan, and Kankanala Amrutha. 2026. "A Deep Learning Framework for Real-Time Pothole Detection from Combined Drone Imagery and Custom Dataset Using Enhanced YOLOv8 and Custom Feature Extraction" Mathematics 14, no. 5: 898. https://doi.org/10.3390/math14050898

APA Style

Reddy, S. S., Janarthanan, M., Khan, I. U., & Amrutha, K. (2026). A Deep Learning Framework for Real-Time Pothole Detection from Combined Drone Imagery and Custom Dataset Using Enhanced YOLOv8 and Custom Feature Extraction. Mathematics, 14(5), 898. https://doi.org/10.3390/math14050898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Learning Framework for Real-Time Pothole Detection from Combined Drone Imagery and Custom Dataset Using Enhanced YOLOv8 and Custom Feature Extraction

Abstract

1. Introduction

2. Literature

3. Methodology

3.1. Objectives

3.2. Dataset

3.3. Proposed Work

4. Result Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI