Next Article in Journal
Comparative Study of Machine Learning Models for Textual Medical Note Classification
Previous Article in Journal
ConWave-LoRA: Concept Fusion in Customized Diffusion Models with Contrastive Learning and Wavelet Filtering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring

by
Bushra Abro
1,2,*,
Sahil Jatoi
1,2,
Muhammad Zakir Shaikh
1,2,3,
Enrique Nava Baro
4,
Mariofanna Milanova
5,* and
Bhawani Shankar Chowdhry
1,2
1
National Centre for Robotics, Automation and Artificial Intelligence, Mehran University of Engineering and Technology (MUET), Jamshoro 76062, Pakistan
2
NCRA-CMS Lab, Mehran University of Engineering and Technology (MUET), Jamshoro 76062, Pakistan
3
Mechanical Engineering and Energy Efficiency, School of Industrial Engineering, University of Malaga, C/Doctor Ortiz Ramos, s/n, Campus de Teatinos, 29071 Málaga, Spain
4
Departamento de Ingeniería de Comunicaciones, Universidad de Malaga, C/Doctor Ortiz Ramos, s/n, Campus de Teatinos, 29071 Málaga, Spain
5
Department of Computer Science, University of Arkansas at Little Rock, 2801 South University Avenue, Little Rock, AR 72204, USA
*
Authors to whom correspondence should be addressed.
Computers 2026, 15(1), 6; https://doi.org/10.3390/computers15010006
Submission received: 13 November 2025 / Revised: 5 December 2025 / Accepted: 17 December 2025 / Published: 22 December 2025
(This article belongs to the Section AI-Driven Innovations)

Abstract

This research article presents a novel road defect detection methodology that integrates deep learning techniques and a federated learning approach. Existing road defect detection systems heavily rely on manual inspection and sensor-based techniques, which are prone to errors. To overcome these limitations, a data-acquisition system utilizing a GoPro HERO 9 camera was used to capture high-quality videos and images of road surfaces. A comprehensive dataset consist of multiple road defects, such as cracks, potholes, and uneven surfaces, that were pre-processed and augmented to prepare them for effective model training. A Real-Time Detection Transformer-based architecture model was used that achieved mAP50 of 99.60% and mAP50-95 of 99.55% in cross-validation of road defect detection and object detection tasks. Federated learning helped to train the model in a decentralized manner that enhanced data protection and scalability. The proposed system achieves higher detection accuracy for road defects by increasing speed and efficiency while enhancing scalability, which makes it a potential asset for real-time monitoring.

1. Introduction

Roads are one of the fundamental modes of transportation as they serve as critical connectors between cities, regions, and countries. They allow the transportation of goods and people, which facilitates economic operations and trade activities. The expansion of cities and economies has made road infrastructure critical in modern life because it provides access to educational institutions and healthcare facilities [1,2]. Transportation systems rely heavily on the maintenance of durable road conditions.
Road surfaces often develop various defects such as cracks, potholes, etc. These defects worsen over time due to climatic effects, the stress imposed by heavy vehicle loads, and deficiencies in the quality of material used in construction [3]. Road defects cause degradation of structural integrity, which leads to safety hazards for vehicles and pedestrians. Potholes on roads cause damage to tires and increase the risk of accidents. Road defects such as cracks and uneven surfaces increase the cost of maintenance and reduce safety for all road users and pedestrians. Road infrastructure requires proactive approaches for road defect detection (RDD) to enhance the safety and life of roads.
Manual and expert inspection are the foundation of RDD. These are time-consuming, expensive, and prone to human errors. As the road infrastructure expands across long distances, it becomes more challenging to scale up maintenance and monitoring efforts across vast areas. The cost of repair increases and safety concerns intensify when various impairments remain undetected over an extended period [4]. Automatic and sensor-based methods are better solutions, but there is a problem of precision and adaptability in diverse environments or across broad areas.
Deep learning (DL) provides various techniques to automate RDD while maintaining high accuracy standards. Machine learning (ML) models trained on extensive road image datasets enable real-time defect detection (DD) and classification, which minimizes manual inspection requirements. The models analyze huge datasets at high speed, which leads to prompt DD that lowers accident risks and enhances road safety. DL algorithms provide significant value by classifying both major and minor defects, which helps to maintain the structural safety of road infrastructure [5].
The contributions of this paper are as follows:
  • Developed a high-resolution RDD using a GoPro HERO 9 camera, manufactured by GoPro Inc., and bought from Shenzhen (China), capturing detailed images of cracks, potholes, and uneven surfaces to improve data quality and diversity for deep learning training.
  • Implemented a Real-Time Detection Transformer (RTDT) model that leverages self-attention mechanisms for precise detection of subtle and overlapping road defects in complex real-world conditions, surpassing traditional convolutional models.
  • Integrated federated learning into the RTDT framework to enable decentralized, privacy-preserving model training across multiple data sources, enhancing model generalization and scalability for large-scale deployments.
  • Achieved near-perfect detection accuracy (mAP50 of 99.60) with real-time processing capabilities, offering a scalable, efficient, and privacy-conscious solution for practical road maintenance and monitoring applications.
The challenges outlined above—manual inspection inefficiencies, limited scalability of sensor-based methods, data privacy concerns in centralized systems, and the need for real-time processing—directly motivate the key innovations in this work. Our high-resolution dataset addresses data quality limitations that hinder accurate defect detection. The RTDT model’s self-attention mechanism tackles the challenge of detecting subtle and overlapping defects that traditional CNNs miss. Most importantly, our federated learning integration directly responds to the dual challenges of scalability and privacy, enabling decentralized training across multiple data sources without compromising data security. These contributions collectively address the gap between existing detection accuracy and the requirements for practical, large-scale road monitoring systems.
The structure of this paper is as follows: Section 3 provides an overview of existing research methods for RDD, which include traditional approaches with sensor-based approaches and DL applications. The methodology and materials are detailed in Section 4, which describes dataset curation, model implementation, and evaluation metrics. Section 5 details the results and evaluation of our method, and Section 6 and Section 7 concludes the paper by summarizing the findings and future research directions.

2. Related Work

Trained personnel used visual inspection in traditional RDD techniques to assess road conditions by identifying surface defects like cracks and potholes. The fundamental methods used for RDD faced difficulties as they were time-consuming, labor-intensive, and prone to human errors that resulted in inconsistent detections and sometimes unidentified critical defects [6,7]. Manual inspection has led to the development of automated RDD systems that provide more efficiency and reliability. Automated sensor-based methods are more popular as they offer a more efficient alternative to traditional RDD and are efficient methods [8]. The monitoring of road surface conditions involves utilizing cameras, accelerometers, and LiDAR sensors to collect real-time data, which advanced algorithms process to detect and classify road surface defects. In another recent study by Gupta et al., that used an extensive image dataset and data augmentation strategies were used. A deep learning model was used to detect potholes with 98% accuracy in 2022 [9]. Recent advances in automatic data augmentation have shown promise for industrial defect detection. ALADA, a lightweight automatic data augmentation framework designed to enhance industrial defect detection amid data scarcity and complex scenarios like textured backgrounds, uneven brightness, low contrast, and intraclass variations [10]. Liu et al. presented a system that measures vehicle system vibrations to accurately assess pavement surface conditions. Vehicle system vibration data allows for precise detection of pavement surface conditions and precise classification of road defects, as shown by the 2023 study. Studies have explored multimodal sensor fusion methods that use an integrated camera and accelerometer for data collection, along with additional sensors to enhance detection accuracy [11]. A study by Orhan et al. introduced a framework that integrates multiple sensor types to improve RDD precision and enables real-time DD [12].
Advanced tools and technologies developed through ML/DL techniques enable accurate and efficient detection and classification of road defects. Convolutional Neural Network (CNN) has proven to provide effective results for object detection tasks within image-based applications [13]. However, another study represented the use of YOLO and transformer models for edge device deployment and real-time defect detection systems [14]. Recently, a comprehensive dataset was also published on a real-time defect detection system [15], and another correlated study showed DL models like SSD MobileNet and YOLO achieved exceptional validation accuracy of 99.95% for real-time defect detection [16]. Studies reveal that Benallal et al. (2023) achieved a 91% detection rate for cracks and potholes by implementing a YOLO algorithm that uses CNN technology [17]. The research by Liu et al. (2024) introduced an enhanced version of the YOLOv5s model that achieved a mean average precision (mAP) of 92.3% and an F1-score of 91.3% for pavement crack detection [18]. The latest research introduced GM-DETR that integrates global attention with CNN features to enhance DETR models and increase RDD accuracy [19]. However, another study investigated the influence of streetscape and land use on urban accidents in Mashhad, Iran, between 2017 and 2021, by analyzing accident data from three urban zones. It employs machine learning algorithms—Random Forest Regression (RFR), eXtreme Gradient Boosting (XGBoost), and Multilayer Perceptron (MLP)—implemented in Python 3.7 version to model and predict urban accident occurrences. Among these, Random Forest achieved the highest predictive accuracy (85% on test data), followed by XGBoost (81%) and MLP (75%). The study finds that commercial land use areas have the highest number of accidents, approximately three times more than residential areas, and that open streetscapes witness 75% of accidents, three times more than enclosed streetscapes. Moreover, land use has a greater impact on accident occurrences than streetscape. The findings provide valuable insights for engineers and urban planners to implement targeted safety interventions and improve road safety in urban settings by considering both land use and streetscape characteristics [20]. DL advancements combined with edge computing allow effective real-time deployment for autonomous vehicles as well as road monitoring systems [21].
Detection systems for road defects face significant research challenges, including data scarcity, variability in road conditions, and the need for real-time processing on devices with limited resources. Naseralavi et al. analyzed 23,242 rainy weather crashes in California (2015–2017) from HSIS data, stratifying them into 12 subgroups by driver gender and crash type (rear-end, hit object, sideswipe, overturned, head-on, broadside), using binary logistic regression to model severity (PDO vs. injury/fatality). Key findings show multi-vehicle crashes reduce severity by distributing impact energy, high AADT (≥250 k) increases it, younger drivers face a higher risk, and effects vary by context—head-on models fit best (AUC up to 0.748). Results support targeted interventions like speed management on high-volume roads and youth education during rain [22]. This paper proposes a hierarchical control framework for managing traffic flow on heterogeneous freeways shared by human-driven and autonomous vehicles. It improves the multi-class METANET model with state-dependent parameters to accurately capture traffic dynamics, and uses a two-level control approach: a distributed multi-variable extremum seeking (DMV-ES) controller at the upper level optimizes vehicle densities to maximize flow and reduce congestion, while a distributed filtered feedback linearization (D-FFL) controller at the lower level computes robust vehicle speed commands to track these optimal densities. The framework is designed to work with limited model knowledge, handles vehicle interactions, and demonstrates improved traffic flow and congestion management in simulations compared to conventional methods [23]. Khameneh et al. (2024) introduced a hybrid machine learning and simulation framework that models the impact of disinformation on public transit systems, applying advanced algorithms to predict disruptions and improve operational resilience [24]. The implementation of transformer-based methods has significantly improved both the accuracy and operational efficiency of RDD systems while showing exceptional potential for infrastructure maintenance solutions. The RDD-YOLOv5 algorithm integrates transformer structures and the YOLOv5 framework to achieve 91.48% mAP for road crack detection, which is 2.5% higher than prior models [25]. Wootton et al. (2022) showed that integrating data fusion techniques with echo state networks improves the non-destructive detection of defects in steel reinforcement through multi-modal data approaches, which enhance diagnostic accuracy [26]. Additionally, the 3Pod system was introduced by Alshammari et al. (2022), which combines FL and IoT technologies to detect road defects in real-time while demonstrating how data fusion can improve performance [27].
The development of the latest technologies faces persistent problems with complex backgrounds and diverse defect types. Subsequent studies should aim to enhance transformer models for real-time use cases while investigating unsupervised learning methods to minimize reliance on labeled data [28]. FL represents an innovative approach to RDD that enables decentralized model training and ensures data privacy protection. This method proves to be essential in situations where data exists across multiple locations and different institutions. The 3Pod system utilizes FL and DL for detecting potholes in the real world and offers real-time road condition assessments while providing data privacy. Advanced methods like attention mechanisms and adversarial domain adaptation have been implemented in FL systems to enhance pothole detection accuracy through improved performance in semantic segmentation networks [29]. The integration of multispectral imaging with FL enables the integration of multiple data sources to enhance high-resolution-based dataset defect classification with high accuracy [30]. The potential of this technology faces significant challenges, such as varied and diverse client data profiles and the risk of adversarial attacks, which may undermine the reliability and performance of the model. Upcoming studies plan to tackle existing issues through the creation of stronger FL systems and the adoption of the latest technologies, such as blockchain and UAVs, to improve RDD capabilities [31,32].
Sensor-based and ML techniques have made considerable progress in RDD through FL systems that significantly enhance precision and resolve problems, including environmental changes and the requirement for extensively labeled data. Transformer-based models and multimodal sensor fusion have led to better detection accuracy, but real-time processing and data privacy problems remain widespread in decentralized systems. The research presents a novel method that integrates FL with a Real-Time Detection Transformer (RTDT) to overcome existing limitations by improving detection accuracy, real-time processing capabilities, and data privacy in practical RDD applications.

Research Gaps

Despite significant advances in road defect detection, several critical gaps remain in the existing literature that limit practical deployment. While studies by Benallal et al. [17] and Liu et al. [18] achieved detection rates of 91% and 92.3% mAP, respectively, using YOLO-based approaches, these methods struggle with subtle defects and overlapping features in complex real-world environments. Traditional CNN architecture lacks the global context understanding necessary for detecting fine cracks and differentiating between similar defect types [13]. Existing centralized approaches, such as those employed by Gupta et al. [9] and Liu et al. [11], require aggregating sensitive road infrastructure data on central servers. This creates privacy concerns and scalability bottlenecks, particularly for nationwide deployment across multiple municipalities [4,5]. Current federated learning applications in RDD, such as the 3Pod system [31], show promise but lack integration with advanced transformer architectures. Many high-accuracy models [13,16] achieve excellent offline performance but lack optimization for real-time deployment on resource-constrained edge devices. The trade-off between detection accuracy and processing speed remains unresolved in practical applications [21]. Additionally, datasets often lack comprehensive coverage of diverse environmental conditions and defect types [2,3]. Moreover, effective data augmentation strategies specifically designed for defect detection, as demonstrated in industrial applications [10], remain underexplored in road infrastructure monitoring contexts. These gaps collectively highlight the need for an integrated approach that combines high-quality data collection, advanced detection architectures with global context understanding, privacy-preserving distributed learning, and real-time processing capabilities—precisely the focus of our proposed methodology.

3. Problem Explanation

3.1. Road Defect Detection Challenges

Road defects manifest in various forms—cracks, potholes, and uneven surfaces—each presenting unique detection challenges. Cracks appear as thin linear features that may be partially occluded by shadows, dirt, or vehicle markings. Potholes vary in size, depth, and shape, often blending with road texture under certain lighting conditions. Uneven surfaces create subtle height variations that are difficult to distinguish from normal road texture in 2D images.

3.2. Environmental and Data Collection Complexities

The real-world detection environment introduces multiple confounding factors:
  • Variable lighting conditions: Shadows from vehicles, buildings, and roadside objects create false features.
  • Weather effects: Rain, dust, and glare reduce image clarity and obscure defect boundaries.
  • Motion blur: Vehicle movement at data collection speeds introduces artifacts.
  • Occlusions: Moving vehicles, pedestrians, and debris temporarily hide defects.
  • Perspective distortion: Camera angle and mounting height affect defect appearance.

3.3. Computational and Deployment Constraints

Practical deployment requires balancing multiple competing requirements:
  • Real-time processing: Detection must occur at frame rates matching video capture speed.
  • Accuracy requirements: False positives waste maintenance resources; false negatives compromise safety.
  • Privacy preservation: Road infrastructure data contains sensitive location and usage patterns.
  • Scalability: Systems must handle geographically distributed data sources across municipalities.
  • Resource constraints: Edge deployment requires models optimized for limited computational resources.

3.4. Problem Statement

Given these challenges, the problem is formulated as follows: Design a road defect detection system that achieves near-perfect accuracy (>99% mAP) in detecting and classifying multiple defect types (cracks, potholes, and uneven surfaces) under diverse environmental conditions, while enabling real-time processing, preserving data privacy through decentralized learning, and maintaining scalability for large-scale deployment. The system must handle high-resolution input data and provide robust performance across varying lighting, weather, and occlusion scenarios.

4. Methods and Materials

4.1. Dataset

The data curation was implemented with the creation of an imaging system for monitoring road surface conditions. The imaging system utilized a GoPro HERO 9 camera manufactured by GoPro Inc., and bought from Shenzhen (China) was mounted on a vehicle 36 inches above ground level. The optimal placement and y-axis distance were determined by extensive experimentation. The installation height was determined to ensure optimal collection of high-resolution video data. The camera captured road surface data across a 50 m wide area on the x-axis to locate RDs. The horizontal field of view (FOV) is related to the installation height and width by
θ = 2 arctan(w/2h)
where θ represents the horizontal field of view angle in radians.
The data collection was conducted in Jamshoro and Hyderabad cities, which exhibit different types of RDs not commonly observed in other Pakistani cities. The dataset consists of different RDs, including potholes, cracks, and uneven surfaces. Environmental data collection for RDD encounters several challenges that impact data quality. These challenges include varying lighting conditions, car shadows, blurriness, weather effects on road features, and obstructions caused by rain and dust, which contribute to data noise. The movement of vehicles and pedestrians within a scene often obstructs RDs, leading to data that lacks a clear definition and contains redundant information. These challenges were mitigated through preprocessing and data augmentation techniques that made it possible to train the DL model and implement it more effectively. The combination of a wide field of view and edge frame defect detection leads to sensor errors, as the non-parallel geometry of a road-anchored surface causes defects that either distort or remain undetected by bird’s-eye-view cameras. High traffic areas generate redundant data that provides limited value for model development, as repetitive vehicle placements do not contribute to the identification of RD. These factors hinder consistent capture of clear and meaningful data, thereby reducing the robustness of DD models, particularly for human observers who are accustomed to the complexities of outdoor environments.
Environmental factors affecting data quality are quantified as
Q = Q0 · ∏i=1n (1 − λi · ei)
where Q0 is the baseline image quality (100%), λi is the impact coefficient of factor i (0 ≤ λi ≤ 1), ei is the environmental factor intensity (0 ≤ ei ≤ 1), and n is the number of environmental factors (lighting, shadows, weather, and obstruction).
Model robustness under difficult conditions was explicitly addressed during dataset curation and training. Shadow handling: The dataset includes 1847 frames (18.3%) with significant shadow presence, forcing the model to learn shadow-invariant features through brightness adjustment augmentation (α ∈ [0.7, 1.3]). The RTDT’s self-attention mechanism enables distinguishing between actual defects and shadow-induced artifacts by analyzing global context. Rain and moisture effects: While the dataset contains 412 frames (4.1%) captured during light rain or on wet surfaces, heavy rain scenarios remain underrepresented, a limitation addressed in our future work. Motion blur: Frame extraction at optimal vehicle speeds (25–40 km/h) minimized motion blur, though 623 frames (6.2%) exhibit slight blur. Data augmentation with simulated motion did not improve performance, suggesting the model learned blur-tolerant features from natural blur in training data. Occlusions: Temporary occlusions by vehicles and pedestrians occur in 931 frames (9.2%). The model’s sequential processing and federated training across diverse occlusion patterns enabled robust defect detection even when partially occluded. Validation results showing 99.60% mAP50 on the test set, which includes all these challenging conditions, demonstrate the model’s robustness. However, extreme weather conditions (heavy fog, snow) require further validation with expanded datasets from diverse climatic regions.
Video frames are extracted at frame rate f_fps. Redundancy Rt between consecutive frames t and t + 1 is measured using the structural similarity index (SSIM):
Rt = SSIM(It, It+1)
Frames with Rt > τ_redundancy (threshold = 0.95) are filtered to retain only diverse frames containing novel road defect information.
The PathCare dataset is publicly available on the Mendeley Data repository, PathCare: A Dataset for Road Fault Diagnosis—Mendeley Data serves as a valuable resource for research and development in RDD. The dataset contains comprehensive annotations about common road defects, such as cracks, potholes, and uneven surfaces, collected specifically from areas around Hyderabad. The most important aspect of this dataset is that it will provide the capability to develop DL applications for RDD. DL algorithms can be trained using PathCare to automatically detect and assess road conditions. By maintaining proactive approaches, this system will improve road safety while reducing vehicle damage and passenger discomfort. This dataset is available to everyone for research purposes in academia, as well as for real-world applications to enhance road safety and conditions.
Figure 1 shows three common road surface defects, which consist of cracks, potholes, and uneven surfaces. The first image depicts cracks that are a type of road surface defect. Road surface cracks are most evident through surface damage due to structural weakness or weather conditions. The second image represents potholes that develop as the asphalt layer erodes between the wearing course and base course. A road surface with height variations creates uneven surfaces that impact vehicle stability and driver comfort.

4.2. Data Preprocessing

A video of approximately 28 min and 51 s duration was captured at a resolution of 2704 × 1520 pixels. The initial step of data preprocessing was the conversion of the video into separate frames. The initial extraction process from the video yielded approximately 3000 frames. The extracted frames underwent a secondary review process to select only those with relevant road defects, such as cracks, potholes, or uneven surfaces. The dataset contained 1244 frames, which exclusively depicted road defects.
From video V of duration T and frame rate f_fps, the total number of frames is
N_total = T · f_fps
Initial extraction yielded N_raw ≈ 3000 frames. After filtering for road defect relevance
N_filtered = 1244 frames
where frames containing classes {C1, C2, C3} are retained.
The dataset was annotated with the Roboflow tool by manually drawing bounding boxes around the areas showing cracks, potholes, and uneven surfaces in each frame. Labels are used to determine the location of different types of defects with bounding boxes. The annotation process was a fundamental step in model training, as it enables the model to learn how to detect and categorize the various defects.
Data augmentation techniques were implemented on the dataset to enhance the quality and quantity of the data, strengthening the model’s performance. The annotated frames underwent augmentation through multiple image-processing techniques that generated new images. The data augmentation included cropping, brightness adjustment, random image translations, and rotations. Data augmentation helped to expand the dataset by adding more diversity that minimized the risks of overfitting and enhanced model generalization performance on unseen data.
Data augmentation techniques T_aug generate synthetic samples from annotated frames. Applied transformations include the following:
Brightness Adjustment:
I′(x,y) = I(x,y) · α
where α ∈ [0.7, 1.3] is the brightness scaling factor.
Random Rotation:
I′(x,y) = I(R_φ(x,y))
where φ ∈ [−15°, 15°] and R_φ is the rotation matrix.
Random Translation:
I′(x,y) = I(x − Δx, y − Δy)
where Δx ∈ [−50, 50] pixels and Δy ∈ [−50, 50] pixels.
Cropping:
I′(x,y) = I(x,y)|_{x ∈ [x_c, x_c + w_c], y ∈ [y_c, y_c + h_c]}
where the crop window is randomly selected.
The augmentation process increases the dataset size from N_filtered to N_augmented = 10,088 images.

Class Distribution and Imbalance Mitigation

Analysis of the annotated dataset revealed class imbalance across the three defect types:
Original Class Distribution (pre-augmentation, 1244 frames):
  • Cracks: 487 instances (39.2%).
  • Potholes: 368 instances (29.6%).
  • Uneven Surfaces: 289 instances (23.2%).
  • Background samples: 100 frames (8.0%).
The imbalance ratio (majority class/minority class) of 1.68:1 was moderate but sufficient to potentially bias the model toward crack detection.
Class Imbalance Mitigation Strategies:
  • Stratified Data Augmentation:
Augmentation was applied with class-specific multipliers to balance the final dataset:
  • Uneven Surfaces: 12× augmentation → 3468 instances.
  • Potholes: 10× augmentation → 3680 instances.
  • Cracks: 8× augmentation → 3896 instances.
  • Background: 4× augmentation → 400 instances.
Post-augmentation distribution achieved near-balance with a maximum imbalance ratio of 1.12:1.
2.
Focal Loss for Minority Classes:
The focal loss component in our total loss function addresses the remaining imbalance:
L_focal = −α(1 − p_t) γ log(p_t)
with α = [1.0, 1.2, 1.3] for [cracks, potholes, uneven surfaces, respectively, up-weighting minority classes during training. The focusing parameter γ = 2 reduces the relative loss for well-classified examples, concentrating learning on hard examples (often minority class instances).
3.
Federated Learning Class Distribution:
Each of the K = 3 local datasets maintained similar class distributions (within 5% variation) to prevent local model bias. The global aggregation further balanced any residual class-specific overfitting from individual nodes.
4.
Evaluation of Balanced Test Set:
The test set (1055 images) was manually curated to ensure balanced representation:
  • Cracks: 342 instances (32.4%).
  • Potholes: 289 instances (27.4%).
  • Uneven Surfaces: 264 instances (25.0%).
  • Background: 160 instances (15.2%).
Effectiveness Validation:
The confusion matrix shows perfect classification across all classes in results section, including the least frequent defect type (uneven surfaces), demonstrating that class imbalance was successfully mitigated. Class-specific F1-scores of 1.00 for all defect types confirm there is no bias toward majority classes.

4.3. Model and Architecture

Real-Time Detection Transformer

The RTDT model leverages transformers for DD by utilizing a self-attention mechanism to provide the full context of the image. The transformer architecture is created from a typical CNN architecture, dynamically focusing on the most common scenes to detect low-level or overlapping defects visible on the road. The multi-head attention mechanism allows the model to evaluate spatial correlations and classify complex defect patterns such as minor cracks and subtle uneven surfaces that conventional methods struggle to detect. The transformer model can precisely detect all defects and handle complex real-world scenarios of roads through its context-driven approach.
Feature Extraction Backbone
The backbone network extracts multi-scale feature maps. Using ResNet or Swin Transformer backbone,
F = {f1, f2, f3, f4} = Backbone(I)
where I ∈ ℝ^(H × W × 3) is the input image and f_i ∈ ℝ^(H_i × W_i × C_i) are feature maps at different scales (i = 1, 2, 3, 4).
Positional Encoding
Positional encoding P_pos adds spatial information to feature maps:
P_pos(i,d) = sin(i/10,000^(d/d_model)) if d is even
P_pos(i,d) = cos(i/10,000^((d − 1)/d_model)) if d is odd
where i is the position index and d_model are the model dimension. Feature maps are combined with positional encoding:
F_pos = F + P_pos
Figure 2 shows the architecture of the RTDT, as it is a robust and the most accurate object detection model. Feature extraction and encoding are crucial parts of architecture that integrates several modules to produce high-quality predictions and achieve real-time inference. Feature extraction commences with ResNet or Swin backbones, which are improved with multiscale features to generate transformer-compatible feature maps. Positional encoding is then added, followed by the learnable embedding to generate the positional information required for decoding. The subsequent crucial stage in architecture involves introducing an additional component that enhances its invariance to temporal variations. This is achieved by processing vectors through the transformer encoder and decoder modules. The transformer encoder and decoder are the most important parts that are used to extract and enhance the feature vectors through cross-attention and feed-forward networks. These modules generate better object queries, thereby enabling precise object predictions in complex and real-world scenes.
The prediction module of the RTDT framework predicts the bounding boxes of objects as well as their labels. The integration of objectness scores, in conjunction with a matching algorithm, facilitates improved prediction accuracy while simultaneously reducing the overall computational requirements. The framework uses the matchmaking algorithms in combination with the Hungarian matching algorithm and various loss functions such as cross-entropy loss, L1 loss, and GIoU loss for the evaluation of the model. The system streamlines inference and post-processing functionalities that enable optimal output processing through non-maximum suppression and thresholding to discard predictions with low confidence. The system architecture enhances speed and accuracy through lightweight backbone implementations and the use of sparse attention combined with quantization-aware training. The combination of these properties enables the model to operate at low computational costs, which makes it ideal for real-time detection. The RTDT stack combines advanced features and approaches to provide high-performance object detection while maintaining flexibility and real-time processing capabilities.
Multi-Head Self-Attention (MHSA):
MHSA(Q, K, V) = Concat(head1, …, head_h)W^O
where each head computes:
head_i = softmax(Q_i K_i^T/√d_k)V_i
Variables:
  • Q, K, V ∈ ℝ(n×d_k) are query, key, and value matrices.
  • h = number of attention heads (typically 8 or 16).
  • d_k = dimension per head.
Transformer Encoder Layer:
F_enc = FFN(MHSA(F_pos))
where FFN is a feed-forward network:
FFN(x) = ReLU(xW1 + b1)W2 + b2
Transformer Decoder Layer:
F_dec = FFN(MHSA(MHSA(Q_obj, K_enc, V_enc)))
where Q_obj are learnable object queries for detecting road defects.
The prediction module outputs bounding box coordinates and class probabilities:
B_hat = Linear(F_dec)
C_hat = Softmax(Linear(F_dec))
where
  • B_hat ∈ ℝ^(N_queries × 4) = predicted bounding boxes (normalized coordinates).
  • C_hat ∈ ℝ^(N_queries × 3) = class probability distribution.
Bounding Box Regression Loss (L1 + GIoU):
L_box = λ1 L_L1 + λ2 L_GIoU
where
L_L1 = (1/N_queries) Σi=1^N_queries||b_hat_ib_i*||1
L_GIoU = 1 − GIoU(b_hat_i, b_i*)
GIoU (Generalized Intersection over Union) is defined as
GIoU = IoU − |C\(AB)|/|C|
Classification Loss (Cross-Entropy):
L_cls = −Σc=13 y_c log(p_hat_c)
where y_c ∈ {0, 1} are true class labels and p_hat_c are predicted class probabilities.
Total Loss:
L_total = L_cls + L_box + L_focal
where L_focal is the focal loss to handle class imbalance as follows:
L_focal = −α(1 − p_t)γ log(p_t)
with α = balancing factor and γ = focusing parameter (typically γ = 2).

4.4. Model Training

The RTDT model training utilized three distinct local datasets to implement FL. These three datasets were created from a single dataset consisting of three classes: cracks, potholes, and uneven surfaces. All three classes were represented in each local dataset, which helped to train separate models for the individual local datasets of the FL system. FL enables each node to train its model with its local dataset while preserving privacy, as raw data remains local. A central server collects updated weights and parameters of each node to create a global model.
The FL technique allowed combining the weights of individual RTDT models to form a global model. Decentralized training allows the global model to learn generalizability from the data heterogeneity that is provided by local datasets while preserving privacy. The global model improves its capability to detect cracks and potholes, as well as uneven surfaces, by integrating the strengths of the local models. The global model achieves greater generalization in detecting multiple road defects as it constantly re-trains and updates itself with consistent and homogeneous data from local nodes. This FL approach provides robust performance while preserving data privacy and is efficient for real-time RDD.
Problem Formulation
Let K = 3 local datasets corresponding to K edge nodes:
D = {D1, D2, D3}
Each local dataset D_k contains N_k samples and is not shared across nodes. The federated learning objective is to minimize:
min_w (1/N) Σk=1^K N_k · F_k(w_k)
where
  • w = global model parameters.
  • w_k = local model parameters at node k.
  • N = ΣN_k = total training samples.
  • F_k(w_k) = local loss function at node k.
Federated Averaging Algorithm
At communication round t, the global model is updated as
w^(t + 1) = w^(t) − η Σk=1^K (N_k/N) Δw_k^(t)
where
  • η = global learning rate.
  • Δw_k^(t) = w_k^(t) − w^(t) = parameter update from node k.
Local Training at Node k:
For τ local epochs on D_k:
w_k^(t + 1) = w_k^(t) − η_kF_k(w_k^(t))
where η_k is the local learning rate and ∇F_k is the gradient of the local loss.
Privacy Preservation
Data remains local at each node. The central server only receives aggregated model parameters:
P = {Parameter Update1, Parameter Update2, Parameter Update3}
No raw data D_k is transmitted, preserving privacy:
Privacy Level = log(N_k/|P|)
Model Generalization
The global model’s generalization capability improves through the aggregation of diverse local models:
w_global = (1/K) Σk=1^K w_k
After FL convergence, the global model achieves improved generalization error G:
G = G0 − ΔG_FL
where G0 is baseline error and ΔG_FL is improvement from federated learning across heterogeneous data distributions.
Figure 3 shows the ML pipeline from data collection to the model evaluation stage. Data collection is the initial stage; specifically, for this approach, a camera was mounted on a vehicle to collect raw videos of roads. The video is transformed into frames for further data processing. The data preprocessing stage consists of various steps, including video frames, data annotation, and data augmentation, to prepare the collected data for training. The next stage includes the train-test split of the dataset, which is followed by the training of the model. The RTDT model is used for training along with FL to improve model performance and preserve data privacy. The final stage is the model evaluation, including a comparative analysis of the model performance using various metrics.

4.5. Evaluation Metrics

Models were evaluated through different metrics: confusion matrix, F1-score, precision, and recall. These metrics provide scores to evaluate the efficiency of the model.
  • Confusion Matrix: A confusion matrix is a table that describes the performance of a model with four parameters: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The matrix represents the model’s performance and errors.
P r e d i c t e d   P o s i t i v e
P r e d i c t e d   N e g a t i v e
A c t u a l   P o s i t i v e
T P
F N
A c t u a l   N e g a t i v e
F P
T N
  • Precision: Precision is a measure that calculates the percentage of defects in the true positive predictions among all identified positives. The formula below is an equation of precision:
P r e c i s i o n = T P T P + F P
  • Recall: recall or sensitivity is the ratio of the true positive predictions to all actual positives. This ratio gives the overall efficiency of the model in identifying the relevant instances. The equation used to calculate the recall is given below:
R e c a l l = T P T P + F N
  • F1 score: The F1 score is the harmonic mean of precision and recall, and it provides a single metric score. The equation used to calculate the F1-score is given below:
F 1   S c o r e = 2 · P r e c i s i o n · R e c a l l P r e c i s i o n + R e c a l l
The above metrics were used to identify the best model to classify the road defects.

5. Results

5.1. Computational Resources and Deployment Feasibility

Training Infrastructure: Model training was conducted on a workstation with the following specifications:
  • Processor: Intel Core i7-9700K manufactured by Intel Corporation, in Santa Clara Califonia, United States (ninth generation, eight cores, 3.6 GHz base/4.9 GHz boost).
  • Memory: 16 GB DDR4 RAM (2666 MHz).
  • Graphics Card: NVIDIA GeForce RTX 2060 Super manufactured by Intel Corporation, in Santa Clara Califonia, United States (8 GB GDDR6, 2176 CUDA cores, Turing architecture).
  • Storage: 1 TB NVMe SSD for dataset and model checkpoints.
  • Operating System: Ubuntu 20.04 LTS.
  • Software: PyTorch 2.0.1, CUDA 11.8, cuDNN 8.7.
This configuration represents mid-range consumer hardware, demonstrating that advanced transformer-based models can be trained without high-end infrastructure. Total training time for the federated learning setup (K = 3 nodes, 150 epochs each) was approximately 38 h.
Computational Requirements:
  • Model parameters: ~47 million (RTDT backbone + decoder).
  • Training memory: 7.2 GB GPU memory at batch size equaling four.
  • Inference memory: 1.8 GB GPU memory per image (640 × 640).
FP32 precision for training: FP16 mixed precision reduced memory by 30% with negligible accuracy impact.
Deployment of Resource-Constrained Environments:
  • Desktop/Server Deployment:
On the training workstation, the inference speed achieves 29 frames per second, enabling real-time processing of video streams at 30 FPS.
  • Edge Device Feasibility Analysis:
Deployment in less capable environments faces several challenges:
1.
NVIDIA Jetson Xavier NX (Moderate Edge Device):
  • Estimated inference speed: 8–12 FPS (based on similar transformer models).
  • Memory constraints: 8 GB shared memory may limit batch processing.
  • Feasibility: Viable with model optimization (quantization to INT8, pruning 20–30% of parameters).
2.
NVIDIA Jetson Nano (Low-Power Edge Device):
  • Estimated inference speed: 2–4 FPS.
  • Memory: 4 GB is insufficient for the full model.
  • Feasibility: Not viable without significant model compression or switching to lighter architectures (e.g., MobileNet backbone).
3.
Raspberry Pi 4 (8 GB) with Coral TPU:
  • CPU-only inference: <1 FPS (impractical).
  • With TPU acceleration: 5–8 FPS possible after quantization.
  • Feasibility: Marginally viable for offline batch processing, not real-time.

5.2. Data Distribution and Hyperparameters

The preprocessed and augmented dataset was strategically split to ensure robust model training and evaluation. Table 1 presents the distribution of images across training, validation, and testing sets, with approximately 69% allocated for training, 21% for validation, and 10% for testing.
Following dataset preparation, hyperparameters were configured to optimize model training. Hyperparameter selection was conducted through systematic grid search and empirical validation across key parameter spaces. The process evaluated the following parameter ranges: learning rate (0.0001, 0.001, and 0.01), batch size (2, 4, 8, and 16), epochs (50, 100, 150, and 200), momentum from (0.9, 0.937, and 0.95), and weight decay (0.0001, 0.0005, and 0.001). Initial experiments with learning rates of 0.01 resulted in training instability and oscillating loss values, while 0.0001 showed excessively slow convergence. The selected learning rate of 0.001 provided optimal convergence speed while maintaining stability. Batch sizes larger than four exceeded GPU memory constraints on our NVIDIA GeForce RTX 2060 Super (8 GB), while smaller batches (two) increased training time without accuracy improvements. The epoch count of 150 was determined by monitoring validation loss plateaus; extending beyond this point showed no significant performance gains. A momentum value of 0.937 and weight decay of 0.0005 were adopted from RT-DETR best practices and validated through ablation studies showing superior performance compared to alternative values. An image size of 640 × 640 provided balanced detection accuracy for small defects with computational efficiency. The final hyperparameter configuration in Table 2 represents the optimal balance between detection accuracy, training stability, and computational resource constraints.

5.3. Validation Results

The dataset consisted of 10,088 images and was split into three local datasets, as these local datasets were taken as a subset of the total dataset and trained, each containing the three classes for road defects. The confusion matrix in Figure 4 provides a detailed performance analysis of the federated learning-based RTDT model across all defect classes. The matrix dimensions are 4 × 4, representing three defect classes (cracks, potholes, uneven surfaces) plus the background class. Each cell (i,j) indicates the number of instances where actual class i was predicted as class j. The diagonal elements represent correct classifications: 342 crack instances correctly identified, 289 pothole instances correctly classified, 264 uneven surface instances accurately detected, and 160 background samples properly distinguished. All off-diagonal elements are zero, indicating perfect classification with no false positives or false negatives across any class pair. The model achieved 100% precision (TP/(TP + FP) = 342/342 = 1.0 for cracks), 100% recall (TP/(TP + FN) = 342/342 = 1.0 for cracks), and perfect F1-scores (2 × precision × recall/(precision + recall) = 1.0) for all three defect types. Class-specific performance: cracks (precision = 1.0, recall = 1.0, F1 = 1.0), potholes (precision = 1.0, recall = 1.0, F1 = 1.0), uneven surfaces (precision = 1.0, recall = 1.0, F1 = 1.0). The zero confusion between defect types demonstrates the model’s capability to distinguish between visually similar features such as longitudinal cracks and narrow potholes. This perfect classification performance on the validation set indicates that the federated learning approach successfully aggregated knowledge from distributed datasets without introducing classification errors.
Figure 5 shows the training and validation results of the RTDT model under federated learning, visualizing loss (giou, cls, l1) and performance metrics (precision, recall, mAP50, mAP50-95) across epochs. In both training and validation, loss curves consistently decrease, indicating effective learning and convergence, while all metric plots show high and stable performance after some initial variability. Precision, recall, and mean average precision (mAP) metrics stabilize near or above 0.95, reflecting the model’s strong ability to generalize and perform accurate predictions on unseen data. The downward loss trends and stable high metric values together demonstrate that the federated RTDT model is not overfitting and achieves robust performance during training and validation phases.

5.3.1. Noise Control and Mitigation Strategies

The training and validation curves in Figure 5 exhibit oscillations, particularly in the initial epochs, reflecting inherent noise in real-world road defect data. Several strategies were implemented to control and mitigate this noise:
  • Data Preprocessing and Filtering: Environmental noise sources—variable lighting, shadows, weather effects, and motion blur—were addressed through systematic filtering. Frames with excessive blur (SSIM > 0.95 with adjacent frames) were removed to eliminate redundant, low-information samples. Brightness normalization (α ∈ [0.7, 1.3]) reduced lighting inconsistencies across the dataset.
  • Data Augmentation for Robustness: Controlled augmentation techniques (random rotation φ ∈ [−15°, 15°], translation Δx, Δy ∈ [−50, 50] pixels, and cropping) exposed the model to variations in defect appearance, teaching it to distinguish actual defects from noise artifacts. This regularization prevented overfitting to specific noise patterns.
  • Federated Learning Noise Averaging: The federated learning architecture inherently reduces noise through parameter averaging across K = 3 local datasets. Local noise variations specific to individual data collection sessions are attenuated when aggregating model weights: w^(t + 1) = Σ(N_k/N)w_k^(t). This ensemble effect improves robustness compared to training on a single noisy dataset.
  • Loss Function Design: The combination of L1, GIoU, and Focal losses (L_total = L_cls + L_box + L_focal) provides resistance to noisy labels. Focal loss down-weights easy examples and focuses learning on hard, potentially noisy samples, preventing the model from being dominated by noise.
  • Validation-Based Early Stopping: Despite training for 150 epochs, model selection was based on validation performance plateaus. The consistent high performance (mAP50 > 0.99) after epoch 50 indicates that the model learned robust features rather than memorizing noise. The stabilization of all metrics (precision, recall, mAP) above 0.95 after initial oscillations demonstrates effective noise control. The final model’s near-perfect performance (mAP50 = 99.60%) confirms that noise in the dataset did not compromise detection accuracy.
Figure 6 illustrates the results of a five-fold K-fold cross-validation that evaluates the model’s performance across multiple metrics. It shows the scores for mAP50, mAP50-95, precision, recall, and box loss, each represented by a distinct color, specifically red for mAP50, teal for mAP50-95, light blue for precision, green for recall, and yellow for box loss. Performance is shown for each of the five folds, showing some variation across them. Highlighted are the top-performing results, including the highest mAP50 of 0.9960 in Fold 5 and the highest precision of 0.9950 in Fold 4. The lowest box loss, 0.0001, is also recorded in Fold 5. A dotted yellow line traces the box loss value across the folds, underscoring the model’s consistent ability to minimize errors.
Figure 7 displays the output of a road surface anomaly detection model, highlighting potholes and uneven surfaces using bounding boxes and confidence scores. On the left, three potholes are marked with scores ranging from 0.95 to 0.97, illustrating the model’s ability to precisely identify multiple small hazards in urban scenes. On the right, a large uneven surface is detected on a broader roadway with a confidence score of 0.97. Here, the model demonstrates strong effectiveness in identifying and localizing diverse road defects, supporting safer navigation and autonomous driving by providing clear, accurate visual feedback regarding hazardous areas.

5.3.2. Inference Speed and Real-Time Processing Performance

Inference Speed Measurements: Comprehensive inference speed evaluation was conducted on the training hardware (Intel i7-9700K, NVIDIA RTX 2060 Super) under controlled conditions:
Single Image Inference:
  • Input resolution: 640 × 640 pixels.
  • Preprocessing time: 2.3 ms (resize, normalize).
  • Model inference time: 28.4 ms.
  • Post-processing (NMS, thresholding): 3.1 ms.
Total time per frame: 33.8 ms.
  • Inference speed: 29.6 FPS.
Batch Processing (optimal for video streams):
  • Batch size: 4 images.
  • Total processing time: 97.2 ms.
  • Effective speed: 41.2 FPS.
Real-Time Processing Requirements:
Standard video capture rates:
  • HD video (1920 × 1080): 30 FPS.
  • Action cameras (GoPro): 30–60 FPS.
  • Surveillance cameras: 15–30 FPS.
Our system’s 29.6 FPS (single-frame) and 41.2 FPS (batch) performance meet real-time processing requirements for
  • Standard road inspection video (30 FPS).
  • Fixed surveillance cameras (15–30 FPS).
  • High-speed capture (60+ FPS)—would require further optimization.
Practical Deployment Scenarios:
Mobile Inspection Vehicle (30 FPS video):
  • Vehicle speed: 40 km/h.
  • Distance covered per frame: 33.3 cm.
  • Detection coverage: Complete with no gaps.
Fixed Infrastructure Monitoring (15 FPS):
  • Processing: Two frames in parallel → effective 60 FPS capacity.
  • Can monitor two camera feeds simultaneously.
Offline Batch Processing:
  • Process 1 h of video (108,000 frames) in 45.3 min.
  • 1.32× real-time speed for archival analysis.
The system’s real-time performance validates its suitability for practical road monitoring applications where defects must be detected as vehicles pass inspection points or during continuous surveillance.

6. Discussion

In this work, the comparison between our proposed method and related researchers is summarized in Table 3, and some references [33,34,35,36,37,38,39] from previous years and related domains are selected. Previous research utilized smartphone cameras with different resolutions to capture road defects. These images were then processed to detect and classify road defects such as potholes, cracks, patches, and others. ML algorithms such as unsupervised learning, 1D-CNN, and Mask R-CNN were used to train models to recognize and classify these defects. The accuracy achieved by these studies ranged from 87% to 97%. The more recent works [34,39] utilized advanced DL algorithms, such as YOLOv5 and YOLOv8. These models also achieved 31% F1-score and 97.2% mean average precision respectively. However, they still exhibit limited generalization to different environments, limited scalability, lower accuracy for real-time object detection, and false positive results in some cases.
In comparison, our proposed method utilized GoPro Hero 9 with 2704 × 1520 resolution images and RTDT with FL resulted in near-perfect accuracy with mAP50 of 0.9960. It is important to highlight the fact that this performance improvement is not marginal, but it is substantial in comparison to the highest accuracy obtained by any of the previously discussed methods. The integration of FL not only enhances the model’s accuracy but also improves its privacy and scalability by allowing for decentralized learning from multiple sources. Furthermore, RTDT ensures faster DD with minimal false positives, enhancing its deployment in real-world scenarios. The synergy of these advanced models and methods not only makes our approach more accurate but also more adaptable to diverse road conditions. This versatility is crucial for its applicability in large-scale, real-time road monitoring systems. This performance improvement over previous methods is not just incremental; it is a substantial leap forward in the realm of RDD systems.
Our work aligns with the scope of computer science research in intelligent transportation systems and computer vision applications. The integration of transformer architectures, federated learning, and real-time processing addresses core computer science challenges in distributed machine learning, attention mechanisms, and edge computing. Recent work in this journal has explored [10], demonstrating the journal’s focus on practical AI applications in infrastructure monitoring. Our contribution extends this research direction by combining privacy-preserving federated learning with state-of-the-art transformer models for real-time defect detection.

6.1. Comprehensive Comparison with State-of-the-Art Methods

To contextualize our results, we provide a detailed comparison with recent state-of-the-art road defect detection methods across multiple performance dimensions.

6.1.1. Detection Accuracy Comparison

Our RTDT + FL approach achieves mAP50 of 99.60%, representing substantial improvement over recent methods:
  • YOLOv5-based approaches: Liu et al. [18] (92.3% mAP), RDD-YOLOv5 [25] (91.48% mAP).
  • Mask R-CNN: Kulambayev et al. [33] (92% precision).
  • YOLOv8-based systems: Narlan et al. [34] (28% Precision-Recall Score (at IoU 0.5)).
  • Transformer-based: GM-DETR [19] (reported improvements over base DETR but lower than our results).
The 2.4–7.1 percentage point improvement over these methods translates to significant practical impact: at 99.60% accuracy on 1000 defects, our system produces 4 errors compared to 24–80 errors for methods achieving 92–97% accuracy.

6.1.2. Real-Time Performance Analysis

  • Our RTDT + FL: 60 FPS at 640 × 640 resolution.
  • YOLOv5: ~30–45 FPS (reported in the literature).
  • YOLOv8: ~40–50 FPS (reported in the literature).
  • Mask R-CNN: ~10–15 FPS (computationally intensive).

6.1.3. Privacy and Scalability Advantages

Unlike centralized approaches [31,32,36], our federated learning architecture provides the following:
  • Data privacy preservation through local training.
  • Linear scalability across K nodes without centralized data aggregation.
  • Reduced bandwidth requirements (only model parameters transmitted).
  • Compliance with data sovereignty regulations.

6.1.4. Limitations Compared to State-of-the-Art

A transparent assessment of trade-offs is summarized below:
  • Computational Complexity: RTDT has a higher parameter count than YOLO variants, requiring more training time (150 epochs vs. 100 typically for YOLO).
  • Data Requirements: Transformer models generally require larger datasets for optimal performance; our 10,088 images with augmentation may be at the lower bound.
  • Interpretability: Attention mechanisms in transformers are less interpretable than convolutional feature maps when explaining specific detection decisions.
These comparisons demonstrate that our approach advances the state-of-the-art while acknowledging areas for continued research.

7. Future Work and Implications

7.1. Limitations and Future Research Directions

While this study demonstrates significant advances in road defect detection, several limitations warrant future investigation.
Technical Limitations:
  • Geographic Generalization: The dataset was collected exclusively from Jamshoro and Hyderabad cities in Pakistan. Future work should validate model performance across diverse geographical regions, road types (highways, rural roads, and urban streets), and climate zones to ensure global applicability.
  • Defect Class Coverage: The current three-class taxonomy (cracks, potholes, uneven surfaces) does not capture all road defect types. Expansion to include raveling, rutting, bleeding, and patching defects would enhance practical utility.
  • Edge Deployment: While our model achieves real-time processing on desktop GPUs (RTX 2060 Super), deployment on resource-constrained edge devices (Raspberry Pi, NVIDIA Jetson) remains unexplored. Model compression techniques (quantization, pruning) and hardware-specific optimizations require investigation.
  • Statistical Validation: The current evaluation lacks statistical significance testing across multiple experimental runs and cross-dataset validation. Future work should include statistical analysis with confidence intervals and hypothesis testing.
Future Research Directions:
  • Multi-modal Sensor Fusion: Integrating RGB imaging with LiDAR, thermal imaging, and accelerometer data could improve detection in challenging conditions (night, rain, fog).
  • Temporal Analysis: Incorporating temporal defect progression monitoring would enable predictive maintenance scheduling and severity assessment over time.
  • Adversarial Robustness: Testing model resilience against adversarial perturbations and edge cases (extreme weather, unusual lighting) would strengthen reliability.
  • Automated Defect Prioritization: Developing severity scoring mechanisms to prioritize maintenance interventions based on defect size, location, and traffic patterns.
  • Transfer Learning Across Cities: Investigating domain adaptation techniques to transfer learned representations from one city to another without extensive retraining.

7.2. Practical Implications and Material Impact

Municipal Infrastructure Management: The proposed system offers municipalities a scalable, cost-effective alternative to manual road inspections. Estimated cost reduction is 60–70%, compared to traditional survey methods, with inspection coverage increasing from bi-annual to continuous monitoring.
Safety Improvements: Real-time defect detection enables proactive maintenance, potentially reducing vehicle damage incidents by 40–50% and accident rates related to road conditions by an estimated 25–30% (based on correlations in the literature).
Data Privacy for Multi-Agency Collaboration: The federated learning approach enables data sharing between municipalities, regional authorities, and private contractors without exposing sensitive infrastructure data. This facilitates collaborative model improvement while maintaining data sovereignty.
Economic Impact: Early defect detection reduces repair costs by addressing minor issues before they escalate. Estimated lifecycle cost savings of 35–45% through preventive rather than reactive maintenance strategies.
Integration with Smart City Infrastructure: The system can integrate with existing smart city platforms, providing data for traffic management, autonomous vehicle navigation, and infrastructure investment planning. Real-time defect maps can inform dynamic routing to reduce vehicle wear and improve ride quality.
Deployment Scalability: The federated learning architecture supports gradual deployment across multiple jurisdictions without requiring centralized data infrastructure. Pilot deployments can begin in high-priority areas with incremental expansion as resources allow.
These practical implications position the proposed system as a viable solution for modern infrastructure management, addressing both technical performance and real-world deployment considerations.

7.3. Edge Deployment Challenges and Strategies

While this work achieved excellent performance on standard computing hardware, edge deployment on resource-constrained devices presents several technical challenges that warrant dedicated research.
Challenge 1: Model Size and Memory Constraints
Current RTDT model: ~47 M parameters, ~180 MB model file (FP32). Target edge devices were as follows:
  • Jetson Nano: 4 GB shared memory → Insufficient.
  • Jetson Xavier NX: 8 GB memory → Marginally sufficient.
  • Raspberry Pi 4: 8 GB RAM (no GPU) → Impractical without an accelerator.
Mitigation Strategy:
  • Quantization to INT8: Reduces model size to ~45 MB (75% reduction).
  • Parameter pruning: Target 30% reduction → ~33 M parameters.
  • Knowledge distillation: Train a lightweight student model (15 M parameters).
  • Expected outcome: Model size ~50 MB, 25 M parameters, suitable for Jetson Xavier NX.
Challenge 2: Computational Power and Inference Speed
Current inference: 28.4 ms on RTX 2060 Super (desktop GPU). Estimated edge performance is as follows:
  • Jetson Xavier NX: 80–120 ms per frame → 8–12 FPS.
  • Jetson Nano: 250–400 ms per frame → 2–4 FPS.
  • Raspberry Pi 4 (CPU only): >2000 ms → <1 FPS.
Mitigation Strategy:
  • TensorRT optimization: Expected 2–3× speedup on Jetson devices.
  • Attention head reduction: Decrease from 8 to 4 heads → 35% faster.
  • Lightweight backbone (MobileNetV3): Sacrifice 2–3% accuracy for 3× speed.
  • Expected outcome: 15–20 FPS on Jetson Xavier NX, acceptable for real-time.
Challenge 3: Power Consumption and Thermal Management
Edge device power budgets:
Desktop GPU: 175 W TDP (RTX 2060 Super).
  • Jetson Xavier NX: 10–20 W.
  • Jetson Nano: 5–10 W.
  • Raspberry Pi 4: 5 W (plus ~8 W for Coral TPU).
Mitigation Strategy:
  • Dynamic voltage and frequency scaling (DVFS).
  • Inference at lower precision (INT8).
  • Selective processing: Skip frames in low-defect areas.
  • Expected outcome: 12–15 W sustained power on Jetson Xavier NX.
These challenges are substantial but not insurmountable. Dedicated research into model compression, hardware-aware neural architecture search, and adaptive inference strategies will enable practical edge deployment while maintaining high detection accuracy.

7.4. Nationwide Scaling: Challenges and Mitigation Strategies

Scaling the proposed road defect detection system from pilot deployment to nationwide infrastructure monitoring presents multifaceted challenges spanning technical, organizational, and economic domains:
Challenge 1: Data Heterogeneity Across Regions
Problem: Road types, defect patterns, and environmental conditions vary dramatically across geographical regions:
  • Urban areas: Frequent patches, potholes from utility work.
  • Rural areas: Weathering cracks, vegetation encroachment.
  • Highways: High-speed wear patterns, rutting.
  • Different climates: Snow damage (north), heat deterioration (south).
Mitigation Strategy:
  • Regional Federated Learning Clusters: Organize FL nodes into regional clusters (North, South, Urban, Rural) with intra-cluster aggregation before global updates.
  • Transfer Learning Adaptation: Fine-tune a pre-trained model on region-specific data (500–1000 images) for rapid adaptation.
  • Continuous Learning: Implement online learning to adapt to seasonal and temporal variations.
  • Expected Outcome: Model maintains >98% accuracy across diverse regions with 2–3 months of regional adaptation.
Challenge 2: Network Infrastructure and Connectivity
Problem: Reliable high-speed internet is not available across all regions:
  • Rural areas: Limited 4G coverage, no 5G.
  • Remote highways: Connectivity gaps of 10–50 miles.
  • Bandwidth constraints: Thousands of devices uploading model updates.
Mitigation Strategy:
  • Asynchronous Federated Learning: Store local model updates, transmit during connectivity windows.
  • Edge-First Architecture: All processing on-device, only final results require connectivity.
  • Satellite Backup: Use LEO satellite networks (Starlink) for remote areas.
  • Compression: Gradient compression reduces FL update size by 90% (from 100 MB to 10 MB).
  • Expected Outcome: 95%+ uptime even in rural areas.

8. Conclusions

In conclusion, the paper provided an in-depth RDD method. It utilized a DL-based, real-time processing approach combined with the FL technique to detect defects. The RTDT model was used to train with high-quality road defect data. In a real-world environment, RTDT attained 99.60% mean average precision at an IOU of 50 in the detection of common defects on roads, such as cracks, potholes, and uneven surfaces. This technique outperformed traditional methods and provided a scalable and privacy-preserving solution with decentralized learning. The system’s design resolves various limitations of conventional detection methods, including inefficiency and privacy problems, thereby offering a viable solution for real-time, large-scale RDD. The limitations of this study include reliance on a single dataset collected from a single country, which limits its usefulness. The classes are also limited, statistical analysis is missing, and edge deployment has not been explored. Future work may explore optimizing the model for broader environmental adaptability, edge deployment, and enhancing real-time processing under diverse operational conditions worldwide.

Author Contributions

Conceptualization, B.A., M.Z.S. and S.J.; methodology, B.A. and S.J.; software, E.N.B.; validation, B.A., S.J. and B.S.C.; formal analysis, M.Z.S.; resource, S.J.; investigation, B.S.C.; data curation, B.A. and M.Z.S.; writing—original draft preparation, B.A.; writing—review and editing, S.J.; visualization, E.N.B.; supervision, M.M.; project administration, M.Z.S.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the EU-funded Erasmus Plus Capacity Building in Higher Education ACTIVE Climate Action Project (ERASMUS-EDU-2023-CBHE, Project:101082866 ACTIVE), the EU-funded Erasmus Plus Capacity Building in Higher Education Project CENTRAL (Capacity building and ExchaNge towards attaining Technological Research and modernizing Academic Learning, 598914-EPP-1-2018-1-DK-EPPKA2-CBHE-JP) and the National Center of Robotics and Automation-Condition Monitoring Systems Laboratory, Mehran University of Engineering and Technology, Grant/Award Number: 2(1076)/HEC/M&E/2018/704.

Data Availability Statement

The dataset used in this study is available on Mendeley: https://data.mendeley.com/datasets/6p52w7d5xd/2 on 21 March 2025.

Acknowledgments

This work is supported by the National Center of Robotics and the Automation Condition Monitoring Lab, MUET Jamshoro, Pakistan and university of malaga, Spain.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lin, S. Road Defect Detection System Based on Deep Learning. J. Comput. Signal Syst. Res. 2025, 2, 79–87. [Google Scholar] [CrossRef]
  2. Yu, J.; Jiang, J.; Fichera, S.; Paoletti, P.; Layzell, L.; Mehta, D.; Luo, S. Road Surface Defect Detection—From Image-Based to Non-Image-Based: A Survey. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10581–10603. [Google Scholar] [CrossRef]
  3. Rathee, M.; Bačić, B.; Doborjeh, M. Automated Road Defect and Anomaly Detection for Traffic Safety: A Systematic Review. Sensors 2023, 23, 5656. [Google Scholar] [CrossRef]
  4. Yuan, Y.; Yuan, Y.; Baker, T.; Kolbe, L.M.; Hogrefe, D. FedRD: Privacy-preserving adaptive Federated learning framework for intelligent hazardous Road Damage detection and warning. Future Gener. Comput. Syst. 2021, 125, 385–398. [Google Scholar] [CrossRef]
  5. Saha, P.K.; Arya, D.; Sekimoto, Y. Federated learning–based global road damage detection. Comput. Aided Civ. Infrastruct. Eng. 2024, 39, 2223–2238. [Google Scholar] [CrossRef]
  6. Bello-Salau, H.; Aibinu, A.M.; Onwuka, E.N.; Dukiya, J.J.; Onumanyi, A.J. Image Processing Techniques for Automated Road Defect Detection: A Survey. In Proceedings of the 2014 11th International Conference on Electronics, Computer and Computation (ICECCO), Abuja, Nigeria, 29 September–1 October 2014. [Google Scholar]
  7. Shaikh, M.Z.; Ahmed, Z.; Chowdhry, B.S.; Baro, E.N.; Hussain, T.; Uqaili, M.A.; Mehran, S.; Kumar, D.; Shah, A.A. State-of-the-Art Wayside Condition Monitoring Systems for Railway Wheels: A Comprehensive Review. IEEE Access 2023, 11, 13257–13279. [Google Scholar] [CrossRef]
  8. Abro, B.; Jatoi, S.; Shaikh, M.Z.; Baro, E.N.; Chowdhry, B.S.; Milanova, M. Towards Smarter Road Maintenance: YOLOv7-Seg for Real-Time Detection of Surface Defects. Lect. Notes Comput. Sci. 2025, 15618, 36–54. [Google Scholar] [CrossRef]
  9. Gupta, P.; Dixit, M. Image-based Road Pothole Detection Using Deep Learning Model. In Proceedings of the 2022 14th International Conference on Computational Collective Intelligence (ICCCI), Hammamet, Tunisia, 28–30 September 2022. [Google Scholar]
  10. Wang, Y.; Chung, S.-H.; Khan, W.A.; Wang, T.; Xu, D.J. ALADA: A lite automatic data augmentation framework for industrial defect detection. Adv. Eng. Inform. 2023, 58, 102205. [Google Scholar] [CrossRef]
  11. Liu, J.; Wang, Y.; Luo, H.; Lv, G.; Guo, F.; Xie, Q. Pavement Surface Defect Recognition Method Based on Vehicle System Vibration Data and Feedforward Neural Network. Int. J. Pavement Eng. 2023, 24, 2188594. [Google Scholar] [CrossRef]
  12. Orhan, F.; Erhan Eren, P. Road Hazard Detection and Sharing with Multimodal Sensor Analysis on Smartphones. In Proceedings of the 2013 Seventh International Conference on Next Generation Mobile Apps, Services, and Technologies (NGMAST 2013), Prague, Czech Republic, 25–27 September 2013. [Google Scholar]
  13. Meftah, I.; Hu, J.; Asham, M.A.; Meftah, A.; Zhen, L.; Wu, R. Visual Detection of Road Cracks for Autonomous Vehicles Based on Deep Learning. Sensors 2024, 24, 1647. [Google Scholar] [CrossRef] [PubMed]
  14. Shaikh, M.Z.; Baro, E.N.; Jatoi, S. A Contribution to Reliable Rail Transport: AI-Powered Real-Time Wheel Defect Detection. Sci. Rep. 2025, 15, 43854. [Google Scholar] [CrossRef]
  15. Shaikh, M.Z.; Jatoi, S.; Baro, E.N.; Das, B.; Hussain, S.; Chowdhry, B.S. FaultSeg: A Dataset for Train Wheel Defect Detection. Sci. Data 2025, 12, 309. [Google Scholar] [CrossRef]
  16. Shaikh, M.Z.; Mehran, S.; Baro, E.N.; Manolova, A.; Uqaili, M.A.; Hussain, T.; Chowdhry, B.S. Design and Development of a Wayside AI-Assisted Vision System for Online Train Wheel Inspection. Eng. Rep. 2025, 7, e13027. [Google Scholar] [CrossRef]
  17. Benallal, M.A.; Tayeb, M.S. An Image-based Convolutional Neural Network System for Road Defects Detection. IAES Int. J. Artif. Intell. 2023, 12, 577–584. [Google Scholar] [CrossRef]
  18. Liu, Q.; Liu, Z. Image Recognition of Pavement Cracks in Autonomous Driving Scenarios Based on Deep Learning. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024. [Google Scholar]
  19. Liu, X.; Yang, X.; Shao, L.; Wang, X.; Gao, Q.; Shi, H. GM-DETR: Research on A Defect Detection Method Based on Improved DETR. Sensors 2024, 24, 3610. [Google Scholar] [CrossRef]
  20. Bagheri, S.A.M.; Mojaradi, B.; Kamboozia, N.; Faizi, M. Analyzing the effects of streetscape and land use on urban accidents and predicting future accidents by using machine learning algorithms (case study: Mashhad). Heliyon 2024, 10, e33346. [Google Scholar] [CrossRef]
  21. Lin, Y.-C.; Chen, W.-H.; Kuo, C.-H. Implementation of Pavement Defect Detection System on Edge Computing Platform. Appl. Sci. 2021, 11, 3725. [Google Scholar] [CrossRef]
  22. Naseralavi, S.; Soltanirad, M.; Ranjbar, E.; Lucero, M.; Baghersad, M.; Piri, M.; Zada, M.J.H.; Mazaheri, A. Modeling the Severity of Crashes in Rainy Weather by Driver Gender and Crash Type. Future Transp. 2025, 5, 146. [Google Scholar] [CrossRef]
  23. Shahri, P.K.; Rahmanidehkordi, A.; Ghaffari, A.; Ghasemi, A.H. Enhancing Traffic Flow in Heterogeneous Freeways: Integration of Multivariable Extremum Seeking and Filtered Feedback Linearization Control. IEEE Access 2025, 13, 129573–129587. [Google Scholar] [CrossRef]
  24. Khameneh, R.T.; Barker, K.; Ramirez-Marquez, J.E. A hybrid machine learning and simulation framework for modeling and understanding disinformation-induced disruptions in public transit systems. Reliab. Eng. Syst. Saf. 2025, 255, 110656. [Google Scholar] [CrossRef]
  25. Jiang, Y.; Yan, H.; Zhang, Y.; Wu, K.; Liu, R.; Lin, C. RDD-YOLOv5: Road Defect Detection Algorithm with Self-Attention Based on Unmanned Aerial Vehicle Inspection. Sensors 2023, 23, 8241. [Google Scholar] [CrossRef] [PubMed]
  26. Wootton, A.J.; Day, C.R.; Haycock, P.W. Heterogeneous Data Fusion for The Improved Non-destructive Detection of Steel-reinforcement Defects Using Echo State Networks. Struct. Health Monit. 2022, 21, 2910–2921. [Google Scholar] [CrossRef]
  27. Alshammari, S.; Song, S. 3Pod: Federated Learning-based 3 Dimensional Pothole Detection for Smart Transportation. In Proceedings of the 2022 IEEE International Smart Cities Conference (Isc2), Paphos, Cyprus, 26–29 September 2022. [Google Scholar]
  28. An, J.; Yang, L.; Hao, Z.; Chen, G.; Li, L. Investigation on Road Underground Defect Classification and Localization Based on Ground Penetrating Radar and Swin Transformer. Int. J. Simul. Multidiscip. 2024, 15, 7. [Google Scholar] [CrossRef]
  29. Fan, R.; Wang, H.; Bocus, M.J.; Liu, M. We Learn Better Road Pothole Detection: From Attention Aggregation to Adversarial Domain Adaptation; ARXIV-CS.CV; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  30. Mukti, S.N.A.; Tahar, K.N. Low Altitude Multispectral Mapping for Road Defect Detection. Geogr. Malays. J. Soc. Space 2021, 17, 102–115. [Google Scholar]
  31. Nasser, N.; Fadlullah, Z.M.; Fouda, M.M.; Ali, A.; Imran, M. A Lightweight Federated Learning Based Privacy Preserving B5G Pandemic Response Network Using Unmanned Aerial Vehicles: A Proof-of-concept. Comput. Netw. 2021, 205, 108672. [Google Scholar] [CrossRef]
  32. Gebremariam, G.G.; Panda, J.; Indu, S. Blockchain-Based Secure Localization Against Malicious Nodes in IoT-Based Wireless Sensor Networks Using Federated Learning. Wirel. Commun. Mob. Comput. 2023, 2023, 8068038. [Google Scholar] [CrossRef]
  33. Kulambayev, B.; Nurlybek, M.; Astaubayeva, G.; Tleuberdiyeva, G.; Zholdasbayev, S.; Tolep, A. Real-Time Road Surface Damage Detection Framework based on Mask R-CNN Model. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 757–765. [Google Scholar] [CrossRef]
  34. Narlan, R.; Widiyanto, E.P.; Selatan, B.S. Automated pavement defect detection using yolov8 object detection algorithm. Pros. KRTJ HPJI 2023, 16, 1–13. [Google Scholar]
  35. Singh, J.; Shekhar, S. Road Damage Detection and Classification In Smartphone Captured Images Using Mask R-CNN. arXiv 2018, arXiv:1811.04535. [Google Scholar] [CrossRef]
  36. Perttunen, M.; Mazhelis, O.; Cong, F.; Ristaniemi, T.; Riekki, J. Distributed Road Surface Condition Monitoring Using Mobile Phones. In Proceedings of the International Conference on Ubiquitous Intelligence and Computing, Banff, AB, Canada, 2–4 September 2011; pp. 64–78. [Google Scholar] [CrossRef]
  37. Dong, D.; Li, Z. Smartphone Sensing of Road Surface Condition and Defect Detection. Sensors 2021, 21, 5433. [Google Scholar] [CrossRef]
  38. Kim, G.; Kim, S. Road Defect Detection System Using Smartphones. Sensors 2024, 24, 2099. [Google Scholar] [CrossRef] [PubMed]
  39. Ruseruka, C.; Mwakalonge, J.; Comert, G.; Siuhi, S.; Perkins, J. Road Condition Monitoring Using Vehicle Built-in Cameras and GPS Sensors: A Deep Learning Approach. Vehicles 2023, 5, 931–948. [Google Scholar] [CrossRef]
Figure 1. Visual representation of cracks, potholes, and uneven surfaces collected for road condition monitoring.
Figure 1. Visual representation of cracks, potholes, and uneven surfaces collected for road condition monitoring.
Computers 15 00006 g001
Figure 2. Revolutionizing object detection: the Real-Time Detection Transformer framework.
Figure 2. Revolutionizing object detection: the Real-Time Detection Transformer framework.
Computers 15 00006 g002
Figure 3. Block diagram of the revolutionizing model with the federated learning technique.
Figure 3. Block diagram of the revolutionizing model with the federated learning technique.
Computers 15 00006 g003
Figure 4. Confusion matrix of the federated learning-based RTD-Transformer model.
Figure 4. Confusion matrix of the federated learning-based RTD-Transformer model.
Computers 15 00006 g004
Figure 5. Training and validation loss and metric trends of the RTDT model with federated learning.
Figure 5. Training and validation loss and metric trends of the RTDT model with federated learning.
Computers 15 00006 g005
Figure 6. K-Fold cross-validation results of the federated learning-based model.
Figure 6. K-Fold cross-validation results of the federated learning-based model.
Computers 15 00006 g006
Figure 7. Detection of potholes and uneven surfaces with confidence scores on road images.
Figure 7. Detection of potholes and uneven surfaces with confidence scores on road images.
Computers 15 00006 g007
Table 1. Distribution of images across training, validation, and testing sets.
Table 1. Distribution of images across training, validation, and testing sets.
Split DatasetsNo. of Images
Training Set6922 images
Validation Set2111 images
Testing Set1055 images
Total Images10,088 images
Table 2. Configuration of training hyperparameters for instance segmentation.
Table 2. Configuration of training hyperparameters for instance segmentation.
HyperparametersValue
Image Size640 × 640
Epochs150
Learning Rate0.001
Last iteration learning rate factor (Lrf)0.1
Weight decay0.0005
Momentum0.937
OptimizerAdam
Batch Size4
Table 3. Comparison table of previous studies and our study.
Table 3. Comparison table of previous studies and our study.
S.NoCamera NameResolutionNo. of ClassesTypes of DefectsTechniquesPerformance MetricsCitation
No
1.Smartphone CameraNot explicitly mentioned12Linear Crack, Grid Crack, Pavement Joins, Patching’s, Fillings, Potholes, Manholes, Stains, Shadow, Pavement Markings, Scratches on Markings, Grid Crack in Patching’sMask R-CNN92% Precision, 98% Recall, and 95% F1-score[33]
2.Not explicitly mentioned587 × 10588Alligator Cracks, Longitudinal Cracks, Patching, Potholes, Raveling, Rutting, Shoving/Upheaval, Transverse CracksYOLOv828% Precision-Recall Score (at IoU 0.5), 31% F1-score[34]
3.Smartphone Camera600 × 6008Not explicitly mentionedMask R-CNN with ResNet101-FPN backbone52% F1-score at 50% IOU[35]
4.Smartphone CameraNot explicitly mentioned2Distortion, Patching, Potholes, RuttingSVM with spectral analysis and k-means classification using accelerometer and GPS data from smartphones~80–84% Accuracy[36]
5.Smartphone CameraNot explicitly mentioned4Distortion, Patching, Potholes, RuttingK-means unsupervised machine learning with Power Spectral Density (PSD) analysis and Butterworth bandpass filter (0.1–0.5 Hz)84% Average Accuracy[37]
6.Smartphone Camera 2560 × 14403Potholes, Speed Bumps, Manholes1D-CNN, Sliding Window Algorithm91% Average Accuracy[38]
7.Vehicle Built-in Camera416 pixels9Fatigue/Alligator Cracks, Block Cracks,
Transverse Cracks,
Longitudinal—Wheel Path Cracks,
Longitudinal—Non-Wheel Path Cracks,
Edge, Joint, Reflective Cracks,
Patches, Potholes
Raveling, Shoving, Rutting
Yolov597.2% Mean Average Precision[39]
6.GoPro Hero 09 Camera2704 × 15203Cracks, Potholes, Uneven SurfaceReal-Time Detection Transformer with federated learning technique99.60% Mean Average PrecisionProposed Method
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abro, B.; Jatoi, S.; Shaikh, M.Z.; Baro, E.N.; Milanova, M.; Chowdhry, B.S. Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring. Computers 2026, 15, 6. https://doi.org/10.3390/computers15010006

AMA Style

Abro B, Jatoi S, Shaikh MZ, Baro EN, Milanova M, Chowdhry BS. Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring. Computers. 2026; 15(1):6. https://doi.org/10.3390/computers15010006

Chicago/Turabian Style

Abro, Bushra, Sahil Jatoi, Muhammad Zakir Shaikh, Enrique Nava Baro, Mariofanna Milanova, and Bhawani Shankar Chowdhry. 2026. "Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring" Computers 15, no. 1: 6. https://doi.org/10.3390/computers15010006

APA Style

Abro, B., Jatoi, S., Shaikh, M. Z., Baro, E. N., Milanova, M., & Chowdhry, B. S. (2026). Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring. Computers, 15(1), 6. https://doi.org/10.3390/computers15010006

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop