Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring
Abstract
1. Introduction
- Developed a high-resolution RDD using a GoPro HERO 9 camera, manufactured by GoPro Inc., and bought from Shenzhen (China), capturing detailed images of cracks, potholes, and uneven surfaces to improve data quality and diversity for deep learning training.
- Implemented a Real-Time Detection Transformer (RTDT) model that leverages self-attention mechanisms for precise detection of subtle and overlapping road defects in complex real-world conditions, surpassing traditional convolutional models.
- Integrated federated learning into the RTDT framework to enable decentralized, privacy-preserving model training across multiple data sources, enhancing model generalization and scalability for large-scale deployments.
- Achieved near-perfect detection accuracy (mAP50 of 99.60) with real-time processing capabilities, offering a scalable, efficient, and privacy-conscious solution for practical road maintenance and monitoring applications.
2. Related Work
Research Gaps
3. Problem Explanation
3.1. Road Defect Detection Challenges
3.2. Environmental and Data Collection Complexities
- Variable lighting conditions: Shadows from vehicles, buildings, and roadside objects create false features.
- Weather effects: Rain, dust, and glare reduce image clarity and obscure defect boundaries.
- Motion blur: Vehicle movement at data collection speeds introduces artifacts.
- Occlusions: Moving vehicles, pedestrians, and debris temporarily hide defects.
- Perspective distortion: Camera angle and mounting height affect defect appearance.
3.3. Computational and Deployment Constraints
- Real-time processing: Detection must occur at frame rates matching video capture speed.
- Accuracy requirements: False positives waste maintenance resources; false negatives compromise safety.
- Privacy preservation: Road infrastructure data contains sensitive location and usage patterns.
- Scalability: Systems must handle geographically distributed data sources across municipalities.
- Resource constraints: Edge deployment requires models optimized for limited computational resources.
3.4. Problem Statement
4. Methods and Materials
4.1. Dataset
4.2. Data Preprocessing
Class Distribution and Imbalance Mitigation
- Cracks: 487 instances (39.2%).
- Potholes: 368 instances (29.6%).
- Uneven Surfaces: 289 instances (23.2%).
- Background samples: 100 frames (8.0%).
- Stratified Data Augmentation:
- Uneven Surfaces: 12× augmentation → 3468 instances.
- Potholes: 10× augmentation → 3680 instances.
- Cracks: 8× augmentation → 3896 instances.
- Background: 4× augmentation → 400 instances.
- 2.
- Focal Loss for Minority Classes:
- 3.
- Federated Learning Class Distribution:
- 4.
- Evaluation of Balanced Test Set:
- Cracks: 342 instances (32.4%).
- Potholes: 289 instances (27.4%).
- Uneven Surfaces: 264 instances (25.0%).
- Background: 160 instances (15.2%).
4.3. Model and Architecture
Real-Time Detection Transformer
P_pos(i,d) = cos(i/10,000^((d − 1)/d_model)) if d is odd
- Q, K, V ∈ ℝ(n×d_k) are query, key, and value matrices.
- h = number of attention heads (typically 8 or 16).
- d_k = dimension per head.
C_hat = Softmax(Linear(F_dec))
- B_hat ∈ ℝ^(N_queries × 4) = predicted bounding boxes (normalized coordinates).
- C_hat ∈ ℝ^(N_queries × 3) = class probability distribution.
4.4. Model Training
- w = global model parameters.
- w_k = local model parameters at node k.
- N = ΣN_k = total training samples.
- F_k(w_k) = local loss function at node k.
- η = global learning rate.
- Δw_k^(t) = w_k^(t) − w^(t) = parameter update from node k.
4.5. Evaluation Metrics
- Confusion Matrix: A confusion matrix is a table that describes the performance of a model with four parameters: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The matrix represents the model’s performance and errors.
- Precision: Precision is a measure that calculates the percentage of defects in the true positive predictions among all identified positives. The formula below is an equation of precision:
- Recall: recall or sensitivity is the ratio of the true positive predictions to all actual positives. This ratio gives the overall efficiency of the model in identifying the relevant instances. The equation used to calculate the recall is given below:
- F1 score: The F1 score is the harmonic mean of precision and recall, and it provides a single metric score. The equation used to calculate the F1-score is given below:
5. Results
5.1. Computational Resources and Deployment Feasibility
- Processor: Intel Core i7-9700K manufactured by Intel Corporation, in Santa Clara Califonia, United States (ninth generation, eight cores, 3.6 GHz base/4.9 GHz boost).
- Memory: 16 GB DDR4 RAM (2666 MHz).
- Graphics Card: NVIDIA GeForce RTX 2060 Super manufactured by Intel Corporation, in Santa Clara Califonia, United States (8 GB GDDR6, 2176 CUDA cores, Turing architecture).
- Storage: 1 TB NVMe SSD for dataset and model checkpoints.
- Operating System: Ubuntu 20.04 LTS.
- Software: PyTorch 2.0.1, CUDA 11.8, cuDNN 8.7.
- Model parameters: ~47 million (RTDT backbone + decoder).
- Training memory: 7.2 GB GPU memory at batch size equaling four.
- Inference memory: 1.8 GB GPU memory per image (640 × 640).
- Desktop/Server Deployment:
- Edge Device Feasibility Analysis:
- 1.
- NVIDIA Jetson Xavier NX (Moderate Edge Device):
- Estimated inference speed: 8–12 FPS (based on similar transformer models).
- Memory constraints: 8 GB shared memory may limit batch processing.
- Feasibility: Viable with model optimization (quantization to INT8, pruning 20–30% of parameters).
- 2.
- NVIDIA Jetson Nano (Low-Power Edge Device):
- Estimated inference speed: 2–4 FPS.
- Memory: 4 GB is insufficient for the full model.
- Feasibility: Not viable without significant model compression or switching to lighter architectures (e.g., MobileNet backbone).
- 3.
- Raspberry Pi 4 (8 GB) with Coral TPU:
- CPU-only inference: <1 FPS (impractical).
- With TPU acceleration: 5–8 FPS possible after quantization.
- Feasibility: Marginally viable for offline batch processing, not real-time.
5.2. Data Distribution and Hyperparameters
5.3. Validation Results
5.3.1. Noise Control and Mitigation Strategies
- Data Preprocessing and Filtering: Environmental noise sources—variable lighting, shadows, weather effects, and motion blur—were addressed through systematic filtering. Frames with excessive blur (SSIM > 0.95 with adjacent frames) were removed to eliminate redundant, low-information samples. Brightness normalization (α ∈ [0.7, 1.3]) reduced lighting inconsistencies across the dataset.
- Data Augmentation for Robustness: Controlled augmentation techniques (random rotation φ ∈ [−15°, 15°], translation Δx, Δy ∈ [−50, 50] pixels, and cropping) exposed the model to variations in defect appearance, teaching it to distinguish actual defects from noise artifacts. This regularization prevented overfitting to specific noise patterns.
- Federated Learning Noise Averaging: The federated learning architecture inherently reduces noise through parameter averaging across K = 3 local datasets. Local noise variations specific to individual data collection sessions are attenuated when aggregating model weights: w^(t + 1) = Σ(N_k/N)w_k^(t). This ensemble effect improves robustness compared to training on a single noisy dataset.
- Loss Function Design: The combination of L1, GIoU, and Focal losses (L_total = L_cls + L_box + L_focal) provides resistance to noisy labels. Focal loss down-weights easy examples and focuses learning on hard, potentially noisy samples, preventing the model from being dominated by noise.
- Validation-Based Early Stopping: Despite training for 150 epochs, model selection was based on validation performance plateaus. The consistent high performance (mAP50 > 0.99) after epoch 50 indicates that the model learned robust features rather than memorizing noise. The stabilization of all metrics (precision, recall, mAP) above 0.95 after initial oscillations demonstrates effective noise control. The final model’s near-perfect performance (mAP50 = 99.60%) confirms that noise in the dataset did not compromise detection accuracy.
5.3.2. Inference Speed and Real-Time Processing Performance
- Input resolution: 640 × 640 pixels.
- Preprocessing time: 2.3 ms (resize, normalize).
- Model inference time: 28.4 ms.
- Post-processing (NMS, thresholding): 3.1 ms.
- Inference speed: 29.6 FPS.
- Batch size: 4 images.
- Total processing time: 97.2 ms.
- Effective speed: 41.2 FPS.
- HD video (1920 × 1080): 30 FPS.
- Action cameras (GoPro): 30–60 FPS.
- Surveillance cameras: 15–30 FPS.
- Standard road inspection video (30 FPS).
- Fixed surveillance cameras (15–30 FPS).
- High-speed capture (60+ FPS)—would require further optimization.
- Vehicle speed: 40 km/h.
- Distance covered per frame: 33.3 cm.
- Detection coverage: Complete with no gaps.
- Processing: Two frames in parallel → effective 60 FPS capacity.
- Can monitor two camera feeds simultaneously.
- Process 1 h of video (108,000 frames) in 45.3 min.
- 1.32× real-time speed for archival analysis.
6. Discussion
6.1. Comprehensive Comparison with State-of-the-Art Methods
6.1.1. Detection Accuracy Comparison
6.1.2. Real-Time Performance Analysis
- Our RTDT + FL: 60 FPS at 640 × 640 resolution.
- YOLOv5: ~30–45 FPS (reported in the literature).
- YOLOv8: ~40–50 FPS (reported in the literature).
- Mask R-CNN: ~10–15 FPS (computationally intensive).
6.1.3. Privacy and Scalability Advantages
- Data privacy preservation through local training.
- Linear scalability across K nodes without centralized data aggregation.
- Reduced bandwidth requirements (only model parameters transmitted).
- Compliance with data sovereignty regulations.
6.1.4. Limitations Compared to State-of-the-Art
- Computational Complexity: RTDT has a higher parameter count than YOLO variants, requiring more training time (150 epochs vs. 100 typically for YOLO).
- Data Requirements: Transformer models generally require larger datasets for optimal performance; our 10,088 images with augmentation may be at the lower bound.
- Interpretability: Attention mechanisms in transformers are less interpretable than convolutional feature maps when explaining specific detection decisions.
7. Future Work and Implications
7.1. Limitations and Future Research Directions
- Geographic Generalization: The dataset was collected exclusively from Jamshoro and Hyderabad cities in Pakistan. Future work should validate model performance across diverse geographical regions, road types (highways, rural roads, and urban streets), and climate zones to ensure global applicability.
- Defect Class Coverage: The current three-class taxonomy (cracks, potholes, uneven surfaces) does not capture all road defect types. Expansion to include raveling, rutting, bleeding, and patching defects would enhance practical utility.
- Edge Deployment: While our model achieves real-time processing on desktop GPUs (RTX 2060 Super), deployment on resource-constrained edge devices (Raspberry Pi, NVIDIA Jetson) remains unexplored. Model compression techniques (quantization, pruning) and hardware-specific optimizations require investigation.
- Statistical Validation: The current evaluation lacks statistical significance testing across multiple experimental runs and cross-dataset validation. Future work should include statistical analysis with confidence intervals and hypothesis testing.
- Multi-modal Sensor Fusion: Integrating RGB imaging with LiDAR, thermal imaging, and accelerometer data could improve detection in challenging conditions (night, rain, fog).
- Temporal Analysis: Incorporating temporal defect progression monitoring would enable predictive maintenance scheduling and severity assessment over time.
- Adversarial Robustness: Testing model resilience against adversarial perturbations and edge cases (extreme weather, unusual lighting) would strengthen reliability.
- Automated Defect Prioritization: Developing severity scoring mechanisms to prioritize maintenance interventions based on defect size, location, and traffic patterns.
- Transfer Learning Across Cities: Investigating domain adaptation techniques to transfer learned representations from one city to another without extensive retraining.
7.2. Practical Implications and Material Impact
7.3. Edge Deployment Challenges and Strategies
- Jetson Nano: 4 GB shared memory → Insufficient.
- Jetson Xavier NX: 8 GB memory → Marginally sufficient.
- Raspberry Pi 4: 8 GB RAM (no GPU) → Impractical without an accelerator.
- Quantization to INT8: Reduces model size to ~45 MB (75% reduction).
- Parameter pruning: Target 30% reduction → ~33 M parameters.
- Knowledge distillation: Train a lightweight student model (15 M parameters).
- Expected outcome: Model size ~50 MB, 25 M parameters, suitable for Jetson Xavier NX.
- Jetson Xavier NX: 80–120 ms per frame → 8–12 FPS.
- Jetson Nano: 250–400 ms per frame → 2–4 FPS.
- Raspberry Pi 4 (CPU only): >2000 ms → <1 FPS.
- TensorRT optimization: Expected 2–3× speedup on Jetson devices.
- Attention head reduction: Decrease from 8 to 4 heads → 35% faster.
- Lightweight backbone (MobileNetV3): Sacrifice 2–3% accuracy for 3× speed.
- Expected outcome: 15–20 FPS on Jetson Xavier NX, acceptable for real-time.
- Jetson Xavier NX: 10–20 W.
- Jetson Nano: 5–10 W.
- Raspberry Pi 4: 5 W (plus ~8 W for Coral TPU).
- Dynamic voltage and frequency scaling (DVFS).
- Inference at lower precision (INT8).
- Selective processing: Skip frames in low-defect areas.
- Expected outcome: 12–15 W sustained power on Jetson Xavier NX.
7.4. Nationwide Scaling: Challenges and Mitigation Strategies
- Urban areas: Frequent patches, potholes from utility work.
- Rural areas: Weathering cracks, vegetation encroachment.
- Highways: High-speed wear patterns, rutting.
- Different climates: Snow damage (north), heat deterioration (south).
- Regional Federated Learning Clusters: Organize FL nodes into regional clusters (North, South, Urban, Rural) with intra-cluster aggregation before global updates.
- Transfer Learning Adaptation: Fine-tune a pre-trained model on region-specific data (500–1000 images) for rapid adaptation.
- Continuous Learning: Implement online learning to adapt to seasonal and temporal variations.
- Expected Outcome: Model maintains >98% accuracy across diverse regions with 2–3 months of regional adaptation.
- Rural areas: Limited 4G coverage, no 5G.
- Remote highways: Connectivity gaps of 10–50 miles.
- Bandwidth constraints: Thousands of devices uploading model updates.
- Asynchronous Federated Learning: Store local model updates, transmit during connectivity windows.
- Edge-First Architecture: All processing on-device, only final results require connectivity.
- Satellite Backup: Use LEO satellite networks (Starlink) for remote areas.
- Compression: Gradient compression reduces FL update size by 90% (from 100 MB to 10 MB).
- Expected Outcome: 95%+ uptime even in rural areas.
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lin, S. Road Defect Detection System Based on Deep Learning. J. Comput. Signal Syst. Res. 2025, 2, 79–87. [Google Scholar] [CrossRef]
- Yu, J.; Jiang, J.; Fichera, S.; Paoletti, P.; Layzell, L.; Mehta, D.; Luo, S. Road Surface Defect Detection—From Image-Based to Non-Image-Based: A Survey. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10581–10603. [Google Scholar] [CrossRef]
- Rathee, M.; Bačić, B.; Doborjeh, M. Automated Road Defect and Anomaly Detection for Traffic Safety: A Systematic Review. Sensors 2023, 23, 5656. [Google Scholar] [CrossRef]
- Yuan, Y.; Yuan, Y.; Baker, T.; Kolbe, L.M.; Hogrefe, D. FedRD: Privacy-preserving adaptive Federated learning framework for intelligent hazardous Road Damage detection and warning. Future Gener. Comput. Syst. 2021, 125, 385–398. [Google Scholar] [CrossRef]
- Saha, P.K.; Arya, D.; Sekimoto, Y. Federated learning–based global road damage detection. Comput. Aided Civ. Infrastruct. Eng. 2024, 39, 2223–2238. [Google Scholar] [CrossRef]
- Bello-Salau, H.; Aibinu, A.M.; Onwuka, E.N.; Dukiya, J.J.; Onumanyi, A.J. Image Processing Techniques for Automated Road Defect Detection: A Survey. In Proceedings of the 2014 11th International Conference on Electronics, Computer and Computation (ICECCO), Abuja, Nigeria, 29 September–1 October 2014. [Google Scholar]
- Shaikh, M.Z.; Ahmed, Z.; Chowdhry, B.S.; Baro, E.N.; Hussain, T.; Uqaili, M.A.; Mehran, S.; Kumar, D.; Shah, A.A. State-of-the-Art Wayside Condition Monitoring Systems for Railway Wheels: A Comprehensive Review. IEEE Access 2023, 11, 13257–13279. [Google Scholar] [CrossRef]
- Abro, B.; Jatoi, S.; Shaikh, M.Z.; Baro, E.N.; Chowdhry, B.S.; Milanova, M. Towards Smarter Road Maintenance: YOLOv7-Seg for Real-Time Detection of Surface Defects. Lect. Notes Comput. Sci. 2025, 15618, 36–54. [Google Scholar] [CrossRef]
- Gupta, P.; Dixit, M. Image-based Road Pothole Detection Using Deep Learning Model. In Proceedings of the 2022 14th International Conference on Computational Collective Intelligence (ICCCI), Hammamet, Tunisia, 28–30 September 2022. [Google Scholar]
- Wang, Y.; Chung, S.-H.; Khan, W.A.; Wang, T.; Xu, D.J. ALADA: A lite automatic data augmentation framework for industrial defect detection. Adv. Eng. Inform. 2023, 58, 102205. [Google Scholar] [CrossRef]
- Liu, J.; Wang, Y.; Luo, H.; Lv, G.; Guo, F.; Xie, Q. Pavement Surface Defect Recognition Method Based on Vehicle System Vibration Data and Feedforward Neural Network. Int. J. Pavement Eng. 2023, 24, 2188594. [Google Scholar] [CrossRef]
- Orhan, F.; Erhan Eren, P. Road Hazard Detection and Sharing with Multimodal Sensor Analysis on Smartphones. In Proceedings of the 2013 Seventh International Conference on Next Generation Mobile Apps, Services, and Technologies (NGMAST 2013), Prague, Czech Republic, 25–27 September 2013. [Google Scholar]
- Meftah, I.; Hu, J.; Asham, M.A.; Meftah, A.; Zhen, L.; Wu, R. Visual Detection of Road Cracks for Autonomous Vehicles Based on Deep Learning. Sensors 2024, 24, 1647. [Google Scholar] [CrossRef] [PubMed]
- Shaikh, M.Z.; Baro, E.N.; Jatoi, S. A Contribution to Reliable Rail Transport: AI-Powered Real-Time Wheel Defect Detection. Sci. Rep. 2025, 15, 43854. [Google Scholar] [CrossRef]
- Shaikh, M.Z.; Jatoi, S.; Baro, E.N.; Das, B.; Hussain, S.; Chowdhry, B.S. FaultSeg: A Dataset for Train Wheel Defect Detection. Sci. Data 2025, 12, 309. [Google Scholar] [CrossRef]
- Shaikh, M.Z.; Mehran, S.; Baro, E.N.; Manolova, A.; Uqaili, M.A.; Hussain, T.; Chowdhry, B.S. Design and Development of a Wayside AI-Assisted Vision System for Online Train Wheel Inspection. Eng. Rep. 2025, 7, e13027. [Google Scholar] [CrossRef]
- Benallal, M.A.; Tayeb, M.S. An Image-based Convolutional Neural Network System for Road Defects Detection. IAES Int. J. Artif. Intell. 2023, 12, 577–584. [Google Scholar] [CrossRef]
- Liu, Q.; Liu, Z. Image Recognition of Pavement Cracks in Autonomous Driving Scenarios Based on Deep Learning. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024. [Google Scholar]
- Liu, X.; Yang, X.; Shao, L.; Wang, X.; Gao, Q.; Shi, H. GM-DETR: Research on A Defect Detection Method Based on Improved DETR. Sensors 2024, 24, 3610. [Google Scholar] [CrossRef]
- Bagheri, S.A.M.; Mojaradi, B.; Kamboozia, N.; Faizi, M. Analyzing the effects of streetscape and land use on urban accidents and predicting future accidents by using machine learning algorithms (case study: Mashhad). Heliyon 2024, 10, e33346. [Google Scholar] [CrossRef]
- Lin, Y.-C.; Chen, W.-H.; Kuo, C.-H. Implementation of Pavement Defect Detection System on Edge Computing Platform. Appl. Sci. 2021, 11, 3725. [Google Scholar] [CrossRef]
- Naseralavi, S.; Soltanirad, M.; Ranjbar, E.; Lucero, M.; Baghersad, M.; Piri, M.; Zada, M.J.H.; Mazaheri, A. Modeling the Severity of Crashes in Rainy Weather by Driver Gender and Crash Type. Future Transp. 2025, 5, 146. [Google Scholar] [CrossRef]
- Shahri, P.K.; Rahmanidehkordi, A.; Ghaffari, A.; Ghasemi, A.H. Enhancing Traffic Flow in Heterogeneous Freeways: Integration of Multivariable Extremum Seeking and Filtered Feedback Linearization Control. IEEE Access 2025, 13, 129573–129587. [Google Scholar] [CrossRef]
- Khameneh, R.T.; Barker, K.; Ramirez-Marquez, J.E. A hybrid machine learning and simulation framework for modeling and understanding disinformation-induced disruptions in public transit systems. Reliab. Eng. Syst. Saf. 2025, 255, 110656. [Google Scholar] [CrossRef]
- Jiang, Y.; Yan, H.; Zhang, Y.; Wu, K.; Liu, R.; Lin, C. RDD-YOLOv5: Road Defect Detection Algorithm with Self-Attention Based on Unmanned Aerial Vehicle Inspection. Sensors 2023, 23, 8241. [Google Scholar] [CrossRef] [PubMed]
- Wootton, A.J.; Day, C.R.; Haycock, P.W. Heterogeneous Data Fusion for The Improved Non-destructive Detection of Steel-reinforcement Defects Using Echo State Networks. Struct. Health Monit. 2022, 21, 2910–2921. [Google Scholar] [CrossRef]
- Alshammari, S.; Song, S. 3Pod: Federated Learning-based 3 Dimensional Pothole Detection for Smart Transportation. In Proceedings of the 2022 IEEE International Smart Cities Conference (Isc2), Paphos, Cyprus, 26–29 September 2022. [Google Scholar]
- An, J.; Yang, L.; Hao, Z.; Chen, G.; Li, L. Investigation on Road Underground Defect Classification and Localization Based on Ground Penetrating Radar and Swin Transformer. Int. J. Simul. Multidiscip. 2024, 15, 7. [Google Scholar] [CrossRef]
- Fan, R.; Wang, H.; Bocus, M.J.; Liu, M. We Learn Better Road Pothole Detection: From Attention Aggregation to Adversarial Domain Adaptation; ARXIV-CS.CV; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
- Mukti, S.N.A.; Tahar, K.N. Low Altitude Multispectral Mapping for Road Defect Detection. Geogr. Malays. J. Soc. Space 2021, 17, 102–115. [Google Scholar]
- Nasser, N.; Fadlullah, Z.M.; Fouda, M.M.; Ali, A.; Imran, M. A Lightweight Federated Learning Based Privacy Preserving B5G Pandemic Response Network Using Unmanned Aerial Vehicles: A Proof-of-concept. Comput. Netw. 2021, 205, 108672. [Google Scholar] [CrossRef]
- Gebremariam, G.G.; Panda, J.; Indu, S. Blockchain-Based Secure Localization Against Malicious Nodes in IoT-Based Wireless Sensor Networks Using Federated Learning. Wirel. Commun. Mob. Comput. 2023, 2023, 8068038. [Google Scholar] [CrossRef]
- Kulambayev, B.; Nurlybek, M.; Astaubayeva, G.; Tleuberdiyeva, G.; Zholdasbayev, S.; Tolep, A. Real-Time Road Surface Damage Detection Framework based on Mask R-CNN Model. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 757–765. [Google Scholar] [CrossRef]
- Narlan, R.; Widiyanto, E.P.; Selatan, B.S. Automated pavement defect detection using yolov8 object detection algorithm. Pros. KRTJ HPJI 2023, 16, 1–13. [Google Scholar]
- Singh, J.; Shekhar, S. Road Damage Detection and Classification In Smartphone Captured Images Using Mask R-CNN. arXiv 2018, arXiv:1811.04535. [Google Scholar] [CrossRef]
- Perttunen, M.; Mazhelis, O.; Cong, F.; Ristaniemi, T.; Riekki, J. Distributed Road Surface Condition Monitoring Using Mobile Phones. In Proceedings of the International Conference on Ubiquitous Intelligence and Computing, Banff, AB, Canada, 2–4 September 2011; pp. 64–78. [Google Scholar] [CrossRef]
- Dong, D.; Li, Z. Smartphone Sensing of Road Surface Condition and Defect Detection. Sensors 2021, 21, 5433. [Google Scholar] [CrossRef]
- Kim, G.; Kim, S. Road Defect Detection System Using Smartphones. Sensors 2024, 24, 2099. [Google Scholar] [CrossRef] [PubMed]
- Ruseruka, C.; Mwakalonge, J.; Comert, G.; Siuhi, S.; Perkins, J. Road Condition Monitoring Using Vehicle Built-in Cameras and GPS Sensors: A Deep Learning Approach. Vehicles 2023, 5, 931–948. [Google Scholar] [CrossRef]







| Split Datasets | No. of Images |
|---|---|
| Training Set | 6922 images |
| Validation Set | 2111 images |
| Testing Set | 1055 images |
| Total Images | 10,088 images |
| Hyperparameters | Value |
|---|---|
| Image Size | 640 × 640 |
| Epochs | 150 |
| Learning Rate | 0.001 |
| Last iteration learning rate factor (Lrf) | 0.1 |
| Weight decay | 0.0005 |
| Momentum | 0.937 |
| Optimizer | Adam |
| Batch Size | 4 |
| S.No | Camera Name | Resolution | No. of Classes | Types of Defects | Techniques | Performance Metrics | Citation No |
|---|---|---|---|---|---|---|---|
| 1. | Smartphone Camera | Not explicitly mentioned | 12 | Linear Crack, Grid Crack, Pavement Joins, Patching’s, Fillings, Potholes, Manholes, Stains, Shadow, Pavement Markings, Scratches on Markings, Grid Crack in Patching’s | Mask R-CNN | 92% Precision, 98% Recall, and 95% F1-score | [33] |
| 2. | Not explicitly mentioned | 587 × 1058 | 8 | Alligator Cracks, Longitudinal Cracks, Patching, Potholes, Raveling, Rutting, Shoving/Upheaval, Transverse Cracks | YOLOv8 | 28% Precision-Recall Score (at IoU 0.5), 31% F1-score | [34] |
| 3. | Smartphone Camera | 600 × 600 | 8 | Not explicitly mentioned | Mask R-CNN with ResNet101-FPN backbone | 52% F1-score at 50% IOU | [35] |
| 4. | Smartphone Camera | Not explicitly mentioned | 2 | Distortion, Patching, Potholes, Rutting | SVM with spectral analysis and k-means classification using accelerometer and GPS data from smartphones | ~80–84% Accuracy | [36] |
| 5. | Smartphone Camera | Not explicitly mentioned | 4 | Distortion, Patching, Potholes, Rutting | K-means unsupervised machine learning with Power Spectral Density (PSD) analysis and Butterworth bandpass filter (0.1–0.5 Hz) | 84% Average Accuracy | [37] |
| 6. | Smartphone Camera | 2560 × 1440 | 3 | Potholes, Speed Bumps, Manholes | 1D-CNN, Sliding Window Algorithm | 91% Average Accuracy | [38] |
| 7. | Vehicle Built-in Camera | 416 pixels | 9 | Fatigue/Alligator Cracks, Block Cracks, Transverse Cracks, Longitudinal—Wheel Path Cracks, Longitudinal—Non-Wheel Path Cracks, Edge, Joint, Reflective Cracks, Patches, Potholes Raveling, Shoving, Rutting | Yolov5 | 97.2% Mean Average Precision | [39] |
| 6. | GoPro Hero 09 Camera | 2704 × 1520 | 3 | Cracks, Potholes, Uneven Surface | Real-Time Detection Transformer with federated learning technique | 99.60% Mean Average Precision | Proposed Method |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Abro, B.; Jatoi, S.; Shaikh, M.Z.; Baro, E.N.; Milanova, M.; Chowdhry, B.S. Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring. Computers 2026, 15, 6. https://doi.org/10.3390/computers15010006
Abro B, Jatoi S, Shaikh MZ, Baro EN, Milanova M, Chowdhry BS. Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring. Computers. 2026; 15(1):6. https://doi.org/10.3390/computers15010006
Chicago/Turabian StyleAbro, Bushra, Sahil Jatoi, Muhammad Zakir Shaikh, Enrique Nava Baro, Mariofanna Milanova, and Bhawani Shankar Chowdhry. 2026. "Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring" Computers 15, no. 1: 6. https://doi.org/10.3390/computers15010006
APA StyleAbro, B., Jatoi, S., Shaikh, M. Z., Baro, E. N., Milanova, M., & Chowdhry, B. S. (2026). Federated Learning-Based Road Defect Detection with Transformer Models for Real-Time Monitoring. Computers, 15(1), 6. https://doi.org/10.3390/computers15010006

