A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction
Abstract
1. Introduction
- 1.
- 2.
- Safety ethics and practicality: It is unethical and unsafe to deliberately stage severe non-compliance near heavy equipment solely for data collection, biasing datasets toward compliant behavior.
- 3.
- Long-tail hazards: Rare but safety-critical scenarios (unusual viewpoints, severe occlusion, atypical poses) are under-represented in real data, yet dominate risk in practice [22].
- Controlled fixed-budget evidence: We fix the total training budget and vary only the real–synthetic composition ratio across eleven configurations. This design isolates composition effects from scale effects and provides construction-specific empirical evidence on how limited synthetic supplementation affects recall under a fixed-budget setting.
- Cross-architecture comparison under matched budgets: We compare representative single-stage CNN, two-stage CNN, and real-time Transformer detectors under identical fixed-budget mixing conditions. The results show that detector families respond differently to synthetic injection, indicating that data strategy and model choice should be considered jointly in construction deployment.
- Recall-oriented construction interpretation with explicit limitations: We adopt AR@100 as the primary safety-oriented metric and analyze category-level transfer behavior under Sim-to-Real conditions, while explicitly discussing the limitations of the current taxonomy, validation strategy, and synthetic material realism.
2. Research Objectives
- RQ1.
- Fixed-budget mixing ratio optimization: Under a fixed training budget, which real–synthetic mixing ratio yields the best recall performance on real construction imagery? We hypothesize that a moderate synthetic proportion can improve recall by increasing viewpoint and occlusion diversity, whereas excessive synthetic substitution may degrade performance because of Sim-to-Real domain shift.
- RQ2.
- Architectural sensitivity to synthetic data: Do different detector architectures (single-stage CNN, two-stage CNN, and real-time Transformer) respond differently to varying real–synthetic mixing ratios? This question is particularly relevant for construction deployment, where RT-DETR-L represents an emerging class of real-time Transformer detectors that offer a practical alternative to CNN-based systems for multi-camera site monitoring. Unlike CNN-based detectors, which rely heavily on local texture statistics, RT-DETR-L leverages global self-attention to model spatial relationships across the full image—a property hypothesized to confer greater robustness to the texture artifacts inherent in synthetic imagery. Including RT-DETR-L alongside YOLOv11s and Faster R-CNN therefore allows us to directly test whether global context modeling translates into a practical advantage under Sim-to-Real domain shift in construction-specific conditions.
- RQ3.
- Deployment trade-offs between recall and efficiency: What are the computational cost implications of each architecture under the identified optimal mixing ratio, and which architecture best balances recall and deployment feasibility for multi-camera construction monitoring?
- RQ4.
- Category-level transfer gaps: Does Sim-to-Real transfer quality differ across PPE categories (helmets vs. safety vests), and if so, what are the implications for digital twin pipeline design?
3. Related Work
3.1. Computer Vision for Construction Safety Monitoring
3.2. Digital Twins and Synthetic Data in Construction
Positioning Relative to Prior Synthetic–Real Mixing Studies
3.3. Architectural Sensitivity and Sim-to-Real Transfer
4. Methodology
4.1. Problem Formulation
4.2. Phase 1: Real-World PPE Dataset Curation
4.3. Phase 2: Digital Twin-Based Synthetic Data Generation
Automated Labeling via Virtual Photography
4.4. Phase 3: Fixed-Budget Mixing Strategy Design
Sampling Protocol Clarification (to Avoid Ambiguity)
4.5. Phase 4: Detection Architectures
4.5.1. YOLOv11s (Single-Stage CNN)
4.5.2. Faster R-CNN (Two-Stage CNN)
4.5.3. RT-DETR-L (Transformer-Based)
4.6. Phase 4: Implementation Details
- Batch Size: ≈16 (YOLOv11s), ≈8 (RT-DETR-L), ≈4 (Faster R-CNN).
- Optimization: SGD (Momentum = 0.937) for CNNs; AdamW [44] for RT-DETR.
- Epochs and Early Stopping: Up to 400 epochs; early stopping if validation AR@100 does not improve for 50 epochs.
4.7. Phase 4: Evaluation Metrics
- Average Recall (AR@100): COCO-style average recall with up to 100 detections per image (reported in %). This metric directly corresponds to false-negative risk under practical detection budgets, aligning with “Vision Zero” monitoring objectives [2].
- Localization-Oriented Metrics: Localization-oriented metrics such as mAP are inspected only as auxiliary diagnostics during analysis and are outside the main scope of this paper; therefore, they are not discussed in detail.
Why Recall Dominates Compliance Monitoring
5. Experiments and Results
5.1. Experimental Setup
5.2. Quantitative Analysis: Mixing Ratios
5.2.1. Overall AR@100 Trends
- (1)
- Single-run non-linearity across detector families.
- (2)
- YOLOv11s: Single-run peak at G9, but robustness requires caution.
- (3)
- RT-DETR-L: Strong single-run mixed-data performance.
- (4)
- Faster R-CNN: Overall preference for pure real, with ratio-sensitive behavior.
Implications for Cross-Site Generalization
5.2.2. Supplementary Repeated-Seed Robustness Analysis for YOLOv11s
5.3. Computational Efficiency and Deployment
- YOLOv11s: Highest throughput (142 FPS), suitable for latency-critical edge monitoring.
- RT-DETR-L: Balanced profile (74 FPS) and the highest single-run AR@100 at G9 in the original benchmark, making it a promising option for gateway/server deployments that prioritize safety recall.
- Faster R-CNN: Slower (24 FPS) and weaker gains under synthetic injection, less attractive under strict latency and recall objectives.
5.4. Failure Analysis: The Materiality Gap
6. Conclusions and Future Work
6.1. Summary of Findings
- 1.
- Contribution 1: This study provides fixed-budget empirical evidence rather than a universal mixing rule. The original G0–G10 benchmark shows that the effect of real–synthetic composition is strongly architecture-dependent and non-linear. Under the same fixed training budget, YOLOv11s and RT-DETR-L reach their highest single-run AR@100 at G9 (90% real/10% synthetic), whereas Faster R-CNN performs best at G10 (100% real). Therefore, the present study does not support a universally optimal mixing ratio; rather, it shows that the utility of synthetic supplementation depends on the detector family and evaluation setting.
- 2.
- Contribution 2: This study clarifies cross-architecture sensitivity under matched data budgets. Under identical data pools, test conditions, and budget constraints, the three detector families respond differently to synthetic injection. YOLOv11s and RT-DETR-L are more competitive in mixed-data settings in the original single-run benchmark, whereas Faster R-CNN shows a stronger preference for pure-real training. This comparative evidence is one of the main contributions of this paper because it demonstrates that real–synthetic allocation should be treated as an architecture-sensitive design decision rather than a model-agnostic recipe.
- 3.
- Contribution 3: The revision corrects the central YOLOv11s interpretation through supplementary robustness analysis. Because the original manuscript emphasized the G9-versus-G10 comparison for YOLOv11s, this revision adds a supplementary repeated-seed fixed-test analysis for these two settings. This analysis shows that G10 achieves higher mean AR@100 and lower variance than G9 on the same fixed real-only test set. Accordingly, the apparent G9 advantage for YOLOv11s should be interpreted as a single-run observation rather than as a stable robustness conclusion. In this sense, the revised manuscript now contributes a more careful and reproducibility-aware interpretation of the central YOLOv11s result.
- 4.
- Contribution 4: This study identifies a persistent materiality gap for deformable PPE. The failure analysis suggests that Sim-to-Real transfer remains category-dependent: helmets transfer more reliably than safety vests, indicating that viewpoint diversity alone is insufficient when appearance realism depends strongly on cloth and material behavior. Thus, beyond the mixing-ratio comparison itself, this paper also contributes descriptive evidence that garment realism remains a practical bottleneck in digital twin-based PPE monitoring.
6.2. Limitations
6.3. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Seo, J.; Han, S.; Lee, S.; Kim, H. Computer vision techniques for construction safety and health monitoring. Adv. Eng. Inform. 2015, 29, 239–251. [Google Scholar] [CrossRef]
- Mubasher, M.; Chen, W.T.; Merrett, H.C.; Marpaung, B. Bridging the Safety Gap: A Systematic Review of Traditional and Smart Personal Protective Equipment in Construction Safety. KSCE J. Civ. Eng. 2026, 30, 100510. [Google Scholar] [CrossRef]
- Arshad, S.; Akinade, O.; Bello, S.; Bilal, M. Computer vision and IoT research landscape for health and safety management on construction sites. J. Build. Eng. 2023, 76, 107049. [Google Scholar] [CrossRef]
- Li, L.; Huang, Z.; Wang, J.; Du, B.; Dai, L. Automated construction monitoring based on computer vision: A comprehensive review. Dev. Built Environ. 2026, 25, 100832. [Google Scholar] [CrossRef]
- Khan, N.; Kim, D.; Kim, M.; Kim, D.; Lee, D. Message-passing framework for multi-camera worker tracking in construction. Autom. Constr. 2026, 181, 106610. [Google Scholar] [CrossRef]
- Shanti, M.Z.; Cho, C.S.; Garcia de Soto, B.; Byon, Y.J.; Yeun, C.Y.; Kim, T.Y. Real-time monitoring of work-at-height safety hazards in construction sites using drones and deep learning. J. Saf. Res. 2022, 83, 364–370. [Google Scholar] [CrossRef]
- Kulinan, A.S.; Park, M.; Aung, P.P.W.; Cha, G.; Park, S. Advancing construction site workforce safety monitoring through BIM and computer vision integration. Autom. Constr. 2024, 158, 105227. [Google Scholar] [CrossRef]
- Zhou, X.; Li, X.; Zhu, Y.; Ma, C. Towards building digital twin: A computer vision enabled approach jointly using multi-camera and building information model (BIM). Energy Build. 2025, 335, 115523. [Google Scholar] [CrossRef]
- Gugssa, M.; Li, L.; Pu, L.; Gurbuz, A.; Luo, Y.; Wang, J. Enabling near-real-time safety glove detection through edge computing and transfer learning: Comparative analysis of edge and cloud computing-based methods. Eng. Constr. Archit. Manag. 2024, 32, 4700–4717. [Google Scholar] [CrossRef]
- Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep learning for site safety: Real-time detection of personal protective equipment. Autom. Constr. 2020, 112, 103085. [Google Scholar] [CrossRef]
- Li, X.; Hu, M.; Li, B.; Tong, R. OAM-YOLO: A real-time small object detection framework for PPE compliance monitoring in industrial environments. Process Saf. Environ. Prot. 2025, 204, 108058. [Google Scholar] [CrossRef]
- Li, Y.; Wei, H.; Han, Z.; Jiang, N.; Wang, W.; Huang, J. Computer Vision-Based Hazard Identification of Construction Site Using Visual Relationship Detection and Ontology. Buildings 2022, 12, 857. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, J.; Torbaghan, M.E. Data-driven safety management of worker-equipment interactions using visual relationship detection and semantic analysis. Autom. Constr. 2025, 175, 106181. [Google Scholar] [CrossRef]
- Ding, Y.; Liu, Q.; Ji, A.; Li, H.; Luo, X. Monocular three-dimensional object detection for proximity monitoring in human-machine collision warning systems on construction sites. Eng. Appl. Artif. Intell. 2025, 159, 111722. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28, pp. 91–99. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- Yang, B.; Zhang, B.; Han, Y.; Liu, B.; Hu, J.; Jin, Y. Vision Transformer-based visual language understanding of the construction process. Alex. Eng. J. 2024, 99, 242–256. [Google Scholar] [CrossRef]
- Chen, Z.; Chen, H.; Imani, M.; Chen, R.; Imani, F. Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces. Expert Syst. Appl. 2025, 265, 125769. [Google Scholar] [CrossRef]
- Seong, J.; Kim, H.S.; Jung, H.J. Improving cross-site generalization in construction object detection via hard negative mining. Autom. Constr. 2026, 182, 106761. [Google Scholar] [CrossRef]
- Lee, H.; Jeon, J.; Lee, D.; Park, C.; Kim, J.; Lee, D. Game engine-driven synthetic data generation for computer vision-based safety monitoring of construction workers. Autom. Constr. 2023, 155, 105060. [Google Scholar] [CrossRef]
- Saif, W.; RazaviAlavi, S.; Kassem, M. Construction digital twin: A taxonomy and analysis of the application-technology-data triad. Autom. Constr. 2024, 167, 105715. [Google Scholar] [CrossRef]
- Roberts, M.; Ramapuram, J.; Ranjan, A.; Kumar, A.; Sundaramoorthi, G.; Brox, T.; Koltun, V. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10912–10922. [Google Scholar]
- Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar]
- Han, Y.; Chen, M.; Li, N.; Ji, M.; Wang, X. Digital twin in construction safety management: Recent advances, challenges, and future directions from 4M1E perspective. Saf. Sci. 2025, 192, 107006. [Google Scholar] [CrossRef]
- Speiser, K.; Teizer, J. Automatic creation of personalised virtual construction safety training in digital twins. Proc. Inst. Civ. Eng. Manag. Procure. Law 2024, 177, 173–183. [Google Scholar] [CrossRef]
- Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomisation for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2017; pp. 23–30. [Google Scholar]
- Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 969–977. [Google Scholar]
- Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. RT-DETR: Real-Time Detection Transformer. arXiv 2023, arXiv:2304.08069. [Google Scholar]
- Huang, T.W.; Chen, Y.H.; Lin, J.J.; Chen, C.S. Deep learning without human labeling for on-site rebar instance segmentation using synthetic BIM data and domain adaptation. Autom. Constr. 2025, 171, 105953. [Google Scholar] [CrossRef]
- Aung, P.P.W.; Sam, K.M.; Kulinan, A.S.; Cha, G.; Park, M. Enhancing deep learning in structural damage identification with 3D-engine synthetic data. Autom. Constr. 2025, 175, 106203. [Google Scholar] [CrossRef]
- Cai, R.; Li, J.; Tan, Y.; Tang, J.; Chen, X. Convolutional neural networks for construction safety: A technical review of computer vision applications. Appl. Soft Comput. 2025, 180, 113374. [Google Scholar] [CrossRef]
- Elrifaee, M.; Zayed, T. Smart IoT-BIM framework with modified zonal safety analysis (ZSA) for real-time safety monitoring in construction. Autom. Constr. 2025, 178, 106431. [Google Scholar] [CrossRef]
- Liang, Y.; Cai, R.; Li, J.; Yi, W.; Xue, H.; Tan, Y. Gaze-guided activity recognition for task-specific personal protective equipment compliance monitoring in hot work: An ontology-computer vision approach. Adv. Eng. Inform. 2026, 70, 104211. [Google Scholar] [CrossRef]
- Jeon, Y.; Tran, D.Q.; Kulinan, A.S.; Kim, T.; Park, M.; Park, S. Vision-based motion prediction for construction workers safety in real-time multi-camera system. Adv. Eng. Inform. 2024, 62, 102898. [Google Scholar] [CrossRef]
- Kulinan, A.S.; Jeon, Y.; Aung, P.P.W.; Park, M.; Cha, G.; Park, S. BIM-based automated analysis of dynamic hazards for proactive safety measures during the earthwork construction stage using CCTV data. Adv. Eng. Inform. 2025, 65, 103296. [Google Scholar] [CrossRef]
- Saif, W.; Doukari, O.; Kassem, M. Stakeholder-centric whole-lifecycle framework for guiding the development and implementation of construction digital twins. Autom. Constr. 2026, 183, 106773. [Google Scholar] [CrossRef]
- Liang, D.; Wu, L.; Sun, M.; Hu, R.; Kong, L.; Pan, Y.; Xue, F. Transfer learning from building information model-based synthetic data for 3D module detection in point clouds of modular-integrated construction hoisting. Eng. Appl. Artif. Intell. 2026, 164, 113243. [Google Scholar] [CrossRef]
- Garavaso, D.; Masi, F.; Musoni, P.; Castellani, U. Point cloud segmentation for 3D Clothed Human Layering. Comput. Graph. 2025, 132, 104393. [Google Scholar] [CrossRef]
- Tiwari, R.; Khapre, S.; Singh, A. Reinforcement learning in robotic systems: A review on sim-to-real transfer. Robot. Auton. Syst. 2026, 198, 105327. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularisation. arXiv 2017, arXiv:1711.05101. [Google Scholar]



| Group | Real:Synth Ratio | YOLOv11s (Single-Stage CNN) | RT-DETR-L (Transformer) | Faster R-CNN (Two-Stage CNN) |
|---|---|---|---|---|
| 2.63 | 3.79 | 18.88 | ||
| 45.73 | 57.96 | 36.46 | ||
| 47.75 | 68.74 | 41.65 | ||
| 51.28 | 66.19 | 44.11 | ||
| 55.67 | 69.39 | 52.03 | ||
| 53.58 | 76.43 | 52.58 | ||
| 65.58 | 68.38 | 47.07 | ||
| 68.83 | 77.09 | 46.48 | ||
| 68.87 | 69.66 | 53.99 | ||
| 72.06 | 80.59 | 49.85 | ||
| 64.83 | 69.77 | 55.57 |
| Setting | Seeds | AR@100 (%) | AP50 (%) | AP@[0.50:0.95] (%) |
|---|---|---|---|---|
| G9 (real9_synth1) | 3 | |||
| G10 (real10_synth0) | 3 |
| Class | Images Containing the Class | Ground-Truth Instances |
|---|---|---|
| Person | 346 | 1131 |
| Helmet | 61 | 145 |
| Safety vest | 39 | 121 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, Z.; Zhang, Y.; Shide, K. A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction. Buildings 2026, 16, 2034. https://doi.org/10.3390/buildings16102034
Zhang Z, Zhang Y, Shide K. A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction. Buildings. 2026; 16(10):2034. https://doi.org/10.3390/buildings16102034
Chicago/Turabian StyleZhang, Ziqi, Yu Zhang, and Kazuya Shide. 2026. "A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction" Buildings 16, no. 10: 2034. https://doi.org/10.3390/buildings16102034
APA StyleZhang, Z., Zhang, Y., & Shide, K. (2026). A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction. Buildings, 16(10), 2034. https://doi.org/10.3390/buildings16102034

