A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction

Zhang, Ziqi; Zhang, Yu; Shide, Kazuya

doi:10.3390/buildings16102034

Open AccessArticle

A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction

by

Ziqi Zhang

^1,*,

Yu Zhang

² and

Kazuya Shide

³

¹

Graduate School of Engineering and Science, Shibaura Institute of Technology, Tokyo 135-8548, Japan

²

Graduate School of Computer and Information Sciences, Hosei University, Tokyo 184-8584, Japan

³

School of Architecture, Shibaura Institute of Technology, Tokyo 135-8548, Japan

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(10), 2034; https://doi.org/10.3390/buildings16102034

Submission received: 15 April 2026 / Revised: 18 May 2026 / Accepted: 18 May 2026 / Published: 21 May 2026

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

Vision-based monitoring of personal protective equipment (PPE) is central to construction safety, yet robust detectors remain limited by scarce, privacy-constrained site imagery. Digital twin simulation can generate labeled synthetic data at scale, but Sim-to-Real gaps make the effective use of synthetic data under a fixed training budget unclear. We benchmark YOLOv11s, Faster R-CNN, and RT-DETR-L using a controlled real–synthetic mixing protocol comprising a fixed real-only test set (400 images), separate sampling pools (5760 real and 4000 synthetic images), and eleven training configurations of approximately constant size (∼1450 images before the validation split) with real fractions ranging from 0% to 100%. Using average recall (AR@100) as the primary safety-oriented metric, the original single-run benchmark shows non-linear architecture-dependent responses to data mixing: YOLOv11s and RT-DETR-L achieve their single-run peaks at G9 (90% real/10% synthetic), whereas Faster R-CNN performs best at G10 (100% real). To assess robustness for the most central YOLOv11s comparison, we further conduct a targeted supplementary repeated-seed analysis for G9 and G10 and re-evaluate all resulting checkpoints on the same fixed real-only test set. This supplementary analysis shows that G10 achieves higher mean performance and lower variance than G9 for YOLOv11s, indicating that the apparent single-run advantage of limited synthetic supplementation is not stable across reruns. However, this robustness check is limited to the central YOLOv11s G9-versus-G10 case and should not be interpreted as a comprehensive robustness validation across all configurations and detector families. Persistent errors on safety vests further indicate a materiality gap for deformable PPE. Overall, these findings suggest that synthetic supplementation can be useful in some settings, but its value is architecture-dependent, evaluation setting-sensitive, and should be interpreted cautiously under robustness-oriented evaluation.

Keywords:

construction safety; PPE detection; digital twin; synthetic data; sim-to-real; data mixing; average recall

1. Introduction

The construction industry, a cornerstone of infrastructure development and economic growth, has long been plagued by disproportionately high accident rates and occupational hazards. Despite strict safety regulations and the ambitious “Vision Zero” initiative, aimed at eliminating fatalities, construction sites remain dynamic, spatially complex, and inherently dangerous [1]. Among the hierarchy of safety controls, the proper usage of personal protective equipment (PPE)—specifically hard hats for head protection and high-visibility vests for conspicuity—constitutes a critical last line of defense. Consequently, automated monitoring of PPE compliance has become a prominent research direction in construction informatics and safety management [2].

Construction journal context and safety management lens. From a safety management viewpoint, PPE monitoring is a “leading indicator” problem: rather than investigating incidents after the fact, the aim is to continuously detect unsafe conditions early enough to trigger preventive intervention. This framing becomes more salient as construction projects move toward data-driven safety governance that integrates sensing, analytics, and operational workflows [3,4]. In such workflows, detection outputs are rarely consumed as raw bounding boxes; instead, they are used to support supervisory decisions (e.g., compliance reminders, work stoppage in high-risk zones, and targeted audits). Therefore, the practical value of an AI detector depends not only on average accuracy but also on whether it reduces missed-detection risk in the presence of occlusions, non-standard viewpoints, and long-tail field conditions.

In practice, PPE monitoring is increasingly expected to operate continuously and at scale: multi-camera networks (fixed CCTV, mobile handheld inspections, and drone flyovers) produce large volumes of imagery where manual inspection becomes impractical [5,6]. Manual checks by safety officers remain indispensable but are labor-intensive, spatially intermittent, and susceptible to human fatigue. Computer vision (CV) provides a non-intrusive alternative leveraging existing site cameras to deliver real-time, persistent compliance monitoring, aligning with Construction 4.0 and data-driven safety management [3,4]. Critically, from a safety engineering perspective, missed detections (false negatives) are often more consequential than occasional false alarms: failing to detect a missing helmet or vest directly increases exposure to severe incidents. This motivates the use of recall-oriented evaluation (AR@100) as a primary KPI in this study.

Deployment realities: multi-camera, BIM, and edge constraints. Modern site monitoring systems are rarely single-camera solutions. They increasingly combine multi-camera coverage with a BIM-aware context to interpret safety states relative to planned workspaces, access routes, and equipment zones [7,8]. At the same time, construction monitoring often faces compute and latency constraints: compliance alerts must be timely, and edge deployment may be necessary due to bandwidth, privacy, or governance constraints. Recent studies explicitly compare edge and cloud pipelines for PPE-related detection tasks, reflecting a shift from purely algorithmic novelty toward deployable system design [9]. These practical constraints further motivate a data strategy that can improve robustness without requiring unlimited real-world annotation.

Deep learning-based object detection has achieved strong performance in site monitoring applications. PPE-focused studies demonstrate that modern detectors can be deployed for near-real-time compliance monitoring under clutter, illumination changes, and occlusions [10,11]. Beyond PPE, recent construction safety research increasingly targets interaction-aware risk understanding (worker–equipment proximity, hazardous zone reasoning) using relationship-level representations and semantic analysis [12,13,14]. In parallel, architectures have evolved from computationally heavy two-stage models, such as Faster R-CNN [15], to efficient single-stage detectors such as the YOLO family [16,17], and more recently to Transformer-based detectors that model long-range context via self-attention [18,19]. Emerging multimodal directions (vision–language understanding and interpretable compliance reasoning) are also beginning to influence workplace safety analytics [20,21].

However, industrial deployment of vision-based PPE monitoring is severely hindered by the Data Scarcity Bottleneck. Training robust deep networks requires large, diverse, and accurately annotated datasets. In construction, acquiring such data is uniquely difficult due to the following:

1.: Privacy and governance barriers: Site surveillance footage is frequently restricted by privacy policies, legal compliance, and project governance, limiting large-scale collection and open sharing [3,4].
2.: Safety ethics and practicality: It is unethical and unsafe to deliberately stage severe non-compliance near heavy equipment solely for data collection, biasing datasets toward compliant behavior.
3.: Long-tail hazards: Rare but safety-critical scenarios (unusual viewpoints, severe occlusion, atypical poses) are under-represented in real data, yet dominate risk in practice [22].

Construction-specific domain constraints justify synthetic data. These three barriers collectively make construction a domain where synthetic data generation is not merely convenient but structurally necessary. Unlike general object detection benchmarks—where large-scale real annotation is feasible and privacy concerns are minimal—construction monitoring faces simultaneous constraints on data volume (governance and access), annotation coverage (rare non-compliant events cannot be staged), and deployment diversity (viewpoints, lighting, and occlusion vary dramatically across sites and project phases). This combination of constraints has no adequate solution through real data alone within practical budgets. Synthetic generation via digital twins directly addresses all three: it bypasses privacy restrictions, enables controlled generation of rare safety-critical scenarios, and provides systematic viewpoint and occlusion coverage. The present study therefore treats synthetic data not as an optional augmentation but as a structurally motivated component of a construction safety monitoring pipeline.

Why fixed-budget mixing is a construction-relevant question. Unlike Internet-scale vision datasets, construction projects typically operate under fixed budgets for annotation, model iteration, and system integration. Practitioners frequently face a constrained design decision: given a limited per-iteration budget (e.g., ∼1–2 k training images), how should one allocate collection effort between additional real images (high cost, governance friction) versus synthetic generation (low marginal cost, but domain gap risk)? This paper deliberately frames the problem in this budgeted manner by fixing the per-configuration mixed dataset size so the outcome can be interpreted as an operational recommendation rather than a purely academic scaling result.

To circumvent these constraints, researchers increasingly turn to Synthetic Data generated via virtual simulation. High-fidelity game engines and BIM-enabled digital twins can render photorealistic images with automatic labels, enabling scalable data generation at reduced marginal cost [23,24]. Synthetic data has demonstrated value in adjacent domains such as indoor scene understanding [25] and urban perception [26]. In construction, synthetic environments are particularly attractive because they allow controlled generation of hazardous layouts, camera placements, and occlusion configurations that are difficult to capture ethically in the field, and they can be integrated into digital twin workflows for monitoring and training [27,28].

Despite these benefits, a major obstacle remains: the Sim-to-Real Domain Gap. Rendered images differ from real photos in texture fidelity, illumination physics, and sensor noise. Domain randomization mitigates this by diversifying simulation parameters [29,30], but a practical question persists for construction safety: how should real and synthetic data be mixed under a fixed training budget to maximize safety-critical performance? Furthermore, different architectures may react differently due to inductive bias. CNNs are known to rely heavily on local texture statistics [31], potentially making them sensitive to synthetic artifacts. By contrast, Transformer-based detectors can leverage global context and shape cues [18,19], which may improve robustness to texture inconsistencies; recent RT-DETR-style designs further provide a practical real-time path for Transformer deployment [32]. In the present study, however, such architecture-level differences should be interpreted as working hypotheses rather than experimentally verified mechanisms because we do not analyze attention maps, feature spaces, representation visualizations, or domain-distance metrics.

Research gaps and contributions of this study. While prior studies have demonstrated the value of synthetic data for construction vision tasks [23,33,34], three practically important gaps remain insufficiently addressed. First, existing work typically treats synthetic data as an additive resource—expanding the total dataset size rather than substituting real images—making it difficult to isolate the effect of data composition from dataset scale. Second, prior benchmarks predominantly report mAP, which balances precision and recall symmetrically; for Vision Zero monitoring, where missed detections carry asymmetric safety costs, recall-oriented evaluation is more appropriate yet remains understudied in the construction synthetic data literature. Third, no construction-specific study has systematically compared CNN-based and Transformer-based detectors under identical fixed-budget mixing conditions, leaving architectural sensitivity to Sim-to-Real domain shift an open practical question for deployment.

In this study, we address these gaps through a controlled empirical evaluation of mixed-reality training strategies. Rather than claiming a new detection algorithm, we focus on a construction-relevant data-allocation question under a fixed practical budget. We curate a hybrid dataset and benchmark YOLOv11s, Faster R-CNN, and RT-DETR-L across 11 mixing ratios under a fixed per-configuration data budget. The main contributions of this study are as follows:

Controlled fixed-budget evidence: We fix the total training budget and vary only the real–synthetic composition ratio across eleven configurations. This design isolates composition effects from scale effects and provides construction-specific empirical evidence on how limited synthetic supplementation affects recall under a fixed-budget setting.
Cross-architecture comparison under matched budgets: We compare representative single-stage CNN, two-stage CNN, and real-time Transformer detectors under identical fixed-budget mixing conditions. The results show that detector families respond differently to synthetic injection, indicating that data strategy and model choice should be considered jointly in construction deployment.
Recall-oriented construction interpretation with explicit limitations: We adopt AR@100 as the primary safety-oriented metric and analyze category-level transfer behavior under Sim-to-Real conditions, while explicitly discussing the limitations of the current taxonomy, validation strategy, and synthetic material realism.

2. Research Objectives

This study addresses the following four research questions (RQs), each motivated by the gaps identified in Section 1:

RQ1.: Fixed-budget mixing ratio optimization: Under a fixed training budget, which real–synthetic mixing ratio yields the best recall performance on real construction imagery? We hypothesize that a moderate synthetic proportion can improve recall by increasing viewpoint and occlusion diversity, whereas excessive synthetic substitution may degrade performance because of Sim-to-Real domain shift.
RQ2.: Architectural sensitivity to synthetic data: Do different detector architectures (single-stage CNN, two-stage CNN, and real-time Transformer) respond differently to varying real–synthetic mixing ratios? This question is particularly relevant for construction deployment, where RT-DETR-L represents an emerging class of real-time Transformer detectors that offer a practical alternative to CNN-based systems for multi-camera site monitoring. Unlike CNN-based detectors, which rely heavily on local texture statistics, RT-DETR-L leverages global self-attention to model spatial relationships across the full image—a property hypothesized to confer greater robustness to the texture artifacts inherent in synthetic imagery. Including RT-DETR-L alongside YOLOv11s and Faster R-CNN therefore allows us to directly test whether global context modeling translates into a practical advantage under Sim-to-Real domain shift in construction-specific conditions.
RQ3.: Deployment trade-offs between recall and efficiency: What are the computational cost implications of each architecture under the identified optimal mixing ratio, and which architecture best balances recall and deployment feasibility for multi-camera construction monitoring?
RQ4.: Category-level transfer gaps: Does Sim-to-Real transfer quality differ across PPE categories (helmets vs. safety vests), and if so, what are the implications for digital twin pipeline design?

These RQs collectively frame this study as a data-centric design problem rather than a purely algorithmic benchmark. The primary aim is to provide construction researchers and practitioners with empirical evidence on how a limited training budget can be allocated between real and synthetic data sources, as well as how detector families respond to such allocation choices.

3. Related Work

This section reviews (i) CV-based safety monitoring in construction, (ii) digital twin and simulation-enabled synthetic data generation, and (iii) architectural sensitivity under Sim-to-Real transfer.

3.1. Computer Vision for Construction Safety Monitoring

CV-based safety monitoring has matured into a core stream of construction informatics research, spanning compliance detection (PPE), hazard recognition, and interaction-aware risk analysis [1,3,4,35]. Compared with tag-based sensing (RFID/UWB), camera-based monitoring is non-intrusive and can reuse existing site infrastructure; recent reviews further emphasize the convergence of CV with IoT/BIM pipelines for operational safety management [3,36].

PPE compliance detection remains among the most direct and actionable tasks because violations can be flagged immediately and linked to safety workflows. Nath et al. [10] demonstrated practical real-time PPE detection in construction contexts, and subsequent work has continued to improve robustness to small objects, occlusion, and deployment constraints (edge vs. cloud trade-offs) [9,11]. Beyond helmets and vests, the scope of “compliance” increasingly includes gloves and task-specific PPE, where fine-grained reasoning and explainability are becoming important [21,37].

System-level monitoring: drones, motion prediction, and multi-camera tracking. Construction sites exhibit frequent occlusion due to scaffolding, temporary structures, and equipment placement; consequently, single-view perception can be brittle even if the underlying detector performs well against static benchmarks. Drone-based monitoring provides complementary vantage points for work-at-height and area-wide observation [6]. At the same time, multi-camera systems have motivated research on worker tracking, trajectory reasoning, and predictive analytics to support proactive interventions; motion prediction and tracking frameworks explicitly acknowledge the need to maintain continuity across views and time [5,38]. In BIM-enabled contexts, CCTV streams have also been integrated into dynamic hazard analysis workflows to support proactive safety measures in specific stages such as earthwork [39]. Collectively, these studies underscore that construction safety perception is not merely a detection task but a component of a broader monitoring and decision support system.

In parallel, construction safety research has shifted from isolated object detection toward risk understanding, e.g., modeling interactions among workers, equipment, and hazardous zones. Visual relationship modeling combined with ontology/semantic reasoning has been shown to represent safety-relevant interactions more explicitly, supporting hazard identification and data-driven safety management workflows [12,13]. For proximity risks, monocular 3D perception has also been explored for collision warning contexts [14]. At the system level, multi-camera tracking and motion prediction have been studied to support continuous site-wide monitoring under occlusion and camera handover [5,38].

A recurring bottleneck across these studies is data availability: collecting and annotating diverse images across projects, phases, and camera viewpoints is expensive and often constrained by privacy governance. More importantly, severe non-compliance and near-miss situations are rare and difficult to collect ethically, creating a mismatch between the training distribution and the risk-dominant tail scenarios encountered in real deployments [22].

3.2. Digital Twins and Synthetic Data in Construction

Construction digital twins have gained prominence as semantic virtual replicas that integrate geometry, schedule, and operational context [24,27,40]. Beyond planning and monitoring, a growing viewpoint treats digital twins as data factories for training vision models: simulation can generate diverse safety scenarios and camera placements while providing automatic labels [23,28]. In broader built-environment contexts, CV-enabled digital twin workflows increasingly integrate multi-camera systems with BIM for persistent monitoring [8].

The central advantage of synthetic data is controllability: viewpoint, lighting, occlusion, and agent behavior can be systematically varied to cover rare modes. Recent construction research has demonstrated synthetic-data-enhanced pipelines not only for safety monitoring but also for BIM-related perception and domain adaptation, such as synthetic BIM data for segmentation and cross-domain transfer [33] and transfer learning from BIM-based synthetic data for 3D module detection in point clouds of modular-integrated construction [41]. Related evidence from 3D-engine synthetic data has also shown benefits in other construction vision tasks (e.g., structural damage identification) [34]. Lessons from adjacent domains reinforce this premise: [25,26] show that synthetic corpora can be valuable when they expand geometric and contextual coverage beyond what is feasible to collect in reality.

Digital twin maturity and the “whole-lifecycle” perspective. For construction journals, digital twins are not purely visualization artifacts; they are often evaluated by their capacity to support stakeholder workflows, data integration, and operational decision-making across the lifecycle. Recent digital twin studies provide taxonomies and whole-lifecycle frameworks that emphasize the alignment between application requirements, enabling technologies, and data flows [24,40]. In this paper, the synthetic pipeline should be interpreted within this broader perspective: synthetic data generation is valuable when it serves a monitoring objective (recall-critical PPE detection) and can be operationalized within governance constraints. Therefore, the mixing-ratio study can be seen as a data-centric “design parameter” for digital twin-enabled safety analytics rather than an isolated computer vision ablation.

Positioning Relative to Prior Synthetic–Real Mixing Studies

Most prior studies in construction and adjacent vision domains treat synthetic data primarily as an additive resource that enlarges the overall training corpus or supports transfer learning and domain adaptation. By contrast, the present study adopts a fixed-budget perspective: the total per-configuration training size is held approximately constant while only the real–synthetic composition is varied. This distinction is important because it isolates composition effects from scale effects and makes the comparison directly interpretable as a constrained data allocation problem for construction deployment under annotation and governance constraints.

Nevertheless, construction PPE introduces a unique challenge: material realism. Even with physically based rendering, clothing appearance depends on wrinkling, draping, and reflective materials. Many real-time pipelines approximate vests using skinned meshes without high-fidelity cloth/material simulation; thus synthetic vests may appear “too clean” or lack realistic folds/retro-reflective behavior. Recent 3D clothed-human modeling studies further highlight the inherent difficulty of clothed geometry/appearance modeling, which conceptually supports the “materiality gap” framing used in our analysis [42].

3.3. Architectural Sensitivity and Sim-to-Real Transfer

The optimal architecture for mixed-reality training remains unclear in construction safety. YOLO-style single-stage detectors are widely adopted due to their speed and deployability [16,17]. Two-stage detectors such as Faster R-CNN can yield strong accuracy in data-rich settings [15], but their proposal generation may be sensitive to domain-specific low-level statistics when real and synthetic textures differ [31].

In contrast, Transformer-based detectors capture global dependencies and shape context [18,19]. RT-DETR represents a real-time detection Transformer that reports strong accuracy–speed trade-offs and has become an influential baseline for real-time Transformer detection [32]. Theoretically, global context modeling can reduce over-reliance on local texture cues, improving robustness to synthetic artifacts. The broader Sim-to-Real literature (including robotics transfer surveys) also emphasizes that cross-domain generalization depends on both representation bias and the fidelity/diversity of simulated experience [43]. Importantly for construction deployment, cross-site generalization is increasingly treated as a first-class objective, where hard negatives and long-tail cases often dominate performance; recent construction object detection research explicitly targets this challenge [22]. This paper contributes construction-specific evidence by evaluating CNN- and Transformer-based detectors under identical data budgets and controlled mixing ratios.

4. Methodology

This study establishes a unified framework to analyze how detection architectures respond to varying mixtures of real and synthetic data under a fixed per-configuration training budget. As summarized in Figure 1, the methodology is structured into four stages: (i) Real Data Collection; (ii) Synthetic Data Generation; (iii) Fixed-Budget Mixing Strategy Design; and (iv) Model Training and Evaluation on a fixed real-only test set.

4.1. Problem Formulation

We formulate PPE detection as a multi-class object detection problem. Let

D = {(x_{i}, y_{i})}_{i = 1}^{N}

denote a dataset where

x_{i}

is an RGB image and

y_{i}

is the ground-truth set of bounding boxes with labels. We use a unified label space of three classes: person, helmet, and safety vest. Bounding boxes follow

(c, x_{c}, y_{c}, w, h)

with normalized coordinates.

The primary experimental variable is the mixing ratio between real and synthetic images in each per-configuration mixed dataset. To isolate the effect of composition (rather than quantity), we (i) fix a real-only test set (

N_{test} = 400

), (ii) maintain separate sampling pools (

| R | = 5760

real;

| S | = 4000

synthetic), and (iii) keep the per-configuration mixed dataset size approximately constant at

N_{mix} \approx 1450

images (before the 10% validation split). This design makes the comparison interpretable for practitioners: the question becomes “how to allocate a fixed training budget between real and synthetic sources.”

Interpretation as a data-centric design problem. Under constrained budgets, the mixing ratio acts as a controllable “knob” that trades off real-domain appearance fidelity against synthetic-domain geometric coverage. Increasing the synthetic fraction can expand viewpoint diversity and reduce annotation burden, but also risks shifting the training distribution away from the real sensor’s characteristics. Conversely, increasing the real fraction improves texture anchoring but may under-sample rare configurations, driving missed detections in practice. The fixed-budget design therefore mirrors a realistic engineering decision: how much synthetic injection is justified for improving recall on real sites?

4.2. Phase 1: Real-World PPE Dataset Curation

Real images were sourced from a publicly available construction safety dataset on Kaggle and filtered to retain representative construction site scenarios. The collected imagery covers a diverse range of outdoor construction environments, including open excavation sites, scaffold-dense building structures, and ground-level work zones. Camera viewpoints span elevated CCTV-style oblique angles, eye-level handheld perspectives, and occasional elevated vantage points, reflecting the multi-camera monitoring configurations commonly deployed on real sites [5,6]. The workers in the images typically appear in standard PPE configurations (hard hats and high-visibility vests), with typical scenes containing one to five visible workers per frame under varying occlusion, illumination, and background clutter conditions. After filtering and remapping labels into our three-class taxonomy (person, helmet, vest), we randomly sampled 400 images as a fixed test set

D_{test}

(held out from all training and model selection). The remaining 5760 images constitute the real sampling pool

R

.

Label harmonization and practical consistency. Because construction datasets often originate from heterogeneous sources and annotation policies, label harmonization is essential for controlled benchmarking. In this work, we use a minimal taxonomy (person/helmet/vest) that aligns with compliance monitoring workflows while keeping the detection objective stable across architectures. This minimal labeling also reflects operational constraints: practitioners frequently prioritize a small set of high-impact PPE indicators that can be reliably monitored across cameras and trades. At the same time, this simplified taxonomy does not capture finer-grained PPE categories or more nuanced compliance states, and therefore may reduce semantic richness relative to datasets with more detailed class definitions. The curated real pool

R

serves as the appearance anchor for Sim-to-Real evaluation, while the test set remains strictly real-only to preserve external validity.

4.3. Phase 2: Digital Twin-Based Synthetic Data Generation

We generated synthetic data using Unity 2022.3.62f1 (Unity Technologies, San Francisco, CA, USA), consistent with the digital twin paradigm [24,40]. Two large-scale virtual construction environments were used to cover distinct contexts (open excavation areas and scaffold-dense structures). Worker agents were animated using motion-capture assets. Helmets were randomized across standard site colors, and vests were rendered as high-visibility garments.

Materiality gap note: While rigid PPE (helmets) is geometrically stable and transfers well once anchored by real images, deformable PPE (vests) is cloth-like and often approximated in real-time pipelines. In practice, this can introduce a systematic appearance mismatch (wrinkles, draping, reflectance) that persists even when geometry and viewpoint diversity are well covered [42].

Automated Labeling via Virtual Photography

We implemented a virtual photography system to sample three deployment-relevant viewpoints: CCTV-like elevated oblique views, drone top-down views, and handheld eye-level views. Labels were generated by projecting 3D bounding volumes into the image plane, producing consistent annotations across viewpoints. In the automated synthetic annotation pipeline, heavily occluded instances may still be retained when their projected geometry remains part of the scene representation; this choice exposes the detector to difficult occlusion patterns, although it may also introduce limited label noise for near-invisible instances. For visual clarity, the representative examples shown in Figure 2 were manually screened to avoid ambiguous illustration cases. This viewpoint design is aligned with recent construction monitoring systems that emphasize multi-view coverage and airborne/mobile sensing for hard-to-observe risks [5,6].

Controlled diversity to approximate field variability. To better approximate field variability without compromising label correctness, the rendering pipeline varies camera placement and orientation within realistic bounds for each viewpoint class. Such controlled diversity is particularly relevant in construction, where camera installations differ across projects and stages, and where occlusion patterns change rapidly as scaffolds and materials are relocated. While synthetic realism cannot fully match real sensing, the goal here is not photorealism alone; rather, it is to expand geometric coverage (rare viewpoints, partial occlusions) under a controlled label space.

Figure 2 shows representative synthetic samples. In total, 4000 synthetic images were generated, forming the synthetic sampling pool

S

.

4.4. Phase 3: Fixed-Budget Mixing Strategy Design

We construct eleven configurations (

G_{0}

to

G_{10}

) by varying the real proportion as follows:

p_{real} = \frac{k}{10}, p_{synth} = 1 - p_{real}, k \in {0, \dots, 10} .

(1)

The per-configuration mixed dataset size is fixed at

N_{mix} \approx 1450

images (before validation split) as follows:

N_{real}^{(k)} = ⌊N_{mix} \cdot p_{real}⌋, N_{synth}^{(k)} = N_{mix} - N_{real}^{(k)} .

(2)

Sampling Protocol Clarification (to Avoid Ambiguity)

Within each configuration

G_{k}

, images are sampled without replacement from R and S (no duplicates within that configuration). Across different configurations, sampling is independent, meaning that the same image may appear in multiple

G_{k}

. This is intentional: our goal is to isolate the effect of the mixing ratio under a controlled per-configuration budget rather than enforce disjointness across groups.

After constructing each

G_{k}

, we randomly hold out 10% as an internal validation split for early stopping and model selection within that configuration. Because this validation split is drawn from the mixed configuration rather than from an independent real-only validation set, the resulting model selection procedure should be interpreted as internal tuning under the same data composition setting rather than as a deployment-faithful validation protocol. Although the supplementary fixed real-only test evaluation introduced later in this revision provides a more deployment-oriented reference, it only partially mitigates this concern and does not eliminate the underlying limitation associated with model selection under mixed-data validation.

Rationale for per-configuration fixed size. Holding

N_{mix}

approximately constant prevents confounding composition effects with scale effects. In particular, if one configuration had substantially more total training images than another, performance differences could be attributed to dataset size rather than the real/synthetic mixture. The fixed-budget approach also reflects typical iterative development cycles on construction projects, where model updates must be produced within practical time and cost constraints. At the same time, this operational control should not be interpreted as implying full informational equivalence between real and synthetic samples: under the same nominal image budget, synthetic images may still embody narrower scene, asset, and environmental diversity than real images.

4.5. Phase 4: Detection Architectures

We benchmark three detectors selected to represent three distinct inductive bias categories relevant to construction deployment, rather than to compare models of equivalent release dates. This selection is deliberate: YOLOv11s represents the single-stage CNN family dominant in real-time edge monitoring; Faster R-CNN (ResNet-50-FPN) represents the two-stage CNN family with explicit proposal generation, which remains a widely used baseline in construction vision research [10,22]; and RT-DETR-L represents the emerging class of real-time Transformer-based detectors that leverage global self-attention [32]. Comparing these three families under identical data budgets and mixing conditions provides architectural insight that generalizes beyond any single model version—a practitioner choosing a deployment architecture benefits more from understanding inductive bias sensitivity than from a leaderboard comparison of the latest model releases.

Engineering rationale for including a Transformer-based detector. From an engineering standpoint, construction site monitoring imposes specific demands that motivate the inclusion of RT-DETR-L beyond academic interest. First, construction scenes are characterized by high spatial complexity: workers frequently appear at varying scales, partially occluded by scaffolding or equipment, and viewed from non-frontal angles across wide-area camera coverage. These conditions require detectors capable of reasoning about global scene context rather than relying solely on local texture cues—a capability central to self-attention mechanisms [18,19]. Second, real-time deployment constraints on construction sites (typically 10–25 FPS across multiple camera streams) demand architectures that balance accuracy and throughput. RT-DETR-L addresses this by combining Transformer-based global reasoning with an efficient hybrid encoder, achieving real-time inference without sacrificing detection quality [32]. Third, in the context of Sim-to-Real transfer, CNN-based detectors are known to over-rely on local texture statistics [31], making them potentially more sensitive to the texture gap between synthetic and real imagery. Transformer-based detectors, by leveraging shape and structural cues through attention, may generalize more robustly under domain shift—a hypothesis this study empirically tests under controlled mixing conditions.

4.5.1. YOLOv11s (Single-Stage CNN)

YOLO-style detectors are widely adopted for real-time monitoring due to their efficiency [16,17]. The training objective is

L_{YOLO} = λ_{box} L_{box} + λ_{cls} L_{cls} + λ_{dfl} L_{dfl} .

(3)

4.5.2. Faster R-CNN (Two-Stage CNN)

Faster R-CNN employs an RPN for proposals and a second-stage classifier/regressor [15]. Its two-stage design can be effective in data-rich regimes, but proposal generation may be sensitive to domain-specific low-level cues when synthetic textures differ from real imagery [31].

4.5.3. RT-DETR-L (Transformer-Based)

RT-DETR uses self-attention to model global dependencies [19,32], as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V .

(4)

Global context modeling is hypothesized to improve robustness to synthetic texture artifacts, which we test under controlled mixing.

4.6. Phase 4: Implementation Details

All models were implemented in PyTorch 2.0 with CUDA 11.8 (NVIDIA Corporation, Santa Clara, CA, USA) and trained on an NVIDIA GeForce RTX 4070 Ti SUPER GPU (16 GB; NVIDIA Corporation, Santa Clara, CA, USA). For consistency across

G_{0}

–

G_{10}

, augmentation and early stopping rules were held constant within each model family.

Hyperparameters:

Batch Size: ≈16 (YOLOv11s), ≈8 (RT-DETR-L), ≈4 (Faster R-CNN).
Optimization: SGD (Momentum = 0.937) for CNNs; AdamW [44] for RT-DETR.
Epochs and Early Stopping: Up to 400 epochs; early stopping if validation AR@100 does not improve for 50 epochs.

Data Augmentation: Mosaic (disabled in last 10 epochs), Mixup, and photometric HSV perturbations are applied consistently. Test-time augmentation is disabled to reflect realistic deployment latency.

Consistency Across Configurations. Within each model family, we keep augmentation policies and training protocols unchanged across

G_{0}

–

G_{10}

so that observed differences can be attributed to dataset composition rather than tuning. This choice mirrors realistic deployment practice: in many construction monitoring pipelines, the primary engineering lever is data curation (what to collect, what to synthesize, and how to mix), while training recipes remain relatively standardized.

4.7. Phase 4: Evaluation Metrics

We evaluate on the fixed real-only test set. Because construction safety is dominated by the cost of missed detections, average recall is treated as the primary metric in the main discussion.

Average Recall (AR@100): COCO-style average recall with up to 100 detections per image (reported in %). This metric directly corresponds to false-negative risk under practical detection budgets, aligning with “Vision Zero” monitoring objectives [2].
Localization-Oriented Metrics: Localization-oriented metrics such as mAP are inspected only as auxiliary diagnostics during analysis and are outside the main scope of this paper; therefore, they are not discussed in detail.

Why Recall Dominates Compliance Monitoring

In compliance monitoring, a small number of false alarms can often be managed via operational filtering (e.g., temporal smoothing, region-of-interest gating, or human-in-the-loop verification). By contrast, a missed detection can fail to trigger any intervention. Therefore, AR@100 provides a direct proxy for the likelihood that the system will be able to retrieve relevant PPE instances under realistic detection budgets, which is more aligned with safety assurance than an exclusive emphasis on localization tightness.

5. Experiments and Results

We evaluate Sim-to-Real transfer under controlled mixing conditions, focusing on recall-driven safety performance. This analysis covers (i) AR@100 trends across mixing ratios, (ii) deployment-relevant efficiency, and (iii) a failure analysis emphasizing the materiality gap.

5.1. Experimental Setup

Experiments were conducted on a workstation with an Intel Core i5-14600K CPU, 32 GB RAM, and an NVIDIA GeForce RTX 4070 Ti SUPER GPU (16 GB). Across

G_{0}

–

G_{10}

, we keep the per-configuration mixed dataset size fixed at

N_{mix} \approx 1450

(10% validation) and evaluate on the same fixed real-only test set (400 images). Thus, performance differences are attributable primarily to data composition and architecture, not dataset size.

5.2. Quantitative Analysis: Mixing Ratios

5.2.1. Overall AR@100 Trends

Table 1 summarizes AR@100 (%) for all mixing ratios.

Primary KPI. The following analysis uses AR@100 as the primary evaluation metric, consistent with the recall-oriented safety rationale established earlier in this manuscript. However, adopting a recall-oriented primary metric does not imply that false positives are operationally negligible. In real deployment, excessive false alarms may reduce trust, increase operator fatigue, and weaken intervention effectiveness; therefore, future deployment-oriented studies should examine precision–recall trade-offs and alert calibration more explicitly.

(1): Single-run non-linearity across detector families.

All three detectors exhibit strongly non-linear responses to the real–synthetic mixing ratio. Pure synthetic training (

G_{0}

) performs poorly for YOLOv11s and RT-DETR-L, indicating a substantial Sim-to-Real domain shift. Introducing even a small amount of real data produces a large performance jump, suggesting that realistic appearance statistics remain essential for anchoring the decision boundary on real test images.

(2): YOLOv11s: Single-run peak at G9, but robustness requires caution.

In the original one-run benchmark, YOLOv11s reaches its highest AR@100 at

G_{9}

(90% real, 10% synthetic), improving from 64.83 at

G_{10}

to 72.06 at

G_{9}

. Under the present fixed-budget setting, one possible interpretation is that limited synthetic supplementation can enrich viewpoint and occlusion coverage while preserving a largely real-domain training distribution. However, as shown later in the supplementary repeated-seed fixed-test analysis, this apparent

G_{9}

advantage does not remain stable under reruns. Therefore, the single-run

G_{9}

result should be interpreted as an informative observation rather than as a definitive robustness conclusion for YOLOv11s.

(3): RT-DETR-L: Strong single-run mixed-data performance.

RT-DETR-L shows strong single-run performance across multiple mixed settings and reaches its highest single-run AR@100 at

G_{9}

. It also remains comparatively strong at moderate ratios such as

G_{5}

and

G_{7}

. A cautious interpretation is that attention-based global aggregation may help exploit synthetic geometric diversity while still benefiting from predominantly real-domain appearance cues. However, this explanation remains interpretive and should not be taken as a formally verified mechanism.

(4): Faster R-CNN: Overall preference for pure real, with ratio-sensitive behavior.

Faster R-CNN achieves its best single-run AR@100 at

G_{10}

(55.57) and does not benefit from adding 10% synthetic data at

G_{9}

(49.85). Importantly, the trend is not strictly monotonic, since several hybrid settings remain competitive. A conservative interpretation is therefore that Faster R-CNN overall prefers pure real data under the present dataset and budget, while the effect of synthetic mixing appears unstable and ratio-sensitive.

Implications for Cross-Site Generalization

The non-linear behavior across ratios suggests that the value of synthetic data depends not only on its availability but also on how it interacts with detector inductive bias, appearance anchoring, and hard-case coverage. Accordingly, the ratio should not be treated as a universal constant; rather, it should be interpreted as a data-centric design variable whose utility may vary across architectures and evaluation protocols.

Figure 3 visualizes the single-run AR@100 trends across the eleven real–synthetic mixing ratios and highlights the detector-specific peaks discussed above.

5.2.2. Supplementary Repeated-Seed Robustness Analysis for YOLOv11s

Because the original manuscript placed particular emphasis on the G9-versus-G10 comparison for YOLOv11s, we conducted a supplementary robustness-oriented analysis using three repeated seeds for each of these two settings. To preserve direct comparability with the main study objective, all resulting checkpoints were re-evaluated on the same fixed real-only test set used throughout the manuscript.

Table 2 shows that the supplementary repeated-seed evaluation does not support a stable advantage of G9 over G10 for YOLOv11s. Instead, G10 achieves both a higher mean AR@100 and markedly lower variance than G9. Specifically, G10 reaches

59.98 \pm 1.88

AR@100, whereas G9 reaches

54.07 \pm 9.12

. The same pattern is observed for AP50 and AP@[0.50:0.95], where G10 also outperforms G9 on average.

These supplementary results refine the interpretation of the original one-run benchmark. While the single-run result suggests that limited synthetic supplementation could be beneficial for YOLOv11s, the repeated-seed fixed-test analysis indicates that this advantage is not stable across reruns. Accordingly, for YOLOv11s, the present evidence supports a more cautious conclusion: under this supplementary robustness-oriented evaluation, the pure-real setting (G10) is more stable and achieves better mean performance than G9.

5.3. Computational Efficiency and Deployment

For practical deployment, recall must be balanced with computational cost. Benchmarking on an RTX 4070 Ti SUPER (Batch = 1, FP16, input

640 \times 640

) shows:

YOLOv11s: Highest throughput (142 FPS), suitable for latency-critical edge monitoring.
RT-DETR-L: Balanced profile (74 FPS) and the highest single-run AR@100 at G9 in the original benchmark, making it a promising option for gateway/server deployments that prioritize safety recall.
Faster R-CNN: Slower (24 FPS) and weaker gains under synthetic injection, less attractive under strict latency and recall objectives.

Construction deployment perspective. Real sites often operate with multiple cameras at 10–25 FPS. When monitoring must cover many streams simultaneously, the practical bottleneck is frequently throughput per GPU or per edge device. Hence, the choice of model family and the data strategy must be co-designed: a modest recall gain that is achievable with a deployable frame rate can be more valuable than a larger gain that is too costly to run continuously. This also explains why real-time architectures such as YOLO variants remain popular in construction monitoring, while Transformer-based real-time designs offer a promising middle ground when recall is prioritized.

5.4. Failure Analysis: The Materiality Gap

To anchor the failure analysis in observable dataset statistics, Table 3 reports the class composition of the fixed real-only test set. The test set is dominated by person instances, while the two PPE categories occupy a much smaller but operationally critical subset. In particular, helmet instances appear in 61 images (145 instances), whereas safety vest instances appear in 39 images (121 instances). This class composition does not by itself establish category-level statistical significance, but it provides a quantitative basis for interpreting why repeated misses in the vest category are operationally important despite their smaller frequency. In addition, the strong dominance of the person class relative to helmet and safety vest instances indicates a visible class imbalance in the fixed test set. While this imbalance does not by itself explain all vest errors, it may partially contribute to the persistent difficulty of vest detection and should be considered when interpreting category-level performance gaps.

Against this quantitative background, inspection of failure patterns suggests that Sim-to-Real transfer is not uniform across PPE categories. A consistent observation is that helmets (rigid, stable silhouette) transfer more reliably than safety vests (deformable, cloth-dependent appearance). In Unity-style pipelines, vests are often approximated by skinned meshes lacking realistic wrinkles, draping, and reflective-strip behavior. Consequently, synthetic vests may appear overly idealized. Under real test imagery, this mismatch can surface as systematic false negatives under occlusions, body articulation, and harsh lighting, contributing to a persistent materiality gap. This aligns with broader evidence that clothed-human modeling and layered clothing geometry/appearance remain challenging even in dedicated 3D-vision research [42].

From a safety standpoint, vest compliance is often tied to visibility requirements and situational awareness, particularly in equipment-dense zones. Missed vest detections can therefore undermine the reliability of compliance analytics even if helmet detection remains comparatively strong. The practical implication is that geometric diversity alone (viewpoints/occlusions) is necessary but not sufficient when category appearance is dominated by cloth/material physics. Future digital twin pipelines should therefore treat garment realism as a first-class requirement rather than as a secondary visual refinement.

6. Conclusions and Future Work

This study set out to examine, under a fixed per-configuration training budget, whether synthetic supplementation provides stable value for PPE detection in construction environments, whether its effect is consistent across detector families, and what category-level limitations remain under Sim-to-Real transfer. Using a fixed real-only test set (400 images), separate real/synthetic pools (

| R | = 5760

,

| S | = 4000

), and a fixed mixed budget (

N_{mix} \approx 1450

with a 10% validation split), the revised findings can be interpreted against the originally intended contributions of this paper as follows.

6.1. Summary of Findings

1.: Contribution 1: This study provides fixed-budget empirical evidence rather than a universal mixing rule. The original G0–G10 benchmark shows that the effect of real–synthetic composition is strongly architecture-dependent and non-linear. Under the same fixed training budget, YOLOv11s and RT-DETR-L reach their highest single-run AR@100 at G9 (90% real/10% synthetic), whereas Faster R-CNN performs best at G10 (100% real). Therefore, the present study does not support a universally optimal mixing ratio; rather, it shows that the utility of synthetic supplementation depends on the detector family and evaluation setting.
2.: Contribution 2: This study clarifies cross-architecture sensitivity under matched data budgets. Under identical data pools, test conditions, and budget constraints, the three detector families respond differently to synthetic injection. YOLOv11s and RT-DETR-L are more competitive in mixed-data settings in the original single-run benchmark, whereas Faster R-CNN shows a stronger preference for pure-real training. This comparative evidence is one of the main contributions of this paper because it demonstrates that real–synthetic allocation should be treated as an architecture-sensitive design decision rather than a model-agnostic recipe.
3.: Contribution 3: The revision corrects the central YOLOv11s interpretation through supplementary robustness analysis. Because the original manuscript emphasized the G9-versus-G10 comparison for YOLOv11s, this revision adds a supplementary repeated-seed fixed-test analysis for these two settings. This analysis shows that G10 achieves higher mean AR@100 and lower variance than G9 on the same fixed real-only test set. Accordingly, the apparent G9 advantage for YOLOv11s should be interpreted as a single-run observation rather than as a stable robustness conclusion. In this sense, the revised manuscript now contributes a more careful and reproducibility-aware interpretation of the central YOLOv11s result.
4.: Contribution 4: This study identifies a persistent materiality gap for deformable PPE. The failure analysis suggests that Sim-to-Real transfer remains category-dependent: helmets transfer more reliably than safety vests, indicating that viewpoint diversity alone is insufficient when appearance realism depends strongly on cloth and material behavior. Thus, beyond the mixing-ratio comparison itself, this paper also contributes descriptive evidence that garment realism remains a practical bottleneck in digital twin-based PPE monitoring.

Taken together, these findings indicate that the main contribution of this paper is not the proposal of a universally superior real–synthetic ratio, but rather the provision of fixed-budget empirical evidence showing that the value of synthetic supplementation is architecture-dependent, robustness-sensitive, and constrained by residual appearance–reality gaps.

6.2. Limitations

First, the fixed-budget setting should be interpreted as a controlled operational benchmark rather than as an assumption of full informational equivalence between real and synthetic samples. Under the same nominal image budget, synthetic images may still embody narrower scene, asset, and environmental diversity than real images. Second, the synthetic pipeline does not fully model cloth/material dynamics and broader realism factors such as weather variation and sensor noise, likely contributing to vest-related false negatives. Third, model validation in the main benchmark was performed within each mixed configuration rather than on an independent real-only validation set, which may affect model selection under domain shift. Although the supplementary fixed real-only test evaluation introduced in this revision provides a more deployment-oriented reference, it only partially mitigates this concern and does not remove the limitation associated with mixed-data validation during model selection. Fourth, the present repeated-seed robustness analysis is intentionally limited in scope: it was conducted only for the central YOLOv11s G9-versus-G10 comparison emphasized in the original manuscript. Therefore, the full G0–G10 benchmark and the other detector families (RT-DETR-L and Faster R-CNN) still remain primarily single-run evidence in the current revision. Fifth, the present revision reports the mean and standard deviation for the supplementary YOLOv11s robustness check, but does not yet provide a comprehensive statistical testing framework across all architectures and mixing ratios. Sixth, some architecture-level interpretations, especially regarding possible Transformer advantages under Sim-to-Real shift, remain qualitative hypotheses rather than experimentally verified mechanisms because the present study does not include attention map analysis, feature space visualization, or domain distance metrics. Seventh, although AR@100 is adopted as the primary safety-oriented metric, the current study does not explicitly model false-positive costs, alarm fatigue, or full precision–recall trade-offs in deployment. Eighth, the fixed real-only test set is visibly imbalanced, with the person class dominating the helmet and safety vest classes; this imbalance may partially contribute to persistent vest detection difficulty. Ninth, the simplified three-class taxonomy improves label harmonization but does not capture finer-grained PPE states. Tenth, the present study does not include cross-site validation across different construction environments, so broader generalization under site diversity remains to be established. Finally, evaluation is limited to static images; temporal stability in video and multi-camera temporal reasoning remain future work [38].

6.3. Future Work

Future work will focus on six directions. First, repeated-seed robustness evaluation should be extended beyond the targeted YOLOv11s G9-versus-G10 comparison to broader ratio ranges and to the other detector families, especially RT-DETR-L and Faster R-CNN. Second, stricter deployment-oriented validation strategies, including real-only validation or alternative model selection protocols, should be investigated more systematically. Third, deployment-oriented evaluation should examine precision–recall trade-offs, false-positive costs, and alert calibration more explicitly. Fourth, class imbalance-aware analysis should investigate how dominant person instances influence PPE category performance, especially for safety vest detection. Fifth, higher-fidelity synthetic generation is needed, including richer cloth/material simulation, weather variation, sensor noise modeling, and broader scene diversity, to reduce the persistent vest-related materiality gap. Sixth, cross-site validation across different construction environments should be conducted to assess the robustness and generalizability of the findings under broader real-world variation.

Author Contributions

Z.Z.: Conceptualization, methodology, software, investigation, formal analysis, visualization, writing—original draft, and writing—review and editing. Y.Z.: Data curation, validation, and writing—review and editing. K.S.: Supervision, validation, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The real-image data used in this study were derived from the publicly available Kaggle dataset “PPE Dataset YOLOv8,” which is distributed under the Apache License 2.0. Interested readers should access the original real-image data directly from Kaggle in accordance with the dataset license and platform terms. The authors’ processed annotations, experimental split files, configuration records, and other supporting materials required to reproduce the fixed-budget experiments are available from the corresponding author upon reasonable request. In addition, a minimal verification package sufficient to assess the reported results can be provided to the journal editorial office for editorial evaluation if required.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Seo, J.; Han, S.; Lee, S.; Kim, H. Computer vision techniques for construction safety and health monitoring. Adv. Eng. Inform. 2015, 29, 239–251. [Google Scholar] [CrossRef]
Mubasher, M.; Chen, W.T.; Merrett, H.C.; Marpaung, B. Bridging the Safety Gap: A Systematic Review of Traditional and Smart Personal Protective Equipment in Construction Safety. KSCE J. Civ. Eng. 2026, 30, 100510. [Google Scholar] [CrossRef]
Arshad, S.; Akinade, O.; Bello, S.; Bilal, M. Computer vision and IoT research landscape for health and safety management on construction sites. J. Build. Eng. 2023, 76, 107049. [Google Scholar] [CrossRef]
Li, L.; Huang, Z.; Wang, J.; Du, B.; Dai, L. Automated construction monitoring based on computer vision: A comprehensive review. Dev. Built Environ. 2026, 25, 100832. [Google Scholar] [CrossRef]
Khan, N.; Kim, D.; Kim, M.; Kim, D.; Lee, D. Message-passing framework for multi-camera worker tracking in construction. Autom. Constr. 2026, 181, 106610. [Google Scholar] [CrossRef]
Shanti, M.Z.; Cho, C.S.; Garcia de Soto, B.; Byon, Y.J.; Yeun, C.Y.; Kim, T.Y. Real-time monitoring of work-at-height safety hazards in construction sites using drones and deep learning. J. Saf. Res. 2022, 83, 364–370. [Google Scholar] [CrossRef]
Kulinan, A.S.; Park, M.; Aung, P.P.W.; Cha, G.; Park, S. Advancing construction site workforce safety monitoring through BIM and computer vision integration. Autom. Constr. 2024, 158, 105227. [Google Scholar] [CrossRef]
Zhou, X.; Li, X.; Zhu, Y.; Ma, C. Towards building digital twin: A computer vision enabled approach jointly using multi-camera and building information model (BIM). Energy Build. 2025, 335, 115523. [Google Scholar] [CrossRef]
Gugssa, M.; Li, L.; Pu, L.; Gurbuz, A.; Luo, Y.; Wang, J. Enabling near-real-time safety glove detection through edge computing and transfer learning: Comparative analysis of edge and cloud computing-based methods. Eng. Constr. Archit. Manag. 2024, 32, 4700–4717. [Google Scholar] [CrossRef]
Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep learning for site safety: Real-time detection of personal protective equipment. Autom. Constr. 2020, 112, 103085. [Google Scholar] [CrossRef]
Li, X.; Hu, M.; Li, B.; Tong, R. OAM-YOLO: A real-time small object detection framework for PPE compliance monitoring in industrial environments. Process Saf. Environ. Prot. 2025, 204, 108058. [Google Scholar] [CrossRef]
Li, Y.; Wei, H.; Han, Z.; Jiang, N.; Wang, W.; Huang, J. Computer Vision-Based Hazard Identification of Construction Site Using Visual Relationship Detection and Ontology. Buildings 2022, 12, 857. [Google Scholar] [CrossRef]
Liu, Y.; Wang, J.; Torbaghan, M.E. Data-driven safety management of worker-equipment interactions using visual relationship detection and semantic analysis. Autom. Constr. 2025, 175, 106181. [Google Scholar] [CrossRef]
Ding, Y.; Liu, Q.; Ji, A.; Li, H.; Luo, X. Monocular three-dimensional object detection for proximity monitoring in human-machine collision warning systems on construction sites. Eng. Appl. Artif. Intell. 2025, 159, 111722. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28, pp. 91–99. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Yang, B.; Zhang, B.; Han, Y.; Liu, B.; Hu, J.; Jin, Y. Vision Transformer-based visual language understanding of the construction process. Alex. Eng. J. 2024, 99, 242–256. [Google Scholar] [CrossRef]
Chen, Z.; Chen, H.; Imani, M.; Chen, R.; Imani, F. Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces. Expert Syst. Appl. 2025, 265, 125769. [Google Scholar] [CrossRef]
Seong, J.; Kim, H.S.; Jung, H.J. Improving cross-site generalization in construction object detection via hard negative mining. Autom. Constr. 2026, 182, 106761. [Google Scholar] [CrossRef]
Lee, H.; Jeon, J.; Lee, D.; Park, C.; Kim, J.; Lee, D. Game engine-driven synthetic data generation for computer vision-based safety monitoring of construction workers. Autom. Constr. 2023, 155, 105060. [Google Scholar] [CrossRef]
Saif, W.; RazaviAlavi, S.; Kassem, M. Construction digital twin: A taxonomy and analysis of the application-technology-data triad. Autom. Constr. 2024, 167, 105715. [Google Scholar] [CrossRef]
Roberts, M.; Ramapuram, J.; Ranjan, A.; Kumar, A.; Sundaramoorthi, G.; Brox, T.; Koltun, V. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10912–10922. [Google Scholar]
Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar]
Han, Y.; Chen, M.; Li, N.; Ji, M.; Wang, X. Digital twin in construction safety management: Recent advances, challenges, and future directions from 4M1E perspective. Saf. Sci. 2025, 192, 107006. [Google Scholar] [CrossRef]
Speiser, K.; Teizer, J. Automatic creation of personalised virtual construction safety training in digital twins. Proc. Inst. Civ. Eng. Manag. Procure. Law 2024, 177, 173–183. [Google Scholar] [CrossRef]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomisation for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2017; pp. 23–30. [Google Scholar]
Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 969–977. [Google Scholar]
Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. RT-DETR: Real-Time Detection Transformer. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Huang, T.W.; Chen, Y.H.; Lin, J.J.; Chen, C.S. Deep learning without human labeling for on-site rebar instance segmentation using synthetic BIM data and domain adaptation. Autom. Constr. 2025, 171, 105953. [Google Scholar] [CrossRef]
Aung, P.P.W.; Sam, K.M.; Kulinan, A.S.; Cha, G.; Park, M. Enhancing deep learning in structural damage identification with 3D-engine synthetic data. Autom. Constr. 2025, 175, 106203. [Google Scholar] [CrossRef]
Cai, R.; Li, J.; Tan, Y.; Tang, J.; Chen, X. Convolutional neural networks for construction safety: A technical review of computer vision applications. Appl. Soft Comput. 2025, 180, 113374. [Google Scholar] [CrossRef]
Elrifaee, M.; Zayed, T. Smart IoT-BIM framework with modified zonal safety analysis (ZSA) for real-time safety monitoring in construction. Autom. Constr. 2025, 178, 106431. [Google Scholar] [CrossRef]
Liang, Y.; Cai, R.; Li, J.; Yi, W.; Xue, H.; Tan, Y. Gaze-guided activity recognition for task-specific personal protective equipment compliance monitoring in hot work: An ontology-computer vision approach. Adv. Eng. Inform. 2026, 70, 104211. [Google Scholar] [CrossRef]
Jeon, Y.; Tran, D.Q.; Kulinan, A.S.; Kim, T.; Park, M.; Park, S. Vision-based motion prediction for construction workers safety in real-time multi-camera system. Adv. Eng. Inform. 2024, 62, 102898. [Google Scholar] [CrossRef]
Kulinan, A.S.; Jeon, Y.; Aung, P.P.W.; Park, M.; Cha, G.; Park, S. BIM-based automated analysis of dynamic hazards for proactive safety measures during the earthwork construction stage using CCTV data. Adv. Eng. Inform. 2025, 65, 103296. [Google Scholar] [CrossRef]
Saif, W.; Doukari, O.; Kassem, M. Stakeholder-centric whole-lifecycle framework for guiding the development and implementation of construction digital twins. Autom. Constr. 2026, 183, 106773. [Google Scholar] [CrossRef]
Liang, D.; Wu, L.; Sun, M.; Hu, R.; Kong, L.; Pan, Y.; Xue, F. Transfer learning from building information model-based synthetic data for 3D module detection in point clouds of modular-integrated construction hoisting. Eng. Appl. Artif. Intell. 2026, 164, 113243. [Google Scholar] [CrossRef]
Garavaso, D.; Masi, F.; Musoni, P.; Castellani, U. Point cloud segmentation for 3D Clothed Human Layering. Comput. Graph. 2025, 132, 104393. [Google Scholar] [CrossRef]
Tiwari, R.; Khapre, S.; Singh, A. Reinforcement learning in robotic systems: A review on sim-to-real transfer. Robot. Auton. Syst. 2026, 198, 105327. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularisation. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Overview of the proposed methodology for the fixed-budget study of real–synthetic data mixing for PPE detection in construction. The workflow consists of four stages: (1) real data collection and label harmonization, (2) synthetic data generation in Unity with domain randomization, (3) fixed-budget mixing strategy design, and (4) model training and evaluation using three detectors under a common real-image test set and the primary metric of average recall (AR@100).

Figure 2. Representative Unity-synthetic images with auto-labeled person, helmet, and vest.

Figure 3. Single-run AR@100 versus real–synthetic mixing ratio for the three detectors. In the original one-run benchmark, YOLOv11s and RT-DETR-L reach their highest single-run AR@100 at

G_{9}

, whereas Faster R-CNN peaks at

G_{10}

. For YOLOv11s, this single-run trend is complemented by a supplementary repeated-seed fixed-test robustness analysis reported later in the Results section.

Figure 3. Single-run AR@100 versus real–synthetic mixing ratio for the three detectors. In the original one-run benchmark, YOLOv11s and RT-DETR-L reach their highest single-run AR@100 at

G_{9}

, whereas Faster R-CNN peaks at

G_{10}

. For YOLOv11s, this single-run trend is complemented by a supplementary repeated-seed fixed-test robustness analysis reported later in the Results section.

Table 1. Single-run AR@100 (%) on the fixed real-world test set for different real–synthetic mixing ratios.

G_{k}

denotes

p_{real} = k / 10

and

p_{synth} = 1 - p_{real}

. These results summarize the original one-run benchmark across

G_{0}

–

G_{10}

and should be interpreted separately from the supplementary repeated-seed robustness analysis reported later for the central YOLOv11s G9-versus-G10 comparison. Bold values indicate the highest single-run AR@100 within each detector column.

Table 1. Single-run AR@100 (%) on the fixed real-world test set for different real–synthetic mixing ratios.

G_{k}

denotes

p_{real} = k / 10

and

p_{synth} = 1 - p_{real}

. These results summarize the original one-run benchmark across

G_{0}

–

G_{10}

and should be interpreted separately from the supplementary repeated-seed robustness analysis reported later for the central YOLOv11s G9-versus-G10 comparison. Bold values indicate the highest single-run AR@100 within each detector column.

Group	Real:Synth Ratio	YOLOv11s (Single-Stage CNN)	RT-DETR-L (Transformer)	Faster R-CNN (Two-Stage CNN)
$G_{0}$	$0 : 10$	2.63	3.79	18.88
$G_{1}$	$1 : 9$	45.73	57.96	36.46
$G_{2}$	$2 : 8$	47.75	68.74	41.65
$G_{3}$	$3 : 7$	51.28	66.19	44.11
$G_{4}$	$4 : 6$	55.67	69.39	52.03
$G_{5}$	$5 : 5$	53.58	76.43	52.58
$G_{6}$	$6 : 4$	65.58	68.38	47.07
$G_{7}$	$7 : 3$	68.83	77.09	46.48
$G_{8}$	$8 : 2$	68.87	69.66	53.99
$G_{9}$	$9 : 1$	72.06	80.59	49.85
$G_{10}$	$10 : 0$	64.83	69.77	55.57

Table 2. Supplementary repeated-seed fixed-test evaluation for the central YOLOv11s comparison between G9 (90% real/10% synthetic) and G10 (100% real). All values are reported as mean ± standard deviation over three seeds on the same fixed real-only test set.

Setting	Seeds	AR@100 (%)	AP50 (%)	AP@[0.50:0.95] (%)
G9 (real9_synth1)	3	$54.07 \pm 9.12$	$57.13 \pm 18.22$	$31.00 \pm 13.07$
G10 (real10_synth0)	3	$59.98 \pm 1.88$	$70.13 \pm 1.37$	$39.81 \pm 1.00$

Table 3. Class composition of the fixed real-only test set used to support the failure analysis.

Class	Images Containing the Class	Ground-Truth Instances
Person	346	1131
Helmet	61	145
Safety vest	39	121

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Z.; Zhang, Y.; Shide, K. A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction. Buildings 2026, 16, 2034. https://doi.org/10.3390/buildings16102034

AMA Style

Zhang Z, Zhang Y, Shide K. A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction. Buildings. 2026; 16(10):2034. https://doi.org/10.3390/buildings16102034

Chicago/Turabian Style

Zhang, Ziqi, Yu Zhang, and Kazuya Shide. 2026. "A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction" Buildings 16, no. 10: 2034. https://doi.org/10.3390/buildings16102034

APA Style

Zhang, Z., Zhang, Y., & Shide, K. (2026). A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction. Buildings, 16(10), 2034. https://doi.org/10.3390/buildings16102034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Fixed-Budget Study of Real–Synthetic Data Mixing for PPE Detection in Construction

Abstract

1. Introduction

2. Research Objectives

3. Related Work

3.1. Computer Vision for Construction Safety Monitoring

3.2. Digital Twins and Synthetic Data in Construction

Positioning Relative to Prior Synthetic–Real Mixing Studies

3.3. Architectural Sensitivity and Sim-to-Real Transfer

4. Methodology

4.1. Problem Formulation

4.2. Phase 1: Real-World PPE Dataset Curation

4.3. Phase 2: Digital Twin-Based Synthetic Data Generation

Automated Labeling via Virtual Photography

4.4. Phase 3: Fixed-Budget Mixing Strategy Design

Sampling Protocol Clarification (to Avoid Ambiguity)

4.5. Phase 4: Detection Architectures

4.5.1. YOLOv11s (Single-Stage CNN)

4.5.2. Faster R-CNN (Two-Stage CNN)

4.5.3. RT-DETR-L (Transformer-Based)

4.6. Phase 4: Implementation Details

4.7. Phase 4: Evaluation Metrics

Why Recall Dominates Compliance Monitoring

5. Experiments and Results

5.1. Experimental Setup

5.2. Quantitative Analysis: Mixing Ratios

5.2.1. Overall AR@100 Trends

Implications for Cross-Site Generalization

5.2.2. Supplementary Repeated-Seed Robustness Analysis for YOLOv11s

5.3. Computational Efficiency and Deployment

5.4. Failure Analysis: The Materiality Gap

6. Conclusions and Future Work

6.1. Summary of Findings

6.2. Limitations

6.3. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI