SC-YOLO: A Real-Time CSP-Based YOLOv11n Variant Optimized with Sophia for Accurate PPE Detection on Construction Sites

Teerapun Saeheaw

doi:10.3390/buildings15162854

Department of Teacher Training in Mechanical Engineering, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand

Buildings2025, 15(16), 2854;https://doi.org/10.3390/buildings15162854

This article belongs to the Section Construction Management, and Computers & Digitization

Version Notes

Order Reprints

Abstract

Despite advances in YOLO-based PPE detection, existing approaches primarily focus on architectural modifications. However, these approaches overlook second-order optimization methods for navigating complex loss landscapes in object detection. This study introduces SC-YOLO, integrating CSPDarknet backbone with Sophia optimization (leveraging efficient Hessian estimates for curvature-aware updates) for enhanced PPE detection on construction sites. The proposed methodology includes three key steps: (1) systematic evaluation of EfficientNet, DINOv2, and CSPDarknet backbones, (2) integration of Sophia second-order optimizer with CSPDarknet for curvature-aware updates, and (3) cross-dataset validation in diverse construction scenarios. Traditional manual PPE inspection exhibits operational limitations, including high error rates (12–15%) and labor-intensive processes. SC-YOLO addresses these challenges through automated detection with potential for real-time deployment in construction safety applications. Experiments on VOC2007-1 and ML-31005 datasets demonstrate improved performance, achieving 96.3–97.6% mAP@0.5 and 63.6–68.6% mAP@0.5:0.95. Notable gains include a 9.03% improvement in detecting transparent objects. The second-order optimization achieves faster convergence with 7% computational overhead compared to baseline methods, showing enhanced robustness over conventional YOLO variants in complex construction environments.

Keywords:

computer vision; construction safety; personal protective equipment (PPE); real-time detection; Sophia optimizer; YOLOv11n

1. Introduction

Construction sites present inherently hazardous environments where safety compliance is critical to preventing injuries and fatalities. The proper use of personal protective equipment (PPE), including hardhats, vests, gloves, and safety glasses, forms a cornerstone of occupational safety. However, manual supervision of PPE compliance is labor-intensive, error-prone, and often infeasible in large-scale operations. In response, computer vision-based automation has emerged as a practical approach for enabling real-time, monitoring of PPE usage [1,2,3].

Deep learning approaches for object detection have evolved from two-stage detectors to more efficient one-stage architectures like YOLO. These architectures balance speed and accuracy requirements for real-time safety monitoring applications. However, detection of small and occluded PPE components—like gloves and safety glasses—presents ongoing technical challenges [4,5]. Additional challenges include ensuring convergence stability and training efficiency, especially on cluttered construction scenes with scale variance and background noise [6,7].

To address these challenges, this study adopts YOLOv11n as a baseline due to its computational efficiency while maintaining accuracy for small-object detection and reduced inference latency. This makes it suitable for real-time PPE detection on edge devices [8,9].

Traditional manual PPE monitoring in construction environments relies on periodic visual inspections conducted by safety supervisors and site personnel. However, documented limitations of this approach present operational challenges for construction project managers. Musarat et al. [10] quantify these limitations, reporting that manual inspection exhibits “average error incidence ranges from 12 to 15% for periodic supervisor checks and manual processes” compared to automated systems achieving “2–5% error rates.” Traditional processes are characterized as “labour-intensive (RII 0.792)” and “time-consuming (RII 0.768)” while resulting in “higher expenses with traditional methods (RII 0.652).”

Rasouli et al. [11] note that conventional approaches suffer from “labor-intensive and time-consuming observation,” where effectiveness is compromised because “workers avoid using these PPEs during their tasks.” This behavioral challenge is compounded by the limitation of passive monitoring systems, where “this passive RFID system is useless during working hours and can be only used at entrance of construction sites” [11], highlighting the need for continuous operational monitoring rather than point-in-time verification.

The comparative evaluation framework for construction safety monitoring encompasses operational dimensions that impact project management decisions. Equipment requirements differ between approaches, with manual systems requiring basic inspection tools and safety personnel, while automated systems necessitate camera infrastructure and computing resources. Labor resource allocation presents considerations where traditional methods demand dedicated supervision time compared to automated systems enabling “real-time tracking of labourers, apparatus, and progress (RII 0.82).” Coverage scope varies between periodic inspection schedules and continuous monitoring capabilities, while economic factors show that traditional approaches result in ongoing personnel costs versus technology investment for automated solutions.

Smart PPE technologies address these limitations by integrating advanced sensing capabilities that enable real-time data collection and analysis while offering a proactive approach to hazard identification and risk management [11]. Unlike traditional passive systems, smart PPE enables continuous monitoring during actual work execution, addressing the behavioral compliance gaps inherent in periodic inspection approaches. Automated PPE detection systems provide advantages for construction safety management. Musarat et al. [10] document that automated approaches provide capabilities for “hazardous situations anticipated via tracking workers/machines (RII 0.796)” while addressing the limitation where “lack of real-time information with traditional techniques (RII 0.752)” constrains effective safety oversight in complex construction environments.

Following established multi-criteria decision-making methodologies demonstrated by Abdel-Basset et al. [12] using “Analytical Hierarchy approach (AHP)” and “Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS),” this framework enables construction managers to evaluate monitoring approaches based on project scale, budget constraints, and operational requirements. The deployment of automated PPE detection systems requires resource allocation considerations that extend beyond technical capabilities. Optimal deployment strategies must account for both “risk aversion” and “cost considerations” in safety resource allocation [13]. This approach ensures that technological solutions align with project-specific constraints while maximizing safety outcomes through evidence-based resource optimization.

The research evaluates three backbone architectures—EfficientNet, DINOv2, and CSPDarknet—to explore their trade-offs between detection precision and resource demands for PPE detection tasks. Among these options, CSPDarknet’s cross-stage partial connections provided an effective balance of performance and efficiency.

To enhance model performance, this study integrates Sophia—a second-order clipped stochastic optimizer—which has shown effectiveness in improving convergence speed, training stability, and reducing gradient explosion in complex deep learning tasks [14]. The proposed SC-YOLO represents a hybrid architecture that enhances the YOLOv11n baseline by integrating the CSPDarknet backbone with Sophia optimization. The model addresses challenges in small-object detection, training instability, and real-time deployment considerations in construction safety monitoring.

The following section examines recent advancements in YOLO-based PPE detection to contextualize the current contributions and identify research gaps that SC-YOLO aims to address.

2. Related Works

Recent advancements in YOLO-based PPE detection have shown improvements in both accuracy and computational efficiency for construction site monitoring. This section synthesizes key studies from 2022–2025, focusing on architectural innovations, detection strategies, and performance benchmarks specifically within YOLO-based frameworks for PPE detection.

2.1. Evolution of Detection Models for Construction Safety

PPE detection systems have evolved considerably in recent years, with various object detection frameworks being adapted to address construction site safety challenges. Early approaches to object detection in safety applications utilized conventional models such as SSD and Faster R-CNN, which offered strong detection capabilities but exhibited high latency and limited deployment feasibility in edge environments [15,16]. The introduction of one-stage detectors such as the YOLO family has improved the trade-off between speed and accuracy. YOLOv5 and YOLOv8 have shown promising results across multiple benchmark datasets, including for PPE detection tasks [17,18], and have become widely adopted in practice.

Two-stage detectors like Faster R-CNN, while offering strong detection capabilities, have shown limitations in construction settings due to their slower inference speed and higher computational requirements. Meanwhile, attention-heavy Transformer models like DINOv2 often exhibit high box localization errors in dense detection tasks typical of construction environments [19,20]. These limitations highlight the need for lightweight yet effective detection models specifically adapted to PPE monitoring challenges.

Initial implementations primarily utilized YOLOv3 and YOLOv4 [2,21] due to their favorable balance between detection accuracy and inference speed. Subsequent research has expanded on these foundations by enhancing feature extraction through attention modules and backbone fusion, as seen in BDC-YOLOv5 [22], YOLO-PL [23], and FFA-YOLOv7 [17].

From 2023 onward, Transformer-based enhancements such as Swin-Transformer [4] and Metaheuristic Optimization [24] have begun addressing detection challenges in occluded or cluttered environments. Concurrently, YOLOv8 and its lightweight variants [25,26] have emerged as leading models due to their modular design, attention-augmented perception, and real-time inference suitability. More recently, YOLOv11 has introduced architectural refinements over YOLOv8, including dynamic head scaling and compound bottleneck designs, which improve precision and training stability, particularly for small-object detection with reduced inference latency [27,28].

2.2. Specialized Detection Strategies and Backbone Architectures

Detection of small and occluded PPE components—like gloves and safety glasses—presents ongoing technical challenges [4,5]. Several models have developed customized strategies for improved robustness in these scenarios. For example, MARA-YOLO [29] leveraged MobileOne and receptive field fusion to enhance multi-class PPE detection. SDCB-YOLO [25] and MEAG-YOLO [30] incorporated multi-scale attention mechanisms to resolve feature confusion in overlapping PPE regions. Meanwhile, semi-supervised methods [31] and edge-device deployment models [28] have emphasized practical integration in safety-critical settings.

Several backbone architectures have been explored to address the specific challenges of PPE detection.

EfficientNet applies compound scaling to balance width, depth, and resolution, and has been widely adopted in efficient object detectors, including waste detection, defect recognition, and agricultural monitoring systems [32,33,34]. DINOv2, a self-supervised vision Transformer, is increasingly applied in tasks requiring semantic generalization and long-range context modeling, such as medical image segmentation and industrial search [19,20,35]. These Transformer-based backbones have shown effectiveness in feature representation and long-range spatial reasoning, though often at the cost of localization precision and inference speed.

CSPDarknet, known for its cross-stage partial connections, has become widely adopted in YOLO variants due to its efficient design that makes it particularly suitable for real-time applications with computational constraints. Its cross-stage partial connections enhance gradient flow while reducing computational redundancy, proving effective across various applications, including industrial inspection and PPE monitoring [8,36].

Real-world deployment faces challenges in ensuring convergence stability and training efficiency, particularly in complex construction environments with scale variance and background noise [6,7]. Furthermore, applications have also extended to hazard-specific contexts. Onososen et al. [37] integrated PPE detection with fatigue monitoring, Kumar et al. [2] combined PPE with fire detection, and Ji et al. [38] enhanced detection reliability using ensemble classification.

2.3. Optimization Approaches in Object Detection

While architectural innovations have dominated PPE detection research, optimization strategies have received comparatively less attention. Most existing works rely on standard first-order optimizers like SGD or Adam, which may struggle with the complex loss landscapes encountered in dense object detection tasks. This represents a notable gap in the current research landscape.

Sophia has been recently adopted in both NLP and computer vision tasks for improving convergence speed, training stability, and reducing gradient explosion. Its efficiency has been validated in large-scale pretraining of language models and scalable Transformer-based systems [14]. However, its application to object detection frameworks, particularly YOLO-based architectures, remains largely unexplored.

Recent studies have begun investigating alternative optimization approaches for object detection. For instance, Nguyen et al. [24] explored metaheuristic optimization for PPE detection, while Ji et al. [38] utilized random forest optimization to enhance reliability. While promising, these approaches still operate primarily within first-order optimization paradigms and do not address the challenges of navigating complex loss landscapes with high-curvature regions.

2.4. Comparative Analysis and Research Gap

Table 1 presents a systematic comparison of YOLO-based PPE detection approaches, revealing three primary trends. First, models developed after 2023 increasingly adopt lightweight backbones (e.g., YOLOv8n, MobileOne) and integrate attention modules (e.g., CoordAttention, CBAM) to address real-time inference constraints. Second, architectural fusion techniques (e.g., BiFPN, RFA, CARAFE) have become widespread to enhance multi-scale feature aggregation and improve small-object detection. Third, although loss function and attention mechanism optimizations are common, innovative approaches at the optimizer level remain limited.

Table 1. Technical comparison of YOLO-based PPE detection studies (2022–2025).

While numerous architectural improvements have been proposed for YOLO-based detectors, few studies have investigated the impact of advanced optimization techniques on detection performance. This represents a notable gap in the current research landscape, particularly regarding second-order optimization methods that can potentially navigate the complex loss landscapes of dense object detection tasks more effectively.

The proposed SC-YOLO addresses this research gap by introducing second-order optimization within the YOLO framework. Unlike previous approaches that focus primarily on architectural modifications, SC-YOLO incorporates Sophia to directly improve training dynamics, especially in scenarios involving small objects and highly curved loss landscapes. While several reviewed models achieve high accuracy (e.g., FFA-YOLOv7: 98.16% mAP@0.5, MARA-YOLO: 74.7%, MEAG-YOLO: 96.5%), they often rely heavily on attention modules or complex backbone structures, potentially compromising inference speed or training stability. Notably, prior studies have not utilized second-order optimizers within the YOLO framework, establishing the methodological contribution of SC-YOLO.

SC-YOLO distinguishes itself by combining the CSPDarknet backbone, known for efficiently balancing speed and feature integrity, with Sophia, providing curvature-aware optimization. Unlike traditional first-order methods (e.g., SGD, Adam, Seahorse Optimization), Sophia enhances training dynamics under high-gradient and high-curvature scenarios common to PPE detection on construction sites.

Additionally, unlike many prior works that evaluate models on a single dataset, this study employs two complementary datasets with diverse characteristics to ensure robust assessment of model generalization capabilities. This cross-dataset validation approach provides stronger evidence for the effectiveness of the proposed methods in real-world scenarios.

To address these identified research gaps, the following section details the comprehensive methodological framework developed to implement and evaluate the proposed SC-YOLO architecture.

3. Methodology

The research framework encompasses the systematic development and evaluation of four YOLO variants, with focus on the integration of Sophia with the CSPDarknet backbone.

3.1. YOLOv11 Architecture Enhancement Framework

YOLOv11 serves as the foundational architecture for all model variants in this study, developed as an evolution of previous YOLO versions to balance computational efficiency and detection accuracy. This addresses limitations observed in earlier iterations such as YOLOv8, YOLOv9, and YOLOv10 [42]. The YOLOv11 architecture includes five variants—YOLOv11n, YOLOv11s, YOLOv11m, YOLOv11l, and YOLOv11x—each varying in network depth and complexity, catering to diverse application requirements ranging from lightweight, real-time detection to high-accuracy computation-intensive tasks [27].

The YOLOv11n variant is selected as the baseline model due to its suitability for deployment in real-time and embedded detection scenarios, particularly critical for tasks such as monitoring construction workers’ use of protective equipment like helmets and safety vests [27]. As illustrated in Figure 1, YOLOv11 architecture consists of three primary components: a backbone for robust feature extraction, a neck structured around the Feature Pyramid Network (FPN) for effective multi-scale feature fusion, and a head optimized for object classification and localization.

Figure 1. The architecture of the proposed YOLOv11-based detection framework, illustrating the backbone, feature fusion neck, and multi-scale detection heads. Color coding: green (SPPF), yellow (convolution layers), blue (C3k2 modules), red (concatenation operations), pink (upsampling), purple (detection heads).

The backbone integrates advanced modules, notably the Convolutional Cross-Connections with Kernel size 2 (C3k2) module and the Spatial Pyramid Pooling Fast (SPPF) module, enhancing feature extraction across multiple scales [43]. Additionally, the Cross-Stage Partial Self-Attention (C2PSA) module enhances contextual sensitivity by emphasizing salient features and filtering irrelevant information, improving detection performance in densely occluded scenes typical of construction sites [8].

The neck structure of YOLOv11 employs efficient upsampling, concatenation, and the refined C3k2 module to facilitate multi-scale semantic feature fusion. The head, designed for both classification and bounding box regression, incorporates depthwise separable convolutions (DWConv) to reduce computational load while preserving accuracy [8].

YOLOv11 [8] delivers multi-task processing capabilities—simultaneously performing object detection, instance segmentation, and pose estimation within a unified framework. Its modular architecture ensures adaptability to custom detection tasks with minimal retraining, proving particularly effective in dynamic, complex environments. Performance benchmarks demonstrate significant efficiency–accuracy trade-offs: YOLOv11x achieves 54.7% mAP@0.5:0.95 on COCO with superior resource optimization over YOLOv8x/YOLOv9e/YOLOv10 variants, while the lightweight YOLOv11n attains 43.2% mAP@0.5:0.95 using merely 4.7 M parameters and 9.3 GFLOPs. This represents a 1.4% mAP improvement over YOLOv10n alongside reduced inference latency. These advancements enable robust real-time detection in construction scenarios, reliably identifying PPE compliance (helmets/vests), recognizing equipment zones, and handling occlusion/variable lighting conditions through domain-specific visual noise management.

3.2. Backbone Architectures

The backbone architecture influences a detection model’s representational capacity and computational performance, serving as the foundational feature extractor that captures spatial hierarchies and semantic context. This study evaluates four distinct backbone architectures: EfficientNet, DINOv2 Vision Transformer, CSPDarknet, and the CSPDarknet variant enhanced by second-order optimization (SC-YOLO). Each architecture addresses specific trade-offs between detection precision, inference speed, and model compactness relevant for real-time construction site monitoring. Table 2 presents a comparative analysis of these backbone architectures, highlighting their key features, optimization methods, strengths, limitations, and primary applications.

Table 2. Comparative Analysis of Backbone Architectures.

3.2.1. EfficientNet Backbone (Efficient-YOLO)

Efficient-YOLO employs EfficientNet as the backbone network, utilizing its compound scaling principle that simultaneously balances network depth, width, and input resolution [32]. EfficientNet’s lightweight structure employs depthwise separable convolutions and mobile inverted bottleneck convolution (MBConv) blocks to reduce computational overhead while maintaining detection performance [33,44,45].

As shown in Figure 2, the architecture integrates an EfficientNet backbone with a multi-scale fusion neck and YOLO detection heads. This architecture benefits the detection of small or densely distributed targets through enhanced Bi-directional Feature Pyramid Networks (BiFPN) [34,46].

Figure 2. The architecture of the proposed Efficient-YOLO model, consisting of three main modules: an EfficientNet backbone for feature extraction, a multi-scale fusion neck, and YOLO detection heads for object localization and classification. Purple blocks: MBConv layers; Green blocks: Feature pyramid levels (P3–P7); Pink/Blue paths: BiFPN’s top-down/bottom-up flows; Yellow circles: Feature fusion; _td: Top-down features; _out: Final output features for detection heads (F3, F4, F5).

3.2.2. Self-Supervised Vision Transformer Backbone (DINOv2-YOLO)

DINOv2-YOLO integrates a self-supervised Vision Transformer (ViT) backbone into the YOLO detection framework as illustrated in Figure 3. The DINOv2 backbone leverages self-supervised learning principles [19,47] and multi-head self-attention mechanisms to capture long-range dependencies without extensive labeled datasets. This approach enables the identification of features and patterns in construction site imagery [20,38].

Figure 3. The architecture of the proposed DINOv2-YOLO model, integrating a self-supervised Vision Transformer backbone with the YOLO detection head.

The model processes images through patch embedding and a Transformer-based encoder, with attention layers dynamically focusing on relevant image regions. Output features are enhanced using SPPF and C2PSA modules to retain semantic details [48]. These features then flow through the YOLO architecture’s neck and detection heads for object localization and classification. This architecture offers advantages in scenarios with complex visual contexts and limited training data [49,50].

3.2.3. CSPDarknet Backbone (CSP-YOLO)

CSP-YOLO utilizes the Cross Stage Partial Darknet (CSPDarknet) architecture as its backbone, optimizing computational efficiency and gradient propagation through strategic feature map splitting [51,52]. This architecture retains comprehensive semantic features while reducing parameters and computational complexity through partial cross-stage connections [36,53].

CSPDarknet employs cross-stage partial connections that strategically divide feature maps to enhance gradient flow while reducing computational redundancy. The architecture has proven effective across various applications, including industrial inspection and PPE monitoring, making it particularly suitable for real-time applications with computational constraints.

As shown in Figure 4, CSP-YOLO comprises a CSPDarknet backbone with convolutional layers and C3 modules for hierarchical feature extraction, enhanced by SPPF and C2PSA modules for improved receptive fields and multi-scale feature interactions [5,54]. The architecture’s neck employs multi-path feature fusion with C3k2 modules, while the detection head operates at three scales for accurate object detection across various sizes [7,55]. This design achieves a balance between efficiency and accuracy for real-time detection in complex environments [56].

Figure 4. The overall architecture of the CSP-YOLO model, illustrating the integration of the CSPDarknet backbone with SPPF and C2PSA modules, followed by a neck composed of C3k2-based multi-scale feature fusion and a three-level detection head.

3.2.4. Proposed CSPDarknet with Sophia (SC-YOLO)

The proposed SC-YOLO integrates the CSPDarknet backbone with Sophia optimizer, enhancing detection performance for complex construction site monitoring scenarios. As illustrated in Figure 5, the architecture demonstrates interactions among its backbone, neck, and detection head components through a comprehensive training workflow.

Figure 5. SC-YOLO complete training architecture with integrated Sophia optimization, showing detailed workflow from model initialization through parameter updates. Numbers 0–23 indicate sequential processing layers: Backbone (Layers 0–10), Neck (Layers 11–22), and Head (Layer 23).

SC-YOLO’s backbone leverages CSPDarknet’s computational efficiency and optimized gradient propagation, using cross-stage partial connections to retain semantic-rich features while reducing computational redundancy. Enhanced with SPPF and C2PSA modules, the backbone achieves contextual sensitivity and multi-scale feature integration suitable for PPE detection.

The model’s neck employs multi-scale fusion with upsampling, concatenation, and C3k2 modules, while the detection head uses three resolution-specific YOLO prediction modules with depthwise separable convolutions for efficient object detection at various scales.

The training process employs a weighted multi-component loss function that balances detection accuracy across different aspects:

L_total = 7.5 × L_box + 0.5 × L_cls + 1.5 × L_dfl

(1)

where L_box represents bounding box regression loss focusing on localization accuracy, L_cls denotes classification loss for object category prediction, and L_dfl indicates distribution focal loss (DFL) for box quality assessment. The weights (7.5, 0.5, 1.5) were empirically determined through preliminary experiments to optimize the precision-recall trade-off. The relatively high weight for L_box (7.5) reflects the importance of precise localization in PPE detection, where accurate bounding box prediction is essential for safety compliance monitoring. The moderate weight for L_dfl (1.5) emphasizes box quality assessment, while the lower weight for L_cls (0.5) reflects that classification between PPE categories is typically less challenging than precise localization. These weights follow established practices in object detection literature where localization losses are commonly weighted higher than classification losses [57,58].

The comprehensive loss formulation in Equation (1) serves as the foundation for gradient computation throughout the network architecture. These gradients, computed according to Equation (2), provide the essential directional information that guides the Sophia optimizer in making intelligent parameter adjustments across all network layers.

The core innovation in SC-YOLO is the integration of Sophia—an optimization approach that leverages second-order curvature information through Hessian-based curvature insights. By dynamically adapting update steps based on local curvature conditions, Sophia enables accelerated and stable convergence in the challenging loss landscapes encountered during PPE detection.

During training, gradients are computed using the following formulation:

g_t = ∇_θ L_total

(2)

where g_t represents the gradient vector at training step t, ∇_θ denotes the partial derivative operator with respect to all network parameters, θ encompasses all trainable parameters distributed across backbone (θ₀ to θ₁₀), neck (θ₁₁ to θ₂₂), and head (θ₂₃) components, and L_total is the weighted multi-component loss function defined in Equation (1).

The Sophia optimizer then applies curvature-aware updates with element-wise clipping mechanism to prevent excessive parameter updates when encountering near-zero or negative Hessian estimates, ensuring training stability through second-order optimization principles.

The optimized parameters from Equation (2) are distributed across three main network components: backbone parameters θ₀ to θ₁₀ handle feature extraction through CSPDarknet, SPPF, and C2PSA modules; neck parameters θ₁₁ to θ₂₂ manage multi-scale feature fusion via FPN-PAN architecture; and head parameters θ₂₃ control final detection outputs, including classification and regression predictions.

This integrated design advances both detection accuracy and computational efficiency, making SC-YOLO a suitable solution for real-time PPE detection in resource-constrained construction environments.

3.3. Sophia: Second-Order Clipped Stochastic Optimization

Sophia represents a second-order clipped stochastic optimization technique that aims to address the limitations of first-order methods such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam). Unlike conventional optimizers such as SGD or Adam, Sophia incorporates curvature information through the Hessian matrix to navigate the loss landscape more efficiently [14].

3.3.1. Motivation

Traditional optimization methods often struggle with heterogeneous curvatures across different parameter dimensions. In loss landscapes where some dimensions exhibit sharp curvature while others remain relatively flat, first-order methods apply updates that progress slowly in flat directions while potentially oscillating in sharp directions.

To illustrate this limitation, consider a simplified two-dimensional loss function

L (θ_{[1]}, θ_{[2]}) = L_{1} (θ_{[1]}) + L_{2} (θ_{[2]})

, where

L_{1}

has much sharper curvature than

L_{2}

. With standard gradient descent, the update is constrained by the sharpest dimension:

θ_{[1]} \leftarrow θ_{[1]} - η \cdot L_{1}^{'} (θ_{[1]})

(3)

θ_{[2]} \leftarrow θ_{[2]} - η \cdot L_{2}^{'} (θ_{[2]})

(4)

The learning rate

η

must be small enough to maintain stability in the sharp dimension

θ_{[1]}

, as shown in Equation (3), which consequently results in slow progress along the flat dimension

θ_{[2]}

, as shown in Equation (4).

Similarly, sign-based methods like SignGD (which approximates Adam’s behavior) apply uniform updates regardless of curvature:

θ_{[1]} \leftarrow θ_{[1]} - η \cdot sign (L_{1}^{'} (θ_{[1]}))

(5)

θ_{[2]} \leftarrow θ_{[2]} - η \cdot sign (L_{2}^{'} (θ_{[2]}))

(6)

This uniform update size leads to inefficient optimization, with quick but bouncing progress in sharp dimensions as seen in Equation (5) and slow convergence in flat dimensions as demonstrated by Equation (6).

3.3.2. Hessian Estimators

Sophia employs two efficient estimators for the diagonal Hessian, both designed to minimize computational overhead. Algorithm 1 presents the Hutchinson estimator, which provides an unbiased estimate of the diagonal Hessian elements.

Algorithm 1: Hutchinson

(θ)

1: Input: parameter

θ

2: Compute mini-batch loss

L (θ)

3: Draw

u

from

N (0, I_{d})

4: Return

u ⊙ \nabla (⟨\nabla L (θ), u⟩)

The Hutchinson estimator, where

⊙

denotes element-wise multiplication, has the property that

E [\hat{h}] = d i a g (\nabla^{2} L (θ))

, where

E [\cdot]

represents the expected value,

\hat{h}

is the estimated diagonal Hessian vector, and

d i a g (\nabla^{2} L (θ))

refers to the diagonal elements of the full Hessian matrix.

For classification tasks, Algorithm 2 presents the Gauss–Newton–Bartlett estimator, which leverages the structure of the loss function.

Algorithm 2: Gauss–Newton–Bartlett

(θ)

1: Input: parameter

θ

2: Draw a mini-batch of input

{\{x_{b}\}}_{b = 1}^{B}

3: Compute logits on the mini-batch:

{\{f (θ, x_{b})\}}_{b = 1}^{B}

4: Sample

{\hat{y}}_{b} ~ softmax (f (θ, x_{b})), \forall b \in [B]

5: Calculate

\hat{g} = \nabla (1 / B \sum l (f (θ, x_{b}), {\hat{y}}_{b}))

6: Return

B \cdot \hat{g} ⊙ \hat{g}

This estimator exploits Bartlett’s identities to efficiently approximate the diagonal of the Gauss–Newton matrix, making it effective for cross-entropy losses.

3.3.3. Complete Algorithm

The full Sophia algorithm, presented in Algorithm 3, combines these estimators with an element-wise clipping mechanism to ensure stable and efficient optimization.

Algorithm 3: Sophia

1: Input:

θ_{1}

, learning rate

{\{η_{t}\}}_{t = 1}^{T}

, hyperparameters

λ, γ, β_{1,} β_{2}, ε

, and estimator choice

Estimator ∈ {Hutchinson, Gauss–Newton–Bartlett}

2: Set

m_{0} = 0, v_{0} = 0, h_{1 - k} = 0

3: For

t = 1

to

T

do

4: Compute mini-batch loss

L_{t} (θ_{t})

5: Compute

g_{t} = \nabla L_{t} (θ_{t})

6:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

7: If

t m o d k = 1

then

8: Compute

{\hat{h}}_{t} = Estimator (θ_{t})

9:

h_{t} = β_{2} h_{t - k} + (1 - β_{2}) {\hat{h}}_{t}

10: Else

11:

h_{t} = h_{t - 1}

12:

θ_{t} = θ_{t} - η_{t} λ θ_{t}

// weight decay

13:

θ_{t + 1} = θ_{t} - η_{t} \cdot clip (m_{t} / m a x \{γ \cdot h_{t}, ε\}, 1)

14: End For

The algorithm maintains exponential moving averages (EMAs) of both gradients and Hessian estimates, updating the diagonal Hessian only every

k

steps to reduce computational overhead. The parameter γ controls the clipping threshold, preventing excessively large updates when the Hessian estimate is small or negative.

When any entry of

h_{t}

is negative or near zero, the corresponding entry in the preconditioned gradient becomes extremely large, triggering the clipping mechanism. This causes the optimizer to default to sign-based momentum in those dimensions, providing a reliable fallback strategy.

3.3.4. Theoretical Properties

Sophia demonstrates favorable theoretical properties compared to first-order methods, particularly for optimizing functions with heterogeneous curvatures. For twice continuously differentiable, strictly convex functions under certain regularity conditions, Sophia’s convergence rate does not depend on the condition number (the ratio between maximum and minimum curvature).

This advantage emerges from the diagonal Hessian preconditioner, which automatically scales updates based on the local curvature in each dimension. In sharp dimensions (where the Hessian is large), updates are appropriately small, while in flat dimensions (where the Hessian is small), updates are larger.

For convex quadratic functions under theoretical conditions, Sophia potentially requires fewer iterations than first-order methods when the condition number is large. This is because Sophia’s convergence rate is largely independent of the ratio between maximum and minimum curvatures, unlike first-order methods, whose performance degrades substantially in poorly-conditioned optimization landscapes.

3.3.5. Practical Considerations

Several practical considerations are important when implementing Sophia. For hyperparameter selection,

β_{1}

≈ 0.96 is recommended for gradient EMA decay rate and

β_{2}

≈ 0.99 for Hessian EMA decay rate. The parameter

γ

should be tuned to achieve 10–50% of non-clipped coordinates, while

k

= 10 works well for Hessian update frequency. Weight decay coefficient

λ

typically ranges from 0.1–0.2, and

ε

should be set to a small value like 10⁻¹². The learning rate can be set to 3–5× that of Lion optimizer or 0.8× that of AdamW.

The Hessian estimation overhead can be minimized by computing diagonal Hessian only every

k

steps, using a reduced batch size for Hessian estimation, and leveraging efficient Hessian-vector product implementations provided by modern frameworks.

Regarding memory requirements, Sophia requires storing two additional vectors (

m_{t}

and

h_{t}

) of the same dimension as the parameters, making its memory footprint comparable to that of Adam.

Empirical studies have demonstrated Sophia’s potential for improved optimization efficiency compared to conventional methods like AdamW [14].

The implementation of Sophia in modern deep learning frameworks is straightforward, requiring only standard auto-differentiation capabilities for gradient computation and Hessian-vector products. This accessibility, combined with its theoretical and empirical advantages, makes Sophia a promising optimization technique for complex deep learning architectures.

3.4. Experimental Datasets

Two complementary datasets with distinct characteristics were employed to evaluate model performance across diverse detection scenarios:

3.4.1. VOC2007-1 Dataset

The VOC2007-1 dataset [59] facilitates PPE detection evaluation in construction environments, comprising 900 images (625 training, 183 validation, 92 testing) with 7223 annotated instances (4965 training, 1374 validation, and 884 testing) across Hardhat, Vest, and Worker classes. As detailed in Table 3 and Table 4, the dataset ensures unbiased evaluation through balanced class distributions and diverse mixed indoor/outdoor scenarios with varying lighting conditions.

Table 3. Comparative analysis of VOC2007-1 and ML-31005 datasets: environmental conditions, PPE categories, and construction adaptability.

Table 4. Distribution of images and instances across training, validation, and test sets for VOC2007-1 dataset.

Figure 6 illustrates the per-class object distribution across training, validation, and testing sets. For the Hardhat class, the distribution is 36.3% in training, 37.4% in validation, and 35.0% in testing sets. The Vest class comprises 18.3% in training, 15.4% in validation, and 19.2% in testing sets. The Worker class maintains the highest proportion with 45.4% in training, 47.2% in validation, and 45.8% in testing sets. This visualization demonstrates the consistent class balance maintained across all dataset partitions, ensuring unbiased model evaluation, while Figure 7 presents sample images from each class, demonstrating the visual complexity and variation within the dataset.

Figure 6. Class distribution across training, validation, and testing sets for VOC2007-1 dataset.

Figure 7. Sample images from VOC2007-1 dataset showing different PPE classes.

3.4.2. ML-31005 Dataset

The ML-31005 dataset [60] comprises 527 images with 3591 annotated instances across six PPE categories: Boots, Glass, Glove, Helmet, Person, and Vest. As detailed in Table 3 and Table 5, the dataset maintains balanced partitioning (training: 369 images/2472 instances; validation: 80/549; testing: 78/570) and features diverse illumination conditions (outdoor daylight and indoor artificial lighting), providing comprehensive coverage for construction safety applications.

Table 5. Distribution of images and instances across training, validation, and test sets for ML-31005 dataset.

Figure 8 presents a multi-ring donut chart depicting the per-class object distribution across dataset partitions. The Boots class represents 24.3% in training, 25.1% in validation, and 22.5% in testing sets. The Glass class accounts for 9.9% in training, 8.9% in validation, and 9.3% in testing sets. The Glove class maintains approximately 19.8–20.4% across all partitions. The Helmet class comprises 13.4% in training, 12.8% in validation, and 12.3% in testing sets. The Person class represents 16.9% in training, 17.1% in validation, and 18.2% in testing sets. The Vest class accounts for 15.7% in training, 16.2% in validation, and 17.4% in testing sets. This balanced distribution across training, validation, and testing partitions ensures robust and unbiased model evaluation. Figure 9 displays representative images from each class, highlighting the diverse appearance characteristics that challenge detection algorithms.

Figure 8. Class distribution across training, validation, and testing sets for ML-31005 dataset.

Figure 9. Sample images from ML-31005 dataset showing different PPE classes.

Both datasets were selected for their complementary characteristics, enabling comprehensive evaluation of model performance across different PPE types and environmental conditions. The VOC2007-1 dataset provides a focused assessment of core PPE detection capabilities in typical construction scenarios, while the ML-31005 dataset offers a broader perspective with additional PPE classes and more diverse environmental contexts.

The strategic selection of these complementary datasets facilitates robust cross-dataset validation, mitigating potential biases that may arise from dataset-specific characteristics. Furthermore, the varied class distributions and environmental conditions represented in these datasets enable thorough assessment of model generalization capabilities, particularly for challenging detection scenarios involving transparent materials (Glass) and complex articulated objects (Gloves) present in the ML-31005 dataset. This dual-dataset evaluation methodology aligns with recent practices in computer vision research, where multi-dataset validation has become increasingly essential for establishing the reliability and generalizability of detection architectures.

3.5. Implementation Details

This section provides comprehensive details regarding the technical implementation and evaluation framework employed in this study. To ensure reproducibility and maintain experimental rigor, all aspects of the implementation are documented across three key dimensions. First, the computing infrastructure section outlines the hardware and software environment used for all experiments. Second, the training protocol section describes the hyperparameter configuration and optimization strategies applied consistently across model variants. Finally, the evaluation metrics section presents the mathematical formulations used to quantitatively assess detection performance. This documentation of implementation details enables accurate replication of results and facilitates fair comparison between the proposed SC-YOLO and alternative architectural approaches.

3.5.1. Computing Infrastructure

All experiments were conducted using a consistent computing environment to ensure fair comparison between model variants. The hardware setup included an Intel Xeon W-2265 CPU, 64 GB RAM, and an NVIDIA GeForce RTX 3090 GPU. The software environment consisted of Windows 11 operating system, Python 3.8.15 as the development language, PyTorch 1.13.1 as the deep learning framework, and CUDA 11.6 for GPU acceleration. PyCharm 2024.2 served as the integrated development environment.

3.5.2. Training Protocol

All models were trained using the identical hyperparameter configuration outlined in the training parameters table. Each model underwent 200 training epochs with an image size of 640 × 640 pixels and a batch size of 16. For the baseline models (Efficient-YOLO, DINOv2-YOLO, and CSP-YOLO), SGD optimization was employed with a momentum of 0.937 and weight decay of 0.0005. The learning rate followed a cosine annealing schedule, starting at 0.01 and decreasing to 0.01 at the final epoch, with a 3-epoch warmup period using a momentum of 0.8 and warmup bias learning rate of 0.1. For SC-YOLO, Sophia-specific parameters were configured as follows:

β_{1}

of 0.96,

β_{2}

of 0.99, clipping threshold (

γ

) of 0.01, Hessian update frequency (

k

) of 10, and weight decay coefficient (

λ

) of 0.15. These parameters were selected based on preliminary experiments and recommendations from Liu et al. [14].

3.5.3. Evaluation Metrics

The comparative assessment employs a suite of standard object detection metrics, each providing distinct insights into detection capabilities. The primary performance indicators include precision, recall, and mean Average Precision (mAP), which are defined as follows:

Precision

(P)

quantifies the proportion of correct positive predictions among all positive predictions made by the model, formulated as shown in Equation (7):

P = \frac{T P}{T P + F P}

(7)

where TP represents True Positives (correctly detected objects) and FP represents False Positives (incorrect detections).

Recall

(R)

measures the proportion of ground truth objects successfully detected by the model, expressed in Equation (8):

R = \frac{T P}{T P + F N}

(8)

where FN represents False Negatives (ground truth objects that the model failed to detect).

Mean Average Precision (mAP) provides a comprehensive measure of detection performance by incorporating both precision and recall across varying confidence thresholds. For each object class, the Average Precision (AP) is calculated by computing the area under the precision-recall curve. To calculate mAP@0.5, the process begins with determining whether a detection is a True Positive or False Positive based on the Intersection over Union (IoU) threshold of 0.5. For each confidence score threshold, precision and recall values are calculated, forming a precision-recall curve. The AP@0.5 for a single class is then calculated using Equation (9):

AP @ 0.5 = \sum_{i = 1}^{n} (r_{i + 1} - r_{i}) \times p_{i n t e r p} (r_{i + 1})

(9)

where

r_{i}

represents recall at the

i

-th threshold, and

p_{i n t e r p} (r_{i + 1})

is the interpolated precision at recall level

r_{i + 1}

, defined as the maximum precision for any recall level

r^{'} \geq r_{i + 1}

. The mAP@0.5 is then calculated as the mean of AP@0.5 values across all classes.

For mAP@0.5:0.95, this calculation is extended by averaging AP values across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05 (resulting in 10 IoU thresholds: 0.5, 0.55, 0.6,..., 0.95). This approach provides a more comprehensive evaluation of localization accuracy, as formulated in Equation (10):

mAP @ 0.5 : 0.95 = \frac{1}{|T|} \sum_{t \in T} mAP @ t

(10)

where

T

represents the set of IoU thresholds, and

mAP @ t

is the mean Average Precision calculated at IoU threshold

t

.

The Intersection over Union (IoU) metric, defined in Equation (11), quantifies the overlap between predicted bounding boxes and ground truth annotations, formally defined as

I o U = \frac{|B_{g t} \cap B_{p r e d}|}{|B_{g t} \cup B_{p r e d}|}

(11)

where

B_{g t}

and

B_{p r e d}

represent ground-truth and predicted bounding boxes, respectively, and |⋅| denotes the area operator. This metric ensures detections not only identify the correct object class but also accurately localize the object within the image. Higher IoU thresholds demand greater localization precision, making mAP@0.5:0.95 a stringent evaluation metric that rewards models with precise boundary delineation capabilities.

Additionally, model efficiency is assessed through the training time per epoch (seconds), which provides insights into the computational requirements of each model variant. This metric is relevant for evaluating the practical deployment potential of the different architectures, enabling analysis of the performance-efficiency trade-off that is important in resource-constrained applications.

3.6. Experimental Design

The experimental design established a comprehensive methodology for evaluating object detection performance across multiple dimensions, enabling objective comparison between the four YOLO model variants. This evaluation framework systematically assessed detection accuracy, class-specific performance patterns, convergence behavior, and computational efficiency through controlled experiments on two complementary datasets with distinct characteristics.

Additional methodological components included a detailed training dynamics analysis protocol, with performance measurements collected at regular epoch intervals (25, 50, 75, 100, 125, 150, 175, 200) throughout the training process. This temporal analysis framework allowed for quantitative assessment of overall learning dynamics, convergence rates, and loss behaviors for each model architecture.

The experimental design incorporated a balanced assessment of accuracy and efficiency through parallel tracking of computational resource requirements. This included structured measurement protocols for training time across model variants, enabling analysis of the practical trade-offs between detection performance and computational demands that are essential for real-world deployment scenarios.

The experimental results were organized into comprehensive visualization and tabulation formats that facilitate direct comparison between models. This includes standardized performance tables and consistent visualization approaches that highlight relative performance across multiple dimensions.

To ensure evaluation reliability, the model’s performance was validated through comprehensive cross-dataset validation using two distinct datasets (VOC2007-1 and ML-31005) with different characteristics. This approach confirms the consistent performance improvements across varied environmental conditions and object classes. The model was also evaluated across various challenging scenarios, including partially occluded objects, transparent materials, multiple PPE types, and poor lighting conditions, to ensure robust real-world performance.

3.7. Ablation Study Design

The ablation study methodology was structured to systematically isolate and quantify the specific contributions of Sophia within the SC-YOLO architecture. This component-level analysis approach identified causal relationships between the optimization technique and observed performance enhancements by implementing controlled variations focused exclusively on the optimization component.

The loss component decomposition framework was designed to analyze how Sophia affects each loss type’s behavior during training compared to conventional optimizers. This specialized investigation examined the differential impact on box loss trajectories for localization accuracy, classification loss evolution for category recognition, and DFL patterns for bounding box quality. Unlike general performance tracking, this analysis specifically focused on identifying optimization-induced differences in training-validation loss relationships and their correlation with generalization capabilities. This loss-specific analysis reveals distinctive patterns in how second-order optimization influences training dynamics, particularly in the plateau regions where first-order methods often struggle to make progress.

The study examined how Sophia provides advantages for challenging detection scenarios, such as transparent objects and partially occluded instances, quantifying improvements achieved through optimization changes. Additionally, convergence analysis focused on optimization-specific metrics including gradient magnitude stability and parameter update directions, isolating differences in gradient dynamics and feature acquisition patterns attributable to Sophia.

This focused ablation framework systematically isolated the optimization component’s contribution to performance improvements, complementing the broader experimental evaluation while providing deeper mechanistic insights into how Sophia specifically enhances object detection capabilities across the comprehensive performance dimensions documented in Section 4.

4. Results and Discussion

Through systematic evaluation on two distinct datasets, the advantages of integrating the Sophia optimizer with the CSPDarknet backbone in SC-YOLO are demonstrated and quantified.

4.1. Model Architecture Comparison

This study evaluates four YOLO variants designed to enhance YOLOv11 architecture: Efficient-YOLO incorporates an EfficientNet backbone with a compound scaling methodology that systematically scales network width, depth, and resolution dimensions [32,33,61]. DINOv2-YOLO integrates a self-supervised vision Transformer backbone, utilizing attention mechanisms for modeling long-range dependencies. CSP-YOLO utilizes CSPDarknet with cross-stage partial connections, strategically dividing feature maps to enhance gradient flow while reducing computational redundancy.

Based on preliminary analyses, CSP-YOLO demonstrated a favorable performance–efficiency balance, warranting its selection for Sophia optimizer integration. This choice stems from CSPDarknet’s efficient gradient flow properties, which is consistent with findings in gangue recognition [52], skeletal image detection [53], and robotic vision systems [51]. The proposed SC-YOLO extends these architectural strengths by incorporating Sophia’s adaptive preconditioned updates.

Experimental assessments employed two complementary datasets: VOC2007-1 (PPE detection in construction environments) and ML-31005 (diverse object categories across varied conditions), enabling thorough evaluation across different application domains.

4.2. Overall Detection Performance

SC-YOLO shows improved performance across all evaluation metrics on both datasets, as illustrated in Figure 10 and Figure 11. On ML-31005: 96.3% mAP@0.5 and 68.6% mAP@0.5:0.95 (2.56% improvement over CSP-YOLO’s 93.9% and 68.2%). On VOC2007-1: 97.6% mAP@0.5 and 63.6% mAP@0.5:0.95 (2.63% and 0.47% improvements respectively). SC-YOLO achieves high precision (93.5% on ML-31005, 96.1% on VOC2007-1) and notable recall improvements: 3.93% on ML-31005 (95.3% vs. 91.7%) and 4.93% on VOC2007-1 (93.6% vs. 89.2%).

Figure 10. Performance comparison of detection models on ML31005 dataset across mAP@0.5, mAP@0.5:0.95, Precision, and Recall metrics.

Figure 11. Performance comparison of detection models on VOC2007-1 dataset across mAP@0.5, mAP@0.5:0.95, Precision, and Recall metrics.

These improvements stem from Sophia’s adaptive capabilities in managing heterogeneous curvatures across parameter dimensions [14], enabling more effective navigation of complex loss landscapes. CSP-YOLO consistently ranks second, aligning with CSPDarknet’s strong generalization [55]. Efficient-YOLO shows moderate performance consistent with lightweight deployment scenarios [44,45,62], while DINOv2-YOLO records the lowest metrics despite the Transformer’s success in medical segmentation [19] and traffic detection [50], suggesting that Transformer backbones require further adaptation for YOLO-based spatial localization.

4.3. Class-Wise Performance Analysis

A detailed examination of class-wise performance metrics, presented in Table 6 and Table 7, reveals SC-YOLO consistently outperforming others across both datasets. For VOC2007-1: SC-YOLO achieves favorable mAP@0.5 scores of 96.8%, 97.3%, and 98.6% for Hardhat, Vest, and Worker classes respectively, compared to CSP-YOLO’s 93.9%, 95.5%, and 96.0%. As visualized in Figure 12, this performance advantage extends across all classes. The ML-31005 analysis (Figure 13) shows notable improvement in Glass class detection: 93.0% versus CSP-YOLO’s 85.3%, while DINOv2-YOLO achieved only 32.5% despite Transformer success in other vision tasks [48,49]. Enhanced performance aligns with SPPF and C2PSA attention-based modules capabilities [8,43].

Table 6. Evaluation metrics for VOC2007-1 dataset—Validation results (highest values for each metric are highlighted in bold).

Table 7. Evaluation metrics for ML-31005 dataset—Validation results (highest values for each metric are highlighted in bold).

Figure 12. Class-wise mAP@0.5 comparison for VOC2007-1 dataset.

Figure 13. Class-wise mAP@0.5 comparison for ML-31005 dataset.

SC-YOLO’s recall improvements are particularly important for safety applications where missed detections pose serious consequences: Worker class (95.9% vs. 91.0%) on VOC2007-1, Glass class (98.8% vs. 91.6%) and Vest class (99.5% vs. 92.1%) on ML-31005. Figure 14 illustrates mAP@0.5 improvements ranging from 0.53% to 9.03% across classes compared to CSP-YOLO, with notable gains in challenging scenarios mirroring YOLOv11n’s performance where C2PSA and refined FPN structures effectively handle translucent boundaries [9]. Efficient-YOLO maintains moderate performance similar to findings in defect detection domains [45,63]. These consistent advantages result from Sophia’s optimization capabilities, enabling effective feature learning across different object appearances and environmental conditions.

Figure 14. SC-YOLO improvement over CSP-YOLO by object class (mAP@0.5 metric). Color coding: orange (VOC2007-1 dataset classes), blue (ML-31005 dataset classes).

4.4. Training Dynamics and Efficiency

SC-YOLO achieves comparable performance to CSP-YOLO as shown in Figure 15 and Figure 16, with both methods reaching higher final mAP@0.5:0.95 values than Efficient-YOLO and DINOv2-YOLO on both ML-31005 and VOC2007-1 datasets. This consistent performance across both datasets validates Sophia’s effectiveness as an alternative optimizer [14], providing comparable results to established methods while introducing curvature-aware optimization capabilities.

Figure 15. Learning curves (mAP@0.5:0.95) for ML31005 and VOC2007-1 datasets.

Figure 16. Convergence speed analysis at different epochs.

Computational efficiency analysis (Figure 17) reveals SC-YOLO achieves improved performance with modest training time increase: 4.5 s per epoch on ML-31005 and 5.6 s on VOC2007-1, compared to CSP-YOLO’s 4.2 and 5.3 s, respectively. This modest computational cost, consistent with Liu et al.’s [14] findings, is well-justified by performance improvements. In contrast, DINOv2-YOLO requires the longest training time (5.1 and 6.4 s), while Efficient-YOLO was the third fastest (4.7 and 5.9 s).

Figure 17. Training speed comparison for ML31005 and VOC2007-1 datasets.

Loss analysis (Table 8, Figure 18, Figure 19, Figure 20 and Figure 21) shows SC-YOLO achieves comparable performance to CSP-YOLO, with slight improvements in box loss (0.810 vs. 0.829) and DFL loss (1.007 vs. 1.021) on VOC2007-1. Both methods significantly outperform DINOv2-YOLO and Efficient-YOLO across all training loss components.

Table 8. Comparative loss measures for VOC2007-1 and ML-31005 datasets.

Figure 18. Combined loss curves for ML31005 dataset.

Figure 19. Combined loss curves for VOC2007-1 dataset.

Figure 20. Training-validation loss gap for ML31005 dataset.

Figure 21. Training-validation loss gap for VOC2007-1 dataset.

4.5. Generalization and Deployment Considerations

SC-YOLO demonstrates consistent performance across different datasets, maintaining improved results across different environmental conditions in both ML-31005 and VOC2007-1 evaluations. This cross-dataset performance aligns with YOLOv11 studies that highlight generalization across heterogeneous object distributions [8]. The consistent advantages are attributed to Sophia’s optimization capabilities, which produce feature representations that generalize effectively across different object appearances and environmental conditions, with enhanced performance on partially occluded PPE items.

From a deployment perspective, SC-YOLO achieves a favorable performance–efficiency balance with important improvements in safety-critical metrics, particularly in recall performance. As shown in Figure 22, SC-YOLO handles various challenging scenarios, including occlusion handling, transparent material detection, and dense scene analysis. This improved performance demonstrates SC-YOLO’s potential for PPE monitoring applications while maintaining computational requirements suitable for resource-constrained environments.

Figure 22. SC-YOLO detection examples on construction site scenarios, demonstrating: (a) occlusion handling with multiple helmet, person, and vest detections (confidence 0.60–0.91) in a crowded environment; (b) transparent safety glasses detection (confidence 0.26–0.96) alongside persons and vests; (c) comprehensive worker safety compliance monitoring with simultaneous detection of multiple PPE items (confidence 0.63–0.93); and (d) dense scene analysis with detection of boots, glasses, persons, and vests (confidence 0.28–0.90) under varying lighting conditions.

The recall improvements are particularly important for safety applications where missed detections of non-compliant workers could compromise site safety protocols. These findings provide a technical foundation for future field validation studies, though a comprehensive assessment of varying construction conditions requires dedicated industry collaboration and on-site evaluation protocols.

SC-YOLO demonstrates consistent performance across the evaluated datasets, though optimal deployment effectiveness varies with environmental characteristics. The confidence scores reported in Figure 22 (0.26–0.96) represent detection confidence thresholds, not system reliability percentages. The actual detection performance shows high accuracy with mAP@0.5 scores, though these metrics represent performance on controlled evaluation datasets, which may differ from real-world operational conditions. In practical deployment scenarios, confidence thresholds are calibrated according to risk tolerance and application requirements.

Controlled environments such as construction plants, rebar workshops, and prefabricated element production facilities offer operational advantages for automated PPE monitoring systems. These environments typically feature consistent worker positioning, standardized PPE requirements, controlled lighting conditions, and structured workflows with repetitive tasks.

Complex construction sites present different deployment considerations, including variable worker positioning, diverse PPE requirements, temporary obstructions, and changing environmental conditions. For such environments, strategic deployment approaches may focus on high-traffic areas, critical safety zones, and temporary controlled spaces. The technical capabilities demonstrated in this study suggest potential applicability across various workplace settings, though deployment strategies should be adapted to specific environmental constraints and operational requirements.

5. Conclusions

SC-YOLO demonstrates performance improvements across both datasets: 96.3% mAP@0.5 on ML-31005 (2.56% improvement) and 97.6% mAP@0.5 on VOC2007-1 (2.63% improvement). The most notable gains occur in recall metrics (3.93% and 4.93% improvements, respectively).

This work makes three primary contributions to automated PPE detection for construction safety: (1) comprehensive benchmarking of three representative backbone architectures (EfficientNet, DINOv2, and CSPDarknet) within the YOLOv11n framework, providing detailed performance-efficiency analysis for construction applications; (2) introduction of SC-YOLO, integrating CSPDarknet backbone with Sophia second-order optimizer for enhanced convergence and detection performance; and (3) cross-dataset validation using complementary datasets (VOC2007-1 and ML-31005) to evaluate model effectiveness across varied construction environments.

The performance characteristics suggest technical suitability for structured environments where workers maintain consistent positions and utilize standardized PPE configurations, such as construction plants and manufacturing facilities. Variable construction sites present additional operational considerations, including highly variable worker positioning, diverse PPE requirements, temporary obstructions, and varying environmental conditions. The extent to which the technical capabilities translate to effective performance in such complex environments requires field validation studies.

Future research should prioritize environment-specific deployment optimization through systematic comparative studies across diverse workplace types. Important research directions include evaluation across controlled and dynamic construction environments, development of adaptive deployment strategies, and investigation of hybrid monitoring frameworks. Multi-environment validation studies conducted in partnership with industry organizations will be crucial for establishing practical deployment guidelines.

Funding

This research budget was allocated by National Science, Research and Innovation Fund (NSRF), and King Mongkut’s University of Technology North Bangkok (Project no. KMUTNB-FF-68-B-64).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflict of interest.

References

Li, J.; Zhao, X.; Zhou, G.; Zhang, M. Standardized use inspection of workers’ personal protective equipment based on deep learning. Saf. Sci. 2022, 150, 105689. [Google Scholar] [CrossRef]
Kumar, S.; Gupta, H.; Yadav, D.; Ansari, I.A.; Verma, O.P. YOLOv4 algorithm for the real-time detection of fire and personal protective equipments at construction sites. Multimed. Tools Appl. 2022, 81, 22163–22183. [Google Scholar] [CrossRef]
Al-Azani, S.; Luqman, H.; Alfarraj, M.; Sidig, A.A.I.; Khan, A.H.; Al-Hammed, D. Real-Time Monitoring of Personal Protective Equipment Compliance in Surveillance Cameras. IEEE Access 2024, 12, 121882–121895. [Google Scholar] [CrossRef]
Riaz, M.; He, J.; Xie, K.; Alsagri, H.S.; Moqurrab, S.A.; Alhakbani, H.A.A.; Obidallah, W.J. Enhancing Workplace Safety: PPE_Swin—A Robust Swin Transformer Approach for Automated Personal Protective Equipment Detection. Electronics 2023, 12, 4675. [Google Scholar] [CrossRef]
Xie, B.; He, S.; Cao, X. Target detection for forward looking sonar image based on deep learning. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022; pp. 7191–7196. [Google Scholar]
Zhang, L.; Wang, J.; Wang, Y.; Sun, H.; Zhao, X. Automatic construction site hazard identification integrating construction scene graphs with BERT based domain knowledge. Autom. Constr. 2022, 142, 104535. [Google Scholar] [CrossRef]
Song, Y.; Hong, S.; Hu, C.; He, P.; Tao, L.; Tie, Z.; Ding, C. MEB-YOLO: An Efficient Vehicle Detection Method in Complex Traffic Road Scenes. Comput. Mater. Contin. 2023, 75, 3. [Google Scholar] [CrossRef]
He, L.; Zhou, Y.; Liu, L.; Ma, J. Research and Application of YOLOv11-Based Object Segmentation in Intelligent Recognition at Construction Sites. Buildings 2024, 14, 3777. [Google Scholar] [CrossRef]
Zhao, J.; Miao, S.; Kang, R.; Cao, L.; Zhang, L.; Ren, Y. Insulator Defect Detection Algorithm Based on Improved YOLOv11n. Sensors 2025, 25, 1327. [Google Scholar] [CrossRef]
Musarat, M.A.; Khan, A.M.; Alaloul, W.S.; Blas, N.; Ayub, S. Automated monitoring innovations for efficient and safe construction practices. Results Eng. 2024, 22, 102057. [Google Scholar] [CrossRef]
Rasouli, S.; Alipouri, Y.; Chamanzad, S. Smart Personal Protective Equipment (PPE) for construction safety: A literature review. Saf. Sci. 2024, 170, 106368. [Google Scholar] [CrossRef]
Abdel-Basset, M.; Mohamed, R.; Chang, V. A Multi-Criteria Decision-Making Framework to Evaluate the Impact of Industry 5.0 Technologies: Case Study, Lessons Learned, Challenges and Future Directions. Inf. Syst. Front. 2024, 27, 791–821. [Google Scholar] [CrossRef]
Zeibak-Shini, R.; Malka, H.; Kima, O.; Shohet, I.M. Analytical Hierarchy Process for Construction Safety Management and Resource Allocation. Appl. Sci. 2024, 14, 9265. [Google Scholar] [CrossRef]
Liu, H.; Li, Z.; Hall, D.; Liang, P.; Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv 2023, arXiv:2305.14342. [Google Scholar]
Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep learning for site safety: Real-time detection of personal protective equipment. Autom. Constr. 2020, 112, 103085. [Google Scholar] [CrossRef]
Wu, J.; Cai, N.; Chen, W.; Wang, H.; Wang, G. Automatic detection of hardhats worn by construction personnel: A deep learning approach and benchmark dataset. Autom. Constr. 2019, 106, 102894. [Google Scholar] [CrossRef]
Chang, R.; Zhang, B.; Zhu, Q.; Zhao, S.; Yan, K.; Yang, Y. FFA-YOLOv7: Improved YOLOv7 Based on Feature Fusion and Attention Mechanism for Wearing Violation Detection in Substation Construction Safety. J. Electr. Comput. Eng. 2023, 2023, 9772652. [Google Scholar] [CrossRef]
Zhang, L.; Sun, Z.; Tao, H.; Wang, M.; Yi, W. Research on Mine-Personnel Helmet Detection Based on Multi-Strategy-Improved YOLOv11. Sensors 2024, 25, 170. [Google Scholar] [CrossRef]
Ban, Y.J.; Lee, S.; Park, J.; Kim, J.E.; Kang, H.S.; Han, S. Dinov2_Mask R-CNN: Self-supervised Instance Segmentation of Diabetic Foot Ulcers. In Diabetic Foot Ulcers Grand Challenge; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 17–28. [Google Scholar]
Paramonov, K.; Zhong, J.X.; Michieli, U.; Moon, J.; Ozay, M. Swiss dino: Efficient and versatile vision framework for on-device personal object search. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 2564–2571. [Google Scholar]
Ferdous, M.; Ahsan, S.M.M. PPE detector: A YOLO-based architecture to detect personal protective equipment (PPE) for construction sites. PeerJ Comput. Sci. 2022, 8, e999. [Google Scholar] [CrossRef]
Zhao, L.; Tohti, T.; Hamdulla, A. BDC-YOLOv5: A helmet detection model employs improved YOLOv5. Signal Image Video Process. 2023, 17, 4435–4445. [Google Scholar] [CrossRef]
Li, H.; Wu, D.; Zhang, W.; Xiao, C. YOLO-PL: Helmet wearing detection algorithm based on improved YOLOv4. Digit. Signal Process. 2024, 144, 104283. [Google Scholar] [CrossRef]
Nguyen, N.T.; Tran, Q.; Dao, C.H.; Nguyen, D.A.; Tran, D.H. Automatic detection of personal protective equipment in construction sites using metaheuristic optimized YOLOv5. Arab. J. Sci. Eng. 2024, 49, 13519–13537. [Google Scholar] [CrossRef]
Yang, X.; Wang, J.; Dong, M. SDCB-YOLO: A High-Precision Model for Detecting Safety Helmets and Reflective Clothing in Complex Environments. Appl. Sci. 2024, 14, 7267. [Google Scholar] [CrossRef]
Song, X.; Zhang, T.; Yi, W. An improved YOLOv8 safety helmet wearing detection network. Sci. Rep. 2024, 14, 17550. [Google Scholar] [CrossRef]
Alkhammash, E.H. Multi-Classification Using YOLOv11 and Hybrid YOLO11n-MobileNet Models: A Fire Classes Case Study. Fire 2025, 8, 17. [Google Scholar] [CrossRef]
Kim, D.; Xiong, S. Enhancing Worker Safety: Real-Time Automated Detection of Personal Protective Equipment to Prevent Falls from Heights at Construction Sites Using Improved YOLOv8 and Edge Devices. J. Constr. Eng. Manag. 2025, 151, 04024187. [Google Scholar] [CrossRef]
Di, B.; Xiang, L.; Daoqing, Y.; Kaimin, P. MARA-YOLO: An efficient method for multiclass personal protective equipment detection. IEEE Access 2024, 12, 24866–24878. [Google Scholar] [CrossRef]
Zhang, H.; Mu, C.; Ma, X.; Guo, X.; Hu, C. MEAG-YOLO: A Novel Approach for the Accurate Detection of Personal Protective Equipment in Substations. Appl. Sci. 2024, 14, 4766. [Google Scholar] [CrossRef]
Chen, H.; Li, Y.; Wen, H.; Hu, X. YOLOv5s-gnConv: Detecting personal protective equipment for workers at height. Front. Public Health 2023, 11, 1225478. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Sun, Z.; Cui, Y.; Han, Y.; Jiang, K. Substation high-voltage switchgear detection based on improved EfficientNet-YOLOv5s model. IEEE Access 2024, 12, 60015–60027. [Google Scholar] [CrossRef]
Li, R.; Wu, J.; Cao, L. Ship target detection of unmanned surface vehicle base on efficientdet. Syst. Sci. Control Eng. 2022, 10, 264–271. [Google Scholar] [CrossRef]
Huang, G.; Zhou, Y.; Hu, X.; Zhang, C.; Zhao, L.; Gan, W. DINO-Mix enhancing visual place recognition with foundational vision model and feature mixing. Sci. Rep. 2024, 14, 22100. [Google Scholar] [CrossRef]
Zhang, K.; Yuan, B.; Cui, J.; Liu, Y.; Zhao, L.; Zhao, H.; Chen, S. Lightweight tea bud detection method based on improved YOLOv5. Sci. Rep. 2024, 14, 31168. [Google Scholar] [CrossRef]
Onososen, A.O.; Musonda, I.; Onatayo, D.; Saka, A.B.; Adekunle, S.A.; Onatayo, E. Drowsiness Detection of Construction Workers: Accident Prevention Leveraging Yolov8 Deep Learning and Computer Vision Techniques. Buildings 2025, 15, 500. [Google Scholar] [CrossRef]
Ji, X.; Gong, F.; Yuan, X.; Wang, N. A high-performance framework for personal protective equipment detection on the offshore drilling platform. Complex Intell. Syst. 2023, 9, 5637–5652. [Google Scholar] [CrossRef]
Yipeng, L.; Junwu, W. Personal protective equipment detection for construction workers: A novel dataset and enhanced YOLOv5 approach. IEEE Access 2024, 12, 47338–47358. [Google Scholar] [CrossRef]
Han, D.; Ying, C.; Tian, Z.; Dong, Y.; Chen, L.; Wu, X.; Jiang, Z. YOLOv8s-SNC: An Improved Safety-Helmet-Wearing Detection Algorithm Based on YOLOv8. Buildings 2024, 14, 3883. [Google Scholar] [CrossRef]
Park, S.; Kim, J.; Wang, S.; Kim, J. Effectiveness of Image Augmentation Techniques on Non-Protective Personal Equipment Detection Using YOLOv8. Appl. Sci. 2025, 15, 2631. [Google Scholar] [CrossRef]
Alkhammash, E.H. A Comparative Analysis of YOLOv9, YOLOv10, YOLOv11 for Smoke and Fire Detection. Fire 2025, 8, 26. [Google Scholar] [CrossRef]
Liu, J.; Zhao, J.; Cao, Y.; Wang, Y.; Dong, C.; Guo, C. Road manhole cover defect detection via multi-scale edge enhancement and feature aggregation pyramid. Sci. Rep. 2025, 15, 10346. [Google Scholar] [CrossRef]
Yang, L.; Chen, G.; Liu, J.; Guo, J. Wear State Detection of Conveyor Belt in Underground Mine Based on Retinex-YOLOv8-EfficientNet-NAM. IEEE Access 2024, 12, 25309–25324. [Google Scholar] [CrossRef]
Li, K.; Zhu, J.; Li, N. Lightweight automatic identification and location detection model of farmland pests. Wirel. Commun. Mob. Comput. 2021, 2021, 9937038. [Google Scholar] [CrossRef]
Fan, J.; Cui, L.; Fei, S. Waste detection system based on data augmentation and YOLO_EC. Sensors 2023, 23, 3646. [Google Scholar] [CrossRef]
Rabbani, N.; Bartoli, A. Can surgical computer vision benefit from large-scale visual foundation models? Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 1157–1163. [Google Scholar] [CrossRef] [PubMed]
Chen, F.; Giuffrida, M.V.; Tsaftaris, S.A. Adapting vision foundation models for plant phenotyping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 604–613. [Google Scholar]
Käppeler, M.; Petek, K.; Vödisch, N.; Burgard, W.; Valada, A. Few-shot panoptic segmentation with foundation models. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 7718–7724. [Google Scholar]
Luo, Z.; Feng, T.; Li, G. Robust vision-based traffic anomaly detection: DINOv2 driven gated recurrent unit network. Adv. Transdiscip. Eng. 2024, 27, 261–268. [Google Scholar]
Ge, Y.; Meng, L. A Powerful Object Detection Network for Industrial Anomaly Detection. In Proceedings of the 2024 6th International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 21–24 August 2024; pp. 1–6. [Google Scholar]
Qin, Y.; Kou, Z.; Han, C.; Wang, Y. Intelligent Gangue Sorting System Based on Dual-Energy X-ray and Improved YOLOv5 Algorithm. Appl. Sci. 2023, 14, 98. [Google Scholar] [CrossRef]
Jeon, Y.D.; Kang, M.J.; Kuh, S.U.; Cha, H.Y.; Kim, M.S.; You, J.Y.; Yoon, D.K. Deep learning model based on you only look once algorithm for detection and visualization of fracture areas in three-dimensional skeletal images. Diagnostics 2023, 14, 11. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Xu, D.; Min, X.; Wu, D. An Improved Underwater Target Detection Algorithm Based on YOLOX. In Proceedings of the OCEANS 2024-Singapore, Singapore, 15–18 April 2024; pp. 1–7. [Google Scholar]
Deng, X.; Qi, L.; Liu, Z.; Liang, S.; Gong, K.; Qiu, G. Weed target detection at seedling stage in paddy fields based on YOLOX. PLoS ONE 2023, 18, e0294709. [Google Scholar] [CrossRef]
Yue, X.; Li, H.; Meng, L. An ultralightweight object detection network for empty-dish recycling robots. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
Ultralytics. YOLOv8 Documentation. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 25 July 2025).
Ultralytics. YOLOv5: A State-of-the-Art Real-Time Object Detection System. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 July 2025).
Xiong, R.; Tang, P. Pose guided anchoring for detecting proper use of personal protective equipment. Autom. Constr. 2021, 130, 103828. Available online: https://github.com/ruoxinx/PPE-Detection-Pose (accessed on 18 March 2025). [CrossRef]
LukeHowardUTS. ML 31005 Dataset. Available online: https://universe.roboflow.com/lukehowarduts/ml-31005 (accessed on 18 March 2025).
Xu, X.; Wu, X. Target recognition algorithm for UAV aerial images based on improved YOLO-X. In Proceedings of the 2023 IEEE 5th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 11–13 October 2023; pp. 83–87. [Google Scholar]
Lin, M.; Ma, L.; Yu, B. An efficient and light-weight detector for wine bottle defects. In Proceedings of the 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV), Shenzhen, China, 13–15 December 2020; pp. 957–962. [Google Scholar]
Zheng, L.; Long, L.; Zhu, C.; Jia, M.; Chen, P.; Tie, J. A lightweight cotton field weed detection model enhanced with EfficientNet and attention mechanisms. Agronomy 2024, 14, 2649. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed YOLOv11-based detection framework, illustrating the backbone, feature fusion neck, and multi-scale detection heads. Color coding: green (SPPF), yellow (convolution layers), blue (C3k2 modules), red (concatenation operations), pink (upsampling), purple (detection heads).

Figure 2. The architecture of the proposed Efficient-YOLO model, consisting of three main modules: an EfficientNet backbone for feature extraction, a multi-scale fusion neck, and YOLO detection heads for object localization and classification. Purple blocks: MBConv layers; Green blocks: Feature pyramid levels (P3–P7); Pink/Blue paths: BiFPN’s top-down/bottom-up flows; Yellow circles: Feature fusion; _td: Top-down features; _out: Final output features for detection heads (F3, F4, F5).

Figure 3. The architecture of the proposed DINOv2-YOLO model, integrating a self-supervised Vision Transformer backbone with the YOLO detection head.

Figure 4. The overall architecture of the CSP-YOLO model, illustrating the integration of the CSPDarknet backbone with SPPF and C2PSA modules, followed by a neck composed of C3k2-based multi-scale feature fusion and a three-level detection head.

Figure 5. SC-YOLO complete training architecture with integrated Sophia optimization, showing detailed workflow from model initialization through parameter updates. Numbers 0–23 indicate sequential processing layers: Backbone (Layers 0–10), Neck (Layers 11–22), and Head (Layer 23).

Figure 6. Class distribution across training, validation, and testing sets for VOC2007-1 dataset.

Figure 7. Sample images from VOC2007-1 dataset showing different PPE classes.

Figure 8. Class distribution across training, validation, and testing sets for ML-31005 dataset.

Figure 9. Sample images from ML-31005 dataset showing different PPE classes.

Figure 10. Performance comparison of detection models on ML31005 dataset across mAP@0.5, mAP@0.5:0.95, Precision, and Recall metrics.

Figure 11. Performance comparison of detection models on VOC2007-1 dataset across mAP@0.5, mAP@0.5:0.95, Precision, and Recall metrics.

Figure 12. Class-wise mAP@0.5 comparison for VOC2007-1 dataset.

Figure 13. Class-wise mAP@0.5 comparison for ML-31005 dataset.

Figure 14. SC-YOLO improvement over CSP-YOLO by object class (mAP@0.5 metric). Color coding: orange (VOC2007-1 dataset classes), blue (ML-31005 dataset classes).

Figure 15. Learning curves (mAP@0.5:0.95) for ML31005 and VOC2007-1 datasets.

Figure 16. Convergence speed analysis at different epochs.

Figure 17. Training speed comparison for ML31005 and VOC2007-1 datasets.

Figure 18. Combined loss curves for ML31005 dataset.

Figure 19. Combined loss curves for VOC2007-1 dataset.

Figure 20. Training-validation loss gap for ML31005 dataset.

Figure 21. Training-validation loss gap for VOC2007-1 dataset.

Figure 22. SC-YOLO detection examples on construction site scenarios, demonstrating: (a) occlusion handling with multiple helmet, person, and vest detections (confidence 0.60–0.91) in a crowded environment; (b) transparent safety glasses detection (confidence 0.26–0.96) alongside persons and vests; (c) comprehensive worker safety compliance monitoring with simultaneous detection of multiple PPE items (confidence 0.63–0.93); and (d) dense scene analysis with detection of boots, glasses, persons, and vests (confidence 0.28–0.90) under varying lighting conditions.

Table 1. Technical comparison of YOLO-based PPE detection studies (2022–2025).

Study	Model	YOLO Base	Key Enhancements	Optimizer	PPE Classes	Dataset	mAP@0.5	Notable Contribution
Kumar et al. [2]	-	YOLOv4	CSPDarknet53 + SPP + PANet	Standard	Fire, Person_With_Helmet, Person, Safety Vest, Fire Extinguisher, Safety Glass	Custom	76.86%	Real-time multi-class PPE and fire detection
Chang et al. [17]	FFA-YOLOv7	YOLOv7	Feature Fusion + Attention	Standard	Ladder, Insulator, Helmet (with and without), Safety Belt (with and without)	Custom substation dataset	98.16%	Feature fusion pathway combining shallow position with deep semantic features
Zhao et al. [22]	BDC-YOLOv5	YOLOv5x	BiFPN, CBAM Attention, Extra Head	SGD	Helmet	SHWD	94.50%	BiFPN + CBAM + 160 × 160 head for dense scenes
Ji et al. [38]	RFA-YOLO	YOLOv4	Residual Feature Augmentation	SGD	Person, Helmet, Workwear	Offshore Platform	88.41%	Hybrid detection-classification with position features
Chen et al. [31]	YOLOv5s-gⁿConv	YOLOv5s	gnConv (Gated Convolution)	Standard	Helmet, Safety Harness	Custom	92.96%	Higher-order spatial interactions through gated convolution
Yipeng and Junwu [39]	AL-YOLOv5	YOLOv5	Coordinate Attention + SEIoU Loss	Standard	Hardhat, Person, Reflective Clothes, Other Clothes	Custom	93.8%	Solving overlapping detection frames
Di et al. [29]	MARA-YOLO	YOLOv8-s	MobileOne-S0, AS-Block, R-C2F, RASFF	Adam	Hardhat, Mask, No_Head_PPE, Gloves, No_Gloves, Safety Vest, No_Safety Vest, No_PPEs, Safety Cone	KSE-PPE	74.7%	MobileOne and receptive field fusion for multi-class detection
Yang et al. [25]	SDCB-YOLO	YOLOv8n	SE Attention, DIOU Loss, CARAFE, BiFPN	Standard	Safe, Unsafe, No_Helmet, No_Jacket	Custom	97.1%	Lightweight upsampling and attention for cluttered scenes
Song et al. [26]	-	YOLOv8	DWR Attention, ASPP, NWD Loss	Anchor-free + NWD	Helmet, No_Helmet	SHWD	92.0%	Small, distant helmet detection in complex scenes
Li et al. [23]	YOLO-PL	YOLOv4	DCSPX, E-PAN, L-VoVN, MP, Swish	CSPDarknet53	Helmet	SHWD, SHD, MHD	94.23%	Lightweight variant for small-helmet detection
Nguyen et al. [24]	-	YOLOv5s	Four-scale Detection	Seahorse Optimization	Gloves, Hardhat, Mask, Safety Vest, Shoes, No_Gloves, No_Hardhat, No_Mask, No_Safety Vest, No_Shoes	Custom	66.4%	Metaheuristic optimization for small/missing PPE
Zhang et al. [30]	MEAG-YOLO	YOLOv8n	MSCA, EC2f, ASFF, GhostConv, PAN	Standard	Helmet, Person, Badge, Gloves, Operating Bar, Wrong Gloves	Substation	96.5%	Multi-attention and fusion modules for efficient detection
Han et al. [40]	YOLOv8s-SNC	YOLOv8s	SPD-Conv Module + SEResNeXt Detection Head + C2f-CA Module + Small Object Detection Layer (4P)	SGD	Helmet, No_Helmet	SHWD + EWHD	92.6%	Enhanced small object detection via SPD-Conv (reduced information loss), SEResNeXt head (superior feature extraction), and dedicated small-target layer for complex construction sites.
He et al. [8]	YOLOv11-Seg	YOLOv11	C3K2 Module + C2PSA (Cross-Stage Partial Self-Attention) + DWConv (Depthwise Convolution) + CSPDarknet Backbone	AdamW	Bulldozer, Concrete Mixer, Crane, Excavator, Hanging Head, Loader, Other Vehicle, Pile Driving, Pump Truck, Roller, Static Crane, Truck, Worker	SODA + MOCS	80.8%	Real-time multi-object segmentation for construction sites, robust in dynamic scenarios at 1080P resolution.
Park et al. [41]	-	YOLOv8	ViT, Swin, and PVT	Standard	No_Helmet, No_ Mask, No_Gloves, No_Vest, No_Shoes	Custom	73.12% (PVT)	Brightness and scale augmentation for PPE absence detection
Kim and Xiong [28]	-	YOLOv8s	CA module, GhostConv, Transfer Learning, Merge-NMS	SGD	Helmet, No_Helmet, Harness, No_Harness, Lanyard	Custom	92.52%	Edge-based detection of fall-prevention PPE
This study (2025)	SC-YOLO	YOLOv11n	CSPDarknet + Sophia Optimizer	Sophia (Second-order)	Boots, Glass, Gloves, Helmet, Person, Vest	VOC2007-1, ML-31005	96.3–97.6%	Second-order optimization for robust small-object detection

Notes: “-” indicates model name not explicitly mentioned in original study. Standard optimizer refers to default settings without specific optimization innovations.

Table 2. Comparative Analysis of Backbone Architectures.

Backbone	Key Features	Optimization Method	Strengths	Limitations	Primary Applications
EfficientNet	Compound scaling, MBConv blocks	Standard (SGD)	Balanced efficiency–accuracy trade-off, parameter efficiency	Moderate feature representation power	Resource-constrained detection
DINOv2	Self-supervised learning, Transformer architecture	Standard (SGD)	Long-range dependencies, strong semantic representation	High computational cost, localization challenges	Semantic segmentation, context-rich scenarios
CSPDarknet	Cross-stage partial connections, feature reuse	Standard (SGD)	Efficient gradient flow, feature integrity	Standard optimization limitations	Real-time detection, edge deployment
SC-YOLO (This study)	CSPDarknet backbone, curvature-aware updates	Sophia (second-order)	Enhanced small-object detection, faster convergence	Marginally increased computation	Construction PPE monitoring

Table 3. Comparative analysis of VOC2007-1 and ML-31005 datasets: environmental conditions, PPE categories, and construction adaptability.

Dataset	Images	Instances	Classes	Conditions	PPE Types	Adaptability For Construction Use
VOC2007-1	900	7223	3	Mixed environments (indoor/ outdoor), diverse lighting	Hardhat, Vest, Worker	Moderate
Ml-31005	527	3591	6	Outdoor daylight, Indoor artificial lighting	Boots, Glass, Gloves, Helmet, Person, Vest	High

Table 4. Distribution of images and instances across training, validation, and test sets for VOC2007-1 dataset.

Class	Train Set		Validation Set		Test Set		Total
	Images	Instances	Images	Instances	Images	Instances	Images	Instances
All	625	4965	183	1374	92	884	900	7223
Hardhat	569	1803	167	514	82	309	818	2626
Vest	307	908	84	212	49	170	440	1290
Worker	625	2254	183	648	92	405	900	3307
Set Distribution		68.7%		19.0%		12.3%		100%

Table 5. Distribution of images and instances across training, validation, and test sets for ML-31005 dataset.

Class	Train Set		Validation Set		Test Set		Total
	Images	Instances	Images	Instances	Images	Instances	Images	Instances
All	369	2472	80	549	78	570	527	3591
Boots	290	601	68	138	65	128	423	867
Glass	232	244	48	49	50	53	330	346
Glove	262	489	58	109	60	116	380	714
Helmet	276	331	55	70	60	70	391	471
Person	347	419	74	94	74	104	495	617
Vest	326	388	73	89	74	99	473	576
Set Distribution		68.8%		15.3%		15.9%		100%

Table 6. Evaluation metrics for VOC2007-1 dataset—Validation results (highest values for each metric are highlighted in bold).

Object	Efficient-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	183	1374	0.887	0.808	0.887	0.534
Hardhat	167	514	0.915	0.794	0.879	0.495
Vest	84	212	0.822	0.811	0.871	0.539
Worker	183	648	0.924	0.819	0.912	0.567
Object	DINOv2-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	183	1374	0.788	0.715	0.778	0.279
Hardhat	167	514	0.727	0.675	0.661	0.212
Vest	84	212	0.802	0.689	0.789	0.277
Worker	183	648	0.835	0.781	0.883	0.348
Object	CSP-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	183	1374	0.953	0.892	0.951	0.633
Hardhat	167	514	0.974	0.876	0.939	0.567
Vest	84	212	0.945	0.890	0.955	0.640
Worker	183	648	0.940	0.910	0.960	0.692
Object	SC-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	183	1374	0.961	0.936	0.976	0.636
Hardhat	167	514	0.976	0.924	0.968	0.563
Vest	84	212	0.955	0.927	0.973	0.647
Worker	183	648	0.953	0.959	0.986	0.694

Table 7. Evaluation metrics for ML-31005 dataset—Validation results (highest values for each metric are highlighted in bold).

Object	Efficient-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	80	549	0.923	0.855	0.908	0.591
Boots	68	138	0.977	0.870	0.908	0.639
Glass	48	49	0.869	0.898	0.889	0.441
Glove	58	109	0.895	0.860	0.882	0.495
Helmet	55	70	0.963	0.750	0.887	0.591
Person	74	94	0.898	0.843	0.913	0.677
Vest	73	89	0.934	0.910	0.967	0.702
Object	DINOv2-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	80	549	0.770	0.741	0.777	0.301
Boots	68	138	0.863	0.797	0.878	0.290
Glass	48	49	0.412	0.408	0.325	0.0639
Glove	58	109	0.776	0.761	0.780	0.210
Helmet	55	70	0.864	0.816	0.874	0.354
Person	74	94	0.783	0.787	0.862	0.374
Vest	73	89	0.918	0.876	0.946	0.517
Object	CSP-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	80	549	0.935	0.917	0.939	0.682
Boots	68	138	0.960	0.906	0.942	0.734
Glass	48	49	0.833	0.916	0.853	0.485
Glove	58	109	0.934	0.890	0.944	0.619
Helmet	55	70	0.946	0.943	0.958	0.705
Person	74	94	0.956	0.925	0.946	0.756
Vest	73	89	0.982	0.921	0.990	0.790
Object	SC-YOLO
Class	Image	Instances	Precision	Recall	mAP@0.5	mAP@0.5:0.95
All	80	549	0.935	0.953	0.963	0.686
Boots	68	138	0.972	0.933	0.956	0.746
Glass	48	49	0.832	0.988	0.930	0.467
Glove	58	109	0.967	0.917	0.954	0.613
Helmet	55	70	0.981	0.971	0.974	0.712
Person	74	94	0.928	0.909	0.951	0.778
Vest	73	89	0.930	0.995	0.997	0.799

Table 8. Comparative loss measures for VOC2007-1 and ML-31005 datasets.

Model	Set	VOC2007-1 Dataset			Ml-31005 Dataset
Model	Set	Box Loss	Cls Loss	Dfl Loss	Box Loss	Cls Loss	Dfl Loss
Efficient-YOLO	Train	1.232	0.786	1.420	1.096	0.769	1.323
	Validation	1.387	0.801	1.548	1.231	0.827	1.418
DINOv2-YOLO	Train	1.620	0.821	1.532	1.362	0.728	1.346
	Validation	1.874	0.896	1.684	1.615	0.810	1.389
CSP-YOLO	Train	0.829	0.428	1.021	0.736	0.414	1.005
	Validation	1.191	0.558	1.231	0.964	0.564	1.169
SC-YOLO	Train	0.810	0.430	1.007	0.741	0.436	1.027
	Validation	1.192	0.563	1.239	1.019	0.573	1.228

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

SC-YOLO: A Real-Time CSP-Based YOLOv11n Variant Optimized with Sophia for Accurate PPE Detection on Construction Sites

Abstract

1. Introduction

2. Related Works

2.1. Evolution of Detection Models for Construction Safety

2.2. Specialized Detection Strategies and Backbone Architectures

2.3. Optimization Approaches in Object Detection

2.4. Comparative Analysis and Research Gap

3. Methodology

3.1. YOLOv11 Architecture Enhancement Framework

3.2. Backbone Architectures

3.2.1. EfficientNet Backbone (Efficient-YOLO)

3.2.2. Self-Supervised Vision Transformer Backbone (DINOv2-YOLO)

3.2.3. CSPDarknet Backbone (CSP-YOLO)

3.2.4. Proposed CSPDarknet with Sophia (SC-YOLO)

3.3. Sophia: Second-Order Clipped Stochastic Optimization

3.3.1. Motivation

3.3.2. Hessian Estimators

3.3.3. Complete Algorithm

3.3.4. Theoretical Properties

3.3.5. Practical Considerations

3.4. Experimental Datasets

3.4.1. VOC2007-1 Dataset

3.4.2. ML-31005 Dataset

3.5. Implementation Details

3.5.1. Computing Infrastructure

3.5.2. Training Protocol

3.5.3. Evaluation Metrics

3.6. Experimental Design

3.7. Ablation Study Design

4. Results and Discussion

4.1. Model Architecture Comparison

4.2. Overall Detection Performance

4.3. Class-Wise Performance Analysis

4.4. Training Dynamics and Efficiency

4.5. Generalization and Deployment Considerations

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics