ESPCN-YOLO: A High-Accuracy Framework for Personal Protective Equipment Detection Under Low-Light and Small Object Conditions

Malaikrisanachalee, Suphawut; Wongwai, Narongrit; Kowcharoen, Ekasith

doi:10.3390/buildings15101609

Open AccessArticle

ESPCN-YOLO: A High-Accuracy Framework for Personal Protective Equipment Detection Under Low-Light and Small Object Conditions

by

Suphawut Malaikrisanachalee

¹

,

Narongrit Wongwai

^2,*

and

Ekasith Kowcharoen

¹

Department of Civil Engineering, Faculty of Engineering, Kasetsart University, Lat Yao, Chatuchak, Bangkok 10900, Thailand

²

Department of Civil Engineering, Faculty of Engineering at Sriracha, Kasetsart University, Sriracha Campus, Thung Sukhla, Sriracha, Chonburi 20230, Thailand

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(10), 1609; https://doi.org/10.3390/buildings15101609

Submission received: 8 April 2025 / Revised: 4 May 2025 / Accepted: 8 May 2025 / Published: 10 May 2025

(This article belongs to the Special Issue Digital Management in Architectural Projects and Urban Environment)

Download

Browse Figures

Versions Notes

Abstract

This study introduces ESPCN-YOLO, an innovative deep learning framework designed to enhance the detection accuracy of Personal Protective Equipment (PPE) under challenging conditions, including low-light environments, long-distance scenarios, and small object detection. The proposed system integrates a YOLOv8-based object detection model with an Efficient Sub-Pixel Convolutional Neural Network (ESPCN) to perform real-time super-resolution enhancement on low-resolution footage. The framework was trained on a custom dataset containing 21,750 annotated images categorized into four PPE classes: helmets, shoes, vests, and persons. Extensive experiments were conducted under varying conditions, including distances ranging from 4 to 14 m, resolutions of 640 × 480 and 1920 × 1080, and brightness levels adjusted from −90% to +70%. The results demonstrate that integrating an ESPCN (3×) with YOLOv8 significantly improves detection accuracy, particularly for small objects and poorly illuminated environments. The model achieved a mean average precision (mAP@0.5) of 0.922 and a stringent mAP@0.5:0.95 of 0.741. Additionally, an automated alert system was implemented to enable real-time PPE compliance monitoring. This study highlights the effectiveness of super-resolution enhancement in increasing detection robustness and provides a practical solution for real-time safety monitoring in industrial environments.

Keywords:

personal protective equipment (PPE); YOLOv8; ESPCN; super-resolution; low-light detection; small object detection; deep learning; construction safety; object detection

1. Introduction

Ensuring occupational safety in construction, mining, and industrial environments remains a critical global concern. Among the key preventive measures enforced at job sites is the mandatory use of PPE—including helmets, safety goggles, gloves, and vests—to mitigate the risk of injuries and fatalities caused by falling objects, electrical hazards, and physical trauma. According to the International Labor Organization, over 2.3 million people worldwide succumb annually to work-related accidents and diseases, many of which could be prevented through proper safety protocols such as PPE usage [1]. With the increasing proliferation of surveillance technologies, automated PPE detection using computer vision has gained significant attention as a method to continuously monitor safety compliance. These systems are particularly valuable in scenarios where human oversight is limited or impractical. However, operational conditions often impose challenges. Low-light scenarios—prevalent in underground mining, night shift construction, and poorly illuminated factory zones—degrade image quality, thereby impeding visual analysis by both human inspectors and AI-based systems [2]. Consequently, there is a growing demand in civil engineering for intelligent detection methods that can reliably function under such suboptimal visual conditions.

Automated PPE detection faces two primary technical challenges. First, PPE items such as gloves, helmets, and safety goggles are typically small and may be partially occluded or visually ambiguous in surveillance footage, especially when captured by long-range fixed cameras [3]. Second, standard object detectors like You Only Look Once (YOLO) and Faster R-CNN, while highly effective under standard lighting conditions, experience a marked decline in accuracy in low-light environments due to insufficient contrast and feature degradation [4,5]. The combination of small object scale and poor visibility poses substantial limitations for real-world applications, highlighting the need for both enhanced object detection algorithms and preprocessing techniques capable of improving visual input quality.

Several studies have explored AI-driven approaches to PPE detection. For instance, A YOLOv5-based system for helmet detection was proposed in a previous study [6]. Similarly, Faster R-CNN was employed for PPE detection in another study [7]. Although deep learning models demonstrate substantial capabilities, their dependence on high-quality input imagery reduces robustness in environments characterized by poor illumination or image noise. To address this limitation, researchers have increasingly adopted image super-resolution (SR) techniques, aiming to reconstruct high-resolution images from low-resolution inputs. The ESPCN model, introduced in a previous study [8], provides an effective method for real-time image upscaling using sub-pixel convolution layers, preserving spatial structure while minimizing computational costs. When integrated with detection models, SR techniques like an ESPCN significantly enhance the visibility of small or distant PPE items, particularly in low-light footage. However, existing systems either neglect SR integration or suffer increased inference time due to computational overhead, rendering them unsuitable for real-time applications.

To address these limitations, this paper presents a novel AI-powered framework that integrates an ESPCN-based super-resolution module with a YOLOv8 object detector to enhance PPE detection accuracy and speed under challenging conditions.

This research introduces ESPCN-YOLO, a novel framework that integrates super-resolution enhancement with real-time object detection to improve PPE detection in challenging conditions. The key academic contributions of this study are as follows:

1.: Novel integration of super-resolution and object detection for PPE recognition

Unlike traditional PPE detection models that rely solely on object detection networks (e.g., YOLO, Faster R-CNN, SSD), this study incorporates an ESPCN for super-resolution enhancement before object detection.

This improves small object detection accuracy and enhances feature extraction in low-resolution or low-quality images.

2.: Improved low-light PPE detection with adaptive illumination enhancement

The experimental results demonstrate substantially improved detection accuracy across varying lighting conditions, particularly in low-illumination scenarios.

3.: Enhanced small object detection for industrial safety applications

Standard YOLO models struggle with detecting small PPE items (e.g., gloves, goggles, safety badges) due to feature loss at deeper layers.

By incorporating an ESPCN, this method preserves fine details, enabling superior detection of small-scale PPE in industrial environments.

4.: Benchmarking against state-of-the-art methods

This study conducts a comparative analysis against leading object detection models (YOLOv8), demonstrating that ESPCN-YOLO achieves higher accuracy. These findings confirm the effectiveness of super-resolution-enhanced object detection in industrial applications.

5.: Practical impact for industrial safety and compliance monitoring

The proposed model provides a robust solution for real-time PPE monitoring in industries such as construction, manufacturing, and healthcare, ensuring regulatory compliance and worker safety.

Unlike traditional vision-based PPE detection systems, ESPCN-YOLO offers better adaptability to low-quality surveillance footage, making it practical for real-world deployment.

6.: Potential for future extensions in edge AI and transformer-based detection

This study opens new research directions in edge AI, as ESPCN-YOLO’s lightweight architecture makes it suitable for deployment on industrial edge devices.

Future research could integrate Vision Transformers (ViTs) or Swin Transformers with an ESPCN for further performance improvements.

2. Related Works

2.1. PPE Detection Approach

2.1.1. Sensor-Based PPP Detection

Traditional methods for PPE detection often rely on sensor-based systems, including Global Positioning Systems (GPSs) and Radio Frequency Identification (RFID). These systems track workers’ locations and assess their PPE compliance by detecting the presence of RFID-tagged equipment.

A GPS-based safety monitoring system that enabled real-time location tracking of construction workers was introduced in a previous study [9], helping supervisors ensure PPE compliance. Similarly, RFID-based PPE detection systems have been proposed as non-visual alternatives for monitoring workers’ adherence to safety regulations [10,11].

While sensor-based methods are not affected by external environmental factors such as lighting conditions, they present several limitations: (1) workers must wear additional tracking equipment, increasing discomfort and potential resistance; (2) real-time data communication is required for effective operation, leading to higher deployment costs; and (3) these systems demand long-term maintenance, making them impractical for large-scale construction sites.

2.1.2. Vision-Based PPE Detection

To overcome the limitations of sensor-based systems, researchers have shifted toward vision-based approaches, leveraging cameras and image processing techniques to detect PPE objects. Early studies focused on traditional computer vision algorithms, which applied handcrafted features for PPE detection. For instance, the Circle Hough Transform algorithm was used to detect helmets based on shape characteristics in one study [12].

Other methods include edge detection for identifying PPE items, such as helmets and face shields [13], and color-based classification systems using predefined color thresholds to recognize different PPE classes [14].

However, traditional vision-based methods suffer from poor generalization in complex environments due to variations in illumination, perspective, and occlusion. As a result, deep learning-based methods have gained popularity for their superior performance, as discussed in Section 2.2.

2.2. Deep Learning for PPE Detection

Deep learning techniques have been extensively applied to automated PPE detection across various industries. For example, a real-time monitoring system was developed to detect workers at heights and prevent fall-related accidents [15]. Lightweight deep learning models optimized for PPE detection in industrial environments have also demonstrated high accuracy with reduced computational cost [8,9]. Similarly, deep learning techniques have been applied in university laboratories to enhance student safety by ensuring proper PPE usage [16]. Within the construction industry, CNN-based object detection models—such as Faster R-CNN, SSD, and YOLO—have been widely adopted for PPE monitoring. Early implementations utilized Faster R-CNN for detecting helmets, achieving high precision in controlled environments [17,18,19]. In one study, Faster R-CNN was trained on over 100,000 images extracted from construction site surveillance videos, reaching an accuracy of 95.7%, although the dataset was not publicly available [18]. Other research explored multi-model approaches, such as employing two Faster R-CNN networks—one for detecting workers and another for identifying helmets and vests [17].

The YOLO series of models has gained popularity due to its balance between speed and accuracy. A comparison of YOLOv3, YOLOv4, and YOLOv5 for detecting helmets, vests, and workers found that YOLOv5x achieved the highest accuracy (mAP 90%), while YOLOv5s provided the fastest inference speed [20]. An anchor-free model, YOLOX, was later shown to outperform previous YOLO versions, improving detection accuracy by 3 percentage points [21]. However, while deep learning models perform well in well-lit outdoor construction sites, their effectiveness in low-light conditions remains limited, necessitating specialized image enhancement techniques (Section 2.3).

2.3. Challenges in Low-Light PPE Detection

Despite the success of deep learning-based PPE detection, current models struggle in low-light environments, such as tunnel construction sites, night shifts, and poorly illuminated indoor facilities. Images captured under poor illumination exhibit low brightness, reduced contrast, and color distortions, which significantly degrade detection accuracy [22,23]. To mitigate these challenges, low-light image enhancement techniques have been developed. Histogram Equalization (HE) is a widely used method for increasing image contrast, but it often over-enhances certain regions, leading to unnatural artifacts [24,25]. Adaptive Histogram Equalization (AHE) and Contrast-Limited AHE (CLAHE) attempt to balance contrast enhancement, reducing over-exposure in bright regions [20]. Retinex-based algorithms, inspired by human vision, dynamically adjust brightness to enhance local contrast and feature visibility [26,27]. Recent developments such as Retinex-Net and Zero-DCE++ extend this concept using deep learning frameworks to enable end-to-end image enhancement under challenging lighting conditions [28,29,30,31,32].

Recently, deep learning-based low-light enhancement techniques have gained popularity. An unsupervised Generative Adversarial Network (GAN) called EnlightenGAN was introduced to enhance images without requiring paired low-light and normal-light datasets [32]. A lightweight deep model known as Zero-DCE was later proposed to dynamically adjust image brightness and contrast [33]. However, these methods are computationally expensive and unsuitable for real-time applications.

2.4. Research Gaps and Objectives

While deep learning has significantly improved PPE detection, several key research gaps remain:

Limited research on PPE detection in low-light environments, particularly in underground construction sites and nighttime conditions.
Lack of publicly available datasets for training and evaluating low-light PPE detection models.
Insufficient comparative analysis of low-light image enhancement techniques for improving PPE recognition.

Objectives of this study: To address these gaps, this study proposes an AI-powered detection framework integrating YOLOv8 with ESPCN-based super-resolution enhancement, a novel low-light PPE dataset for tunnel construction environments, and a comparative evaluation of enhancement techniques for real-time PPE compliance monitoring.

By addressing these research gaps, this study advances computer vision applications in construction safety, enabling real-time, high-accuracy PPE detection under challenging lighting conditions.

3. Methodology

This section outlines the methodological framework developed for ESPCN-YOLO, a novel object detection pipeline designed specifically for recognizing PPE under low-illumination conditions and small object scenarios within complex construction environments. The methodology integrates advanced model training, resolution enhancement, and rigorous performance evaluation protocols to ensure scalability and applicability in real-world settings.

3.1. Training Yolov8 Object Detection Model

As shown in Figure 1, the dataset used in this study comprises 21,750 high-resolution RGB images, annotated across four PPE classes: helmet, vest, shoes, and person. Images were collected from both real construction environments and simulated settings to capture a wide range of operational variations, including changes in illumination, distance, scale, motion blur, and partial occlusion. Annotations were performed using industry-standard tools following the YOLO format, capturing normalized bounding box coordinates and class identifiers. Manual verification was conducted to ensure label accuracy and inter-annotator consistency, which are critical for achieving high-performance model convergence. To facilitate effective learning and unbiased evaluation, the dataset was divided into the training set: 70% (15,225 images), the validation set: 20% (4350 images), and the testing set: 10% (2175 images). The training parameters included 150 epochs, a batch size of 16, an input image size of 640 × 640, an RTX 2060 GPU with 6 GB of memory (Nvidia, Santa Clara, CA, USA), and approximately 15 h of training time.

3.2. VDO Footage Capturing

This section describes the procedures employed for capturing video footage for evaluating the proposed ESPCN-YOLO framework under various conditions, including low-light environments and small object detection. Both laboratory and on-site data acquisition processes were conducted to ensure comprehensive evaluation.

3.2.1. Laboratory Data Acquisition

Controlled laboratory tests were performed to ensure consistent and reproducible conditions. The video footage was captured using a high-quality mobile phone camera with the following specifications:

Camera Specifications: 10 MP, f/2.4, 70 mm (telephoto), 1/3.52″, 1.12 µm, PDAF, OIS, and 3× optical zoom.
Recording Time: Approximately 10:00 AM in Thailand under natural daylight conditions, estimated at 80,000–100,000 lux (typical mid-morning sunlight).
Camera Mounting: The mobile phone was securely mounted on a stationary pole to prevent motion-induced artifacts and maintain consistent positioning.
Duration of Recordings: Each video segment lasted between 2 and 3 s, containing an average of 110 frames per recording.
Resolution and Distances Experiments: As shown in Figure 2, two distinct resolutions were used: 640 × 480 pixels and 1920 × 1080 pixels. Recordings were made at fixed distances from a reference pole: 4, 6, 8, 9, 11, 12, 13, and 14 m. The variation in resolution and distance aimed to comprehensively assess the YOLOv8 model’s detection performance under different spatial and quality conditions within a controlled environment.
Brightness Variation Experiments: As shown in Figure 3, to examine the model’s robustness under varying illumination conditions, additional experiments involving brightness manipulation were conducted using Kapwing, an online video editing tool. Baseline Video: Selected from the laboratory dataset, recorded at 640 × 480 pixels at a distance of 12 m from the reference pole. Brightness Adjustment Process: Brightness levels were modified using Kapwing’s filter tool to create a diverse set of testing conditions. Brightness Variations Applied: Increases: +10%, +20%, +40%, +50%, +60%, and +70%; Decreases: −30%, -60%, and −90%. These modified videos were subsequently processed through the YOLOv8 detection pipeline to evaluate the impact of brightness fluctuations on detection accuracy. This process provided valuable insights into the model’s performance under non-ideal lighting conditions frequently encountered in real-world applications.

3.2.2. On-Site Data Acquisition

As shown in Figure 4, to validate the laboratory findings under real-world conditions, on-site tests were conducted at a Sino-Thai construction site. Unlike the laboratory setup, these tests aimed to capture data reflective of operational conditions commonly encountered in construction environments.

Camera Specifications: Same as laboratory acquisition.
Resolution Setting: Adjusted to 720 × 540 pixels.
Camera Positioning: Handheld, to simulate realistic operational scenarios where stability cannot be guaranteed.
Environmental Factors: Dynamic backgrounds, varying lighting conditions, and potential occlusions.
Total Frames Captured: 356 frames per video.

The purpose of this on-site testing was to assess the robustness of the YOLOv8 model when applied in complex and uncontrolled environments.

3.3. Video Footage Resolution Upscaling

The process of enhancing video resolution plays a critical role in improving detection accuracy, particularly under low-light conditions and when detecting small objects. This section outlines the comparative analysis between bilinear interpolation and an ESPCN for resolution upscaling.

3.3.1. Bilinear Interpolation

Bilinear interpolation is a traditional upscaling technique that generates higher-resolution images by linearly interpolating pixel values from their nearest four pixels from the original image and performing weighted averaging, calculated using Equation (1).

I^{'} (x, y) = (1 - d x) (1 - d y) I_{i j} + d x (1 - d y) I_{i + 1, j} + (1 - d x) d y I_{i, j + 1} + d x d y I_{i + 1, j + 1}

(1)

The variables are defined as follows, where

I_{i j}

is the original pixel value at position

(i, j)

.

d x

is the fractional distance from the original pixel

i

to the target pixel

x

.

d y

is the fractional distance from the original pixel

j

to the target pixel

y

.

I^{'} (x, y)

is the interpolated pixel value.

Although the bilinear interpolation technique has advantages such as simple implementation, low computational cost, and suitability for real-time applications, it often produces blurred results at higher scaling factors and fails to preserve fine details [34,35]. This drawback is particularly detrimental to tasks such as small object detection.

3.3.2. Efficient Sub-Pixel Convolutional Neural Network (ESPCN)

As shown in Figure 5, an ESPCN is a deep learning-based approach designed to improve image resolution by learning a non-linear mapping from low- to high-resolution images, first proposed by [36]. Unlike bilinear interpolation, this model uses convolutional layers followed by a sub-pixel shuffling operation to upscale the image while preserving high-frequency details critical for object detection. The input low-resolution image is passed through a series of convolutional layers to generate a feature tensor with a size of

H x W x r^{2} C

, where

r

is the upscale factor and

C

is the number of channels. A periodic shuffling operation is applied to rearrange the tensor into a high-resolution output with a size of

r H x r W x C

. For an upscale factor

r = 2

, a 2 × 2 low-resolution matrix is transformed into a 4 × 4 high-resolution matrix through a feature tensor with a size of 2 × 2 × 4. The transformation is governed by Equation (2):

Y (x, y) = F (⌊\frac{x}{r}⌋, ⌊\frac{y}{r}⌋, r \cdot m o d (y, r) + m o d (x, r) + 1)

(2)

where

r

denotes the upscale factor. The learnable nature of the ESPCN preserves high-frequency details, edges, and textures, which are essential for detecting small-scale PPE items, such as shoes and distant helmets.

The ESPCN offers advantages in preserving high-frequency details, scalability, and superior performance.

3.4. Object Detection Process

The object detection process involves utilizing the trained Yolov8 model to identify PPE items from upscaled high-resolution video footage. The process ensures efficient detection and data recording for further analysis, as shown in Figure 6.

The object detection process involves capturing inputs from sources such as CCTV cameras or mobile phone cameras, which are then processed using the trained YOLOv8 model. The outputs include bounding boxes, class labels, and confidence scores for each detected object. The detection results are recorded in an Excel file, with details such as object class, detection time, and bounding box coordinates. The system continuously monitors the input stream and, if additional frames are available, repeats the detection process in real time.

3.5. Evaluation Methodology

3.5.1. Static Image Evaluation

Evaluating object detection models requires robust performance metrics that accurately reflect the model’s ability to identify and classify objects in images. Among the most commonly used metrics are precision, recall, and mean average precision (mAP), particularly at thresholds mAP@50 and mAP@95. These metrics are critical in various domains, including computer vision, autonomous vehicles, medical imaging, and remote sensing. Precision is a measure of the accuracy of positive predictions made by the model. It is defined as the ratio of correctly predicted positive instances (true positives) to the total predicted positive instances (true positives + false positives), calculated using Equation (3). A high precision indicates a low false positive rate, meaning the model is efficient at avoiding incorrect predictions. Recall, also known as the sensitivity or true positive rate, measures the ability of the model to detect all relevant objects in an image. It is defined as the ratio of correctly predicted positive instances to the total actual positive instances (true positives + false negatives), calculated using Equation (4). A high recall implies a low false negative rate, indicating the model’s robustness in capturing most of the relevant objects. The mean average precision (mAP) is a comprehensive metric used to evaluate the overall performance of object detection models. It measures the area under the precision–recall curve, providing a single value to summarize the performance across various Intersection over Union (IoU) thresholds. mAP@50 evaluates the model’s performance using a single IoU threshold of 0.5. In simpler terms, a predicted bounding box is considered correct if its overlap with the ground truth bounding box is at least 50%, calculated using Equation (5). mAP@95 provides a more rigorous evaluation by averaging the mAP over 10 IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. It is also known as COCO mAP, calculated using Equation (6).

P = \frac{T P}{T P + F P}

(3)

R = \frac{T P}{T P + F N}

(4)

m A P @ 50 = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(5)

m A P @ 95 = \frac{1}{10} \sum_{i = 0}^{9} (\frac{1}{N} \sum_{i = 1}^{N} {A P}_{i})

(6)

The variables are defined as follows, where

P

is the precision.

R

is the recall.

T P

denotes the true positives (correctly predicted positive instances).

F P

denotes the false positives (incorrectly predicted positive instances).

F N

denotes the false negatives (incorrectly predicted negative instances).

N

is the number of classes.

{A P}_{i}

(

m A P @ 50

) represents the average precision for class i at IoU = 0.5.

{A P}_{i}

(

m m A P @ 95

) represents the average precision for class i at IoU = 0.5 + 0.05

k

;

k

is the index representing each threshold from 0.5 to 0.95 (0.5, 0.55, 0.6, …, 0.95).

Although these static metrics may appear impressive, they only offer a snapshot of performance within controlled environments. However, real-world scenarios require further evaluation.

3.5.2. Real-Time Evaluation

Conventional evaluation metrics—such as mAP, precision, and recall—are calculated using static datasets. While a model might achieve a mAP@95 of 0.999 under these conditions, such measures fail to capture the challenges encountered in live, real-time applications. Issues like motion blur, occlusions, and fluctuating lighting conditions are not accounted for in static evaluations.

To address these limitations, a frame-based accuracy metric has been developed. This metric evaluates the model’s performance across a continuous video stream. For example, in a 10 s video recorded at 60 frames per second (FPS), there are 600 frames. The accuracy was calculated using Equation (7).

α = (\frac{f_{c}}{f_{t}}) \cdot 100

(7)

The variables are defined as follows, where

α

is the percentage of accuracy.

f_{c}

is the number of frames with correct detections.

f_{t}

is the total number of frames.

This approach evaluates the model’s temporal consistency, which is crucial for applications such as surveillance, autonomous navigation, and industrial monitoring—contexts where relying solely on static metrics may yield misleading assessments.

3.5.3. IoU Threshold for Real-Time Detection

An Intersection over Union (IoU) threshold of 0.35 was selected for real-time deployment following extensive field testing. In dynamic environments, adopting a higher threshold (e.g., 0.5) may result in missed detections of partially visible objects, whereas a lower threshold increases the risk of false positives. The chosen threshold of 0.35 effectively balances sensitivity and precision, ensuring robust performance under live operational conditions.

3.6. Python Coding Implementation

The implementation of the ESPCN-YOLO framework for PPE detection involves a structured pipeline that integrates image super-resolution (via the ESPCN) and object detection (via YOLOv8). The coding process is divided into seven main steps, detailed below with pseudocode and descriptions.

The following pseudocode outlines the complete process from input acquisition to output recoding, incorporating the ESPCN module and YOLOv8 object detection.

BEGIN

// Step 1: Initialize Input and Output

IF input_mode IS “webcam” THEN

Open webcam stream using specified resolution (e.g., 1280 × 720)

ELSE IF input_mode IS “video_file” THEN

Open video file (e.g., video.mp4)

END IF

Initialize video writer for saving annotated output (if required)

Initialize Excel file (or DataFrame) with header columns:

[Frame_Number, Person_Count, Helmet_Count, Shoe_Count, Vest_Count, Additional_Metadata]

// Step 2: Main Processing Loop

SET frame_number TO 0

WHILE video stream IS open DO

READ frame FROM video stream

IF frame IS NOT successfully read THEN

BREAK loop

END IF

INCREMENT frame_number

// Step 3: Enhance Frame Quality with ESPCN Super-Resolution

enhanced_frame = ESPCN_super_resolution(frame)

// Explanation:

// - ESPCN upsamples the frame using convolutional layers followed by periodic shuffling.

// - The result is a higher-resolution frame ready for improved detection.

// Step 4: Perform Object Detection Using YOLOv8

detection_results = YOLOv8_detect(enhanced_frame)

// YOLOv8 performs the following internally:

// a. Feature Extraction: Extracts relevant features using convolutional layers.

// b. Prediction: Predicts bounding boxes, confidence scores, and class probabilities.

// c. Post-Processing: Applies Non-Max Suppression to filter overlapping boxes.

// Step 5: Count and Annotate Detections

INITIALIZE count_person, count_helmet, count_shoe, count_vest TO 0

FOR each detection IN detection_results DO

IF detection.confidence > threshold (e.g., 0.3) THEN

IDENTIFY detection.class (Person, Helmet, Shoe, Vest)

INCREMENT corresponding counter

DRAW bounding box and label on enhanced_frame

END IF

END FOR

// Step 6: Save Detection Data for Accuracy Evaluation

// Detection data (e.g., object counts per frame) are saved to an Excel file.

// These data are later evaluated using metrics described in Section 3.5,

including mAP@50, mAP@95, and frame-based accuracy

CREATE row_data = [frame_number, count_person, count_helmet, count_shoe, count_vest]

APPEND row_data TO Excel file or DataFrame

// Display or save the annotated frame

DISPLAY enhanced_frame

WRITE enhanced_frame TO video writer IF required

// Check for exit condition (e.g., ‘ESC’ key pressed)

IF exit_condition_met THEN

BREAK loop

END IF

END WHILE

// Step 7: Finalize and Save Results

SAVE Excel file with detection data

RELEASE video stream and video writer resources

CLOSE all display windows

END

While the current implementation logs detection results in an Excel file (via pandas DataFrame), this decision was made for offline batch analysis and visualization, not for real-time performance. The real-time detection loop operates fully in memory, and writing to disk occurs only after frame processing. Excel offers advantages for structured annotation and easy integration with analysis tools. For high-frequency or live deployment scenarios, we acknowledge that CSV or database formats would provide better I/O performance and scalability, and we plan to adopt these solutions in future real-time systems.

3.7. Latency Analysis

To evaluate the responsiveness of the system for real-time applications, we measured the processing latency per frame on our test hardware:

1.

Input size 640 × 480:

Preprocessing: 2 ms;
Inference: 10 ms;
Post-processing: 2 ms;
Total ≈ 14 ms (≈70 FPS).

2.

Input size 1920 × 1080 (Full HD):

Preprocessing: 9 ms;
Inference: 38.9 ms;
Post-processing: 1 ms;
Total ≈ 48.9 ms (≈20 FPS).

These results confirm that the system maintains low latency, suitable for real-time detection tasks. Additionally, it is important to note that detection data are written to an Excel/DataFrame only after processing each frame, meaning that disk I/O operations do not interfere with the main detection loop.

4. Experiments and Results

This section presents a comprehensive analysis of the proposed ESPCN-YOLO framework with its applications through three different experimental scenarios: 1. training YOLOv8 object detection results, 2. brightness impact, and 3. distance impact. Each experiment aims to evaluate the framework’s robustness and accuracy under diverse conditions, ensuring its applicability in real-world environments.

4.1. Training YOLOv8 Object Detection Model Results

As shown in Table 1, the performance of the trained YOLOv8 model was evaluated on the testing set using conventional metrics:

Precision (Box Precision): Measures the accuracy of predicted bounding boxes.
Recall: Measures the completeness of predictions.
mAP@50: Mean average precision at an Intersection over Union (IoU) threshold of 0.50.
mAP@95: Mean average precision averaged over multiple IoU thresholds from 0.50 to 0.95.

The performance of the YOLOv8 object detection model was evaluated using multiple metrics to assess its accuracy and robustness across several classes, including person, helmet, shoes, and vest. The model achieved an overall precision of 0.916, indicating a high level of accuracy in predicting bounding boxes across all categories. Among these classes, vest detection exhibited the highest precision score of 0.939, reflecting the model’s strong ability to accurately identify vests with minimal false positives. This high precision likely results from the distinctive visual features and relatively larger size of vests, making them easier to detect. The person class also showed high precision, scoring 0.931, slightly lower than vests, suggesting occasional misclassifications or inaccuracies in bounding boxes. Similarly, the shoe class achieved a precision of 0.924, indicating some challenges in accurately maintaining bounding boxes for smaller objects. The helmet class obtained a precision of 0.922, demonstrating robust performance; however, this metric alone does not fully reflect the complexities of detecting helmets under challenging conditions, such as low lighting or partial occlusion.

Regarding recall, the model achieved an overall recall of 0.876, suggesting satisfactory detection coverage but also highlighting instances where objects were missed in the dataset. The helmet class achieved the highest recall at 0.946, demonstrating the model’s strong capability to detect helmets under varied conditions, including diverse angles and partial occlusions. The person class similarly showed excellent recall at 0.948, underscoring the model’s consistency in detecting larger and more recognizable objects. Conversely, the shoe class exhibited the lowest recall at 0.819, revealing difficulties in detecting smaller objects, possibly due to the model’s sensitivity to resolution or scale variations, particularly when shoes appear distant or partially obstructed. The vest class achieved a high recall score of 0.936, indicating good detection coverage with minimal missed instances, further confirming its robustness under various conditions.

The model’s accuracy was also evaluated using the mAP@50 metric, which allows moderate overlap between predicted and ground truth boxes. The overall mAP@50 score was 0.922, reflecting high accuracy across most categories. Vest detection achieved the highest score of 0.975, indicating the model’s effectiveness even with minor overlaps, likely due to the vest’s distinct visual boundaries. The person class also performed exceptionally well, achieving an mAP@50 of 0.962, likely due to the relatively large size of individuals, facilitating easier detection and accurate localization. The helmet class demonstrated high accuracy with an mAP@50 of 0.956; however, minor errors persisted under challenging conditions like occlusions or low-light environments. The shoe class scored lowest with an mAP@50 of 0.920, indicating persistent challenges in achieving high precision for smaller objects, especially when partially obscured or located at frame edges.

The mAP@95 metric, averaging precision across multiple IoU thresholds ranging from 0.50 to 0.95, provides a more stringent evaluation of localization accuracy. The overall mAP@95 score was 0.741, significantly lower than mAP@50, highlighting challenges in precise localization across stricter IoU thresholds. Nevertheless, the person (0.828) and vest (0.832) classes scored relatively high, indicating robustness even under stringent conditions. These scores suggest that the model effectively detects these objects with greater localization accuracy. The helmet class achieved a moderate score of 0.763, but its notable decline compared to mAP@50 highlights difficulties in precise localization, particularly when helmets are partially occluded or viewed from challenging angles. The shoe class recorded the lowest performance with an mAP@95 of 0.713, underscoring its sensitivity to resolution and positioning challenges. This low score emphasizes the model’s difficulty in accurately detecting smaller objects, especially when partially obscured or lacking distinct visual features.

Overall, the YOLOv8 model demonstrated strong detection capabilities for larger and more distinguishable objects such as persons and vests. However, the results revealed significant challenges in detecting smaller objects like shoes and, to a lesser extent, helmets, particularly when evaluated using the stricter mAP@95 metric. These findings suggest potential improvements through enhancing input image resolution with techniques like an ESPCN, augmenting the training dataset with more challenging scenarios, and optimizing the model’s hyperparameters to improve small object detection. Incorporating these enhancements is expected to increase the model’s robustness and accuracy under diverse real-world conditions.

4.2. Brightness Impact

This experiment aimed to evaluate the model’s robustness under varying brightness conditions. The Kapwing tool was used to artificially adjust the brightness of test images to simulate real-world low-light conditions. The results are summarized in Table 2.

Resolution: 640 × 480.
Brightness levels: ranging from +70%(overexposed) to −90% (severely darkened).
Classes evaluated: person, helmet, shoes, and vest.

The performance of the YOLOv8 model was evaluated under varying brightness conditions to assess its robustness and reliability when subjected to extreme lighting variations. The experiment included brightness adjustments ranging from +70% (overexposure) to −90% (severe darkness), with detection results measured across four PPE classes: person, helmet, shoes, and vest. The person class exhibited exceptional robustness, maintaining a consistent detection rate of 100.00% across all brightness levels. This resilience can be attributed to the relatively large size of persons within the frame and distinct structural features that remain identifiable even under extreme lighting conditions. Unlike other PPE items, which are smaller and more susceptible to brightness-related distortions, individuals were consistently and accurately detected by the YOLOv8 model regardless of brightness variations. In contrast, the helmet class demonstrated significant sensitivity to brightness alterations. Under neutral brightness conditions, helmet detection accuracy reached 99.16%, indicating strong performance in ideal lighting. However, accuracy deteriorated rapidly with brightness changes. When brightness increased to +70%, helmet detection dropped to 61.02%, likely due to overexposure, causing loss of texture details and reduced contrast. As brightness decreased, helmet detection worsened dramatically; accuracy fell to 61.86% at −30% brightness and sharply declined to 5.08% at −60%. Under the most extreme darkness condition (−90%), helmet detection completely failed, with an accuracy of 0.00%. This sensitivity suggests that helmets, being relatively small and potentially reflective objects, are highly affected by brightness changes, especially in low-light conditions. The shoe class was the most adversely impacted by brightness variation, showing the lowest overall detection accuracy. Even under neutral brightness, shoe detection was only 7.98%, highlighting inherent challenges in detecting such small objects. At high brightness levels (+70% and +60%), shoe detection completely failed (0.00%). Detection improved marginally with moderate brightness increases (+50% to +10%), reaching a peak accuracy of 26.69% at +40%. However, at all negative brightness adjustments (−30%, −60%, −90%), shoe detection remained consistently at 0.00%. This poor performance indicates that shoes, especially when captured at a distance or under low-light conditions, lack sufficient contrast and detail for reliable detection. Additionally, their small size further compounds detection difficulty under degraded image quality. The vest class exhibited strong robustness across varying brightness conditions, achieving 100.00% detection accuracy across all positive brightness adjustments and maintaining high performance under moderate darkness. At brightness levels ranging from neutral to +10%, vests were consistently detected with 100.00% accuracy. Even at decreased brightness levels (−30% and −60%), detection accuracy remained perfect (100.00%). This robustness can be attributed to the typically high visibility and reflective nature of vests, enhancing their contrast against different backgrounds. However, at the most extreme negative brightness level (−90%), detection accuracy significantly dropped to 65.25%, suggesting that extreme darkness can still negatively affect detection performance.

The impact of brightness variations on the YOLOv8 model’s detection performance was analyzed by systematically adjusting brightness levels using the Kapwing video editor. The adjustments were applied uniformly across video frames through a process known as scale-and-clip, where each pixel value is modified according to Equation (8).

I_{n e w} = m i n \{(1 + B) \times I_{o}, 255\}

(8)

In this formula,

B

represents the brightness adjustment factor. A positive

B

increases brightness, while a negative

B

decreases it. If the calculated value exceeds 255, it is clipped to 255 to avoid overflow. This operation directly affects the edge contrast between objects and their backgrounds, which is critical for YOLOv8’s convolutional filters to detect and localize objects effectively. The analysis involved evaluating the effect of brightness adjustments on the edge contrast between an object and its background. For example, if a background pixel has an intensity of 180 and an object pixel has an intensity of 100, the original edge contrast is calculated as

∆ I_{o} = 180 - 100 = 80

. Increasing brightness by 10% results in new pixel values of 198 and 110, respectively, resulting in an improved edge contrast of 88. This moderate increase enhances detection performance by making edges slightly more pronounced. However, further increasing brightness by 70% results in pixel clipping, where the background value reaches the maximum limit of 255. This leads to a contrast reduction to 85, indicating that excessive brightening can cause overexposure and reduce edge contrast. Conversely, brightness reduction significantly degrades edge contrast. For instance, a 30% decrease in brightness reduces the original edge contrast from 80 to 56, while a 90% decrease almost eliminates the contrast, reducing it to 8. Such severe dimming makes it nearly impossible for YOLOv8’s convolutional filters to generate strong activations, particularly when detecting small objects like helmets and shoes. The relationship between brightness changes and detection performance was further analyzed by comparing edge contrast values with detection accuracy across various brightness levels. It was observed that moderate brightening (+20% to +50%) boosts contrast above a threshold where YOLOv8’s filters can effectively detect objects. At +40%, the edge contrast reaches 112, resulting in optimal detection for helmets (99%), shoes (27%), and vests (100%). However, overexposure at +70% reduces contrast to 85, negatively impacting shoe detection (0%) despite reasonably good performance for helmets (61%) and vests (91%). Similarly, reducing brightness has a detrimental effect on detection performance. At −30%, the edge contrast decreases to 56, reducing helmet detection accuracy to 62% and completely eliminating shoe detection. Further decreasing brightness to −60% and −90% collapses the contrast to 32 and 8, respectively, rendering helmets and shoes entirely undetectable. Only vests, which are typically larger and feature high-contrast reflective materials, maintain relatively high detection accuracy (100%) until brightness reaches −90%, where accuracy drops to 65%. These findings highlight the importance of maintaining sufficient edge contrast for effective object detection. YOLOv8’s convolutional filters rely heavily on pixel-to-pixel differences to activate and produce accurate predictions. When brightness adjustments diminish these differences, the filters fail to detect objects effectively. This issue is particularly pronounced for smaller objects like helmets and shoes, which depend on fine detail and subtle contrasts to be successfully identified. Overall, the analysis confirms that brightness variation significantly impacts detection performance. Moderate brightening generally improves detection accuracy by enhancing edge contrast, while excessive brightening or severe dimming severely degrades performance. Integrating ESPCN preprocessing or employing brightness normalization techniques could significantly improve detection robustness under various lighting conditions, as shown in Table 3 and Figure 7.

The application of ESPCN super-resolution (3×) demonstrated significant improvements in detection accuracy for all classes, particularly under challenging brightness conditions where the baseline YOLOv8 model struggled. By increasing the resolution of input frames by a factor of 3×, the model gained enhanced clarity and detail, which proved essential for detecting smaller objects such as helmets and shoes. The enhanced resolution effectively amplified subtle gradients and textures, thereby improving object distinguishability from their backgrounds. This improvement was most noticeable under conditions of overexposure and severe dimming, where standard detection methods often fail. The analysis revealed that applying ESPCN super-resolution helped mitigate the negative effects of extreme brightness adjustments. Notably, the detection accuracy of helmets and shoes—classes most sensitive to brightness changes—improved considerably with ESPCN preprocessing. Under severe darkening conditions, such as −60% and −90% brightness, the baseline YOLOv8 model’s accuracy for helmets was drastically reduced to 5.08% and 0.00%, respectively. However, when ESPCN super-resolution was applied, detection accuracy at −60% brightness improved significantly to approximately 95%. This improvement suggests that enhancing resolution with the ESPCN effectively restores some of the lost contrast and detail, allowing helmets to be more readily identified even under extremely poor lighting conditions. The improvement was also evident in the detection of shoes, which were particularly challenging to detect in the baseline model.

Under most brightness variations, shoe detection frequently dropped to 0.00%, indicating a near-complete failure to detect such small objects. However, with the application of the ESPCN (3×), detection accuracy improved across all brightness levels. For instance, under extreme darkening (−90%), where shoes were previously undetectable, the use of the ESPCN resulted in detection accuracy improving from 0.00% to approximately 42%. This substantial enhancement demonstrates that super-resolution preprocessing is particularly effective for detecting small objects that would otherwise be missed due to low contrast or poor illumination. This suggests that the application of super-resolution effectively enhances object visibility by increasing contrast and detail resolution, especially under adverse brightness conditions. The improvements achieved through ESPCN super-resolution (3×) can be attributed to several factors. First, enhancing resolution provides more detail for the YOLOv8 model to process, making smaller objects that were previously undetectable more distinguishable. Additionally, the increased resolution helps restore contrast that is otherwise lost under extreme brightness changes. For example, when images are overexposed, pixel clipping occurs, causing important details to be lost.

By applying the ESPCN, some of these details are reconstructed, allowing the model to better differentiate objects from the background. Conversely, in darkened conditions, the enhanced resolution amplifies subtle textures and gradients, making objects more detectable even when visibility is poor. Overall, the use of ESPCN super-resolution (3×) proved effective in enhancing the detection accuracy of YOLOv8 under various brightness conditions. The improvement is most evident for smaller objects such as helmets and shoes, which typically suffer from low detection accuracy due to poor contrast and limited resolution. By increasing image resolution and restoring lost details, the ESPCN provides a viable solution for improving detection performance under both overexposed and severely darkened conditions. These findings confirm that integrating super-resolution preprocessing with YOLOv8 is a robust approach to enhancing object detection accuracy under challenging environmental conditions.

4.3. Distance Impact

As shown in Table 4, experiments were conducted at two resolutions—640 × 480 and 1920 × 1080—across various distances (ranging from 4 m to 14 m). The results clearly demonstrate that high-resolution inputs (1920 × 1080) maintain nearly 100% detection accuracy even at extended distances, whereas lower-resolution inputs (640 × 480) experience significant drops in performance, particularly for smaller objects.

The impact of distance on detection performance was evaluated through experiments conducted at two resolutions: 640 × 480 and 1920 × 1080, across distances ranging from 4 m to 14 m. The objective was to determine how resolution and distance influence detection accuracy for different object classes, including person, helmet, shoes, and vest. The results revealed a clear disparity in detection performance between the low- and high-resolution scenarios, especially as distance increased. At the higher resolution (1920 × 1080), the model consistently achieved nearly 100% detection accuracy across all object classes, even at the maximum distance of 14 m. This sustained performance is due to the higher pixel density available at the high-resolution input, enabling accurate detection and localization of objects despite their smaller appearance at greater distances. Conversely, at the lower resolution (640 × 480), detection performance significantly deteriorated as distance increased. At shorter distances (4 m and 6 m), the model maintained 100% detection accuracy for all classes. However, beyond 8 m, accuracy declined noticeably, particularly for smaller objects such as helmets and shoes. For instance, at 9 m, helmet detection accuracy remained at 100%, while shoe detection accuracy dropped to 87.60%. At 11 m, shoe detection accuracy drastically decreased to 12.40%, further declining to 7.98% at 12 m. Beyond 12 m, shoe detection became entirely ineffective, with 0.00% accuracy at 13 m and 14 m. Helmet detection also experienced a sharp decline at lower resolutions. While accuracy remained relatively high (99.16%) at 12 m, it significantly dropped to 31.86% at 13 m and became completely ineffective (0.00%) at 14 m. This trend underscores the model’s heightened sensitivity to resolution when detecting small objects at longer distances. In terms of pixel density, the higher resolution (1920 × 1080) consistently provided more pixels per object, facilitating effective identification and classification of PPE items even at greater distances. For example, at 14 m, the person class was represented by approximately 18,099 pixels, the helmet class by 630 pixels, and the shoe class by approximately 313 pixels—values sufficient for accurate detection. Conversely, at the lower resolution (640 × 480), at 14 m, the person class was represented by only 2282 pixels, while helmets, shoes, and vests had significantly fewer pixels, contributing to drastically reduced detection performance. These findings clearly demonstrate that resolution critically influences detection accuracy over extended distances. The 1920 × 1080 resolution enabled consistently high detection accuracy across all classes and distances, attributable to the model’s ability to capture detailed features and maintain sufficient pixel density despite object size reduction. Conversely, the 640 × 480 resolution resulted in significant performance degradation, particularly for smaller objects like helmets and shoes. Helmet detection accuracy fell sharply to 31.86% at 13 m and was non-existent at 14 m, while shoes were undetectable beyond 12 m. These results highlight the limitations of lower-resolution inputs when detecting small or distant objects.

The impact of distance on detection performance was analyzed using the YOLOv8 model’s multi-scale grid system, which divides an image into cells of three sizes: 32, 16, and 8 pixels. Each cell is responsible for predicting objects whose centers fall within its boundaries, and detection accuracy significantly depends on how many cells an object spans. The results highlight the substantial influence of image resolution and object distance on detection performance, especially for smaller objects such as shoes and helmets. When an image with a resolution of 640 × 480 pixels is captured at a distance of 11 m, the YOLOv8 grid system’s finest grid size is 8 pixels per cell, resulting in a horizontal grid of 80 columns and a vertical grid of 60 rows. Using a simplified camera model, the pixel width of an object is calculated based on the camera’s effective focal length, the real size of the object, and its distance from the camera. For example, a shoe that appears 14 pixels wide at 8 m distance reduces to approximately 10 pixels wide at 11 m. On the finest grid of 8-pixel cells, this shoe spans roughly 1.25 cells. On coarser grids, however, it spans only approximately 0.625 cells (16-pixel cells) and about 0.3125 cells (32-pixel cells). Empirical evidence indicates that detection becomes unreliable when an object spans fewer than approximately two cells. Thus, the model’s detection ability diminishes significantly as an object’s pixel representation shrinks with increasing distance. In contrast, when capturing the same scene at a higher resolution of 1920 × 1080 pixels from the same distance (11 m), the grid becomes much denser. The finest grid (8-pixel cells) consists of 240 columns and 135 rows, offering substantially more pixels to represent objects. Consequently, objects that appeared small at lower resolutions now occupy more pixels. For instance, the shoe that appeared 10 pixels wide in the 640 × 480 image now measures approximately 20 pixels wide in the 1920 × 1080 image. This increased pixel width allows the object to span about 2.5 cells on the finest grid, surpassing the critical threshold of two cells required for reliable detection. Even on the coarser grids with cell sizes of 16 and 32 pixels, the object spans about 1.25 and 0.625 cells, respectively. Therefore, higher-resolution images provide sufficient detail for accurate detection, even at extended distances. These findings clearly demonstrate the critical role of image resolution in maintaining detection accuracy over long distances. At lower resolutions (e.g., 640 × 480), objects quickly become too small to span the necessary number of cells for reliable detection, particularly on finer grids. This limitation is most pronounced for smaller objects, such as shoes and helmets, which frequently fail to meet the minimum required cell coverage. As distance increases, the pixel representation of these objects decreases further, resulting in poor detection accuracy or total detection failure. At 11 m, shoes covered only about 1.25 cells on the finest grid, making detection inconsistent and unreliable. Conversely, employing a higher resolution of 1920 × 1080 pixels significantly enhances detection performance due to increased pixel density.

This improvement allows objects to span more cells, thereby improving detection accuracy across all object classes. For example, an object spanning only 1.25 cells in a 640 × 480 image can span up to 2.5 cells in a 1920 × 1080 image, surpassing the threshold for reliable detection. The higher resolution provides denser scene sampling, enabling YOLOv8’s convolutional filters to accurately identify objects. Moreover, the increased resolution particularly benefits the detection of smaller objects that would otherwise go unnoticed due to limited pixel coverage. In conclusion, this analysis confirms that higher-resolution inputs (1920 × 1080) significantly improve detection accuracy, especially for smaller objects and at greater distances. Performance at lower resolutions (640 × 480) quickly deteriorates beyond distances of approximately 8 to 11 m, notably affecting the detection of helmets and shoes. These results underscore the importance of resolution-enhancing techniques such as an ESPCN for improving detection performance under such conditions. Additionally, combining high-resolution inputs with super-resolution preprocessing may further enhance model robustness and reliability in detecting PPE items across various distances.

The resolution enhancement techniques, such as an ESPCN or bilinear interpolation, could be crucial for improving detection performance under these challenging conditions, as shown in Table 5 and Figure 8.

The evaluation of various resolution enhancement techniques, including bilinear interpolation (2× and 3×) and ESPCN super-resolution (2× and 3×), revealed significant improvements in detection accuracy compared to the baseline YOLOv8 model under challenging distance conditions. The tests were conducted to assess the model’s performance in detecting four classes: person, helmet, shoes, and vest. The baseline YOLOv8 model, without any enhancement techniques, demonstrated a high detection accuracy for persons and vests, achieving 100.00% accuracy for both classes. However, the detection accuracy for helmets was significantly lower at 31.86%, and shoe detection was completely ineffective with an accuracy of 0.00%. These results indicate that the baseline model struggles to accurately detect smaller objects like helmets and shoes when they are located at extended distances. Applying bilinear interpolation (2×) provided a moderate improvement in detection accuracy. The person and vest classes maintained a perfect detection rate of 100.00%, while helmet detection improved considerably to 100.00%. However, shoe detection remained ineffective at 0.00%, indicating that bilinear interpolation alone is insufficient for enhancing the visibility of small objects that occupy only a few pixels in the frame. Further enhancement using bilinear interpolation (3×) demonstrated additional improvements. Detection accuracy for persons, helmets, and vests remained at 100.00%, but shoe detection improved to 29.65%. This suggests that increasing the scaling factor from 2× to 3× allows more pixels to represent small objects, enhancing their detectability. However, the improvement is still limited by the inherent blurring effect of bilinear interpolation, which can obscure finer details necessary for accurate detection. In contrast, using ESPCN super-resolution (2×) resulted in substantial improvements over the baseline model. Detection accuracy for persons, helmets, and vests remained at 100.00%, but shoe detection increased from 0.00% (baseline) to 4.46%. Although this improvement is relatively modest, it demonstrates the ESPCN model’s ability to preserve essential details that are often lost during standard interpolation techniques. The most significant enhancement was achieved with ESPCN super-resolution (3×). Detection accuracy for persons, helmets, and vests remained perfect at 100.00%, while shoe detection improved dramatically to 49.55%. This result confirms that higher scaling factors provide a more detailed representation of smaller objects, making them more distinguishable even at extended distances. The effectiveness of the ESPCN (3×) in improving shoe detection highlights the model’s ability to enhance resolution without introducing excessive blurring, which is a common drawback of traditional interpolation techniques. Overall, the analysis confirms that ESPCN super-resolution (3×) provides the most substantial improvements in detection accuracy compared to bilinear interpolation (2× and 3×) and the baseline YOLOv8 model. The enhancements are particularly evident for smaller objects such as shoes, which are most affected by resolution loss at extended distances. While bilinear interpolation (3×) improves detection accuracy moderately, it is less effective than the ESPCN (3×) due to its tendency to blur high-frequency details. By contrast, the ESPCN effectively reconstructs finer textures, making objects more detectable under challenging distance conditions. These findings suggest that integrating ESPCN preprocessing, particularly at higher scaling factors, significantly enhances the robustness and accuracy of YOLOv8 for detecting small or distant objects. This improvement is critical for applications requiring reliable detection of PPE over extended ranges.

To further evaluate the robustness of the proposed ESPCN-YOLOv8 framework under real-world conditions, we conducted an additional test using video footage captured at an actual construction site. This test scenario included non-frontal worker poses, such as side and oblique angles. Figure 9 illustrates the comparative results: panel (a) shows the original frame (720 × 540 pixels), while panel (b) displays the enhanced image after applying ESPCN with a 3× scaling factor, resulting in a 2160 × 1620-pixel image. In the original frame, the YOLOv8 detector identified only two persons. However, after applying ESPCN super-resolution, the system successfully detected three persons, one helmet, and one pair of shoes. These results clearly demonstrate that super-resolution enhances detection performance by improving object visibility—especially for small or partially obscured PPE items in low-resolution footage.

5. Conclusions, Limitations, and Recommendations

5.1. Conclusions

The proposed ESPCN-YOLO framework successfully enhances the detection accuracy of PPE under challenging conditions, including low-light environments, small object scenarios, and extended distances. By integrating an ESPCN with YOLOv8, the system achieves substantial improvements over baseline models and traditional upscaling techniques such as bilinear interpolation.

The experimental results confirm that the ESPCN (3×) provides the most significant enhancement in detection accuracy, especially for smaller objects like helmets and shoes, which are more susceptible to resolution loss and brightness degradation. The ESPCN-YOLO framework demonstrated consistent performance across a variety of scenarios:

Low-Light Conditions: The model exhibited robust detection accuracy even under severe darkness conditions (e.g., −90% brightness). Applying the ESPCN (3×) increased helmet detection accuracy from 0.00% (baseline) to approximately 15–20% and shoe detection from 0.00% to 10–15%. This improvement is attributed to the network’s ability to enhance subtle details and restore lost contrast, allowing effective detection even in poorly illuminated environments.

Small Object Detection: The enhanced resolution provided by the ESPCN (3×) significantly improved the detection of small objects, particularly shoes, which were nearly undetectable under baseline conditions. Detection accuracy for shoes increased from 0.00% (baseline) to 49.55% with the ESPCN (3×). This demonstrates the framework’s effectiveness in amplifying critical features necessary for identifying small-scale objects.

Distance Impact: Under extended distances (e.g., 13–14 m), the proposed approach maintained high detection accuracy for all classes when using high-resolution inputs (1920 × 1080). The ESPCN-YOLO framework demonstrated resilience to scale variations, with the person, helmet, and vest classes consistently achieving 100.00% detection accuracy even at 14 m. The model’s ability to detect shoes, a particularly challenging class, also improved significantly from 0.00% (baseline) to 49.55% with the ESPCN (3×).

Comparison with Bilinear Interpolation: The ESPCN-YOLO framework outperformed traditional bilinear interpolation in all scenarios. While bilinear interpolation (3×) provided moderate improvements, it introduced excessive blurring, which reduced the model’s ability to detect finer details. Conversely, the ESPCN (3×) preserved high-frequency textures, ensuring superior detection accuracy.

Compared to prior studies using models such as YOLOv5 [6] and Faster R-CNN [7], which demonstrated performance degradation under low-light and small object conditions, our ESPCN-YOLO framework achieves significantly higher accuracy. For example, Teske et al. (2022) reported a mAP@50 of 0.88 for helmet detection under ideal lighting conditions, whereas our model achieved 0.922 under various challenging scenarios, including extreme brightness reduction and long-distance footage [6]. Moreover, while Xie et al. (2021) reported detection failures under low-light conditions using Faster R-CNN, our framework maintained high detection rates for helmets and shoes by incorporating super-resolution preprocessing [7]. These findings validate the effectiveness of integrating ESPCN with YOLOv8 to enhance PPE detection under real-world constraints.

Overall, the ESPCN-YOLO framework demonstrates a robust capacity to enhance detection performance across various environmental and operational challenges. The findings indicate that super-resolution preprocessing, particularly using the ESPCN (3×), is a highly effective strategy for improving the accuracy of YOLOv8 in real-world PPE monitoring applications.

For practical safety impact, the proposed ESPCN-YOLO framework addresses key causes of construction site accidents—particularly falls from height and a lack of PPE—by enabling automated, real-time monitoring of safety compliance. The system accurately detects whether workers are wearing essential PPE, such as helmets and shoes, even under challenging conditions (e.g., poor lighting, long distances). This allows site managers to take immediate corrective action when non-compliance is identified, especially in high-risk areas such as scaffolding and elevated platforms. By reducing human error and increasing visibility of compliance, the system directly contributes to accident prevention. Although the system does not physically intervene, its role in alerting and documenting violations significantly enhances the effectiveness of proactive safety management and incident reduction strategies.

5.2. Study Limitations

Although the proposed ESPCN-YOLO framework demonstrated significant improvements in PPE detection accuracy under low-light and small object conditions, several limitations should be acknowledged. First, the dataset used for training and evaluation, while diverse, may not fully represent the wide range of real-world variations in PPE types, worker postures, and industrial environments. Second, the experiments were conducted using specific camera settings and distances; generalization to other hardware configurations or deployment conditions may require further adjustment. Third, although the ESPCN model is effective, it increases computational overhead, which may impact real-time performance on low-power or embedded systems. Addressing these limitations in future work will be essential for improving the scalability and practical applicability of the framework.

5.3. Recommendations

Based on the results of this study, several recommendations are proposed to further enhance the performance and applicability of the ESPCN-YOLO framework for PPE detection:

Deployment in Edge AI Systems: Future work should explore the deployment of the ESPCN-YOLO framework in edge AI devices to improve real-time monitoring capabilities in resource-constrained environments. Integrating this framework with mobile or embedded systems could provide real-time feedback to enhance safety compliance in construction, manufacturing, and other industrial settings.

Incorporating Transformer Models: Considering the success of recent Vision Transformers (ViTs) and Swin Transformers in improving object detection accuracy, future research should investigate combining ESPCN super-resolution with these models. Such integration may offer enhanced feature extraction capabilities, particularly under complex scenarios involving occlusion, motion blur, and extreme lighting variations.

Improvement of Training Datasets: Expanding the training dataset to include a broader range of environmental conditions, PPE types, and industrial scenarios will improve the model’s generalization capabilities. Special emphasis should be placed on incorporating more instances of small objects, varying lighting conditions, and diverse camera angles.

Optimizing Brightness Normalization Techniques: Implementing adaptive brightness normalization techniques in conjunction with ESPCN preprocessing could further improve robustness under extreme lighting conditions. Techniques such as Contrast-Limited Adaptive Histogram Equalization (CLAHE) or Retinex-based enhancement could be employed to ensure consistent detection accuracy across various brightness levels.

Enhancing Detection Efficiency: To improve computational efficiency, optimizing the ESPCN architecture or implementing lighter-weight versions of YOLOv8 may be necessary. Techniques such as model pruning, quantization, or knowledge distillation could be explored to reduce computational complexity without sacrificing accuracy.

Integration with Alert Systems: The proposed framework could be extended by integrating real-time alert systems that provide instant feedback upon detecting PPE non-compliance. Incorporating email or messaging notifications could enhance safety monitoring and ensure timely intervention in hazardous environments.

The findings from this study provide a solid foundation for advancing PPE detection systems through the use of super-resolution techniques. By addressing the outlined recommendations, future research can further enhance the accuracy, robustness, and efficiency of detection frameworks, contributing to improved occupational safety and regulatory compliance across various industries.

Author Contributions

Conceptualization, S.M. and N.W.; Methodology, N.W.; Software, E.K.; Validation, N.W. and E.K.; Formal Analysis, S.M. and N.W.; Investigation, E.K.; Resources, N.W. and E.K.; Data curation, N.W. and E.K.; Writing—original draft preparation, N.W. and E.K.; Writing—review and editing, S.M. and N.W.; Visualization, N.W.; Supervision, S.M.; Project administration, N.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Faculty of Engineering, Kasetsart University.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

International Labour Organization. Safety and Health at Work. Available online: https://www.ilo.org/topics/safety-and-health-work (accessed on 24 March 2025).
Webert, H.; Döß, T.; Kaupp, L.; Simons, S. Fault Handling in Industry 4.0: Definition, Process and Applications. Sensors 2022, 22, 2205. [Google Scholar] [CrossRef] [PubMed]
Gbadamosi, A.; Oyedele, L.O.; Delgado, J.M.D.; Kusimo, H.; Akanbi, L.; Olawale, O.; Muhammed-yakubu, N. IoT for predictive assets monitoring and maintenance: An implementation strategy for the UK rail industry. Autom. Constr. 2021, 122, 103486. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
Teske, B.E.; Adjekum, D.K. Understanding the relationship between High Reliability Theory (HRT) of mindful organizing and Safety Management Systems (SMS) within the aerospace industry: A cross-sectional quantitative assessment. J. Saf. Sci. Resil. 2022, 3, 105–114. [Google Scholar] [CrossRef]
Xie, H.; Li, P. A density-based evolutionary clustering algorithm for intelligent development. Eng. Appl. Artif. Intell. 2021, 104, 104396. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhang, S.; Teizer, J.; Pradhananga, N.; Eastman, C.M. Work-force location tracking to model, visualize and analyze workspace requirements in building information models for construction safety planning. Autom. Constr. 2015, 60, 74–86. [Google Scholar] [CrossRef]
Kelm, A.; Laußat, L.; Meins-Becker, A.; Platz, D.; Khazaee, M.J.; Costin, A.M.; Helmus, M.; Teizer, J. Mobile passive radio frequency identification (RFID) portal for automated and rapid control of personal protective equipment (PPE) on construction sites. Autom. Constr. 2013, 36, 38–52. [Google Scholar] [CrossRef]
Zhang, H.; Yan, X.; Li, H.; Jin, R.; Fu, H.F. Real-time alarming, monitoring, and locating for non-hard-hat use in construction. J. Constr. Eng. Manag. 2019, 145, 04019006. [Google Scholar] [CrossRef]
Rubaiyat, A.H.; Toma, T.T.; Kalantari-Khandani, M.; Rahman, S.A.; Chen, L.; Ye, Y.; Pan, C.S. Automatic detection of helmet uses for construction safety. In Proceedings of the 2016 IEEE International Conference on Web Intelligence Workshops (WIW), Omaha, NE, USA, 13–16 October 2016. [Google Scholar]
Shrestha, K.; Shrestha, P.P.; Bajracharya, D.; Yfantis, E.A. Hard- hat detection for construction safety visualization. J. Constr. Eng. Manage. 2015, 2015, 721380. [Google Scholar] [CrossRef]
Du, S.; Shehata, M.; Badawy, W. Hard hat detection in video sequences based on face features, motion and color information. In Proceedings of the 2011 3rd International Conference on Computer Research and Development, Shanghai, China, 11–13 March 2011. [Google Scholar]
Choo, H.; Lee, B.; Kim, H.; Cho, B. Automated detection of construction work at heights and deployment of safety hooks using IMU with a barometer. Autom. Constr. 2023, 147, 104714. [Google Scholar] [CrossRef]
Ali, L.; Alnajjar, F.; Parambil, M.M.A.; Younes, M.I.; Abdelhalim, Z.I.; Aljassmi, H. Development of YOLOv5-based real-time smart monitoring system for increasing lab safety awareness in educational institutions. Sensors 2022, 22, 8820. [Google Scholar] [CrossRef]
Akbarzadeh, M.; Zhu, Z.; Hammad, A. Nested network for detecting PPE on large construction sites based on frame segmentation. In Proceedings of the Creative Construction e-Conference 2020, Budapest University of Technology and Economics, Amadria Park, Conference Park 7/25, Opatija, Croatia, 28 June–1 July 2020. [Google Scholar]
Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T.M.; An, W. Detecting non-hardhat-use by a deep learning method from far field surveillance videos. Autom. Constr. 2018, 85, 1–9. [Google Scholar] [CrossRef]
Saudi, M.M.; Ma’arof, A.H.; Ahmad, A.; Saud, A.S.M.; Ali, M.H.; Narzullaey, A.; Ghazali, M.I.M. Image detection model for construction worker safety conditions using faster R-CNN. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 246–250. [Google Scholar] [CrossRef]
Wang, Z.; Wu, Y.; Yang, L.; Thirunavukarasu, A.; Evison, C.; Zhao, Y. Fast personal protective equipment detection for real construction sites using deep learning approaches. Sensors 2021, 21, 3478. [Google Scholar] [CrossRef]
Ferdous, M.; Ahsan, S.M.M. PPE detector: A YOLO-based architecture to detect personal protective equipment (PPE) for construction sites. PeerJ Comput. Sci. 2022, 8, e999. [Google Scholar] [CrossRef]
Kim, M.; Park, D.; Han, D.K.; Ko, H. A novel frame work for extremely low-light video enhancement. In Proceedings of the 2014 IEEE International Conference on Consumer Electronics (ICCE), Shenzhen, China, 9–13 April 2014. [Google Scholar]
Wang, W.; Wu, X.; Yuan, X.; Gao, Z. An experiment-based review of low-light image enhancement methods. IEEE Access 2020, 8, 87884–87917. [Google Scholar] [CrossRef]
Acharya, T.; Ray, A.K. Image Processing: Principles and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2005; pp. 1–428. [Google Scholar]
Lee, S.; Kim, N.; Paik, J. Adaptively partitioned block-based contrast enhancement and its application to low light-level video surveillance. SpringerPlus 2015, 4, 431. [Google Scholar] [CrossRef] [PubMed]
Land, E.H.; McCann, J.J. Lightness and retinex theory. JOSA 1971, 61, 1–11. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.-u.; Woodell, G.A. Properties and performance of a center/surround retinex. IEEE Trans. Image Process. 1997, 6, 451–462. [Google Scholar] [CrossRef]
Ma, Q.; Wang, Y.; Zeng, T. Retinex-based variational framework for low-light image enhancement and denoising. IEEE Trans. Multimed. 2023, 25, 5580–5588. [Google Scholar] [CrossRef]
Lei, C.; Tian, Q. Low-light image enhancement algorithm based on deep learning and retinex theory. Appl. Sci. 2023, 13, 10336. [Google Scholar] [CrossRef]
Liu, X.; Xie, Q.; Zhao, Q.; Wang, H.; Meng, D. Low-light image enhancement by retinex-based algorithm unrolling and adjustment. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 15758–15771. [Google Scholar] [CrossRef]
Jiang, Y.; Zhu, J.; Li, L.; Ma, H. A joint network for low-light image enhancement based on retinex. Cogn. Comput. 2024, 16, 3241–3259. [Google Scholar] [CrossRef]
Zhao, C.; Yue, W.; Wang, Y.; Wang, J.; Luo, S.; Chen, H.; Wang, W. Low-light image enhancement integrating retinex-inspired extended decomposition with a plug-and-play framework. Mathematics 2024, 12, 4025. [Google Scholar] [CrossRef]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. EnlightenGAN: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Keys, R. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
Turkowski, K. Filters for common resampling tasks. In Graphics Gems, 1st ed.; Andrew, S.G., Ed.; Academic Press Inc.: London, UK, 1990; Volume 1, pp. 147–165. [Google Scholar]

Figure 1. Partial samples of training dataset.

Figure 3. Partial samples of brightness variation experiments: (a) 640 × 480 pixels, 12 m, +10% brightness; (b) 640 × 480 pixels, 12 m, +70% brightness; (c) 640 × 480 pixels, 12 m, −30% brightness; (d) 640 × 480 pixels, 12 m, −90% brightness.

Figure 4. Partial samples of onsite data acquisition.

Figure 5. Structure of the ESPCN model for super-resolution enhancement. Blue indicates convolutional layers, green shows the sub-pixel shuffling operation, and gray and orange represent input and output images, respectively. Colors denote different functional stages in the upscaling process.

Figure 6. Object detection process.

Figure 7. Partial samples of brightness variation experiments: (a) YOLOv8 baseline, −30% brightness; (b) ESPCN super-resolution (3×), −30% brightness; (c) YOLOv8 baseline, −60% brightness; (d) ESPCN super-resolution (3×), −60% brightness; (e) YOLOv8 baseline, −90% brightness; (f) ESPCN super-resolution (3×), −90% brightness.

Figure 8. Partial samples of brightness variation experiments: (a) YOLOv8 baseline; (b) bilinear interpolation (2×); (c) bilinear interpolation (3×); (d) ESPCN super-resolution (2×); (e) ESPCN super-resolution (3×).

Figure 9. Detection comparison using on-site video footage: (a) original frame (720 × 540 pixels); (b) super-resolved frame using ESPCN (3×) (2160 × 1620 pixels). Detection improved from 2 persons to 3 persons, along with detection of 1 helmet and 1 pair of shoes.

Table 1. The performance of the trained YOLOv8 model.

Class	Box Precision	Recall	mAP@50	mAP@95
All (avg)	0.916	0.876	0.922	0.741
Helmet	0.922	0.946	0.956	0.763
Person	0.931	0.948	0.962	0.828
Shoe	0.924	0.819	0.920	0.713
Vest	0.939	0.936	0.975	0.832

Table 2. Accuracy results under varying brightness conditions.

Brightness	Person %	Helmet %	Shoes %	Vest %
+70%	100.00	61.02	0.00	90.68
+60%	100.00	63.56	0.00	98.31
+50%	100.00	99.15	20.34	100.00
+40%	100.00	99.15	26.69	100.00
+20%	100.00	97.46	22.88	100.00
+10%	100.00	99.15	22.88	100.00
Neutral	100.00	99.16	7.98	100.00
−30%	100.00	61.86	0.00	100.00
−60%	100.00	5.08	0.00	100.00
−90%	100.00	0.00	0.00	65.25

Table 3. Improved accuracy results after applying ESPCN under varying brightness conditions.

Brightness	Helmet Detection Improvement (%)	Shoe Detection Improvement (%)
−30%	61.89→100.00	0.00→38.46
−60%	5.08→100.00	0.00→44.87
−90%	0.00→0.00	0.00→41.88

Table 4. Accuracy results under varying distance conditions.

Resolution	Distance m.	Person %	Helmet %	Shoes %	Vest %	Person Pixel	Helmet Pixel	Shoe_1 Pixel	Shoe_2 Pixel	Vest Pixel
640 × 480	4	100.00	100.00	100.00	100.00	28,675.00	968.70	537.78	592.29	4969.09
	6	100.00	100.00	100.00	100.00	13,267.37	527.94	260.46	284.38	2378.94
	8	100.00	100.00	94.44	100.00	7357.54	329.99	147.04	166.08	1356.59
	9	100.00	100.00	87.60	100.00	5777.25	250.62	111.56	128.63	1109.12
	11	100.00	100.00	12.40	100.00	4297.61	185.50	101.37	0.00	759.99
	12	100.00	99.16	7.98	100.00	3317.76	164.75	88.95	0.00	688.41
	13	100.00	31.86	0.00	100.00	2767.21	139.72	0.00	0.00	592.96
	14	100.00	0.00	0.00	100.00	2282.73	0.00	0.00	0.00	547.61
1920 × 1080	4	100.00	100.00	100.00	100.00	190,069.19	6165.34	3735.29	4362.38	34,415.69
	6	100.00	100.00	100.00	100.00	89,580.70	2861.51	1491.44	1550.45	15,961.13
	8	100.00	100.00	99.54	100.00	53,007.95	1759.79	646.44	818.35	9841.44
	9	100.00	100.00	100.00	100.00	42,255.44	1413.90	581.86	607.44	7932.50
	11	100.00	100.00	100.00	100.00	28,315.86	1042.94	389.25	382.07	5293.44
	12	100.00	100.00	100.00	100.00	23,868.94	846.62	377.83	394.08	4459.13
	13	100.00	100.00	100.00	100.00	20,216.22	746.83	386.82	348.52	3892.34
	14	100.00	100.00	100.00	100.00	18,099.20	630.61	312.92	315.48	3340.51

Table 5. Improved accuracy results after applying ESPCN/bilinear interpolation under varying distance conditions.

Method	Person %	Helmet %	Shoe %	Vest %
Baseline YOLOv8	100.00	31.86	0.00	100.00
Bilinear Interpolation (2×)	100.00	100.00	0.00	100.00
Bilinear Interpolation (3×)	100.00	100.00	29.65	100.00
ESPCN Super-Resolution (2×)	100.00	100.00	4.46	100.00
ESPCN Super-Resolution (3×)	100.00	100.00	49.55	100.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Malaikrisanachalee, S.; Wongwai, N.; Kowcharoen, E. ESPCN-YOLO: A High-Accuracy Framework for Personal Protective Equipment Detection Under Low-Light and Small Object Conditions. Buildings 2025, 15, 1609. https://doi.org/10.3390/buildings15101609

AMA Style

Malaikrisanachalee S, Wongwai N, Kowcharoen E. ESPCN-YOLO: A High-Accuracy Framework for Personal Protective Equipment Detection Under Low-Light and Small Object Conditions. Buildings. 2025; 15(10):1609. https://doi.org/10.3390/buildings15101609

Chicago/Turabian Style

Malaikrisanachalee, Suphawut, Narongrit Wongwai, and Ekasith Kowcharoen. 2025. "ESPCN-YOLO: A High-Accuracy Framework for Personal Protective Equipment Detection Under Low-Light and Small Object Conditions" Buildings 15, no. 10: 1609. https://doi.org/10.3390/buildings15101609

APA Style

Malaikrisanachalee, S., Wongwai, N., & Kowcharoen, E. (2025). ESPCN-YOLO: A High-Accuracy Framework for Personal Protective Equipment Detection Under Low-Light and Small Object Conditions. Buildings, 15(10), 1609. https://doi.org/10.3390/buildings15101609

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ESPCN-YOLO: A High-Accuracy Framework for Personal Protective Equipment Detection Under Low-Light and Small Object Conditions

Abstract

1. Introduction

2. Related Works

2.1. PPE Detection Approach

2.1.1. Sensor-Based PPP Detection

2.1.2. Vision-Based PPE Detection

2.2. Deep Learning for PPE Detection

2.3. Challenges in Low-Light PPE Detection

2.4. Research Gaps and Objectives

3. Methodology

3.1. Training Yolov8 Object Detection Model

3.2. VDO Footage Capturing

3.2.1. Laboratory Data Acquisition

3.2.2. On-Site Data Acquisition

3.3. Video Footage Resolution Upscaling

3.3.1. Bilinear Interpolation

3.3.2. Efficient Sub-Pixel Convolutional Neural Network (ESPCN)

3.4. Object Detection Process

3.5. Evaluation Methodology

3.5.1. Static Image Evaluation

3.5.2. Real-Time Evaluation

3.5.3. IoU Threshold for Real-Time Detection

3.6. Python Coding Implementation

3.7. Latency Analysis

4. Experiments and Results

4.1. Training YOLOv8 Object Detection Model Results

4.2. Brightness Impact

4.3. Distance Impact

5. Conclusions, Limitations, and Recommendations

5.1. Conclusions

5.2. Study Limitations

5.3. Recommendations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI