Next Article in Journal
Support Vector Machine-Based Logics for Exploring Bromine and Antimony Content in ABS Plastic from E-Waste by Using Reflectance Spectroscopy
Previous Article in Journal
The Impact of ESG Information Disclosure on Corporate Environmental Performance: Evidence from China’s Shanghai and Shenzhen A-Share Listed Companies
 
 
Article
Peer-Review Record

Integrated Construction-Site Hazard Detection System Using AI Algorithms in Support of Sustainable Occupational Safety Management

Sustainability 2025, 17(23), 10584; https://doi.org/10.3390/su172310584
by Zuzanna Woźniak 1,*, Krzysztof Trybuszewski 2, Tomasz Nowobilski 1, Marta Stolarz 3 and Filip Šmalec 4
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4: Anonymous
Reviewer 5: Anonymous
Sustainability 2025, 17(23), 10584; https://doi.org/10.3390/su172310584
Submission received: 1 October 2025 / Revised: 8 November 2025 / Accepted: 15 November 2025 / Published: 26 November 2025
(This article belongs to the Section Sustainable Engineering and Science)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The study outlines the generation of real-time visual detection system for construction site safety, dwelling on two major areas: absence of safety helmets and fall of site workers. This is a timely study which seeks to proffer answers to one of the fundamental issues plaguing the delivery of construction projects. However, the study can be improved in the following areas:

The study presents the integration of two existing, prior-trained models (YOLO), the authors have not proposed a novel architecture or algorithm within the context of the problem being tackled. The system is essentially a straightforward application and integration of off-the-shelf components (YOLO, a cloud API, a web dashboard), a task that is well within the capabilities of standard software engineering. Also, the reliance solely on the bounding box aspect ratio is overly simplistic. This method can also be prone to errors from camera perspective and human pose diversity. 

Furthermore, the authors should have careful assessment of the resulting outcome from the fall detection (a recall of 0.45 and accuracy of 0.62). Also, the system is not compared to any baseline (or benchmark). Therefore, making it difficult to have an objective evaluation of the system.  Moreover, the study claims the system is "lightweight" and has "low hardware requirements," but provides no data on computational cost, inference speed (FPS), or power consumption to support this claim. Using a cloud-based architecture also introduces latency and bandwidth requirements that may not be "low" for all SMEs.

The authors should consider the above highlighted points for the improvement of the manuscript. 

Author Response

Integrated construction site hazard detection system using AI algorithms: case study

Thank you for all the valuable comments in this review. We hope that the correction of the article according to the Editor's and Reviewers suggestions will significantly improve its quality. Below are responses to the Editor's comments. Changes in the text are marked in red.

The study outlines the generation of real-time visual detection system for construction site safety, dwelling on two major areas: absence of safety helmets and fall of site workers. This is a timely study which seeks to proffer answers to one of the fundamental issues plaguing the delivery of construction projects. However, the study can be improved in the following areas:

  • Comment 1: The study presents the integration of two existing, prior-trained models (YOLO), the authors have not proposed a novel architecture or algorithm within the context of the problem being tackled. The system is essentially a straightforward application and integration of off-the-shelf components (YOLO, a cloud API, a web dashboard), a task that is well within the capabilities of standard software engineering. Also, the reliance solely on the bounding box aspect ratio is overly simplistic. This method can also be prone to errors from camera perspective and human pose diversity. 

 

Response: YOLOv8 serves as the basic detection framework, and we believe that the contribution of this work goes beyond algorithmic novelty to address a critical and well-documented research gap: the barrier to practical implementation of safety systems in small and medium-sized enterprises (SMEs).

As clearly stated in section 2.3 of our paper:

‘Despite the growing number of solutions using artificial intelligence algorithms to monitor occupational safety hazards, their practical implementation on construction sites faces numerous difficulties. The main barriers include the complexity of integration, the lack of standardised interfaces and the limited availability of technology for smaller companies.’

Recent empirical research confirms that high initial implementation costs, the lack of standardised integration protocols and computational requirements are the main obstacles to the adoption of AI in construction SMEs.

Our innovative contribution is a system-level architecture that creates a closed loop of ‘detection – warning – recording – traceability’, which eliminates these documented barriers by:

  1. Open API integration enabling seamless connection to existing construction site management systems via RESTful architecture with asynchronous operations (section 3.2)
  2. Lightweight cloud architecture that reduces hardware requirements to ordinary edge devices, achieving an average inference time of 123.88 milliseconds (NHD) and 131.84 milliseconds (FD)
  3. Event logging and traceability module that creates a complete audit trail with timestamp, location, and event type to ensure regulatory compliance
  4. Web-based administration interface providing real-time monitoring accessible via computer and smartphone.

 

Regarding Bounding Box Aspect Ratio Limitations In Section 3.2, we added:

The width-to-height ratio threshold (width ≥ 2 × height) used to identify potential human falls was determined empirically during preliminary tests as a simplified heuristic for detecting horizontal body postures, aimed at ensuring computational efficiency in the prototype implementation.

 

Additionally, we conducted systematic evaluation of these limitations (Section 4.4, Table 8), analysing misclassified fall-detection cases across multiple camera viewpoints (front, side, top, oblique) and illumination conditions (normal, low/uneven). Our results demonstrate:​

  • Highest error rates occurred at oblique camera angles under uneven lighting (13 misclassifications)
  • Front viewpoint under normal illumination showed only 2 misclassifications
  • The model maintains high precision (1.00 for FD) while recall is affected (0.45)

 

We provided additional explanation:

"Beyond the counts in Table 8, misclassifications cluster into three recurring failure modes: (i) geometric ambiguity under oblique viewpoints, where the width–height proxy becomes non-discriminative; (ii) appearance degradation due to low/uneven illumination that suppresses texture/edge cues for small human silhouettes; and (iii) partial occlusion by materials or equipment."

  • Comment 2: Furthermore, the authors should have careful assessment of the resulting outcome from the fall detection (a recall of 0.45 and accuracy of 0.62). Also, the system is not compared to any baseline (or benchmark). Therefore, making it difficult to have an objective evaluation of the system.  Moreover, the study claims the system is "lightweight" and has "low hardware requirements," but provides no data on computational cost, inference speed (FPS), or power consumption to support this claim. Using a cloud-based architecture also introduces latency and bandwidth requirements that may not be "low" for all SMEs.

 

Response: Firstly, the values obtained for the fall detection module (sensitivity = 0.45, accuracy = 0.62) result mainly from the use of a simplified geometric heuristic adopted for prototype implementation. As described in Section 3.2, the threshold for the width-to-height ratio (width ≥ 2 × height) was empirically determined to ensure computational efficiency and low latency when running on edge devices, while maintaining high precision in the initial deployment phase. The lower sensitivity results from the model’s sensitivity to variations in camera angle and lighting conditions, which is confirmed by the error analysis in Table 8.

Secondly, regarding the lack of a reference benchmark, the present work focused on validating the system concept and real-time operation. Benchmark comparison will be included in future research once the extended dataset and evaluation framework are established.

The solution is based on a lightweight YOLOv8s architecture, chosen for its good performance with low resource consumption on edge-class devices (e.g., budget mini-PCs). Actual processing times per frame, as shown in the added Table 9, averaged 123.88 ms for helmet detection and 131.84 ms for fall detection (tested on 100 samples of each category). The minimum recorded times were 95.88 ms and 88.46 ms, respectively, which translates to an effective throughput of 7–10 FPS on average hardware without a dedicated GPU. Such parameters enable system deployment in SME environments, particularly within edge architectures.

For the cloud-processing variant, the architecture was equipped with frame buffering and asynchronous rendering mechanisms to minimize the risk of data loss even under transmission delays. However, it was observed that throughput and latency issues may depend on specific infrastructure and should therefore be verified during deployment. The system is thus designed to be flexible, allowing implementation in both local and cloud environments.

 

  • The authors should consider the above highlighted points for the improvement of the manuscript. 

Response: We would like to express our sincere gratitude to the reviewer for their constructive comments and valuable suggestions, which have greatly contributed to improving the quality and clarity of our manuscript. All issues raised have been thoroughly addressed in the revised version.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes a real-time visual detection system based on YOLOv8 to identify the absence of safety helmets and the risk of worker falls on construction sites, integrating an API, event logging, and a web interface, tailored to the low hardware requirements of small and medium-sized enterprises. Overall, the paper holds practical application value, and the following are my main concerns:
- Introduction: While the importance of helmet and fall detection is emphasized, a brief overview of the current state of automated detection research should be added to strengthen the scientific significance, establish the research background, and highlight the innovation, avoiding redundancy with the literature review.
- Literature Review: The identification of research gaps is reasonable, but the description is somewhat generalized and lacks sufficient evidence. Section 2.3 points out the complexity of integration and suitability for small enterprises as gaps, yet it relies only on general observations without specific references, weakening the persuasiveness of the gaps.
- Materials and Methods: 
1. The source of the dataset is not detailed, and the data annotation process is unclear, leading to insufficient reproducibility. 
2. YOLOv8 has been surpassed by more efficient versions (e.g., YOLOv10), and it is recommended to update to the latest model or explain the rationale for sticking with v8. 
3. Figure 4 should be carefully reviewed, as the confusion matrix contains an error with "True Positive" listed twice.
- Conclusions: The conclusion highlights the system's potential but should further specify its contributions in both theoretical and practical aspects.

Author Response

Thank you for all the valuable comments in this review. We hope that the correction of the article according to the Editor's and Reviewers suggestions will significantly improve its quality. Below are responses to the Editor's comments. Changes in the text are marked in red.

The paper proposes a real-time visual detection system based on YOLOv8 to identify the absence of safety helmets and the risk of worker falls on construction sites, integrating an API, event logging, and a web interface, tailored to the low hardware requirements of small and medium-sized enterprises. Overall, the paper holds practical application value, and the following are my main concerns:



  • Comment 1: Introduction: While the importance of helmet and fall detection is emphasized, a brief overview of the current state of automated detection research should be added to strengthen the scientific significance, establish the research background, and highlight the innovation, avoiding redundancy with the literature review.

 

Response 1: In the revised version of the manuscript, two new paragraphs have been added to the Introduction section to strengthen the scientific background and clarify the context of the study:

In recent years, the automation of occupational safety monitoring through computer vision and artificial intelligence has gained significant attention. Numerous studies have focused on detecting the absence of personal protective equipment (PPE) or identifying unsafe worker behaviors using deep learning models, particularly convolutional neural networks (CNN) and real-time object detection frameworks such as YOLO, SSD, or Faster R-CNN. These approaches have demonstrated high accuracy in controlled or simulated conditions; however, their implementation on real construction sites remains limited due to high computational demands, lack of integration capabilities, and difficulties in adapting to variable lighting and environmental conditions. The prototype developed in this study aims to address these challenges by employing a lightweight YOLOv8s model, which ensures real-time performance without the need for high-end hardware. Combined with a simple web-based interface and an open API, the system provides a flexible foundation for further research and development of affordable and easily deployable safety monitoring tools for small and medium-sized construction enterprises.

 

 

  • Comment 2: Literature Review: The identification of research gaps is reasonable, but the description is somewhat generalized and lacks sufficient evidence. Section 2.3 points out the complexity of integration and suitability for small enterprises as gaps, yet it relies only on general observations without specific references, weakening the persuasiveness of the gaps.

 

Response 2: We have revised Section 2.3 by adding specific references and evidence to support the identified research gaps and strengthen the discussion on integration complexity and suitability for small enterprises:

Recent studies confirm that these challenges are not merely general observations but well-documented barriers in the implementation of computer vision systems for safety management in construction. Research to date [26, 27] highlights the technical and organisational complexity of integrating AI-based detection modules with BIM, IoT and site management platforms, largely due to the absence of standardised data exchange protocols and interoperability frameworks, which increases both the cost and the risk of deployment in heterogeneous environments. Furthermore, computational requirements and limited hardware capabilities remain a significant obstacle for small and medium-sized enterprises, as many high-performance AI models are unsuitable for real-time inference on low-cost edge devices. However, recent works demonstrate that lightweight architectures such as YOLOv8s or MobileNet variants can maintain adequate detection accuracy while significantly reducing resource consumption, thus enabling feasible on-site deployment [43]. These findings reinforce the relevance of the proposed solution, which addresses the identified gaps by combining a lightweight detection model, open API integration, and event traceability mechanisms, ensuring both scalability and affordability for SMEs.

 

  • Comment 3: Materials and Methods:

  1. The source of the dataset is not detailed, and the data annotation process is unclear, leading to insufficient reproducibility. 
    2. YOLOv8 has been surpassed by more efficient versions (e.g., YOLOv10), and it is recommended to update to the latest model or explain the rationale for sticking with v8. 
    3. Figure 4 should be carefully reviewed, as the confusion matrix contains an error with "True Positive" listed twice.

 

Response 3:

 

  1. Additional information has been added to the revised manuscript to clarify the origin and structure of the dataset, as well as the annotation procedure. The dataset was collected in situ by the authors during controlled sessions on an active construction site. All images were manually labeled.
    Due to copyright and personal image protection concerns, the dataset cannot be made publicly available. However, the authors can share the data upon reasonable request by contacting the corresponding author.
    The following paragraph has been added to the manuscript (Section 3.3):

“The dataset used in this study was collected in situ by the authors during controlled observation sessions conducted on an active construction site. All image data were gathered directly by the research team and manually annotated to identify the presence or absence of safety helmets and human falls.”

 

  1. We appreciate the reviewer’s insightful remark. We are aware of the release of newer YOLO versions, including YOLOv9 and YOLOv10 and upper, which introduce architectural and computational improvements. However, YOLOv8 was deliberately selected for this study due to its proven stability, broad adoption in academic and industrial research, and the availability of pre-trained weights and task-specific variants (e.g., hard-hat detection models). The aim of this research was to develop and validate a functional prototype rather than to perform state-of-the-art benchmarking. Therefore, YOLOv8 offered the best balance between detection accuracy, computational efficiency, and ease of deployment on low-resource edge devices.

To clarify this choice, the following sentence has been added to the Materials and Methods section:

“YOLOv8 was selected as the detection framework due to its stability and proven performance in industrial safety applications, which made it well suited for the prototype implementation on low-cost edge devices [46, 47].”

  1. Figure 4 has been corrected



  • Comment 4: Conclusions: The conclusion highlights the system's potential but should further specify its contributions in both theoretical and practical aspects.

Response 4: The Conclusions section has been revised to explicitly specify both theoretical and practical contributions of the study. A new paragraph has been added at the end of the section as follows:

“From a theoretical standpoint, the conducted research contributes to the advancement of adaptable frameworks for real-time hazard detection based on deep learning and open API integration. The proposed architecture demonstrates how convolutional neural networks can be effectively utilized to identify occupational risks in dynamic and unstructured construction environments. From a practical perspective, the developed system confirms the feasibility of deploying AI-driven monitoring solutions in conditions characterized by limited computational resources and constrained infrastructure. Furthermore, the modular design of the platform provides a foundation for future extensions aimed at improving proactive safety management and reducing accident rates in the construction industry.”

This addition clarifies the scientific contribution of the paper and strengthens the overall conclusion by linking the research findings to both theoretical development and real-world implementation potential.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

I appreciate the authors for presenting an innovative manuscript that addresses the critical topic of construction safety, a domain that should be actively advanced through contemporary technologies to reduce hazards on job sites. While I commend the effort and the overall rigor, I have several substantive questions and comments that I encourage the authors to address in revision so the paper attains a higher level of scientific clarity and practical value for readers across research and industry.

First, the Abstract should clearly articulate the specific novelty introduced, whether, for example, the system integrates real-time video analysis rather than sequential photo analysis, and explicitly state how this approach departs from and improves upon current practice.

The Introduction reads clearly, but it under-emphasizes prior work that has deployed YOLOv8 in similar construction settings. A concise comparison to those studies, highlighting any trade-offs in accuracy, response time, robustness, or other metrics, would be particularly valuable. This probelm currently can be found in both the Introduction and the Literature Review sections and should be addressed in depth.

More broadly, the paper should define its core contribution relative to closely related efforts, given the substantial body of research on fall detection, posture recognition, and heavy-equipment pose detection. Relatedly, several research groups, particularly at universities in Hong Kong, have moved beyond detection alone. The field now increasingly emphasizes reactive measures that trigger real-time alerts and constrain unsafe actions, as well as proactive measures that sense environmental conditions, predict emerging hazards, and prevent incidents before they occur. In contrast, this study remains focused primarily on the initial detection stage with limited attention to alerting or intervention. In my opinion, the authors should clarify how their work advances the state of the art despite this focus and articulate its distinct contribution within this evolving landscape.

In the Methods section, please also detail the API architecture: describe the development steps, data flows, and sequence of operations, ideally supported by either pseudo-code or a process diagram to improve reproducibility.

With respect to presentation, several figures are overly wide and appear to lose resolution; consider reformatting them with a more vertical orientation and improving the underlying image quality.

Although the manuscript is promising, the Results and Discussion are comparatively brief and would benefit from deeper analysis and interpretation, e.g., error breakdowns, failure modes, ablation or sensitivity checks, and a more explicit linkage between findings and practical implications on site.

Please also verify the journal’s policies on publishing photographs of identifiable individuals; such images often require documented consent, and compliance may be needed from both the authors and the journal.

Finally, the references should be expanded to include the most relevant recent studies, particularly those that use YOLOv8 or adjacent real-time detection frameworks in comparable construction safety contexts.

I think, strengthening these elements that I have detailed will substantially enhance the manuscript’s rigor, clarity, and impact.

Comments on the Quality of English Language

Authors need to improve the English language in their writing.

Author Response

Integrated construction site hazard detection system using AI algorithms: case study

I appreciate the authors for presenting an innovative manuscript that addresses the critical topic of construction safety, a domain that should be actively advanced through contemporary technologies to reduce hazards on job sites. While I commend the effort and the overall rigor, I have several substantive questions and comments that I encourage the authors to address in revision so the paper attains a higher level of scientific clarity and practical value for readers across research and industry.

Thank you for all the valuable comments in this review. We hope that the correction of the article according to the Editor's and Reviewers suggestions will significantly improve its quality. Below are responses to the Editor's and Reviewers comments. Changes in the text are marked in red.

  • Comment 1: First, the Abstract should clearly articulate the specific novelty introduced, whether, for example, the system integrates real-time video analysis rather than sequential photo analysis, and explicitly state how this approach departs from and improves upon current practice.

Response: We appreciate the reviewer’s valuable comment. The Abstract has been revised to clearly emphasize the novelty of the proposed approach: “This study advances current practice by providing an integrated, low-resource solution that unites multi-hazard detection, event documentation, and system interoperability, thereby addressing a key gap in existing research and implementations.”

  • Comment 2: The Introduction reads clearly, but it under-emphasizes prior work that has deployed YOLOv8 in similar construction settings. A concise comparison to those studies, highlighting any trade-offs in accuracy, response time, robustness, or other metrics, would be particularly valuable. This probelm currently can be found in both the Introduction and the Literature Review sections and should be addressed in depth.

Response: We thank the reviewer for this insightful comment. In response, the analysis of prior studies employing YOLOv8 in construction-related safety applications has been substantially expanded. Table 1 has been revised and complemented with a new Table 2, which provides a detailed comparison of the reviewed studies, including dataset characteristics, performance metrics, applicable scenarios, and key innovations. In addition, a new comparative test and Table 3 have been added to highlight trade-offs in accuracy, response time, robustness, and hardware requirements among YOLO-based and alternative models. Furthermore, these findings have been summarized and critically discussed in the Research Gap subsection (Section 2.3) to better position the proposed system within the current state of the art. These enhancements strengthen both the Introduction and Literature Review sections and provide a clearer contextualization of the study’s novelty and contribution.

  • Comment 3: More broadly, the paper should define its core contribution relative to closely related efforts, given the substantial body of research on fall detection, posture recognition, and heavy-equipment pose detection. Relatedly, several research groups, particularly at universities in Hong Kong, have moved beyond detection alone. The field now increasingly emphasizes reactive measures that trigger real-time alerts and constrain unsafe actions, as well as proactive measures that sense environmental conditions, predict emerging hazards, and prevent incidents before they occur. In contrast, this study remains focused primarily on the initial detection stage with limited attention to alerting or intervention. In my opinion, the authors should clarify how their work advances the state of the art despite this focus and articulate its distinct contribution within this evolving landscape.

Response: We appreciate the reviewer’s insightful comment. We acknowledge that while the present study focuses primarily on the detection stage, its main contribution lies in establishing a scalable foundation for future development of reactive and predictive safety modules. At the same time, a comprehensive literature review and synthesis of the current state of research were also among the study’s primary objectives. The revised version highlights that the paper provides an extensive and structured comparative analysis of YOLOv8-based and related methods for construction-site safety monitoring, including trade-offs in accuracy, response time, robustness, and integration capabilities (Tables 1–3). The practical novelty therefore lies in consolidating and contextualizing existing research while bridging the gap between high-performance AI models and their real-world applicability in small and medium-sized construction enterprises, where resource constraints often limit the adoption of advanced safety systems.

  • Comment 4: In the Methods section, please also detail the API architecture: describe the development steps, data flows, and sequence of operations, ideally supported by either pseudo-code or a process diagram to improve reproducibility.

Response: We thank the Reviewer for this valuable comment. In response, an additional description of the API architecture has been included in the Methods section:

“The developed system uses a RESTful API that enables communication between the monitoring module, the detection model, and the database. The communication between system components follows a sequence of asynchronous operations designed to ensure real-time performance and secure data handling.

The typical sequence of operations is as follows:

  1. Image acquisition: the camera periodically captures an image frame and sends it to the backend server via an HTTP POST request.
  2. Data validation: the API verifies the request parameters and stores the received image in the temporary data repository.
  • Detection request: the backend triggers the YOLOv8 detection module through an internal API call, passing the image reference and metadata.
  1. Analysis and classification: the detection module performs object detection (helmet absence / fall event) and returns a JSON response containing bounding box coordinates, class labels, and confidence scores.
  2. Incident registration: if a safety violation is detected, the API automatically records the event in the incident database with time, camera ID, and event type.
  3. Alert notification: the system updates the web-based interface through a WebSocket channel, displaying the new event and enabling user acknowledgment.”
  • Comment 5: With respect to presentation, several figures are overly wide and appear to lose resolution; consider reformatting them with a more vertical orientation and improving the underlying image quality.

Response: We thank the reviewer for this helpful observation. Due to the nature of the presented scenes, particularly those depicting individuals in a lying position, the authors decided to retain a horizontal orientation for certain figures to ensure better readability and a wider view of the construction site and its surroundings. High-resolution versions of all images have been provided as supplementary attachments to the manuscript. We hope that these higher-quality files will be used during the publication process, as image resolution tends to decrease automatically when embedded in the Word document.

  • Comment 6: Although the manuscript is promising, the Results and Discussion are comparatively brief and would benefit from deeper analysis and interpretation, e.g., error breakdowns, failure modes, ablation or sensitivity checks, and a more explicit linkage between findings and practical implications on site.

Response: Thank you for the helpful suggestion. We have substantially expanded the Results and Discussion sections to provide deeper analysis and interpretation while keeping changes lightweight. Specifically, we: (i) present an explicit error breakdown and a concise failure-mode taxonomy (oblique viewpoints, low/uneven illumination, partial occlusion), (ii) add a qualitative sensitivity check for the posture aspect-ratio threshold (a proxy for ablation, without new experiments), (iii) articulate practical, on-site implications (camera placement, illumination provisioning, multi-camera/short-clip support for FD), and (iv) report runtime characteristics confirming near-real-time feasibility. We also broadened the references with recent studies.

  • Comment 7: Please also verify the journal’s policies on publishing photographs of identifiable individuals; such images often require documented consent, and compliance may be needed from both the authors and the journal.

Response: We thank the reviewer for this important comment.All individuals depicted in the photographs have provided a informed consent for the publication of their image in this article, in accordance with the journal’s policies.

  • Comment 8: Finally, the references should be expanded to include the most relevant recent studies, particularly those that use YOLOv8 or adjacent real-time detection frameworks in comparable construction safety contexts.

Response: We thank the reviewer for this helpful comment. The literature review section has been expanded to include several additional recent studies related to YOLOv8 and other real-time detection frameworks applied in construction safety contexts. The following publications have been added to the revised manuscript:

  1. Wang, Z.; Wu, Y.; Yang, L.; Thirunavukarasu, A.; Evison, C.; Zhao, Y. Fast Personal Protective Equipment Detection for Real Construction Sites Using Deep Learning Approaches. Sensors 2021, Vol. 21, Page 3478, 2021, 21 (10), 3478. https://doi.org/10.3390/S21103478.
  2. Fang, Q.; Li, H.; Luo, X.; Ding, L.; Luo, H.; Rose, T. M.; An, W. Detecting Non-Hardhat-Use by a Deep Learning Method from Far-Field Surveillance Videos. Autom Constr, 2018, 85, 1–9. https://doi.org/10.1016/J.AUTCON.2017.09.018.
  3. Wang, Z.; Cai, Z.; Wu, Y. An Improved YOLOX Approach for Low-Light and Small Object Detection: PPE on Tunnel Construction Sites. J Comput Des Eng, 2023, 10 (3), 1158–1175. https://doi.org/10.1093/JCDE/QWAD042.
  4. An, Q.; Xu, Y.; Yu, J.; Tang, M.; Liu, T.; Xu, F. Research on Safety Helmet Detection Algorithm Based on Improved YOLOv5s. Sensors (Basel), 2023, 23 (13), 5824. https://doi.org/10.3390/S23135824.
  5. Ren, H.; Fan, A.; Zhao, J.; Song, H.; Liang, X. Lightweight Safety Helmet Detection Algorithm Using Improved YOLOv5. J Real Time Image Process, 2024, 21 (4). https://doi.org/10.1007/S11554-024-01499-5.

 

These updates broaden the theoretical background and improve the completeness of the literature review.

  • Comment 9: I think, strengthening these elements that I have detailed will substantially enhance the manuscript’s rigor, clarity, and impact.

Response: We sincerely thank the reviewer for their constructive feedback and valuable suggestions. All recommended improvements have been carefully implemented to enhance the manuscript’s rigor, clarity, and overall impact.

  • Comment 10: Authors need to improve the English language in their writing.

Response: We appreciate the reviewer’s comment. The entire manuscript has been carefully proofread, and the English language has been revised and improved throughout the paper. We hope the revised version reads more clearly now, and we thank the reviewer for this helpful suggestion.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

1) The paper does not clearly define the core scientific questions or technical challenges of the research in the introduction. Suggest clearly listing the research objectives and key issues to be addressed at the end of the introduction.
2) The literature review lacks systematic comparison. Table 1 lists multiple related studies, but does not systematically compare them (such as model performance, data size, applicable scenarios), nor does it clearly indicate the innovation and differences of this study.
3) The paper did not specify the source, scale, annotation method, and data augmentation strategy of the dataset used to train the YOLOv8 model, which affected the reproducibility of the experiment.
4) Each detection task uses 100 images for testing, and the sample size is not sufficient to ensure statistical significance, especially for rare events such as' fall detection '. Please explain the reason.
5) The use of "width ≥ 2 times height" as the basis for fall judgment, but without providing the source or verification process of this threshold, lacks theoretical or experimental support.

Author Response

Thank you for all the valuable comments in this review. We hope that the correction of the article according to the Editor's suggestions will significantly improve its quality. Below are responses to the Editor's and Reviewers comments. Changes in the text are marked in red.

 

  • Comment 1: The paper does not clearly define the core scientific questions or technical challenges of the research in the introduction. Suggest clearly listing the research objectives and key issues to be addressed at the end of the introduction.

 

Response 1: The following fragment has been added to the introduction section:

Based on the identified research gap, the main objective of this study is to design and experimentally evaluate a prototype system for the automatic detection of key safety hazards on construction sites using computer vision and deep learning methods. From a technical perspective, the research addresses several important challenges related to the development of AI-based detection systems, including the need to ensure reliable real-time performance, achieve accurate classification with limited training data, and integrate detection, alerting, and event recording within a unified modular architecture. Accordingly, the study seeks to answer the following research questions:

RQ1: How effective is the proposed YOLOv8-based model in detecting the absence of safety helmets and worker falls within the designed prototype system?

RQ2: How can detection, alert generation, and event recording be conceptually integrated through an open API to support real-time safety monitoring?

RQ3: What are the main factors influencing detection accuracy and system performance under test conditions, and how can they inform further development of the model and its architecture?

 

  • Comment 2: The literature review lacks systematic comparison. Table 1 lists multiple related studies, but does not systematically compare them (such as model performance, data size, applicable scenarios), nor does it clearly indicate the innovation and differences of this study.

 

Response 2:

The information originally presented in Table 1 has been expanded and reorganized into two coordinated tables. The new tables add study-by-study details, including dataset size (and splits where available) and model characteristics. In Section 2.2, we introduced brief edits to explicitly reference these tables and we further clarified the specific contribution of our study.

 

 

  • Comment 3: The paper did not specify the source, scale, annotation method, and data augmentation strategy of the dataset used to train the YOLOv8 model, which affected the reproducibility of the experiment.

 

Response 3:

 

Thank you for pointing this out. We have added a detailed description of the dataset used for model training and validation, including its origin, size, annotation process, and augmentation methods. As clarified in the revised manuscript (Section 3.3), the dataset consisted of 200 manually annotated images collected directly from an active construction site, covering both “no hardhat” and “fall” classes. All images were labeled using the open-source LabelImg tool in YOLO format, and standard augmentation techniques (rotation, flipping, brightness adjustment, and Gaussian noise) were applied to increase dataset variability and model robustness. These details have been included to improve reproducibility and methodological transparency.

 

The following passage was added to the article: The proprietary dataset used for model training and validation consisted of 200 RGB images (100 for helmet detection and 100 for fall detection) captured on an active construction site under varying lighting and weather conditions using a 1080p IP camera. All images were manually annotated using the open-source LabelImg tool, and labels were saved in YOLO format (.txt files containing class ID and bounding box coordinates). The labeling process distinguished two main classes: person_with_helmet / person_without_helmet and person_falling.  Given the exploratory scope of this study, the present experiments were conducted on the baseline dataset. In subsequent phases, we plan to extend the dataset and adopt a systematic augmentation pipeline (e.g., horizontal/vertical flips, small-angle rotations, photometric adjustments, and controlled noise injection) together with stratified train/validation/test protocols to preserve class balance. This staged approach is intended to improve generalization and robustness of the YOLOv8s model while maintaining methodological transparency to support reproducibility.

 

 

 

  • Comment 4: Each detection task uses 100 images for testing, and the sample size is not sufficient to ensure statistical significance, especially for rare events such as' fall detection '. Please explain the reason.

 

Response 4: We sincerely thank the reviewer for this valuable comment. We acknowledge that the test dataset used in this study is relatively small. However, this research was designed as a pilot experimental stage aimed at validating the functionality and feasibility of the proposed prototype system, rather than establishing statistically significant results or providing a large-scale benchmark model. The current dataset, comprising 200 manually annotated images, was collected in situ on an active construction site to capture realistic conditions and worker behaviour, which inherently limited the available sample size particularly for rare events such as falls.

To clarify this, information indicating that the presented system is a prototype has been added to the Introduction section. Furthermore, the Materials and Methods section has been supplemented with the following statement:

“The dataset used in this study was collected in situ by the authors during controlled observation sessions conducted on an active construction site. All image data were gathered directly by the research team and manually annotated to identify the presence or absence of safety helmets and human falls.”

Due to copyright and personal image protection regulations, the dataset cannot be made publicly available in open repositories. Nevertheless, the authors declare that it can be shared upon reasonable request for research purposes by contacting the corresponding author.

Future research will include extending the dataset and performing a more comprehensive statistical evaluation across diverse construction environments to improve the robustness and generalisability of the model.

 

Comment 5: The use of "width ≥ 2 times height" as the basis for fall judgment, but without providing the source or verification process of this threshold, lacks theoretical or experimental support.

Response 5:

We appreciate the reviewer’s valuable observation. The heuristic assumption that a person’s bounding box width being at least twice its height indicates a lying posture was derived from preliminary empirical observations conducted during prototype testing. This simplified ratio-based rule was adopted to reduce computational complexity and enable real-time detection on low-resource edge devices. The criterion was not intended as a universal threshold but as a practical approximation for initial system validation. Future research will include a larger dataset and multi-perspective video analysis to empirically verify and optimize this threshold value.

To clarify this aspect, the following explanatory sentence has been added to the Materials and Methods section:

“The width-to-height ratio threshold (width ≥ 2 × height) used to identify potential human falls was determined empirically during preliminary tests as a simplified heuristic for detecting horizontal body postures, aimed at ensuring computational efficiency in the prototype implementation.”

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors This paper focuses on construction-site scenarios and proposes an integrated hazard-detection system built upon YOLOv8. By tackling the two most critical risks—missing safety helmets and worker falls—the topic directly addresses a long-standing pain point of the construction industry and has obvious practical value. From the perspectives of academic rigor, research depth and result completeness, however, the manuscript still leaves room for improvement. Specific comments are as follows:
  1. The Introduction cites ILO data (60 000 fatal accidents per year worldwide) and Polish economic losses (US XX billion are spent annually on manual safety inspections worldwide (add reference), accident rates remain high because of limited inspection frequency and human error, highlighting the necessity of automated detection systems.”
  2. At present the Introduction only states that “an integrated detection system is developed”. The exact differences from existing studies (e.g. Wu et al. [35], Qin et al. [37] reviewed in Section 2.2) are not spelled out. Add 1–2 sentences that summarize the key innovations, e.g. “Unlike previous single-task models, the proposed system integrates an open API, an incident-logging module and a lightweight cloud architecture, forming a closed loop of ‘detection–alert–record–traceability’. Hardware requirements are reduced to ordinary edge devices (e.g. Jetson Nano), making the solution affordable for small and medium-sized contractors.”
  3. The literature review praises YOLO for “speed and accuracy” but offers no systematic comparison with other mainstream detectors (R-CNN family, SSD, RetinaNet) in construction scenarios. Please insert a table (similar to Table 1) that contrasts speed (FPS), accuracy (mAP), hardware demand and site-specific robustness (occlusion, illumination), thereby justifying the choice of YOLOv8 scientifically instead of qualitatively.
  4. The test set contains only 200 images (100 helmets, 100 falls), which is too small. Expand it to at least 500–1000 images and make the data publicly available (e.g. GitHub or supplementary material).
  5. Fall detection achieves Recall = 0.45, yet no statistical analysis of missed cases (viewpoint, occlusion, illumination) is provided. Add an error-type diagram (confusion matrix grouped by viewpoint/illumination) and supply ROC curves with AUC for the fall-detection task.
  6. Report the positive/negative sample ratios for both tasks (e.g. missing-helmet XX % vs. normal XX %; falls XX % vs. normal posture XX %) and the diversity of scenarios (day/night, partial/full occlusion). Explain the low recall of fall detection (e.g. “falls account for only XX % of the test set, leading to class imbalance”).
  7. Section 3.2 mentions “asynchronous video processing” but does not clarify how asynchrony is implemented (parallel frame extraction & inference, buffering, local storage upon network failure). Add a flow chart or textual description that shows how real-time performance and computational load are balanced, echoing the “lightweight” goal.
  8. The experiment relies on a single hold-out test set; no cross-validation is reported. Conduct 5-fold cross-validation (4/5 training, 1/5 testing, average the results) to reduce random error caused by the small sample size.
  9. Section 5.2 attributes low fall recall to “single-frame analysis and bounding-box heuristics” but offers no experimental proof. Provide a controlled experiment: select 20 fall videos and compare single-frame detection with a 3-frame majority-vote sequence; report the recall increase (e.g. “recall rises to XX %”), quantitatively demonstrating the limitation of single-frame analysis.
  10. Where appropriate, cite DOI: 10.1007/s11069-025-07601-9.
  11. The reference “Safety and Health at Work: A Vision for Sustainable Prevention; International Labour Organization: Frankfurt, 2014” lacks a DOI; “Central Statistical Office. Accidents at Work; 2024” lacks place of publication and URL/access date. Complete these bibliographic details.

Author Response

Thank you for all the valuable comments in this review. We hope that the correction of the article according to the Editor's suggestions will significantly improve its quality. Below are responses to the Editor's and Reviewers comments. Changes in the text are marked in red.

 

This paper focuses on construction-site scenarios and proposes an integrated hazard-detection system built upon YOLOv8. By tackling the two most critical risks—missing safety helmets and worker falls—the topic directly addresses a long-standing pain point of the construction industry and has obvious practical value. From the perspectives of academic rigor, research depth and result completeness, however, the manuscript still leaves room for improvement. Specific comments are as follows:

 

  • Comment 1: The Introduction cites ILO data (60 000 fatal accidents per year worldwide) and Polish economic losses (US375million,2015−2022),butitdoesnotexplicitlylinkthesefiguresto“thelimitationsoftraditionalsupervision”.Pleaseaddatransitionalargumentsuchas:“AlthoughapproximatelyUS XX billion are spent annually on manual safety inspections worldwide (add reference), accident rates remain high because of limited inspection frequency and human error, highlighting the necessity of automated detection systems.”

 

Response 1: We added information in introduction section:


“Although approximately US 1.8 billion are spent annually on workplace safety audits and inspections, and the construction worker safety market is valued at around US 3.5 billion in 2025 [6, 7] accident rates remain high due to limited inspection frequency and human error, emphasizing the urgent need for automated hazard detection systems.”

 

  • Comment 2: At present the Introduction only states that “an integrated detection system is developed”. The exact differences from existing studies (e.g. Wu et al. [35], Qin et al. [37] reviewed in Section 2.2) are not spelled out. Add 1–2 sentences that summarize the key innovations, e.g. “Unlike previous single-task models, the proposed system integrates an open API, an incident-logging module and a lightweight cloud architecture, forming a closed loop of ‘detection–alert–record–traceability’. Hardware requirements are reduced to ordinary edge devices (e.g. Jetson Nano), making the solution affordable for small and medium-sized contractors.”

 

Response 2: The sentences was added in 2.3 point: „Unlike previous single-task models, the proposed system integrates an open API, an incident-logging module and a lightweight cloud architecture, forming a closed loop of ‘detection–alert–record–traceability’. Hardware requirements are reduced to ordinary edge devices making the solution affordable for small and medium-sized contractors.”

 

  • Comment 3: The literature review praises YOLO for “speed and accuracy” but offers no systematic comparison with other mainstream detectors (R-CNN family, SSD, RetinaNet) in construction scenarios. Please insert a table (similar to Table 1) that contrasts speed (FPS), accuracy (mAP), hardware demand and site-specific robustness (occlusion, illumination), thereby justifying the choice of YOLOv8 scientifically instead of qualitatively.

 

Response 3:

Dear Reviewer, thank you for the helpful comment. We have added a new systematic Table 3 that contrasts the requested detector families (R-CNN, SSD, RetinaNet) with representative YOLO variants in construction scenarios, reporting mAP, speed (FPS) together with the stated hardware, computational demand, and site-specific robustness (occlusion, illumination). In the revised text we clarify our rationale: on identical safety helmet datasets, YOLO achieved higher mAP and/or higher throughput than SSD and Faster R-CNN; YOLOv5 sustained real-time operation on GPU with a quantified AP drop under partial occlusion (≈7 pp); and YOLOX variants maintained strong performance in low-light tunnel conditions. We further highlight the compute overhead of two-stage RPN pipelines and the availability of lightweight YOLO adaptations for edge deployments. Taken together, these quantitative and implementation-oriented factors provide a scientific justification for selecting YOLOv8 as the best speed-accuracy-robustness trade-off for construction-site analytics.

 

  • Comment 4: The test set contains only 200 images (100 helmets, 100 falls), which is too small. Expand it to at least 500–1000 images and make the data publicly available (e.g. GitHub or supplementary material).

 

Response 4: We sincerely thank the reviewer for this valuable comment. We acknowledge that the test dataset used in this study is relatively small. However, this research was designed as a pilot experimental stage aimed at validating the functionality and feasibility of the proposed prototype system, rather than establishing statistically significant results or providing a large-scale benchmark model. The current dataset, comprising 200 manually annotated images, was collected in situ on an active construction site to capture realistic conditions and worker behaviour, which inherently limited the available sample size particularly for rare events such as falls

To clarify this, information indicating that the presented system is a prototype has been added to the Introduction section. Furthermore, the Materials and Methods section has been supplemented with the following statement:

“The dataset used in this study was collected in situ by the authors during controlled observation sessions conducted on an active construction site. All image data were gathered directly by the research team and manually annotated to identify the presence or absence of safety helmets and human falls.”

Due to copyright and personal image protection regulations, the dataset cannot be made publicly available in open repositories. Nevertheless, the authors declare that it can be shared upon reasonable request for research purposes by contacting the corresponding author.

Future research will include extending the dataset and performing a more comprehensive statistical evaluation across diverse construction environments to improve the robustness and generalisability of the model.

 

  • Comment 5: Fall detection achieves Recall = 0.45, yet no statistical analysis of missed cases (viewpoint, occlusion, illumination) is provided. Add an error-type diagram (confusion matrix grouped by viewpoint/illumination) and supply ROC curves with AUC for the fall-detection task.

 

Response 5: We would like to thank the Reviewer for this valuable and constructive comment. In response, we conducted an additional statistical analysis of missed detections (false negatives) for the fall-detection module. We added text:

 

“Table 8. Number of misclassified samples grouped by camera viewpoint and illumination conditions.

Viewpoint

Illumination

Normal

Low / Uneven

Front

2

0

Side

6

4

Top

11

2

Oblique

18

13

 

The highest error rate was observed for oblique camera angles under uneven lighting, confirming the model’s sensitivity to complex visual conditions.”

 

We appreciate the Reviewer’s remark regarding the ROC curve and AUC analysis. In the current version, the fall detection is implemented using a deterministic, single-frame heuristic based on the aspect ratio of the detected person’s bounding box. As this approach does not involve probabilistic classification or adjustable confidence thresholds, a conventional ROC curve cannot be directly computed. Instead, the model’s performance was evaluated exhaustively across all samples (n = 100) through manual verification, and the results were summarized using confusion matrices and error-type analyses (Table 8). Future work will include the integration of a learning-based, probabilistic fall classifier, which will enable ROC and AUC evaluation in a more conventional manner.

 

 

  • Comment 6: Report the positive/negative sample ratios for both tasks (e.g. missing-helmet XX % vs. normal XX %; falls XX % vs. normal posture XX %) and the diversity of scenarios (day/night, partial/full occlusion). Explain the low recall of fall detection (e.g. “falls account for only XX % of the test set, leading to class imbalance”).

 

Response 6:

Thank you for this valuable comment. All images in the test dataset were captured under daylight conditions; however, to assess the impact of lighting on detection accuracy, two illumination levels were defined: normal and low/uneven. We added information in manuscript:

For the fall detection task, 87% of the samples represented actual fall events, while 13% depicted people standing or crouching, which occasionally resulted in misclassification. In the helmet detection task, 60% of the images showed workers without helmets, and 40% showed workers wearing proper protective equipment. The relatively small number of fall cases compared to the total number of frames introduced a class imbalance, which may have contributed to the lower Recall observed in the fall-detection module.

All images in the test dataset were captured under daylight conditions; however, due to the relatively low recall observed in the fall-detection task, an additional analysis was conducted to identify potential sources of error. To assess the impact of lighting on detection accuracy, two illumination levels were defined: normal and low/uneven. The test dataset included samples captured from multiple viewpoints under varied illumination conditions to evaluate the model’s robustness in real construction-site environments. The analysis of misclassified fall-detection cases revealed a relationship between detection accuracy, camera viewpoint, and lighting conditions. Table 8 summarizes the number of misclassified samples grouped by these two factors.

Table 8. Number of misclassified samples grouped by camera viewpoint and illumination conditions.

 

Viewpoint

Illumination

Normal

Low / Uneven

Front

2

0

Side

6

4

Top

11

2

Oblique

18

13

 

The highest error rate was observed for oblique camera angles under uneven lighting, confirming the model’s sensitivity to complex visual conditions.”

 

  • Comment 7: Section 3.2 mentions “asynchronous video processing” but does not clarify how asynchrony is implemented (parallel frame extraction & inference, buffering, local storage upon network failure). Add a flow chart or textual description that shows how real-time performance and computational load are balanced, echoing the “lightweight” goal.

 

Response 7:

The architecture of the asynchronous processing pipeline is already illustrated in the paper (see Fig. 3). To clarify its operation, an additional description has been included in Section 4.4.

The system performs frame acquisition, YOLO-based inference, and output rendering in parallel threads. Incoming frames are buffered locally when temporary delays or network interruptions occur, ensuring continuous operation without frame loss. To quantitatively verify the lightweight and real-time characteristics, average inference times were measured for 100 test images using both detection modules. The results are presented in Table 9.

Table 9. Average processing times for the asynchronous inference pipeline

Task

Number of Samples

 

Mean Time [ms]

Min Time [ms]

Max Time [ms]

NHD

100

123.88

95.88

383.46

FD

100

131.84

88.46

363.13

 

 

  • Comment 8: The experiment relies on a single hold-out test set; no cross-validation is reported. Conduct 5-fold cross-validation (4/5 training, 1/5 testing, average the results) to reduce random error caused by the small sample size.

 

Response 8:  Thank you for this valuable comment. In this study, a pre-trained YOLOv8 model (https://github.com/keremberke/awesome-yolov8-models) was used. This model had already been trained and validated by its authors on large and diverse datasets using standard cross-validation procedures. The purpose of the present research was not to retrain the model, but to implement, integrate it with an external alerting system, and evaluate its performance in a specific application context detection under real-world conditions (varied viewpoints, illumination, and partial occlusions). Therefore, additional k-fold cross-validation was not required, as the experiments focused on verifying the inference reliability and system integration of an already trained model rather than its training process. Nevertheless, to minimize potential random bias, the test dataset included samples representing varied environmental and lighting conditions, ensuring the representativeness of the evaluation results.

 

  • Comment 9: Section 5.2 attributes low fall recall to “single-frame analysis and bounding-box heuristics” but offers no experimental proof. Provide a controlled experiment: select 20 fall videos and compare single-frame detection with a 3-frame majority-vote sequence; report the recall increase (e.g. “recall rises to XX %”), quantitatively demonstrating the limitation of single-frame analysis.

 

Response 9:  This direction has already been outlined in the Directions for future research of the paper in 8 section. The current study focused on verifying the feasibility of implementing the detection pipeline in real-time conditions using a lightweight model.

 

  • Comment 10: Where appropriate, cite DOI: 10.1007/s11069-025-07601-9.

 

In response to the comment, we re-examined the entire bibliography and supplied missing DOIs wherever available, and we incorporated a citation to the recommended article at the most substantively appropriate point in the manuscript, where it aligns thematically and methodologically with our argument as 56 position of references.

 

  • Comment 11: The reference “Safety and Health at Work: A Vision for Sustainable Prevention; International Labour Organization: Frankfurt, 2014” lacks a DOI; “Central Statistical Office. Accidents at Work; 2024” lacks place of publication and URL/access date. Complete these bibliographic details.

 

Response 11:

We thank the reviewer for the helpful comment. The missing bibliographic details have been completed. The corrected references now appear in the manuscript as follows:

  1. Safety and Health at Work: A Vision for Sustainable Prevention; International Labour Organization: Frankfurt, 2014. ISBN: 9789221289081
  2. Central Statistical Office. Accidents at Work. https://stat.gov.pl/ (accessed Oct 15, 2025).

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have properly addressed all my comments, and I have no further comments. 

Author Response

We sincerely thank the Reviewer for the positive evaluation and confirmation that all previous comments have been properly addressed. We greatly appreciate your valuable feedback and time devoted to improving our manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

I appreciate the authors’ substantial revision effort. The manuscript now reads more clearly and demonstrates stronger scientific rigor, with a cleaner narrative that helps readers grasp the central message. That said, I still have a few critical questions that should be addressed to further strengthen the paper’s contribution.

First, the claim of novelty remains insufficiently articulated. The response suggests the work is primarily a structured synthesis and comparative analysis of YOLOv8-based and related approaches, consolidating and contextualizing existing research. This positioning aligns more with a literature review than a case study, especially since the presented case focuses on detection of safety equipment rather than reactive or proactive interventions (alerts, constraint/lockout, prediction, prevention), where the real safety impact occurs. If the novelty is “establishing a scalable foundation” for future reactive/predictive modules, please clarify why such a foundation is still needed given existing work. Alternatively, specify what is genuinely new within the detection stage: improved accuracy, expanded classes (e.g., additional PPE or posture states), lower latency, deployment constraints solved, or algorithmic advances beyond a standard YOLOv8 pipeline. If not, the study risks overlapping with extensive prior art, while the field is already moving toward newer YOLO versions.

Second, Table 2 is overly dense. The “Performance” column aggregates too many indicators for most readers to parse. Please streamline to the key outcomes per cited study and shift the comparative analysis into the text, where you can make clearer, sentence-level contrasts on the most salient metrics.

Third, Table 3 should broaden the benchmark set. Please include additional YOLO variants (e.g., YOLOv9/YOLOv10/YOLOv12 and earlier v5/v6 where relevant) and complementary detectors such as CenterNet, FCOS, and Mask R-CNN. Given the manuscript’s comparative thrust, expanding these baselines will better situate your case study and mitigate the impression of overlap with existing literature.

Comments on the Quality of English Language

English language is fine and understandable, however editor need to run thorugh the proofread.

Author Response

Thank you for all the valuable comments in this review. We hope that the correction of the article according to the Editor's and Reviewers suggestions will significantly improve its quality. Below are responses to the Editor's and Reviewers comments. Changes in the text are marked in red.

  • Comment 1: First, the claim of novelty remains insufficiently articulated. The response suggests the work is primarily a structured synthesis and comparative analysis of YOLOv8-based and related approaches, consolidating and contextualizing existing research. This positioning aligns more with a literature review than a case study, especially since the presented case focuses on detection of safety equipment rather than reactive or proactive interventions (alerts, constraint/lockout, prediction, prevention), where the real safety impact occurs. If the novelty is “establishing a scalable foundation” for future reactive/predictive modules, please clarify why such a foundation is still needed given existing work. Alternatively, specify what is genuinely new within the detection stage: improved accuracy, expanded classes (e.g., additional PPE or posture states), lower latency, deployment constraints solved, or algorithmic advances beyond a standard YOLOv8 pipeline. If not, the study risks overlapping with extensive prior art, while the field is already moving toward newer YOLO versions.

Response: Thank you for this important comment. We have clarified the novelty of the work and revised the manuscript accordingly.

The contribution of the study is system-level, not algorithmic: we present a lightweight, deployable and fully integrated detection–alert–record pipeline combining two YOLOv8 modules with an asynchronous inference loop, an incident-logging backend and an open REST API. Such end-to-end architectures are absent in existing YOLOv8-based PPE/fall-detection studies, which focus exclusively on model accuracy and do not include alerting, data persistence, interoperability or low-resource deployment.

We also explain why this foundation is needed: current research lacks solutions that SMEs can deploy in real conditions, and no prior work provides a unified, API-driven workflow enabling further reactive or predictive safety modules.

These clarifications were added concisely to the Abstract, Introduction, Section 2.3, System Architecture, and Conclusions, without altering the core structure of the paper.

  • Comment 2: Second, Table 2 is overly dense. The “Performance” column aggregates too many indicators for most readers to parse. Please streamline to the key outcomes per cited study and shift the comparative analysis into the text, where you can make clearer, sentence-level contrasts on the most salient metrics.

Response: We thank the Reviewer for this valuable suggestion. In response, we have simplified Table 2 by removing less relevant numerical details from the Performance column and retaining only the key indicators for clarity. The comparative discussion of the main performance trends has been moved to the text directly below the table.

Additionally, we have added the following explanatory paragraph to accompany the table:

“As presented in Table 2, recent YOLO-based architectures demonstrate mean Average Precision (mAP@0.5) values ranging from approximately 84.7% to 93.2%, while maintaining real-time inference performance between 60 and 90 frames per second. Wu et al. [39] proposed the most computationally efficient lightweight configuration, achieving 89 FPS. In contrast, studies such as those by Qin et al. [41] and Huang et al. [43] emphasized robustness in fall detection tasks, particularly under challenging visual conditions including occlusion and variable illumination.”

  • Comment 3: Third, Table 3 should broaden the benchmark set. Please include additional YOLO variants (e.g., YOLOv9/YOLOv10/YOLOv12 and earlier v5/v6 where relevant) and complementary detectors such as CenterNet, FCOS, and Mask R-CNN. Given the manuscript’s comparative thrust, expanding these baselines will better situate your case study and mitigate the impression of overlap with existing literature.

Response: Thank you for this helpful suggestion. In response, Table 3 has been expanded to include additional YOLO variants (YOLOv9, YOLOv10, YOLOv12, as well as YOLOv5/YOLOv6 where applicable) and complementary detectors commonly used as baselines in construction-safety research, including CenterNet, FCOS and Mask R-CNN. These entries have been added together with representative peer-reviewed studies to more comprehensively situate the proposed system within the broader object-detection landscape. The revisions strengthen the comparative context while maintaining the focus and scope of the manuscript.

  • Comment 4: English language is fine and understandable, however editor need to run thorugh the proofread.

We appreciate the reviewer’s comment regarding language consistency. Minor grammatical and stylistic corrections were introduced throughout the text (e.g., verb tense consistency, punctuation adjustments, and redundant word removal). The revised version has been fully checked for English grammar and readability.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have made good revisions to the paper and it is ready for publication.

Author Response

We sincerely thank the Reviewer for the positive evaluation and confirmation that all previous comments have been properly addressed. We greatly appreciate your valuable feedback and time devoted to improving our manuscript.

Reviewer 5 Report

Comments and Suggestions for Authors

accept

Author Response

We sincerely thank the Reviewer for the positive evaluation and confirmation that all previous comments have been properly addressed. We greatly appreciate your valuable feedback and time devoted to improving our manuscript.

Back to TopTop