Next Article in Journal
An Automated Pipeline for Modular Space Planning Using Generative Design Within a BIM Environment
Previous Article in Journal
Enhancing Warehouse Picking Efficiency Through Integrated Allocation and Routing Policies: A Case Study Towards Sustainable and Smart Warehousing
 
 
Article
Peer-Review Record

Real-Time Deep-Learning-Based Recognition of Helmet-Wearing Personnel on Construction Sites from a Distance

Appl. Sci. 2025, 15(20), 11188; https://doi.org/10.3390/app152011188
by Fatih Aslan 1,* and YaÅŸar Becerikli 2,3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Appl. Sci. 2025, 15(20), 11188; https://doi.org/10.3390/app152011188
Submission received: 10 September 2025 / Revised: 30 September 2025 / Accepted: 2 October 2025 / Published: 18 October 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors
  1. The introduction mentions that facial recognition has limitations and that QR codes have distance and angle limitations. This justifies the move to symbols, but it could be more explicit. Why symbols, and not other methods of unique identification?
  2.  The idea of ​​using symbols, does not detail how this solution integrates with existing ones or why it is more advantageous than other approaches.
  3. Figure 5 could benefit from a better presentation. What were the considerations for choosing these symbols??
  4. The process of generating symbols and applying them to the helmets (including variations in angle, lighting, scale) should be explained in much more detail.
  5. The description of CNNs and YOLO is quite generic. The authors should focus more on the specific aspects of YOLOv5 and YOLOv8 that are relevant to their problem.
  6. Pseudocode is provided, but a more detailed explanation of each step, with a concrete visual example, would be helpful. How are the exceptions handled?
  7. Figure 9 is hard to read.
  8. The results are not commented enough.
  9. A discussion section is missing where several aspects should be presented, such as: why YOLOv5 outperforms YOLOv8 at greater distances, despite the fact that YOLOv8 is considered more advanced in some respects, The impact of camera resolution and symbol size on distance performance, Which symbol combinations are more robust at distance and why, The importance of "confidence averages" and how they correlate with distance, Why certain shapes are detected better than others, complexity, computational time, implementation in a real system, etc.

Author Response

1. The introduction mentions that facial recognition has limitations and that QR codes have distance and angle limitations. This justifies the move to symbols, but it could be more explicit. Why symbols, and not other methods of unique identification?

Response 1: A paragraph is added to the end of the introduction section about the current methods’ vulnarabilities and our methods’ importance is highlighted.

2. The idea of ​​using symbols, does not detail how this solution integrates with existing ones or why it is more advantageous than other approaches.

Response 2: It is mentioned that battery usage, distance limitations, intrusiveness problems make identifying and tracking workers highly unpractical. Therefore, no current solution is used in this study. Instead, an object detection framework is used to detect objects.

We added at the introduction section, a tracking algorithm can be integrated to our identification system in order to ease tracking after every detection of a worker from their symbols placed on helmet. However, our main focus is not tracking for now, it will be our future work as we mentioned in the conclusion section.

3. Figure 5 could benefit from a better presentation. What were the considerations for choosing these symbols??

Response 3: The criteria to select symbols are summarized in the symbols section. Also, the symbol selection process is planned as a future work and added into the conclusion section.

4. The process of generating symbols and applying them to the helmets (including variations in angle, lighting, scale) should be explained in much more detail.

Response 4: The symbol placement picture is added and corresponding paragraph included in the symbols section. They look upright when seen from the front.

5. The description of CNNs and YOLO is quite generic. The authors should focus more on the specific aspects of YOLOv5 and YOLOv8 that are relevant to their problem.

Response 5: The Yolov8 architecture diagram is removed.

YOLOv5 and YOLOv8 are most used methods at the time the study began. Their performance in speed and accuracy is very high especially their being single-stage detecting structure. And the one-stage situation is already mentioned in the CNN section.

6. Pseudocode is provided, but a more detailed explanation of each step, with a concrete visual example, would be helpful. How are the exceptions handled?

Response 6: Pseudocode is only for the ordering correction according to top-left locations of each detected symbol. The explanation of the algorihtm is added accordingly.

7. Figure 9 is hard to read.

Response 7: Their size increased now.

8. The results are not commented enough.

Response 8: The results section is divided into two sub-sections and improved.

9. A discussion section is missing where several aspects should be presented, such as: why YOLOv5 outperforms YOLOv8 at greater distances, despite the fact that YOLOv8 is considered more advanced in some respects, The impact of camera resolution and symbol size on distance performance, Which symbol combinations are more robust at distance and why, The importance of "confidence averages" and how they correlate with distance, Why certain shapes are detected better than others, complexity, computational time, implementation in a real system, etc.

Response 9: Symbol size and camera resolution were not studied in this work. According to our trained dataset, some symbols appear more than others. That would effect detection performance of some over others.

Computational time was studied before and the results are now added into the results section as a table.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper focuses on real-time recognition of helmet-wearing workers on construction sites from a distance using deep learning techniques. The study aims to address the limitations of traditional hardware-based helmet detection methods (e.g., battery dependence, low worker acceptance) and provide a passive, camera-based solution for construction site worker safety monitoring and tracking.

  1. The abstract mentions "addressing the issue of unfixed symbol detection order in inference mode through a position-based symbol sorting algorithm" but fails to briefly elaborate on the core logic of the algorithm—"sorting symbols from left to right". This omission hinders readers from quickly grasping the key mechanism of the algorithm.
  2. Although hardware based (such as helmet sensors and RFID) and visual based helmet detection methods have been introduced, the limitations of early hardware solutions in terms of battery life, detection distance, and how these limitations have driven deep learning based visual detection methods to become mainstream have not been analyzed.
  3. The paper only stating the selection of 14 symbols and the use of the combination form of "three in one group", without explaining the design basis. It is not explained that the number of 14 symbols is determined by comparing the impact of different numbers of symbols on recognition efficiency and uniqueness.
  4. In the section on symbol dataset construction, only the use of a 64-megapixel mobile phone camera for image capture is mentioned. Critical details such as lighting conditions and shooting angle ranges are not specified. These factors directly impact the model’s adaptability to complex real-world construction site environments.
  5. While the precision, recall, and other metrics of YOLOv5 and YOLOv8 for symbol detection are compared, the paper does not analyze differences in computational latency and memory consumption between the two models. The absence of these metrics undermines informed model selection for practical applications.
  6. The conclusion notes that "the symbol detection rate decreases significantly beyond 10 meters" but does not specifically analyze the primary causes of this decline (e.g., reduced resolution leading to indistinct symbol features, increased background interference).
  7. The paper does not conduct model testing for common special interference scenarios in construction sites, such as dust, mechanical occlusion, and cases where the color of workers' clothing is similar to that of helmets. These scenarios are high-frequency environments at construction sites.

Author Response

1. The abstract mentions "addressing the issue of unfixed symbol detection order in inference mode through a position-based symbol sorting algorithm" but fails to briefly elaborate on the core logic of the algorithm—"sorting symbols from left to right". This omission hinders readers from quickly grasping the key mechanism of the algorithm.

Response 1: The explanation of the algorithm is added accordingly.

2. Although hardware based (such as helmet sensors and RFID) and visual based helmet detection methods have been introduced, the limitations of early hardware solutions in terms of battery life, detection distance, and how these limitations have driven deep learning based visual detection methods to become mainstream have not been analyzed.

Response 2: At the end of the introduction section, the limitations mentioned again highlighting them. Actually, in the end of the introduction section, one of the main contributions mention about “using already placed cameras”. This highlights there is no need for an active sensor requiring batteries. Since our focus was about generating and training the symbol dataset in this study, the other methods and vulnerabilities have just been mentioned briefly.

3. The paper only stating the selection of 14 symbols and the use of the combination form of "three in one group", without explaining the design basis. It is not explained that the number of 14 symbols is determined by comparing the impact of different numbers of symbols on recognition efficiency and uniqueness.

Response 3: Of course, we have not meticulously researched the symbols in the symbol determination phase. Our next study will involve in generating symbol selection criteria for different requirements such as number of workers. In this future study, also diffrent size of symbols will be analyzed.

In addition to that, we added symbol selection choices in the symbols section.

4. In the section on symbol dataset construction, only the use of a 64-megapixel mobile phone camera for image capture is mentioned. Critical details such as lighting conditions and shooting angle ranges are not specified. These factors directly impact the model’s adaptability to complex real-world construction site environments.

Response 4: We added symbol dataset images in daytime and mostly from hangar.

The shooting angles are taken from the front of the helmet, as the symbols are positioned on the front side.

5. While the precision, recall, and other metrics of YOLOv5 and YOLOv8 for symbol detection are compared, the paper does not analyze differences in computational latency and memory consumption between the two models. The absence of these metrics undermines informed model selection for practical applications.

Response 5: Computational average processing times for an image is added for two methods at the end of the results section. YOLOv5 has a higher speed. But memory consumption was not analyzed.

6, The conclusion notes that "the symbol detection rate decreases significantly beyond 10 meters" but does not specifically analyze the primary causes of this decline (e.g., reduced resolution leading to indistinct symbol features, increased background interference).

Response 6: In the symbols section, we already mentioned “Obviously, increasing size will have positive effect on distance accuracy”. Also, in the class-wise metric results analyze, we mention about less precise symbols. Mainly, size of a symbol and precision/recall metrics affect the performance.

7. The paper does not conduct model testing for common special interference scenarios in construction sites, such as dust, mechanical occlusion, and cases where the color of workers' clothing is similar to that of helmets. These scenarios are high-frequency environments at construction sites.

Response 7: Our primary objective is to establish an identification system that operates reliably at longer distances. Building on the base framework, a tracking method will be incorporated to facilitate consistent identification. Subsequently, we plan to extend the system to more challenging scenarios, such as crowded environments and the use of helmets with varying colors.

Since our approach employs three consecutive object detections in sequence—person, helmet, and symbol—and YOLO, as an artificial neural network, is trained to distinguish helmet-like shapes from clothing, helmets did not cause any confusion with garments in our experiments. However, if a helmet-shaped pattern were present on a worker’s T-shirt, it would inevitably confuse the algorithm.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper proposes a deep learning-based framework for real-time identification of helmet-wearing workers on construction sites from a distance using visual data. The approach integrates the detection of persons, helmets, and distinct symbols placed on helmets to identify individuals uniquely. Two YOLO versions, YOLOv5 and YOLOv8, are applied with transfer learning on a novel symbol dataset of 11,243 images containing 14 different symbols. The study shows high precision and recall in symbol detection, achieving an accuracy of up to 10 meters in identifying workers via helmet symbols. The authors also propose a location-based symbol ordering algorithm to correct the detection sequence. 

This reviewer suggests some relevant points to consider in order to improve the quality of the presented paper:

  1. The Introduction is too large, so it is recommended to separate the introduction and the state-of-the-art into different sections.
  2. The study focuses exclusively on white helmets; lack of evaluation on helmets with varied colors or designs may limit generalizability.
  3. The system’s scalability and robustness under multi-person crowded scenes with overlapping helmets and occlusions are not evaluated.
  4. Lack of analysis on the effect of symbol size variations on detection range and accuracy. Is the symbol ordering robust to helmets with damaged, worn, or partially covered symbols?
  5. Lack of comparison with alternative state-of-the-art detection and identification approaches (e.g., more recent versions of YOLO or ViT models).
  6. Can you provide confusion matrices to explore the class-wise evaluation misclassification?
Comments on the Quality of English Language

English language contains minor grammatical, typographical, and stylistic issues affecting clarity and readability.

Author Response

1. The Introduction is too large, so it is recommended to separate the introduction and the state-of-the-art into different sections.

Response 1: The introduction is divided into sub-sections accordingly as:

-Hardware-Based Helmet Detection and Identification

-Visual-Based Helmet Detection

-Face-Based Identification

-QR-Based Identification

-Traffic Signs and Symbol Detection

-Human Detection and Tracking

Also two studies that MDPI suggested us are included into introduction section:

https://www.mdpi.com/1424-8220/23/3/1682

https://www.mdpi.com/2075-5309/14/6/1644

2. The study focuses exclusively on white helmets; lack of evaluation on helmets with varied colors or designs may limit generalizability.

Response 2: Using diferent colors will be our future work. It is mentioned in future work of the conclusion section. The dataset will be extended using different helmet colors.

3. The system’s scalability and robustness under multi-person crowded scenes with overlapping helmets and occlusions are not evaluated.

Response 3: Two-people scenario studied and did not change the result in case of distance and accuracy. But, the results were not mentioned in this study.

If there is an overlapping problem for any symbol of three, the detection cannot be succeded. However, including a tracking algorithm, which is one of our future work, will solve the issue. The proposed method’s success rate is measured by 30 frame per second. If any identification occurred in any frame, then a tracking mechanism will apply and resolve identification problems. This will be another future work mentioned in the conclusion section.

4. Lack of analysis on the effect of symbol size variations on detection range and accuracy. Is the symbol ordering robust to helmets with damaged, worn, or partially covered symbols?

Response 4: Symbol size was not meticulously researched in this paper. A helmet’s dimensions were measured and a label of 4cmx4cm for symbol were accepted as big as enough. It would be possible to use up to 8cmx8cm symbols with losing enough space to put three symbols in row to be seen from front view.

Partially seen symbols cannot be identified and defects on helmets were not studied.

5. Lack of comparison with alternative state-of-the-art detection and identification approaches (e.g., more recent versions of YOLO or ViT models).

Response 5: It was possible to use newer YOLO versions. Actually, we tried YOLOv12 but did not include in this paper yet. Other major deep learning architectures like R-CNN, Faster-CNN were also tried but they worked very slowly. Therefore, they have been decided not to mention.

6. Can you provide confusion matrices to explore the class-wise evaluation misclassification?

Response 6: Yes, they were in the paper, but they were excluded without dropping their references. They are included again in the results section .

Reviewer 4 Report

Comments and Suggestions for Authors

< !--StartFragment -->

This manuscript presents a deep learning-based approach for real-time identification of helmet-wearing individuals on construction sites using symbol recognition. The authors employ YOLOv5 and YOLOv8 for object detection and introduce a custom symbol dataset to facilitate worker identification from a distance. While the topic is relevant to occupational safety and surveillance, the manuscript has several critical shortcomings that limit its suitability for publication in its current form.

1. Limited Novelty

The core detection pipeline relies on standard YOLO architectures and public datasets for person and helmet detection. The proposed symbol-based identification mechanism, while practical, lacks algorithmic innovation and is conceptually similar to existing object classification tasks such as traffic sign recognition.

2. Restricted Generalizability

The system assumes fixed symbol placement and helmet visibility, which may not hold in real-world scenarios involving occlusions, varied lighting, or worker movement. The evaluation is conducted in controlled environments with limited diversity, raising concerns about robustness and scalability.

< !--EndFragment -->

< !--StartFragment -->

3. Insufficient Comparative Analysis

The manuscript compares YOLOv5 and YOLOv8 but omits benchmarking against other relevant models (e.g., Faster R-CNN, SSD). There is no ablation study to isolate the impact of symbol size, helmet color, or camera resolution on detection performance.

4. Evaluation Gaps

The performance metrics focus on mAP, precision, and recall but do not address practical deployment concerns such as latency, false identification rates, or multi-worker scenarios. The ordering algorithm assumes left-to-right alignment, which may not be reliable in dynamic environments.

5. Presentation and Language Quality

The manuscript contains grammatical errors, inconsistent terminology, and vague figure captions. The writing style is often awkward and detracts from clarity. A thorough language revision is recommended.

< !--EndFragment -->

Comments on the Quality of English Language

The manuscript contains grammatical errors, inconsistent terminology, and vague figure captions. The writing style is often awkward and detracts from clarity. A thorough language revision is recommended.

Author Response

1. Limited Novelty

The core detection pipeline relies on standard YOLO architectures and public datasets for person and helmet detection. The proposed symbol-based identification mechanism, while practical, lacks algorithmic innovation and is conceptually similar to existing object classification tasks such as traffic sign recognition.

Response 1: We did not improve any artificial neural network as theory, only used YOLO versions. However, a quicksort-like recursive algorihm generated for the problem in order to cope with ordering symbols accordingly.

Also, the symbol selection and generating a huge dataset is a unique process for our study, mentioned in the end of the introduction section.

2. Restricted Generalizability

The system assumes fixed symbol placement and helmet visibility, which may not hold in real-world scenarios involving occlusions, varied lighting, or worker movement. The evaluation is conducted in controlled environments with limited diversity, raising concerns about robustness and scalability.

Response 2: Up to a distance of 10 meters, all three symbols were generally captured accurately and successfully matched to the database using the proposed algorithm. As outlined in the future work section, a sequential tracking method will be integrated into the identification mechanism, thereby enabling a more robust and reliable approach.

3. Insufficient Comparative Analysis

The manuscript compares YOLOv5 and YOLOv8 but omits benchmarking against other relevant models (e.g., Faster R-CNN, SSD). There is no ablation study to isolate the impact of symbol size, helmet color, or camera resolution on detection performance.

Response 3: The conclusion highlights symbol size and helmet color as future research considerations, with the current study serving as a reliable basis.

At the time this study was initiated, YOLOv5 and YOLOv8 were among the most widely used detection methods. Their performance in both speed and accuracy is notable, particularly due to their single-stage detection architecture. The one-stage characteristic has already been discussed in the CNN section. Actually, other major deep learning architectures like R-CNN, Faster-CNN were also tried but they worked very slowly. Therefore, they have been decided not to mention since our main concern is a real-time feasible solution.

4. Evaluation Gaps

The performance metrics focus on mAP, precision, and recall but do not address practical deployment concerns such as latency, false identification rates, or multi-worker scenarios. The ordering algorithm assumes left-to-right alignment, which may not be reliable in dynamic environments.

Response 4: Computational average processing times for an image is added for two methods at the end of the results section. YOLOv5 has a higher speed.

Crowd scenarios were tested with two individuals and yielded consistent results, since the algorithm operates sequentially by first detecting the person, then the helmet, and finally the symbols located on the helmet. If we solve single worker problem from a reliable distance, it would ease the crowd scenarios.

The ordering algorithm performs robustly as long as the helmet is not placed on the ground and rotated more than 90 degrees from its upright position.

5. Presentation and Language Quality

The manuscript contains grammatical errors, inconsistent terminology, and vague figure captions. The writing style is often awkward and detracts from clarity. A thorough language revision is recommended.

Response 5: We will proof-read the article by MDPI author services, an AI or real person.

Figure titles have been re-arranged.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Small adjustments regarding the arrangement of figures (Figure 9,10,11) and tables (Table 3) are necessary. Tables with results should not be placed in conclusions .

Author Response

Comment:

Small adjustments regarding the arrangement of figures (Figure 9,10,11) and tables (Table 3) are necessary. Tables with results should not be placed in conclusions .

Response:

Figure 9,10, 11 and Table-3 are re-arranged accordingly by resizing and placing. Table-3 headers are re-organized. Also, the conclusion is not interfered with any table or figures. The figures are placed without interfering each other.

Reviewer 3 Report

Comments and Suggestions for Authors

I would like to thank the authors for considering my suggestions and their kind responses. I recommend continuing with further research lines in order to improve current results. The paper can be accepted in its present form. 

Author Response

Comment:

I would like to thank the authors for considering my suggestions and their kind responses. I recommend continuing with further research lines in order to improve current results. The paper can be accepted in its present form. 

Response:

We sincerely thank the reviewer for their thoughtful feedback, constructive suggestions, and kind words. We greatly appreciate the recommendation to pursue further research directions to enhance the current results, and we are encouraged by the reviewer’s positive assessment that the paper can be accepted in its present form.

Reviewer 4 Report

Comments and Suggestions for Authors

This present form can be accepted.

Author Response

Comment:

This present form can be accepted.

Response:

We thank the reviewer for their positive evaluation and recommendation for acceptance. We also appreciate the suggestion regarding the clarity of English expression. In response, we have revised the manuscript to further improve the language and ensure that the research is presented as clearly as possible.

Back to TopTop