Occlusion-Aware Worker Detection in Masonry Work: Performance Evaluation of YOLOv8 and SAMURAI

Yoon, Seonjun; Kim, Hyunsoo

doi:10.3390/app15073991

Open AccessArticle

Occlusion-Aware Worker Detection in Masonry Work: Performance Evaluation of YOLOv8 and SAMURAI

by

Seonjun Yoon

and

Hyunsoo Kim

^*

Department of Architectural Engineering, Dankook University, 152 Jukjeon-ro, Suji-gu, Yongin-si 16890, Gyeonggi-do, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3991; https://doi.org/10.3390/app15073991

Submission received: 13 March 2025 / Revised: 27 March 2025 / Accepted: 2 April 2025 / Published: 4 April 2025

(This article belongs to the Section Civil Engineering)

Download

Browse Figures

Versions Notes

Abstract

This study evaluates the performance of You Only Look Once version 8 (YOLOv8) and a SAM-based unified and robust zero-shot visual tracker with motion-aware instance-level memory (SAMURAI) for worker detection in masonry construction environments under varying occlusion conditions. Computer vision-based monitoring systems are widely used in construction, but traditional object detection models struggle with occlusion, limiting their effectiveness in real-world applications. The research employed a structured experimental framework to assess both models in brick transportation and brick laying tasks across three occlusion levels: non-occlusion, partial occlusion, and severe occlusion. Results demonstrate that while YOLOv8 processes frames 2.5 to 3.5 times faster (28–32 FPS versus 9–12 FPS), SAMURAI maintains significantly higher detection accuracy, particularly under severe occlusion conditions (92.67% versus 52.67%). YOLOv8’s frame-by-frame processing results in substantial performance degradation as occlusion severity increases, whereas SAMURAI’s memory-based tracking mechanism enables persistent worker identification across frames. This comparative analysis provides valuable insights for selecting appropriate monitoring technologies based on specific construction site requirements. YOLOv8 is suitable for construction environments characterized by minimal occlusions and a high demand for real-time detection, whereas SAMURAI is more applicable to scenarios with frequent and severe occlusions that require the sustained tracking of worker activity. The selection of an appropriate model should be based on an initial assessment of environmental factors such as layout complexity, object density, and expected occlusion frequency. The findings contribute to the advancement of more reliable vision-based monitoring systems for enhancing productivity assessment and safety management in dynamic construction settings.

Keywords:

masonry work; monitoring; YOLOv8 computer vision model; SAMURAI computer vision model

1. Introduction

Masonry work plays a crucial role in construction projects, directly influencing both schedules and costs [1,2]. The efficiency of masonry work is determined by several factors, including brick transportation speed, placement accuracy, and worker fatigue [3,4,5]. Given the labor-intensive nature of masonry tasks, workers are required to perform repetitive and physically demanding activities [6], which can lead to variations in productivity and an increased risk of fatigue over time [3]. These variations can cause delays in project timelines and inconsistencies in construction quality [7]. To mitigate these challenges, real-time monitoring and quantitative productivity assessment are essential for optimizing efficiency and maintaining consistency in masonry work [7,8,9].

Therefore, computer vision-based monitoring systems have been widely adopted in construction to automate worker activity tracking [10,11,12], productivity assessment [13,14], and safety monitoring [12,15,16,17,18]. These systems leverage deep learning-based object detection and tracking algorithms to identify workers [17,19,20], analyze movement patterns [21,22], and evaluate task efficiency [23,24]. Among these, You only Look Once (YOLO), Faster Region-based Convolutional Neural Network (Faster R-CNN), and Single Shot Multibox Detector (SSD) are commonly used models, offering real-time detection capabilities that facilitate the continuous monitoring of worker behavior in dynamic construction environments [25,26,27,28].

Computer vision technologies have been utilized in construction for various monitoring applications, including safety compliance verification [29,30,31], progress tracking [32,33,34,35,36], and automated productivity analysis [37,38]. For instance, several studies have employed YOLO-based detection systems to identify workers wearing personal protective equipment (PPE) and to monitor compliance with safety regulations [39,40]. Additionally, motion analysis and pose estimation techniques have been integrated into construction site monitoring to evaluate worker posture, fatigue levels, and ergonomic risks [41,42,43]. Other research has explored the automated recognition of construction tasks, enabling productivity assessments based on detected worker actions and movements [44,45].

Despite these advancements, existing object detection models primarily rely on frame-by-frame image processing, which presents challenges in highly dynamic and cluttered construction environments [15,46,47,48]. Workers frequently interact with materials, tools, and other workers, resulting in occlusion scenarios that disrupt tracking continuity [49,50,51]. Addressing these limitations requires more advanced tracking methodologies that can maintain persistent worker identification even in occlusion-heavy environments [52,53,54].

Occlusion is a major challenge in computer vision-based construction monitoring, particularly in masonry work [48,55]. During brick transportation, workers carrying multiple bricks may partially or fully obscure their own bodies [56], preventing conventional object detection models from maintaining continuous tracking [48]. Similarly, as the height of the masonry wall increases, parts of the worker’s body become progressively occluded by the stacked bricks, leading to tracking inconsistencies and data loss in movement analysis [57]. The inability to accurately track workers under these conditions results in gaps in productivity assessments and unreliable performance metrics. For example, in a continuous monitoring situation, if a worker is frequently occluded by construction materials, the detection model may fail to recognize their presence for extended periods [58]. As a result, the system might incorrectly register the worker as inactive, leading to inaccurate productivity measurement results [38,59]. Furthermore, the limitations of existing models can lead to misinterpretations of worker activities, inaccurate performance evaluations, and difficulties in automating construction site monitoring [15,60]. Addressing the occlusion problem is crucial for enhancing the accuracy of automated worker tracking and ensuring reliable productivity evaluations [58,61].

To overcome these challenges, SAM-based unified and robust zero-shot visual tracker with motion-aware instance-level memory (SAMURAI) has emerged as a promising solution [62]. Unlike conventional object detection models that process each frame independently, SAMURAI incorporates motion-aware memory selection mechanisms, enabling objects to be tracked continuously even under occlusion scenarios [62]. By integrating temporal continuity across multiple frames, SAMURAI significantly improves tracking accuracy in highly dynamic environments where objects frequently disappear and reappear due to physical obstructions [62,63]. SAMURAI was developed to address long-term object tracking (LTOT) challenges [62], particularly in scenarios where standard object detection models fail due to occlusion [62]. Unlike YOLO, which relies solely on frame-based detection, SAMURAI leverages previous frame information to maintain object identities, ensuring greater tracking stability [62].

SAMURAI is based on the Segment Anything Model (SAM) and its advanced version, SAM2, incorporating a motion-aware instance-level memory selection mechanism [62]. Unlike conventional object detection models, SAMURAI utilizes SAM2’s segmentation-driven framework to enhance object tracking performance, particularly in occlusion-heavy environments [62]. This approach allows SAMURAI to refine object tracking by integrating motion-aware selection and affinity-based filtering, ensuring persistent identification of occluded objects across frames [62]. Recent studies have demonstrated its potential in long-term tracking tasks, including surveillance and autonomous systems [62], suggesting that its occlusion-resistant capabilities could also be applicable to construction site monitoring.

Despite the increasing adoption of SAMURAI in other industries [62], studies investigating SAMURAI’s application in construction remain limited. To facilitate the adoption of SAMURAI in the construction sector, it is essential to evaluate whether it can provide a robust and reliable alternative to YOLO for real-time construction monitoring. While previous research has extensively explored YOLO-based worker tracking [64,65], no comprehensive study has directly compared the performance of YOLO and SAMURAI in masonry work. Since YOLO remains the most widely used detection model but continues to exhibit weaknesses in persistent worker tracking under occlusion [66,67], a direct comparative analysis with SAMURAI is necessary to determine its viability as an effective solution for construction monitoring challenges.

To address these challenges, this study aims to bridge this gap by assessing how SAMURAI performs under occlusion conditions and its potential impact on masonry productivity monitoring. Building on this, this study further evaluates the applicability of occlusion-aware object tracking techniques to improve the accuracy of masonry worker productivity analysis in construction sites. By assessing the strengths and limitations of YOLO and SAMURAI in occlusion-heavy environments, this research explores methods to enhance the reliability of productivity assessments in masonry work. Furthermore, by providing a comparative analysis of these models, this study contributes to the decision-making process for selecting appropriate monitoring techniques based on real-world construction site requirements. The findings may also support advancements in real-time construction productivity monitoring and facilitate data-driven task analysis, ultimately improving the efficiency and accuracy of worker tracking in dynamic construction environments.

2. Methodology

2.1. Research Framework

This section presents the proposed framework that integrates experimental setup and data collection and compares YOLOv8 and SAMURAI. The primary objective is to evaluate the applicability of these models for detecting construction workers during masonry tasks, focusing on their performance under different levels of occlusion.

Figure 1 illustrates the overall process. YOLOv8 and SAMURAI are applied to both brick transportation and bricklaying tasks, using datasets with labeled worker images to ensure accurate detection. Video data are collected using a single camera positioned at a fixed location, capturing worker movements and occlusion scenarios. The framework categorizes occlusion into three levels: (1) non-occlusion (no obstruction), (2) partial occlusion (partial obstruction by materials or tools), and (3) severe occlusion (significant obstruction of the worker’s body).

While YOLOv8 processes each frame independently, SAMURAI maintains temporal continuity across frames, enabling more consistent detection even when workers are partially or fully obscured. The detailed structure and methodology of each algorithm will be explained in subsequent sections.

2.2. Experiment Design and Data Collection

This section describes the experimental design and data collection process to evaluate the performance of YOLOv8 and SAMURAI in brick transportation and brick laying tasks. The experiment was conducted in an outdoor environment simulating a construction site. Figure 2a illustrates the experimental setup, showing the fixed camera position and the relative locations of the worker and bricks. A single fixed camera was used to capture two tasks performed by the worker: (1) brick transportation and (2) brick laying. The camera was positioned to clearly capture the worker from the front.

In the brick transportation task, the worker pushed a trolley loaded with bricks, and the level of occlusion was classified based on the degree to which the worker’s body was obstructed by an obstacle. Figure 2b–d visually represent these occlusion levels: Case ① (non-occlusion) occurred when the worker’s entire body was visible without any obstruction; Case ② (partial occlusion) occurred when the obstacle obscured the worker below the knees; and Case ③ (severe occlusion) occurred when the obstacle obscured the worker from the waist up, significantly limiting visibility. The intersection over union (IoU) values for each occlusion level in the brick transportation task were as follows: Case ① (non-occlusion) ranged between 0.85 and 0.92, with an average IoU of 0.89; Case ② (partial occlusion) ranged between 0.67 and 0.75, with an average IoU of 0.71; and Case ③ (severe occlusion) exhibited the lowest values, ranging from 0.45 to 0.58, with an average IoU of 0.52.

In the brick laying task, the worker crouched while stacking bricks, adding one layer at a time until reaching a total of 13 layers. The levels of occlusion were classified based on the degree to which the worker’s body was obscured by the stacked bricks. As depicted in Figure 2e–g, Case ① (non-occlusion) occurred when the worker’s feet were visible (up to two-three layers of bricks); Case ② (partial occlusion) occurred when only the worker’s upper body was visible (up to seven-eight layers); and Case ③ (severe occlusion) occurred when only the worker’s face was visible (up to twelve layers). For the brick laying task, IoU values varied depending on the occlusion level: Case ① (non-occlusion) had IoU values between 0.82 and 0.90, with an average of 0.86; Case ② (partial occlusion) had values between 0.55 and 0.68, averaging 0.61; and Case ③ (severe occlusion) demonstrated the lowest IoU values, ranging from 0.38 to 0.50, with an average of 0.44.

For each task, videos were recorded until the worker reached each occlusion level. The collected video frames were annotated with bounding boxes around the worker, and the resulting dataset was used to evaluate the performance of YOLOv8 and SAMURAI, as described in the subsequent sections.

2.3. YOLOv8 and SAMURAI Architecture

This section describes the architectures of YOLOv8 and SAMURAI, outlining their core components and operational principles. YOLOv8 processes each frame independently, enabling rapid object detection [68,69]. In contrast, SAMURAI incorporates temporal dependencies through memory-based learning, facilitating continuous tracking across frames [62]. The subsequent subsections detail the structural elements of each architecture and explain the applied detection and tracking methodologies used in this study. Additionally, the differences between YOLOv8 and SAMURAI in object detection and tracking are outlined to provide a clear understanding of their respective mechanisms and implementation in this research.

2.3.1. YOLOv8 Architecture

YOLOv8 is a deep learning-based object detection model that builds upon the efficiency and accuracy of its predecessors in the YOLO algorithm [70,71]. It maintains the fundamental one-stage detection pipeline, allowing for rapid object localization and classification within a single forward pass [72,73]. YOLOv8 consists of three primary components: the backbone, neck, and head, as illustrated in Figure 3 [74,75].

Compared to YOLOv5, YOLOv8 introduces several architectural improvements. It replaces the C3 modules of YOLOv5 with Cross-Stage Partial connections with two convolutional layers and fusion (C2f), which improves feature fusion while reducing the number of parameters. Additionally, YOLOv8 removes the anchor box mechanism (anchor-free), simplifying the detection head and enabling more flexible object localization [76]. The backbone is responsible for extracting low-level to high-level spatial features from input images using convolutional layers. The enhanced modules such as C2f and the use of the Sigmoid Linear Unit (SiLU) activation function in place of Leaky Rectified Linear Unit (Leaky ReLU) further contribute to better convergence and detection precision [76]. It employs Cross-Stage Partial (CSP) connections to enhance gradient flow and reduce computational redundancy while maintaining detection performance. The extracted features are then forwarded to the neck, which utilizes Spatial Pyramid Pooling-Fast (SPPF) layers to aggregate multi-scale features. This enhances the model’s ability to detect objects of varying sizes, making it particularly useful in dynamic environments. Finally, the YOLO head performs object classification and bounding box regression, producing the final detection results.

For this study, YOLOv8 was applied to detect construction workers performing brick transportation and bricklaying tasks. The model was trained on an annotated dataset consisting of worker images captured in an outdoor environment. The detection performance was evaluated across different occlusion conditions, ensuring a comprehensive analysis of the model’s effectiveness in real-world construction scenarios. Figure 3 illustrates the improved architecture of YOLOv8, including updated backbone modules and anchor-free detection head, which distinguish it from previous YOLO versions and contribute to its enhanced performance.

2.3.2. SAMURAI Architecture

SAMURAI is an object tracking architecture that utilizes temporal relationships between consecutive frames to track object locations [62]. Unlike single-frame-based detection models, SAMURAI incorporates a memory-based approach to maintain object information over time [62]. This study applies SAMURAI to analyze worker movements in a construction environment and assess detection performance under different occlusion conditions.

The SAMURAI architecture consists of key components, including an image encoder, memory attention, mask decoder, and additional modules for processing and evaluation. The image encoder processes input frames and extracts feature maps, converting visual data into structured representations. These extracted features are then used to detect objects and analyze spatial relationships. The memory attention module utilizes stored object information from previous frames to determine object locations in the current frame, facilitating tracking across sequential frames.

The mask decoder reconstructs object locations using the memory-attended information. Additionally, the motion modeling module accounts for object movement characteristics such as displacement and velocity. The master encoder further refines detection results, and two auxiliary components, the affinity head and object head, assess the confidence of detected objects based on the consistency of their features across frames.

SAMURAI includes a motion-aware memory selection mechanism, which prioritizes and utilizes relevant temporal features for each object across sequential frames. This mechanism enables continuous object tracking by dynamically selecting memory features that reflect motion patterns, even under conditions involving occlusion or scene changes.

In this study, SAMURAI is used to track construction workers performing brick transportation and bricklaying tasks. The collected data are processed through the memory module, allowing for the verification of detection consistency across frames. The detailed evaluation of detection performance will be discussed in subsequent sections.

Figure 4 illustrates the SAMURAI architecture implemented in this study.

2.3.3. Differences in Detection Between YOLOv8 and SAMURAI

This section presents the structural and operational differences between YOLOv8 and SAMURAI, focusing on their distinct detection methodologies. The primary difference between these models lies in how they process frames and track objects over time. YOLOv8 functions as a frame-based object detection model, where each image is analyzed independently without referencing previous frames. This approach involves extracting features directly from each frame using a convolutional neural network (CNN) and performing object localization and classification within that frame. As a result, YOLOv8 provides detection results based only on the current frame without incorporating temporal dependencies.

On the other hand, SAMURAI utilizes a memory-based tracking mechanism that references object information from previous frames. By integrating memory attention, SAMURAI can refine detection results based on past observations, allowing for frame-to-frame consistency in tracking. Additionally, motion modeling is employed to estimate object movement trajectories, leveraging historical motion patterns to complement detection results.

Table 1 summarizes the key differences between YOLOv8 and SAMURAI in object detection and data processing.

2.4. Performance Comparison Between YOLOv8 and SAMURAI

This section presents the methodology for evaluating the detection performance of YOLOv8 and SAMURAI under varying occlusion conditions. The evaluation process systematically segments video sequences into fixed temporal intervals and measures detection accuracy at each time-step. To ensure structured analysis, detection results are compiled in tabular format, allowing for a comparative assessment based on task type, applied algorithm, occlusion case, and time-based evaluation intervals. Figure 5 visually represents the evaluation segments for each occlusion scenario, including key time points (T₁ to T₁) and occlusion severity levels. Additionally, it illustrates the overall experimental workflow, detailing the processes of data preprocessing, model implementation (YOLOv8 and SAMURAI), and systematic performance evaluation across different tasks and conditions. T₁ denotes the initial evaluation time-step for each occlusion scenario, while T_L (last time-step) represents the final evaluation time point.

For the brick transportation task, performance evaluation encompasses Case ① (non-occlusion), Case ② (partial occlusion), and Case ③ (severe occlusion). The evaluation period starts from the moment the worker enters the occlusion region until they are fully visible again. In all cases, a total duration of 15 s is analyzed, with detection results recorded at 0.25 s intervals, yielding 60 evaluation time points (T₁ to T₆₀). In Case ①, the worker remains fully visible without obstruction, providing a baseline for detection performance. In Case ②, the worker becomes partially occluded as an obstacle obstructs the lower body, and evaluation continues until visibility is fully restored. In Case ③, severe occlusion occurs when the worker’s upper body and head are significantly obscured by the obstacle, posing additional detection challenges. By dividing the evaluation into 0.25 s intervals, the study assesses how YOLOv8 and SAMURAI perform during the occlusion period.

For the brick laying task, performance evaluation is based on occlusion progression due to the increasing height of stacked bricks rather than worker movement. The non-occlusion condition (Case ①) is defined as the stage when the worker remains fully visible, corresponding to brick layers up to the second and third rows. Partial occlusion (Case ②) is observed when only the upper body of the worker is visible, occurring from the seventh to eighth brick layers. Severe Occlusion (Case ③) begins when only the worker’s face or part of the upper body is visible, occurring from the twelfth brick layer onward. Since stacking two brick layers takes approximately one minute, the evaluation follows the same segmentation approach as the brick transportation task, recording detection results at 0.25 s intervals over 240 time-steps (T₁ to T₂₄₀).

By structuring the evaluation in this manner, detection trends can be analyzed across time, and the comparative performance of YOLOv8 and SAMURAI under varying occlusion levels can be assessed. The subsequent section discusses the experimental results and provides a detailed performance comparison based on the collected data.

All experimental evaluations were conducted using Python 3.10 and PyTorch 2.0 on a workstation equipped with an NVIDIA RTX 3090 GPU and Intel Core i9-12900K CPU. Both YOLOv8 and SAMURAI were implemented in the same software environment to ensure consistency in detection performance measurement across time-steps. Detection results were recorded and processed systematically at each predefined interval, supporting accurate and reproducible evaluation.

3. Experimental Evaluation

This section presents the evaluation results of YOLOv8 and SAMURAI for worker detection in brick transportation and brick laying tasks. The analysis focuses on detection performance across different occlusion levels, comparing the two models based on their ability to identify workers under varying degrees of visibility. The evaluation follows a structured time-based analysis, ensuring a comprehensive comparison of detection trends.

To systematically compare the detection performance, the evaluation framework categorizes the results into three occlusion cases: Case ① (non-occlusion), Case ② (partial occlusion), and Case ③ (severe occlusion). For each case, detection accuracy is recorded at predefined time intervals to analyze temporal detection consistency. The results are presented in a tabular format, with additional visual representations highlighting key detection outcomes at different occlusion stages.

The results are divided into two sections based on experimental tasks. Section 3.1 analyzes detection performance in brick transportation, where occlusion occurs as the worker moves behind obstacles, such as a trolley loaded with bricks. Section 3.2 evaluates detection performance in brick laying, where occlusion progressively increases as more brick layers are stacked in front of the worker. Each subsection details the detection trends observed for YOLOv8 and SAMURAI, offering insights into their object detection consistency across time and occlusion conditions. A final summary in Section 3.3 consolidates the findings and provides a comparative analysis of the detection effectiveness of both models.

3.1. Detection Performance in Brick Transportation

The detection performance of YOLOv8 and SAMURAI was evaluated for the brick transportation task by analyzing detection results under three occlusion conditions: non-occlusion, partial occlusion, and severe occlusion. The evaluation process involved segmenting video sequences into specific time intervals and measuring detection accuracy at each time-step. Performance assessment was conducted using multiple metrics, including the F1–confidence curve, recall–confidence curve, precision–recall curve, precision–confidence curve, and confusion matrix, as shown in Figure 6.

Detection results were analyzed over 60 evaluation time-steps, recorded at 0.25 s intervals. Detection accuracy was determined based on the classification of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs). The accuracy for each occlusion case was computed using the following equation:

\frac{T P + T N}{T P + T N + F P + F N}

(1)

Based on the values in the confusion matrix, the accuracy was calculated to be 0.975.

Each detection event was recorded at 0.25 s intervals, and detection success was systematically analyzed at each time-step. The evaluation period for each case was set to 15 s, resulting in 60 detection time-steps from T₁ to T₆₀. The dataset included sample detection images for visualization, with four representative detections per algorithm for each occlusion level, totaling twenty-four images, as shown in Figure 7.

Detection accuracy varied across occlusion levels and algorithms. The time-series detection status for YOLOv8 and SAMURAI in brick transportation is summarized in Table 2, which provides a structured comparison of detection performance across different occlusion conditions, now incorporating accuracy values. The non-occlusion case was evaluated separately for YOLOv8 and SAMURAI. For the partial occlusion case, the analysis focused on frames where the worker was partially obstructed by an obstacle, continuing until the obstruction was fully removed. The severe occlusion case was evaluated for the period when the worker’s upper body was significantly obstructed, affecting visibility.

Table 2 provides a detailed time-step-based evaluation, outlining detection success at each recorded time-step, now including accuracy values. The overall performance was computed using the following equation:

\frac{C o r r e c t D e t e c t i o n s}{T o t a l T i m e S t e p s}

(2)

YOLOv8 detected 56 out of 60 time-steps in the non-occlusion case, achieving an accuracy of 93.33%. In the partial occlusion case, it detected 49 out of 60 time-steps, achieving an accuracy of 81.67%. In the severe occlusion case, it detected 41 out of 60 time-steps, resulting in an accuracy of 68.33%. Similarly, SAMURAI detected 59 out of 60 time-steps in the non-occlusion case, achieving an accuracy of 98.33%. In the partial occlusion case, it detected 57 out of 60 time-steps, achieving an accuracy of 95.00%. In the severe occlusion case, it detected 56 out of 60 time-steps, resulting in an accuracy of 93.33%

3.2. Detection Performance in Brick Laying

The detection performance of YOLOv8 and SAMURAI was also evaluated for the brick laying task. Unlike the brick transportation task, which used 60 evaluation time-steps, the brick laying task was analyzed over 240 time-steps. As in the previous analysis, three occlusion conditions were considered: non-occlusion, partial occlusion, and severe occlusion. Figure 8 presents representative detection cases for each occlusion level of YOLOv8 and SAMURAI, including four cases per algorithm and per occlusion condition, totaling 24 detection examples.

Table 3 provides a structured comparison of detection performance across different occlusion conditions, summarizing the number of correctly detected time-steps for each algorithm. As in Section 3.1, detection accuracy was calculated using the previously introduced formula.

Table 3 provides a detailed breakdown of detection success at each recorded time-step for the brick laying task. YOLOv8 detected 223 out of 240 time-steps in the non-occlusion case, achieving an accuracy of 92.92%. In the partial occlusion case, it detected 185 out of 240 time-steps, achieving an accuracy of 77.08%. In the severe occlusion case, it detected 117 out of 240 time-steps, resulting in an accuracy of 48.75%. Similarly, SAMURAI detected 236 out of 240 time-steps in the non-occlusion case, achieving an accuracy of 98.33%. In the partial occlusion case, it detected 226 out of 240 time-steps, achieving an accuracy of 94.17%. In the severe occlusion case, it detected 222 out of 240 time-steps, resulting in an accuracy of 92.50%.

3.3. Comparison of the Detection Accuracy of YOLOv8 and SAMURAI

Section 3.1 and Section 3.2 presented the detection performance of YOLOv8 and SAMURAI in two different construction tasks: brick transportation and brick laying. Detection accuracy varied across occlusion conditions and tasks, with YOLOv8 and SAMURAI demonstrating different levels of robustness under various occlusion scenarios. To provide a direct comparison, the detection accuracy of each case was computed as the ratio of correctly detected time-steps to the total number of time-steps across both tasks.

The total number of time-steps used for evaluation was 300, derived from 60 time-steps in the brick transportation task and 240 time-steps in the brick laying task. The final accuracy comparison was computed using the following equation:

\frac{T o t a l C o r r e c t D e t e c t i o n s}{T o t a l T i m e S t e p s (300)}

(3)

Table 4 summarizes the comparative detection accuracy of YOLOv8 and SAMURAI for each occlusion condition across both tasks.

The results in Table 4 indicate that detection accuracy was highest in Case ① (non-occlusion), followed by Case ② (partial occlusion), and lowest in Case ③ (severe occlusion). Additionally, SAMURAI consistently achieved higher detection accuracy than YOLOv8 across all occlusion conditions. These findings highlight variations in detection robustness depending on the occlusion level and the nature of the construction task. Further analysis of these results will be discussed in subsequent sections.

4. Discussion

This section analyzes the detection performance of YOLOv8 and SAMURAI across different occlusion levels and construction tasks. The impact of occlusion severity on detection accuracy is examined, along with differences between brick transportation and brick laying tasks. Additionally, this section discusses how detection speed and accuracy affect each other, explaining the advantages and disadvantages of each model. The practical applications and limitations of YOLOv8 and SAMURAI are evaluated, and potential improvements, such as hybrid detection models, are proposed for future research.

4.1. Analysis of Detection Performance

The detection performance of YOLOv8 and SAMURAI was assessed across different occlusion conditions and construction tasks, focusing on accuracy, processing speed, and computational efficiency. The evaluation results highlight the strengths and limitations of each model in various scenarios.

Both models demonstrated high detection accuracy under non-occlusion conditions (Case ①). However, as occlusion increased, YOLOv8’s performance declined significantly. In partial occlusion (Case ②), YOLOv8 showed reduced accuracy compared to SAMURAI, and in severe occlusion (Case ③), its accuracy dropped to approximately 52.67%. In contrast, SAMURAI maintained a relatively high accuracy of 92.67%, indicating that its memory-based tracking mechanism helps maintain consistent detection results even when objects are occluded for extended periods.

The impact of occlusion also varied by task type. In the brick transportation task, which involved 15 s video clips, both models achieved relatively high accuracy due to the worker’s continuous movement, which periodically exposed them to the camera. However, in the brick laying task, which used 60 s video clips, detection accuracy decreased due to the prolonged periods in which workers remained stationary behind stacked bricks. YOLOv8 performed better in dynamic environments where occlusion was temporary, whereas SAMURAI demonstrated greater robustness in scenarios where occlusion persisted over time.

In addition to accuracy, key evaluation metrics such as precision, recall, F1-score, and IoU were analyzed to provide a comprehensive assessment of detection performance. YOLOv8 achieved a precision of 97.00%, recall of 98.00%, and F1-score of 0.97, indicating strong performance under minimal occlusion conditions. Similarly, SAMURAI recorded a precision of 96.50%, recall of 94.80%, and F1-score of 0.95, maintaining high detection quality even under occlusion-heavy scenarios.

Beyond accuracy, detection speed is also a crucial factor in evaluating model performance. YOLOv8 processed frames at approximately 28–32 FPS, making it suitable for real-time applications. However, its frame-by-frame detection approach made it more susceptible to occlusion. In contrast, SAMURAI prioritized accuracy over speed, operating at approximately 9–12 FPS. While this slower speed limits its suitability for real-time applications, its ability to maintain high accuracy in occlusion-heavy environments makes it more reliable for scenarios where precise detection is required.

Table 5 provides a summary of detection accuracy, precision, recall, F1-score, processing speed, and video processing times for each task, offering a detailed comparison between the two models.

These findings indicate that YOLOv8 is approximately 2.5 to 3.5 times faster than SAMURAI in terms of frame processing speed. However, SAMURAI achieves about 1.3 times higher accuracy on average across all occlusion conditions. YOLOv8 is well-suited for real-time applications requiring fast inference, while SAMURAI provides superior accuracy, particularly under occlusion, albeit with higher computational demands. The inclusion of precision, recall, and F1-score enhances the reliability of performance evaluation, providing a more comprehensive view of each model’s detection capabilities under various site conditions.

4.2. Contribution and Limitations

The comparative analysis of YOLOv8 and SAMURAI highlights the balance between detection speed and accuracy. YOLOv8 is highly effective in environments with minimal occlusion, ensuring fast detection with lower computational costs. However, its frame-by-frame processing makes it vulnerable to detection failures under occlusion. In contrast, SAMURAI excels in environments with frequent occlusions, maintaining higher accuracy even in challenging detection scenarios. Quantitatively, SAMURAI demonstrates approximately 1.3 times higher average detection accuracy compared to YOLOv8 across all occlusion conditions, though at the cost of 2.5–3.5 times slower processing speed. These differences suggest that a hybrid detection approach combining YOLOv8’s speed and SAMURAI’s occlusion resilience could serve as a viable alternative for optimizing performance across diverse construction environments. Moreover, this comparison contributes to the decision-making process in selecting an appropriate monitoring algorithm based on the specific requirements of construction sites. By analyzing the strengths and weaknesses of both models, this study provides insights that can guide the development of optimized monitoring strategies tailored to the specific requirements of construction site operations.

The implementation of accurate worker tracking systems has significant implications for construction productivity and safety management [58,59]. Enhanced detection accuracy, particularly in occlusion-heavy environments, enables more reliable productivity measurements, facilitates data-driven performance evaluations, and supports proactive safety monitoring by maintaining continuous worker visibility [15,58]. For instance, in masonry construction, accurate tracking can help identify inefficient work patterns, optimize material handling procedures, and reduce safety risks associated with blind spots.

From a practical standpoint, integrating SAMURAI into existing monitoring systems may influence project workflows, budget allocation, and system scalability. For example, deploying SAMURAI may require more robust hardware infrastructure, such as high-performance GPUs, and increased maintenance due to its computational demands. These factors can affect cost-efficiency and scalability, especially in large or resource-constrained projects. Nonetheless, in environments where precise worker tracking is critical, the enhanced detection performance may justify the additional investment by improving safety outcomes and operational oversight.

However, study has several limitations. The experiments were conducted in a controlled environment, meaning that real-world variations in lighting, background complexity, and unpredictable worker movements could introduce variability in detection performance. The focus on masonry work limits the generalizability of the findings to other construction tasks, such as steel frame assembly or excavation, which may present different occlusion patterns and movement dynamics. Additionally, the evaluation focused on specific occlusion cases, which may not fully capture the diversity of real-world occlusion scenarios in construction sites. Moreover, the occlusion levels in this study were defined using IoU thresholds. However, IoU alone may not fully capture the dynamic nature and complexity of occlusion in real-world environments. Future studies should consider incorporating additional metrics or qualitative assessments to improve occlusion characterization. Furthermore, SAMURAI’s computational performance under high-resolution inputs or complex scenes was not evaluated in this study. Future research should assess whether SAMURAI can maintain its detection accuracy advantage in high-resolution video streams (e.g., 4K) and evaluate latency across different hardware configurations to determine practical deployment feasibility. From a practical implementation perspective, SAMURAI’s higher computational requirements present challenges for deployment in resource-constrained environments, potentially necessitating more substantial hardware investments compared to YOLOv8-based systems. The trade-off between detection accuracy and computational efficiency must be carefully considered within the context of available infrastructure and monitoring objectives. Therefore, further testing in diverse environmental conditions and various computational setups is necessary to validate the generalizability and practical feasibility of the findings.

While YOLOv8 is expected to be advantageous in real-time safety monitoring applications where immediate detection is critical, SAMURAI is anticipated to offer superior accuracy in occlusion-heavy environments where long-term worker tracking is required. Based on the current technological capabilities, YOLOv8 may be more suitable for open construction sites with minimal occlusion and where rapid detection is prioritized, whereas SAMURAI would be better suited for complex, multi-level construction environments with frequent occlusions where tracking continuity is essential. Understanding the strengths and limitations of each model allows for the development of more adaptable detection systems, potentially leveraging a hybrid approach that balances speed and accuracy to optimize detection performance in construction environments.

4.3. Future Research

To further enhance the effectiveness of worker detection systems in construction environments, several future research directions are proposed. One potential area of future study is the proposed development of hybrid models that leverage the strengths of both YOLOv8 and SAMURAI. By combining YOLOv8’s fast detection speed with SAMURAI’s robust occlusion handling, a more efficient and accurate detection framework could be achieved. For instance, a conceptual two-stage detection process could be implemented, where YOLOv8 serves as the primary detection system, followed by SAMURAI applying additional occlusion correction techniques. The illustrative workflow of this conceptual hybrid detection framework is shown in Figure 9.

This figure hypothetically illustrates the two-stage detection process that integrates YOLOv8 and SAMURAI, highlighting the role of each model and the expected detection outcome. The first stage employs YOLOv8, which excels in fast real-time object detection. YOLOv8 is applied as the primary detection model, quickly identifying workers under non-occlusion or low-occlusion conditions. If YOLOv8 successfully detects a worker, the result is directly output as the final detection. However, in cases where occlusion is present, YOLOv8 often fails to maintain detection due to its frame-based processing.

In this conceptual framework, YOLOv8’s detection failure is inferred through predefined conditions such as low detection confidence or temporal discontinuity (e.g., missing detection in sequential frames). When these conditions are met, the system assumes potential occlusion and activates SAMURAI for secondary tracking.

To address this issue, the second stage introduces SAMURAI, which incorporates motion-aware instance-level memory to enhance detection under high-occlusion conditions. When YOLOv8 fails to detect a worker due to severe occlusion, SAMURAI is activated as a secondary detection model, leveraging its tracking mechanism to re-identify the worker. By maintaining continuous traceability across frames, SAMURAI can recover detection failures caused by occlusion and provide more stable worker tracking in construction environments.

This proposed hybrid framework is designed to balance the trade-off between speed and accuracy by prioritizing YOLOv8 for rapid detection in clear conditions while leveraging SAMURAI for occlusion correction in challenging scenarios. The integration of both models minimizes the computational burden associated with using SAMURAI for the entire detection process while ensuring reliable worker tracking under varying occlusion levels.

In addition, future research should consider benchmarking SAMURAI against other state-of-the-art tracking algorithms, such as transformer-based trackers and memory-augmented networks. Comparative evaluations with these advanced models would provide deeper insights into SAMURAI’s unique strengths and limitations, allowing for a more comprehensive understanding of its potential applications across diverse tracking scenarios.

Another crucial research direction involves evaluating performance in diverse environmental conditions. While this study was conducted in a controlled setting, future experiments should consider varied lighting conditions and complex backgrounds to assess the adaptability of YOLOv8 and SAMURAI. Additionally, testing in environments with more intricate occlusion patterns could provide further insights into SAMURAI’s robustness and limitations.

Overall, future studies should focus on enhancing hybrid detection strategies and testing in varied real-world conditions to develop more reliable and scalable worker detection models for construction environments.

5. Conclusions

This study evaluated the detection performance of YOLOv8 and SAMURAI for monitoring workers in masonry tasks, specifically in brick transportation and brick laying. The results demonstrate that YOLOv8 provides fast real-time detection, making it well-suited for dynamic environments where immediate monitoring is necessary. However, its reliance on frame-by-frame detection leads to a significant decline in accuracy under occlusion, particularly in cases of prolonged or severe obstructions. In contrast, SAMURAI maintains a high detection accuracy even in occlusion-heavy conditions due to its memory-based tracking mechanism, but its higher computational demands limit its real-time applicability.

Through this comparative analysis, it was observed that each model has distinct advantages and limitations depending on the work environment and occlusion conditions. YOLOv8 performs effectively when occlusion is minimal, ensuring rapid inference, while SAMURAI is more robust in environments with frequent occlusion, allowing for the more stable tracking of workers despite visual obstructions. These findings highlight the need to select the appropriate detection model based on the specific requirements of the construction site, balancing the need for speed and accuracy according to the operational demands.

Furthermore, this study proposes a hybrid detection approach that integrates YOLOv8’s speed with SAMURAI’s robustness against occlusion. While the potential benefits of such an approach have been outlined, further research is required to develop and validate an integrated framework that optimizes both real-time performance and detection accuracy. Additionally, future studies should focus on evaluating the models under diverse environmental conditions, including varying lighting, complex backgrounds, and different types of occlusions, to ensure broader applicability in real-world construction settings.

By addressing the challenges of worker detection during the performance of masonry tasks, this study contributes to improving productivity monitoring and safety management on construction sites. The findings provide a foundation for developing more adaptive and reliable vision-based monitoring systems, ensuring their effectiveness in diverse and dynamic construction environments.

Author Contributions

Conceptualization, S.Y. and H.K.; methodology, S.Y. and H.K.; validation, S.Y. and H.K.; investigation, S.Y.; writing—original draft preparation, S.Y.; writing—review and editing, S.Y. and H.K.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant (RS-2022-00143493) from a Digital-Based Building Construction and Safety Supervision Technology Research Program, funded by the Ministry of Land, Infrastructure and Transport of the Korean Government.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and was approved by the Institutional Review Board of Dankook University (DKU 2020-09-027, date of approval 14 September 2020).

Informed Consent Statement

Written informed consent was obtained from the participant(s) to publish this paper.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ramamurthy, K.; Kunhanandan Nambiar, E.K. Accelerated Masonry Construction Review and Future Prospects. Prog. Struct. Eng. Mater. 2004, 6, 1–9. [Google Scholar] [CrossRef]
Hong, S.; Ham, Y.; Chun, J.; Kim, H. Productivity Measurement through IMU—Based Detailed Activity Recognition Using Machine Learning: A Case Study of Masonry Work. Sensors 2023, 23, 7635. [Google Scholar] [CrossRef] [PubMed]
Florez, L.; Cortissoz, J.C. Defining a Mathematical Function for Labor Productivity in Masonry Construction: A Case Study. Procedia Eng. 2016, 164, 42–48. [Google Scholar] [CrossRef]
Florez, L. Crew Allocation System for the Masonry Industry. Comput. Aided Civ. Infrastruct. Eng. 2017, 32, 874–889. [Google Scholar] [CrossRef]
Florez, L.; Castro-Lacouture, D. Optimal Crew Design for Masonry Construction Projects Considering Contractors’ Requirements and Workers’ Needs. In Proceedings of the Construction Research Congress 2014: Construction in a Global Network, Atlanta, GA, USA, 19–21 May 2014; pp. 1149–1158. [Google Scholar] [CrossRef]
Sanders, S.R.; Thomas, H.R. Factors Affecting Masonry—Labor Productivity. J. Constr. Eng. Manag. 1991, 117, 626–644. [Google Scholar] [CrossRef]
Navon, R. Automated Project Performance Control of Construction Projects. Autom. Constr. 2005, 14, 467–476. [Google Scholar] [CrossRef]
Kim, K.; Cho, Y.K. Automatic Recognition of Workers’ Motions in Highway Construction by Using Motion Sensors and Long Short-Term Memory Networks. J. Constr. Eng. Manag. 2021, 147, 04020184. [Google Scholar] [CrossRef]
Kim, K.; Cho, Y.K. Automated Productivity Estimation of Masonry Work Using a Deep Learning Approach and Wearable Motion Sensors. Available online: https://construction-robots.github.io/papers/16.pdf (accessed on 12 March 2025).
Reja, V.K.; Varghese, K.; Ha, Q.P. Computer Vision-Based Construction Progress Monitoring. Autom. Constr. 2022, 138, 104245. [Google Scholar] [CrossRef]
Jayaram, M.A. Computer Vision Applications in Construction Material and Structural Health Monitoring: A Scoping Review. Mater. Today Proc. 2023. [Google Scholar] [CrossRef]
Liu, W.; Meng, Q.; Li, Z.; Hu, X. Applications of Computer Vision in Monitoring the Unsafe Behavior of Construction Workers: Current Status and Challenges. Buildings 2021, 11, 409. [Google Scholar] [CrossRef]
Ekanayake, B.; Wong, J.K.-W.; Fini, A.A.F.; Smith, P. Computer Vision-Based Interior Construction Progress Monitoring: A Literature Review and Future Research Directions. Autom. Constr. 2021, 127, 103705. [Google Scholar] [CrossRef]
Sami Ur Rehman, M.; Shafiq, M.T.; Ullah, F. Automated Computer Vision-Based Construction Progress Monitoring: A Systematic Review. Buildings 2022, 12, 1037. [Google Scholar] [CrossRef]
Seo, J.; Han, S.; Lee, S.; Kim, H. Computer Vision Techniques for Construction Safety and Health Monitoring. Adv. Eng. Inform. 2015, 29, 239–251. [Google Scholar] [CrossRef]
Moragane, H.; Perera, B.; Palihakkara, A.D.; Ekanayake, B. Application of Computer Vision for Construction Progress Monitoring: A Qualitative Investigation. Constr. Innov. 2022, 24, 446–469. [Google Scholar] [CrossRef]
Kim, H.; Han, S. Accuracy Improvement of Real-Time Location Tracking for Construction Workers. Sustainability 2018, 10, 1488. [Google Scholar] [CrossRef]
Kim, J.-M. The Quantification of the Safety Accident of Foreign Workers in the Construction Sites. Korean J. Constr. Eng. Manag. 2024, 25, 25–31. [Google Scholar]
Sherafat, B.; Ahn, C.R.; Akhavian, R.; Behzadan, A.H.; Golparvar-Fard, M.; Kim, H.; Lee, Y.-C.; Rashidi, A.; Azar, E.R. Automated Methods for Activity Recognition of Construction Workers and Equipment: State-of-the-Art Review. J. Constr. Eng. Manag. 2020, 146, 03120002. [Google Scholar] [CrossRef]
Lee, H.-S.; Lee, K.-P.; Park, M.-S.; Kim, H.-S.; Lee, S.-B. A Construction safety management system based on Building Information Modeling and Real-time Locating System. Korean J. Constr. Eng. Manag. 2009, 10, 135–145. [Google Scholar]
Lee, K.-P.; Lee, H.-S.; Park, M.; Kim, H.; Han, S. A Real-Time Location-Based Construction Labor Safety Management System. J. Civ. Eng. Manag. 2014, 20, 724–736. [Google Scholar] [CrossRef]
Fang, W.; Ding, L.; Love, P.E.D.; Luo, H.; Li, H.; Peña-Mora, F.; Zhong, B.; Zhou, C. Computer Vision Applications in Construction Safety Assurance. Autom. Constr. 2020, 110, 103013. [Google Scholar] [CrossRef]
Kittiwatanachod, N. Computer Vision Application in Construction Progress Monitoring. Bachelor’s Thesis, Alto University, Espoo, Finland, 2024. [Google Scholar]
Ou, X. Computer Vision Technology in Cost Monitoring of Construction Projects. In Proceedings of the Conference on Sustainable Traffic and Transportation Engineering in 2023; Bieliatynskyi, A., Komyshev, D., Zhao, W., Eds.; Springer Nature: Singapore, 2024; pp. 561–571. [Google Scholar]
Zhang, S.; Zhang, L. Construction Site Safety Monitoring and Excavator Activity Analysis System. Constr. Robot. 2022, 6, 151–161. [Google Scholar] [CrossRef]
Shamsollahi, D.; Moselhi, O.; Khorasani, K. A Timely Object Recognition Method for Construction Using the Mask R-CNN Architecture. In Proceedings of the International Symposium on Automation and Robotics in Construction; IAARC Publications: Waterloo, ON, Canada, 2021; Volume 38, pp. 372–378. [Google Scholar]
Thakar, V.; Saini, H.; Ahmed, W.; Soltani, M.M.; Aly, A.; Yu, J.Y. Efficient Single-Shot Multibox Detector for Construction Site Monitoring. In Proceedings of the 2018 IEEE International Smart Cities Conference (ISC2), Kansas City, MO, USA, 16–19 September 2018; pp. 1–6. [Google Scholar]
Dai, B.; Nie, Y.; Cui, W.; Liu, R.; Zheng, Z. Real-Time Safety Helmet Detection System Based on Improved SSD. In Proceedings of the 2nd International Conference on Artificial Intelligence and Advanced Manufacture, Manchester, UK, 15–17 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 95–99. [Google Scholar]
Li, J.; Miao, Q.; Zou, Z.; Gao, H.; Zhang, L.; Li, Z.; Wang, N. A Review of Computer Vision-Based Monitoring Approaches for Construction Workers’ Work-Related Behaviors. IEEE Access 2024, 12, 7134–7155. [Google Scholar] [CrossRef]
Fang, W.; Love, P.E.D.; Luo, H.; Ding, L. Computer Vision for Behaviour-Based Safety in Construction: A Review and Future Directions. Adv. Eng. Inform. 2020, 43, 100980. [Google Scholar] [CrossRef]
Han, S.; Won, J.; Choongwan, K. A Strategic Approach to Enhancing the Practical Applicability of Vision-based Detection and Classification Models for Construction Tools—Sensitivity Analysis of Model Performance Depending on Confidence Threshold. Korean J. Constr. Eng. Manag. 2025, 26, 102–109. [Google Scholar]
Konstantinou, E.; Lasenby, J.; Brilakis, I. Adaptive Computer Vision-Based 2D Tracking of Workers in Complex Environments. Autom. Constr. 2019, 103, 168–184. [Google Scholar] [CrossRef]
Park, M.-W.; Makhmalbaf, A.; Brilakis, I. Comparative Study of Vision Tracking Methods for Tracking of Construction Site Resources. Autom. Constr. 2011, 20, 905–915. [Google Scholar] [CrossRef]
Shin, Y.; Seo, S.-W.; Choongwan, K. Synthetic Video Generation Process Model for Enhancing the Activity Recognition Performance of Heavy Construction Equipment—Utilizing 3D Simulations in Unreal Engine Environment. Korean J. Constr. Eng. Manag. 2025, 26, 74–82. [Google Scholar]
Yoon, S.; Kim, H. Time-Series Image-Based Automated Monitoring Framework for Visible Facilities: Focusing on Installation and Retention Period. Sensors 2025, 25, 574. [Google Scholar] [CrossRef]
Oh, J.; Hong, S.; Choi, B.; Ham, Y.; Kim, H. Integrating Text Parsing and Object Detection for Automated Monitoring of Finishing Works in Construction Projects. Autom. Constr. 2025, 174, 106139. [Google Scholar] [CrossRef]
Alzubi, K.M.; Alaloul, W.S.; Salaheen, M.A.; Qureshi, A.H.; Musarat, M.A.; Baarimah, A.O. Automated Monitoring for Construction Productivity Recognition. In Proceedings of the 2021 Third International Sustainability and Resilience Conference: Climate Change, Sakheer, Bahrain, 15–16 November 2021; pp. 489–494. [Google Scholar]
Gong, J.; Caldas, C.H. Computer Vision-Based Video Interpretation Model for Automated Productivity Analysis of Construction Operations. J. Comput. Civ. Eng. 2010, 24, 252–263. [Google Scholar] [CrossRef]
Ferdous, M.; Ahsan, S.M.M. PPE Detector: A YOLO-Based Architecture to Detect Personal Protective Equipment (PPE) for Construction Sites. PeerJ Comput. Sci. 2022, 8, e999. [Google Scholar] [CrossRef]
Nath, N.D.; Behzadan, A.H.; Paal, S.G. Deep Learning for Site Safety: Real-Time Detection of Personal Protective Equipment. Autom. Constr. 2020, 112, 103085. [Google Scholar] [CrossRef]
Yu, Y. Automatic Physical Fatigue Assessment for Construction Workers Based on Computer Vision and Pressure Insole Sensor. Ph.D. Thesis, The Hong Kong Polytechnic University, Kowloon, Hong Kong, 2020. [Google Scholar]
Roberts, D.; Torres Calderon, W.; Tang, S.; Golparvar-Fard, M. Vision-Based Construction Worker Activity Analysis Informed by Body Posture. J. Comput. Civ. Eng. 2020, 34, 04020017. [Google Scholar] [CrossRef]
Seo, J. Evaluation of Construction Workers Physical Demands Through Computer Vision-Based Kinematic Data Collection and Analysis. Ph.D. Thesis, The University of Michigan, Ann Arbor, MI, USA, 2016. [Google Scholar]
Chen, X.; Wang, Y.; Wang, J.; Bouferguene, A.; Al-Hussein, M. Vision-Based Real-Time Process Monitoring and Problem Feedback for Productivity-Oriented Analysis in off-Site Construction. Autom. Constr. 2024, 162, 105389. [Google Scholar] [CrossRef]
Cheng, M.-Y.; Cao, M.-T.; Nuralim, C.K. Computer Vision-Based Deep Learning for Supervising Excavator Operations and Measuring Real-Time Earthwork Productivity. J. Supercomput. 2023, 79, 4468–4492. [Google Scholar] [CrossRef]
Li, Z.; Li, D. Action Recognition of Construction Workers Under Occlusion. J. Build. Eng. 2022, 45, 103352. [Google Scholar] [CrossRef]
Zhang, M.; Cao, Z.; Yang, Z.; Zhao, X. Utilizing Computer Vision and Fuzzy Inference to Evaluate Level of Collision Safety for Workers and Equipment in a Dynamic Environment. J. Constr. Eng. Manag. 2020, 146, 04020051. [Google Scholar] [CrossRef]
Park, M.-W.; Brilakis, I. Construction Worker Detection in Video Frames for Initializing Vision Trackers. Autom. Constr. 2012, 28, 15–25. [Google Scholar] [CrossRef]
Wang, Q.; Liu, H.; Peng, W.; Tian, C.; Li, C. A Vision-Based Approach for Detecting Occluded Objects in Construction Sites. Neural Comput. Appl. 2024, 36, 10825–10837. [Google Scholar] [CrossRef]
Lin, X.; Guo, Z.; Guo, H.; Zhou, Y. Vision-Based Real-Time Posture Tracking for Multiple Construction Workers. J. Comput. Civ. Eng. 2024, 38, 04024023. [Google Scholar] [CrossRef]
Luo, H.; Wang, M.; Wong, P.K.-Y.; Cheng, J.C.P. Full Body Pose Estimation of Construction Equipment Using Computer Vision and Deep Learning Techniques. Autom. Constr. 2020, 110, 103016. [Google Scholar] [CrossRef]
Xiao, B.; Xiao, H.; Wang, J.; Chen, Y. Vision-Based Method for Tracking Workers by Integrating Deep Learning Instance Segmentation in off-Site Construction. Autom. Constr. 2022, 136, 104148. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, Z.; Wang, Y.; Sun, C. Head-Integrated Detecting Method for Workers under Complex Construction Scenarios. Buildings 2024, 14, 859. [Google Scholar] [CrossRef]
Chang, S.; Deng, Y.; Zhang, Y.; Zhao, Q.; Wang, R.; Zhang, K. An Advanced Scheme for Range Ambiguity Suppression of Spaceborne SAR Based on Blind Source Separation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5230112. [Google Scholar] [CrossRef]
Yang, J.; Arif, O.; Vela, P.A.; Teizer, J.; Shi, Z. Tracking Multiple Workers on Construction Sites Using Video Cameras. Adv. Eng. Inform. 2010, 24, 428–434. [Google Scholar] [CrossRef]
Xu, S.; Wang, J.; Shou, W.; Ngo, T.; Sadick, A.-M.; Wang, X. Computer Vision Techniques in Construction: A Critical Review. Arch. Comput. Methods Eng. 2021, 28, 3383–3397. [Google Scholar] [CrossRef]
Ersoz, A.B.; Pekcan, O. Removal of Construction Machinery Occlusion Using Image Segmentation and Inpainting for Automated Progress Tracking. In ISARC Proceedings of the International Symposium on Automation and Robotics in Construction; IAARC Publications: Waterloo, ON, Canada, 2024; Volume 41, pp. 759–767. [Google Scholar]
Park, M.-W.; Brilakis, I. Continuous Localization of Construction Workers via Integration of Detection and Tracking. Autom. Constr. 2016, 72, 129–142. [Google Scholar] [CrossRef]
Yang, J.; Park, M.-W.; Vela, P.A.; Golparvar-Fard, M. Construction Performance Monitoring via Still Images, Time-Lapse Photos, and Video Streams: Now, Tomorrow, and the Future. Adv. Eng. Inform. 2015, 29, 211–224. [Google Scholar] [CrossRef]
Khosrowpour, A.; Niebles, J.C.; Golparvar-Fard, M. Vision-Based Workface Assessment Using Depth Images for Activity Analysis of Interior Construction Operations. Autom. Constr. 2014, 48, 74–87. [Google Scholar] [CrossRef]
Konstantinou, E.; Brilakis, I. Matching Construction Workers across Views for Automated 3D Vision Tracking On-Site. J. Constr. Eng. Manag. 2018, 144, 04018061. [Google Scholar] [CrossRef]
Yang, C.-Y.; Huang, H.-W.; Chai, W.; Jiang, Z.; Hwang, J.-N. SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory. arXiv 2024, arXiv:2411.11922. [Google Scholar]
Rajič, F.; Ke, L.; Tai, Y.-W.; Tang, C.-K.; Danelljan, M.; Yu, F. Segment Anything Meets Point Tracking. arXiv 2023, arXiv:2307.01197. [Google Scholar]
Son, H.; Kim, C. Integrated Worker Detection and Tracking for the Safe Operation of Construction Machinery. Autom. Constr. 2021, 126, 103670. [Google Scholar] [CrossRef]
Li, P.; Wu, F.; Xue, S.; Guo, L. Study on the Interaction Behaviors Identification of Construction Workers Based on ST-GCN and YOLO. Sensors 2023, 23, 6318. [Google Scholar] [CrossRef] [PubMed]
Wong, J.K.W.; Bameri, F.; Fini, A.A.F.; Maghrebi, M. Tracking Indoor Construction Progress by Deep-Learning-Based Analysis of Site Surveillance Video. Constr. Innov. 2023, 25, 461–489. [Google Scholar] [CrossRef]
Zhang, Y.; Guan, D.; Zhang, S.; Su, J.; Han, Y.; Liu, J. GSO-YOLO: Global Stability Optimization YOLO for Construction Site Detection. arXiv 2024, arXiv:2407.00906. [Google Scholar]
Bakirci, M. Utilizing YOLOv8 for Enhanced Traffic Monitoring in Intelligent Transportation Systems (ITS) Applications. Digit. Signal Process. 2024, 152, 104594. [Google Scholar] [CrossRef]
Bai, R.; Wang, M.; Zhang, Z.; Lu, J.; Shen, F. Automated Construction Site Monitoring Based on Improved YOLOv8-Seg Instance Segmentation Algorithm. IEEE Access 2023, 11, 139082–139096. [Google Scholar] [CrossRef]
Sun, L.; Shen, Y. Intelligent Monitoring of Small Target Detection Using YOLOv8. Alex. Eng. J. 2025, 112, 701–710. [Google Scholar] [CrossRef]
Seth, Y.; Sivagami, M. Enhanced YOLOv8 Object Detection Model for Construction Worker Safety Using Image Transformations. IEEE Access 2025, 13, 10582–10594. [Google Scholar] [CrossRef]
Hussain, M. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Safaldin, M.; Zaghden, N.; Mejdoub, M. An Improved YOLOv8 to Detect Moving Objects. IEEE Access 2024, 12, 59782–59806. [Google Scholar] [CrossRef]
Hussain, M. YOLOv5, YOLOv8 and YOLOv10: The Go-To Detectors for Real-Time Vision. arXiv 2024, arXiv:2407.02988. [Google Scholar]
Peserico, G.; Morato, A. Performance Evaluation of YOLOv5 and YOLOv8 Object Detection Algorithms on Resource-Constrained Embedded Hardware Platforms for Real-Time Applications. In Proceedings of the 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024; pp. 1–7. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]

Figure 1. Research framework.

Figure 2. Collected data: (a) experiment site; (b) carrying bricks in the non-occlusion state; (c) carrying bricks in the partial occlusion state; (d) carrying bricks in the severe occlusion state; (e) brick laying in the non-occlusion state; (f) brick laying in the partial occlusion state; and (g) brick laying in the severe occlusion state.

Figure 3. Improved YOLOv8 architecture compared to YOLOv5.

Figure 4. SAMURAI architecture.

Figure 5. Comprehensive workflow and evaluation structure for YOLOv8 and SAMURAI detection performance under varying occlusion conditions.

Figure 6. Performance evaluation of worker detection YOLOv8 model: (a) F1–confidence curve; (b) recall–confidence curve; (c) precision–recall curve; (d) precision–confidence curve; and (e) confusion matrix.

Figure 7. Sample YOLOv8 and SAMURAI detection results by different cases involving carrying bricks.

Figure 8. Sample YOLOv8 and SAMURAI detection results by case in brick laying scenarios.

Figure 9. Conceptual hybrid detection framework: integration of YOLOv8 and SAMURAI.

Table 1. Comparison of detection mechanisms between YOLOv8 and SAMURAI.

Aspect	YOLOv8	SAMURAI
Frame Processing	Processes each frame independently for inference	Incorporates past frame information into detection
Temporal Dependency	Detect objects within isolated frames	Uses memory-based tracking to associate frames
Feature Extraction	Extracts feature per frame using CNN	Integrates feature information from prior frames
Handing Occlusion	Detecting objects based on current frame visibility	Adjust detection by referencing past frames
Tracking Approach	Performs object detection on a per-frame basis	Maintains object identity across multiple frames

Table 2. Time-series detection status of YOLOv8 and SAMRAI in brick transportation; ○ indicates successful detection; × indicates detection failure.

Time	YOLOv8 Case ①	SAMURAI Case ①	YOLOv8 Case ②	SAMURAI Case ②	YOLOv8 Case ③	SAMURAI Case ③
T₁	○	○	○	○	○	○
T₂	○	○	○	○	×	○
T₃	○	○	×	○	○	○
⋮	⋮	⋮	⋮	⋮	⋮	⋮
T₃₀	○	○	○	○	○	×
T₃₁	○	○	○	○	○	○
T₃₂	○	○	×	○	×	○
⋮	⋮	⋮	⋮	⋮	⋮	⋮
T₅₈	○	○	○	○	○	○
T₅₉	○	○	○	○	○	○
T₆₀	○	○	○	○	○	○
Accuracy (%)	93.33	98.33	81.67	95.00	68.33	93.33

Table 3. Time-series detection status of YOLOv8 and SAMRAI in brick laying scenarios; ○ indicates successful detection; × indicates detection failure.

Time	YOLOv8 Case ①	SAMURAI Case ①	YOLOv8 Case ②	SAMURAI Case ②	YOLOv8 Case ③	SAMURAI Case ③
T₁	○	○	○	○	○	○
T₂	○	○	×	○	×	○
T₃	○	○	○	○	×	○
⋮	⋮	⋮	⋮	⋮	⋮	⋮
T₁₂₀	○	○	×	○	○	○
T₁₂₁	○	○	○	○	×	○
T₁₂₂	○	○	○	○	○	○
⋮	⋮	⋮	⋮	⋮	⋮	⋮
T₂₃₈	○	○	○	○	×	×
T₂₃₉	○	○	○	○	○	○
T₂₄₀	○	○	○	○	○	○
Accuracy (%)	92.92	98.33	77.08	94.17	48.75	92.50

Table 4. Detection accuracy of YOLOv8 and SAMURAI by algorithm and case in brick transportation and laying.

Task	YOLOv8 Case ①	SAMURAI Case ①	YOLOv8 Case ②	SAMURAI Case ②	YOLOv8 Case ③	SAMURAI Case ③
Carrying Bricks (Detection Count)	56	59	49	57	41	56
Brick Laying (Detection Count)	223	236	185	226	117	222
Accuracy (%)	93.00	98.33	78.00	94.33	52.67	92.67

Table 5. Performance comparison between YOLOv8 and SAMURAI.

Factor	YOLOv8	SAMURAI
Average Accuracy (%): Cases ①, ②, and ③	74.55	95.11
Precision (%)	97.00	96.50
Recall (%)	98.00	94.80
F1-Score	0.97	0.95
Processing Speed (FPS)	28–32	9–12
Time to Process 15s Video: Carrying Bricks (s)	0.55–0.65	1.4–1.7
Time to Process 60s Video: Brick Laying (s)	2.2–2.5	5.8–6.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yoon, S.; Kim, H. Occlusion-Aware Worker Detection in Masonry Work: Performance Evaluation of YOLOv8 and SAMURAI. Appl. Sci. 2025, 15, 3991. https://doi.org/10.3390/app15073991

AMA Style

Yoon S, Kim H. Occlusion-Aware Worker Detection in Masonry Work: Performance Evaluation of YOLOv8 and SAMURAI. Applied Sciences. 2025; 15(7):3991. https://doi.org/10.3390/app15073991

Chicago/Turabian Style

Yoon, Seonjun, and Hyunsoo Kim. 2025. "Occlusion-Aware Worker Detection in Masonry Work: Performance Evaluation of YOLOv8 and SAMURAI" Applied Sciences 15, no. 7: 3991. https://doi.org/10.3390/app15073991

APA Style

Yoon, S., & Kim, H. (2025). Occlusion-Aware Worker Detection in Masonry Work: Performance Evaluation of YOLOv8 and SAMURAI. Applied Sciences, 15(7), 3991. https://doi.org/10.3390/app15073991

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Occlusion-Aware Worker Detection in Masonry Work: Performance Evaluation of YOLOv8 and SAMURAI

Abstract

1. Introduction

2. Methodology

2.1. Research Framework

2.2. Experiment Design and Data Collection

2.3. YOLOv8 and SAMURAI Architecture

2.3.1. YOLOv8 Architecture

2.3.2. SAMURAI Architecture

2.3.3. Differences in Detection Between YOLOv8 and SAMURAI

2.4. Performance Comparison Between YOLOv8 and SAMURAI

3. Experimental Evaluation

3.1. Detection Performance in Brick Transportation

3.2. Detection Performance in Brick Laying

3.3. Comparison of the Detection Accuracy of YOLOv8 and SAMURAI

4. Discussion

4.1. Analysis of Detection Performance

4.2. Contribution and Limitations

4.3. Future Research

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI