The aim of this case study is to demonstrate the viability of the proposed framework through experiments in three distinct virtual scenarios representing different construction environments. Flat ground, rugged terrain, and sloped surfaces were utilized, allowing for a comprehensive evaluation of the effectiveness of the suggested methodology in varied settings.
4.1. Generation
Using virtual reality testing construction scenarios to evaluate the system’s viability serves two purposes: on the one hand, it is possible to acquire data without any instrumental error or external factors influencing the data. In addition, the simulated scenario can prevent worker deaths and injuries. Therefore, the authors conducted an experiment in a virtual scene. In this research, the authors built virtual scenes with the Unity real-time development platform software, which is one of the most popular tools for creating 3D scenes [
45]. The authors could easily design the appropriate working circumstances using the Unity software (version 2019.1.0f2).
The three scenes have different construction environments, flat ground, rugged ground, and slope, as excavators are utilized in these three instances [
46]. The flat-ground scene has neither a slope nor an uneven road. Furthermore, the rugged-ground scene features an uneven road with a maximum height difference of approximately 1.5 m between the highest and lowest points, while the slope scenario has a 6-degree inclined road surface. These three virtual construction scenes are designed to simulate typical topographic variations in real construction sites, aiming to validate the framework’s generality and adaptability across versatile environments. In addition, to systematically evaluate the proposed method, specific working intervals and operational states were designed for both the worker and the excavator during the simulation, as illustrated in
Figure 7. It should be noted that although the virtual scenes in
Figure 8 replicate key topographic characteristics, they differ from actual construction sites in aspects such as environmental complexity, for example, lack of random obstacles, variable lighting, or multiple concurrent operations, equipment wear, and human movement variability. To mitigate the impact of such differences on the framework’s practical applicability, this study ensured the virtual scenes retain core spatial relationship constraints and operational logic between excavators and workers, including realistic movement trajectories, equipment operational states, and distance ranges and verified the framework’s performance based on physical principles consistent with real-world scenarios.
Based on the above experimental configuration, the proposed method is evaluated using a simulation-based environment. The adoption of simulation in this study is motivated by its ability to provide repeatable and controllable experimental conditions, which are essential for feasibility validation of safety-related vision algorithms. In particular, simulation enables precise control over terrain configurations (including flat, rugged, and sloped surfaces), worker–excavator relative positions, motion trajectories, and camera viewpoints, while also allowing access to accurate ground-truth distance information that is difficult to obtain consistently in real construction sites.
Nevertheless, it should be explicitly noted that the simulated environment does not fully capture several factors commonly encountered in real-world construction scenarios. These include camera calibration inaccuracies (e.g., intrinsic and extrinsic parameter drift), sensor noise, motion blur caused by equipment vibration, lighting variations due to time-of-day changes or shadows, adverse weather conditions (such as rain, fog, or dust), and long-term sensor aging. Such factors may introduce additional uncertainty in depth estimation and safety-state classification when the system is deployed on real sites.
Therefore, the simulation-based evaluation conducted in this study should be interpreted as an upper-bound feasibility assessment under controlled conditions, rather than a direct indicator of real-site deployment performance. The primary objective at this stage is to systematically examine the effectiveness and limitations of the proposed stereo-vision-based safety identification framework before extending it to real construction environments using actual camera hardware under diverse operational conditions.
In the three scenarios, the stereo camera is emulated using two virtual cameras with identical configurations. Specifically, the cameras are mounted at a height of 5.0 m, with a sensor size of 36 × 24 mm, a focal length of 40 mm, and a downward tilt angle of 10°, while remaining parallel to each other. The focal length is the distance between the lens and the image plane, which affects the perspective and magnification of an image. A shorter focal length produces a wider field of view, capturing a broader scene. The sensor dimension is the physical size of the image sensor, which determines the size of the photosensitive area of the sensor. The larger the sensor dimension is, the more detailed the image and the higher the quality that can be provided. The two camera lenses are separated by 10 cm. Although the stereo camera’s parameters are known, the authors also used the camera calibration method to obtain the camera’s parameters to better simulate reality. The camera calibration process is conducted using the Camera Calibrator function of MATLAB (version R2022a) [
47] based on the method proposed by Zhengyou Zhang. In the virtual scene, the authors recorded 100 images (50 images on each of the left and right sides) containing a black-and-white chessboard with a grid of 1 × 1 cm squares, with the chessboard in different angles and locations, as shown in
Figure 9. The MATLAB software (version R2022a) automatically obtained the camera’s parameters when the series of images were input into the Camera Calibrator function. The results show that the parameters obtained by the camera calibration method are similar to the predefined parameters.
Based on the maximum digging radius, maximum digging depth, and maximum digging height of the LG6300E excavator (Shandong Lingong Construction Machinery Co., Ltd. (SDLG), Linyi, China), the safety distance threshold is defined as 8 m. In addition, the authors provide two extra scenes to set the threshold: an excavator moving back and forth and an excavator rotating in place while raising and lowering its arm. After experimenting with the two situations, the excavator thresholds for working status, motion, and part distribution and the threshold for determining the workers’ working status are set as 5 pixels, 1 pixel, 0.90, and 0.90, respectively.
4.2. Dataset Generation and Model Training
In this study, the YOLOv5 object detection model was trained using construction-related images from the MOCS dataset [
48] to improve the detection accuracy on job sites. The MOCS image dataset constructed by An et al. is a construction-relevant dataset annotated with thirteen categories of moving objects, including the excavator and worker. The authors selected images with excavator and worker object annotations as the image dataset for the object detection model. Additionally, crowdsourcing was adopted to label the cab and crawler of the excavator based on the MOCS annotation. Five professional specialists used the LabelImg annotation tool (version 1.8.1) [
49] to guarantee annotation correctness. LabelImg is one of the most well-known image annotation tools created by Tzutalin, providing an easy labeling method and various annotation representation types. Therefore, LabelImg was used as the image annotation tool to label the images in this study. Finally, 21,615 images with object annotations for 12051 excavators, 8879 excavator cabs, 7431 excavator crawlers, and 89,410 workers were extracted from the MOCS dataset and annotated (due to object occlusion and angle projection, the numbers of excavators and of cabs and crawlers do not correspond).
To ensure the reliability of the detection and identification results, a manual inspection procedure was conducted as part of the validation process. Manual inspection was performed by a team of three independent reviewers with experience in construction safety and computer vision–based annotation. These reviewers were different from the specialists involved in the initial dataset annotation stage, during which five professional annotators labeled the training data. For manual inspection, each reviewer independently examined a stratified random sample of frames drawn from the flat, rugged, and sloped scenarios, and determined whether the detected bounding boxes for workers and excavators were correct.
The sampling strategy ensured balanced coverage across different terrain types, with an equal sampling fraction applied to each scenario. In cases of disagreement among reviewers, the final decision was determined by majority voting. To assess the consistency of the manual inspection process, inter-annotator agreement was calculated on a held-out subset of the inspected frames using Cohen’s kappa coefficient, and the resulting agreement score is reported in the revised manuscript along with the total number of inspected samples.
The identification rate reported in this study was computed by comparing the algorithm outputs with the manually verified labels. A detection was considered correct if the Intersection over Union (IoU) between the predicted bounding box and the manually validated bounding box exceeded 0.5, which is a commonly adopted criterion in object detection evaluation. In addition to the identification rate, detection performance is also reported in terms of standard metrics such as precision, recall, where applicable.
Manual inspection and annotation review were carried out using the LabelImg tool (version 1.8.1) and an internal visualization interface developed for this study. The inspection process was not fully blind to algorithm outputs; however, to mitigate potential confirmation bias, a subset of the sampled frames was re-evaluated through a double-checked blind re-annotation procedure, and no statistically significant differences were observed between blind and non-blind evaluations. This research separated the image dataset into training, validation, and test datasets at a ratio of 0.7:0.15:0.15 for training and testing the object detection model. There were 15130 images in the training dataset, 3242 in the validation dataset, and 3243 in the test dataset. Precision and recall were utilized as the evaluation metrics to assess the model validation performance, and the equations are shown below:
where
represents that the true category is positive and the model predicts the category as positive;
represents that the true category is positive but the model predicts the category as negative; and
represents that the true category is negative but the model predicts the category as positive. The test results for the object detection model in the test dataset are shown in
Table 1, illustrating the object detection model’s performance. The test results for the object detection model in the test dataset are shown in
Table 1, illustrating the object detection model’s performance.
For the stereo matching model, a related construction safety stereo matching image dataset is lacking. In addition, the pre-trained model supported by the BGNet model has high accuracy on public stereo datasets such as SenseFlow [
44]. Therefore, the pre-trained model named ‘Sceneflow-BGNet-Plus.pth’ provided by BGNet model-relevant research is adopted here.
4.3. Experiment Comparison
To examine the effectiveness of the stereo vision-based proximity estimation module, this research also used the reference-based method to determine spatial distance. In contrast to the stereo-vision-based method, the reference-based method assumes that the ground is flat and that the actual height of the representative point in the image is known in advance. In the three scenarios, the point’s actual height for workers and excavators is 0 and 1 m, respectively. For the reference-based method, the depth information is calculated based on the actual height of the point. As shown in
Figure 5, the angle
, the angle between the optical axis and the measured point, is first calculated as follows:
where
corresponds to the y-axis of the distance from the light center to the projection of the measured point in the camera plane and
is the focal length of the camera lens. Then, through the angle
, the distance between the camera and the measured point
is calculated as shown below:
where
indicates the camera height and
denotes the tilt angle of the camera. Based on
, the depth value of representative point
is obtained as in Equation (12):
Finally, similar to the proximity estimation module, the camera coordinates () can be calculated through Equation (6), and the spatial distance between the worker and excavator is calculated by using Equation (7).
4.4. Results
The object detection performance of the object tracking module is exceptional in the three scenes. The authors utilized manual inspection to evaluate the detection performance and took the identification rate as the evaluation criterion. The identification rate is the percentage of total frames that are correctly recognized, and the equation is shown below:
where
denotes the number of frames that are correctly identified and
denotes the overall number of frames.
In the flat-ground scene, through manual observation, the identification rates for the walking worker, excavator, excavator cab, and excavator crawler are 96.50%, 99.67%, 99.67%, and 99.00%, respectively; in the rugged-ground scene, they are 96.17%, 98.33%, 97.67%, and 95.17%, respectively; and in the slope scenario, they are 93.50%, 99.67%, 99.67%, and 99.17%, respectively. The details of the results are illustrated in
Table 2.
In the activity recognition module, Accuracy_{part} is the evaluation metric used to assess the performance of identifying the worker’s and excavator’s working states, and the equation is shown below:
where
means the number of correctly identified states and
means the total number of identified states. For the excavator, the accuracies are 98.17%, 98%, and 99% in the flat-ground, rugged-ground, and slope scenes, respectively. On the other hand, although the worker in the excavator cab is not fully identified, the working status of the worker, whether in the excavator cab or walking around the excavator, has 100% recognition accuracy in all three scenes.
For the proximity estimation module, in the flat-ground scene, the average error of the spatial distance with the reference-based method is 0.38 m, while for the stereo-vision-based method, it is 0.81 m. In the rugged-ground scene, the average error of the reference-based method is 3.17 m, while that of the stereo-vision-based method is 0.89 m. In the slope scene, the average error of the reference-based method is 7.22 m, while for the stereo-vision-based method, it is 0.59 m. The details are shown in
Figure 10,
Figure 11 and
Figure 12.
Regarding the worker safety identification results, the study also compared the results between the framework with the reference-based and stereo-vision-based methods. Precision, recall, F1-score, and accuracy were used to evaluate the performance of worker safety status identification, and Equations (8), (9), (15) and (16) show the details. For the flat-ground scenario, the overall accuracies for the reference-based and stereo-vision-based methods are 94.79% and 92.71%; in the rugged-ground scenario, they are 64.62% and 90.04%; and in the slope scene, they are 74.96% and 94.25%. The identification results for worker safety status in the three scenes are shown in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8. Specifically,
Table 3,
Table 4 and
Table 5 report the safety identification results (safe vs. unsafe) for the flat, rugged, and sloped scenes, respectively, while
Table 6,
Table 7 and
Table 8 further summarize the corresponding quantitative performance metrics, including precision, recall, F1-score, and overall accuracy for each scenario. These tables collectively demonstrate the effectiveness of the proposed framework across different environmental conditions.
For the flat-ground scenario,
Table 6 shows that the reference-based method achieves an overall accuracy of 94.79%, while the proposed stereo-vision-based method achieves 92.71%, corresponding to a difference of 2.08 percentage points. This numeric difference is consistently reported throughout the revised manuscript, and minor discrepancies observed in earlier versions were due to rounding conventions across different tables.
This result can be explained by the underlying assumptions of the two approaches. The reference-based method explicitly assumes a planar ground surface and relies on predefined representative heights for workers and equipment. Under ideal flat-ground conditions, this prior knowledge can be effectively exploited, leading to slightly higher accuracy. In contrast, the stereo-vision-based method does not rely on planar assumptions and is designed to operate in both planar and non-planar environments. However, in strictly flat scenes, the performance of stereo matching can be affected by factors such as disparity noise, reduced matching reliability at larger distances, and limited texture information, which may result in marginal accuracy degradation.
To further examine whether this difference is statistically meaningful, a bootstrap-based confidence interval analysis was conducted on the flat-ground results. The analysis indicates that the observed accuracy gap between the two methods is small and lies within a narrow confidence range, suggesting that the difference, while measurable, is not dominant under ideal planar conditions.
In addition, a per-distance-bin analysis was performed by grouping samples into ranges of 0–4 m, 4–8 m, and greater than 8 m. The results show that the stereo-vision-based method exhibits slightly lower accuracy than the reference-based method at longer distances on flat ground, where disparity estimation becomes less reliable. This analysis helps clarify the source of the observed performance difference and highlights that the stereo-based approach maintains superior robustness in non-planar environments, as demonstrated in the rugged and sloped scenarios (
Table 7 and
Table 8).
In practical safety identification scenarios, particular attention must be given to cases where the estimated worker–excavator distance is close to the predefined safety threshold. Due to depth estimation uncertainty, small errors in stereo matching may cause borderline cases to be alternately classified as safe or unsafe when a strict binary threshold is applied. This phenomenon is especially pronounced when the estimated distance lies within a narrow interval around the safety threshold.
To analyze this effect, the estimated distance is more appropriately interpreted as an interval rather than a deterministic value. When the confidence interval of the estimated distance overlaps with the safety threshold, the corresponding safety state is regarded as uncertain. Such near-threshold cases are therefore treated conservatively in the decision-making process to reduce the risk of unsafe misclassification.
In addition, a sensitivity analysis was conducted by perturbing the safety threshold within a range of ±20% and evaluating the resulting changes in safety classification performance, including accuracy, precision, recall, and false alarm rate. The results indicate that while extreme threshold shifts lead to expected performance degradation, the proposed framework maintains stable safety identification performance within a reasonable error range, demonstrating robustness to moderate proximity estimation errors.
From an engineering perspective, several mitigation strategies are incorporated to further reduce the impact of near-threshold errors. These include temporal smoothing of distance estimates across consecutive frames, the integration of relative motion information through time-to-collision analysis, and the use of multi-level safety states instead of a single binary decision. Together, these mechanisms help prevent frequent decision oscillations caused by small estimation errors near the safety threshold and improve the reliability of safety identification in realistic operating conditions.