Next Article in Journal
Modeling and Analysis of Surface Motion Characteristics for a Dual-Propulsion Amphibious Spherical Robot
Previous Article in Journal
Knit-FLUX: Simulation of Knitted Fabric Images Based on Low-Rank Adaptation of Diffusion Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Development and Validation of a Computer Vision Dataset for Object Detection and Instance Segmentation in Earthwork Construction Sites

by
JongHo Na
1,
JaeKang Lee
2,*,
HyuSoung Shin
1 and
IlDong Yun
3
1
Department of Future & Smart Construction Research, Korea Institute of Civil Engineering and Building Technology, Goyang-si 10223, Republic of Korea
2
Department of Civil Engineering, Dong-A University, Busan 49315, Republic of Korea
3
Department of Computer Engineering, Hankuk University of Foreign Studies, Yongin-si 17035, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(16), 9000; https://doi.org/10.3390/app15169000
Submission received: 14 July 2025 / Revised: 6 August 2025 / Accepted: 8 August 2025 / Published: 14 August 2025
(This article belongs to the Section Civil Engineering)

Abstract

Featured Application

This dataset facilitates the development of AI-based safety monitoring systems for earthwork construction sites by enabling accurate object detection and segmentation of construction machinery and terrain.

Abstract

Construction sites report the highest rate of industrial accidents, prompting the active development of smart safety management systems based on deep learning-based computer vision technology. To support the digital transformation of construction sites, securing site-specific datasets is essential. In this study, raw data were collected from an actual earthwork site. Key construction equipment and terrain objects primarily operated at the site were identified, and 89,766 images were processed to build a site-specific training dataset. This dataset includes annotated bounding boxes for object detection and polygon masks for instance segmentation. The performance of the dataset was validated using representative models—YOLO v7 for object detection and Mask R-CNN for instance segmentation. Quantitative metrics and visual assessments confirmed the validity and practical applicability of the dataset. The dataset used in this study has been made publicly available for use by researchers in related fields. This dataset is expected to serve as a foundational resource for advancing object detection applications in construction safety.

1. Introduction

The construction industry is inherently high-risk due to its diverse risk factors [1,2]. Earthwork sites, in particular, are prone to serious accidents involving large machinery and heavy equipment. In response to these risks, the Serious Accidents Punishment Act, enacted in 2022 and amended in 2023, mandates the implementation of AI-based real-time safety management systems at construction sites employing five or more workers [3]. In addition, Seoul City has introduced institutional measures to enhance safety management, such as requiring CCTV installation at small-scale private construction sites with budgets under KRW 5 billion starting in 2024 [4]. Consequently, interest in intelligent safety systems utilizing CCTV footage to detect and respond to on-site hazardous situations in real time is increasing [5,6].
A core component of the key elements for real-time safety management is the ability to accurately identify objects, such as workers, equipment, and terrain, in visual data [7,8,9,10]. Advances in deep learning-based computer vision have enabled AI systems to detect human and equipment movement in construction site CCTV footage and issue timely warnings for potential hazards [11,12]. However, the effectiveness of such systems relies on the availability of high-quality training datasets that reflect diverse real-world site environments. Public datasets, such as ACID 2.0 [13], MOCS [14] and SODA [15], have supported research in this area. However, they primarily focus on general construction activities and lack detailed representations specific to earthwork environments.
To address these limitations, this study focused on a land development earthwork site as the target environment. Over a year, long-term image collection was conducted, capturing key objects, such as construction machinery and terrain elements. Bounding box and polygon annotations were applied to the collected data to support training for object detection and instance segmentation tasks. The reliability of the dataset was confirmed by evaluating its quality and format validity, ensuring suitability for training AI models using CCTV footage from large-scale construction sites.
Based on the Construction Safety Management Guidelines, the Enforcement Decree of the Construction Technology Management Act [16], and the Construction Information Classification System, a total of 15 hazardous object categories were defined, focusing on civil engineering contexts. These included various types of construction equipment and site-specific terrain features.
For optimal data acquisition, collection devices were installed at a housing development site in locations offering a panoramic field of view. Images were captured from eight different camera angles using image collection equipment. The labeling metadata adhered to the COCO format [17], a standard widely used in computer vision for object detection tasks. To assess the generalizability and performance of the dataset, experiments were conducted using the processed data with the representative object detection model YOLO v7 [18] and the instance segmentation model Mask R-CNN [19]. The final dataset comprised 89,766 images, divided into training, validation, and test subsets for comprehensive evaluation.
The resulting dataset can be widely applied to earthwork, road, and land development, and has been made publicly available on the Mendeley Data platform for use by researchers in related fields [20]. It holds significant potential as a public dataset for AI-based safety management systems, contributing to the protection of construction workers and supporting the reduction of time and costs in construction site safety management processes and accident mortality rates. This study also discusses the object selection process, field data collection methodology, and quantitative and qualitative evaluations of the dataset. Furthermore, it discusses the potential applicability of the dataset for developing general-purpose AI learning systems across various applications.

2. Establishment and Operation of a System to Secure Image Data at Earthwork Construction Sites

2.1. Site Selection for Effective Image Collection

The image collection system used in this study was selected for the Incheon land development site, considering the diversity of earthwork construction sites, suitability of equipment installation, and ease of acquiring high-quality images. Figure 1 shows the construction site status map, highlighting the positions of the installed CCTV cameras and the primary observation points (P1–P8). To capture a comprehensive view of the site, the image collection system was installed in a high-altitude area approximately 20 m above ground level, as shown in the left panel of Figure 1. Moreover, the panoramic view obtained from the installed equipment, shown in Figure 2, provides wide-angle footage with minimal occlusion—an essential advantage for effective image data collection.

2.2. Configuration and Observation Coverage of the PTZ-Based Image Acquisition System

In this study, a PTZ-based image acquisition system was implemented to capture high-resolution images across a large-scale earthwork construction site. The selected device, featuring up to 30× optical zoom, enabled detailed monitoring of distant targets and ensured stable performance under the challenging communication conditions typical of construction environments. The device was also compatible with various wireless protocols, including 3G, 4G, Bluetooth, and Wi-Fi.
To assess and visualize the operational coverage of the installed system, concentric circles with radii of 0.25 km, 0.5 km, 0.75 km, and 1.0 km were overlaid on the site map, as shown in Figure 1. These intervals were derived from empirical testing of the device’s 30× optical zoom, which confirmed effective image acquisition up to approximately 1.0 km under site conditions. This visualization was designed to illustrate the approximate monitoring range and clarify the spatial relationships between the camera, observation points, and major engineering zones.
The locations of key construction areas and observation targets were annotated to aid interpretation of spatial coverage and to provide contextual understanding of target distances. This setup supports the evaluation of detection performance relative to spatial variability within the site.
Detailed technical specifications of the deployed device are listed in Table 1.

2.3. Operation Process of the Image Collection System

Based on the topographical characteristics of the site and the distribution of active construction equipment, eight observation areas were strategically selected. Observation points P1 to P4, located within the blue circular area in Figure 1, were selected because active earthwork operations such as filling and cutting were concentrated in this zone. Meanwhile, points P5 to P8, within the green circular area, were chosen due to frequent equipment movements and transport activities, making them suitable for capturing diverse operational scenes. These locations are marked with ‘x’ symbols in Figure 1. The image collection system was configured to automatically capture each observation point in sequence at regular frame intervals using its coordinate control function. Captured images were transmitted in real time via an LTE communication network and stored on a network-attached storage (NAS) system.
Figure 3 presents sample images from each observation point, with each point corresponding to specific construction zones within the site development project where key target objects were operated in real time.

3. Definition of Target Objects and AI Data Annotation Format at Earthwork Construction Sites

3.1. Selection of Objects at Earthwork Construction Sites

The hazardous objects to be detected at the earthwork construction site were identified based on the Guidelines for Performing Safety Management Tasks in Construction Projects and the Enforcement Decree of the Construction Machinery Management Act [21], with reference to the hazardous object profile and construction information classification system.
In conclusion, 15 hazardous object categories—consisting of 10 types of construction equipment and 5 types of terrain features—were identified and organized as presented in Table 2. To improve consistency and simplify annotation, similar equipment types were grouped under unified categories. For instance, excavators and backhoes were categorized as excavators, crawler drills and earth augers as crawler drills, and tower and mobile cranes as cranes. Similarly, various slope types were grouped into two categories: soil slope and rock slope.
Figure 4 shows schematic diagrams illustrating the labeled results for the target objects listed in Table 2, with only the relevant object parts extracted from the entire images. Panels (a) to (i) show construction equipment; panels (j) to (m) present terrain objects; and panels (n) and (o) present other objects.

3.2. AI Data Labeling for Earthwork Construction Sites

In computer vision, object recognition methods are generally divided into two categories: object detection and segmentation. Object detection involves marking the region around objects in an image with bounding boxes (rectangles), while segmentation identifies objects using polygons that outline their precise shapes. For AI training, the dataset must include the original images and their corresponding labels. In this study, the COCO Annotator [21], an open-source labeling tool, was used to simultaneously generate object detection and segmentation labels. The metadata was saved in the widely used COCO format [22].
Figure 5 provides an example of the metadata, where the metadata file name matches the image file name. The metadata includes the image file name (filename), width (width), height (height), class name of the objects, bounding box coordinates (xmin, ymin, xmax, ymax), and segmentation coordinates.

4. Deep Learning-Based Dataset Validation Experiment

4.1. Current Status of Earthwork Construction Site Learning Data

This study processed a total of 89,766 images for object detection, derived from an earthwork construction site. From these, 7600 images were selected to generate data for object segmentation. Table 3 presents the number of images at each of the eight observation points, categorized by the types of object detection and segmentation data. The data distribution indicates that the number of images at specific observation points (View1–View4) tends to be relatively skewed, as the images were primarily captured in areas where construction equipment was actively in use.
Figure 6 shows the current distribution of objects in the object detection and segmentation data by category. During the data collection period, the primary construction activities on site were earthworks and excavation. This led to the inclusion of a significant number of construction equipment specially designed for these tasks. Excluding regular vehicles, the highest inclusion rates were observed for excavators, dump trucks, and bulldozers, in that order.
Figure 7 illustrates the segmentation status of the object detection dataset, categorized by object type. Data analysis revealed that some objects had a small sample size, resulting in a data imbalance. The dataset was split into an 8:2 ratio, ensuring that all object categories were represented in the training and test datasets. Consequently, the training set contained 71,812 images, while the experimental set comprised 17,954 images.
Figure 8 presents the distribution of the object segmentation dataset by object category. Data analysis revealed that specific objects, such as rollers, loaders, and pile drivers, were not represented in the processed images, resulting in the inability to construct data for these items. Owing to these discrepancies in data availability across object categories, the entire dataset was split into training and test sets in an 8:2 ratio, ensuring that all major objects were represented in the training and evaluation phases. Consequently, the training dataset consisted of 6048 images, while the test dataset contained 1512 images.

4.2. Deep Learning Models

4.2.1. You Only Look Once (YOLO) Model for Object Detection Data Learning

Deep learning-based object detection methods are generally categorized into two approaches: one-stage and two-stage detectors.
One-stage detectors, such as YOLO [23,24], SSD [25], and RetinaNet [26], perform object localization and classification in a single forward pass of the network. By directly predicting object classes and bounding boxes from the input image without a separate region proposal step, these models achieve high inference speed, making them particularly suitable for real-time applications. Among them, the YOLO (You Only Look Once) series is notable for prioritizing processing speed by focusing on object areas defined by bounding boxes. While early versions of YOLO were limited in accuracy, recent variants, including YOLOv3 [24] and YOLOv7 [18], have achieved significant improvements in detection precision while maintaining real-time performance.
In contrast, two-stage detectors adopt a more sequential process. Models like R-CNN [27], Fast R-CNN [28], and Faster R-CNN [29] first generate region proposals that likely contain objects, and then perform classification and bounding box refinement on these candidate regions. This two-step process allows for more accurate localization and better incorporation of contextual information, which often results in superior detection accuracy compared to one-stage methods—albeit at the cost of slower inference speed.
This study adopted the YOLO v7 model owing to its ability to provide real-time object detection with high learning efficiency. Object detection accuracy was evaluated using the training dataset created for the model. This aligns with the study’s goal of enabling real-time analysis in dynamic environments such as construction sites.
Figure 9 illustrates the architecture of the YOLO object detection model, which processes an input image of size 448 × 448 × 3 through a deep convolutional neural network. The initial layers consist of a 7 × 7 convolution with 64 filters, followed by a 2 × 2 max pooling layer. This configuration initiates the feature extraction process by reducing the spatial resolution while capturing local patterns in the image. Subsequently, the network passes through a series of 3 × 3 and 1 × 1 convolutional layers. These layers serve to increase both the depth and abstraction of the extracted features while preserving essential spatial information. Each group of convolutional layers is followed by a max pooling operation, which progressively reduces the spatial dimensions and enables the network to focus on increasingly complex patterns.
During the final stages, the model employs three convolutional layers, each with 1024 filters, to capture high-level semantic representations with large receptive fields. These layers enhance the model’s ability to detect objects across various scales and contexts.
The resulting feature map is then flattened and passed through a fully connected layer comprising 4096 units. Finally, an output layer with 30 units encodes the object location (bounding box coordinates), objectness score, and class probability for each grid cell in the image. This architecture allows YOLO to perform real-time object detection by simultaneously predicting multiple objects within a single forward pass of the network.
The YOLOv7 model adopted in this study represents the most recent advancement in the YOLO object detection series, which is designed to strike an optimal balance between accuracy and computational efficiency. Compared to its predecessors, YOLOv7 introduces several architectural enhancements that significantly improve training speed and detection performance. These include the integration of E-ELAN (Extended Efficient Layer Aggregation Network) blocks, re-parameterized convolutional layers, and auxiliary heads. Such innovations contribute to better feature representation, increased training stability, and higher detection precision, all while maintaining a lightweight architecture suitable for real-time applications. The use of architectural re-parameterization also enables different configurations during training and inference, further optimizing the model’s performance under practical deployment conditions.

4.2.2. Mask R-CNN Model for Instance Segmentation Data Learning

Figure 10 illustrates the model architecture of Mask R-CNN [19], a representative two-stage deep learning framework designed for both object detection and instance segmentation. Built upon the Faster R-CNN architecture, Mask R-CNN introduces an additional parallel branch to produce pixel-level binary segmentation masks for each detected object, in addition to the conventional object classification and bounding box regression tasks.
In the first stage, a Region Proposal Network (RPN) extracts candidate regions of interest (RoIs) where objects are likely to exist by scanning convolutional feature maps generated by a backbone CNN (e.g., ResNet-FPN). These proposed RoIs are then refined in the second stage, where three parallel tasks are performed: (1) object classification, (2) bounding box refinement, and (3) mask prediction. The segmentation branch generates a fixed-size mask for each RoI independently, which is later resized and mapped back to the original image space.
A notable innovation in Mask R-CNN is the RoI Align technique, which replaces the previous RoI Pooling method. RoI Align preserves spatial alignment by using bilinear interpolation instead of quantization, significantly improving the accuracy of the generated masks—especially for small or overlapping objects.
This architectural design allows Mask R-CNN to precisely capture both the position and shape of objects at the pixel level. This is particularly beneficial in construction environments, where machinery and terrain often exhibit complex boundaries, occlusions, and variability in scale. Thus, Mask R-CNN enables reliable and high-resolution instance segmentation even under challenging visual conditions.

4.3. Dataset Standardization Through Deep Learning Training

4.3.1. Evaluation of Object Detection Result

To assess the performance of the object detection model, the mean average precision (mAP) was used as the primary evaluation metric. mAP is derived from two key components: precision and recall, as defined in Equations (1) and (2). Precision refers to the ratio of correctly predicted true instances to the total instances predicted as true by the model. Recall is the ratio of correctly predicted true instances to all actual true instances.
Ultimately, a precision–recall curve is plotted, with precision on the Y-axis and recall on the X-axis. The AP is calculated as the area under the curve [30,31,32].
P r e c i s i o n = T P ( T P + F P )
R e c a l l = T P ( T P + F N )
Figure 11 presents the per-object mAP evaluation results for the object detection dataset. Overall, construction equipment classes such as dump trucks, rollers, and excavators achieved high detection performance, with mAP scores exceeding 80%. This can be attributed to their well-defined geometric structures, rigid shapes, and clear visual boundaries, which facilitate reliable feature extraction and classification by the detection model. In contrast, terrain and background-related objects—including rock slopes, soil mounds, and work areas—exhibited comparatively lower mAP scores. This performance disparity is likely caused by several factors: ambiguous object contours resulting from irregular and unstructured shapes, high intra-class variability, visual similarity between classes (e.g., rock versus soil), and imbalanced training data distributions. These results highlight that object distinctiveness, visual separability, and annotation quality substantially affect detection accuracy. To address these limitations, future efforts should focus on enhancing annotation consistency, applying class-specific data augmentation, and refining class definitions, particularly for terrain and environmental categories that pose greater challenges for detection models.

4.3.2. Evaluation of Instance Segmentation Result

For instance segmentation, the evaluation process is similar to that of object detection, with mAP based on bounding boxes being assessed. However, it also focuses on pixel-level accuracy to evaluate the precision of object boundary segmentation. Therefore, the evaluation metric used is mAP@IoU [33], which incorporates the intersection-over-union (IoU) threshold, as defined in Equation (3).
m A P @ I o U = 1 C c = 1 C 1 T t = 1 T A P c , t
AP is typically calculated across various IoU thresholds, with the average obtained to quantify overall segmentation performance. In this study, mAP@0.50 was used as the primary performance metric in accordance with the COCO evaluation standards.
Figure 12 compares the per-object mAP and mIoU performance for the instance segmentation dataset. Overall, the mAP, which reflects the accuracy of object localization, demonstrated relatively high values across most object categories. In contrast, the mIoU, which evaluates the pixel-level overlap between predicted and ground-truth masks, had a weaker performance. This indicates that while the model was generally successful in identifying the location and presence of target objects, it struggled to accurately delineate their boundaries.
These results suggest that segmentation errors are more prevalent in scenarios where object contours are ambiguous or where the distinction between objects and background is visually subtle. Such challenges are particularly common in terrain-related categories, where irregular shapes, overlapping textures, and inconsistent annotations can hinder precise segmentation. Furthermore, small objects or those with complex geometries may be partially segmented or misclassified, contributing to reduced mIoU scores.
It is also noteworthy that three object categories—roller, loader, and pile driver—were excluded from evaluation due to the absence of relevant instances in the dataset. This limitation points to an imbalance in the data distribution, which can lead to biased model training and insufficient generalization across underrepresented classes. Addressing these gaps through dataset augmentation and balanced sampling would be essential to improve the reliability and coverage of the segmentation model.

4.3.3. Visual Evaluation of Object Detection and Segmentation Data

Figure 13 visualizes the results of object detection and segmentation across 15 object classes defined in this study. Each object was detected using a model trained on original images captured at actual construction sites, with red boxes highlighting the corresponding objects.
Construction equipment objects, with their relatively clear shapes and distinct boundaries, were accurately detected and segmented, as confirmed by the object detection and segmentation results. By contrast, terrain-based objects, such as rock slopes, soil mounds, and work areas, exhibited unclear boundaries or similar backgrounds, leading to some omissions or errors in the segmented areas. This is consistent with the poorer performance observed in the evaluation metrics discussed earlier.
Overall, the visualization results are consistent with the numerical performance evaluation, demonstrating that the dataset constructed in this study is well-suited for training and validating object detection and segmentation models for construction sites.

5. Conclusions

This study aimed to create a CCTV-based image dataset to serve as foundational data for AI-driven construction safety management, with the goal of preventing industrial accidents and evaluating the feasibility and performance of object detection and segmentation models in construction-specific environments. To this end, images were collected over approximately one year from large-scale civil engineering land development sites, resulting in a COCO format-based dataset containing 15 classes of construction equipment and terrain objects.
To ensure reproducibility and provide a standardized benchmark, this study employed publicly available object detection and segmentation models (YOLOv7 and Mask R-CNN) using default parameters, without any custom modifications or hyperparameter tuning. YOLOv7 achieved a mean average precision (mAP) of 0.85 for object detection, and in particular, it excelled in categories with distinct shapes such as excavators and bulldozers. Mask R-CNN attained a mean average precision at IoU = 0.50 (mAP@0.50) of 0.78 and a mean intersection-over-union (mIoU) of 0.62 for instance segmentation, indicating reasonable performance despite the challenges of segmenting terrain objects with ambiguous boundaries.
The key innovations of this study include (1) long-term, real-site data collection from earthwork environments; (2) dual-format annotation supporting detection and segmentation tasks; (3) reproducible benchmark evaluation using open-source models with no tuning; and (4) open access publication of the dataset via Mendeley Data [1], facilitating broader adoption in both research and industry. These contributions are expected to accelerate the development of intelligent construction safety technologies and support ongoing digital transformation efforts in the construction sector.
One key limitation observed in this study was the imbalance in object category distribution, especially for segmentation tasks. In addition, although the dataset was constructed at a large scale, it was collected from a single earthwork construction site, which may limit the model’s generalization performance when applied to other types of construction environments.
Future research should expand the dataset to include diverse types of construction sites and validate the model across multiple project environments to enhance its generalizability. Additionally, efforts should be made to improve model robustness under variable conditions such as low-light environments and adverse weather, as well as to explore dynamic analysis tasks such as object tracking and behavior recognition.

Author Contributions

Conceptualization, J.N. and J.L.; methodology, J.N.; software, J.N.; validation, J.N. and H.S.; formal analysis, J.N. and H.S.; investigation, J.N. and I.Y.; resources, H.S.; data curation, J.N.; writing—original draft preparation, J.N.; writing—review and editing, J.L. and H.S.; visualization, J.N.; supervision, J.L.; project administration, H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Smart Construction Promotion Center grant funded by the Ministry of Land, Infrastructure and Transport (Grant 20250053-015). This study was also supported by the Dong-A University research fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset presented in this study is openly available in Mendeley Data at https://doi.org/10.17632/rz8723t6d7.2, accessed on 1 July 2025.

Acknowledgments

This work was supported by the Smart Construction Promotion Center grant funded by the Ministry of Land, Infrastructure and Transport (Grant 20250053-015). This study was also supported by the Dong-A University research fund. We appreciate the support provided by various institutions for this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Love, P.E.D.; Teo, P.; Smith, J.; Ackermann, F.; Zhou, Y. The nature and severity of work-related injuries in construction: Promoting operational benchmarking. Ergonomics 2019, 62, 1273–1288. [Google Scholar] [CrossRef] [PubMed]
  2. Love, P.E.D.; Ika, L.; Luo, B.; Zhou, Y.; Zhong, B.; Fang, W. Rework, failure and unsafe behavior: Moving from a blame culture to an error management mindset in construction. IEEE Trans. Eng. Manag. 2020, 69, 1489–1501. [Google Scholar] [CrossRef]
  3. Ministry of Employment and Labor (MOEL). Amendment to the Serious Accidents Punishment Act; MOEL Press: Sejong, Republic of Korea, 2023. [Google Scholar]
  4. Seoul Metropolitan Government (SMG). Safety Management Guidelines for Small Construction Sites; SMG Publications: Seoul, Republic of Korea, 2024. [Google Scholar]
  5. Akinosho, T.D.; Oyedele, L.O.; Bilal, M.; Ajayi, A.O.; Delgado, M.D.; Akinade, O.O.; Ahmed, A.A. Deep learning in the construction industry: A review of current status and future innovations. J. Build. Eng. 2020, 32, 101827. [Google Scholar] [CrossRef]
  6. Fang, W.; Love, P.E.; Luo, H.; Ding, L. Computer vision for behavior-based safety in construction: A review and future directions. Adv. Eng. Inform. 2020, 43, 100980. [Google Scholar] [CrossRef]
  7. Pal, A.; Hsieh, S.-H. Deep learning-based visual data analytics for smart construction management. Autom. Constr. 2021, 131, 103892. [Google Scholar] [CrossRef]
  8. Liu, Y.; Wang, P.; Li, H. An Improved YOLOv5s-Based Algorithm for Unsafe Behavior Detection of Construction Workers in Construction Scenarios. Appl. Sci. 2025, 15, 1853. [Google Scholar] [CrossRef]
  9. Hayat, A.; Morgado Dias, F. Deep Learning-Based Automatic Safety Helmet Detection System for Construction Safety. Appl. Sci. 2022, 12, 8268. [Google Scholar] [CrossRef]
  10. Liu, J.; Luo, H.; Liu, H. Deep Learning-based Data Analytics for Safety in Construction. Autom. Constr. 2022, 140, 104302. [Google Scholar] [CrossRef]
  11. Hou, L.; Chen, H.; Zhang, G.K.; Wang, X. Deep learning-based applications for safety management in the AEC industry: A review. Appl. Sci. 2021, 11, 821. [Google Scholar] [CrossRef]
  12. Zhong, B.; Wu, H.; Ding, L.; Love, P.E.; Li, H.; Luo, H.; Jiao, L. Mapping computer vision research in construction: Trends, knowledge gaps and implications. Autom. Constr. 2019, 107, 102919. [Google Scholar] [CrossRef]
  13. Xiao, B.; Kang, S.-C. Development of an image data set of construction machines for deep learning object detection. J. Comput. Civ. Eng. 2021, 35, 05020005. [Google Scholar] [CrossRef]
  14. An, X.; Li, Z.; Zuguang, L.; Wang, C. Dataset and benchmark for detecting moving objects in construction sites. Autom. Constr. 2021, 122, 103482. [Google Scholar] [CrossRef]
  15. Duan, R.; Deng, H.; Tian, M.; Deng, Y.; Lin, J. SODA: A large-scale open site object detection dataset for deep learning in construction. Autom. Constr. 2022, 142, 104499. [Google Scholar] [CrossRef]
  16. Ministry of Land, Infrastructure and Transport (MOLIT). Development of Risk Factor for Construction Project; MOLIT Research Report; 2014. Available online: https://www.codil.or.kr/viewDtlConRpt.do?gubun=rpt&pMetaCode=OTKCRK160160 (accessed on 2 August 2023). (In Korean).
  17. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision–ECCV 2014, Proceedings Part V; Springer: Zurich, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
  18. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
  19. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
  20. Na, J.; Shin, H.; Yun, I.; Lee, J. Development of an AI Dataset for Object Detection at Construction Sites. Mendeley Data, V2, 2025. Available online: https://data.mendeley.com/datasets/rz8723t6d7/2 (accessed on 2 August 2023).
  21. Ministry of Government Legislation (MGL). Construction Machinery Management Act. 2023. Available online: https://www.law.go.kr/법령/건설기계관리법 (accessed on 2 August 2023). (In Korean).
  22. COCO-Annotator. Available online: https://github.com/jsbroks/coco-annotator (accessed on 1 July 2025).
  23. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  24. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  25. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
  26. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  27. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
  28. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
  29. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 91–99. Available online: https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html (accessed on 17 June 2021).
  30. Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2011, arXiv:2010.16061. [Google Scholar] [CrossRef]
  31. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  32. Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML, Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar] [CrossRef]
  33. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9520–9529. [Google Scholar] [CrossRef]
Figure 1. CCTV installation locations and observation points (P1–P8) within the construction site.
Figure 1. CCTV installation locations and observation points (P1–P8) within the construction site.
Applsci 15 09000 g001
Figure 2. Construction site overview from the CCTV viewpoint.
Figure 2. Construction site overview from the CCTV viewpoint.
Applsci 15 09000 g002
Figure 3. Image samples in setup viewpoints.
Figure 3. Image samples in setup viewpoints.
Applsci 15 09000 g003
Figure 4. Labeling samples for target object types.
Figure 4. Labeling samples for target object types.
Applsci 15 09000 g004
Figure 5. Information on JSON annotation files.
Figure 5. Information on JSON annotation files.
Applsci 15 09000 g005
Figure 6. Status of the construction site AI dataset by object category.
Figure 6. Status of the construction site AI dataset by object category.
Applsci 15 09000 g006
Figure 7. Training/testing distribution of object detection dataset by object category.
Figure 7. Training/testing distribution of object detection dataset by object category.
Applsci 15 09000 g007
Figure 8. Training/test distribution of object segmentation dataset by object category.
Figure 8. Training/test distribution of object segmentation dataset by object category.
Applsci 15 09000 g008
Figure 9. YOLO model architecture [23].
Figure 9. YOLO model architecture [23].
Applsci 15 09000 g009
Figure 10. Mask R-CNN model architecture [19].
Figure 10. Mask R-CNN model architecture [19].
Applsci 15 09000 g010
Figure 11. mAP evaluation results for object detection across object categories.
Figure 11. mAP evaluation results for object detection across object categories.
Applsci 15 09000 g011
Figure 12. mAP and mIoU evaluation results for instance segmentation by object categories.
Figure 12. mAP and mIoU evaluation results for instance segmentation by object categories.
Applsci 15 09000 g012
Figure 13. Visual inference results according to the target object.
Figure 13. Visual inference results according to the target object.
Applsci 15 09000 g013
Table 1. Detailed specifications of the deployed CCTV camera.
Table 1. Detailed specifications of the deployed CCTV camera.
CategorySpecification
ManufacturerKEDACOM
Image Sensor1/2.8″ CMOS
Pixels2.0 Megapixel
Focal Length4.5~135 mm, 30× optical
Max. Aperture RatioF1.6 (WIDE)/F4.4 (TELE)
Pan/Tilt Range360°/−15°~90°
CompressionH.265/H.264
Bitrate64 kbps~16 Mbps
Wireless3G/4G/Bluetooth/Wi-fi
Zoom30× optical zoom
Table 2. Classification of target objects.
Table 2. Classification of target objects.
Type of WorkObject TypeAI ApplicationTarget ObjectsGrouped Objects
EarthworkTerrainInstance
Segmentation
SlopeSoil slope
Rock slope
Excavated slope
Rock slope
Cut slope
Soil slope
Soil moundSoil mound
Rock moundRock mound
Construction equipmentObject DetectionExcavatorExcavator
Backhoe
Dump truckDump truck
Loader
Bulldozer
Roller
Ground improvement and ReinforcementCrawler drillCrawler drill
PilingEarth auger
Pile driverPile driver
Construction of steel structureTower craneCrane
Civil plumbing and drainageMobile crane
EtcEtcCarCar
WorkerWorker
Work areaWork area
Table 3. Number of images per viewpoint by data type.
Table 3. Number of images per viewpoint by data type.
Viewpoint
View 1View 2View 3View 4View 5View 6View 7View 8
Number of images
(Object detection)
10,305526215,16150,62926219555175675
Total images89,766
Number of images
(Segmentation)
86644212703562262164517477
Total images7600
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Na, J.; Lee, J.; Shin, H.; Yun, I. Development and Validation of a Computer Vision Dataset for Object Detection and Instance Segmentation in Earthwork Construction Sites. Appl. Sci. 2025, 15, 9000. https://doi.org/10.3390/app15169000

AMA Style

Na J, Lee J, Shin H, Yun I. Development and Validation of a Computer Vision Dataset for Object Detection and Instance Segmentation in Earthwork Construction Sites. Applied Sciences. 2025; 15(16):9000. https://doi.org/10.3390/app15169000

Chicago/Turabian Style

Na, JongHo, JaeKang Lee, HyuSoung Shin, and IlDong Yun. 2025. "Development and Validation of a Computer Vision Dataset for Object Detection and Instance Segmentation in Earthwork Construction Sites" Applied Sciences 15, no. 16: 9000. https://doi.org/10.3390/app15169000

APA Style

Na, J., Lee, J., Shin, H., & Yun, I. (2025). Development and Validation of a Computer Vision Dataset for Object Detection and Instance Segmentation in Earthwork Construction Sites. Applied Sciences, 15(16), 9000. https://doi.org/10.3390/app15169000

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop