A Hybrid Game Engine–Generative AI Framework for Overcoming Data Scarcity in Open-Pit Crack Detection

Le Roux, Rohan; Khaksar, Siavash; Sepehri, Mohammadali; Murray, Iain

doi:10.3390/make8040099

Open AccessArticle

A Hybrid Game Engine–Generative AI Framework for Overcoming Data Scarcity in Open-Pit Crack Detection

¹

School of Electrical Engineering, Computing, and Mathematical Sciences, Curtin University, Perth 6102, Australia

²

The Western Australian School of Mines, Curtin University, Kalgoorlie 6430, Australia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(4), 99; https://doi.org/10.3390/make8040099

Submission received: 6 March 2026 / Revised: 9 April 2026 / Accepted: 10 April 2026 / Published: 12 April 2026

Download

Browse Figures

Versions Notes

Abstract

Open-pit mining operations rely heavily on visual inspection to identify indicators of slope instability such as surface cracks. Early identification of these geotechnical hazards enables timely safety interventions to protect both workers and assets in the event of slope failures or landslides. While computer vision (CV) approaches offer a promising avenue for autonomous crack detection, their effectiveness remains constrained by the scarcity of labelled geotechnical datasets. Deep learning (DL)-based models, in particular, require large amounts of representative training data to generalize to unseen conditions; however, collecting such data from operational mine sites is limited by safety, cost, and data confidentiality constraints. To address this challenge, this study proposes a novel hybrid game engine–generative artificial intelligence (AI) framework for large-scale dataset generation without requiring real-world training data. Leveraging a parameterized virtual environment developed in Unreal Engine 5 (UE5), the framework generates realistic images of open-pit surface cracks and enhances their fidelity and diversity using StyleGAN2-ADA. The synthesized datasets were used to train the YOLOv11 real-time object detection model and evaluated on a held-out real-world dataset of open-pit slope imagery to assess the effectiveness of the proposed framework in improving model generalizability under extreme data scarcity. Experimental results demonstrated that models trained using the proposed framework consistently outperformed the UE5 baseline, with average precision (AP) at intersection over union (IoU) thresholds of 0.5 and [0.5:0.95] increasing from 0.792 to 0.922 (+16.4%) and 0.536 to 0.722 (+34.7%), respectively, across the best-performing configurations. These findings demonstrate the effectiveness of hybrid generative AI frameworks in mitigating data scarcity in CV applications and supporting the development of scalable automated slope monitoring systems for improved worker safety and operational efficiency in open-pit mining.

Keywords:

synthetic dataset generation; generative adversarial networks; computer vision; deep learning; object detection; surface crack detection; open-pit mining

1. Introduction

Open-pit mines are among the most hazardous industrial environments due to the risks posed by geotechnical events such as slope failures and landslides [1,2,3]. The devastating consequences of these incidents are exemplified by recent disasters such as the 2023 Xinjing coal mine landslide in China, which resulted in 53 fatalities and approximately USD 28 million in economic losses [4], and the 2020 Hpakant jade mine landslide in Myanmar, which claimed nearly 200 lives and severely impacted local communities [5]. These events can be triggered by several factors, including weak geological structures [6], intense or prolonged rainfall [7,8], seismic activity, and vibrations from excavation and blasting [9]. Early warning systems are therefore designed to monitor slope displacement and detect hazards such as surface cracks [10,11], enabling safety interventions such as exclusion zones to protect both workers and assets [12]. Despite advances in technologies such as stability radar, visual inspection remains central to hazard identification across many open-pit mines in Australia [13,14,15]. However, this manual practice is both labor-intensive and subjective, exposing workers to hazardous environments and compromising safety [16,17,18].

Consequently, recent studies have focused on developing automated approaches that leverage technologies such as AI to reduce dependence on manual monitoring while enhancing operational safety and efficiency [19]. Specifically, advances in DL [20] have driven substantial progress in CV tasks such as object detection, image classification, and image segmentation [21], thereby enabling machines to derive meaningful information from real-world visual data [22]. To that end, neural network architectures such as convolutional neural networks (CNNs) [23] and vision transformers (ViTs) [24] have been widely utilized across diverse domains [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39], underscoring their potential for geotechnical risk management. In particular, recent studies have demonstrated the use of CNN-based CV models such as YOLOv8 [40], YOLOv10 [41], Mask R-CNN [42], U-Net [43,44], and ENet [45] for automated surface crack detection in open-pit mining. While these works demonstrate the feasibility of DL models for geotechnical hazard identification, their effectiveness remains constrained by the limited availability of labelled crack images.

This issue, commonly referred to as data scarcity, is a ubiquitous problem in the field of DL [46], where domain generalization, or the capacity of a CV model to recognize objects in unseen settings or environments [47], is driven by the volume and representativeness of the data used for training [48]. Data scarcity is especially pronounced in industrial domains such as mining [49], where datasets are inherently commercially sensitive and limited by the significant cost and expertise required for data collection and annotation [50,51,52]. To address this challenge, techniques such as transfer learning [53] and data augmentation [54] have been widely adopted in the literature, particularly in healthcare and other data-constrained domains. Transfer learning reduces reliance on large datasets but is vulnerable to source–target domain mismatch [55,56], while data augmentation, though capable of artificially increasing dataset size [57], is prone to amplifying existing distributional biases [58,59]. As both methods remain fundamentally bounded by the quality of the original training data, interest continues to grow in techniques capable of generating entirely new and diverse training data at scale [60]. Specifically, two principal approaches have emerged to address this need, each with complementary strengths and limitations.

The first approach leverages commercial game engines such as Unreal Engine (UE) [61] and Unity [62] to render synthetic images for training CV models. These programs enable controlled variation in scene parameters and environmental conditions, as well as automated dataset generation and ground-truth annotation [63]. Moreover, modern game engines such as UE5 achieve near-photorealism through physically based rendering (PBR) with virtualized geometry [64] and real-time global illumination (RTGI) [65]. Despite these advances, the fundamental distributional disparity between synthetic and real-world imagery, known as the reality gap [66], continues to constrain the generalizability of CV models trained on game engine outputs, which often lack the stochastic qualities of real-world data. In addition, game engine rendering remains inherently bounded by the manual effort required for scene and asset development, imposing practical limitations on both dataset scale and variation.

The second approach employs generative models such as generative adversarial networks (GANs) [67], which learn continuous data distributions directly from training samples to synthesize realistic and structurally diverse images. In particular, StyleGAN2-ADA [68] provides controllable latent space manipulation and adaptive discriminator augmentation (ADA) to mitigate overfitting for small datasets, making it well suited to data-scarce domains [69]. However, generative models typically require substantial training data to learn meaningful distributions [70], presenting a barrier in domains where the very data scarcity that motivates synthetic dataset generation also constrains the training of generative models themselves.

The respective strengths and limitations of game engine rendering and generative modelling suggest a largely unexplored opportunity. Game engines can produce large volumes of labelled synthetic data, yet their outputs remain constrained by a residual reality gap and bounded diversity [71]. Generative models offer a potential means of addressing both of these limitations: they can synthesize structural variations that extend dataset diversity beyond what parametric rendering alone can achieve, while the stochastic nature of adversarial generation and transfer learning from broader source domains can alleviate aspects of the reality gap by introducing variation and texture priors [72,73,74,75,76] not captured by game engine imagery. A hybrid framework in which high-fidelity game engine outputs serve as the sole training data for a generative model could therefore address both constraints without requiring any real-world training imagery. Despite this, no prior study has systematically evaluated whether game engine synthetic data can serve as an effective substitute for real-world data in training generative models for scalable image synthesis in data-scarce CV applications.

To address this gap, the main contributions of this work are as follows:

We design and implement a novel hybrid framework for large-scale synthetic dataset generation without requiring real-world training data, combining a parameterized UE5 virtual environment, StyleGAN2-ADA-based image enhancement, and automatic annotation using a vision language model (VLM).
We conduct a systematic evaluation of three transfer learning strategies for training StyleGAN2-ADA on game engine data, comparing semantically unrelated and domain-aligned source domains to assess their impact on generation quality. The fidelity, diversity, and domain gap of the generated images are evaluated through a combination of distributional, perceptual, statistical, and embedding-based analyses.
We evaluate downstream effectiveness by training YOLOv11 on synthetic data generated by the proposed framework and testing its object detection performance on a held-out set of 200 real-world open-pit mining images, demonstrating that the best-performing configurations improve AP@0.5 and AP@[0.5:0.95] over the UE5 baseline by up to 16.4% and 34.7%, respectively.

To the best of our knowledge, this study presents the first systematic evaluation of a hybrid game engine–generative AI framework for synthetic dataset generation utilizing UE5 and StyleGAN2-ADA, and the first to demonstrate that generative models trained solely on game engine data can produce synthetic images sufficient for training CV models that generalize effectively to real-world conditions. Applied to surface crack detection in open-pit mining, the framework improves object detection accuracy for slope failure precursors, supporting improved safety outcomes while offering a methodology with potential transferability to other data-constrained domains.

2. Related Works

2.1. Synthetic Dataset Generation Using Game Engines

Advances in game engine technology have pushed the boundaries of visual realism and provided a promising means of addressing data scarcity in CV. Unlike manual data collection, which can be costly and time-consuming, game engines enable synthetic data generation with automated ground-truth annotation and fine-grained control of scene parameters and environmental conditions [77]. For instance, Half-Life 2’s Source engine [78], renowned for its detailed and lifelike animation and physics technology, was first utilized nearly two decades ago to develop and test an autonomous surveillance system [79]. Since then, researchers have increasingly leveraged synthetic data generated by photorealistic game engines to train and validate CV models for tasks such as object detection and image segmentation [80]. Table 1 summarizes recent works which apply game engine synthetic data across diverse applications, from construction safety monitoring to autonomous vehicle detection.

These studies demonstrate performance improvements ranging from modest gains of less than 1% [84] to substantial enhancements exceeding 65% [81], highlighting the variable influence of game engine data on model generalizability. The most common development platforms used are UE4 and Unity due to their accessibility and integration with CV plugins such as NDDS [92] and Unity Perception Package [93], which support domain randomization techniques and automatically generate ground-truth labels such as bounding boxes and segmentation masks for captured data. An alternative approach is the use of commercial games, such as Grand Theft Auto V (GTA V) [94], where a combination of in-engine tools and community mods are adapted for synthetic data capture and annotation [88]. Collectively, these works demonstrate the versatility of game engines for synthetic data generation, producing datasets ranging from approximately 1500 [82] to over one million images [88]. Studies utilizing fewer than 3000 synthesized images for training of downstream CV models reported limited performance improvements or underfitting [82,84], suggesting insufficient data diversity to support robust model generalization. In contrast, approaches which synthesized larger training datasets demonstrated substantial gains [81,83,86], highlighting the impact of dataset size on downstream performance. These findings indicate that while game engines enable scalable data generation, meaningful performance benefits require datasets of sufficient scale and diversity to capture the contextual variability necessary for effective transferability.

In addition to dataset size, the realism of rendered images remains a key determinant of downstream performance. Several studies have demonstrated that models trained on data from high-fidelity game engines tend to generalize better than those trained on less photorealistic platforms [87]. For instance, [83] reported that the enhanced texture detail, lighting, and reflections of images generated in UE5 led to a 17.3% increase in mAP relative to a model trained on data produced in UE4, highlighting the influence of visual fidelity on learned feature distributions for CV models. Similarly, commercial games, developed with significant budgets and advanced rendering pipelines, typically produce more realistic images than those generated by less robust engines such as Unity (see Figure 1) [88]. As a result, models trained on lower fidelity datasets often struggle to generalize in real-world applications because their limited realism widens the reality gap [87].

To address this issue, several studies employ transfer learning by pre-training using synthetic data and fine-tuning with small quantities of real-world data, thereby better aligning the synthetic and real feature domains. This mixed dataset approach, demonstrated in [89], resulted in a 5% improvement in the PCK for pose estimation relative to a model trained solely on synthetic data. Furthermore, ref. [86] reported a 79.5% improvement in mAP when fine-tuning a model trained on 400,000 synthetic images with only 760 real images (a 526:1 synthetic-to-real ratio), underscoring the effectiveness of limited real-world data in enhancing model generalizability. Complementary to this, domain randomization techniques have proven effective in bolstering model robustness across unseen conditions. For example, ref. [90] demonstrated a 22.8% increase in mAP through successive randomization of parameters such as camera positioning and object location, while refs. [82,84,86] leveraged NDDS and Unity Perception Package to improve cross-domain generalization and downstream performance.

While these mitigation strategies demonstrate measurable downstream performance benefits, the reality gap and diversity constraints remain a central limitation of game engine synthetic data, impacting the transferability of CV models to real-world settings. Consequently, recent studies have explored generative modelling for producing realistic synthetic images, an approach examined in the following section.

2.2. Synthetic Dataset Generation Using Generative Models

Introduced in 2014, GANs [67] employ an adversarial training framework in which two neural networks compete in a minimax game to generate realistic synthetic images. Unlike game engines, which rely on manual scene construction, GANs learn underlying data distributions directly from training samples, enabling them to produce sharper and more lifelike outputs than prior generative methods such as variational autoencoders (VAEs) [95]. Since their inception, various GAN architectures have been proposed for image synthesis, with StyleGAN2-ADA emerging as a particularly effective approach due to its incorporation of adaptive discriminator augmentation (ADA), which mitigates overfitting when training on limited datasets. As shown in Table 2, StyleGAN2-ADA has been successfully applied across diverse domains spanning medical diagnostics, environmental monitoring, and infrastructure inspection, demonstrating the versatility and effectiveness of generative modelling for enhancing downstream performance across a range of CV tasks. Notably, all studies summarized in Table 2 train exclusively on real-world imagery, leaving the effectiveness of purely synthetic training data for generative models largely unexplored.

For generative models, the Fréchet Inception Distance (FID) metric has been widely adopted to quantify the fidelity of generated samples relative to real data, with lower scores typically indicating closer alignment to real-world statistical distributions [96]. For example, ref. [97] demonstrated high quantitative fidelity using StyleGAN2-ADA to synthesize images of skin lesions for melanoma classification, achieving an FID score of 0.79. Similarly, studies such as [72,98,99] produced anatomically realistic magnetic resonance imaging (MRI) scans, with FID values ranging from 18.14 to 67.53. Despite this, recent studies have scrutinized the reliability of FID due to inherent limitations such as biased estimations and incorrect distributional assumptions [100,101]. Notably, dental radiographs synthesized in [73] were rated indistinguishable from real scans by domain experts, yet they achieved an FID of 72.76, the highest recorded in Table 2. Conversely, despite obtaining an FID score of 20.90, chest X-rays produced in [102] contained artefacts that hindered downstream classification performance. These findings highlight that FID may not necessarily reflect the practical utility of synthetic images for downstream tasks, and that quantitative metrics alone may be insufficient to assess generation quality. Dataset size is also closely tied to the effectiveness of generative models, with insufficient data often leading to discriminator overfitting and training instability [69]. For instance, ref. [103] found that over 10,000 training images were required to synthesize realistic petrographic samples.

Table 2. Summary of related works on synthetic dataset generation using StyleGAN2-ADA.

Application	Downstream Task	Training Dataset	Training Configuration	FID
Petrographic image classification [103]	Classification	10,070 real petrographic images	6520 kimg, NVIDIA Quadro RTX 5000	12.49
Brain tumor classification [72]	Classification	3064 real brain scans	NVIDIA Tesla P100	58.11–67.53
Abdominal scan synthesis [99]	Not reported	1300 real abdominal scans	7800 kimg, NVIDIA GeForce RTX 2080	18.14
Algal bloom detection [74]	Semantic segmentation	3114 real algal bloom images	NVIDIA Tesla P100	42.56
Dental radiograph classification [73]	Classification	1456 real dental radiographs	NVIDIA Tesla A100	72.76
Brain scan synthesis [98]	Not reported	1412 real brain scans	1800 kimg, NVIDIA Tesla A100	20.21
Chest X-ray classification [102]	Classification	3616 real chest X-rays	NVIDIA Tesla K80	20.90
Skin cancer classification [97]	Classification	33,126 real skin lesion images	NVIDIA GeForce RTX 3090	0.79
Landslide detection [104]	Semantic segmentation	770 real landslide images	Not reported	67.47
Wildfire detection [76]	Object detection	1865 real wildfire images	25,000 kimg, NVIDIA GeForce RTX 3090 Ti	24.07
Pavement crack detection [105]	Semantic segmentation	778 real crack images	32,000 kimg, NVIDIA Tesla T4	6.30

Similarly, ref. [97] reported improved image quality when training with datasets exceeding 30,000 images, demonstrating the correlation between dataset size and synthetic image fidelity. Conversely, studies utilizing fewer than 2000 images [73,98] generally produced lower-quality outputs with inconsistent results, underscoring the significance of adequate training data. Despite this general trend, synthesis quality can also be influenced by domain complexity, independent of dataset size. For example, ref. [105] generated realistic pavement crack images using only 778 training samples, while ref. [99] achieved high-fidelity abdominal MRI scan synthesis with 1300 images. In contrast, ref. [104] obtained an FID of 67.47 when training on 770 landslide images, likely due to the visual diversity inherent to environmental images. These findings suggest that while larger datasets generally enhance synthesis quality, the amount of data required to achieve high-fidelity generation varies considerably across domains depending on visual complexity.

To address these data requirements, recent studies have explored transfer learning strategies whereby pre-trained models are fine-tuned using small amounts of real-world data from the target domain [75]. For instance, ref. [72] demonstrated that pre-training StyleGAN2-ADA on unrelated source domains such as FFHQ [106] improves synthesis quality for brain tumor MRI scans. Similarly, ref. [73] reported that transfer learning not only improved FID scores of synthesized dental radiographs but also enhanced the accessibility of generative modelling for researchers with limited access to computational resources such as GPUs. It should be highlighted that existing transfer learning strategies for StyleGAN2-ADA rely uniformly on real-world data as the fine-tuning source, and no study has investigated whether synthetic data, particularly from game engines, can serve as an effective substitute. Nonetheless, these studies indicate that transfer learning may offer a promising avenue for addressing aspects of data scarcity in the field of generative modelling [72,73,74,75]. Regardless, generative models fundamentally require substantial quantities of training data to learn meaningful distributions. Even StyleGAN2-ADA, designed specifically for data-constrained scenarios, typically requires thousands of images to produce high-fidelity outputs [69], presenting a significant barrier in domains where data acquisition is limited by cost or accessibility, thereby motivating the hybrid framework for synthetic data generation introduced in the following section.

3. Materials and Methods

This study proposes a novel hybrid synthetic dataset generation framework that integrates game engine rendering with generative modelling to address data scarcity in open-pit crack detection without requiring real-world training data. Leveraging UE5 and StyleGAN2-ADA, the framework synthesizes diverse and realistic images of surface cracks that are automatically annotated using Grounding DINO [107] to train the YOLOv11 real-time object detection model. The methodology comprises three primary stages, as outlined in Figure 2.

The first stage (Section 3.1) focuses on the development of a parameterized UE5 environment. A virtual scene representing an open-pit mine wall is first constructed, followed by the development of an automated dataset generation algorithm for domain randomization and data acquisition. By systematically varying crack decals, ground material textures, and lighting parameters, this algorithm generates a dataset of 20,000 labelled images of open-pit surface cracks. In the second stage (Section 3.2), StyleGAN2-ADA is trained solely on the UE5 dataset to enhance the fidelity and diversity of the synthesized imagery, generating crack images with structural variations not present in the original UE5 dataset. Three initialization strategies are evaluated to generate 20,000 images per configuration to assess the influence of transfer learning. Each generated image is automatically annotated using Grounding DINO, a VLM capable of detecting objects from a text prompt and generating bounding boxes for downstream CV model training. In the final stage (Section 3.3), a dataset-level ablation study is conducted to assess how each dataset configuration influences downstream YOLOv11 crack detection performance. Model accuracy is assessed on 200 real-world mining images across key performance metrics such as AP, precision, recall, and F1 score to quantify the generalization capability of object detection models trained solely on game engine data and those trained on enhanced images generated by the proposed framework.

Together, these three stages form a unified framework that synthesizes training data through game engine rendering, enhances dataset fidelity and diversity through generative modelling, and trains generalizable real-time object detection models, thereby enabling autonomous hazard identification to improve operational safety in data-scarce open-pit mining while minimizing manual data collection and annotation effort.

3.1. Synthetic Dataset Generation Using UE5

We adopt UE5 for dataset synthesis due to its state-of-the-art (SOTA) PBR pipeline with support for virtualized geometry [64] and RTGI [65]. These features enable the generation of high-fidelity terrain surfaces and illumination effects, supporting more robust downstream generalization [83]. We first construct a configurable virtual open-pit environment (Section 3.1.1) wherein surface cracks are rendered and photographed under systematically randomized conditions. Parameters such as surface appearance and texture (Section 3.1.2), crack morphology (Section 3.1.3), illumination (Section 3.1.4), and camera viewpoint (Section 3.1.5) are independently varied to enhance dataset diversity. High-resolution images (Section 3.1.6) and corresponding bounding box annotations (Section 3.1.7) are automatically generated using a dataset generation pipeline (Section 3.1.8) prior to downstream generative modelling using StyleGAN2-ADA (Section 3.2).

3.1.1. Virtual Environment Construction

As demonstrated in Figure 2, a virtualized section of an open-pit wall is constructed using the Landscape tool [108] in UE5. Designed to support photorealistic visual rendering rather than explicit geotechnical modelling, the generated heightmap incorporates representative slope surfaces and surrounding context largely consistent with operational open-pit settings, without attempting to reproduce site-specific stratigraphy or failure mechanics. The primary objective of this 3D scene is to provide a visually realistic background surface against which surface cracks are rendered, and for this reason, emphasis is placed on surface texture, color variation, and roughness, an approach consistent with domain randomization methodologies commonly adopted in synthetic data generation to enhance model generalization across unseen settings [109,110].

3.1.2. Terrain Material Parameterization

To reflect the visual diversity typical of open-pit mine sites, the generated landscape is parameterized using 12 distinct terrain surface materials selected to span the range of surface conditions typically documented across Australian mining operations [111,112]. These materials represent commonly observed surface types including soils, sandy deposits, compacted gravel, weathered rock formations, ironstone, and mixed debris zones, with color variations spanning red, ochre, grey, and brown tones. Each high-resolution material, sourced from Quixel Megascans [113], incorporates PBR properties including albedo, normal, and roughness, as well as Nanite [64] displacement maps to support realistic light interaction. By randomly varying surface appearance while maintaining fixed terrain geometry during dataset generation, controlled background variability is introduced to improve model robustness under domain shift, thereby enhancing generalizability across different operational environments.

3.1.3. Surface Crack Decal Parameterization

Surface crack decals are generated from real-world crack imagery to preserve authentic morphological characteristics. Crack images acquired from field surveys undergo image preprocessing such as cropping prior to being manually processed to extract binary crack masks. These masks are then standardized to a resolution of 2048 × 2048 pixels and converted into greyscale opacity maps prior to additional processing to generate auxiliary texture maps such as normal, height, and roughness for enhanced decal realism. These textures are then imported into UE5 and assembled into deferred decal materials, forming the unique surface cracks used for automated dataset generation. The complete surface crack decal generation workflow is illustrated in Figure 3.

A total of 22 crack decal variants are generated, representing common patterns such as single cracks, bifurcated cracks, and crossed cracks [114], as summarized in Table 3. During instantiation in UE5, each decal is parameterized using a Data Table that stores per-crack length and width values. Decals are scaled anisotropically to preserve the proportions observed in the source field imagery, thereby maintaining geometric consistency with the visual morphology of real-world cracks. The decal sizing strategy was selected to remain broadly consistent with surface cracks documented in operational open-pit environments, where field surveys have reported crack lengths of approximately 5 to 8 m and aperture widths ranging from 1 to 40 cm [115]. Accordingly, the resulting decal library was designed to provide a field-informed visual parameterization of representative cracks for CV model training rather than a site-specific statistical model of crack occurrence.

3.1.4. Lighting Positioning and Intensity Parameterization

Lighting parameters are independently randomized to replicate the diverse conditions encountered during site inspections. The directional light component, representing solar illumination, is configured with continuously varying position and intensity. Azimuth angle θ spans full 360° rotation along the horizontal plane, while elevation angle ψ varies between 30° and 90° to represent solar position changes from early morning to overhead midday conditions. Light intensity I ranges from 5 to 10 lux with a fixed color temperature of 5000 K to represent daylight white balance. This systematic randomization of solar position and intensity produces natural changes in shadow direction and surface contrast, helping prevent overfitting and improving model robustness to real-world lighting variability.

3.1.5. Camera Viewpoint Parameterization

Virtual camera positioning employs randomized spherical coordinate parameterization relative to the center of the instantiated crack, thereby ensuring coverage of geotechnically relevant inspection viewpoints. The standoff distance d is sampled between 1 and 10 m to represent the typical range for both ground-based and unmanned aerial vehicle (UAV) inspection [116]. The azimuth angle ϕ provides 360° rotation for complete directional coverage around the crack, eliminating bias towards specific viewing directions, while the elevation angle α varies between 45° and 90° to span oblique to nadir perspectives. Slight camera jitter ρ between −10° and 10° across pitch and roll axes simulates natural handheld tilt and UAV attitude changes, whereas field of view (FOV) variations from 70° to 110° represent the typical range of smartphone and digital single-lens reflex (DSLR) cameras. Focus distance is automatically set to match the standoff distance, ensuring consistent sharpness across viewpoints. Additional intrinsic camera parameters for the Cine Camera Actor component [117] are configured as summarized in Table 4.

3.1.6. Synthetic Image Rendering

For each randomized scene instance, a 1920 × 1080 resolution image is rendered using the High Resolution Screenshot Tool (HRSST) [118]. Global illumination is enabled via Lumen [65] in hardware ray tracing (RT) mode with high-quality settings to achieve photorealistic lighting behavior. Temporal anti-aliasing (TAA) is used to suppress spatial aliasing, while post-processing effects such as auto-exposure, chromatic aberration, vignetting, film grain, and lens distortion are disabled to maintain consistent image quality across captures.

3.1.7. Bounding Box Computation

As illustrated in Figure 4, each crack decal actor is approximated by a 3D box with center p_c and half-extents e_x, e_y, and e_z, with corresponding corner points defined as:

p_{i} = p_{c} + ({\pm e}_{x}, {\pm e}_{y}, {\pm e}_{z}), i = 1, \dots, 8 .

(1)

To transform these 3D world coordinates into 2D image space for bounding box computation, each corner point is projected to pixel space using the camera projection operator П(·):

(u_{i}, v_{i}) = Π (p_{i}),

(2)

where u_i and v_i represent the horizontal and vertical pixel coordinates of the i-th projected corner clamped to the viewport bounds [0, W] × [0, H], with W = 1920 and H = 1080.

The operator П(·) performs perspective division and maps world coordinates to pixel coordinates using a 4 × 4 reversed-Z perspective projection matrix P:

P = [\begin{matrix} \frac{1}{\tan (α / 2)} & 0 & 0 & 0 \\ 0 & \frac{W}{H \cdot \tan (α / 2)} & 0 & 0 \\ 0 & 0 & \frac{n}{n - f} & 1 \\ 0 & 0 & - \frac{f \cdot n}{n - f} & 0 \end{matrix}],

(3)

where α represents the vertical FOV half-angle (the angle from the center of the lens to the edge of the viewable area), and n and f correspond to the near and far clipping planes, respectively [119]. From the set of eight projected corner points {(u_i, v_i)}, the enclosed 2D bounding box coordinates are computed as:

u_{m i n} = \min_{i} u_{i}, u_{m a x} = \max_{i} u_{i}, v_{m i n} = \min_{i} v_{i}, v_{m a x} = \max_{i} v_{i} .

(4)

Pixel-space center coordinates and bounding box dimensions w and h are then calculated as:

\begin{matrix} {x_{c}}^{p x} = \frac{u_{m i n} + u_{m a x}}{2}, {y_{c}}^{p x} = \frac{v_{m i n} + v_{m a x}}{2}, \\ w^{p x} = u_{m a x} - u_{m i n}, h^{p x} = v_{m a x} - v_{m i n} . \end{matrix}

(5)

These quantities are then normalized with respect to the viewport dimensions W and H and exported as a label file in the YOLO annotation format [120]:

x_{c} = \frac{{x_{c}}^{p x}}{W}, y_{c} = \frac{{y_{c}}^{p x}}{H}, w = \frac{w^{p x}}{W}, h = \frac{h^{p x}}{H} .

(6)

The reliability of the projection-based annotation strategy is assessed using a representative subset of 300 synthetic images by comparing automatically generated bounding boxes against manually annotated reference labels using IoU-based agreement metrics.

3.1.8. Automated Dataset Generation Pipeline

Automated dataset generation is implemented through the UE5 Blueprint script presented in Algorithm 1, which iteratively randomizes crack decals, terrain materials, lighting conditions, and camera geometry for each rendered image using the distributions defined in Section 3.1.2, Section 3.1.3, Section 3.1.4 and Section 3.1.5. For each rendered image, the corresponding annotation is computed as described in Section 3.1.7 to produce a total of 20,000 labelled synthetic surface crack images. Dataset generation is performed using an NVIDIA GeForce RTX 5090 GPU, 64 GB RAM, and Intel Core Ultra 9285H CPU in UE5 version 5.6.1.

Algorithm 1. Automated Synthetic Dataset Generation Pipeline for UE5

Input: Dataset size N, Crack decals C, Terrain materials T
Output: Images I = {I₁, …, I_N}, Labels L = {L₁, …, L_N}
1: for i = 1 to N do
2: //Sample crack decal and terrain material
3:   crack ← SampleCrack(C), terrain ← SampleMaterial(T)
4: //Sample lighting and camera parameters
5:   (θ, ψ, I) ← SampleLighting(), (d, ϕ, α, ρ, FOV) ← SampleCamera()
6: //Configure scene with sampled parameters
7:   SpawnDecal(crack), ApplyTerrainMaterial(terrain), SetDirectionalLight(θ, ψ, I)
8: //Compute and assign camera position
9:   position ← SphericalToCartesian(d, ϕ, α)
10: SetPosition(position, roll = ρ, perspective = FOV)
11:   //Render image and compute annotation
12: I_i = CaptureImage(), L_i = ComputeBoundingBox()
13:   //Prepare for next iteration
14: DestroyDecal(crack)
15: end for
16: return I, L

3.2. Synthetic Dataset Fidelity and Diversity Enhancement Using StyleGAN2-ADA

In this stage, StyleGAN2-ADA (Section 3.2.1) is trained exclusively on the UE5 dataset to enhance the fidelity and diversity of the synthesized imagery, generating realistic samples of surface cracks with structural variations beyond those achievable with parametric rendering alone. To investigate the effect of different initialization strategies, three transfer learning configurations (Section 3.2.2) are evaluated with respect to generation fidelity, quantified using FID (Section 3.2.3), and generation diversity, measured using LPIPS (Section 3.2.4). Additionally, the domain gap relative to both UE5 imagery and real-world surface crack samples is examined using feature-space analyses (Section 3.2.5). The resulting 20,000 images generated for each StyleGAN2-ADA configuration are then automatically annotated with bounding boxes using Grounding DINO and filtered using confidence thresholding (Section 3.2.6).

3.2.1. StyleGAN2-ADA Architecture Overview

We adopt a GAN-based architecture over diffusion-based alternatives due to the fine-grained, deterministic control GANs provide over the generation process [121]. Specifically, we employ StyleGAN2-ADA, which is well suited to generating high-fidelity synthetic imagery even under limited-data conditions. The architecture consists of a generator, which includes mapping and synthesis networks that produce synthetic images, and a discriminator, which evaluates the realism of generated samples, as illustrated in Figure 5. The mapping network contains a multilayer perceptron (MLP) with eight fully connected (FC) layers which transform an input vector in latent space z ∈ Z into an intermediate latent code w ∈ W. The synthesis network then produces images through a hierarchy of style-modulated convolution blocks with learned weight demodulation spanning multiple resolution scales. Per-layer affine transforms A from w produce the style parameters that control the convolutions performed in each of these style blocks, thereby controlling visual attributes and image characteristics, while injected stochastic noise B provides finer, unstructured detail. The discriminator employs a multi-stage residual architecture, downsampling full-resolution inputs and passing the final feature map through a FC layer to produce a single scalar output D(x) representing the probability that an input image is real or generated. To maintain training stability under limited data conditions, ADA dynamically applies random geometric and color-space perturbations to discriminator inputs to prevent overfitting while preserving generator output diversity.

3.2.2. StyleGAN2-ADA Training Configuration

As discussed throughout Section 2.2, generative models typically require tens of thousands of training images to produce high-fidelity outputs [69,76], with techniques such as transfer learning potentially enhancing their generalizability in data-constrained settings [72,73,75]. Pre-training leverages visual priors learned from a source domain to accelerate convergence and improve generation quality in a target domain. To examine the impact of transfer learning on the synthesis of surface crack images using StyleGAN2-ADA, three training configurations are developed to represent distinct points along a domain similarity spectrum.

The first configuration, SG2, trains StyleGAN2-ADA solely on game engine data, serving as a baseline for quantifying the generation quality achievable without external knowledge transfer. Both the generator and discriminator are initialized with random weights and trained for 2000 kimg on the UE5 dataset.

The second configuration, SG2 + FFHQ, leverages Flickr-Faces-HQ (FFHQ) [106], pre-trained on 70,000 high-resolution images of human faces, to examine the impact of transfer learning from a semantically unrelated source domain. This configuration is motivated by prior research demonstrating that low-level visual features, such as edge and texture primitives, exhibit strong transferability across semantically different domains [72]. The pre-trained model is fine-tuned using the UE5 dataset for 2000 kimg.

The third configuration, SG2 + DTD, utilizes weights pre-trained on the Describable Textures Dataset (DTD) [122], comprising 5640 images across 47 texture categories at 1024 × 1024 resolution [123]. Unlike the FFHQ dataset, DTD contains, amongst other categories, explicit representations of cracked, fractured, and rough surface patterns, providing strong low-level visual correspondence between source and target domains. Fine-tuned using the UE5 dataset for 2000 kimg, this configuration enables assessment of the impact of explicit texture-focused pre-training on the generation of surface crack images.

All training configurations use the official NVIDIA StyleGAN3 repository [124] with StyleGAN2-ADA architecture configuration due to improved compatibility with current versions of PyTorch. Training is conducted on an NVIDIA Tesla A100 GPU with CUDA 12.8, PyTorch 2.2.0, and Python 3.10. Training images are downsampled from 1920 × 1080 to 512 × 512 resolution to balance spatial detail with computational efficiency and training stability. A batch size of 16 is selected to ensure gradient stability within GPU memory constraints. As crack morphology remains invariant under horizontal reflection, mirror augmentation is enabled to increase training data diversity and improve generalization. Additional training hyperparameters follow the StyleGAN2-ADA default values [125] summarized in Table 5.

3.2.3. Generation Fidelity Evaluation

We employ the FID [126] to assess generation fidelity, using it as a complementary distributional metric alongside perceptual and embedding-based analyses (Section 3.2.4 and Section 3.2.5). This metric quantifies the distance between the feature distributions of reference and generated images in the activation space of a pre-trained Inception-v3 network as follows [127]:

d^{2} ((m, C), (m_{w}, C_{w})) = ‖m - {m_{w}‖}_{2}^{2} + T_{r} (C + C_{w} - 2 {(C C_{w})}^{1 / 2}),

(7)

where (m, C) and (m_w, C_w) denote the mean vectors and covariance matrices of the real and generated image features extracted from the Inception-v3 activation space respectively, and T_r represents the matrix trace. Lower FID scores generally indicate a greater similarity between the Gaussian approximations of real and generated feature distributions p and p_w, thereby indicating higher generation fidelity.

To monitor generation quality throughout training, FID scores are computed on selected checkpoints (every 200 kimg) spanning the training duration by synthesizing 50,000 random samples and comparing their feature distributions against 50,000 images sampled with replacement from the training set. This temporal tracking facilitates analysis of convergence behavior, typically marked by FID score stabilization, and enables detection of potential mode collapse, which may manifest as sudden increases or sustained oscillations in FID scores during training. Upon convergence and completion of training, the best checkpoint for each of the three configurations is selected based on the lowest FID score achieved using full latent space sampling to ensure unbiased evaluation. Finally, 20,000 surface crack images are generated using the best checkpoint for each training configuration through random sampling of the latent space.

3.2.4. Generation Diversity Evaluation

We evaluate generation diversity by utilizing LPIPS [128], a metric which measures perceptual distance aligning closely with human visual judgement by quantifying how far apart two images are in the feature space of a pre-trained network. As shown in Equation (8), LPIPS computes this perceptual distance by measuring the squared l₂ difference between the feature activations y^l and y^l₀ of two images across multiple layers of a network F, in our case, Visual Geometry Group-16 (VGG-16) [129]. These feature differences are unit-normalized and scaled by vector w^l prior to being averaged spatially and summed channel-wise, producing a scalar output d that correlates strongly with human judgements of visual similarity. Lower scores, typically close to 0, indicate greater perceptual similarity, whereas higher scores suggest that image pairs look more different and diverse to humans.

d (x, x_{0}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {‖w^{l} ⊙ ({\hat{y}}^{l}_{h w} - {\hat{y}}^{l}_{0 h w})‖}_{2}^{2} .

(8)

To assess whether StyleGAN2-ADA has generated surface crack images with greater diversity than those present in the original UE5 training dataset, we conduct an intra-set diversity comparison by analyzing both LPIPS and clustering in the VGG-16 feature space. To do so, we randomly sample 5000 pairs of images from each of our synthesized datasets and calculate the distance between each pair. By comparing the resulting LPIPS distributions, including their mean and median values, differences in generation diversity across dataset configurations are quantified. To determine whether these observed differences are statistically significant, the Mann–Whitney U test is employed due to its suitability for comparing distributions without assuming normality [130]. We also extract the feature vectors from our datasets using the VGG-16 backbone and use t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the image embeddings. The number and separation of clusters in t-SNE space provide qualitative insight into distributional coverage, where a greater number of separated clusters suggests broader perceptual variability and therefore, increased dataset diversity.

3.2.5. Domain Gap Evaluation

To quantitatively assess whether StyleGAN2-ADA reduces the domain gap between synthetic and real-world imagery, we perform a feature-space distribution analysis. Specifically, we compare the feature representations of the UE5 baseline dataset and the three StyleGAN2-ADA-enhanced configurations against the real-world surface crack dataset (described in Section 3.3.3). This yields four pairwise comparisons, enabling direct assessment of which initialization strategy produces the smallest domain gap relative to real-world conditions. Image embeddings are extracted using a pre-trained CLIP ViT-L/14 image encoder [131]. This model was selected because its contrastive language-image pretraining yields general-purpose embeddings that capture broad semantic and visual structure. In contrast, Inception-v3 and ResNet-based alternatives were originally developed for supervised image classification, so their embeddings may be shaped by class-discriminative objectives [132]. Given the limited size of the real-world test set, domain difference is quantified using the squared Maximum Mean Discrepancy (MMD²) computed on the CLIP embeddings [133]. A third-degree polynomial kernel is employed, with embeddings l₂ normalized prior to comparison. The MMD² metric is defined as:

{M M D}^{2} (S, R) = E_{x, x^{'} ~ S} [k (x, x^{'})] + E_{y, y^{'} ~ R} [k (y, y^{'})] - 2 E_{x ~ S, y ~ R} [k (x, y)],

(9)

where S and R denote the synthetic and real-world feature distributions, respectively, and k(·, ·) is the polynomial kernel. Lower MMD² values indicate greater alignment between the synthetic and real-world image distributions, thereby providing a direct quantitative measure of whether the StyleGAN2-ADA-enhanced datasets more closely approximate real-world open-pit surface crack imagery than the UE5 baseline. This feature-space analysis complements the downstream YOLOv11 evaluation (Section 3.3), which provides a task-specific assessment of whether closer feature-space alignment corresponds to improved real-world detection performance.

3.2.6. Automatic Annotation Using Grounding DINO

Manual annotation in CV has long been a time-consuming and resource-intensive task. To address this challenge, we employ pseudo-labelling using Grounding DINO [107], a unified VLM designed for open-set object detection through language-guided text prompts. This zero-shot inference approach enables the generation of labels for previously unseen data, such as the images synthesized by StyleGAN2-ADA. Grounding DINO adopts a transformer-based encoder–decoder architecture that processes visual features extracted from images and fuses them with prompt information via cross-attention. The visual encoder, implemented using a pre-trained Swin Transformer [134], extracts multi-scale visual features from input images, while a BERT-based [135] text encoder converts prompt tokens into semantic embeddings that guide the detection process. The model integrates these representations within a feature enhancer module, producing language-conditioned object queries that are passed to the decoder, whose outputs are directly projected into bounding-box coordinates and text-region alignment scores through prediction layers.

We utilize Grounding DINO in a zero-shot manner with the text prompt “crack” to generate candidate bounding boxes for each StyleGAN2-ADA image. Low-confidence predictions are removed via confidence thresholding, and the remaining annotations are converted to YOLO format to produce a large-scale pseudo-labelled dataset suitable for downstream training. To determine an appropriate confidence threshold and evaluate the reliability of the pseudo-labelling pipeline, a sensitivity analysis is performed on a representative subset of 300 manually annotated images by comparing pseudo-label quality across multiple threshold values using IoU-based agreement metrics. This analysis identifies the threshold that provides the best balance between retaining valid crack detections and suppressing lower-confidence predictions, thereby defining the final pseudo-labelling strategy used for downstream training.

3.3. Crack Detection Using YOLOv11

3.3.1. YOLOv11 Architecture Overview

We utilize YOLOv11 [136] to evaluate the downstream effectiveness of the datasets synthesized by the proposed framework. As an established, high-performing one-stage real-time object detection model, YOLOv11 is well suited to operational mining contexts requiring real-time inference. As illustrated in Figure 6, YOLOv11 follows a modular design comprising three principal components: a backbone, a neck, and a detection head.

The backbone consists of stacked CBS and C3K2 blocks that progressively reduce the spatial resolution of input images while increasing channel depth, enabling efficient extraction of features such as crack width and curvature. It also incorporates the spatial pyramid pooling-fast (SPPF) block to expand the receptive field and combine multi-scale contextual information, and C2PSA attention modules to more effectively capture both local and global features, thereby improving detection accuracy across objects of varying scales. The neck facilitates multi-scale feature fusion through a series of upsampling, downsampling, and concatenation operations that refine information from different stages of the backbone, enhancing robustness to scale variation. Finally, the detection head employs decoupled classification and regression branches, enabling more precise object localization and bounding box prediction by processing the fused features transmitted from the neck.

3.3.2. YOLOv11 Training Configuration

To determine whether the proposed framework provides measurable benefits for surface crack detection, we conduct a dataset-level ablation study using YOLOv11. Four dataset variants are evaluated: (i) UE5 only (UE5O), representing the baseline; (ii) UE5O augmented with StyleGAN2-ADA (UE5O + SG2), used to assess whether the proposed framework improves detection performance beyond UE5O; (iii) SG2 with FFHQ pre-training (UE5O + SG2-FFHQ); and (iv) SG2 with DTD pre-training (UE5O + SG2-DTD). These configurations, summarized in Table 6, are designed to evaluate the impact that generative enhancement of synthetic data has on downstream object detection performance. Training YOLOv11 on each dataset separately also provides a controllable basis for isolating the effects of StyleGAN2-ADA initialization strategies on real-world generalization performance.

Training is conducted using the official YOLOv11 implementation on an NVIDIA Tesla A100 GPU with CUDA 12.8, PyTorch 2.2.0, and Python 3.10. The medium model variant (YOLOv11m) is selected to balance detection performance and computational efficiency. COCO pre-trained weights are used for the UE5O configuration, while SG2-enhanced models are fine-tuned from the best-performing UE5O weights. All remaining training hyperparameters are reported in Table 7.

3.3.3. Object Detection Performance Evaluation

Object detection performance is evaluated using a held-out real-world test set of 200 images collected across four operational open-pit mine locations in Australia. The dataset includes 49 UAV images and 151 handheld-camera images, spanning a range of aerial and ground-based viewpoints from close-range inspections to wider contextual views. It also includes scenes from benches, haul roads, pit slopes, and crest regions, providing a realistic evaluation set across varied viewing geometries, surface contexts, and operational inspection conditions. Figure 7 provides a representative overview of the test set.

Model performance is quantified using precision, recall, F1 score, and AP evaluated at different IoU thresholds. In this study, a predicted bounding box is considered a true positive (TP) if its overlap with a ground-truth bounding box exceeds an IoU threshold of 0.5. Predictions below this threshold are classified as false positives (FP), while ground-truth objects without matching predictions constitute false negatives (FN). From these quantities we compute precision (P), representing the proportion of predicted detections that correctly identify actual surface cracks:

P = \frac{T P}{T P + F P}

(10)

We also calculate recall (R), indicating the proportion of actual surface cracks successfully detected by the model:

R = \frac{T P}{T P + F N}

(11)

We leverage the F1 score as a balanced indicator of detection performance, providing a single measure that reflects model performance with respect to both consistent and accurate crack identification:

F 1 = 2 \times \frac{P \times R}{P + R}

(12)

Finally, the AP metric, corresponding to the area under the precision-recall curve, is used to provide a consolidated measure of detection quality by integrating precision and recall across all confidence thresholds:

A P = \int_{0}^{1} P (R) d R

(13)

We report AP@0.5 (IoU

\geq

0.5) and AP@[0.5:0.95] to quantify both detection accuracy and spatial extent localization, two attributes essential for effective surface crack monitoring.

4. Results and Discussion

This section evaluates the effectiveness of the proposed hybrid synthetic dataset generation framework in addressing data scarcity for open-pit surface crack detection. We first examine the convergence characteristics of the StyleGAN2-ADA training configurations, followed by assessments of generation fidelity, diversity, and domain gap, as well as validation of the UE5 projection-based annotations and the Grounding DINO pseudo-labelling strategy (Section 4.1). We then evaluate real-world crack detection performance through a dataset-level ablation study using YOLOv11, assessing downstream model robustness and generalizability to real-world open-pit mining conditions (Section 4.2). Finally, we discuss the broader implications of the proposed framework for automated surface crack detection, evaluating its practical utility, scalability, and limitations for real-world deployment (Section 4.3). Together, these analyses provide a comprehensive assessment of whether the proposed framework meaningfully improves real-world object detection performance under limited-data conditions.

4.1. StyleGAN2-ADA Training and Image Generation Assessment

4.1.1. Training Dynamics

Figure 8 illustrates the training behavior of the three StyleGAN2-ADA configurations used for image synthesis in this study.

As highlighted in Figure 8a, the baseline configuration (SG2) exhibits higher generator loss at initialization compared to both pre-trained models (SG2 + FFHQ and SG2 + DTD). This behavior is expected when training from scratch, as the generator initially produces unstructured noise, allowing the discriminator to easily classify its outputs as fake samples with a high degree of confidence. SG2 generator loss rapidly decreases within the first 100 kimg as the generator starts to form a coherent latent representation, then decays slowly until reaching convergence of 1.29 around 1250 kimg. Both pre-trained models, conversely, achieved more rapid convergence with lower initial generator loss due to their already well-structured latent representations. SG2 + FFHQ ultimately achieves the lowest generator loss at 1.13 and converges the most smoothly, indicating that pre-trained weights from the FFHQ-1024 dataset may provide effective feature representations for surface crack image generation, despite originating from a semantically distant and unrelated domain. Domain-aligned pre-training by way of the SG2 + DTD configuration exhibits better initialization and faster convergence than both SG2 and SG2 + FFHQ; however, generator loss fluctuations throughout training suggest comparatively reduced stability relative to the other training configurations.

While generator loss highlights the impact of initialization and pre-training on generative capacity, discriminator loss, quantified in Figure 8b, provides complementary insight into the stability and balance of adversarial training dynamics. Discriminator loss across all three configurations converges rapidly within the first 200 kimg and remains tightly clustered for the remainder of training, indicating that the adversarial game stabilizes early and reaches a Nash equilibrium [137]. All three configurations exhibit an initial loss spike caused by the discriminator rapidly adapting to the highly unrealistic outputs synthesized by the generator; however, the discriminator then settles into a narrow range between 0.85 and 0.90 following this transient phase, with only minor fluctuations throughout the full 2000 kimg training window. SG2 displays slightly lower discriminator loss than the pre-trained configurations during the early stages (<50 kimg), reflecting the confidence with which the discriminator classifies the random noise synthesized by the uninitialized generator as fake samples. As generator fidelity improves, the discriminator loss rises to match that of the pre-trained configurations, signaling convergence toward a stable equilibrium. Both pre-trained configurations reach this equilibrium more quickly, however, exhibiting nearly indistinguishable trajectories beyond the 100 kimg mark. Overall, the minimal separation between curves indicates that, unlike generator loss, discriminator loss is only weakly influenced by pre-training and instead reflects the balance of the adversarial process once both networks have stabilized, ultimately confirming that all three configurations maintained stable training dynamics without evidence of discriminator collapse.

Figure 8c shows the evolution of the augmentation probability applied by the ADA module to mitigate discriminator overfitting. This value increases when the discriminator exhibits signs of overfitting to the training data, so its trajectory reflects the degree of overfitting pressure throughout training. The baseline configuration exhibits the lowest augmentation probability throughout the training run, rising gradually from near-zero to just 32%, indicating comparatively weaker overfitting pressure as the generator initially produces low-quality outputs from its randomly initialized weights. In contrast, SG2 + DTD rises steadily and plateaus around 50%, while SG2 + FFHQ reaches the highest augmentation probability of 60%, suggesting that the discriminator is under stronger overfitting pressure, likely due to the coherent outputs of the FFHQ-initialized generator, thereby requiring ADA to inject stronger regularizations to maintain adversarial balance. From a training stability perspective, the rising augmentation probability for the pre-trained configurations indicates that ADA is actively preventing the discriminator from overfitting to training samples, thereby helping maintain stable generation quality throughout training. Without this mechanism, the discriminator would be prone to memorizing aspects of the limited training set rather than learning generalizable features, ultimately degrading the generator’s ability to synthesize diverse and realistic outputs.

Changes in generator fidelity throughout training are demonstrated in Figure 8d, which illustrates the FID score progression of the three StyleGAN2-ADA initialization strategies. Across all configurations, FID decreases sharply during the first 400 kimg as the network rapidly learns structural and textural characteristics from the training data. Both pre-trained configurations exhibit a steeper initial decline than the baseline configuration, reflecting the advantage conferred by transfer learning through the inheritance of low-level feature priors that improve early-stage generation fidelity. This acceleration in early synthesis quality is further supported by Figure 9, which shows that crack-like structures emerge within the first 30 kimg of training for both pre-trained configurations.

In contrast, the baseline configuration requires nearly three times longer to form comparably coherent latent-space structure. Following the initial descent, all curves plateau and show only marginal improvements for the remainder of training, indicating that each configuration collectively reaches a point of diminishing returns relatively early at around 400 kimg. This behavior also confirms that all three configurations achieved stable training dynamics with no evidence of mode collapse, which is often signaled by late-stage FID fluctuations. SG2 + FFHQ ultimately achieves the lowest FID at convergence of 12.75, outperforming both SG2 + DTD (15.99) and SG2 (15.88), further demonstrating that FFHQ pre-training provides the most effective and stable transfer of feature representations for realistic surface crack image synthesis in this study.

Taken together, the adversarial loss trajectories, ADA probability trends, and FID score evolution indicate that all three StyleGAN2-ADA configurations converged stably, with the pre-trained configurations, particularly SG2 + FFHQ, achieving faster learning and superior generation fidelity overall. These dynamics provide evidence against severe discriminator overfitting or obvious mode collapse across the evaluated configurations, but do not by themselves quantify the realism or variability of the final generated images. Accordingly, the following section evaluates the fidelity and diversity of the generated samples in greater detail.

4.1.2. Generation Fidelity and Diversity Evaluation

Table 8 provides a quantitative comparison of the final synthesis quality attained by each StyleGAN2-ADA training configuration, evaluated using the FID and LPIPS metrics described in Section 3.2.3 and Section 3.2.4.

Consistent with the training dynamics analysis in Section 4.1.1, SG2 + FFHQ achieves the lowest final FID score of 12.75, indicating the strongest distributional fidelity among the evaluated configurations. SG2 + DTD (15.88) and SG2 (15.99) also attain comparatively low FID values, suggesting that all three configurations generate high-quality synthetic crack imagery, with FFHQ pre-training providing the most effective transfer of visual priors for this task. However, as discussed in Section 3.2.3, FID is interpreted here as a complementary fidelity metric rather than a standalone indicator of real-world transferability.

Representative samples from each configuration are shown in Figure 10. While UE5O produces photorealistic imagery, its morphological variation is limited. The StyleGAN2-ADA configurations introduce a broader range of crack structures and surface textures, with SG2 + FFHQ exhibiting the sharpest crack boundaries and most coherent backgrounds, consistent with its superior FID. In addition to achieving the strongest FID score, SG2 + FFHQ also exhibits the highest mean (0.472 ± 0.090) and median (0.477) LPIPS, indicating that samples generated from this configuration contain greater perceptual diversity than the other StyleGAN2-ADA training configurations. SG2 + DTD (0.457 ± 0.092) and SG2 (0.452 ± 0.094) also achieve higher LPIPS values than the UE5O baseline, indicating modest but meaningful diversity gains. Relative to UE5O, the LPIPS improvements for SG2, SG2 + FFHQ, and SG2 + DTD are 2.80%, 7.49%, and 3.96%, respectively.

To assess whether the observed differences in LPIPS are statistically significant, Mann–Whitney U tests with Holm correction for multiple comparisons were conducted on the LPIPS distributions of each configuration against UE5O, as shown in Table 9. All three comparisons yielded statistically significant results, while the rank-biserial effect sizes indicate that the practical magnitude of these differences varies across configurations. SG2 + FFHQ exhibits the largest effect (r = 0.2276), followed by SG2 + DTD (r = 0.1442), while the baseline SG2 configuration shows a smaller effect (r = 0.0995). These results confirm that, although all StyleGAN2-ADA configurations produce LPIPS distributions that are statistically distinguishable from UE5O, FFHQ pre-training provides the most practically meaningful improvement in perceptual diversity.

Figure 11 visualizes the t-SNE embeddings of the UE5O dataset and the three StyleGAN2-ADA training configurations, providing further insight into the diversity trends observed in Table 8 and Table 9. Figure 11a, specifically, shows that UE5O forms a series of relatively compact clusters, whereas the StyleGAN2-ADA configurations occupy a broader and more continuous manifold in feature space. Among these, SG2 + FFHQ appears to exhibit the widest and most overlapping spread, consistent with its higher LPIPS statistics and larger rank-biserial effect size. The t-SNE plots are presented for qualitative visualization of feature-space structure only; quantitative assessment of domain alignment is provided by the MMD² analysis in Section 4.1.3.

4.1.3. Domain Gap Evaluation

To quantify the domain gap between synthetic images generated by the proposed framework and real-world open-pit mining imagery, MMD² was computed in CLIP ViT-L/14 feature space between each StyleGAN2-ADA configuration and the held-out real-world test set. Summarized in Table 10, lower MMD² values indicate greater alignment between synthetic and real-world distributions and therefore a smaller domain gap.

Among the evaluated configurations, SG2 + FFHQ achieves the lowest MMD² (0.000404), representing a 14.4% reduction in domain gap relative to the UE5O baseline. Additionally, its 95% CI also does not overlap with that of UE5O, indicating a clear reduction in distributional distance relative to real-world imagery. By contrast, neither SG2 nor SG2 + DTD produce comparable reductions, with both configurations increasing the distance and exhibiting overlapped confidence intervals with UE5O. These results indicate that the choice of transfer learning strategy plays a significant role in determining whether generative modelling improves synthetic-to-real distributional alignment. In particular, the FFHQ pre-trained configuration appears to introduce visual priors that are beneficial for narrowing the domain gap, whereas training from scratch or from a texture-aligned source domain does not produce the same improvement in feature-space alignment. At the same time, MMD² is interpreted as a complementary domain gap analysis metric, rather than a standalone indicator of practical effectiveness. A reduction in feature-space distance provides quantitative evidence that the synthetic dataset has moved closer to the target real-world distribution but does not by itself establish improved downstream object detection performance. Accordingly, the downstream YOLOv11 evaluation in Section 4.2 is used to determine whether the combined gains in domain alignment, fidelity, and diversity translate to more robust real-world crack detection.

4.1.4. Annotation Reliability Assessment

To assess the reliability of the automatic annotation strategies developed in Section 3.1.7 and Section 3.2.6, two targeted validation analyses were conducted. To evaluate the accuracy of the projection-based UE5 annotation pipeline, a representative subset of 300 synthetic images was manually annotated and compared against the automatically generated bounding boxes. As summarized in Table 11, overall agreement was high, with a mean IoU of 0.931 and a median IoU of 0.958. The method further achieved F1 scores of 0.990 and 0.960 at IoU thresholds of 0.5 and 0.75, respectively, indicating that the projection-based annotation strategy provides strong agreement with manual reference labels and is unlikely to introduce substantial systematic bias into the UE5 training dataset.

To determine an appropriate confidence threshold and assess Grounding DINO pseudo-label quality, a sensitivity analysis was conducted on a representative subset of 300 manually annotated StyleGAN2-ADA images. As shown in Table 12, Grounding DINO achieves consistently high agreement with the reference labels, with mean IoU values ranging from 0.897 to 0.918 and median IoU exceeding 0.945 for all threshold values. A threshold of 0.35 was used for annotation of the StyleGAN2-ADA datasets as it achieved the strongest F1@0.5 while maintaining a competitive F1@0.75, providing a suitable overall balance between retaining valid crack detections and suppressing low-confidence predictions. Moreover, these results indicate that Grounding DINO produces labels of sufficient spatial accuracy and detection reliability to support training downstream models.

4.2. YOLOv11 Training and Crack Detection Performance Evaluation

4.2.1. Training Dynamics

Figure 12 summarizes the YOLOv11 training behavior of the four dataset configurations developed in this study. To enable direct comparison with the SG2-enhanced models, which were trained for 100 epochs due to early stopping, only the first 100 epochs of UE5O are shown. As highlighted by the validation box loss curves in Figure 12a, all models converged stably with no evidence of overfitting or divergence. UE5O, trained solely on game engine data, exhibited more gradual convergence and ultimately reached a final validation box loss of 0.17 at 300 epochs. In contrast, the three SG2-enhanced variants stabilized within the first 20 epochs. This accelerated convergence reflects the effect of fine tuning, as the SG2-enhanced models begin from weights already optimized on the UE5O dataset. All configurations achieved strong validation performance across AP@0.5, AP@[0.5:0.95], and precision, indicating that YOLOv11 learns the UE5 domain almost perfectly. This also suggests that the synthetic validation set may lack sufficient complexity to meaningfully differentiate the SG2-enhanced models from UE5O.

4.2.2. Real-World Performance Evaluation

Table 13 documents the object detection performance of the four YOLOv11 dataset configurations evaluated on a held-out test set of 200 real-world open-pit surface crack images. The UE5O baseline achieves the weakest performance overall, with an AP@0.5 of 0.792 and an AP@[0.5:0.95] of 0.536. While this demonstrates that high-fidelity game engine data can support reasonable detection performance, its comparatively weaker precision and recall relative to the SG2-enhanced models suggest that a residual domain gap limits its ability to generalize reliably to real-world conditions.

This limitation causes the model to miss more complex detections or incorrectly identify the background material as a surface crack, as illustrated in Figure 13.

The precision-recall characteristics shown in Figure 14 further reinforce this behavior, with UE5O exhibiting earlier precision degradation as recall increases relative to the SG2-enhanced configurations, indicating reduced robustness to confidence threshold variation. This pattern is consistent with a residual domain gap and suggests that certain real-world visual characteristics are potentially underrepresented by the UE5O training distribution. As a result, the model is more prone to missed detections and occasional false positives on occluded regions, as illustrated in Figure 13d. All SG2-enhanced variants, in contrast, demonstrate improved detection performance across all reported metrics, suggesting that the proposed framework introduces stochastic variation and structural diversity that benefits downstream generalization. As shown in Table 13, these configurations achieved higher AP, precision, recall, and F1 scores relative to the UE5O baseline, indicating that the generated images provide a more effective training distribution for real-world open-pit surface crack detection. Moreover, paired bootstrap 95% CIs with 1000 resamples provide additional evidence that these improvements are robust to test set sampling variability, with non-overlapping intervals observed between UE5O and both UE5O + SG2 and UE5O + SG2-FFHQ for AP@0.5 and AP@[0.5:0.95]. UE5O + SG2-FFHQ attained the highest precision (0.808), indicating the strongest suppression of false positives, while UE5O + SG2 achieved the highest recall (0.902), correctly identifying the largest proportion of surface cracks. Although UE5O + SG2-DTD exhibited slightly weaker performance than the other SG2 variants, it still delivered an overall improvement over the UE5O baseline. Despite achieving a slightly lower AP@0.5 than UE5O + SG2, UE5O + SG2-FFHQ provided superior localization performance across a broader range of IoU thresholds, with an AP@[0.5:0.95] of 0.722. This is consistent with the fidelity and diversity metrics reported in Section 4.1.2 and Section 4.1.3, where FFHQ pre-training achieved the strongest FID, the largest LPIPS effect size, and the greatest reduction in domain gap.

Interestingly, UE5O + SG2 achieved higher recall than UE5O + SG2-FFHQ, suggesting that the baseline SG2 configuration promotes greater detection sensitivity, albeit at the expense of localization precision. By contrast, the comparatively weaker performance of UE5O + SG2-DTD suggests that texture-focused initialization is less effective for downstream surface crack detection, despite achieving broadly comparable image-level fidelity and diversity, as shown in Table 8. These initialization-dependent effects are also reflected in the precision-recall behavior in Figure 14b–d, where all SG2-enhanced variants maintain higher precision as recall increases than UE5O, indicating greater robustness to confidence threshold variation. In particular, UE5O + SG2 maintains near-perfect precision until a recall of approximately 0.70, while both UE5O + SG2-FFHQ and UE5O + SG2-DTD exhibit similarly strong behavior with slightly earlier precision decay at higher recall.

Figure 15 further illustrates how StyleGAN2-ADA initialization influences detection confidence and localization behavior by comparing representative test detections across the four YOLOv11 dataset configurations. As shown in Figure 15b, UE5O + SG2 generally produces the highest confidence predictions but occasionally includes additional background material within predicted bounding boxes. This suggests that the model favors detection sensitivity and objectness confidence, albeit at the expense of tighter localization. This qualitative behavior is consistent with the strong recall and AP@0.5 seen for UE5O + SG2 in Table 13, as well as the consistently high precision maintained with increasing levels of recall in Figure 14b, and helps explain the slightly weaker AP@[0.5:0.95] of this configuration, reflecting reduced localization accuracy at stricter IoU thresholds.

In contrast, UE5O + SG2-FFHQ produces more selective detections with improved spatial alignment relative to ground truth annotations, as seen by the tighter bounding boxes in Figure 15c. While this behavior can result in lower confidence predictions, the increased localization accuracy leads to stronger performance at higher IoU thresholds, resulting in a stronger AP@[0.5:0.95], as seen in Table 13. UE5O + SG2-DTD attains comparatively weaker qualitative results, exhibiting reduced robustness that manifests as missed detections for lower-contrast or partially occluded cracks, and FPs with overlapping predictions on background material, as observed in Figure 15d. This behavior impacts both precision and AP, resulting in earlier precision drops in Figure 14d and weaker overall model performance relative to the other SG2-enhanced variants.

To further examine this behavior, Table 14 reports the distribution of FP and FN counts for each YOLOv11 dataset configuration evaluated on the real-world test set, providing a breakdown of the underlying causes of the aforementioned quantitative and qualitative performance trends. Consistent with its weaker overall performance, UE5O exhibits a comparatively high number of FPs and FNs, further quantifying the sample-level behavior seen in Figure 13. This directly contributes to the precision-recall characteristics observed in Figure 14a, where the high FP count causes precision degradation with increasing recall, while the curve also terminates earlier due to the relatively high number of FNs. All SG2-enhanced variants demonstrate reduced FP and FN counts relative to the UE5O baseline. UE5O + SG2 records the lowest FP count overall, consistent with its strong recall and AP@0.5 performance. Its slightly weaker AP@[0.5:0.95] can be attributed to localization imprecision, whereby otherwise correct detections fail to match the ground-truth annotation at stricter IoU thresholds, thereby increasing the number of false negatives. By contrast, although UE5O + SG2-FFHQ exhibits a moderately higher FP count than UE5O + SG2, it achieves substantially fewer missed detections under stricter IoU criteria, resulting in superior localization robustness. This is consistent with the higher precision and AP@[0.5:0.95] reported in Table 13, as well as the tighter detections shown in Figure 15c. UE5O + SG2-DTD performs weakest among the SG2-enhanced variants, exhibiting elevated FP and FN counts relative to both UE5O + SG2 and UE5O + SG2-FFHQ, consistent with the earlier precision degradation seen in Figure 14d. As illustrated in Figure 15d, the higher FP count for this configuration appears to arise from frequent misclassifications of background debris or rubble as surface cracks, while its elevated FN count is associated with occasional missed detections.

One persistent limitation seen across all configurations, regardless of initialization strategy, is the difficulty in detecting heavily occluded, fine-grained surface cracks. This behavior, highlighted in Figure 16, causes a missed detection which impacts overall recall performance, and is likely attributable in part to the inherent architectural constraints of CNN-based object detection models such as YOLOv11 [138]. A number of factors such as backbone downsampling, localization sensitivity, and receptive field size mismatch can result in the model ignoring or missing smaller features [139]. As a result, while more pronounced cracks are generally detected with high confidence, hairline cracks hidden within the background terrain remain challenging to localize consistently.

To facilitate comparison across the full evaluation framework, Table 15 consolidates the key generation quality and downstream detection metrics for each dataset configuration.

4.3. Practical Deployment Considerations and Limitations

While the results presented throughout the preceding sections demonstrate that the proposed hybrid synthetic dataset generation framework improves real-world open-pit surface crack detection performance and alleviates the issue of data scarcity, its practical utility depends not only on detection accuracy but also on its deployability in operational settings.

A significant practical advantage of the proposed framework is its ability to generate extensive datasets at scale with minimal human intervention. Unlike conventional CV-based monitoring, which necessitates costly and laborious field data collection and manual annotation [50,51,52], the images synthesized in this study are labelled automatically. For mining operations, this eliminates the need to deploy workers into the field solely for data collection purposes, reducing operational disruption and safety risks. Furthermore, this decoupling of dataset size from annotation effort enables rapid dataset expansion without a proportional increase in manual labelling time, enhancing operational efficiency. Beyond this, the nature of virtualized game engine environments allows for iterative refinement and adaptation to new contexts, enabling continuous improvement of training data as domain requirements evolve. This reduces the overall cost associated with the implementation of object detection models and enhances the accessibility of domain-specific training data.

One practical consideration is the upfront computational overhead associated with both UE5-based image synthesis and StyleGAN2-ADA-based generative enhancement. In this study, full dataset generation of 20,000 UE5 images required approximately six hours on high-end consumer hardware (NVIDIA GeForce RTX 5090 GPU and 64 GB RAM). Similarly, the use of StyleGAN2-ADA incurred additional GPU costs, requiring nearly 12 h for training and a further four hours for image synthesis and pseudo-labelling per configuration on an NVIDIA Tesla A100 GPU, amounting to a total of 64 h of GPU runtime. As such, the overall framework is best viewed as an upfront investment to generate large-scale datasets for data-scarce domains, rather than as a lightweight data augmentation technique.

Ultimately, the primary practical outcome of the proposed framework is improved downstream model robustness to real-world variability, particularly in data-scarce environments. As demonstrated in Section 4.2.2, the combination of game engine data and GAN-based generative enhancement enables CV models to generalize more effectively across diverse viewpoints, crack morphologies, and background conditions, all without requiring access to real-world training datasets. Improvements in metrics essential for automated inspection workflows, such as recall and AP@[0.5:0.95], indicate that the framework trains consistent and reliable object detection models that can enhance downstream analysis and decision support. More broadly, these findings demonstrate that the absence of large-scale real-world datasets need not restrain the effective application of CV models in data-scarce domains. By carefully constructing domain-specific virtualized environments within a game engine and subsequently enhancing their fidelity and diversity through generative modeling, it is possible to produce training data that generalizes effectively to real-world conditions. In this context, the proposed hybrid synthetic dataset generation framework provides a practical pathway for addressing data scarcity in domains constrained by privacy, intellectual property, and proprietary concerns. For mining operations in particular, where site-specific geotechnical data is often commercially sensitive and difficult to share across organizations, this capability offers a pathway to develop robust inspection systems without compromising data security.

Notwithstanding these results, a number of limitations should be acknowledged. The real-world test set comprises 200 images from four Australian mine sites, which, while representative of the target domain, limits the extent to which the reported findings can be generalized to climatically diverse mining contexts. Similarly, downstream detection performance was assessed using a single CNN-based model to isolate the contributions of the proposed framework; however, evaluation across other architectures such as ViTs would help further establish the utility of the reported findings. Additionally, while image synthesis was conducted at 512 × 512 resolution to balance spatial detail with computational efficiency and training stability, this may constrain the preservation of fine-grained crack detail. Finally, while the three transfer learning configurations evaluated in this study were selected to span a domain similarity spectrum, additional source domains and training schedules remain unexplored and may offer further improvements to both generation quality and downstream performance.

5. Conclusions and Future Work

Autonomous surface crack detection in open-pit mining offers numerous benefits such as enhanced worker safety and improved operational efficiency. However, CV models require large amounts of representative training data to generalize effectively to unseen conditions, impacting their applicability in commercial domains constrained by safety, cost, and data confidentiality considerations. To address this challenge, this study presented a novel hybrid game engine–generative AI framework for synthetic dataset generation without requiring real-world training data and evaluated its effectiveness for surface crack detection in real-world open-pit mining imagery. The proposed approach combined the high-fidelity rendering capabilities of the UE5 game engine with StyleGAN2-ADA, enabling the synthesis of large-scale, fully labelled surface crack datasets that improve the generalizability of CV models without reliance on extensive field data collection or manual annotation.

Comprehensive evaluation on a held-out real-world test set demonstrated that object detection models trained on images generated by the proposed framework outperformed those trained solely on synthetic data from UE5. In particular, AP@0.5 and AP@[0.5:0.95] increased by up to 16.4% and 34.7%, respectively, for the best-performing GAN-enhanced configurations. These performance gains were accompanied by higher recall and reduced missed detections, which decreased from 44 to as few as 8. A systematic comparison of transfer learning strategies revealed that FFHQ pre-training, despite originating from a semantically unrelated domain, consistently outperformed both from-scratch and texture-aligned initialization across generation fidelity, diversity, domain gap, and downstream detection metrics. From a practical perspective, this work demonstrates the viability of synthetic dataset generation pipelines in autonomous inspection workflows, reducing dependence on manual labeling while mitigating operational, safety, and confidentiality constraints associated with real-world data collection. More broadly, it demonstrates that synthetic data can generalize effectively to real-world conditions when underpinned by an appropriate generation framework, suggesting that the proposed approach may be extended to other data-constrained domains where large-scale labelled datasets are similarly difficult to obtain.

Future work will build on this research in several ways. First, diffusion-based generative models will be explored as an alternative to StyleGAN2-ADA to examine whether their distinct synthesis characteristics provide additional downstream benefits. Second, the detection pipeline will be extended toward multi-scale learning for improved object detection at varying distances, and instance or semantic segmentation to enable more precise delineation of surface crack boundaries for downstream analysis such as propagation measurement. Additionally, to address the limitation of small object detection identified in the study, future work will investigate alternative architectures, including ViT-based models, alongside architectural enhancements to object detection and segmentation models to further improve sensitivity to fine-grained cracks. Further automated or learned domain randomization strategies will also be investigated to reduce manual parameterization effort, and higher-resolution generation strategies, such as patch-based synthesis, will be explored to improve fine crack detail preservation. Finally, framework integration with edge-based inference platforms and UAV-based data acquisition will be examined to support real-time autonomous inspection workflows across diverse open-pit mining environments.

Author Contributions

Conceptualization, R.L.R., S.K., M.S. and I.M.; methodology, R.L.R.; software, R.L.R.; validation, R.L.R.; formal analysis, R.L.R.; investigation, R.L.R.; resources, R.L.R.; data curation, R.L.R.; writing—original draft preparation, R.L.R.; writing—review and editing, R.L.R.; visualization, R.L.R.; supervision, S.K., M.S. and I.M.; project administration, S.K., M.S. and I.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADA	Adaptive Discriminator Augmentation
AI	Artificial Intelligence
AP	Average Precision
BERT	Bidirectional Encoder Representations from Transformers
CBS	Convolutional Block with Batch normalization and SiLU
CIoU	Complete Intersection over Union
CNN	Convolutional Neural Network
COCO	Common Objects in Context
CPU	Central Processing Unit
CUDA	Compute Unified Device Architecture
CV	Computer Vision
DL	Deep Learning
DSLR	Digital Single-Lens Reflex
DTD	Describable Textures Dataset
FC	Fully Connected
FID	Fréchet Inception Distance
FN	False Negative
FOV	Field of View
FP	False Positive
GAN	Generative Adversarial Network
GPU	Graphics Processing Unit
GTA V	Grand Theft Auto V
HRSST	High Resolution Screenshot Tool
IoU	Intersection over Union
LPIPS	Learned Perceptual Image Patch Similarity
mAP	Mean Average Precision
MLP	Multilayer Perceptron
NDDS	NVIDIA Deep Learning Dataset Synthesizer
PCK	Percentage of Correct Keypoints
PBR	Physically Based Rendering
RAM	Random Access Memory
RT	Ray Tracing
RTGI	Real-Time Global Illumination
SG2	StyleGAN2-ADA
SOTA	State of the Art
SPPF	Spatial Pyramid Pooling-Fast
TAA	Temporal Anti-Aliasing
TP	True Positive
t-SNE	t-Distributed Stochastic Neighbor Embedding
UAV	Unmanned Aerial Vehicle
UE	Unreal Engine
UE4	Unreal Engine 4
UE5	Unreal Engine 5
UE5O	Unreal Engine 5 only
VAE	Variational Autoencoder
VGG-16	Visual Geometry Group-16
ViT	Vision Transformer
VLM	Vision Language Model

References

Li, G.; Hu, Z.; Wang, D.; Wang, L.; Wang, Y.; Zhao, L.; Jia, H.; Fang, K. Instability Mechanisms of Slope in Open-Pit Coal Mines: From Physical and Numerical Modeling. Int. J. Min. Sci. Technol. 2024, 34, 1509–1528. [Google Scholar] [CrossRef]
Kolapo, P.; Oniyide, G.O.; Said, K.O.; Lawal, A.I.; Onifade, M.; Munemo, P. An Overview of Slope Failure in Mining Operations. Mining 2022, 2, 350–384. [Google Scholar] [CrossRef]
de Graaf, P.J.H.; Desjardins, M.; Tsheko, P.; Fourie, A.B.; Tibbett, M. Geotechnical Risk Management for Open Pit Mine Closure: A Sub-Arctic and Semi-Arid Case Study; Australian Centre for Geomechanics: Crawley, Australia, 2019; pp. 211–234. [Google Scholar]
Zhang, N.; Wang, Y.; Zhao, F.; Wang, T.; Zhang, K.; Fan, H.; Zhou, D.; Zhang, L.; Yan, S.; Diao, X.; et al. Monitoring and Analysis of the Collapse at Xinjing Open-Pit Mine, Inner Mongolia, China, Using Multi-Source Remote Sensing. Remote Sens. 2024, 16, 993. [Google Scholar] [CrossRef]
Lin, Y.N.; Park, E.; Wang, Y.; Quek, Y.P.; Lim, J.; Alcantara, E.; Loc, H.H. The 2020 Hpakant Jade Mine Disaster, Myanmar: A Multi-Sensor Investigation for Slope Failure. ISPRS J. Photogramm. Remote Sens. 2021, 177, 291–305. [Google Scholar] [CrossRef]
Martin, C.D.; Stacey, P.F.; Dight, P.M. Pit Slopes in Weathered and Weak Rocks; Australian Centre for Geomechanics: Crawley, Australia, 2013; pp. 3–28. [Google Scholar]
Zhong, Z.; Hu, B.; Li, J.; Sheng, J.; Wan, C. Impact of Rainfall Dry-Wet Cycles on Slope Deformation and Landslide Prediction in Open-Pit Mines: A Case Study of Mohuandang Landslide, Emeishan, China. Results Eng. 2025, 26, 105011. [Google Scholar] [CrossRef]
Wang, W.; Griffiths, D. Case Study of Slope Failure during Construction of an Open Pit Mine in Indonesia. Can. Geotech. J. 2018, 56, 636–648. [Google Scholar] [CrossRef]
Kong, K.W.K.; Dight, P.M. Blasting Vibration Assessment of Rock Slopes and a Case Study; Australian Centre for Geomechanics: Crawley, Australia, 2013; pp. 1335–1344. [Google Scholar]
Wang, J.; Zhou, Z.; Chen, C.; Wang, H.; Chen, Z. Failure Mechanism and Stability Analysis of an Open-Pit Slope under Excavation Unloading Conditions. Front. Earth Sci. 2023, 11, 1109316. [Google Scholar] [CrossRef]
Bridges, M.C.; Dight, P.M. An Extensional Mechanism of Instability and Failure in the Walls of Open Pit Mines; Australian Centre for Geomechanics: Crawley, Australia, 2013; pp. 137–150. [Google Scholar]
Whittall, J.R.; McDougall, S.; Eberhardt, E. A Risk-Based Methodology for Establishing Landslide Exclusion Zones in Operating Open Pit Mines. Int. J. Rock Mech. Min. Sci. 2017, 100, 100–107. [Google Scholar] [CrossRef]
McQuillan, A.; Canbulat, I.; Oh, J. Methods Applied in Australian Industry to Evaluate Coal Mine Slope Stability. Int. J. Min. Sci. Technol. 2020, 30, 151–155. [Google Scholar] [CrossRef]
Vaziri, A.; Moore, L.; Ali, H. Monitoring Systems for Warning Impending Failures in Slopes and Open Pit Mines. Nat. Hazards 2010, 55, 501–512. [Google Scholar] [CrossRef]
Mohammed, M.M. A Review On Slope Monitoring And Application Methods In Open Pit Mining Activities. Int. J. Min. Sci. Technol. Res. 2021, 10, 181–186. [Google Scholar]
Ching, J.; Phoon, K.-K. Value of Geotechnical Site Investigation in Reliability-Based Design Advances in Structural Engineering. Adv. Struct. Eng. 2012, 15, 1935–1945. [Google Scholar] [CrossRef]
Zumrawi, M. Effects of Inadequate Geotechnical Investigations on Civil Engineering projects. Int. J. Sci. Res. IJSR 2014, 3, 927–931. [Google Scholar]
Crisp, M.P.; Jaksa, M.; Kuo, Y. Optimal Testing Locations in Geotechnical Site Investigations through the Application of a Genetic Algorithm. Geosciences 2020, 10, 265. [Google Scholar] [CrossRef]
Le Roux, R.; Sepehri, M.; Khaksar, S.; Murray, I. Slope Stability Monitoring Methods and Technologies for Open-Pit Mining: A Systematic Review. Mining 2025, 5, 32. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Matsuzaka, Y.; Yashiro, R. AI-Based Computer Vision Techniques and Expert Systems. AI 2023, 4, 289–302. [Google Scholar] [CrossRef]
Kalluri, P.R.; Agnew, W.; Cheng, M.; Owens, K.; Soldaini, L.; Birhane, A. Computer-Vision Research Powers Surveillance Technology. Nature 2025, 643, 73–79. [Google Scholar] [CrossRef] [PubMed]
Chimakurthi, V.N.S.S. Application of Convolution Neural Network for Digital Image Processing. Eng. Int. 2020, 8, 149–158. [Google Scholar] [CrossRef]
Kameswari, C.; J, K.; Reddy, T.; Chinthaguntla, B.; Jagatheesaperumal, S.; Gaftandzhieva, S.; Doneva, R. An Overview of Vision Transformers for Image Processing: A Survey. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 273–289. [Google Scholar] [CrossRef]
Fawole, O.A.; Rawat, D.B. Recent Advances in 3D Object Detection for Self-Driving Vehicles: A Survey. AI 2024, 5, 1255–1285. [Google Scholar] [CrossRef]
Bratulescu, R.-A.; Vatasoiu, R.-I.; Sucic, G.; Mitroi, S.-A.; Vochin, M.-C.; Sachian, M.-A. Object Detection in Autonomous Vehicles. In Proceedings of the 2022 25th International Symposium on Wireless Personal Multimedia Communications (WPMC), Herning, Denmark, 30 October–2 November 2022; pp. 375–380. [Google Scholar]
Albuquerque, C.; Henriques, R.; Castelli, M. Deep Learning-Based Object Detection Algorithms in Medical Imaging: Systematic Review. Heliyon 2025, 11, e41137. [Google Scholar] [CrossRef]
Saraei, M.; Lalinia, M.; Lee, E.-J. Deep Learning-Based Medical Object Detection: A Survey. IEEE Access 2025, 13, 53019–53038. [Google Scholar] [CrossRef]
Malburg, L.; Rieder, M.-P.; Seiger, R.; Klein, P.; Bergmann, R. Object Detection for Smart Factory Processes by Machine Learning. Procedia Comput. Sci. 2021, 184, 581–588. [Google Scholar] [CrossRef]
Fatima, Z.; Zardari, S.; Tanveer, M.H. Advancing Industrial Object Detection Through Domain Adaptation: A Solution for Industry 5.0. Actuators 2024, 13, 513. [Google Scholar] [CrossRef]
Di Mucci, V.M.; Cardellicchio, A.; Ruggieri, S.; Nettis, A.; Renò, V.; Uva, G. Artificial Intelligence in Structural Health Management of Existing Bridges. Autom. Constr. 2024, 167, 105719. [Google Scholar] [CrossRef]
Plevris, V.; Papazafeiropoulos, G. AI in Structural Health Monitoring for Infrastructure Maintenance and Safety. Infrastructures 2024, 9, 225. [Google Scholar] [CrossRef]
Lee, J.; Lee, S. Construction Site Safety Management: A Computer Vision and Deep Learning Approach. Sensors 2023, 23, 944. [Google Scholar] [CrossRef]
Rabbi, A.B.K.; Jeelani, I. AI Integration in Construction Safety: Current State, Challenges, and Future Opportunities in Text, Vision, and Audio Based Applications. Autom. Constr. 2024, 164, 105443. [Google Scholar] [CrossRef]
Yan, Y.; Zhang, Y.; Zhou, L.; Zhang, F.; Wang, H. Acoustic Signals-Based Probabilistic Fault Diagnosis for Expansion Joints of Small and Medium Bridges Using Bayesian Ensemble Learning. Eng. Struct. 2026, 354, 122379. [Google Scholar] [CrossRef]
Liao, R.; Zhang, Y.; Wang, H.; Zhao, T.; Wang, X. Multi-Objective Optimisation of Surveillance Camera Placement for Bridge–Ship Collision Early-Warning Using an Improved Non-Dominated Sorting Genetic Algorithm. Adv. Eng. Inform. 2026, 69, 103918. [Google Scholar] [CrossRef]
Khalife, S.; Emadi, S.; Wilner, D.; Hamzeh, F. Developing Project Value Attributes: A Proposed Process for Value Delivery on Construction Projects. In Proceedings of the IGLC 30—International Group for Lean Construction Conference, Edmonton, AB, Canada, 25–29 July 2022; pp. 913–924. [Google Scholar]
Demirel, Z.; Nasraldeen, S.T.; Pehlivan, Ö.; Shoman, S.; Albdairi, M.; Almusawi, A. Comparative Evaluation of YOLO and Gemini AI Models for Road Damage Detection and Mapping. Future Transp. 2025, 5, 91. [Google Scholar] [CrossRef]
Zhao, M.; Wang, S.; Guo, B.; Gu, W. Review of Crack Depth Detection Technology for Engineering Structures: From Physical Principles to Artificial Intelligence. Appl. Sci. 2025, 15, 9120. [Google Scholar] [CrossRef]
Ruan, S.; Hu, Y.; Liu, J.; Wang, J. An Advanced Crack Detection Method for Slope Management in Open-Pit Mines: Applying Enhanced YOLOv8 Network. Int. J. Min. Reclam. Environ. 2026, 40, 70–87. [Google Scholar] [CrossRef]
An, J.; Dong, S.; Wang, X.; Li, C.; Zhao, W. Research on UAV Aerial Imagery Detection Algorithm for Mining-Induced Surface Cracks Based on Improved YOLOv10. Sci. Rep. 2025, 15, 30101. [Google Scholar] [CrossRef] [PubMed]
Ruan, S.; Liu, D.; Gu, Q.; Jing, Y. An Intelligent Detection Method for Open-Pit Slope Fracture Based on the Improved Mask R-CNN. J. Min. Sci. 2022, 58, 503–518. [Google Scholar] [CrossRef]
Letshwiti, T.M.; Shahsavar, M.; Moniri-Morad, A.; Sattarvand, J. Deep Learning-Based Image Segmentation for Highwall Stability Monitoring in Open Pit Mines. J. Eng. Res. 2025, 13, 3595–3608. [Google Scholar] [CrossRef]
Wang, K.; Wei, B.; Zhao, T.; Wu, G.; Zhang, J.; Zhu, L.; Wang, L. An Automated Approach for Mapping Mining-Induced Fissures Using CNNs and UAS Photogrammetry. Remote Sens. 2024, 16, 2090. [Google Scholar] [CrossRef]
Winkelmaier, G.; Battulwar, R.; Khoshdeli, M.; Valencia, J.; Sattarvand, J.; Parvin, B. Topographically Guided UAV for Identifying Tension Cracks Using Image-Based Analytics in Open-Pit Mines. IEEE Trans. Ind. Electron. 2021, 68, 5415–5424. [Google Scholar] [CrossRef]
Bansal, M.A.; Sharma, D.R.; Kathuria, D.M. A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications. ACM Comput. Surv. 2022, 54, 208:1–208:29. [Google Scholar] [CrossRef]
Wang, J.; Lan, C.; Liu, C.; Ouyang, Y.; Qin, T.; Lu, W.; Chen, Y.; Zeng, W.; Yu, P.S. Generalizing to Unseen Domains: A Survey on Domain Generalization. arXiv 2022, arXiv:2103.03097. [Google Scholar] [CrossRef]
Alzubaidi, L.; Bai, J.; Al-Sabaawi, A.; Santamaría, J.; Albahri, A.S.; Al-dabbagh, B.S.N.; Fadhel, M.A.; Manoufali, M.; Zhang, J.; Al-Timemy, A.H.; et al. A Survey on Deep Learning Tools Dealing with Data Scarcity: Definitions, Challenges, Solutions, Tips, and Applications. J. Big Data 2023, 10, 46. [Google Scholar] [CrossRef]
Harle, S.M.; Wankhade, R.L. Machine Learning Techniques for Predictive Modelling in Geotechnical Engineering: A Succinct Review. Discov. Civ. Eng. 2025, 2, 86. [Google Scholar] [CrossRef]
Ramasamy, D.; Sivamani, S. The Future of Geotechnical Engineering Through Deep Learning: A Concise Literature Review. J. Inf. Syst. Eng. Manag. 2025, 10, 685–694. [Google Scholar] [CrossRef]
Yamani, A.; AlAmoudi, N.; Albilali, S.; Baslyman, M.; Hassine, J. Data Requirement Goal Modeling for Machine Learning Systems. arXiv 2025, arXiv:2504.07664. [Google Scholar] [CrossRef]
Taye, M.M. Understanding of Machine Learning with Deep Learning: Architectures, Workflow, Applications and Future Directions. Computers 2023, 12, 91. [Google Scholar] [CrossRef]
Hutchinson, M.L.; Antono, E.; Gibbons, B.M.; Paradiso, S.; Ling, J.; Meredig, B. Overcoming Data Scarcity with Transfer Learning. arXiv 2017, arXiv:1711.05099. [Google Scholar] [CrossRef]
Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.-T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A Comprehensive Survey on Data Augmentation. IEEE Trans. Knowl. Data Eng. 2026, 38, 47–66. [Google Scholar] [CrossRef]
Zhao, Z.; Alzubaidi, L.; Zhang, J.; Duan, Y.; Gu, Y. A Comparison Review of Transfer Learning and Self-Supervised Learning: Definitions, Applications, Advantages and Limitations. Expert Syst. Appl. 2024, 242, 122807. [Google Scholar] [CrossRef]
Brodzicki, A.; Piekarski, M.; Kucharski, D.; Jaworek-Korjakowska, J.; Gorgon, M. Transfer Learning Methods as a New Approach in Computer Vision Tasks with Small Datasets. Found. Comput. Decis. Sci. 2020, 45, 179–193. [Google Scholar] [CrossRef]
Kumar, T.; Mileo, A.; Brennan, R.; Bendechache, M. Image Data Augmentation Approaches: A Comprehensive Survey and Future Directions. IEEE Access 2024, 12, 187536–187571. [Google Scholar] [CrossRef]
Mumuni, A.; Mumuni, F. Data Augmentation: A Comprehensive Survey of Modern Approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Li, M.; Chen, H.; Wang, Y.; Zhu, T.; Zhang, W.; Zhu, K.; Wong, K.-F.; Wang, J. Understanding and Mitigating the Bias Inheritance in LLM-Based Data Augmentation on Downstream Tasks. arXiv 2025, arXiv:2502.04419. [Google Scholar] [CrossRef]
Nikolenko, S. Synthetic Data for Deep Learning; Springer: Cham, Switzerland, 2021; ISBN 978-3-030-75177-7. [Google Scholar]
Unreal Engine 5. Available online: https://www.unrealengine.com/en-US/unreal-engine-5 (accessed on 15 September 2025).
Unity Real-Time Development Platform|3D, 2D, VR & AR Engine. Available online: https://unity.com (accessed on 15 September 2025).
Li, Y.; Dong, X.; Chen, C.; Li, J.; Wen, Y.; Spranger, M.; Lyu, L. Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization. arXiv 2024, arXiv:2403.19866. [Google Scholar] [CrossRef]
Nanite Virtualized Geometry in Unreal Engine|Unreal Engine 5.6 Documentation|Epic Developer Community. Available online: https://dev.epicgames.com/documentation/en-us/unreal-engine/nanite-virtualized-geometry-in-unreal-engine (accessed on 16 September 2025).
Lumen Global Illumination and Reflections in Unreal Engine|Unreal Engine 5.6 Documentation|Epic Developer Community. Available online: https://dev.epicgames.com/documentation/en-us/unreal-engine/lumen-global-illumination-and-reflections-in-unreal-engine (accessed on 16 September 2025).
Ulhas, S.S.; Kannapiran, S.; Berman, S. GAN-Based Domain Adaptation for Creating Digital Twins of Small-Scale Driving Testbeds: Opportunities and Challenges. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024; pp. 137–143. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv 2019, arXiv:1812.04948. [Google Scholar] [CrossRef]
Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training Generative Adversarial Networks with Limited Data. arXiv 2020, arXiv:2006.06676. [Google Scholar] [CrossRef]
Bandi, A.; Adapa, P.V.S.R.; Kuchi, Y.E.V.P.K. The Power of Generative AI: A Review of Requirements, Models, Input–Output Formats, Evaluation Metrics, and Challenges. Future Internet 2023, 15, 260. [Google Scholar] [CrossRef]
Werda, M.S.; Taibi, H.; Kouiss, K.; Chebak, A.; Ben Halima, S.; Decottignies, M.; Dilliott, C. Towards Minimizing Domain Gap When Using Synthetic Data in Automotive Vision Control Applications. IFAC-Pap. 2024, 58, 522–527. [Google Scholar] [CrossRef]
Tariq, U.; Qureshi, R.; Zafar, A.; Aftab, D.; Wu, J.; Alam, T.; Shah, Z.; Ali, H. Brain Tumor Synthetic Data Generation with Adaptive StyleGANs. In Artificial Intelligence and Cognitive Science; Longo, L., O’Reilly, R., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 147–159. [Google Scholar]
Yang, S.; Kim, K.-D.; Ariji, E.; Takata, N.; Kise, Y. Evaluating the Performance of Generative Adversarial Network-Synthesized Periapical Images in Classifying C-Shaped Root Canals. Sci. Rep. 2023, 13, 18038. [Google Scholar] [CrossRef] [PubMed]
Barrientos-Espillco, F.; Gascó, E.; López-González, C.I.; Gómez-Silva, M.J.; Pajares, G. Semantic Segmentation Based on Deep Learning for the Detection of Cyanobacterial Harmful Algal Blooms (CyanoHABs) Using Synthetic Images. Appl. Soft Comput. 2023, 141, 110315. [Google Scholar] [CrossRef]
Achicanoy, H.; Chaves, D.; Trujillo, M. StyleGANs and Transfer Learning for Generating Synthetic Images in Industrial Applications. Symmetry 2021, 13, 1497. [Google Scholar] [CrossRef]
Park, G.; Lee, Y. Wildfire Smoke Detection Enhanced by Image Augmentation with StyleGAN2-ADA for YOLOv8 and RT-DETR Models. Fire 2024, 7, 369. [Google Scholar] [CrossRef]
Man, K.; Chahl, J. A Review of Synthetic Image Data and Its Use in Computer Vision. J. Imaging 2022, 8, 310. [Google Scholar] [CrossRef]
Half-Life 2 on Steam. Available online: https://store.steampowered.com/app/220/HalfLife_2/ (accessed on 9 October 2025).
Taylor, G.R.; Chosak, A.J.; Brewer, P.C. OVVV: Using Virtual Worlds to Design and Evaluate Surveillance Systems. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for Data: Ground Truth from Computer Games. arXiv 2016, arXiv:1608.02192. [Google Scholar] [CrossRef]
Lee, H.; Jeon, J.; Lee, D.; Park, C.; Kim, J.; Lee, D. Game Engine-Driven Synthetic Data Generation for Computer Vision-Based Safety Monitoring of Construction Workers. Autom. Constr. 2023, 155, 105060. [Google Scholar] [CrossRef]
Rasmussen, I.; Kvalsvik, S.; Andersen, P.-A.; Aune, T.N.; Hagen, D. Development of a Novel Object Detection System Based on Synthetic Data Generated from Unreal Game Engine. Appl. Sci. 2022, 12, 8534. [Google Scholar] [CrossRef]
Turkcan, M.K.; Li, Y.; Zang, C.; Ghaderi, J.; Zussman, G.; Kostic, Z. Boundless: Generating Photorealistic Synthetic Data for Object Detection in Urban Streetscapes. arXiv 2024, arXiv:2409.03022. [Google Scholar] [CrossRef]
Hwang, H.; Adhikari, K.; Shodhaka, S.; Kim, D. Synthetic Data Augmentation for Robotic Mobility Aids to Support Blind and Low Vision People. In Robot Intelligence Technology and Applications 9; Park, D., Liu, C., Lee, D.-Y., Kim, M.J., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 92–102. [Google Scholar]
Cauli, N.; Reforgiato Recupero, D. Synthetic Data Augmentation for Video Action Classification Using Unity. IEEE Access 2024, 12, 156172–156183. [Google Scholar] [CrossRef]
Borkman, S.; Crespi, A.; Dhakad, S.; Ganguly, S.; Hogins, J.; Jhang, Y.-C.; Kamalzadeh, M.; Li, B.; Leal, S.; Parisi, P.; et al. Unity Perception: Generate Synthetic Data for Computer Vision. arXiv 2021, arXiv:2107.04259. [Google Scholar] [CrossRef]
Naidoo, J.; Bates, N.; Gee, T.; Nejati, M. Pallet Detection from Synthetic Data Using Game Engines. arXiv 2023, arXiv:2304.03602. [Google Scholar] [CrossRef]
Angus, M.; ElBalkini, M.; Khan, S.; Harakeh, A.; Andrienko, O.; Reading, C.; Waslander, S.; Czarnecki, K. Unlimited Road-Scene Synthetic Annotation (URSA) Dataset. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 985–992. [Google Scholar]
Shooter, M.; Malleson, C.; Hilton, A. SyDog: A Synthetic Dog Dataset for Improved 2D Pose Estimation. arXiv 2021, arXiv:2108.00249. [Google Scholar] [CrossRef]
Lee, J.G.; Hwang, J.; Chi, S.; Seo, J. Synthetic Image Dataset Development for Vision-Based Construction Equipment Detection. J. Comput. Civ. Eng. 2022, 36, 04022020. [Google Scholar] [CrossRef]
Natarajan, S.A.; Madden, M.G. Hybrid Synthetic Data Generation Pipeline That Outperforms Real Data. J. Electron. Imaging 2023, 32, 023011. [Google Scholar] [CrossRef]
NVIDIA Corporation. NVIDIA Deep Learning Dataset Synthesizer (NDDS). Available online: https://github.com/NVIDIA/Dataset_Synthesizer (accessed on 10 September 2025).
Perception Package|Perception Package|1.0.0-Preview.1. Available online: https://docs.unity3d.com/Packages/com.unity.perception@1.0/manual/index.html (accessed on 9 October 2025).
Games, R. Grand Theft Auto V. Available online: https://www.rockstargames.com/gta-v (accessed on 10 October 2025).
Sengar, S.S.; Hasan, A.B.; Kumar, S.; Carroll, F. Generative Artificial Intelligence: A Systematic Review and Applications. arXiv 2024, arXiv:2405.11029. [Google Scholar] [CrossRef]
Deijn, R.d.; Batra, A.; Koch, B.; Mansoor, N.; Makkena, H. Reviewing FID and SID Metrics on Generative Adversarial Networks. In Proceedings of the AI, Machine Learning and Applications, Copenhagen, Denmark, 27 January 2024; pp. 111–124. [Google Scholar]
Wang, R.; Chen, X.; Wang, X.; Wang, H.; Qian, C.; Yao, L.; Zhang, K. A Novel Approach for Melanoma Detection Utilizing GAN Synthesis and Vision Transformer. Comput. Biol. Med. 2024, 176, 108572. [Google Scholar] [CrossRef] [PubMed]
Lai, M.; Marzi, C.; Mascalchi, M.; Diciotti, S. Brain MRI Synthesis Using Stylegan2-ADA. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
Gonçalves, B.; Vieira, P.; Vieira, A. Abdominal MRI Synthesis Using StyleGAN2-ADA. In Proceedings of the 2023 IST-Africa Conference (IST-Africa), Tshwane, South Africa, 2 June–31 May 2023; pp. 1–9. [Google Scholar]
Chong, M.J.; Forsyth, D. Effectively Unbiased FID and Inception Score and Where to Find Them. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 6070–6079. [Google Scholar]
Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 9307–9315. [Google Scholar]
Fedoruk, O.; Klimaszewski, K.; Ogonowski, A.; Możdżonek, R. Performance of GAN-Based Augmentation for Deep Learning COVID-19 Image Classification. AIP Conf. Proc. 2024, 3061, 030001. [Google Scholar] [CrossRef]
Ferreira, I.; Ochoa, L.; Koeshidayatullah, A. On the Generation of Realistic Synthetic Petrographic Datasets Using a Style-Based GAN. Sci. Rep. 2022, 12, 12845. [Google Scholar] [CrossRef] [PubMed]
Feng, X.; Du, J.; Wu, M.; Chai, B.; Miao, F.; Wang, Y. Potential of Synthetic Images in Landslide Segmentation in Data-Poor Scenario: A Framework Combining GAN and Transformer Models. Landslides 2024, 21, 2211–2226. [Google Scholar] [CrossRef]
Ghosh, R.; Yamany, M.S.; Smadi, O. Generation of Synthetic Dataset to Improve Deep Learning Models for Pavement Distress Assessment. Innov. Infrastruct. Solut. 2025, 10, 41. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. Flickr-Faces-HQ Dataset (FFHQ). Available online: https://github.com/NVlabs/ffhq-dataset (accessed on 10 September 2025).
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv 2024, arXiv:2303.05499. [Google Scholar]
Creating Landscapes in Unreal Engine|Unreal Engine 5.7 Documentation|Epic Developer Community. Available online: https://dev.epicgames.com/documentation/en-us/unreal-engine/creating-landscapes-in-unreal-engine (accessed on 7 January 2026).
Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. arXiv 2018, arXiv:1804.06516. [Google Scholar] [CrossRef]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv 2017, arXiv:1703.06907. [Google Scholar] [CrossRef]
Viscarra Rossel, R.A.; Bui, E.N.; de Caritat, P.; McKenzie, N.J. Mapping Iron Oxides and the Color of Australian Soil Using Visible–near-Infrared Reflectance Spectra. J. Geophys. Res. Earth Surf. 2010, 115. [Google Scholar] [CrossRef]
van Vreeswyk, A.M.E.; Leighton, K.A.; Payne, A.L.; Hennig, P. An Inventory and Condition Survey of the Pilbara Region, Western Australia; Technical Bulletin 92; Department of Agriculture, Western Australia: Perth, Australia, 2004. [Google Scholar]
Quixel. Available online: https://quixel.com/megascans (accessed on 8 January 2026).
Xiao, Y.; Deng, H.; Li, J.; Zhou, M.; Assefa, E.; Chen, X. A Quantitative Method for the Determination of Rock Fragmentation Based on Crack Density and Crack Saturation. Sci. Rep. 2023, 13, 11747. [Google Scholar] [CrossRef]
Research on Deformation Characteristics and Mechanisms of an Open Pit Coal Mine Landslide Event in Extremely Cold Region|Scientific Reports. Available online: https://www.nature.com/articles/s41598-025-27509-5 (accessed on 29 March 2026).
Wang, X.; Wang, Y.; Wang, Y.; Chan, T.O. A Fast and Reliable Crack Measurement Approach Based on Perspective Projection Simulation Models and UAV Imaging for Dam and Levee Inspections. Surv. Rev. 2025, 58, 134–145. [Google Scholar] [CrossRef]
Cine Camera Actor|Unreal Engine 4.27 Documentation|Epic Developer Community. Available online: https://dev.epicgames.com/documentation/en-us/unreal-engine/cinematic-cameras-in-unreal-engine (accessed on 7 January 2026).
Taking Screenshots in Unreal Engine|Unreal Engine 5.7 Documentation|Epic Developer Community. Available online: https://dev.epicgames.com/documentation/en-us/unreal-engine/taking-screenshots-in-unreal-engine (accessed on 9 January 2026).
Epic Games. FPerspectiveMatrix. Available online: https://dev.epicgames.com/documentation/en-us/unreal-engine/API/Runtime/Core/Math/FPerspectiveMatrix (accessed on 13 January 2026).
Ultralytics Object Detection Datasets Overview. Available online: https://docs.ultralytics.com/datasets/detect/ (accessed on 13 January 2026).
Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures. Available online: https://www.mdpi.com/2313-433X/11/8/252 (accessed on 28 November 2025).
Describable Textures Dataset. Available online: https://www.robots.ox.ac.uk/~vgg/data/dtd/ (accessed on 29 October 2025).
Pinkney, J. Awesome Pretrained StyleGAN2. Available online: https://github.com/justinpinkney/awesome-pretrained-stylegan2 (accessed on 10 September 2025).
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. StyleGAN3: Official PyTorch Implementation of Alias-Free Generative Adversarial Networks. Available online: https://github.com/NVlabs/stylegan3 (accessed on 10 September 2025).
Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. StyleGAN2-ADA: Official PyTorch Implementation. Available online: https://github.com/NVlabs/stylegan2-ada-pytorch (accessed on 10 September 2025).
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 2018, arXiv:1706.08500. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. arXiv 2015, arXiv:1512.00567. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv 2018, arXiv:1801.03924. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Chicco, D.; Sichenze, A.; Jurman, G. A Simple Guide to the Use of Student’s t-Test, Mann-Whitney U Test, Chi-Squared Test, and Kruskal-Wallis Test in Biostatistics. BioData Min. 2025, 18, 56. [Google Scholar] [CrossRef] [PubMed]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Bińkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying MMD GANs. arXiv 2021, arXiv:1801.01401. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Ultralytics Ultralytics YOLO11. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 20 January 2026).
Nash, J. Non-Cooperative Games. Ann. Math. 1951, 54, 286–295. [Google Scholar] [CrossRef]
Liu, H.; Li, X.; Wang, L.; Zhang, Y.; Wang, Z.; Lu, Q. MS-YOLOv11: A Wavelet-Enhanced Multi-Scale Network for Small Object Detection in Remote Sensing Images. Sensors 2025, 25, 6008. [Google Scholar] [CrossRef]
Mu, D.; Guou, Y.; Wang, W.; Peng, R.; Guo, C.; Marinello, F.; Xie, Y.; Huang, Q. URT-YOLOv11: A Large Receptive Field Algorithm for Detecting Tomato Ripening Under Different Field Conditions. Agriculture 2025, 15, 1060. [Google Scholar] [CrossRef]

Figure 1. Graphical comparison between (a) GTA V and (b) a nature scene in Unity. GTA V demonstrates greater photorealism than Unity due to advanced rendering effects such as ray-traced reflections, global illumination, and detailed material texturing, illustrating the disparity in visual fidelity that contributes to the reality gap.

Figure 2. Overview of the proposed hybrid synthetic dataset generation framework. Stage 1 constructs a parameterized virtual open-pit environment in UE5 and automatically captures 20,000 crack images with ground-truth bounding boxes. These labelled UE5 images are used to train StyleGAN2-ADA in Stage 2, where three initialization strategies are evaluated to enhance the fidelity and diversity of the synthesized imagery, generating 60,000 images that are subsequently pseudo-labelled using Grounding DINO. In Stage 3, YOLOv11 is trained on these synthetic datasets and tested on real-world imagery to assess the effectiveness of the proposed framework in improving surface crack detection performance.

Figure 3. Workflow for generating surface crack decals from field-acquired crack imagery. Crack images are manually processed to obtain refined binary crack masks, which are subsequently standardized to 2048 × 2048 pixels and used to derive opacity, height, normal, and roughness maps. These texture maps are then imported into UE5 and assembled into deferred decal materials, with per-crack dimensional parameters stored in a Data Table for deployment in automated synthetic dataset generation.

Figure 4. Overview of the automated bounding box computation pipeline. A virtual 3D scene containing the crack decal is captured from the active Cine Camera Actor viewpoint. The decal world coordinates are extracted and geometrically projected into 2D image space. A min/max pixel search is performed on the projected coordinates to determine the crack extent, after which the calculated bounding box is normalized and exported as a label file.

Figure 5. Architecture of StyleGAN2-ADA, comprising (a) the generator and (b) the discriminator. The mapping network transforms a latent vector z ∈ Z into an intermediate latent representation w ∈ W, which modulates the synthesis network through per-layer affine transforms A. The synthesis network starts from a constant input c₁ and progressively refines features using modulated style blocks and stochastic noise B to introduce unstructured detail. The discriminator uses ADA to apply random geometric and color-space perturbations to real and generated images, mitigating overfitting under limited-data conditions. Progressive downsampling with residual connections produces a scalar output D(x) representing the probability that an image is real or generated.

Figure 6. Architectural overview of YOLOv11, comprising a backbone, neck, and detection head for real-time crack detection. The backbone uses stacked CBS and C3K2 blocks to progressively downsample the input while increasing channel depth to extract features relevant to crack morphology. The SPPF block enlarges the receptive field by combining multi-scale context, and C2PSA modules enhance feature representation through spatial and channel attention. The neck performs multi-scale feature fusion via upsampling, downsampling, and concatenation operations that integrate information from different backbone stages. The decoupled detection head then applies concurrent classification and regression branches to generate bounding box coordinates and class scores for detected cracks.

Figure 7. Examples of real-world open-pit mine surface crack images used for YOLOv11 performance evaluation.

Figure 8. StyleGAN2-ADA training behavior for three initialization strategies. Subplots show (a) generator loss, (b) discriminator loss, (c) augmentation probability, and (d) FID score progression over a 2000 kimg training window for SG2 (Baseline), SG2 + FFHQ, and SG2 + DTD configurations.

Figure 9. Early synthesis progression of the three StyleGAN2-ADA training configurations. Samples are shown at initialization, 30 kimg, 60 kimg, and 100 kimg. At initialization, the baseline configuration produces unstructured noise, while the pre-trained configurations generate coherent textures that reflect their source-domain priors. By 30 kimg, SG2 begins to acquire coarse color and texture distributions, whereas the pre-trained configurations already synthesize recognizable crack-like structures. All configurations improve in fidelity and structural realism by 60 kimg, with the pre-trained configurations presenting more developed crack morphology. By 100 kimg, all configurations produce reasonable crack patterns, although the pre-trained configurations contain sharper edges, more consistent textures, and realistic background materials, highlighting the benefits of transfer learning.

Figure 10. Representative samples generated by each StyleGAN2-ADA configuration at convergence, with UE5O baseline images included for comparison. Each column shows two randomly selected samples from a single dataset configuration. UE5O images exhibit high photorealism but limited morphological variation. SG2 (Baseline) introduces structural diversity but with slightly reduced background coherence. SG2 + FFHQ produces the sharpest crack boundaries and most coherent surface textures, consistent with its superior FID score. SG2 + DTD shows comparable diversity to SG2 + FFHQ but with marginally less consistent textural detail.

Figure 11. t-SNE visualizations of UE5O and StyleGAN2-ADA feature embeddings. (a) All datasets plotted together, showing UE5O forming a series of compact clusters while StyleGAN2-ADA samples occupy a broader manifold. (b) UE5O vs. SG2 illustrates the expansion in diversity introduced by generative modelling. (c) UE5O vs. SG2 + FFHQ highlights the large dispersion and heavy distributional overlap achieved through effective pre-training. (d) UE5O vs. SG2 + DTD shows a similar but slightly less pronounced expansion in feature space.

Figure 12. Training behavior of the four YOLOv11 dataset configurations evaluated in this study, showing (a) validation box loss, (b) AP@0.5, (c) AP@[0.5:0.95], and (d) precision for UE5O, UE5O + SG2, UE5O + SG2-FFHQ, and UE5O + SG2-DTD.

Figure 13. UE5O surface crack detection performance on real-world open-pit test images, showing low-confidence and missed detections across subplots (a–d). Fragmented and inconsistent spatial coverage is observed in subplots (a,c), a missed detection is seen in subplot (b), and subplot (d) highlights a false positive triggered by background debris and shadowing effects.

Figure 14. Precision-recall characteristics for the four YOLOv11 dataset configurations evaluated on the real-world open-pit surface crack test set. The curves demonstrate the trade-off between precision and recall across confidence thresholds, for (a) UE5O, (b) UE5O + SG2, (c) UE5O + SG2-FFHQ, and (d) UE5O + SG2-DTD, highlighting the improved robustness and extended high-precision regions of the SG2-enhanced variants relative to the UE5O baseline.

Figure 15. Qualitative comparison of SG2-enhanced surface crack detection performance on real-world open-pit test images. Each row shows a test image with (a) ground truth annotation, followed by predictions from (b) UE5O + SG2, (c) UE5O + SG2-FFHQ, and (d) UE5O + SG2-DTD. Across the representative examples, UE5O + SG2 demonstrates high prediction confidence but produces less tight bounding box localization, while UE5O + SG2-FFHQ frequently provides improved spatial alignment at the cost of slightly reduced confidence. UE5O + SG2-DTD demonstrates reduced robustness under challenging conditions, including missed detections on low contrast or partially occluded cracks, as well as false positives on background debris and rubble.

Figure 16. Representative failure case highlighting the difficulty in detecting fine-grained surface cracks in real-world open-pit test images for (a) UE5O, (b) UE5O + SG2, (c) UE5O + SG2-FFHQ, and (d) UE5O + SG2-DTD, where the red bounding box denotes the ground truth annotation for the missed detection.

Table 1. Summary of related works on synthetic dataset generation using game engines.

Application	Downstream Task	Platform	Dataset Size	Performance
Construction monitoring [81]	Object detection	Unity	7000	mAP@[0.5:0.95]: 0.46
Generic object detection [82]	Object detection	UE4 with NDDS	1500	Not reported
Autonomous driving [83]	Object detection	UE5	16,700	mAP@0.5: 0.67
Navigation assistance [84]	Object detection	UE4 with NDDS	3000	Precision: 0.92 Recall: 0.91
Exercise monitoring [85]	Pose estimation	Unity	5000	I3D test accuracy: 0.99
Grocery item detection [86]	Object detection	Unity	400,000	mAP@[0.5:0.95]: 0.68
Warehouse object detection [87]	Semantic segmentation	Unity	7140	mAP@0.5: 0.65
Autonomous driving [88]	Semantic segmentation	GTA V	1,355,568	CIoU: 0.45
Animal monitoring [89]	Pose estimation	Unity	32,000	PCK: 0.13
Construction monitoring [90]	Object detection	Unity	6000	Precision: 0.92
Generic object detection [91]	Classification	UE4	31,200	Top-1 accuracy: 0.72

Abbreviations: mAP—Mean AP; NDDS—NVIDIA DL Dataset Synthesizer; I3D—Inflated 3D Networks; CIoU—Complete Intersection over Union; PCK—Percentage of Correct Keypoints.

Table 3. Summary of surface crack decal classes by morphological type.

Crack Type	Count	Description
Single	10	Linear or slightly curved cracks with a single continuous trace
Bifurcated	5	Cracks exhibiting branching into two subsidiary traces
Crossed	7	Intersecting crack traces forming networked geometries

Table 4. Cine Camera Actor settings used for synthetic dataset generation in UE5.

Setting	Value
Sensor Format	36 mm × 20.25 mm
Aspect Ratio	16:9
Resolution	1920 × 1080 pixels
Aperture	ƒ/5.6
ISO	100
Shutter Speed	1/500 s

Table 5. Default hyperparameter configuration used for training StyleGAN2-ADA.

Hyperparameter	Value
Learning Rate	0.002
Optimizer	Adam (β₁ = 0, β₂ = 0.99, ε = 1 × e⁻⁸)
R1 Regularization Weight	10.0
Effective R1 Weight	160
Path Length Regularization Interval	4 iterations
R1 Regularization Interval	16 iterations
ADA Target	0.60 (60%)
Loss Function	Non-saturating logistic loss

Table 6. Dataset configurations used for training YOLOv11 for open-pit surface crack detection.

Dataset	Total Images	Training Images	Validation Images
UE5O	20,000	17,000	3000
UE5O + SG2	40,000	34,000	6000
UE5O + SG2-FFHQ	40,000	34,000	6000
UE5O + SG2-DTD	40,000	34,000	6000

Table 7. Hyperparameter configuration used for training YOLOv11.

Hyperparameter	Value
Model	YOLOv11m
Initialization Weights	COCO
Input Resolution	512 × 512 pixels
Batch Size	64
Epochs	300
Optimizer	Adam (β₁ = 0.90, β₂ = 0.99)
Initial Learning Rate	0.001
Learning Rate Schedule	Cosine decay
Warmup Epochs	3
Weight Decay	0.0005
Data Augmentation	On (scaling, translation, flip, mosaic)

Table 8. Fidelity (FID), perceptual diversity (LPIPS), and relative LPIPS improvement over the UE5O baseline (LPIPS vs. UE5O) for the three StyleGAN2-ADA initialization strategies. The best scores are marked in bold.

Configuration	FID Score	Mean LPIPS	Median LPIPS	LPIPS Range	LPIPS vs. UE5O
SG2 (Baseline)	15.99	0.452 ± 0.094	0.456	[0.011, 0.725]	+2.80%
SG2 + FFHQ	12.75	0.472 ± 0.090	0.477	[0.140, 0.709]	+7.49%
SG2 + DTD	15.88	0.457 ± 0.092	0.462	[0.115, 0.735]	+3.96%

Table 9. Statistical comparison of LPIPS distributions relative to the UE5O baseline using the Mann–Whitney U test. The best scores are marked in bold.

Comparison	Holm-Adjusted p-Value	Rank-Biserial Effect Size r
SG2 (Baseline) vs. UE5O	1.74 × 10⁻³⁴	0.0995
SG2 + FFHQ vs. UE5O	6.89 × 10⁻¹⁷¹	0.2276
SG2 + DTD vs. UE5O	8.70 × 10⁻⁷⁰	0.1442

Table 10. Synthetic-to-real domain gap of images generated by the proposed framework relative to the held-out real-world test set quantified using MMD² in CLIP feature space. The best scores are marked in bold.

Dataset	Mean MMD²	95% CI	Change vs. UE5O
UE5O	0.000472	0.000446–0.000503	-
SG2 (Baseline)	0.000487	0.000456–0.000516	−3.20%
SG2 + FFHQ	0.000404	0.000379–0.000425	+14.40%
SG2 + DTD	0.000480	0.000456–0.000501	−1.70%

Table 11. Validation of projection-based UE5 bounding box annotations against manual reference labels.

Metric	Value
Mean IoU	0.931
Median IoU	0.958
F1@0.5	0.990
F1@0.75	0.960

Table 12. Grounding DINO pseudo-labelling threshold sensitivity analysis against manual reference annotations. The best scores are marked in bold.

Threshold	Mean IoU	Median IoU	F1@0.5	F1@0.75
0.30	0.897	0.945	0.894	0.847
0.35	0.909	0.945	0.946	0.889
0.40	0.900	0.945	0.945	0.896
0.45	0.918	0.947	0.894	0.856

Table 13. YOLOv11 performance evaluation results for each dataset configuration evaluated on the real-world open-pit surface crack test set. The best scores are marked in bold.

Dataset	Precision	Recall	F1@0.5	AP@0.5 (95% CI)	AP@[0.5:0.95] (95% CI)
UE5O	0.692	0.729	0.710	0.792 (0.748–0.849)	0.536 (0.475–0.595)
UE5O + SG2	0.792	0.902	0.844	0.922 (0.876–0.948)	0.706 (0.649–0.744)
UE5O + SG2-FFHQ	0.808	0.850	0.829	0.911 (0.880–0.939)	0.722 (0.680–0.763)
UE5O + SG2-DTD	0.730	0.828	0.776	0.858 (0.805–0.895)	0.638 (0.593–0.689)

Note: 95% CIs for AP@0.5 and AP@[0.5:0.95] were calculated using paired bootstrap resampling of the 200 held-out real-world test set images with 1000 resamples.

Table 14. FP and FN count for each YOLOv11 dataset configuration evaluated on the real-world open-pit surface crack test set. The best scores are marked in bold.

Configuration	FP Count	FN Count
UE5O	53	44
UE5O + SG2	27	20
UE5O + SG2-FFHQ	42	8
UE5O + SG2-DTD	48	31

Table 15. Consolidated summary of generation quality and downstream object detection performance across all dataset configurations. The best scores are marked in bold.

Dataset	FID	Mean LPIPS	MMD²	AP@0.5	AP@[0.5:0.95]
UE5O	-	0.440	0.000472	0.792	0.536
UE5O + SG2	15.99	0.452	0.000487	0.922	0.706
UE5O + SG2-FFHQ	12.75	0.472	0.000404	0.911	0.722
UE5O + SG2-DTD	15.88	0.457	0.000480	0.858	0.638

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Le Roux, R.; Khaksar, S.; Sepehri, M.; Murray, I. A Hybrid Game Engine–Generative AI Framework for Overcoming Data Scarcity in Open-Pit Crack Detection. Mach. Learn. Knowl. Extr. 2026, 8, 99. https://doi.org/10.3390/make8040099

AMA Style

Le Roux R, Khaksar S, Sepehri M, Murray I. A Hybrid Game Engine–Generative AI Framework for Overcoming Data Scarcity in Open-Pit Crack Detection. Machine Learning and Knowledge Extraction. 2026; 8(4):99. https://doi.org/10.3390/make8040099

Chicago/Turabian Style

Le Roux, Rohan, Siavash Khaksar, Mohammadali Sepehri, and Iain Murray. 2026. "A Hybrid Game Engine–Generative AI Framework for Overcoming Data Scarcity in Open-Pit Crack Detection" Machine Learning and Knowledge Extraction 8, no. 4: 99. https://doi.org/10.3390/make8040099

APA Style

Le Roux, R., Khaksar, S., Sepehri, M., & Murray, I. (2026). A Hybrid Game Engine–Generative AI Framework for Overcoming Data Scarcity in Open-Pit Crack Detection. Machine Learning and Knowledge Extraction, 8(4), 99. https://doi.org/10.3390/make8040099

Article Menu

A Hybrid Game Engine–Generative AI Framework for Overcoming Data Scarcity in Open-Pit Crack Detection

Abstract

1. Introduction

2. Related Works

2.1. Synthetic Dataset Generation Using Game Engines

2.2. Synthetic Dataset Generation Using Generative Models

3. Materials and Methods

3.1. Synthetic Dataset Generation Using UE5

3.1.1. Virtual Environment Construction

3.1.2. Terrain Material Parameterization

3.1.3. Surface Crack Decal Parameterization

3.1.4. Lighting Positioning and Intensity Parameterization

3.1.5. Camera Viewpoint Parameterization

3.1.6. Synthetic Image Rendering

3.1.7. Bounding Box Computation

3.1.8. Automated Dataset Generation Pipeline

3.2. Synthetic Dataset Fidelity and Diversity Enhancement Using StyleGAN2-ADA

3.2.1. StyleGAN2-ADA Architecture Overview

3.2.2. StyleGAN2-ADA Training Configuration

3.2.3. Generation Fidelity Evaluation

3.2.4. Generation Diversity Evaluation

3.2.5. Domain Gap Evaluation

3.2.6. Automatic Annotation Using Grounding DINO

3.3. Crack Detection Using YOLOv11

3.3.1. YOLOv11 Architecture Overview

3.3.2. YOLOv11 Training Configuration

3.3.3. Object Detection Performance Evaluation

4. Results and Discussion

4.1. StyleGAN2-ADA Training and Image Generation Assessment

4.1.1. Training Dynamics

4.1.2. Generation Fidelity and Diversity Evaluation

4.1.3. Domain Gap Evaluation

4.1.4. Annotation Reliability Assessment

4.2. YOLOv11 Training and Crack Detection Performance Evaluation

4.2.1. Training Dynamics

4.2.2. Real-World Performance Evaluation

4.3. Practical Deployment Considerations and Limitations

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI