1. Introduction
In safety-critical industries such as automotive and aerospace, the structural integrity of metal assemblies (e.g., automotive door lock strikes) forms the cornerstone of product reliability and user safety [
1]. Current industrial production primarily relies on two-dimensional vision inspection systems based on optical imaging. However, unlike objects in standard datasets with simple geometries and uniform textures, such as MVTec [
2], such industrial components typically possess complex three-dimensional features formed through multiple processes like stamping and riveting, and their surfaces encompass heterogeneous areas such as highlights, rough textures, and transition zones [
3]. This complexity makes the imaging results extremely sensitive to lighting angles, surface curvature, and positional variations. Under conventional lighting, optical interferences like specular reflections, local overexposure, and irregular shadows are prone to occur [
4], severely obscuring the genuine features of micro defects such as cracks and scratches, resulting in a high miss rate. Concurrently, the exceptionally high production yield in industry leads to extremely scarce defect samples, making it difficult for supervised learning methods that rely on large amounts of annotated data to be directly applied [
5,
6].
While 3D sensing or CAD model-based approaches might seem like natural alternatives, they face significant barriers in real-world, high-speed production settings. Deploying accurate 3D inspection in-line is often hindered by high costs, sensitivity to industrial environmental interference, and computational complexity for real-time processing [
7,
8,
9]. On the other hand, generating synthetic data from precise CAD models to overcome sample scarcity is challenged by the non-trivial domain gap between rendered images and real-world captures, as well as inherent geometric and appearance variations in manufactured parts that perfect CAD models do not capture [
10,
11].
Therefore, the core challenge for reliable defect detection of complex metal parts lies in how to transform the 3D defect recognition problem with strong spatial dependencies into a robust and deployable 2D vision solution, without relying on expensive 3D sensing or precise CAD models. This transformation faces the dual systemic challenges of “imaging interference” and “sample scarcity”.
Existing research often addresses only one of these problems in isolation. While a stream of studies primarily mitigates imaging interference through specialized hardware or sensor design, it frequently overlooks the scarcity of real defect data needed for training robust models [
12]. Another significant body of work concentrates on overcoming sample scarcity by developing advanced data generation or synthesis algorithms [
13,
14], yet these methods often rely on idealized imaging assumptions, creating a gap between synthetically generated training data and real captured images affected by complex optical phenomena. This fragmentation of the “imaging-generation-detection” pipeline—where hardware-centric methods lack data adaptability, and data-centric methods lack physical grounding—makes it difficult for most solutions to form a complete and reliable system in real industrial scenarios. Although emerging approaches attempt to integrate multiple information sources to improve detection effectiveness [
15,
16], such methods are generally confined to fusing data under predetermined imaging conditions and fail to fundamentally co-optimize the stages of physical imaging, data generation, and detection. Consequently, designing a tightly synergistic framework that proactively integrates “imaging-generation-detection” to systematically tackle the dual challenges of imaging interference and sample scarcity remains an open and critical problem.
To address the imaging challenges posed by complex geometries and heterogeneous surfaces, researchers have attempted various hardware solutions. Fixed lighting schemes are prone to introducing uncertain reflections and shadows on curved surfaces. Gerges and Chen employed a large-scale LED array and convex optimization to enhance 2D recognition [
17], while Wu et al. utilized dome lighting and photometric stereo to encode depth information [
18]. These methods improve imaging quality but increase system complexity and cost, and do not solve the fundamental problem of sample scarcity. On the other hand, solutions represented by 3D sensing (e.g., structured light, laser scanning) or CAD model comparison, while capable of directly acquiring geometric information, face limitations for large-scale, high-speed production lines due to their high costs, complex on-site deployment and maintenance, sensitivity to highly reflective materials, and the misalignment between geometric reconstruction and the characterization of optical defects (e.g., fine scratches) [
19,
20]. These solutions collectively reflect a limitation in current research: the optimization of hardware imaging is often independent of downstream data generation and recognition tasks, lacking systematic design oriented towards the final detection performance.
At the data level, to alleviate the problem of sample scarcity, generative methods have garnered significant attention. Traditional data augmentation techniques struggle to preserve the morphological integrity of defects [
21]. Copy–paste-based methods (e.g., CutPaste [
22]) can change defect locations, but their fusion with the background is often unnatural; similarly, methods like NSA [
23] improve boundary naturalness through Poisson image editing, but their core function remains redistributing existing defect patterns, making it difficult to create defects with entirely new morphologies or textures. Generative Adversarial Networks (GANs) demonstrate the potential for high-fidelity image synthesis [
24], and works like Defect-GAN [
25] and SDGAN [
26] have shown the potential of GANs for generating defect images. However, the conventional GAN paradigm, which aims to directly generate complete defect images end-to-end from random noise, faces inherent limitations in complex industrial scenarios. This approach requires the generator to jointly and simultaneously model the intricate optical properties of the non-defective background and the subtle morphological features of the defect itself. This often leads to objective confusion and training instability, as the model struggles to decouple these two distinct learning tasks [
27,
28]; substantial computation is consumed on reconstructing the already adequate normal background, causing redundancy; complex backgrounds can severely interfere with defect generation, often leading to mode collapse. Algorithmic improvements for few-shot scenarios (e.g., FastGAN [
29]) enhance training efficiency [
30], but their static fusion mechanisms remain insufficient when dealing with complex geometries and heterogeneous textures. Consequently, dynamic multi-granularity feature fusion has become key to enhancing model representational capacity. For example, the Squeeze-and-Excitation (SE) module enhances feature sensitivity through channel-wise recalibration [
31], but its single-resolution processing limits cross-scale adaptability; Residual Networks (ResNet) mitigate the vanishing gradient problem via skip connections [
32] but lack a strategy for dynamic cross-layer fusion. Recent breakthroughs combine attention mechanisms [
33] with Multi-Scale Fusion (MSF) [
34] to achieve resolution-adaptive feature weighting, such as the CNN–Transformer parallel architecture for industrial image super-resolution [
35], or multi-scale convolutional kernels with global feature selection [
36]. These advances at the algorithmic level indicate that multi-granularity dynamic fusion is an effective way to improve a model’s ability to handle complex content.
However, a fundamental, system-level bottleneck persists: even with the introduction of more advanced dynamic fusion mechanisms, existing generative methods are still largely researched and validated based on relatively “clean” image data. They fail to be co-designed with the physical characteristics of the front-end imaging system (e.g., specialized lighting schemes designed to suppress highlights and shadows). When applied to images acquired from real industrial settings, which contain complex optical interferences (such as non-uniform reflections, strong shadows) that are difficult to completely strip away via algorithms, the generative capability of the models significantly degrades. Therefore, optimizing the generation algorithm in isolation while ignoring deep coupling with the front-end imaging cannot fundamentally solve the problem of defect sample generation in complex industrial scenarios. This deeply reveals the necessity of systematically integrating “imaging hardware optimization” and “data generation algorithms,” and bridging this gap is one of the core objectives of this work.
In summary, there is a clear fragmentation in current research across the three key stages of “imaging hardware, data generation, and defect recognition.” Hardware research does not fully consider providing optimal input for few-shot learning; generative algorithm research is often detached from real imaging physical constraints. This fragmentation makes it difficult for existing solutions to constitute a complete, reliable, usable, and deployable system under the harsh conditions of real industry (strong imaging interference, extremely low defect rates).
To overcome this systemic bottleneck, this paper proposes and implements a task-driven, end-to-end paradigm for constructing a defect detection system. The core idea of this work is, through the co-design of imaging physics, data synthesis, and detection models, to transform the 3D defect recognition problem for complex metal parts into a robust, deployable 2D vision solution. We no longer view imaging, generation, or detection in isolation, but integrate them into a closed-loop optimization process: the imaging subsystem provides “clean” input for data generation; the data generation engine, constrained by the characteristics of front-end imaging, efficiently synthesizes high-quality defect samples; finally, the performance of the downstream detector validates and provides feedback on the effectiveness of the entire system. The main contributions of this system are reflected in the following three closely related aspects:
(1) 3D-to-2D Defect Mapping Strategy: The core of this paper is proposing a systematic modeling method with a rigorous theoretical framework (
Section 2.1). Through a systematic approach co-designing the imaging system and deep learning algorithms, along with methods adapted for highly reflective metal materials, this strategy uses 2D images as an intermediary to convert 3D defects into discriminable 2D features, achieving a stable mapping from 3D defects to 2D images. Differing from methods for conventional textured objects, this method targets metal parts with high curvature, complex structures, and directionally sensitive defects, balancing lightweight implementation and high detection performance without relying on 3D reconstruction, point clouds, or CAD models.
(2) Robust Imaging Subsystem for Highly Reflective Complex Workpieces: A dedicated imaging device integrating precision fixtures, dome diffuse lighting, and a small-aperture lens was developed. This subsystem effectively suppresses specular highlights and irregular shadows, providing high-quality images with uniform illumination and salient features for subsequent processing stages, reducing imaging interference at the source.
(3) Task-Driven Defect Imaging + Generation Software Architecture: With the ultimate goal of “training a few-shot detection model,” the first stage, based on an improved FastGAN, focuses on generating local defect patches. By reducing the dimensionality of the generation space, it effectively avoids interference from background textures and significantly reduces parameter count and computational cost. Simultaneously, a Dynamic Multi-Granularity Fusion (DMGF-SLE) module is introduced, which employs cross-layer channel and spatial attention gating mechanisms to achieve adaptive multi-scale feature fusion, thereby enhancing the representation capability for defect texture, shape, and structural anomalies. Furthermore, a perceptual loss is incorporated during training to improve the texture details and visual realism of the generated outputs. The second stage employs an optimized fusion mechanism based on the principles of Poisson image editing, which uses a designed key region mask to precisely control the location of defect generation. A composite loss function is also designed to ensure edge smoothness and geometric continuity, ultimately producing complete defect samples.
The overall workflow of the proposed system is illustrated in
Figure 1. Initially, the imaging system is employed for data acquisition to obtain original defective and non-defective samples, which are then processed to yield key region masks for the normal samples and the original defect patches. Random noise is fed into the generator, which is focused on generating local defects, thereby mitigating the interference of complex optical features from the normal regions on the global synthesis process. A pre-trained VGG16 network [
37] is introduced. By combining a perceptual loss function with the adversarial loss, the generator is optimized to produce defect patches of higher quality, enriched with optical details such as texture and illumination consistency. These generated defect patches are combined with real defect patches to form an expanded defect patch dataset, which is then paired with real normal images and their corresponding masks. Within the defect fusion and optimization module, the defect patches are fused onto the normal images at the designated defect locations guided by the masks. The fusion process employs Poisson image editing, a content loss, and a boundary smoothing loss to refine the blending result, ensuring that the generated defect regions retain high-quality details and exhibit natural transitions. These synthetic defective images are subsequently fed into a deep learning model for defect recognition, where the model analyzes the images to identify and localize defects, ultimately generating automated inspection results that complete the end-to-end defect detection pipeline. The final output is a complete synthetic defective image. Representative images generated by the proposed method are shown in
Figure 2.
3. Experiments
To validate the effectiveness of the proposed method, systematic experiments were conducted on an automotive door lock strike dataset, encompassing imaging quality assessment, defect patch generation, complete defect sample fusion, object detection performance improvement, and comparative analysis.
3.1. Experimental Setup: Dataset, Calibration, and Configuration
Dataset and Annotation. The dataset comprises 864 images of non-defective samples and 566 images of defective samples. The irregular defect regions on the defective samples were annotated using LabelMe (v3.16.7). These annotated regions were cropped and placed onto a pure black background to isolate the optical characteristics of the defects. The defects were categorized based on their type (linear, punctate, compound), with concentrations primarily in the top, middle, and bottom regions of the samples, showing a relatively fixed spatial distribution: linear defects were predominantly located in the bottom and middle areas, while punctate defects were mostly distributed in the top region. Some ambiguous defects were classified as “compound defects.” Given the relatively consistent locations of the defects, the complete door lock strike samples were further segmented into different regions. Mask images for these key regions were generated using LabelMe to guide the defect generation process based on optical imaging properties.
System Calibration and Physical-Pixel Scale. Our imaging system was calibrated using a high-precision calibration board to establish the physical-pixel correspondence. The original images captured at resolution have a physical scale of 1 pixel = 17 μm (or 3000 pixels = 50 mm). After image processing (padding to and downsampling to ), the final working resolution corresponds to 1 pixel = 68 μm (0.068 mm) or 14.7 pixels/mm. This calibration was verified using multiple reference measurements across the image field to ensure accuracy.
Defect Size Distribution and Statistics. We conducted a comprehensive analysis of the physical dimensions of all defect samples. As shown in
Table 2, the dataset exhibits a diverse range of defect sizes. The minimum detectable defect size is 0.73 × 0.57 mm (10.9 × 8.5 pixels), which is below the 1.0 mm threshold specified in automotive quality standards ISO 286-2 [
67], validating the high sensitivity of our inspection system. The maximum defect size measured is 9.13 × 3.60 mm (137.0 × 54.0 pixels), with an average equivalent diameter of 2.28 ± 1.17 mm. This distribution reflects real-world industrial defect scenarios where medium-sized defects are most common, while extremely small or large defects occur less frequently.
Model Training Configuration. The Generative Adversarial Networks (GANs) were trained on an RTX 3080 (20 GB) GPU (NVIDIA Corporation, Santa Clara, CA, USA) using PyTorch 2.0. We adopted a “divide-and-conquer” strategy by training separate GAN models for each defect category. This ensures that each model focuses exclusively on learning the data distribution of a specific defect pattern, thereby enhancing the diversity and representativeness of the generated samples. During training, separate models were developed for each defect category with input sizes and batch sizes configured according to the specific defect characteristics, as detailed in
Table 2.
3.2. Imaging Quality Assessment Experiment
To systematically validate the effectiveness of the proposed imaging scheme on complex metal workpieces, this study conducted a comprehensive comparative analysis across visual quality, objective image metrics, and downstream detection performance and robustness. A 2 × 2 factorial design was employed to deconstruct the individual and combined contributions of the two core design principles: uniform dome illumination versus directional lighting, and a small aperture (large depth of field) versus a large aperture. Four representative optical configurations were selected: Dome+f/16 (our proposal), Ring+f/16 (ring light + small aperture), Bar+f/16 (bar light + small aperture), and Dome+f/4 (dome light + large aperture). All images were captured from the same batch of automotive door lock strike workpieces under strictly controlled, fixed working distances and illumination intensities. Image quality was assessed using sharpness (Laplacian gradient norm) and noise level (image standard deviation ). Downstream performance was evaluated using a YOLOv8 model, measuring mAP@50, mAP@50:95, F1-Score, and critically, the average number of false positives per image (FPs/img) to quantify detection robustness under identical training and validation settings.
As shown in
Figure 8, the visual comparative analysis first qualitatively demonstrates the effect differences. (a) Mobile phone with indoor lighting exhibits low contrast and blurred edges. (b) Bar light + small aperture enhances local contrast but introduces strong directional reflections and overexposure. (c) Dome light + large aperture suppresses highlights but suffers from a shallow depth of field, preventing full-field sharpness on curved surfaces. (d) Ring light + small aperture offers a balance but retains top reflections that reduce local contrast. (e) Our method (Dome light + small aperture) effectively suppresses reflections and shadows across the entire field of view, resulting in a smooth background where defects are highlighted as high-contrast features, confirming the necessity of co-designing “uniform illumination” and “large depth of field.”
3.2.1. Experimental Validation of Aperture Selection
To further validate the aperture selection, we conducted ablation experiments under identical dome illumination and exposure-matched conditions. The experiments used the same door-lock components, captured images with , , and apertures respectively, and employed the same YOLOv8 model for defect detection, with mAP@50 as the evaluation metric.
The experimental results show that
, due to insufficient DOF, caused local defocus on curved surfaces, reducing defect sharpness and detection performance. Although
provided sufficient DOF, diffraction blur and reduced SNR degraded image quality, also affecting defect detection accuracy.
maintained adequate DOF while controlling diffraction blur and noise, achieving the best detection performance. Quantitative results are summarized in
Table 3.
Although a smaller aperture reduces light throughput, controlled dome illumination and exposure compensation prevent excessive noise amplification. Consequently, the SNR remains sufficient for weak defect detection, as reflected by the stable mAP@50 and low false-positive rates achieved with the configuration.
While reducing the aperture increases the depth of field, it also introduces diffraction-related blur. The diffraction-limited blur size can be estimated by the Airy disk diameter, . For white-light imaging, taking , the corresponding Airy disk diameters for , , and are approximately , , and , respectively, corresponding to about 8.0, 11.6, and 16.0 pixels on the employed sensor. In our application, the characteristic spatial extent of typical surface defects (e.g., scratches and small surface irregularities spanning tens of pixels) remains significantly larger than the diffraction blur at , indicating that diffraction is not the dominant resolution-limiting factor under this setting.
However, further stopping down to results in increased diffraction blur and reduced photon efficiency, which may adversely affect weak defect visibility. Therefore, represents a practical balance between depth-of-field coverage and diffraction-induced resolution loss.
3.2.2. Comprehensive Performance Comparison
At the image quality level, significant differences are observed (
Table 4). The Ring+f/16 configuration achieved the highest sharpness (303.70) due to directional edge enhancement, but at the cost of the highest noise level (3.71). In contrast, our Dome+f/16 scheme maintained excellent sharpness (266.40) while achieving the lowest noise (2.50), demonstrating effective speckle noise suppression. The Dome+f/4 result is critical: although using the same uniform illumination, its large aperture causes a severe sharpness drop (131.68), directly proving the necessity of a “small aperture” for full-field sharpness. It is important to clarify the intentional trade-off here: the relatively small aperture (f/16) was selected to ensure a large depth of field. While a reduced aperture may introduce slight diffraction-induced blur at the theoretical resolution limit, this choice prioritizes full-field focus and, crucially, suppresses unstable specular reflections on highly reflective surfaces. For defect detection, maximizing stable defect-to-background contrast and robustness is paramount over ultimate spatial resolution.
These fundamental differences in image quality directly and consistently translated to downstream defect detection performance and, most notably, robustness (
Table 4). The model trained on Dome+f/16 data achieved the best overall performance, particularly on the comprehensive metric mAP@50:95 (50.4%) and F1-Score (0.89). The Ring+f/16 configuration, leveraging its high sharpness, achieved the second-best mAP@50:95 (47.1%). However, a decisive advantage of our proposed scheme is revealed in the false positive analysis. The Ring+f/16 configuration produced the highest rate of false alarms (2.3 FPs/img), nearly triple that of our method (0.7 FPs/img). This indicates that the directional highlights from the ring light, while enhancing some edges, create numerous spurious, defect-like features that degrade model reliability. The Bar+f/16 configuration, with the most unstable lighting, performed worst across all detection metrics, including false positives (3.0 FPs/img). Conversely, the isotropic diffuse light from the dome source provides a consistent, low-noise background, drastically reducing false alarms. The comparison between Dome+f/4 and Dome+f/16 further isolates the aperture’s role: the large aperture variant, despite its uniform lighting, suffers in overall detection performance (mAP@50:95: 47.7%) due to poor sharpness, though it maintains a moderate false positive rate (1.5 FPs/img) thanks to its stable illumination. This systematic ablation confirms that uniform dome illumination is the key to minimizing false positives and ensuring robustness, while the small aperture is foundational for achieving the full-field sharpness required for high detection accuracy. The proposed Dome+f/16 configuration optimally balances these factors, delivering superior performance where it matters most for industrial inspection: high detection accuracy coupled with high operational reliability.
3.3. Validation of the Effectiveness of the DMGF-SLE Module and Perceptual Loss
3.3.1. Ablation Study on Core Modules: DMGF-SLE and Perceptual Loss
To evaluate the effectiveness of the DMGF-SLE module and perceptual loss in the first stage of our defect sample generation framework, we used the original FastGAN as the baseline and conducted experiments with three variants: FastGAN+DMGF, FastGAN+PER, and FastGAN+All (combining both DMGF-SLE and perceptual loss). Training was performed on an RTX 3080 GPU using the same defect patch dataset for 4 h per variant, with three random seeds to ensure reproducibility. We conducted ablation experiments on defect patches using FID (Fréchet Inception Distance) [
68], LPIPS (Learned Perceptual Image Patch Similarity) [
69], and IS (Inception Score) [
70] as evaluation metrics. All experiments were conducted with multiple runs, and the mean values are reported to evaluate performance. FID quantifies feature distribution differences between real and generated images:
Here, Tr denotes the trace of a matrix, and
and
represent the mean and covariance matrices of the real image data and generated image dataset in the feature space, respectively. LPIPS measures perceptual similarity via weighted feature distances:
Here,
x and
y are the two images to be compared,
represents the feature maps extracted by the pre-trained deep network (such as AlexNet [
71] or VGG at the
l-th layer, and
is the weighting coefficient used to balance the contributions of different feature layers.IS assesses diversity using the entropy of category distributions:
Here,
x represents the generated image,
denotes the category probability distribution predicted by the classification network for the generated image
x, and
represents the average category probability distribution of the generated image dataset. Results in
Table 5 demonstrate that adding DMGF-SLE or perceptual loss individually reduces FID, with the combined approach (FastGAN+All) achieving the lowest FID (e.g., 46.25 for linear defects), alongside lower LPIPS and higher IS scores, indicating enhanced quality, perceptual similarity, and diversity in defect patch generation, reflecting improved detail capture and variety in the generated patches. We also performed a visual evaluation conducted by professional workers from the production line to compare the generated defect patches across different methods under the same number of iterations, as shown in
Figure 9. The first column displays real defect patches as reference images, while the other columns present defect patches generated by different methods, with each row corresponding to a specific defect type. When using FastGAN alone, the generated defect patches exhibit darker tones, simpler shapes and blurred edges, deviating significantly from real defect appearances. When DMGF-SLE is added alone, cross-layer feature interactions enhance the structure and texture details, resulting in sharper edges, though some unnatural jagged edges appear, particularly in the first and fourth rows of the third column. When only the perceptual loss module is used, it refines details by eliminating jagged edges and improving texture and structure, but the shapes tend to become more conservative. When both DMGF-SLE and perceptual loss are combined (FastGAN+All), the generated defect patches show significant improvements in detail capture, structural consistency, and natural transitions, closely resembling real defect patches in terms of texture, edge clarity, and overall visual quality.
3.3.2. In-Depth Analysis of the Dynamic Fusion Mechanism
To validate the effectiveness of our dynamic fusion mechanism, we conduct comprehensive ablation studies comparing three variants: Baseline (Original FastGAN without DMGF-SLE module), Static (DMGF-SLE with fixed parameters ()) and Dynamic (Full DMGF-SLE with learned parameters).
The ablation results, as visually summarized in
Figure 10, demonstrate that our dynamic fusion mechanism achieves significant and consistent improvements over both the baseline and static variants. The bar charts clearly show that the
Dynamic variant (dark purple) achieves the lowest bar (best score) for every defect type in both FID and LPIPS metrics. Quantitatively, compared to the Baseline, the Dynamic variant reduces the FID by 6.93% for linear defects, 3.06% for punctate defects, and a substantial 22.01% for compound defects. Concurrently, it lowers the LPIPS by approximately 4.8% consistently across all defect types. This indicates a clear and simultaneous enhancement in both the realism (FID) and perceptual quality (LPIPS) of the generated patches. More importantly, the Dynamic variant consistently outperforms the Static variant. The visual gap between the medium pink (Static) and dark purple (Dynamic) bars is evident across nearly all categories. For instance, in the most challenging compound defects task, the Dynamic variant achieves a 15.1% lower FID (58.7 vs. 69.10) and a 5.5% lower LPIPS (0.1969 vs. 0.2083) than the Static variant, which is directly observable in the rightmost set of bars in both charts.
This consistent performance gap, now made intuitive by the comparative bar charts, provides strong evidence that both
and
parameters are actively and beneficially adapting during training. If the parameters were not effectively learning, the dynamic version would not reliably surpass the statically tuned version across all scenarios as shown. The progressive adaptation of these parameters (evident in
Table 1) correlates directly with this improved generation quality, validating our design hypothesis that dynamic multi-granularity fusion is superior to a fixed fusion strategy, especially for handling complex and irregular defect patterns.
3.3.3. Cross-Dataset Generalization Validation
To further assess the generalizability of our improvements across diverse datasets, we evaluated FID on randomly selected images from the AFHQ_v2 (Animal Faces-HQ) [
72] dog and cat categories and our car door latch dataset. Results and setups are presented in
Table 6, with generated images shown in
Figure 11. The consistent FID reduction across datasets (e.g., from 46.19 to 34.26 for the Car Door Latches dataset) validates the robustness of DMGF-SLE in enhancing defect patch generation for various applications.
3.4. Quality Evaluation Experiments Based on Defect Sample
To verify the authenticity of integrating generated defect patches into normal samples in the second stage, we conducted a quality comparison experiment. Since car door latch defects are small and easily obscured by similar non-defective regions, we expanded the defect areas into central patches. Given defect types vary by location, we selected top, middle, and bottom positions for patching. Three methods were tested—CutPaste, Poisson image editing, and our proposed approach—using the same defect patch, with quality assessed via FID (Fréchet Inception Distance).
Table 7 shows CutPaste yielding the highest FID scores (e.g., 160.98 at bottom), due to poor fusion with the background, reducing sample authenticity. Poisson and Ours achieve much lower FID scores, with Ours outperforming Poisson at top (25.34 vs. 28.32) and bottom (54.73 vs. 60.57) positions, and comparable performance at middle (33.63 vs. 34.78). This reflects our method’s ability to balance defect embedding and feature preservation in complex backgrounds. This is because our method not only considers embedding the defect block into the target sample but also emphasizes preserving the features of the defect region. This suggests that our method can effectively balance the fusion quality between the defect block and the background region in complex backgrounds.
We also validated our method on a self-collected ampoule dataset, with results shown in
Figure 12. The generated images demonstrate seamless defect integration and effective preservation of defect features.
To quantitatively evaluate the visual realism of the generated defects, an expert blind test was designed and conducted. Three senior engineers, each with over five years of experience in quality inspection on latch production lines, were invited as evaluators. The test material comprised 200 complete latch images with defects, of which 100 were from the real test set and the other 100 were randomly generated by our method. These images were randomly mixed. Under a single-blind condition, each evaluator was required to independently judge whether each image was “real” or “generated.” The average rate at which the generated samples were misclassified as real was used as the core evaluation metric. The results show that the average misclassification rate among the three experts was 64.3% (standard deviation ± 2.1%). This result, significantly higher than the random guessing level of 50%, strongly confirms from the perspective of human visual perception that our method is capable of generating defect samples with high authenticity.
3.5. Complete Defect Sample Replacement Experiments
To validate the authenticity of the generated samples and their effectiveness in replacement training, we designed and conducted a sample replacement experiment based on YOLO. This experiment utilized defect patch datasets of different categories for model training, and generated 312 complete defective samples through the second-stage defect patch fusion and optimization step. During the generation process, defects were precisely positioned and pasted onto target locations, with the system automatically recording the paste coordinates. Based on this location data, annotation files compliant with the YOLO format (normalized center coordinates + width/height) were generated programmatically, achieving fully automated end-to-end annotation. Subsequently, the real samples and generated samples were divided into training, validation, and test sets according to a ratio of 0.65, 0.15, and 0.2, respectively. A total of six experimental groups were established, as detailed in
Table 8, with the Real-Real group serving as the baseline. The input image size for the YOLO model was uniformly set to 1024 × 1024. Common object detection data augmentation techniques (image translation, mirror flipping, color jittering) were applied during training. The experiment utilized mAP@50:95, mAP@50, precision, and recall as evaluation metrics. mAP@50 is a widely used evaluation metric in object detection tasks, measuring the detection performance when the Intersection over Union (IoU) threshold is set at 50%. A higher mAP@50 value indicates that the model can detect and localize targets more accurately. mAP@50:95 is a comprehensive evaluation metric that assesses model performance across multiple IoU thresholds. A higher mAP@50:95 value suggests that the model possesses a stronger capability for precise target localization.
Here,
T is the set of IoU thresholds, and
represents the average precision at threshold
t. As shown in
Table 8, the comprehensive cross-domain evaluation reveals the multifaceted characteristics of the generated data in model training and its interplay with real data. The Real-Real group (mAP@50: 91.6, mAP@50:95: 50.4) establishes a definitive performance benchmark. Crucially, the Syn-Syn group (mAP@50: 91.4, mAP@50:95: 50.3) achieves nearly indistinguishable performance, demonstrating the exceptional internal consistency, feature fidelity, and self-domain effectiveness of the generated data. This validates its core utility in creating controlled, balanced, and scalable training environments.
The cross-domain evaluations further delineate the nature of the distributional shift. The Syn-Real group (mAP@50: 88.5) shows a measurable performance gap when transferring from synthetic to real domains, while the Real-Syn group (mAP@50: 90.1) exhibits a smaller gap in the reverse direction. This asymmetry confirms that while our generation process captures the macro-level defect distribution, subtle discrepancies in geometric fidelity, texture complexity, and background stochasticity between synthetic and real data remain. The model trained on the more complex real distribution generalizes better to the simpler synthetic domain than vice versa. The pronounced performance degradation in the Mixed-Real group (mAP@50: 80.2), where the model is trained on a blend of both domains but tested solely on real data, serves as a critical quantitative indicator. This result, consistent with findings in related work [
61], demonstrates that a naive concatenation of synthetic and real data can be suboptimal. It suggests that without a sophisticated fusion mechanism, the combined dataset may introduce distributional bias and optimization conflict, preventing the model from effectively reconciling the two domains and potentially overfitting to the more regular patterns in the synthetic subset, thereby impairing its discriminative capability on pure real data.
These results provide a nuanced and actionable understanding: (1) The generated data is highly effective within its own domain and for tasks requiring domain-consistent data augmentation. (2) A measurable but manageable distribution gap exists, explaining the cross-domain generalization limits. (3) Most importantly, the performance drop in Mixed-Real is not merely a limitation but a clear diagnostic signal. It underscores that the key to unlocking the full potential of synthetic data lies not in simple mixing, but in developing advanced strategies such as curriculum learning, domain-invariant representation learning, or adaptive data weighting to intelligently bridge the domain gap. Thus, our work not only delivers a high-quality synthetic dataset but also empirically charts a clear path for future research in hybrid data utilization.
The object detection results are visualized in
Figure 13.
3.6. Sample Augmentation Experiments Based on Object Detection Model
To evaluate the effectiveness of generated samples in augmentation training, we designed and conducted a sample augmentation experiment. The experiment utilized 312 real defect samples, of which 20% were randomly selected as a fixed test set. All experimental groups were evaluated using this identical test set. Different training sets were constructed by adding 400, 800, 1200, 1600, 2000, and 2400 generated samples, respectively, to the remaining real samples. To account for the data volume difference introduced by the generated samples, we also created equivalent-sized training sets using traditional data augmentation methods (random translation by 3 mm and random rotation) based on the real samples as a baseline group. These methods ensure minimal divergence between the augmented samples and the real images by introducing minor transformations. Within each training set group, 15% of the data was randomly selected as a validation set, and the input image size was set to 1024 × 1024 pixels. During training, mainstream object detection data augmentation techniques (such as image translation, mirror flipping, and color jittering) were applied. The experiment used mAP@50:95 and mAP@50 as the primary evaluation metrics. The YOLOv8 experimental results shown in
Figure 14a,b indicate that both our generated sample method and the traditional augmentation method effectively improve model performance. When the sample size increased to 2400, the generative method outperformed the traditional method in both mAP@50:95 (60.5% vs. 59.8%) and mAP@50 (94.9% vs. 94.5%), demonstrating its performance advantage. Although the performance of both methods showed an overall upward trend with increasing sample size, the traditional method exhibited significant instability, with a notable performance drop at 800 samples (mAP@50 = 89.2%), likely due to distribution bias introduced by geometric transformations. In contrast, the generative method exhibited only slight fluctuations at small sample sizes (400) due to initial distribution bias, followed by stable improvement, indicating that the feature distribution of its generated samples is closer to the real data, providing a more robust augmentation effect. It is noteworthy that the generative method consistently achieved higher gains in mAP@50:95, reflecting its particular effectiveness in improving localization accuracy. When the sample size was large (2000–2400), the incremental improvements gradually diminished, suggesting that the performance might be approaching its upper limit under the current constraints. Overall, while maintaining stability, the generated samples demonstrated superior potential for performance improvement compared to traditional methods. To eliminate the dependency of our conclusions on a single model architecture, we conducted comparative experiments on the RT-DETR model, with results shown in
Figure 14c,d. RT-DETR demonstrated a higher baseline performance (mAP@50 = 92.0%, mAP@50:95 = 53.0%), which aligns with expectations for an advanced detection model. Crucially, the core advantage of our method was perfectly replicated on RT-DETR. At the 2400 sample size, our method also achieved the highest performance on RT-DETR (mAP@50 = 96.3%, mAP@50:95 = 63.5%), and its advantage over the traditional method became more pronounced (the performance gap increased from 0.7% on YOLOv8 to 1.5%). This phenomenon suggests that RT-DETR’s powerful representational capacity can more fully utilize the rich information in high-quality generated samples, indicating a positive synergistic effect between our method and advanced model architectures. The experiments validate the effectiveness of the samples generated by our method in enhancing object detection performance. The data trends indicate that performance improvement is positively correlated with sample size, and the GAN shows greater potential in increasing dataset diversity. Although traditional augmentation is also effective, its growth is relatively limited. It is noteworthy that the impact of GAN-generated data has a complex relationship with sample size, with the most consistent performance improvements observed at medium to large sample sizes. Larger sample sizes may require further optimization of the generated samples to prevent performance instability caused by potential distribution bias.
3.7. Comparative Experiments
Defect-GAN [
25], as a single-stage defect generation method, requires training on a dataset containing non-defective images, defective images, and their corresponding defect masks. This method employs a composite layer structure based on inpainting and restoration processes to synthesize diverse and realistic defect samples. The Defect-aware Feature Manipulation Generative Adversarial Network (DFMGAN) [
21] represents another two-stage approach. It first trains StyleGAN2 on defect-free images to generate high-fidelity background textures, and then employs defect-aware residual blocks to learn defect mask generation and local feature manipulation from limited defective samples. Compared to our two-stage defect generation method, Defect-GAN directly synthesizes complete defective images, requiring simultaneous modeling of both background and defect features. This joint optimization can lead to unnatural transition artifacts in complex backgrounds, and due to the lack of independent optimization of background and defect features, its capability in modeling complex defect patterns (e.g., local textures and geometric morphologies in automotive door lock strikes) is limited. Compared to our approach, DFMGAN requires pre-generating complete defect-free images. In practical industrial scenarios (e.g., door lock strike inspection), non-defective samples are abundant and highly consistent (fixed viewing angle, stable illumination), whereas defective samples are scarce and localized, having minimal global impact. Generating complete defect-free images introduces computational redundancy and unnecessary noise, thereby reducing generation efficiency and compromising output quality. To evaluate the performance differences, we designed a comparative experiment using 300 defect-free and 120 defective images of automotive door lock strikes. The defect regions were annotated using LabelMe. Data augmentation techniques (such as image translation, mirror flipping, and color jittering) were also employed. Three datasets were prepared: a defect patch dataset (for our method), a dataset conforming to the MVTec standard (used by DFMGAN), and a dataset in Defect-GAN format. All experiments were conducted on an RTX 3090 GPU (with 24GB VRAM) (NVIDIA Corporation, Santa Clara, CA, USA) using PyTorch 1.9.0, Python 3.8, and CUDA 11.1. Image quality was quantified using the Fréchet Inception Distance (FID), and computational resource requirements were additionally analyzed. In the task of generating defects for automotive door lock strikes, our method demonstrated significant performance advantages (
Table 9). Compared to DFMGAN, our two-stage generation strategy reduced the FID score by 18.3% (31.51 vs. 38.59), while also significantly improving training efficiency. As shown in
Figure 15, although the defect details generated by DFMGAN exhibit high realism, insufficient training in its first-stage background generation leads to two key issues: (1) distortion of background shape and texture, and (2) unnatural color shifts (an anomalous bluish tint). The single-stage architecture of Defect-GAN has fundamental limitations in modeling complex geometries, resulting in poor training efficiency metrics (FID = 59.18, training time 10 h).
Figure 15 shows that images generated by Defect-GAN often exhibit missing defects, primarily due to insufficient training samples and inadequate training duration under the experimental conditions. Our method focuses on learning local defect features (such as texture and shape) to enhance computational efficiency and optimize resource utilization. By directly generating defect image patches without modeling the global background, our method significantly reduces the computational burden. This method achieves a balance between generated image quality and computational resource utilization, making it particularly suitable for resource-constrained industrial environments.
3.8. Summary of Experimental Results
The experiments in
Section 3.1 detail the construction of the dataset, system calibration, and experimental configuration, laying the foundation for all subsequent work. The experiments in
Section 3.2 conducted a comprehensive evaluation of the imaging scheme through qualitative and quantitative experiments, verifying the effectiveness of the imaging system and providing 2D data support for subsequent work. The experiments in
Section 3.3 validated the effectiveness of the first-stage method in generating high-quality defect patches. Quantitative metrics (FID, LPIPS, IS) and visual assessment demonstrated the realism and diversity of the generated samples. In the first stage, compared to the original FastGAN, our method improved the FID score by 11–24% across different defect categories, indicating that the generated defect patches are more similar to real ones in texture and structural details, which is critical for industrial imaging applications. For instance, the FID for linear defects decreased from 55.85 to 46.25, indicating that the generated defects more closely resemble the appearance of real defects under varying imaging conditions. Furthermore, the LPIPS and IS scores further confirmed the improvement in perceptual quality and diversity, ensuring that the generated patches can capture a variety of defect manifestations. We also conducted ablation studies to validate the learning mechanism of the alpha and beta parameters, demonstrating their dynamic adaptation during training. On other datasets (such as AFHQ_v2 and the automotive door lock strike dataset), FID showed consistent improvements of 5% to 20%, highlighting the generalization ability of the method across different imaging scenarios. In the second stage, the experiments in
Section 3.4 showed that our method significantly improved the FID compared to other methods, ensuring that the generated defect patches maintain realistic visual characteristics after being fused with normal samples. In blind tests, experienced quality inspectors misclassified 64.3% (±2.1%) of the generated samples as real ones, providing objective evidence for the high quality of the images produced by our method. The experiments in
Section 3.5 further highlighted the practical impact of our method on deep learning-based defect detection models. By introducing varying amounts of generated samples into the training of object detection models, we observed a significant improvement in detection performance compared to using only real samples. Specifically, on YOLOv8, the mAP@50:95 metric saw a maximum increase of 10.1 percentage points (rising from 50.4% to 60.5%), and mAP@50 also showed significant improvement when training with a mix of real and generated samples. These improvements indicate that the generated samples effectively augment the training dataset, enhancing the model’s ability to detect defects in challenging imaging scenarios, such as non-planar metal workpieces with reflective surfaces or complex geometries. It is noteworthy that the training process remained efficient, requiring only a few hours on an RTX 3080 GPU to generate high-quality defect samples, making it practical for industrial applications requiring rapid deployment. The experiments in
Section 3.6 and
Section 3.7 further validated the framework.
Section 3.6 demonstrated the effectiveness of the generated samples in augmenting the training data for object detection models.
Section 3.7 included comparative experiments with other defect generation methods (DFMGAN and Defect-GAN), demonstrating the performance superiority of our method in generating defects for automotive door lock strikes.
4. Conclusions and Future Work
This research overcomes the systemic bottleneck of fragmentation across imaging, generation, and recognition stages by introducing a task-driven, end-to-end paradigm for defect detection on highly reflective, complex metal workpieces. Through the co-design and closed-loop optimization of imaging physics, data synthesis, and recognition models, the solution transforms the challenging 3D defect recognition problem into a robust, deployable 2D vision system, significantly enhancing feasibility under industrial constraints of limited samples and harsh imaging conditions. The main contributions of this work, which are closely interlinked, are summarized as follows:
(1) A 3D-to-2D Defect Mapping Strategy via Systemic Co-design: The core of this work is a systematic modeling philosophy. It achieves a stable and reliable mapping from 3D surface defects to discriminable 2D image features through the co-optimization of the imaging subsystem, generative algorithms, and methods adapted for highly reflective metals. This strategy successfully converts a spatially complex 3D inspection task into a more tractable 2D recognition problem without reliance on 3D sensors or CAD models. (2) A Robust Imaging Subsystem for Complex Reflective Workpieces: We developed a dedicated imaging device integrating precision fixturing, dome diffuse lighting, and a small-aperture lens. This hardware subsystem effectively suppresses specular highlights and irregular shadows at the source, providing uniformly illuminated, high-contrast images that serve as “clean” input for subsequent data generation and detection, thereby laying the physical foundation for system robustness. (3) A Task-Driven Software Architecture for Defect Imaging and Generation: Constructed with the ultimate goal of “training a few-shot detection model,” this architecture integrates a two-stage generation process: a defect-patch synthesis stage based on an improved FastGAN with a Dynamic Multi-Granularity Fusion module, which efficiently learns defect representations; and a realistic fusion stage employing optimized Poisson editing and composite loss for precise, natural-looking defect injection. This integrated pipeline provides a clear, systematic engineering pathway for few-shot defect data synthesis.
Both quantitative and qualitative experiments demonstrate that our integrated system surpasses traditional fragmented solutions in imaging quality, data generation fidelity (e.g., FID score), and ultimately, the performance of downstream few-shot detectors. From an industrial application perspective, this work significantly reduces dependency on massive manual annotation, decreases the cost of model training, and improves deployment practicality.
Although this study has made significant progress in improving defect sample generation quality and system integration, it still faces some challenges. For example, when dealing with specific categories classified as “compound defects,” due to their large feature variation and complex morphology, the model training is not stable enough, and generating high-fidelity pixel-level details remains difficult, which limits the accurate representation of complex defect patterns.
Future research work will focus on the following aspects: Firstly, we will further optimize the structure of the feature extraction and generation modules to enhance the model’s ability to adapt to and reconstruct diverse defect features, thereby improving the generation quality for complex defects. Secondly, we are investigating a robotic arm-assisted “pose unification strategy,” where a robotic grasping and positioning system stably places the workpiece in the same location and orientation, effectively reducing imaging interference caused by pose variations and improving data consistency and generation stability. Furthermore, we will also dedicate efforts to optimizing the multi-light source configuration and illumination parameters in the optical imaging system to further enhance the ability to extract faint defects and improve feature visibility. We plan to extend this method to more types of non-planar metal workpieces, such as door hinges, support arms, and connectors, which have complex 3D structures, and conduct extensive validation under diverse fixturing and lighting conditions to further test the system’s generalization ability and robustness. We will also explore controllable defect parameter generation technology, enabling it to adapt to different manufacturing standards and quality assessment needs, further expanding its application potential in industrial visual inspection.