Abstract
The collection and annotation of large-scale image datasets remains a significant challenge in training vision-based AI models, especially in domains such as industrial automation. In industrial settings, this limitation is especially critical for quality inspection tasks within Flexible Manufacturing Systems and Batch-Size-of-One production, where high variability in components restricts the availability of relevant datasets. This study presents a pipeline for generating photorealistic synthetic images to support automated visual inspection. Rendered images derived from geometric models of manufactured parts are enhanced using a Cycle-Consistent Adversarial Network (CycleGAN), which transfers pixel-level features from real camera images. The pipeline is applied in two scenarios: (1) domain transfer between similar objects for data augmentation, and (2) domain transfer between dissimilar objects to synthesize images before physical production. The generated images are evaluated using mean Average Precision (mAP) and the Turing test, respectively. The pipeline is further validated in two industrial setups: object detection for a pick-and-place task using a Niryo robot, and anomaly detection in products manufactured by a FESTO machine. The successful implementation of the pipeline demonstrates its potential to generate effective training data for vision-based AI in industrial applications and highlights the importance of enhancing domain quality in industrial synthetic data workflows.
1. Introduction
Recent advances in Artificial Intelligence (AI) have revolutionized the manufacturing industry, facilitating its digitalization. AI is currently applied in various industrial domains, such as optimizing manufacturing processes, product and process design, scientific machine learning, computational experimentation, and automation []. However, real-time AI-driven automation is still in its early stages.
One area where AI is having a significant impact is quality control. Vision-based systems are being used to inspect and assess the quality of manufactured components, aiming to enhance both precision and efficiency. In 2023, Swarit Anand Singh and Kaushal A. Desai proposed a computational framework for an automated vision-based defect detection system, achieving an impressive classification accuracy of 99.7% and a precision rate of 100% []. Sarvesh Sundaram and Abe Zeid later proposed a custom Convolutional Neural Network (CNN) model to detect casting defects in submersible pump impellers. The model was trained on a set of 7348 images of submersible pump impellers and got an accuracy of 99.86%. This illustrated that CNNs can be used for very specific industrial tasks, such as detecting casting defects [].
Despite these significant advancements, the widespread adoption of vision-based AI models in manufacturing quality inspection remains limited, largely due to their data-intensive nature. These models rely on large, varied, and high quality datasets to train effectively, that presents a significant challenge for implementing reliable quality inspection in manufacturing. To address this issue, researchers have proposed various techniques, including data pre-processing and data augmentation [], transfer learning [], and few-shot learning []. In 2023, Anastasiia Safonova, et al. conducted a survey on the problem of small data in remote sensing data for deep learning applications, suggesting ten promising deep learning methods like zero-shot learning, ensemble deep learning and many more to enhance model performance []. Similarly, Chen Li et al. developed a domain adaption YOLOv5 model for automatic surface defect detection on magnetic tile surfaces. Their study demonstrates that this model outperforms traditional mixed training and pre-train fine-tuning methods, particularly when limited datasets are available [].
Although these methods have shown promising results, they cannot be generalized for vision-based AI applications in industrial environments for quality control. This is because training a model for classification or detection requires a dataset containing each product class and its associated defects, which can be a large number of possibilities. This process of collecting dataset is time-consuming and resource-intensive, making it impractical to utilize these vision-based AI models across all industries. This is particularly true in advanced manufacturing plants that use Flexible Manufacturing Systems (FMS) [] or Batch-Size-of-One (BSO) [] production. Due to the uniqueness of each product in these systems, applying the aforementioned techniques to models after collecting datasets post-production is neither efficient nor practical. This creates a critical need for methods that can generate training data for AI-based quality inspection without compromising the efficiency of industrial environments.
Our research aims to address these issues by using synthetic dataset to develop a solution that can be applied across all industrial environments to detect defects without impacting the productivity of the manufacturing plant. Recent research highlights the growing significance of synthetic data in training AI models for industrial quality inspection. A notable study by Aleksei Boikov et al. (2021) [] explored the effectiveness of synthetic data by training two neural network models, Unet and Xception, on dataset generated using Blender. Their findings demonstrated that a synthetic dataset could serve as a viable alternative when real-world data are scarce or difficult to obtain, producing robust and accurate models for steel defect recognition in automated production control systems [].
In 2023, we conducted an in-depth study on different methods for generating synthetic datasets for computer vision tasks in industrial settings. The study highlighted the benefits of using synthetic data, explored various generation techniques, and emphasized the importance of achieving photorealistic images to improve the performance of trained models []. In this paper, we present a pipeline that leverages geometric models of products manufactured in industries to generate realistic images for training AI models. This approach offers an efficient solution for industrial environments, as it allows data generation without disrupting overall plant productivity. It aligns with the growing adoption of digital twin technologies in various industries [].
The primary drawback of using synthetic data is that the performance of AI models relies heavily on how accurately the synthetic data replicates real camera images. In 2022, Ole Schmedemann et al. [] proposed a pipeline that enhances synthetic training data by rendering images of objects under inspection. In their approach, defects are incorporated as 3D models on the surface of the object, and domain randomization techniques are applied to minimize the domain gap between synthetic and real data []. Similarly, Krishnakant Singh et al. expanded on the assessment of synthetic data by examining the resilience of models developed with synthetic images generated by pre-trained diffusion models. They compared supervised, self-supervised, and multi-modal learning approaches against models trained on real images, analyzing their resilience to factors such as shape and background bias. Their results revealed that synthetic data models, particularly self-supervised and multi-modal ones, could match or even surpass real-image-trained models. However, these models were also more susceptible to adversarial attacks and real-world noise. Their findings suggest that a hybrid approach, combining real and synthetic data, could enhance the robustness of the model [].
Building upon this line of research, we propose a synthetic data generation pipeline that utilizes geometric models of industrial products and incorporates Cycle-Consistent Adversarial Networks (CycleGANs) [] for domain adaptation. This enables image-to-image translation between rendered images and real camera images, enhancing the realism of synthetic data and improving model generalization in real-world applications. The methodology for this pipeline is detailed in Section 2, and the generated images are presented in the results Section 3. The generated dataset is then evaluated by comparing it with camera-captured data, as shown in Section 4.1.
Furthermore, the proposed pipeline is utilized to train an object detection (OD) model deployed on the Niryo Robot for real-time detection of mechanical components for pick and place operations. In another application, the pipeline is used to detect anomalies in temperature sensors mounted on a casing cylinder in a FESTO production environment, as discussed in Section 4.2. This demonstrates the potential for industrial application in quality control, where vision-based AI models can be effectively trained using synthetically generated photorealistic images derived from geometric models of manufactured parts.
Despite these promising results, the limitations of the AI model, along with the evaluation results presented in Section 4.1, raise several critical questions that must be addressed to develop a fully autonomous defect detection pipeline for manufactured parts. These challenges and considerations are discussed in detail in the Limitations and Future Work—Section 4.3 and Section 5, respectively.
2. Materials and Methods
This work aims to address the challenge of acquiring large, diverse, and high-quality datasets required to train vision-based AI models for quality inspection in industrial settings, particularly in FMS and BSO production environments. As discussed in Section 1, the use of synthetic datasets presents a viable solution for addressing the challenges associated with data acquisition in such industrial environments. This approach takes advantage of the fact that most industrial products have digital representations, such as geometric models, which can be used to generate synthetic datasets.
The goal is to automate the creation of highly realistic synthetic datasets, eliminating the need for human participation. Figure 1 shows an automated pipeline that generates a photorealistic dataset for vision-based AI applications. To generate photorealistic images, the pipeline is divided into three primary phases. The first phase involves collecting or designing the geometric models of the products of interest. In this work, we use a set-up that includes multiple parts, such as aluminum extensions, t-nuts, ball bearings, and screws, as shown in Figure 1. These parts were modeled on a 1:1 scale using SolidWorks 2022 software. The resulting .STL files were then used in the next phase to generate synthetic images using a rendering tool.
Figure 1.
The pipeline for generating photorealistic synthetic dataset for vision-based AI applications.
There are various rendering tools available, such as KeyShot, SolidWorks, and Blender, among others. Since our goal is to automate the pipeline without human intervention, we chose to use the Blender Python module. This module allows us to create a Python v3.9.5 script that automates the image rendering process using the .STL files.
Typically, during rendering, several factors must be taken into account, including lighting, object color, the environment in which the object is placed, camera settings, material properties, and surface characteristics. These elements have a significant influence on the realism of the rendered images, and achieving optimal results usually requires expertise in the field. However, for our automated approach, we intentionally do not focus on these aspects. This is because the third phase of domain transfer adds realism to the images while only requiring basic rendered images of the objects, eliminating the need for complex visual details. A large variety of rendered images are generated by automatically rotating and moving the objects. This process ensures that multiple perspectives and angles of the objects are captured without requiring manual intervention.
The third phase involves mapping the source image domain to the target domain using CycleGANs []. CycleGANs are selected over other domain transfer models, such as Pix2Pix, primarily due to their ability to perform unpaired image-to-image translation, which is essential in industrial scenarios where paired datasets (i.e., matching real and rendered images) are scarce or unavailable. Unlike models that need supervised learning with matching image pairs, CycleGAN uses cycle consistency loss to verify that an image translated to another domain and then back to the original is structurally coherent.
This makes it particularly suitable for pixel-level transformations, allowing us to transfer complex visual details, such as lighting, shadows, material properties, and surface characteristics, from camera-captured images onto the rendered images from Phase 2, creating photorealistic results.
In this process, the CycleGAN model is trained using two input datasets: one containing the source domain A (camera-captured images) and the other containing the target domain B (rendered images from Phase 2). Once trained, the CycleGAN model can transfer the visual details from the real images to the rendered images, while preserving the geometric properties and positions of the objects in the rendered images.
The application of CycleGAN can be divided into two types, depending on the type of input dataset used during training:
- Domain Transfer with Similar Objects: In this scenario, the images in both domains contain the same objects. The model recognizes the objects in domain A and transfers the domain-specific details of that particular object to domain B, resulting in photorealistic images that still retain the object shapes and positions.
- Domain Transfer with Different Objects: In this case, the objects in the two domains are completely different. The model transfers the domain of one object to the other, producing a photorealistic image based purely on synthetic images. This method is particularly beneficial, as it enables the generation of realistic images of products prior to manufacturing, which is very useful in systems such as BSO and FMS production environments.
Both techniques are employed to create synthetic images, with the outcomes presented in Section 3. The synthetic data produced by the pipeline can subsequently be utilized to train vision-based AI models for tasks such as object classification (OC), object detection (OD), and anomaly detection (AD).
3. Results
In the previous section, we described how we generate photorealistic images using geometric models, the Blender Python module, and transfer the domains using CycleGANs. In this section, we present the results generated by our pipeline. The outcomes are categorized based on the type of input data used for training the CycleGANs.
3.1. Domain Transfer with Similar Objects
In this case, we compile a small dataset of 400 images captured by a camera, featuring various parts such as aluminum extensions, ball bearings, t-nuts, and screws, as shown in Figure 2a. For each of these images, we create a synthetic counterpart by designing 1:1 scaled geometric models and rendering them using Blender’s Python module, as illustrated in Figure 2b. These 400 real images and 400 rendered images are then used to train the CycleGAN model to transfer the domain from the real images to the rendered ones. Additionally, the dataset size is expanded to 5000 images per domain (real and rendered) through data augmentation techniques, including image transposition and rotation.
Figure 2.
(a) Realistic camera-captured image. (b) Rendered image using Blender Python module. (c) CycleGAN domain-transferred image generated using 400 training images. (d) CycleGAN domain-transferred image generated using 5000 training images.
The training is carried out using the official CycleGAN repository [,], running on a 24 GB NVIDIA TITAN RTX GPU. We implemented the CycleGAN framework in PyTorch 20.12, employing the standard 70 × 70 PatchGAN architecture for the discriminator and a ResNet-based generator with 9 residual blocks. The number of filters in the final convolutional layer of the generator and the initial layer of the discriminator was set to 128. Both networks were trained using Adam optimizer with parameters , , and an initial learning rate of 0.0002. The cycle consistency loss was scaled by a factor of , and an identity loss term with a weight of 0.5 was also included. The input images were initially resized to 286 × 286 pixels and then randomly cropped to 256 × 256 during training. For the models trained with 400 and 5000 images, the learning rate was kept constant for the first 100 and 50 epochs, respectively, and then linearly decayed to zero over the subsequent 10 and 20 epochs, resulting in total training durations of 110 and 70 epochs with a batch size of 1.
Figure 2c,d show the outputs when a rendered image from Figure 2b is passed through the trained model. As observed, the model successfully preserves object geometry and positioning while realistically transferring background, lighting, shadows, material properties, and surface details onto the rendered images. However, the difference in visual quality between the model trained with 400 and 5000 images is marginal. Although the model trained on 5000 images exhibits slightly enhanced photorealism, particularly in the refinement of shadows and surface textures, the overall improvement is minimal.
This method is especially useful when working with a small dataset (like 400 images in this case), as it allows for the creation of more realistic images by simulating various scenarios. The ability to rotate and position objects differently provides virtually endless possibilities for generating additional images, thus creating larger datasets and improving AI model training without needing vast amounts of real-world data.
3.2. Domain Transfer with Different Objects
In this approach, we train a CycleGAN model to perform domain transfer between two datasets containing completely different objects. Initially, we construct a dataset of 400 real-camera images of an aluminum extension, as shown in Figure 3a. In parallel, we generate a separate dataset consisting of 400 rendered images of a bracket using the Blender Python module, as illustrated in Figure 3b. These two unpaired datasets are used to train the CycleGAN model, which then produces the output shown in Figure 3c. While we don’t manually apply data augmentation to the datasets, the official CycleGAN algorithm automatically performs random cropping, horizontal flipping, and scaling during training. The CycleGAN was trained using the same hyperparameters described in Section 3.1, except for the following modifications. The number of filters in the generator’s final convolutional layer and the discriminator’s first layer was set to 64. The model was trained for 60 epochs using a fixed learning rate of 0.0002, and no learning rate decay was employed. The trained model is able to transfer various visual properties, such as background details, material appearance, and surface textures, from the real aluminum extension images to the rendered bracket images. Interestingly, the model also learns to incorporate structural features such as the bracket’s slot during the transformation.
Figure 3.
(a) Camera-captured image. (b) Rendered image using Blender Python module. (c) CycleGAN domain-transferred image generated using 400 training images.
However, the generated images fall short of achieving the full level of realism seen in actual camera-captured images. Although the model effectively transfers some material characteristics and surface textures, it has difficulty capturing more nuanced visual elements, such as consistent lighting, natural shadow formation, and depth perception. These limitations are more evident when compared to the results obtained from domain transfer involving similar objects. To address this, we conducted multiple experiments, including extending the number of training epochs and adjusting various hyperparameters. However, these attempts led to overfitting, where the model began to reproduce the geometry of the objects themselves rather than merely transferring surface-level visual features. This signifies a saturation threshold at which further training degrades the model’s intended behavior. A comprehensive discussion of this limitation is provided in Section 4.3.
The quality of these generated images is further evaluated in Section 4.1. This domain transfer technique allows for the projection of realistic material and surface characteristics from physical objects, such as metallic parts, onto synthetic images created from geometric models, even before the actual production of those parts. This proves especially useful for generating high-quality dataset to train vision-based AI models in FMS and BSO environments, where each variant of the product is unique. Moreover, this data generation process does not interfere with the operational productivity of the manufacturing facility.
4. Discussion
4.1. Evaluation
In the previous section, we presented the photorealistic images generated using our pipeline. It is crucial to evaluate the quality of these output images, especially since they are used to train vision-based AI models. Although there are various metrics available for assessing image quality such as MSE (Mean Square Error), PSNR (Peak Signal-to-Noise Ratio), SSIM (Structured Similarity Indexing Method), and FSIM (Feature Similarity Indexing Method) [], we assess the quality of the dataset generated by our pipeline in comparison to camera-captured images. Given that the primary purpose of the dataset is to train AI models, the evaluation focuses on the downstream performance of these models. Specifically, the mean Average Precision (mAP) achieved by the trained models serves as a benchmark for assessing the effectiveness of the domain-transferred synthetic images relative to real images of similar objects. For domain transfer involving different objects, a simplified Turing test approach is adopted, wherein a classifier is used to evaluate whether the generated images are visually indistinguishable from real images. This approach offers a deeper insight into the behavior of AI models across different datasets, as discussed further later in this section.
4.1.1. Domain Transfer with Similar Objects
To evaluate this, we train an OD model and use the mean Average Precision (mAP) at 0.5–0.95 Intersection Over Union (IoU) as the metric to assess the dataset generated by our pipeline.
There are three key factors that contribute to this evaluation metric:
- The architecture of the OD models;
- The domain of the training images;
- The number of training images.
All three factors are considered in our evaluation, and the results are shown in Figure 4 and Figure 5. To address the different architectures, we use four state-of-the-art OD models: YOLOv7 [], EfficientDet [], Faster R-CNN [], and SSD MobileNet []. To account for the domain of the images, we use four different datasets (the datasets can be provided for research purposes upon request): the real camera-captured images (Figure 2a), rendered images from the Blender Python module (Figure 2b), CycleGAN-generated images trained with 400 images (Figure 2c), and CycleGAN-generated images trained with 5000 images (Figure 2d).
Figure 4.
Mean Average Precision (mAP) at 0.5–0.95 IOU using 50 training images.
Figure 5.
Mean Average Precision (mAP) at 0.5–0.95 IOU using 400 training images.
The OD models are trained on these datasets and then tested on the real camera-captured images to compare the models’ performance on real-world data when trained on images from different domains.
Additionally, we train the models on two distinct dataset sizes: 50 images and 400 images. For each OD model, all hyperparameters, such as learning rate, number of epochs, batch sizes, and optimizers, are kept consistent throughout training on different domain images. This allows us to evaluate how the domain of the images affects the models’ performance. The code for evaluating the datasets using OD models is available in our GitHub repository [].
The graphs presented in Figure 4 and Figure 5 show the results of evaluation of the datasets. On the x-axis, we have different training and testing data types based on the domain of the dataset. The OD models were trained on real images, rendered images, and CycleGAN-generated images trained with 400 and 5000 images (as shown in Figure 2a–d), and tested on real camera-captured images. We also included results from an OD model trained and tested on rendered images to assess performance on this data type. The y-axis represents the mAP at 0.5–0.95 IoU, shown as a percentage.
From the graphs, we can observe that the models perform well, when trained and tested on real images and rendered images, averaging 79.9% and 84.8% for 50 training images, and 84% and 88.4% for 400 training images, respectively, across the four OD models. However, when the models are trained on rendered images and tested on real images, their performance drops significantly, averaging 56.2% for 50 training images and 64.1% for 400 training images. This performance drop highlights the importance of domain similarity between the training and testing images, indicating that the domain of the training data significantly impacts model performance. Real images include variability in lighting, color, shadows, and occlusions, while rendered images lack these variations, leading to a performance gap. This indicates that more photorealistic synthetic data is required to train models effectively.
We also see that performance improves for CycleGAN-generated images compared to rendered images. For models trained on CycleGAN-generated datasets, the average mAP is 73.9% and 74.6% for 50 images, and 79.1% and 77.1% for 400 images, corresponding to training CycleGAN with 400 and 5000 images, respectively. This indicates that using data generated by our pipeline results in better model performance compared to using a purely rendered dataset.
Furthermore, we see that the SSD MobileNet model performs the worst, with a sharp reduction to about 40% mAP for rendered vs. real, demonstrating that model architecture has a substantial influence on performance on synthetic dataset. We observe that increasing the number of training images from 50 to 400 leads to higher mAP scores for all models except EfficientDet, which shows a decline, and the improvement is particularly pronounced for YOLOv7, indicating that the underlying model architecture plays a significant role in how performance scales with training data size.
In summary, by enhancing the rendered images with added realism through CycleGANs, our pipeline significantly improves the training of OD models compared to using purely rendered images. The average mAP achieved across models trained on our synthetic dataset is 74% with 50 training images and 79.1% with 400 training images. These results closely approach the performance of models trained on real camera-captured images, which reached an average mAP of 80% and 84% for 50 and 400 training images, respectively, across the four OD models.
Recent advances in the field emphasize that achieving high mAP scores—typically higher than 90%—is vital for reliable manufacturing quality inspection [,], where both high precision and recall are crucial, as even minimal classification errors can result in defective products or costly recalls. In our implementation, the highest performance was observed with the YOLOv7 model trained on 400 real camera images, achieving a mAP of 98.3%. When trained on our synthetic dataset of the same size, the model achieved a mAP of 90.4%. This level of accuracy is particularly important in industrial settings, such as automated visual inspection of components, where strict quality standards and low defect tolerance are mandatory. Figure 6a,b show the confusion matrices of the YOLOv7 model trained on real images and CycleGAN-generated images, respectively, with both models evaluated on real camera images. As observed, the number of false negatives increases when the model is trained on generated images compared to training on real images. The most notable differences arise from the background false negative and false positive classes, indicating that the model frequently fails to detect certain objects and also incorrectly classifies background regions as objects, respectively. The ongoing work focuses on identifying the root causes of these errors and mitigating them to enhance the realism and diversity of synthetic data, with the goal of narrowing the 7.9% performance gap between models trained on real versus synthetic datasets.
Figure 6.
Confusion matrix for testing YOLOv7 on real images when trained on (a) real images (b) CycleGAN-generated images.
4.1.2. Domain Transfer with Different Objects
In this section, we assess the quality of the images generated by CycleGAN when domain transfer is carried out between images containing entirely different objects. As depicted in Figure 7c, each output image features a single object. To evaluate the suitability of these generated images for training AI models, we employ an OC strategy inspired by the Turing test []. A classification model is trained using real camera-captured images and Blender-rendered images, as shown in Figure 7a,b. The trained model is then tested on CycleGAN-generated images, which are labeled as real. This approach helps to determine whether the classification model perceives the synthetic images as real. To quantify the results, a confusion matrix is generated, highlighting whether the model’s predictions align more closely with real images or with rendered counterparts.
Figure 7.
(a) Camera-captured image. (b) Rendered image using Blender Python module. (c) CycleGAN domain-transferred image generated using 400 training images.
For this experiment, a dataset comprising 100 real camera-captured images of a bracket (Figure 7a) and 100 rendered images (Figure 7c) was used to train a MobileNetV2 classification model. The trained model was first evaluated on a test set consisting of 100 real and 100 rendered images. As shown in Figure 8a, the resulting confusion matrix indicates that the model achieved 100% classification accuracy, perfectly distinguishing between real and rendered images.
Figure 8.
Confusion Matrix for Classification on (a) Camera-captured Images versus Rendered images (b) CycleGAN-Generated Images versus Rendered images.
The model is then evaluated with 100 CycleGAN-generated images (Figure 7c), labeled as real, and 100 rendered images. The confusion matrix in Figure 8b reveals that only 67 of the CycleGAN-generated images were correctly classified as real by the model, whereas 33 were misclassified as synthetic. This decline in classification performance highlights that the generated images still fall short of achieving full photorealism. As discussed in Section 4.1.2, deploying AI-based models in industrial settings demands a high level of accuracy, ideally above 90%. The current misclassifications rate of 33% indicates that further enhancements in image generation quality are needed, especially when transferring between domains with dissimilar objects, to improve the reliability of these models for vision-based training applications.
4.2. Application of the Synthetic Data Generation Pipeline
Building on the previous discussion of the synthetic image generation pipeline and its evaluation, this section presents its application in two real-world industrial use cases:
- Real-time object detection using a Niryo robot [].
- Anomaly detection in temperature sensors assembled by an in-house Festo production system.
These case studies demonstrate the practical feasibility and industrial relevance of the proposed synthetic data-driven training approach for both OD and AD tasks.
4.2.1. Object Detection Using Niryo Robot
The adoption of robotic systems for automation has seen significant growth in recent years. In this section, we demonstrate the application of our proposed pipeline within such a robotic system for OD in a pick-and-place operation. The primary objective is to evaluate the performance of a model trained using synthetically generated images, produced using our pipeline, when deployed in a real-world industrial environment.
Figure 9 illustrates the complete pipeline for training an OD model using CycleGAN-generated images and deploying it in a real-time application with the Niryo Robot. The first step involves labeling the CycleGAN-generated images for the OD task, which is done using the labelImg 1.8.6 tool []. After labeling the images, the SSD MobileNet model from the tensorflow model zoo [] was trained for 1000 epochs to perform the OD task. The trained model is then used to detect mechanical parts in real-time camera images from the Niryo robot, as shown in Figure 10, which includes both the Niryo robot and the mechanical parts, along with the output from the camera system capable of real-time OD.
Figure 9.
Object Detection using Niryo robot trained on the synthetically generated dataset using our pipeline.
Figure 10.
(a) Niryo Robot Setup. (b) Output of Niryo Robot object detection pipeline.
The entire pipeline is executed on a 24 GB NVIDIA TITAN RTX GPU. To streamline and automate the process, we use Docker containers. Each step of the pipeline, such as rendering with the Blender Python module, training the CycleGAN model, generating photorealistic images, labeling the images using labelImg, training the SSD MobileNet model for OD, and deploying the trained model for real-time OD with the Niryo robot, runs in its own dedicated Docker container. These containers are linked through a bash script, which manages their activation, deactivation, and file transfers during the pipeline’s execution. The code for this pipeline is available on our GitHub repository [].
4.2.2. Anomaly Detection of Temperature Sensors
The core focus of our work is the use of AI models for defect detection in manufactured products. This is achieved through an AD model trained exclusively on images of defect-free products, enabling the identification of deviations that may indicate defects. In this section, we demonstrate the application of the proposed pipeline within the FESTO production system to detect anomalies in assembled products.
Figure 11 presents the complete pipeline used to train an AD model based on CycleGAN-generated synthetic images, and its deployment in a real-time industrial application. Specifically, the pipeline is applied to detect anomalies in temperature sensors mounted by a Festo machine system, which comprises multiple workstations capable of assembling sensors onto casing cylinders. The trained AD model verifies whether the sensor has been correctly mounted and outputs the corresponding AD results.
Figure 11.
Anomaly Detection in Temperature Sensors assembled using Festo Machine trained using the synthetically generated dataset using our pipeline.
The process of collecting rendered images using geometric models and applying domain transfer via CycleGAN is consistent with the previously discussed Niryo Robot use case. The synthetic images generated through this approach are used to train the Reverse Distillation AD model [], as implemented in the Anomalib library []. This model requires only non-defective images during training; it subsequently identifies deviations from the learned normal patterns and produces output in the form of prediction scores, segmentation maps, heatmaps, and binary anomaly masks. For inference, real-time images are captured using an IDS camera mounted on the Festo machine, as shown in Figure 11.
The entire pipeline is executed on a 24 GB NVIDIA TITAN RTX GPU using Docker containers, as previously described. The complete source code for this implementation is publicly available via our GitHub repository [].
The results produced by the Reverse Distillation AD model are presented in Figure 12a,b. In Figure 12a, no anomalies are detected, as the temperature sensor is correctly mounted on the casing cylinder. This is reflected in the segmentation map and masks, which indicate a normal condition. In contrast, Figure 12b illustrates the scenario in which the temperature sensor is improperly mounted on the casing cylinder. In this case, the model successfully identifies the anomaly, generating segmentation maps, masks, and bounding boxes that highlight the defective region, as shown in Figure 12b.
Figure 12.
Output of Festo Machine Anomaly Detection. (a) Anomaly Detection with no anomalies—Good Mounting (b) Anomaly Detection with anomalies—Bad Mounting.
The results obtained in the use cases mentioned above, featuring an average precision of 0.958 at 0.75 IoU for OD and an image_AUROC of 0.97 for AD, indicate that the proposed pipeline’s performance is comparable to conventional OD and AD models trained on real-world camera images. This highlights the viability of synthetic data for training high-performance vision models in industrial contexts. In addition, the pipeline offers promising opportunities for further development. For example, outputs from OD models can be integrated into downstream industrial processes such as automated pick-and-place operations, thereby enhancing the adaptability and efficiency of robotic systems like the Niryo robot. Similarly, the predictions generated by AD models can facilitate automated sorting of nondefective and defective components, providing a scalable solution for quality assurance in manufacturing environments.
4.3. Limitations of the Pipeline
A key drawback of the proposed pipeline is its dependence on generative models such as CycleGAN for domain adaptation. Achieving high-quality results with these models requires precise tuning of hyperparameters and careful selection of the number of training epochs to prevent both overfitting and underfitting. As illustrated in Figure 13, several types of errors were observed during domain transfer involving similar and different objects. In the case of similar objects, overfitting frequently led to issues such as objects blending into the background or completely disappearing, as shown in Figure 13a,b. Distortions in object geometry (Figure 13b) and hallucinated features such as holes (Figure 13c) were also observed. When domains were transferred between different objects, overfitting became even more pronounced, leading to severe artifacts, as seen in Figure 13d,e. Additional anomalies such as inconsistent color tones or the appearance of white bubbles were also evident, particularly in Figure 13f.
Figure 13.
Examples of typical errors observed during CycleGAN domain transfer.
Although the cycle-consistency loss in CycleGAN is effective, it does not explicitly guarantee the accurate preservation of the geometric structure. Addressing these shortcomings often requires considerable manual intervention, such as reviewing intermediate outputs and adjusting the hyperparameter to maintain structural integrity when results are unsatisfactory, making it difficult to establish a fully automated workflow. Although automated hyperparameter optimization techniques like grid search or random search could help identify ideal settings, they require significantly more computational resources and longer training times. Future work will explore strategies to minimize these artifacts and develop more efficient methods for automatically determining optimal hyperparameters to enhance overall model performance.
5. Conclusions and Future Work
In this work, we introduced a domain transfer pipeline that utilizes cycle-consistent adversarial networks (CycleGANs) to narrow the domain gap between real and synthetic data, with a specific focus on enhancing the dataset used to train AI models in manufacturing scenarios such as FMS and BSO production systems. Through comprehensive experimentation, we showed that the proposed approach significantly boosts the training performance of OD and classification models over the conventional rendered dataset. Two strategies were employed based on the nature of the data: domain transfer between visually similar objects resulted in the highest performance, achieving a mean average precision (mAP) of 90.4% using the YOLOv7 model trained on 400 CycleGAN-generated images. In contrast, domain transfer between dissimilar objects led to a classification accuracy of 67% using a model trained to distinguish real images from CycleGAN-generated images.
Although the synthetic images produced were of comparable quality to real-world data, showing a 7.9% difference in mAP, they also revealed key limitations of the generative models. Issues such as overfitting, hallucinations, and distortion of object geometry were observed. In the case of domain transfer between dissimilar objects, additional shortcomings, such as poor rendering of lighting, shadows, and depth, further limited the effectiveness of the generated images for use in high-precision industrial inspection tasks.
However, our two case studies, OD using Niryo Robot and AD in FESTO-assembled temperature sensors, demonstrate the practical viability of using domain-transferred synthetic images. They provide a cost-effective and scalable alternative to real data in scenarios in which data collection is difficult or resource-intensive. However, the requirement for manual hyperparameter tuning and frequent human validation presents a barrier to complete pipeline automation.
Future research will advance along two primary directions. The first involves improving the photorealism and overall fidelity of the generated images, with particular emphasis on challenging domain transfers between different objects. In 2024, Yue He et al. introduced a cycle-transformer architecture that utilizes patch-level self-attention and cross-level attention to optimize the style-mapping function, enabling domain transfer from only a single set of input images []. This approach presents a promising alternative for high-quality style transfer, especially when dealing with domain transfer between different objects. Its ability to operate with reduced training data also makes it well-suited for synthetic data generation in advanced manufacturing environments such as FMS and BSO systems. Additionally, efforts will be directed toward improving the training pipeline through automated strategies, including hyperparameter tuning and automatic data labeling for OD, with the goal of achieving a fully autonomous workflow.
The second research direction aims to gain a deeper understanding of the factors that influence the performance of vision-based AI models. Specifically, it seeks to identify which image features contribute most to model behavior. For instance, we observe a significant performance drop in the SSD MobileNet model when trained on Blender-generated images, while YOLOv7 shows a notable mAP improvement when the training images increase from 50 to 400. These findings emphasize the significance of identifying the factors that activate specific responses in neural networks. To investigate this, techniques such as feature visualization [] and attribution will be employed to understand what individual layers look for in an image and which regions of the image influence the predictions, respectively.
In 2024, Alain Andres et al. [] introduced the Detector Morphological Fragmental Perturbation Pyramid (D-MFPP), which generates explanation masks using multi-level superpixels, along with D-Deletion, a method for assessing both class probability and localization quality in object detection explainability. Their work demonstrated improved trustworthiness in industrial applications by producing more focused, object-specific explanations []. These insights can guide improvements in the image generation process, ensuring that synthetic images align more effectively with the features on which AI models depend.
Ultimately, our goal is to develop a fully autonomous and generalizable pipeline capable of generating high-quality synthetic data tailored for diverse vision-based applications in industrial settings such as FMS and BSO production systems, thus advancing the role of AI in smart manufacturing.
Author Contributions
Conceptualization, N.N. and J.E.; Methodology, N.N. and J.E.; Software, N.N.; Validation, N.N.; Formal analysis, N.N.; Investigation, N.N.; Resources, N.N.; Writing—original draft, N.N.; Writing—review & editing, N.N. and J.E.; Supervision, J.E.; Project administration, J.E.; Funding acquisition, J.E. All authors have read and agreed to the published version of the manuscript.
Funding
This paper is part of the KIDZ project funded by the Carl Zeiss Stiftung.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The project is accessible via GitLab at https://github.com/nishanthnandakumar/Synthetic-data-Generation-Pipeline- accessed on 14 September 2025, and the associated dataset is available from the authors on request for research purposes.
Conflicts of Interest
Authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
References
- Plathottam, S.J.; Rzonca, A.; Lakhnori, R.; Iloeje, C.O. A review of artificial intelligence applications in manufacturing operations. J. Adv. Manuf. Process. 2023, 5, e10159. [Google Scholar] [CrossRef]
- Singh, S.A.; Desai, K.A. Automated surface defect detection framework using machine vision and convolutional neural networks. J. Intell. Manuf. 2023, 34, 1995–2011. [Google Scholar] [CrossRef]
- Sundaram, S.; Zeid, A. Artificial intelligence-based smart quality inspection for manufacturing. Micromachines 2023, 14, 570. [Google Scholar] [CrossRef] [PubMed]
- Maharana, K.; Mondal, S.; Nemade, B. A review: Data pre-processing and data augmentation techniques. Glob. Transit. Proc. 2022, 3, 91–99. [Google Scholar] [CrossRef]
- Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
- Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. (Csur) 2020, 53, 1–34. [Google Scholar] [CrossRef]
- Safonova, A.; Ghazaryan, G.; Stiller, S.; Main-Knorn, M.; Nendel, C.; Ryo, M. Ten deep learning techniques to address small data problems with remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103569. [Google Scholar] [CrossRef]
- Li, C.; Yan, H.; Qian, X.; Zhu, S.; Zhu, P.; Liao, C.; Tian, H.; Li, X.; Wang, X.; Li, X. A domain adaptation YOLOv5 model for industrial defect inspection. Measurement 2023, 213, 112725. [Google Scholar] [CrossRef]
- Kostal, P.; Velisek, K. Flexible manufacturing system. World Acad. Sci. Eng. Technol. 2011, 77, 825–829. [Google Scholar]
- Jin, Z.; Marian, R.M.; Chahl, J.S. Achieving batch-size-of-one production model in robot flexible assembly cells. Int. J. Adv. Manuf. Technol. 2023, 126, 2097–2116. [Google Scholar] [CrossRef]
- Boikov, A.; Payor, V.; Savelev, R.; Kolesnikov, A. Synthetic data generation for steel defect detection and classification using deep learning. Symmetry 2021, 13, 1176. [Google Scholar] [CrossRef]
- Nakumar, N.; Eberhardt, J. Overview of Synthetic Data Generation for Computer Vision in Industry. In Proceedings of the 2023 8th International Conference on Mechanical Engineering and Robotics Research (ICMERR), Krakow, Poland, 8–10 December 2023; pp. 31–35. [Google Scholar]
- Singh, M.; Srivastava, R.; Fuenmayor, E.; Kuts, V.; Qiao, Y.; Murray, N.; Devine, D. Applications of digital twin across industries: A review. Appl. Sci. 2022, 12, 5727. [Google Scholar] [CrossRef]
- Schmedemann, O.; Baaß, M.; Schoepflin, D.; Schüppstuhl, T. Procedural synthetic training data generation for AI-based defect detection in industrial surface inspection. Procedia CIRP 2022, 107, 1101–1106. [Google Scholar] [CrossRef]
- Singh, K.; Navaratnam, T.; Holmer, J.; Schaub-Meyer, S.; Roth, S. Is synthetic data all we need? benchmarking the robustness of models trained with synthetic images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2505–2515. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Sara, U.; Akter, M.; Uddin, M.S. Image quality assessment through FSIM, SSIM, MSE and PSNR—A comparative study. J. Comput. Commun. 2019, 7, 8–18. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Nandakumar, N. nishanthnandakumar/Synthetic-Data-Generation-Pipeline. GitHub. Available online: https://github.com/nishanthnandakumar/Synthetic-data-Generation-Pipeline- (accessed on 14 September 2025).
- Li, A.; Hamzah, R.; Rahim, S.K.N.A.; Gao, Y. YOLO algorithm with hybrid attention feature pyramid network for solder joint defect detection. IEEE Trans. Components Packag. Manuf. Technol. 2024, 14, 1493–1500. [Google Scholar] [CrossRef]
- Dehaerne, E.; Dey, B.; Halder, S.; De Gendt, S. Optimizing YOLOv7 for semiconductor defect detection. In Proceedings of the Metrology, Inspection, and Process Control XXXVII, San Jose, CA, USA, 26 February–2 March 2023; Volume 12496, pp. 635–642. [Google Scholar]
- Turing, A.M. Computing machinery and intelligence. In Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer; Springer: Dordrecht, The Netherlands, 2007; pp. 23–65. [Google Scholar]
- Ned2—Educational Desktop Robotic Arm. Niryo. Available online: https://niryo.com/product/educational-desktop-robotic-arm/ (accessed on 28 November 2024).
- Tzutalin, D. HumanSignal LabelImg. Github Repository. 2015. Available online: https://github.com/HumanSignal/labelImg (accessed on 25 November 2022).
- Yu, H.; Chen, C.; Du, X.; Li, Y.; Rashwan, A.; Hou, L.; Jin, P.; Yang, F.; Liu, F.; Kim, J.; et al. Tensor Flow Model Garden. GitHub Repository. 2020. Available online: https://github.com/tensorflow/models (accessed on 5 July 2024).
- Deng, H.; Li, X. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9737–9746. [Google Scholar]
- Akcay, S.; Ameln, D.; Vaidya, A.; Lakshmanan, B.; Ahuja, N.; Genc, U. Anomalib: A deep learning library for anomaly detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1706–1710. [Google Scholar]
- He, Y.; Chen, L.; Yuan, Y.J.; Chen, S.Y.; Gao, L. Multi-level patch transformer for style transfer with single reference image. In International Conference on Computational Visual Media; Springer Nature: Singapore, 2024; pp. 221–239. [Google Scholar]
- Nguyen, A.; Yosinski, J.; Clune, J. Understanding neural networks via feature visualization: A survey. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Springer International Publishing: Cham, Switzerland, 2019; pp. 55–76. [Google Scholar]
- Andres, A.; Martinez-Seras, A.; Laña, I.; Del Ser, J. On the black-box explainability of object detection models for safe and trustworthy industrial applications. Results Eng. 2024, 24, 103498. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).