Next Article in Journal
Reliability and Variability of Performance and Kinematic Measures in Seated Shot Put: A Case Study of a Paralympic Thrower
Previous Article in Journal
Interactive Digital Twin Workflow for Energy Assessment of Buildings: Integration of Photogrammetry, BIM and Thermography
 
 
Article
Peer-Review Record

A Synthetic Image Generation Pipeline for Vision-Based AI in Industrial Applications

Appl. Sci. 2025, 15(23), 12600; https://doi.org/10.3390/app152312600
by Nishanth Nandakumar * and Jörg Eberhardt
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Reviewer 5: Anonymous
Appl. Sci. 2025, 15(23), 12600; https://doi.org/10.3390/app152312600
Submission received: 15 September 2025 / Revised: 20 November 2025 / Accepted: 25 November 2025 / Published: 28 November 2025
(This article belongs to the Special Issue Artificial Intelligence for Industrial Informatics)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper addressed a highly relevant industrial problem: the data bottleneck for vision AI in flexible manufacturing. It presented a complete, end-to-end pipeline with two compelling, real-world industrial case studies. The text provides a great, thorough, and open analysis of the generative approach’s weaknesses and points of failure. 

While: 

1. Key experimental details required for reproducibility are missing, particularly hyperparameters for both the object detection models and the CycleGAN.

2. There are some discrepancies in how the number is described, see the attached file.

3. A few grammar errors, see the attached file. 

4. Figure 4 (“...using 50 training images”) and Figure 5 (“...using 400 training images”) are almost visually identical, can you merge them together and make a better comparison plot? 

5. The “True/Predicted” labels for Tables 1 and 2 are a little unclear. Standard confusion matrix notation is “Predicted Label” on the x-axis and “True Label” on the y-axis. Consider reformatting for clarity. 

6. The current results show that the YOLOv7 model’s mAP drops from 98.3% on real data to 90.4% on synthetic data. Instead of just reporting the final score, a deeper discussion could analyze the types of errors the synthetic-trained model makes more frequently. 

7. A good way to improve your introduction is to demonstrate that the problem of limited data is a well-known issue that’s being addressed in various ways. By presenting your work (photorealistic rendering + GANs) as a strong alternative to statistical generation methods like SMOTE, you clearly define your specific contribution. For example, virtual sample generation in manufacturing can be used for comparison. Check the attached file.

8. It’s recommended to contextualize the “future work” suggestion from a simple idea into a well-supported, state-of-the-art research direction. It shows you understand how XAI can be used to gain physical insights, diagnose anomalies, and build trust in black-box models, which is exactly what’s needed for industrial adoption. Check the attached file. 

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Please find comments and sugegstion in the attached file.

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The article presents a pipeline for generating photorealistic
synthetic images to support automated visual inspection in industry,
employing a Cycle-Consistent Adversarial Network to enhance rendered
synthetic datasets by transferring pixel-level features from real
camera images.

The introduction presents well the problems found in vision-based
Artificial Intelligence applications in advanced manufacturing plants
using Flexible Manufacturing Systems or Batch-Size-of-One production.

It is inappropriate for the related work section to be a subsection of
materials and methods. They should be independent sections, at the
same level, in sequence.

Otherwise, the presentation of related work is good, creating a
foundation for the following sections.

The idea of ​​using images generated in STL, rendered by Blender and
made realistic by CycleGAN is interesting and well justified.

The results section presents them clearly, with very good
explanations and figures.

In figures 4 and 5, "5k" could be replaced by "5k epochs", if that is
indeed the meaning.

It would be better if the caption for Table 1 were "Confusion Matrix
for Classification on Camera captured Images versus rendered images."
In Table 2, "Confusion Matrix for Classification on CycleGAN Generated
Images versus rendered images."

The discussion section was comprehensive and in-depth, including a
good analysis of the limitations of the proposed method.

The conclusion has a good analysis of the work and follows logically
from it.

The list of references is consistent with the work, comprehensive and
up-to-date.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

1. The difference in performance is 7.9% (98.3 - 90.4) in theory, but it was reported as 8.3%.

2. The demonstrated applications on a Niryo robot and a Festo machine show practical utility. The performance achieved is promising for industrial adoption (this could be substantiated in the Outlook or in the Introduction).

3. The pipeline’s design allows it to work with any object, provided a 3D geometric model exists. This feature is useful for various quality control and automation tasks in today’s manufacturing, but the introduction should provide more details and examples, with some citations, to emphasize this point.  

4. Does the 7.9% performance drop come from more false positives (detecting objects that aren’t there), false negatives (missing real objects), or poor localization (inaccurate bounding boxes)? Showing a few qualitative examples of failure cases would be very insightful.

5. Are some objects more challenging for the model? For example, it could find pull handles well, yet it often overlooks tiny bolts because of CycleGAN artifacts. A more detailed view would come from analyzing each object class’s average precision.

6. Why does SSD MobileNet’s performance collapse so dramatically when trained on rendered images, while YOLOv7 and Faster R-CNN are more resilient? The discussion could hypothesize that lightweight, single-stage detectors like SSD MobileNet are more sensitive to low-level feature shifts (like texture and lighting), whereas more complex architectures might learn more robust, higher-level shape representations.

7. The paper notes that as training images increased from 50 to 400, only the YOLOv7 model’s mAP improved significantly. This result is very interesting and benefits from deeper analysis. It might suggest that certain architectures are better at leveraging larger synthetic datasets, a crucial point for practical applications.

8. Is there a correlation between the 33% of images identified as “synthetic” and poor performance in the downstream task? Perhaps the visual artifacts that fool the classifier are the same ones that confuse the object detector.

9. The discussion could explore the concept of a “sufficient realism” threshold. Does an image need to be 100% indistinguishable from reality to be effective for training? This result suggests that even imperfectly realistic data is highly valuable, reducing the domain gap significantly compared to basic rendered images.

10. The discussion could mention that the cycle-consistency loss in CycleGAN, while powerful, does not explicitly enforce perfect preservation of geometric structure. This can lead to the kinds of shape distortions and disappearing parts seen in Figure 11.

11. The theoretical link would make the future work section stronger. Instead of just saying “explore more advanced generative models,” the authors could suggest specific models (e.g., CUT or StyleGAN variants, with related references) that are known to offer better structural preservation, directly addressing the observed weaknesses of CycleGAN.

12. The author poposed 2 future research dicrections:

“Future research will aim to increase the photorealism and utility of generated images specifically for domain transfer …. In addition, explainability techniques such as feature visualization, attribution, and eXplainable AI (XAI) ….”

Could you please elaborate on these points? With concrete illustrations?

Following text for your reference:

Future research will aim to increase the photorealism and utility of generated images, particularly for the challenging domain transfer between different objects. This can be achieved through enhanced training strategies, automated hyperparameter tuning, and the integration of more advanced generative models. While CycleGANs are effective, recent advancements in diffusion models have shown superior performance in high-fidelity image synthesis tasks. For example, in the related field of blind face restoration, a visual style prompt learning framework was developed that uses diffusion probabilistic models to generate “visual prompts” within the latent space of a pre-trained model to guide the image recovery process [related references]. A similar prompt-based approach could be explored for our pipeline; one could use a small set of real images to generate style prompts that guide a diffusion model to render CAD data in that specific visual style. This could offer more precise control over the final appearance and potentially mitigate the geometric distortions and feature hallucinations observed with the current CycleGAN implementation.

In addition, explainability techniques such as feature visualization and attribution are critical for building trust and diagnosing model failures, especially in high-stakes industrial settings. Instead of treating models as black boxes, eXplainable AI (XAI) can uncover the visual cues driving their decisions. For example, data-driven models with physical interpretability for real-time cavity profile prediction in machining processes [related references], successfully used methods like SHAP (SHapley Additive exPlanations) to identify key factors affecting cavity size in the machining process, aligning model predictions with domain knowledge. Applying these techniques to our vision models could reveal whether they are focusing on correct object features or relying on spurious background textures introduced by the GAN. These insights can then guide targeted improvements in the synthesis process, ensuring that synthetic images align more effectively with the patterns on which Al models depend.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

The article addresses a significant issue in the field of computer vision for industrial applications – the lack of large, annotated datasets in FMS and BSO systems. The proposed solution utilizing CycleGAN for generating synthetic images is relevant and has the potential to bring value to this field. However, the paper requires substantial revisions, particularly in terms of methodology, evaluation, and depth of analysis, before it can be considered for publication.

Among the most critical shortcomings that require improvement are:

  • Lack of a dedicated literature review section – The article does not contain an independent literature review chapter. The shallow analysis of existing works is unfortunately included in the "Materials and Methods" section (subsection 2.1), which is structurally and methodologically inappropriate. This review is limited to listing a few individual works, rather than providing a critical synthesis of the current state of knowledge. The authors fail to identify specific research gaps that their work aims to fill. They do not address the key question: What exactly is missing from existing approaches that necessitates the proposal of a new method? A proper literature review section should therefore be added, clearly demonstrating the research gap. The objective of the work should align with the gap thus identified.
  • The Introduction also requires revision – This section should clearly articulate the aim of the article and explain its theoretical contribution. The introduction should also briefly describe the content of the subsequent sections of the article (the abstract also fails to clearly state the aim).
  • The article does not adequately justify the choice of CycleGAN – The authors do not explain why CycleGAN was selected over other generative models (e.g., Diffusion Models, StyleGAN), which may offer better quality and stability. Was a comparison with other architectures conducted? The article should include such a comparison and provide a clear rationale for choosing CycleGAN.
  • The article lacks a description of training parameters, such as CycleGAN hyperparameters (learning rate, loss function weights) – this information should be added.
  • It is also necessary to describe the data augmentation process (if it was indeed performed).
  • The article relies primarily on mAP to assess data quality. Metrics specific to generative models (e.g., FID – Fréchet Inception Distance), which better reflect the realism of the images, are missing – it seems essential to extend the study with more appropriate image quality metrics.
  • A notable weakness of the article is the poor quality of transfer between different objects – the 67% result in the Turing test indicates low realism of the generated images. Is this sufficient for industrial applications that require high precision? A discussion of these results should be added to the article.
  • The mAP results (Figures 4–5) are not subjected to statistical testing (e.g., ANOVA), which makes it impossible to assess the significance of the differences between models. Including such results in the article would be beneficial.

Despite my critical remarks, I believe the article holds significant potential. Implementing the suggested revisions will greatly contribute to achieving a much higher scientific quality of the text.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

1. While I appreciate the revision, there are still many issues to be resolved. Especially when inspecting the data. Section 3.2 change the hyperparameters without further explanations. The analysis of the mAP evaluation (Sec 4.1.1) is STILL flawed, misstating that only YOLOv7 improved with more data and completely missing the anomalous negative result for EfficientDet.

2. The results of confusion matrices are quite weird. (Too good to be true, maybe I'm wrong, but if you could provide the source code, I can look into it)

3. The new outlook and discussion sound OK, but they lack engineering perspective and context, which is the title “… in industrial applications”. For instance, what could a diffusion-based model offer for “industrial application”? How could the knowledge of “what individual layers look for in an image and which regions of the image influence the predictions” help in industry?

My positive comments were “This paper addressed a highly relevant industrial problem: the data bottleneck for vision AI in flexible manufacturing. It presented a complete, end-to-end pipeline with two compelling, real-world industrial case studies.” It appears these new discussions are even discouraging these positive comments. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

1. Citation [13] is by the same authors (Nandakumar, N., & Eberhardt, J.). This should be noted, e.g., “In our previous work [13], we emphasized...”

2. Claim in Section 2 “...we intentionally do not focus on these aspects. This is because the third phase of domain transfer adds realism...” is a key hypothesis. The discussion (Sec 4) should explicitly revisit this. The 7.9% mAP gap suggests the base rendering quality might still be a factor. Please add discussion on whether a higher-fidelity initial rendering could further close this gap.

3. The goal is to make rendered images look real (translate Rendered -> Real). This text (line 157: “In this process, the CycleGAN model is trained using two input datasets: one containing the source domain A (camera-captured images) and the other containing the target domain B (rendered images from Phase 2).”) implies the source is Real (A) and target is Rendered (B), suggesting a Real -> Rendered translation. Please clarify the direction of translation. Based on the following paragraph, the generator F: Y -> X (Rendered -> Real) is being used.

4. The epochs were described in several ways:

a. Explicitly stated, 400 and 5000 epochs

b. A learning rate schedule that decreases after 100 epochs also implied 110 epochs.

c. The evaluation figures (4 and 5) use “5k epochs”

5. In section 3.2, the added text: “The CycleGAN was trained using the same hyperparameters described in section 3.1, except for the following modifications. The number of filters... was set to 64. The model was trained for 60 epochs... no learning rate decay was employed.” Why were these specific hyperparameters changed from Section 3.1? The use of 60 epochs (vs. 400/5000) and different filter counts makes a direct comparison of the “similar” vs. “different” object scenarios difficult. Please justify these methodological changes.

6. Line 317: “We also see that as the number of training images increases from 50 to 400, only the mAP of the YOLOv7 model improves...”. This analysis is incorrect. Based on Figures 4 and 5, Faster R-CNN (e.g., 80.3% -> 80.7% for cgan vs real) and SSD MobileNet (e.g., 77.4% -> 80.9% for cgan vs real) also show slight mAP improvements. Most notably, EfficientDet’s performance decreases with more data (e.g., 68.9% -> 64.6% for real vs real). This entire observation (Line 565) must be corrected, and the anomalous result for EfficientDet requires discussion.

7. Line 432: “The results obtained in the above-mentioned use cases... demonstrate performance comparable to conventional OD and AD models trained on real-world camera images.”. This claim is not supported. Section 4.2 provides qualitative examples (Figs 10, 12) but no quantitative metrics (mAP, F1-score, etc.) to compare against a real-data baseline. The term “comparable” is unsubstantiated. Either remove this claim or provide quantitative data for the applications in Sec 4.2.

8. I didn’t see “industrial applications” in the text from lines 488 to 513. They are quite general, and even the references are not based on “industrial applications”. Please check the review comments (Comment #8) in the previous round.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

Comments and Suggestions for Authors

I believe that after the revisions have been made, the article is ready for publication.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

Accept. 

Reviewer 4 Report

Comments and Suggestions for Authors

Accept. 

Back to TopTop