Next Article in Journal
Multi-Condition Fault Diagnosis Method for Rolling Bearings Based on Enhanced Singular Spectrum Decomposition and Optimized MMPE + SVM
Previous Article in Journal
Techno-Economic Assessment of Hydrogen and CO2 Recovery from Broccoli Waste via Dark Fermentation and Biorefinery Modeling
Previous Article in Special Issue
LGM-YOLO: A Context-Aware Multi-Scale YOLO-Based Network for Automated Structural Defect Detection
 
 
Article
Peer-Review Record

Operationalizing the R4VR-Framework: Safe Human-in-the-Loop Machine Learning for Image Recognition

Processes 2025, 13(12), 4086; https://doi.org/10.3390/pr13124086
by Julius Wiggerthale *,† and Christoph Reich †
Reviewer 2: Anonymous
Processes 2025, 13(12), 4086; https://doi.org/10.3390/pr13124086
Submission received: 24 November 2025 / Revised: 11 December 2025 / Accepted: 15 December 2025 / Published: 18 December 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

-This study does not propose a new framework; instead, it aims to demonstrate concrete steps for implementing the existing R4VR framework in practice. The title and content don't seem to match up. The article needs to be revised in many aspects.

-In their earlier work, the authors conducted a review study aiming to create safe and reliable ML models by integrating XAI throughout the entire model lifecycle, and proposed the R4VR framework. They may consider explicitly stating within the manuscript that this framework is their own prior contribution.

-Figure 1 is taken from the previous study of authors and must be properly referenced.

-Sections such as ‘2.3. EU AI Act for High Risk Applications’ contain an excessive number of short paragraphs. This creates the impression of a ‘bullet-point summary,’ which disrupts the flow of an academic manuscript and overwhelms the reader with fragmented details. The goal of a scholarly article is not to present raw information as-is, but to synthesize and interpret it in a coherent narrative. Therefore, this section—as well as the entire manuscript—should be carefully revised to improve structural clarity and readability. Particular attention should be given to Sections 3.1, Sec. 4, and all corresponding subsections, as well as the ‘6. Conclusions’ section. A thorough revision would significantly enhance the overall quality of the paper.

-Again, in Section 2.3, references such as “Article 9” are used, but no such reference exists. The paper has not even been checked before being submitted to the journal; it is very careless. Table 1 also contains the same type of referencing issue. These references likely correspond to the entries in the bibliography, but this is difficult to understand. The entire manuscript should be checked carefully and these ambiguities must be resolved.

-The sentence ‘More detailed information on single XAI techniques can be found in [21].’ does not seem to connect well with the preceding text, and I do not understand why it was included.

-I find the statement ‘There are two major categories of UQ approaches, namely Bayesian methods and ensembles’ incomplete, as the literature also includes additional uncertainty quantification methods beyond these two categories.

-There is likely a structural error in the manuscript. Section 3 is titled ‘3. R4VR-Framework for Visual Inspection,’ but subsection 3.2, which examines a medical use case, contains nothing related to visual inspection.

-There is a statement saying, ‘In medical use cases, this is relatively easy since physicians regularly diagnose patients based on images.’ However, I do not think this is entirely accurate, as there are many steps involved, such as patient confidentiality and ethical approval procedures. It would be more appropriate to thoroughly review Section 3.2 with references from the literature. I also consider Figure 3 to be incomplete. It only presents model development and model deployment. Data collection and other components could be added. Sections such as Validation, Verification, and Deployment could be presented in a way that aligns better with the text. A similar situation applies to Section 3.3 as well; this part should also be supported with references to the literature. Both Sections 3.2 and 3.3 are written in very general terms, and their structural readability is low.

In Section 3.3, it is stated that the Reliability phase can be conducted exactly as in Section 3.2. However, since the data collection stages and the types of problems differ between the two domains, differences in the Reliability phase are also likely. If there are indeed no differences, it should be briefly explained why the approach from Section 3.2 can be used directly. The other subsections under Section 3.3 should also be reviewed in a similar manner.

-The sentence states that “the images represent common defects from 6 different classes”, namely …, but only 5 class names are listed.

-The study claims to demonstrate how the R4VR framework can be used to train highly reliable machine learning models. What evidence in the paper supports the claim of ‘highly reliable’? Which results or analyses validate this level of reliability?

-The dataset is described as resembling a “real” factory visual inspection problem, but what is the basis for this claim? The number of samples appears small, and to my knowledge, this is not a widely used benchmark dataset in the literature. Moreover, the dataset lacks real-world variation such as differences in lighting, camera conditions, and production environments (although the dataset does not contain lighting variations comparable to real-world conditions, this issue is addressed extensively in a later section and the authors attempt to resolve it). The images look very clean and do not seem representative of an actual factory setting. Therefore, the validation of the R4VR framework remains limited.

-It is stated that since this is not a high-risk application, a domain expert is not needed in the validation phase. However, is high-risk status the only reason to require a domain expert? Even if the developer performs the model validation himself, a certain level of expertise in the subject matter is still necessary.

-The experiments in the Data Collection (4.1) section were carried out using VGG16. I am wondering why the results were not also obtained with ResNet18. In this section, the authors draw conclusions such as ‘when more images are used at the beginning of training, the number of images requiring manual labeling increases.’ Can these conclusions be assumed to hold for different datasets as well? If not, this limitation should be stated clearly in the text.

-In sections such as 4.2–3, information should be provided about the hyperparameters used in the deep learning methods and about how many repetitions were performed in the experiments. They use MC dropout for the purpose of UQ and Grad-CAM as the XAI technique. Why were different techniques not tested? The methods and parameters used for XAI and UQ should be explained in more detail. Mathematical details should be given, as XAI and UQ provide benefits at many stages.

-There is a statement that says, “In order to create sufficiently different models, we train each model only on half of the images.” This reduces the amount of data and may lower the quality of comparisons between different models. What if easy data is given to one model and difficult data to another? These problems can be solved by repeating the process on different sets, but the article does not provide any information on this.

-How was it determined that the issues were due to the lighting problem? Could there be other problems? Could there have been other solutions for the lighting problem besides artificial gradient in brightness?

-Solutions such as MLOps (and its derivatives) exist, so why haven't they been mentioned at all?

-Future actions and current challenges should be provided in greater detail.

-There are no comparisons regarding literature studies and their results.

 

 

 

 

Author Response

Dear Reviewer,

we are greatful that you took the time to review our manuscript. You made several good points that helped us to improve our manuscript and increase its impact. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted manuscript. 

 

Comments 1

This study does not propose a new framework; instead, it aims to demonstrate concrete steps for implementing the existing R4VR framework in practice. The title and content don't seem to match up. The article needs to be revised in many aspects.

Response: We clarified throughout the manuscript that we do not propose a new framework but illustrate the application of our existing R4VR framework. Also, we adapted the title.

Comments 2

-In their earlier work, the authors conducted a review study aiming to create safe and reliable ML models by integrating XAI throughout the entire model lifecycle, and proposed the R4VR framework. They may consider explicitly stating within the manuscript that this framework is their own prior contribution.

Response: We explicitly stated that we proposed the R4VR Framework in Sec. 1 and Sec. 2 .

Comments 3

-Figure 1 is taken from the previous study of authors and must be properly referenced.

Response: We added a reference to our prior work.

Comments 4

-Sections such as ‘2.3. EU AI Act for High Risk Applications’ contain an excessive number of short paragraphs. This creates the impression of a ‘bullet-point summary,’ which disrupts the flow of an academic manuscript and overwhelms the reader with fragmented details. The goal of a scholarly article is not to present raw information as-is, but to synthesize and interpret it in a coherent narrative. Therefore, this section—as well as the entire manuscript—should be carefully revised to improve structural clarity and readability. Particular attention should be given to Sections 3.1, Sec. 4, and all corresponding subsections, as well as the ‘6. Conclusions’ section. A thorough revision would significantly enhance the overall quality of the paper.

Response: We carefully revised the entire manuscript with a particular focus on the sections mentioned for clarity and readability. We merged short paragraphs into more cohesive units and rewrote single Sections to provide a clear narrative.

Comments 5

-Again, in Section 2.3, references such as “Article 9” are used, but no such reference exists. The paper has not even been checked before being submitted to the journal; it is very careless. Table 1 also contains the same type of referencing issue. These references likely correspond to the entries in the bibliography, but this is difficult to understand. The entire manuscript should be checked carefully and these ambiguities must be resolved.

Response: We thoroughly revised the entire paper including all references to the EU AI Act. In the text, we added the reference to the EU AI Act every time we referred to articles from the EU AI Act. In Table 1, we changed the column heading to “Corresponding EU AI Act Articles” to provide a clear reference.

Comments 6

-The sentence ‘More detailed information on single XAI techniques can be found in [21].’ does not seem to connect well with the preceding text, and I do not understand why it was included.

Response: We removed the sentence.

Comments 7

-I find the statement ‘There are two major categories of UQ approaches, namely Bayesian methods and ensembles’ incomplete, as the literature also includes additional uncertainty quantification methods beyond these two categories.

Response: The statement followed the categorization made in this manuscript (https://doi.org/https://doi.org/10.1016/j.inffus.2021.05.008). However, we missed to point out that it only reflects the opinion stated in the manuscript and your comment is legitimate. We added additional approaches (EDL, DUQ) to the section in order to provide a more complete overview and removed the statement “There are two major categories”.

Comments 8

-There is likely a structural error in the manuscript. Section 3 is titled ‘3. R4VR-Framework for Visual Inspection,’ but subsection 3.2, which examines a medical use case, contains nothing related to visual inspection.

Response: The subsection on medical use cases was added at a late stage to reduce one-dimensionality, and we forgot to adapt the heading of Section 3. We now corrected this and renamed Section 3 to “R4VR Framework in Practice”.

Comments 9

-There is a statement saying, ‘In medical use cases, this is relatively easy since physicians regularly diagnose patients based on images.’ However, I do not think this is entirely accurate, as there are many steps involved, such as patient confidentiality and ethical approval procedures. It would be more appropriate to thoroughly review Section 3.2 with references from the literature. I also consider Figure 3 to be incomplete. It only presents model development and model deployment. Data collection and other components could be added. Sections such as Validation, Verification, and Deployment could be presented in a way that aligns better with the text. A similar situation applies to Section 3.3 as well; this part should also be supported with references to the literature. Both Sections 3.2 and 3.3 are written in very general terms, and their structural readability is low.

Response: We completely updated Sections 3.2 and 3.3 and added reference to the literature. Also, we adapted the figures showing the process by adding the process of data collection as well as monitoring and logging activities.

Comments 10

In Section 3.3, it is stated that the Reliability phase can be conducted exactly as in Section 3.2. However, since the data collection stages and the types of problems differ between the two domains, differences in the Reliability phase are also likely. If there are indeed no differences, it should be briefly explained why the approach from Section 3.2 can be used directly. The other subsections under Section 3.3 should also be reviewed in a similar manner.

Response: In course of the revision of Sec. 3.2 and Sec. 3.3 according to Comments 9, the single phases were explained in more detail. In Sections 3.2 and 3.3, we now provide a targeted description of each phase  for both medical and industrial use cases, and we highlight domain-specific differences in the single phases.

Comments 11

-The sentence states that “the images represent common defects from 6 different classes”, namely …, but only 5 class names are listed.

Response: We now use a different dataset with 4 defect classes. They are listed correctly in the manuscript.

Comments 12

-The study claims to demonstrate how the R4VR framework can be used to train highly reliable machine learning models. What evidence in the paper supports the claim of ‘highly reliable’? Which results or analyses validate this level of reliability?

Response: We replaced the term “reliable” by “safe”. We connect the claim explicitly to our empirical findings. In Section 4, we show that less than 0.1% of images are classified wrongly and not detected by warning mechanisms. We clarified in the text that our claim of safety is based on this detection performance, and we also discuss limitations in the discussion Section.

Comments 13

-The dataset is described as resembling a “real” factory visual inspection problem, but what is the basis for this claim? The number of samples appears small, and to my knowledge, this is not a widely used benchmark dataset in the literature. Moreover, the dataset lacks real-world variation such as differences in lighting, camera conditions, and production environments (although the dataset does not contain lighting variations comparable to real-world conditions, this issue is addressed extensively in a later section and the authors attempt to resolve it). The images look very clean and do not seem representative of an actual factory setting. Therefore, the validation of the R4VR framework remains limited.

Response: We now use a more complex dataset. However, we acknowledge in the discussion section that we only tested on images from the steel industry and validation for other use cases as well as in real industrial settings should be addressed by future work.

Comments 14

-It is stated that since this is not a high-risk application, a domain expert is not needed in the validation phase. However, is high-risk status the only reason to require a domain expert? Even if the developer performs the model validation himself, a certain level of expertise in the subject matter is still necessary.

Response: We thank the reviewer for the very good comment. We added an explanation that domain experts should be consulted when risks are high or normal developers are not able to properly perform the Validation phase (e.g. because the data is too special).

Comments 15

-The experiments in the Data Collection (4.1) section were carried out using VGG16. I am wondering why the results were not also obtained with ResNet18. In this section, the authors draw conclusions such as ‘when more images are used at the beginning of training, the number of images requiring manual labeling increases.’ Can these conclusions be assumed to hold for different datasets as well? If not, this limitation should be stated clearly in the text.

Response: We added the note that we conducted the experiment with both models and provide reference to our GitHub repository where the results are available. We also point out that we observed similar results with both models. Beyond that, we added the statement “At this point, it has to be mentioned that all results observed only apply to our specific use case. For other datasets, other results may be observed.” In the beginning of “Discussion of Results” section.

Comments 16

-In sections such as 4.2–3, information should be provided about the hyperparameters used in the deep learning methods and about how many repetitions were performed in the experiments. They use MC dropout for the purpose of UQ and Grad-CAM as the XAI technique. Why were different techniques not tested? The methods and parameters used for XAI and UQ should be explained in more detail. Mathematical details should be given, as XAI and UQ provide benefits at many stages.

Response: We added a table indicating the hyperparameters and number of samples for MC dropout in the beginning of Sec. 4. For more details, we added reference to our GitHub repository. Furthermore, we stated the use of only MC dropout and Grad-CAM as a limitation of the study.

Comments 17

-There is a statement that says, “In order to create sufficiently different models, we train each model only on half of the images.” This reduces the amount of data and may lower the quality of comparisons between different models. What if easy data is given to one model and difficult data to another? These problems can be solved by repeating the process on different sets, but the article does not provide any information on this.

Response: In the beginning of Sec. 4, we explained that models were trained 10 times where images were assigned randomly to the single models in every training cycle.

Comments 18

-How was it determined that the issues were due to the lighting problem? Could there be other problems? Could there have been other solutions for the lighting problem besides artificial gradient in brightness?

Response: We acknowledge that we did not describe the process sufficiently. We now stated thate we compared exposure in images flagged by warning mechanisms to exposure in unflagged images. Also, we pointed out that there could be other possible problems and solutions to overcome the issues. Beyond that, we described why we used data augmentation to enhance models. (Because it requires no changes to the existing inspection setup and can be integrated easily into a generic training pipeline.)

Comments 19

-Solutions such as MLOps (and its derivatives) exist, so why haven't they been mentioned at all?

Response: We added a subsection 2.4 - “Related Concepts” that briefly presents concepts of HiL machine learning, active learning, XAI as well as UQ and MLOps. In the section, we also point out what distinguishes the concepts from the R4VR-framework and how the R4VR-framework complements MLOps.

Comments 20

-Future actions and current challenges should be provided in greater detail.

Response: We extended the discussion section. We now point out several limitations of the approach in general as well as the study in particular and give an outlook how future work could address these limitations.

Comments 21

-There are no comparisons regarding literature studies and their results.

Response: We added a comparison to other research in Sec. 4.3.

Reviewer 2 Report

Comments and Suggestions for Authors

The topic presented to me for review has significant strengths. The following weaknesses should be noted:
- a relatively small dataset that does not reflect real conditions, while at the same time some of the production factors are limited.
- it has not been tested in real production, which limits the assessment of time, load, long-term stability.
- insufficient statistical validity is observed.
- no comparisons are presented to minimize risks, such as: SHAP, LRP and others.
- morally obsolete architectures are applied in practice, and better explainable and stable ones can be added.
- the article offers a general framework, but limited presentation of validity.
- too long an exposition.
- the figures for Grad-CAM do not have a quantitative metric for quality.

Author Response

Dear Reviewer,

we are greatful that you took the time to review our manuscript. You made several good points that helped us to improve our manuscript and increase its impact. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted manuscript. 

 

Comments 1:

a relatively small dataset that does not reflect real conditions, while at the same time some of the production factors are limited.

Response:  We conducted the experiments with a different dataset (Severstal - Dataset Ninja). The new dataset contains more images (~12500 labeled images), is more complex and better reflects real world conditions.

Comments 2:

it has not been tested in real production, which limits the assessment of time, load, long-term stability.

Response: At present, we do not have access to an industrial partner that would allow us to test the framework in a real production environment. We now explicitly acknowledge this limitation in the discussion Section and highlight it as an important direction for future work.

Comments 3:

insufficient statistical validity is observed.

Response:  Due to computational constraints, we limited our experiments to 10 training runs with different random splits. We now describe this procedure explicitly in the beginning of Sec. 4.

Comments 4:

no comparisons are presented to minimize risks, such as: SHAP, LRP and others.

Response: Our main aim was to demonstrate the practical applicability of the framework. We focused on Grad-CAM as method for XAI since it is intuitive to understand for image data. We added reference to our GitHub repository where we implemented SHAP as well. We acknowledge the weakness in the discussion and point out that future research should address the question how different methods of XAI (but also UQ) affect the outcome.

Comments 5:

morally obsolete architectures are applied in practice, and better explainable and stable ones can be added.

Response: The models were the ones that performed well on the original dataset used for the elaboration. With the more complex dataset, we found that a ResNet18 and an EffiecientNet-B0 perform well. We also tested a vision transformer as well as different other architectures but found insufficient performance. This is not uncommon in smaller datasets of structured images. The superior performance on such datasets and can (to the best of our knowledge) be attributed to the local filters of CNN based models such as ResNet and the smaller number of parameters that avoids overfitting.

Comments 6:

the article offers a general framework, but limited presentation of validity.
Response: We now used a more representative dataset and also pointed out how our framework improves compared to a normal baseline model. Nevertheless, we acknowledged in the discussion section that the framework’s validity has to be tested in other domains and real production scenarios.

Comments 7: too long an exposition.
Response: We streamlined the exposition where possible. However, in response to requests for additional detail, some sections had to be expanded. We have aimed to balance completeness and conciseness and improved the structure to maintain readability.

Comments 8: the figures for Grad-CAM do not have a quantitative metric for quality.

Response: We updated the figure according to the new dataset and added a quantitative metric.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The article has been revised properly.  I have no further questions. 

Reviewer 2 Report

Comments and Suggestions for Authors

The notes are duly reflected.

The development can be published in the journal.

Back to TopTop