Next Article in Journal
HDR-IRSTD: Detection-Driven HDR Infrared Image Enhancement and Small Target Detection Based on HDR Infrared Image Enhancement
Previous Article in Journal
A Game-Theoretic Kendall’s Coefficient Weighting Framework for Evaluating Autonomous Path Planning Intelligence
 
 
Article
Peer-Review Record

Reliable Detection of Unsafe Scenarios in Industrial Lines Using Deep Contrastive Learning with Bayesian Modeling

Automation 2025, 6(4), 84; https://doi.org/10.3390/automation6040084 (registering DOI)
by Jesús Fernández-Iglesias 1,2,3,*, Fernando Buitrago 2,4 and Benjamín Sahelices 1,2,5
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Automation 2025, 6(4), 84; https://doi.org/10.3390/automation6040084 (registering DOI)
Submission received: 27 September 2025 / Revised: 16 November 2025 / Accepted: 25 November 2025 / Published: 2 December 2025
(This article belongs to the Section Industrial Automation and Process Control)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Overall the paper is in good shape already, I would however strongly suggest that the authors try to shorten it significantly. It appears that authors made use of GenAI, which is okay but this typically generates rather lengthy phrases. A lot of paragraphs are good to read but communicate little. The figures are way to small. 

The related work would benefit from a stronger structure. 

For bias detection, you failed to disclose how you did the embedding for images. I guess you used the embedding of one of your models. Different models might provide different results.

You used a t-SNE, this is problematic, checkout why https://jmlr.org/papers/volume22/20-1061/20-1061.pdf

Additionally, while you might not have a biases in your embedding space, you might still have a bias in your dataset, e.g., in where certain objects appear, or in the model which you used. 

Regarding the training specification, you argue that use augmentation to ensure consistent performance, in the discussion you highlight issues with lighting. Please provide numbers on how much training improves due to augmentation. 

Have you also checked your uniform/no bias in latent space claim holds for the different datasets?

Figures 7 is way to small also you don't disclose which architecture belongs to which plot (also in Figure 8). It would also be interesting to discuss how the model architecture latent space representation, and euclidian distance plot are interlinked.  Especially in the later we not only want higher distance between OK and KO but maybe higher or lower variance? 

As far as I understood it your BGMM used the downprojection into R² to do distribution checks. I wonder why ResNet-18 comes out here out as the best standalone model? Why didn't you use the high dimensional space for the distribution fitting in the beginning? You also claim that combining two models into an R4 is beneficial.  It looks to me that for ResNet-18 the downprojection yielded yielded the best results because the information loss was the most little here. Please clarify this section. 

The XAI pictures are really bad,that why I can really judge this section. The input feature ablation method is interesting, particular because you remove/replace the unsafe part. However, this should be compared with introducing unsafe objects into safe spaces as well. That is rather easy to do. Another approach here would be to make use of diffusion models. 

The discussion/limitation/future work section is very limited. Please extend this to more specific limitations of your study and design methodology. 

 

Author Response

Comment 1 — The related work would benefit from a stronger structure.

Reply: We fully agree. We have redesigned the section and created a more organized structure based
on three main points: quality control and risk detection in architecture, engineering, and construction
(AEC) environments, industrial anomaly detection, and the integration of AI systems with functional
safety methodologies. Each paragraph in the section 2. Related work addresses one of the points. In
addition, we have emphasized the differences between the cited literature and our work.

 

Comment 2 — For bias detection, you failed to disclose how you did the embedding for images.
I guess you used the embedding of one of your models. Different models might provide different
results. You used a t-SNE, this is problematic, checkout why https://jmlr.org/papers/volume22/20-
1061/20-1061.pdf Additionally, while you might not have a biases in your embedding space, you
might still have a bias in your dataset, e.g., in where certain objects appear, or in the model which
you used.

Reply: In the original version of the manuscript, we used two techniques to reduce the dimensionality
of the dataset. First, principal component analysis (PCA) preserving as much variance as possible, and
then t-distributed stochastic neighbor embedding (t-SNE) to project the principal components onto a
R2 space. We have carefully read the methodology presented in [Wang et al., 2020], and we understand
the advantages of the pairwise controlled manifold approximation projection (PaCMAP) method over
the uniform manifold approximation and projection (UMAP) and t-SNE techniques. While t-SNE and
UMAP usually offer good performance maintaining local structure, they can miss large-scale geometry.
However, PaCMAP preserves both local and global structure. We have applied the PaCMAP technique
to the entire dataset using the GitHub code provided by the authors. We have chosen the standard
parameter settings: 10 neighbors for the k-Nearest Neighbor graph, 0.5 as the ratio between the number
of mid-near pairs and the number of neighbors, 2 as the ratio between the number of farthest pairs and
the number of neighbors, and PCA as the initialization method of the lower-dimensional embedding.
The resulting embedding is displayed in the Figure 5 of the manuscript. There seem to be no clear
patterns or isolated categories in clusters unrelated to the rest of the observations. They all seem to
form a relatively homogeneous global structure, so there does not appear to be any clear bias in the
dataset. We agree that further studies can be conducted to determine the quality and balance of the dataset.
To this end, we have conducted a study on the position of the persons (a subset of 850 random images)
and all anomalous objects with respect to the machine layout. We have manually labelled the middle
point of each object, and the results are shown in Figure 6. As can be seen, all areas of the cell have
been sampled reasonably uniformly. There is no anomaly category that is only found in a very specific
zone of the layout. 

 

Comment 3 — Regarding the training specification, you argue that use augmentation to ensure
consistent performance, in the discussion you highlight issues with lighting. Please provide numbers
on how much training improves due to augmentation

Reply: We agree that the effect of data augmentation on training is a very relevant aspect, although it
is not the purpose of this work to perform that analysis. Instead, we choose certain data augmentation
techniques that are well established and proven effective in the deep learning community (geometric
and color space alterations) that simulate conditions that the deep learning (DL) models will face in
deployment. We have improved the explanation in the paragraph describing data augmentation in
which we explain this issue, as it was not clearly presented previously. In any case, we agree that
evaluating the best data augmentation techniques for similar industrial problems is really interesting for
future research. In this regard, in a study we are currently conducting (Aparicio-Sanz et al., 2026),
we introduce a special data augmentation technique that allows illegitimate elements to be realistically
integrated into safe scenes, facilitating the roll-out process for new machines or layout changes. 

 

Comment 4 — Have you also checked your uniform/no bias in latent space claim holds for the
different datasets?

Reply: We have conducted the discussed biases analysis for the entire dataset presented in the Table
1 (6835 instances). We have just performed a random sampling (850 images) of the Ko category for
the location density map of the elements, as manual labeling is required for each instance. We have not
used any other external data source that, in our opinion, may require biases study. 

 

Comment 5 — Figures 7 is way to small also you don’t disclose which architecture belongs to
which plot (also in Figure 8). It would also be interesting to discuss how the model architecture
latent space representation, and euclidian distance plot are interlinked. Especially in the later we
not only want higher distance between OK and KO but maybe higher or lower variance?

Reply: It is true. In the original manuscript, we did not show the relationship between subfigures and
architectures for both figures. We have now added it. We have also increased the size of the figures
to make them easier to analyze. At the same time, we have expanded the information shown in the
graphs in the previous Figure 8 (now Figure 10) to help interpret the relationship between the Euclidean
distances represented and the latent space plot shown in the immediately preceding figure (now Figure
9). We have also added some sentences to clarify this relationship. Regarding the variance, we agree
that it is interesting to discuss. We have added an explanation at the end of the penultimate paragraph
of Section 4. Experimental Results for the Base safe/unsafe Scenario detailing why, above a certain
threshold determined by the γ value of the loss function, DL algorithms do not optimize their loss by
increasing that value. Thus, as long as the average (and minimum) value is greater than this threshold,
it is not relevant to study the variance in the distances obtained, since there is no explicit constraint
requiring them to be close to each other.

 

Comment 6 — As far as I understood it your BGMM used the downprojection into R² to do
distribution checks. I wonder why ResNet-18 comes out here out as the best standalone model?
Why didn’t you use the high dimensional space for the distribution fitting in the beginning? You
also claim that combining two models into an R4 is beneficial. It looks to me that for ResNet-18
the downprojection yielded the best results because the information loss was the most little here.
Please clarify this section.

Reply: Thank you for the detailed explanation. The contrastive learning encoders project the original
images onto a R2 subspace, that is, each image is converted into a vector with two values. Subsequently, the Bayesian Gaussian mixture model (BGMM) summarizes the distribution made up of the safe
scenarios. In this way, the BGMM is able to quantify the likelihood that a new representation belongs to
the cluster of safe scenarios. We need to apply the BGMM to an embedding with the synthesized information. If we applied it to the original dimensional space (2MPx images), then the high amount of noise
and irrelevant information would cause the methodology to perform poorly. We consider ResNet-18 to
be the best model because, in Table 3, it can be seen that the average AUC obtained for all categories
(AUCMean) is the highest among all the encoders tested (0.9928). Regarding the subsection 5.3. Hybrid
latent space for performance maximization, we combine the latent representations of two encoders to
produce an R4 space. In this way, the discriminative power of each individual encoder is exploited for
those images in which performance is optimal, and aggregation maximizes overall performance. This
hybridization between models shows optimal results with all the encoders we have tested in the study.
For example, in Table 4, we can see that two encoders that perform worse than ResNet-18 (XCiT-nano
and EfficientNet-B0) have individual AUCMean of 0.9225 and 0.8729, while the AUC of the hybrid model
is 0.9649, more than 4 and 9 percentage points higher respectively. 

 

Comment 7 — The XAI pictures are really bad,that why I can really judge this section. The
input feature ablation method is interesting, particular because you remove/replace the unsafe part.
However, this should be compared with introducing unsafe objects into safe spaces as well. That
is rather easy to do. Another approach here would be to make use of diffusion models. 

Reply: Thank you for pointing it out. We have not increased the size of the figures in the manuscript
since it would make it longer than it already is, but we have added more examples to our GitHub
(see github.com/jesusferigl/) to facilitate their study and interpretation. We fully agree that it is very
interesting to study the output of the proposed method when an unsafe element (anomalous person
or object) is found in a safe area (outside the facility). To that end, we have selected four relevant
examples from the test data set and created Figure A3 (appendix). The figure shows the output of the
input feature ablations method for the following scenarios:
• One Ko image caused by the presence of a person inside the cell.
• Two Ok images with persons outside the cell but visible in the images.
• One Ko image caused by the presence of an anomaly and with a person outside the cell but visible
in the image.
It can be seen that, for the real unsafe patches (red circles), the latent space displacement caused by
replacing them with a safe patch is 0.18881 and 1.05342. For the rest of situations (green circles), the
displacements are several orders of magnitude lower, even when there are persons located next to the
cell. This result shows that the models are capable of ignoring the presence of people and, in general,
variable situations outside the cell. We have added a new paragraph at the end of subsection 6.1. Input
feature ablations explaining this new result. 

 

Comment 8 — The discussion/limitation/future work section is very limited. Please extend
this to more specific limitations of your study and design methodology.

Reply: We agree that subsection 7.1. Limitations and future work can be expanded to reflect the
limitations of the work carried out. We have modified the paper to improve this section as follows.
First, we have expanded on the explanation of the two main limitations that we believe the study has:
performance in the face of unexpected changes in light and the difficulty of retraining the system in the
event of a layout change or deployment in a new cell. It is worth noting that a study we are currently
conducting (Aparicio-Sanz et al., 2026) uses a setup based on near infrared (NIR) lighting devices and
cameras to shield the system from ambient and plant light, as well as a special data augmentation
technique to simulate unsafe scenarios having only safe scenarios. In addition, we have added a third
point that represents a limitation of the experimental design conducted. The number of anomalies
captured in the dataset is small, much lower than the number of Ok and Ko scenarios. Therefore, the
validity of the results obtained for these categories does not have the same statistical significance as for
Ok and Ko scenarios. Another point addressed is the size of the anomalous objects. In this study, only
anomalous elements of a relatively big size (the smallest is a drill) were captured. In fact, in subsection
5.2. Results and 5.3 Hybrid latent space for performance maximization it is seen that, when anomalous
objects become smaller, the effectiveness of the safety system slightly decreases. If smaller elements are
to be detected, it is necessary to use high-resolution industrial cameras or several cameras in parallel to
cover the entire surface of the industrial facility. In addition, we have extended the conclusions of the
study to reflect the aforementioned limitations.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The article is relevant because the use of information technologies in industrial practice is very important today, as it significantly reduces organizational costs and increases the competitiveness and sustainability of the industry. The authors have appropriately applied the pyramid aggregation network for strip steel surface defect segmentation. After incorporating my comments, I recommend publishing the article in Automation journal. My comments on the article are as follows:

  1. The text in Figure 1 is not legible and needs to be enlarged.
  2. The authors wrote on line 41 that "A significant percentage of accidents occur within automated manufacturing lines". It would be helpful to indicate the percentage here.
  3. The version number of the VideoPose3D software should be included in the article.
  4. The authors of the article mention on line 163 that they are aiming to detect potentially dangerous anomalous situations caused by foreign objects of any kind. It would be useful to specify exactly which anomalies are being referred to here.
  5. The authors describe the physical environment used for research in Chapter 3, represented by Figure 2. However, this figure appears in Chapter 1. I suggest moving this figure to Chapter 3, immediately after it is referenced in the text. The same applies to Table 3. It would be better chronologically if the figure or table immediately followed its reference in the chapter text.

Author Response

Comment 1 — The text in Figure 1 is not legible and needs to be enlarged.

Reply: Thank you for pointing it out. We have increased the size of the text in the figure

 

Comment 2 — The authors wrote on line 41 that ”A significant percentage of accidents occur
within automated manufacturing lines”. It would be helpful to indicate the percentage here. 

Reply: We fully agree. We must specify the percentage when making such a statement. We have tried
to find this information on various official websites of the European Union, the United States, and China,
but we have not found any metrics that match the data we need. However, we have found an interesting
article (see [Lee et al., 2021]) that analyzes 369 robot-related accidents that caused injuries to operators
in Korea during the period 2010-2020, showing that more than 95% of robot-related accidents occurred
in manufacturing companies, while the remaining 5% were recorded in the service and construction
sectors. We have incorporated this information into the paragraph to support the explanation of the
safety risk existing in automated manufacturing processes. 

 

Comment 3 — The version number of the VideoPose3D software should be included in the
article.

Reply: We have tried to find the software version used by the authors in [Tao et al., 2023] and have
not been able to find it. This software comes from the GitHub repository in [Pavllo et al., 2019], which
does not appear to have any releases record. On the other hand, as requested by another reviewer, we
have slightly modified section 2. Related work to make the structure clearer. The work is still cited,
but details of the software used in it are no longer provided. 

 

Comment 4 — The authors of the article mention on line 163 that they are aiming to detect
potentially dangerous anomalous situations caused by foreign objects of any kind. It would be
useful to specify exactly which anomalies are being referred to here.

Reply: Since we have restructured section 2. Related work, the aforementioned sentence is now at the
end of the first paragraph. We have extended it to improve the explanation, so now we hope it is clearly
explained that we can detect any type of anomalies coming from the inclusion of any type of object,
clarifying that if depends on the size of the object, but we prove that most of the most common objects
can be detected. This include objects like tools, cleaning utensils, wires, or components misplaced from
another industrial process

 

Comment 5 — The authors describe the physical environment used for research in Chapter 3,
represented by Figure 2. However, this figure appears in Chapter 1. I suggest moving this figure to
Chapter 3, immediately after it is referenced in the text. The same applies to Table 3. It would be
better chronologically if the figure or table immediately followed its reference in the chapter text. 

Reply: We fully agree. We have moved the location where Figure 2 appears so that it is within
subsection 3.1. Industrial configuration. Similarly, Table 3 is now within the subsection 5.2. Results. 

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The article discusses a very interesting and timely topic. The authors propose an AI-based safety system that combines contrastive learning with Bayesian modeling. The solution is designed to detect hazardous situations on production lines — for example, the presence of a person or foreign objects in the machine’s working area.

The idea is innovative and aligns well with the direction of Industry 4.0. However, the manuscript still needs some refinement to make it more transparent and convincing.

The main concept has strong practical potential, but it would be helpful to clarify how the proposed approach differs from previous studies. The literature review is extensive but tends to summarize other works rather than emphasize how this method extends or improves them.

The methodological section is detailed — at times a bit too technical, which makes it harder to follow. Adding a clear block diagram showing the entire process, from data collection to anomaly detection, would improve readability. It would also be useful to explain in more detail how data bias was evaluated, since PCA/t-SNE analysis alone is not sufficient.

The reported results are very high (AUC values close to 1.0), which may indicate possible overfitting — this aspect deserves further attention. It is also unclear whether the model was tested on independent datasets or under varying conditions (e.g., changes in lighting or camera angle). Including more information about runtime performance and practical applicability would strengthen the paper, especially since these factors are crucial for safety systems.

 

The use of patch-based ablation analysis is an excellent idea that improves model transparency. I would suggest including more examples with heatmaps and discussing how such visualizations help assess the reliability of the model.

The manuscript is rather long (over 30 pages) and contains some repetitions. It could be shortened and simplified for clarity. The English is generally correct but would benefit from light language editing — simplifying certain sentences, improving syntax, and removing redundancies.

The conclusions summarize the results well but are somewhat conservative. It would be valuable to include a reflection on limitations (e.g., possible errors, data constraints, scalability issues) and briefly describe how the authors plan to implement the system in practice — for instance, integration with PLC systems or testing on other production lines.

Comments on the Quality of English Language

The article discusses a very interesting and timely topic. The authors propose an AI-based safety system that combines contrastive learning with Bayesian modeling. The solution is designed to detect hazardous situations on production lines — for example, the presence of a person or foreign objects in the machine’s working area.

The idea is innovative and aligns well with the direction of Industry 4.0. However, the manuscript still needs some refinement to make it more transparent and convincing.

The main concept has strong practical potential, but it would be helpful to clarify how the proposed approach differs from previous studies. The literature review is extensive but tends to summarize other works rather than emphasize how this method extends or improves them.

The methodological section is detailed — at times a bit too technical, which makes it harder to follow. Adding a clear block diagram showing the entire process, from data collection to anomaly detection, would improve readability. It would also be useful to explain in more detail how data bias was evaluated, since PCA/t-SNE analysis alone is not sufficient.

The reported results are very high (AUC values close to 1.0), which may indicate possible overfitting — this aspect deserves further attention. It is also unclear whether the model was tested on independent datasets or under varying conditions (e.g., changes in lighting or camera angle). Including more information about runtime performance and practical applicability would strengthen the paper, especially since these factors are crucial for safety systems.

 

The use of patch-based ablation analysis is an excellent idea that improves model transparency. I would suggest including more examples with heatmaps and discussing how such visualizations help assess the reliability of the model.

The manuscript is rather long (over 30 pages) and contains some repetitions. It could be shortened and simplified for clarity. The English is generally correct but would benefit from light language editing — simplifying certain sentences, improving syntax, and removing redundancies.

The conclusions summarize the results well but are somewhat conservative. It would be valuable to include a reflection on limitations (e.g., possible errors, data constraints, scalability issues) and briefly describe how the authors plan to implement the system in practice — for instance, integration with PLC systems or testing on other production lines.

Author Response

Comment 1 — The main concept has strong practical potential, but it would be helpful to
clarify how the proposed approach differs from previous studies. The literature review is extensive
but tends to summarize other works rather than emphasize how this method extends or improves
them.

Reply: We fully agree. We have restructured section 2. Related work to provide a clearer structure.
We draw on literature from three different fields: quality control and risk detection in architecture,
engineering, and construction (AEC) environments, industrial anomaly detection, and the integration of
artificial intelligence (AI) systems with functional safety methodologies. Each paragraph of the section
discusses main research on that field. Also, in each paragraph, we express more clearly the differences
between the cited literature and the contribution of our work. 

 

Comment 2 — The methodological section is detailed — at times a bit too technical, which
makes it harder to follow. Adding a clear block diagram showing the entire process, from data
collection to anomaly detection, would improve readability. 

Reply: Thank you for pointing it out. We have tried to explain the concepts and proposed methodology
in simple terms, but we think it is a good idea to add an infographic summarizing all the stages. At
the end of section 3. Methods, we have added Figure 8. It is a diagram showing, sequentially, the
steps that had to be followed to develop the methodology presented in our work, from the placement
and adjustment of the camera to the validation of the test dataset and integration into the full stack
software and hardware. We have also added a paragraph at the end of section 3. Methods explaining
the new figure. 

 

Comment 3 — It would also be useful to explain in more detail how data bias was evaluated,
since PCA/t-SNE analysis alone is not sufficient.

Reply: We agree that the use of rincipal component analysis (PCA) and t-distributed stochastic
neighbor embedding (t-SNE) is not sufficient to conduct a comprehensive study of the possible presence
of bias in the dataset. We have improved this part in two different ways:
• First, by request of another reviewer, we have changed the way we calculate the embeddings.
As discussed in [Wang et al., 2020], widely employed dimensionality reduction techniques like tSNE and uniform manifold approximation and projection (UMAP) usually offer good performance
maintaining local structure properties between instances, but they can miss large-scale geometry.
However, the pairwise controlled manifold approximation projection (PaCMAP) method preserves
both local and global structure. This way, it is possible to obtain an embedding space that is
more representative of the intrinsic relationships between the images in the dataset. The new
embedding space can be observed in Figure 5. It can be seen how observations belonging to
different categories are mostly mixed in similar areas of the R2
space. There is no category that
is completely isolated in a cluster unrelated to the rest of the observations, which indicates that
there does not appear to be a clear bias in the embedding space.
• Second, we have conducted a study to determine the distribution of the location of persons and
anomalous objects in the industrial cell. The results of the analysis can be seen in Figure 6.
Overall, it can be seen that all the categories are distributed reasonably uniformly across the
entire surface of the cell. There is no group of elements that has been captured only in an isolated
area of the machine. In order to present and explain these results, we have added a paragraph at
the end of subsection 3.2. Dataset.
We believe that these two modifications have improved the exploration of potential hidden biases in
the dataset and provided greater transparency regarding the nature and distribution of the data. 

 

Comment 4 — The reported results are very high (AUC values close to 1.0), which may
indicate possible overfitting — this aspect deserves further attention. It is also unclear whether the
model was tested on independent datasets or under varying conditions (e.g., changes in lighting or
camera angle)

Reply: It is true that the results obtained for detecting the Ko scenarios are very high (AUC = 1.0).
However, we believe that this does not indicate overfitting. Overfitting is typically considered when a
model memorizes the data used for training and is then unable to generalize the synthetized knowledge to
unseen scenarios. The dataset presented in this work has been collected over a period of 12 consecutive
days, at different hours, trying to achieve diverse ambient lighting conditions. In addition, the power of the different LED lighting devices integrated into the cell has also been altered. This way, we built a
balanced dataset representative a wide range of realistic lighting conditions that may happen in a longterm deployment. All these images have been divided into 4 subsets: train, validation, supplementary
train, and test. Therefore, using the test set, we can assess the model’s performance under different
lighting conditions. We have expanded the fourth paragraph of subsection 3.2. Dataset to improve
the explanation about the variety of lighting conditions covered in the dataset. In parallel, we have
added a new paragraph at the end of subsection 5.2. Results discussing that supervised contrastive
learning techniques are able of achieving state-of-the-art results when using a balanced dataset with a
large number of positive (safe) and negative (unsafe) examples, as in this work.

 

Comment 5 — Including more information about runtime performance and practical applicability would strengthen the paper, especially since these factors are crucial for safety systems. 

Reply: Thank you for pointing it out. We have added more information regarding the practical
applicability of the solution at the end of the first part of the section 7. Industrial deployment. We
show the appearance of the graphical interface we have built to house the methodology presented in the
paper. The scenario shown in the image is different from the one discussed in the rest of the manuscript
because, for confidentiality reasons, we cannot show screenshots of software that is already running in
an industrial plant. The image in the figure shows the software running at our facilities. It can be seen
how two unsafe scenes (brush and people’s feet) are detected and, by activating a toggle button, a
heatmap is displayed in which the cell worker can determine the presence of anomalies in the industrial
facility. This heatmap is dynamically calculated using the input feature ablations method presented in
the work. We believe that seeing the integration of the methodology into a software application provides
insights on how the methodology can be integrated into an industrial process. 

 

Comment 6 — The use of patch-based ablation analysis is an excellent idea that improves
model transparency. I would suggest including more examples with heatmaps and discussing how
such visualizations help assess the reliability of the model.

Reply: Thank you very much for the comment. In order to put more emphasis on that part of the work,
we have added a new figure (Figure A3). It shows the output of the patches-based feature ablations
method when an unsafe element (anomalous person or object) is found in a safe area (outside the
facility). Specifically, the following scenarios have been considered:
• One Ko image caused by the presence of a person inside the cell.
• Two Ok images with persons outside the cell but visible in the images.
• One Ko image caused by the presence of an anomaly and with a person outside the cell but visible
in the image.
It can be seen that, for the real unsafe patches (red circles), the latent space displacement caused by
replacing them with a safe patch is 0.18881 and 1.05342. For the rest of situations (green circles), the
displacements are several orders of magnitude lower, even when there are persons located next to the
cell. This result shows that the models are capable of ignoring the presence of people and, in general,
variable situations outside the cell. We have added a new paragraph at the end of subsection 6.1. Input feature ablations explaining this new result. In addition, we have added more examples of the generated
heatmaps in the GitHub profile of one of the authors (see github.com/jesusferigl/).

 

Comment 7 — The manuscript is rather long (over 30 pages) and contains some repetitions.
It could be shortened and simplified for clarity. The English is generally correct but would benefit from light language editing — simplifying certain sentences, improving syntax, and removing
redundancies.

Reply: We agree. We have tried to simplify the wording and make some parts more straightforward
and less dense to read. We have reduced repetitions across all the work, but specially in sections 2.
Related work and 5. Generalization to Unknown Non-Legitimate Scenarios: Uncertainty Quantification
The length of the manuscript has actually increased because we have added several figures and new
paragraphs based on the new review comments. Nevertheless, we hope that the reader’s comprehension
and follow-up of the manuscript has improved. 

 

Comment 8 — The conclusions summarize the results well but are somewhat conservative.
It would be valuable to include a reflection on limitations (e.g., possible errors, data constraints,
scalability issues) and briefly describe how the authors plan to implement the system in practice
— for instance, integration with PLC systems or testing on other production lines.

Reply: Thank you for the suggestion, we agree. We have added a new paragraph at the end of the
conclusions section discussing the main limitations of our work. We emphasize scalability (amount of
work needed in setting up the system for a new installation), and we also mention robustness in the face
of unexpected lighting changes and having a broader set of anomalies to obtain more representative
results. 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Regarding Comment 5: The variance addition to the plot is nice. However what I wanted to point out here. If you think about your KO/OK clusters as distributions, you don't want to have a sufficiently good threshold but you want your two clusters to not overlap. You can visualize it, e.g., https://arxiv.org/pdf/2309.12667, compute it, e.g., https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.01089/full or if you assume a underlying type of distribution do testing of statistical significance, e.g., using the t-test. The later would give you the insight how high the probability is that these cluster overlap and provide a good numerical assessment if the distance is "high" enough. Please have a look. 

Clarification regarding Comment 6: My point here was you could have adapted the projector to multi dimensionality, e.g., https://arxiv.org/pdf/2309.11782 or use dedicated improvements like https://proceedings.neurips.cc/paper_files/paper/2022/hash/297f7c6c56af81239f7c47d21558b75a-Abstract-Conference.html However nothing needs to be done here. 

 

Author Response

Comment 1 — Regarding Comment 5: The variance addition to the plot is nice. However
what I wanted to point out here. If you think about your KO/OK clusters as distributions, you
don’t want to have a sufficiently good threshold but you want your two clusters to not overlap. You
can visualize it, e.g., [link], compute it, e.g., [link] or if you assume a underlying type of distribution
do testing of statistical significance, e.g., using the t-test. The later would give you the insight how
high the probability is that these cluster overlap and provide a good numerical assessment if the
distance is ”high” enough. Please have a look. 

Reply: Thank you very much for pointing out the interesting references. We agree that we can
calculate some statistical measures to further demonstrate the different distribution of the Ok and Ko
latent representations for the different encoders. To that end, we use two metrics:
• Mann-Whitney U test.
• Pastore & Calcagni overlapping index

We use the former to test the null hypothesis that two samples come from the same population,
while we apply the latter (using the code from [Pastore, 2018]) to quantify the similarity between two
or more empirical distributions using the overlap between their kernel density estimates. We choose the
non-parametric Mann-Whitney U test because the normality assumption is not satisfied. The results
obtained are presented in Table 3. It can be seen that, for the Mann-Whitney test, all p-values are very
low (the minimum that can be achieved computationally). Therefore, it can be rejected that the safe
and unsafe distances come from the same population. As for the overlap index, all values are also very
close to 0. This suggests the total absence of overlap between the safe and unsafe distributions.
We have added a new paragraph in section 4. Experimental results for the safe/unsafe baseline
scenario in which the results are presented and explained. In addition, we have also generated the
associated overlapping bell curves (see Appendix A Overlapping bell curves), so that it can be visually
checked that the Ok and Ko distributions do not overlap for any of the encoders.

 

Comment 2 — Clarification regarding Comment 6: My point here was you could have adapted
the projector to multi dimensionality, e.g., [link] or use dedicated improvements like [link]. However
nothing needs to be done here.

Reply: Thank you very much for pointing out the references with such an interesting application in
our line of work. As we are currently working on a version of the safety system with some improvements
(presented in 7.1. Limitations and future work), we will thoroughly review and consider these studies
to improve the quality of the proposed methodology and the results obtained.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The article presents an interesting and relevant study on the application of contrastive learning combined with Bayesian modeling to enhance functional safety in industrial production lines. The work is original, methodologically correct, and well supported by experimental validation, with results confirming the effectiveness of the proposed approach.

The authors have developed a coherent and well-structured detection system with clear potential for practical implementation in the context of Industry 4.0. The proposed methodology (a combination of supervised contrastive learning and Bayesian modeling) is justified and contributes meaningfully to the development of artificial intelligence methods in industrial applications.

Minor suggestions concern clarity and style — some sentences could be shortened or simplified, and the description of technical details made more concise. It would also be useful to slightly expand the part discussing model interpretability (patch-based ablation and heatmap analysis).

Overall, the paper is well prepared, logical, and valuable. It can be accepted in its current form or after minor editorial and language polishing.

Author Response

Comment 1 — Minor suggestions concern clarity and style — some sentences could be shortened or simplified, and the description of technical details made more concise. It would also be useful to slightly expand the part discussing model interpretability (patch-based ablation and heatmap analysis).

Reply: We agree that some parts of the manuscript could be simplified. We have tried to shorten some sentences and remove certain technical details and explanations in the section 3. Methods that were repetitive. We have attached a file showing the differences from the previous version of the manuscript to help identify the modified parts. We have also added a new subsection (6.3. Discussion) at the end of section 6. Confidence against uncertainty: explainable artificial intelligence (XAI). We have relocated the final part of subsection 6.2. Saliency maps to the new one. Additionally, we have added information about the value of the feature ablations method in determining that the contrastive learning (CL) scheme is capable of omitting the
presence of illegitimate elements in areas outside the cell. We have also pointed out that the section 7. Industrial deployment will show how these heatmaps can be included in a software application to enhance usability and provide transparency on the safety system decision making process. 

Author Response File: Author Response.pdf

Back to TopTop