Review Reports - Comparing Application-Level Hardening Techniques for Neural Networks on GPUs

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

After consulting the current version of your manuscript, a few issues need to be corrected.

First, the Introduction is very long and should not have subchapters. It is recommendable to display a foreword in regard to your methods and results, but as it is, you need to rethink it.

Second, the entire text, top to bottom lacks a proper research manuscript structure.

Thr literature review appers to be found here under the title Background and related works. But the chapter is unclear, what are the variables you are referring to? What says the literature in this regard?

What are your hypotheses? Is it backed up with previous results?

Literature review and manuscripts in general avoid bullet point/ numbering information display. You can use that fir primary data reports and/or notes.

The methodology at this point seems incomplete. As well as the Results section, it does not display thresholds, methods choices or any data array in thr light of accepted literature.

For this reason, results at this point seem semi valid. Also you need to consider what happened to your H and whether you accepted/ rejected it and why.

Limitations and future remarks are recommened as individual subchapters at this point.

Your reference list also needs improvement.

Best regards,

Author Response

Concern#1: the Introduction is very long and should not have subchapters. It is recommendable to display a foreword in regard to your methods and results, but as it is, you need to rethink it.

Author response#1:

We thank the reviewer for the comment and the suggestion. We apologize for not correctly structuring the manuscript, especially in the first half (Sections 1 and 2).

Author Action#1:

Section 1 (Introduction) and Section 2 (Background and Related works) were updated according to the reviewer’s comments.

Section 1 (Introduction) now addresses 3 main topics:

Edge AI in safety-critical applications;

Reliability concerns regarding the approaches used to evaluate the software solution's effectiveness for reliability enhancement strategies. In this way, we show and stress the limitations found in previous works that we overcome with the current work;

Contributions and achieved results.

Concern#2: The literature review appears to be found here under the title Background and related works. But the chapter is unclear, what are the variables you are referring to? What says the literature in this regard? What are your hypotheses? Is it backed up with previous results?

Author response#2:

We thank the reviewer for the suggestion of clarifying the literature review. We better introduced the related works in the introduction and we clarified our hypothesis along with the related works. The main gap that we have found in the previous works, specifically in the evaluation of the software-implemented Hardening Techniques (HTs) working at the application-level, is that they often resort to simplistic and hardware-agnostic error models.

We have considered various factors, one of which is the accuracy of the error model used to evaluate the effectiveness of proposed hardening techniques. Specifically, previous studies that propose application-level hardening techniques often simulate faults by performing multiple bit-flips at random neurons locations. However, this method of simulating faults overlooks error propagation, which can be influenced by the scheduling policy of the underlying hardware. As a result, this approach increases the level of hardware agnosticism. Moreover, these evaluations involve always a different testbench (e.g., NN and dataset under test, error) and, for this reason, there exists no unified approach that allows for a direct comparison between the HTs proposed in literature. In addition to the lack of a standardized approach, previous works do not provide information about the complexity or engineering effort to implement each HT.

For this reason, we propose an evaluation approach that, by employing a hardware-aware error model, allows us to compare the effectiveness of the hardening techniques. Furthermore, our evaluation methodology allow also to assess the Hardening Technique practicality of implementation introducing qualitative and quantitative metrics devoted to objectively grade the effort of implementation and the impact of the HT on the NN performance without considering hardware faults (namely, Effort of Implementation and Inference Time Overhead. Our experiments assess the effectiveness of application-level HTs from different perspectives: the complexity of the implementation, the accuracy in the fault-free scenario, the estimation of the inference time overhead introduced with the implementation of the HT, and the accuracy degradation due to errors arising from faults.

Author Action#2:

We have updated the name of Section 2 to “Background”, where we provided the needed background to better understand the rest of the article. For this reason, we included the following subsections:

GPU organization and fault propagation;

Neural Networks;

Software solutions for enhancing NNs reliability.

Concern#3: literature review and manuscripts in general avoid bullet point/ numbering information display. You can use that for primary data reports and/or notes.

Author response#3:

We thank the reviewer for the suggestion. We will follow your suggestion of not using bullet points in the background section.

Author Action#3:

We have turned the list in Section 2.1 (Neural Networks) into paragraphs describing the various Neural Network architectures.

Concern#4: the methodology at this point seems incomplete. As well as the Results section, it does not display thresholds, methods choices or any data array in the light of accepted literature. For this reason, results at this point seem semi-valid. Also, you need to consider what happened to your H and whether you accepted/ rejected it and why.

Author response#4:
We apologize for the lack of clarity in the methodology and results sections. In this work we propose an evaluation. Our evaluation methodology regarding the most prominent application level Hardening Techniques (HTs) comprises 3 main steps: i) HT implementation, ii) reliability assessment through hardware-aware fault injection campaigns and iii) Data analysis.

The first step involves the implementation of existing application-level HTs on the NN architectures. To do so, a set of NN architectures are modified according to the HTs original implementation procedure (e.g., changing the order of the layers or including an additional operation in the original NN architecture).

The second step consists of an experimental stage that conducts the reliability assessment to assess the implemented HTs reliability enhancement capabilities. Specifically, FI campaigns are conducted at the application level resorting to 2 existing hardware-aware error models: i) WSBF and ii) NBER.

The last step in our evaluation procedure involves an evaluation that targets i) the HT implementation impact and ii) the effectiveness of the HTs as fault countermeasures.

Remarkably, the implementations of the considered HTs require modifications in the NN architecture, which can increase inference time. Additionally, this new architecture may require an extra training phase, and there is a possibility that the new neural network may not reach the same level of convergence as the original architecture. For this reason, during the HT implementation impact evaluation, we evaluated the impact on the NN accuracy (quantitative assessment), the required effort to make HTs successful (qualitative evaluation) and the introduced overhead (overhead estimation), whereas in the faulty scenarios (during WSBF FI and NBER FI) we evaluated the NN accuracy degradation. In order to perform the qualitative evaluation and the overhead estimation, we introduced 2 metrics: the Effort of Implementation and the Inference Time Overhead, respectively.

Concerning the HT effectiveness study, we have conducted a quantitative assessment on the NN accuracy degradation recorded after the WSBF and NBER FI campaigns. To do so, taking inspiration from previous studies on DNN reliability assessment, we classified the faults based on the Relative Accuracy Degradation metric which allowed us to assess the fault mitigation capabilities of the Hardening Technique under test.

Author Action#4:

We modified Section 1 (Introduction), stressing how our contribution fills the gap that we found in the literature. Moreover, we removed Software reliability concerns, as was already discussed in the Background section. In the same section, we elaborated more about the general evaluation flow, and we changed Figure 4 accordingly. We included the definition of our qualitative metric (Effort of Implementation) in section 3.3.

Concern#5: Your reference list also needs improvement.

Author response#5:

We thank the author for the suggestion. We have carefully revisited all the references in the document, making sure they follow the references guidelines. The following references to recent works further support our work and have been added to the list:

Rech, P.; “Artificial Neural Networks for Space and Safety-Critical Applications: Reliability Issues and Potential Solutions,”, 2024 in IEEE Transaction on Nuclear Science. The article provides practical guidelines for performing reliability assessments for faults, referring to application-level error models for soft errors.

Jing, W. et al.; “Enhancing Neural Network Reliability: Insights From Hardware/Software Collaboration With Neuron Vulnerability Quantization,”, 2024 in IEEE Transaction on Computers. The article presents a software-based application level fault countermeasure inspired by hardware scheduling policy.

Ruospo, A.; “Selective Hardening of Critical Neurons in Deep Neural Networks,”, 2022 25th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS). The article proposes a selective hardening technique applying Triple Modular Redundancy on the most sensitive neurons of Neural Networks.

Bono, F.M. et al.; “A novel approach for quality control of automated production lines working under highly inconsistent conditions,” 2023 in Engineering Applications of Artificial Intelligence. Reviewer 2 suggested the article in Concern#2, which is about an application of Neural Networks for industry quality control.

Bourechak, A. et al.; “At the Confluence of Artificial Intelligence and Edge Computing in IoT-Based Applications: A Review and New Perspectives,” 2023 in MDPI Sensors. The paper provides a literature review of the application of Artificial Intelligence for Edge Computing.

Author Action#5:

We included more recent references in the list. We also included the date when the website was visited for the references to the visited websites.

Concern#6: Limitations and future remarks are recommended as individual subchapters at this point.

Author response#6:

We appreciate the reviewer’s suggestion about providing limitations and future plans for our work.

In the future, we plan to expand our test bench to include more neural network architectures commonly used in Edge AI applications, such as Split Computing NNs, Quantized NNs, and Sparse NNs. Moreover, we will develop a new software-based hardening technique based on the results presented in this work.

Author Action#6:

We included an additional paragraph in the conclusion chapter describing limitations and future remarks of our work.

Reviewer 2 Report

Comments and Suggestions for Authors

Dear authors,

Interesting and well described work. I have some comments to improve the work:

1) The first footnote should not be placed in the first page but in the last page in the dedicated section.

2) In section 2.1 when describing general NN, you should refer to previous works. I suggest citing the following: https://www.sciencedirect.com/science/article/pii/S0952197623003330

3) Conclusion should be more quantitative and supported by results

4) Did you test your approach on other datasets?

5) Eq 5 is out of bound – consider putting it on two lines

6) Please describe in details fig 2 and 3

Author Response

Concern#1: The first footnote should not be placed in the first page but in the last page in the dedicated section.

Author response#1: We thank the reviewer for pointing out the misplaced footnote.

Author Action#1: We have moved the footnote from the first page to the section dedicated to project funding.

Concern#2: In section 2.1, when describing general NN, you should refer to previous works. I suggest citing the following: https://www.sciencedirect.com/science/article/pii/S0952197623003330

Author response#2: We appreciate the suggestions provided by the reviewer. We have included the suggested reference in the Introduction.

Author Action#2: We updated Section 2.1 (Neural Networks) by including the suggested reference.

Concern#3: Conclusion should be more quantitative and supported by results

Author response#3: We thank the reviewer for the comment and concern and apologize for not including enough quantitative results to support our conclusions.

Our study proposes an evaluation strategy that allows to present an independent assessment based on i) the impact that the HT implementation has on the original NN architecture and ii) the HTs effectiveness when subject to hardware-aware error models.

Experimental results show that implementing Hardening Techniques (HTs) requires significant effort, which can render them impractical. For example, applying Swap ReLU6 to MobileNet V2 with WSBF results in a modest accuracy drop to 78.99\%, despite being superior to other HTs. Its implementation effort is rated at 6 out of 10, which may lead to a preference for alternative HTs.

In contrast, range restriction-based HTs, such as Adaptive Clipper and Ranger, offer a better trade-off between implementation effort and performance, with Adaptive Clipper rated up to 6/10 for effort and a maximum accuracy degradation of 4.5\%. Additionally, Adaptive Clipper incurs only 3.52\% inference time overhead when implemented on Mnasnet, while Ranger shows a higher overhead of 7.54\%. Furthermore, range restriction-based HTs can effectively mask fault effects in over 58\% of cases during hardware-aware fault injections, particularly when Adaptive Clipper is used on Lenet5 compared to the unhardened version.

Author Action#3: We updated section 6 (Discussion), providing more quantitative evidence on the effectiveness of the Hardening Techniques resulting from our experiments.

Concern#4: Did you test your approach on other datasets?

Author response#4: We thank the reviewer for the suggestion on our evaluation setup. To achieve a systematic and reproducible study, we utilized three of the most representative datasets that accurately describe real-world conditions (Cifar10, MNIST, STL10). However, our proposed evaluation approach is applicable to any dataset (see Concern#1 from Reviewer 3).

Author Action#4: In Section 4 (Experimental Setup) we clarified the reason behind the choice of the selected dataset along with the assigned NN architectures.

Concern#5: Eq 5 is out of bound – consider putting it on two lines

Author response#5:

We thank the reviewer for the suggestion, we modified the Equation 5 accordingly.

Author Action#5:

We modified Equation 5 so that it fits in the template bounds.

Concern#6: Please describe in details fig 2 and 3

Author response#6:

We thank the reviewer for the suggestion.

Fig. 2 section highlights the GPU hierarchical organization. We understand that its description can miss the description of the visual elements presented in the figure. The grey box on the left represents the software running on the GPU based on the Single Instruction Multiple Threads (SIMT) programming paradigm for parallel computing. The program organizes the threads in grids and blocks. A block is a collection of threads, and a grid is a collection of blocks.

On the right-hand side of the image, the GPU hardware components are depicted: a scheduler defines the policy for distributing the threads to the physical components. Each Streaming Multiprocessor (SM) executes to a Thread Block (TB) while all SMs share the GPU Device Memory. An SM includes local memories and Register Files to support the parallel thread execution. Subsequently, the SM schedules several warps into the Streaming/Scalar Processors (SP) to perform integer, floating-point, and trigonometric operations. The description of the hierarchical representation provides useful preliminary insights to better understand the error propagation induced by a physical defect throughout the GPU architecture.

Figure 3 shows how a physical defect can lead to errors in an application. It illustrates the varying levels of impact the defect can have, affecting both hardware (shown in the lower part of the figure) and software (shown in the upper part). This supports the error model we used for our evaluations, which describes the errors occurring in the output of the Fused Multiply-Accumulate Cores during the execution of the Tiling Matrix Multiplication algorithm due to a phisical defect.

The Tiling Matrix Multiplication divides inputs, feature maps and weight arrays, into smaller submatrices known as tiles, which are then distributed among the GPU's parallel cores. The computation is carried out through several TB computations, which are further divided into smaller tiles at the warp level withing the GPU's SMs. This tiling method requires each thread in a warp to compute up to four small matrix multiplications between tiles, each of size 4x4. When a single SM has a faulty FMA core, errors can propagate to multiple threads within a warp, resulting in data corruption at the output of the tile. Consequently, if more than one tile is processed by on the faulty SM, a similar error pattern will likely affect the results of the algorithm.

However, the structural organization of the hardware or the implicit nature of the code in the application may mask some of these effects, meaning they can go unnoticed if they do not produce a visible corruption in the application's outputs or the system itself.

Author Action#6:

In Section 2.2 we improved the description of the GPU organization depicted in Figure 2 and of the error propagation shown in Figure 3 where we included the error propagation pattern during the execution of the Tiling Matrix Multiplication.

Reviewer 3 Report

Comments and Suggestions for Authors

In this paper, the authors proposed a method for comparing and evaluating the application-level hardening techniques (HTs) for neural networks (NNs) on GPUs. The authors explained the need for their study and the importance of their work in the introduction section and mentioned the various applications of the problem in practice. They have clearly explained the limitations of the existing approaches used to motivate the current research. The authors have summarized the paper's contribution in subsection 1.4, which concurs with the paper's content.

There is a list of some points that I consider important to be addressed in the paper:

Comments concerning Section 4 (Experimental Setup):

The authors evaluated the effectiveness of different HTs on five existing architectures and four different input datasets were used. This is an important feature confirming the generability of the findings. However, I noticed that you chose rather different deep neural networks (DNNs). I understand what could be the reason for this, but can you elaborate on it more in this section? Also, the same can be applied to the datasets chosen. Of course, these are standardly used datasets from the literature, but how did you choose which architecture will be evaluated on which dataset?

Also, you mention trying different hyperparameter settings and configurations for training the models. How did you finally select the hyperparameters? Are they chosen based on hit-and-miss, or did you use an optimization algorithm for this? If yes, which one? Please elaborate on this in more detail in this section.

The list of references is rather novel, as 28 of the total 59 references originate from the last five years (2020-2025). Is it possible to extend the reference list with more (most recent) references? This would additionally confirm that the current topic is novel and actual, especially nowadays with the usage of various edge devices in everyday applications.

The paper is concluded in Section 7, but the authors do not mention their future work. This should be included in the paper, and the authors should comment on the possible next steps of their research. Maybe they would like to work on the main shortcomings of their proposed model or to improve on the weak points of their methodology if those exist in the paper. I would kindly ask the authors to address this accordingly.

There are some general comments on the formatting of the paper that should be improved:

The references in the paper follow the order they appear in the text, and all figures are referenced correctly. I might be wrong, but I think that Table 3 is the only one not referenced in the text. Also, there might be a reference number [33] that is not referenced in the text. Please correct this if needed.

When referencing equations in the paper, the authors sometimes refer to them as „Eq. X“, and sometimes as „equation X“ (for example I have noticed this in subsection 3.2.1). On the other hand, in subsection 3.3, the equations are referenced in the text as „Equation X“. Please make the referencing of the equations unique and consistent throughout the paper (and according to the journal requirements). Also, I have noticed inconsistency when referencing the figures, as in subsection 5.4 a figure is referenced as „Fig. 7“, and then as „Figure 7“ – all within the same paragraph. Please ensure consistency when referencing both equations and figures.

In general, I believe that the paper has a very large number of abbreviations, to the extent that it becomes difficult for the reader to follow the text. I needed to „go back“ several times and remind myself of the meaning of a certain abbreviation so that I could understand the paragraphs. This could be difficult to accomplish, as it could be needed due to the specific topic of the paper, but do you think it would be possible to reduce the total number of abbreviations in the paper? Or, maybe it would be easier to add the list of references at the end of the paper?

The abbreviation SW is used in the abstract without providing its full name.

There are some abbreviations whose full names are not given in the main paper text. I have noticed this for ReLU, lr... There could be more, so please check. Namely, the full name of each abbreviation should be given only once (when first mentioned), so please address this accordingly. For example, the abbreviation BL is used a lot in the text, but its full name is given very late (in subsection 5.1). There might be other cases like this, so please check all the abbreviations you have.

Also, check that there are no abbreviations whose full names are given more than once. I have noticed this for AC, R, SR, MF, TMR (redefined in subsection 5.1), ITO (redefined in subsection 5.2), and NBER (redefined in Section 7). Please check if other examples exist and correct them accordingly.

In the list of references, when you put a link or a website that you visited (like ref. [2], [5], [27], [57], [58], [59]), I think that it would be needed to add the date and time the website was accessed.

Author Response

Concern#1: The authors evaluated the effectiveness of different HTs on five existing architectures and four different input datasets were used. This is an important feature confirming the generalizability of the findings. However, I noticed that you chose rather different deep neural networks (DNNs). I understand what could be the reason for this, but can you elaborate on it more in this section? Also, the same can be applied to the datasets chosen. Of course, these are standardly used datasets from the literature, but how did you choose which architecture will be evaluated on which dataset?

Author response#1:

We appreciate the reviewer’s comments and concerns.

It is true that multiple DNN architectures exist, and they all deserve deep evaluation. Nonetheless, in choosing DNNs, we aimed to cover diverse architectures that differ in depth, layer composition, and computational characteristics. This diversity allows us to analyze how different HTs perform under varying workload conditions, ensuring our findings are not biased toward a single type of architecture. Similarly, the selection of datasets was guided by their standard use in the literature and their relevance to the specific architectures being evaluated. We aimed to ensure that each DNN was tested in a realistic and meaningful context that reflects common deployment scenarios. To achieve a systematic and reproducible study, we utilized three of the most representative datasets that accurately describe real-world conditions. However, our proposed evaluation approach is applicable to any dataset.

Regarding the selection of the pairings between the dataset and architecture, we have adhered to the experimental setups outlined in the original articles that introduced the considered neural network architectures, with the exception of ResNet18. We chose to evaluate Resnet using the CIFAR-10 dataset instead of ImageNet, as specified in the original paper. This decision was made to demonstrate how the impact of error depends on the neural network architecture, specifically when comparing MnasNet, ResNet18, and MobileNet V2. MobileNet V2, MnasNet, and ResNet18 all utilize skip connections, but they employ them in different ways. MobileNet V2 and MnasNet incorporate skip connections within an Inverted Residual Block, while ResNet integrates them in a Residual Block. Testing these three neural network architectures on the same dataset allows us to evaluate the impact of faults when HTS are implemented across these different architectures.

Author Action#1:

In Section 4 (Experimental Setup) we clarified the reasons behind the choice of the selected dataset along with the assigned NN architectures.

Concern#2: You mention trying different hyperparameter settings and configurations for training the models. How did you finally select the hyperparameters? Are they chosen based on hit-and-miss, or did you use an optimization algorithm for this? If yes, which one? Please elaborate on this in more detail in this section.

Author response#2:

We apologize for not explaining the process for choosing the hyperparameters to train the Neural Network in the configurations. As three out of the five software-implemented at the application-level Hardening Techniques (HTs) (Adaptive Clipper, Median Filter and Swap ReLU) require re-training or fine-tuning the NN architectures, we performed a hyperparameter tuning step to guarantee training algorithm convergence with the best NN performance at inference time.

Starting from the original hyperparameter configuration that the original training process that Baseline Neural Network architecture requires, the hyperparameters are chosen based on hit-and-miss:

The number of epochs is increased until the NN training algorithm does not converge.

Optimizer always refers to the optimizer used in the original implementation of the training process of the NN architecture.

Initial lr. If the training process ended with a steady training loss, the learning rate was reduced.

Learning rate scaling factor is kept constant.

Author Action#2:

We have revisited the manuscript and included the details about the hyperparameter tuning in Section 4 (Experimental setup), as suggested.

Concern#3: The list of references is rather novel, as 28 of the total 59 references originate from the last five years (2020-2025). Is it possible to extend the reference list with more (most recent) references? This would additionally confirm that the current topic is novel and actual, especially nowadays with the usage of various edge devices in everyday applications.

Author response#3:

We thank the author for the suggestion of increasing the number of citations to recent papers. The following references to recent works further support our work and have been added to the list:

Rech, P.; “Artificial Neural Networks for Space and Safety-Critical Applications: Reliability Issues and Potential Solutions,”, 2024 in IEEE Transaction on Nucler Science. The article provides practical guidelines for performing reliability assessments for faults, referring to application-level error models for soft errors.

Jing, W. et al.; “Enhancing Neural Network Reliability: Insights From Hardware/Software Collaboration with Neuron Vulnerability Quantization,”, 2024 in IEEE Transaction on Computers. The article presents a software-based application level fault countermeasure inspired by hardware scheduling policy.

Ruospo, A.; “Selective Hardening of Critical Neurons in Deep Neural Networks,”, 2022 25th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS). The article proposes a selective hardening technique applying Triple Modular Redundancy on the most sensitive neurons of Neural Networks.

Bono, F.M. et al.; “A novel approach for quality control of automated production lines working under highly inconsistent conditions,” 2023 in Engineering Applications of Artificial Intelligence. Reviewer 2 suggested the article in Concern#2, which is about an application of Neural Networks for industry quality control.

Bourechak, A. et al.; “At the Confluence of Artificial Intelligence and Edge Computing in IoT-Based Applications: A Review and New Perspectives,” 2023 in MDPI Sensors. The paper provides a literature review of the application of Artificial Intelligence for Edge Computing.

Author Action#3:

We have added new references in the introduction and in the Background section.

Concern#4: The paper is concluded in Section 7, but the authors do not mention their future work. This should be included in the paper, and the authors should comment on the possible next steps of their research. Maybe they would like to work on the main shortcomings of their proposed model or to improve on the weak points of their methodology if those exist in the paper. I would kindly ask the authors to address this accordingly.

Author response#4:

We appreciate the reviewer’s suggestion aboutr providing limitations and future plans for our work.

Author Action#4:

We included the future works in the Conclusion section

Concern#5: The references in the paper follow the order they appear in the text, and all figures are referenced correctly. I might be wrong, but I think that Table 3 is the only one not referenced in the text. Also, there might be a reference number [33] that is not referenced in the text. Please correct this if needed.

Author response#5:
We appreciate the suggestions provided by the reviewer. We have revisited the paper and crosschecked that all figures, references, and tables are appropriately cited across the text, including Table 3 and reference [33].

Author Action#5:

We have revisited the manuscript and included the reference to Table 3 in Section 4 (Experimental Setup) and the reference [33] in Section 2.2 (GPU organization and fault propagation).

Concern#6: When referencing equations in the paper, the authors sometimes refer to them as “Eq. X“, and sometimes as “equation X“ (for example I have noticed this in subsection 3.2.1). On the other hand, in subsection 3.3, the equations are referenced in the text as “Equation X“. Please make the referencing of the equations unique and consistent throughout the paper (and according to the journal requirements). Also, I have noticed inconsistency when referencing the figures, as in subsection 5.4 a figure is referenced as “Fig. 7“, and then as “Figure 7“ – all within the same paragraph. Please ensure consistency when referencing both equations and figures.

Author response#6:

We apologize not referencing figures and equations consistently throughout the paper. We have correctly fixed the references to the equations and figures.

Author Action#6:

According to the journal guidelines, we fixed the references to Equation 1 and 2, Equation 3, Figure 7 and moved Table 1 to the Experimental Results section.

Concern#7: In general, I believe that the paper has a very large number of abbreviations, to the extent that it becomes difficult for the reader to follow the text. I needed to “go back“ several times and remind myself of the meaning of a certain abbreviation so that I could understand the paragraphs. This could be difficult to accomplish, as it could be needed due to the specific topic of the paper, but do you think it would be possible to reduce the total number of abbreviations in the paper? Or, maybe it would be easier to add the list of references at the end of the paper?

Author response#7:

We thank the reviewer for the comment and the suggestion.

We reduced the number of acronyms. For example, we understand the acronyms BL, AC, R, MF and SR (which corresponds to Baseline, Adaptive Clipper, Ranger, Median Filter and Swap ReLU) can be hard to remember while reading, so we removed them from the paper and used the full name of the different techniques instead.

Author Action#7:

We have modified the sections: Software solutions for reliability enhancement of NNs, Experimental setup, Fault mitigation w.r.t. WSBFs, Fault mitigation w.r.t. NBER, by removing the acronyms for Baseline (BL), Ranger (R), Swap ReLU (SR), Adaptive Clipper (AC) and Median Filter (MF).

Concern#8: The abbreviation SW is used in the abstract without providing its full name. There are some abbreviations whose full names are not given in the main paper text. I have noticed this for ReLU, lr... There could be more, so please check. Namely, the full name of each abbreviation should be given only once (when first mentioned), so please address this accordingly. For example, the abbreviation BL is used a lot in the text, but its full name is given very late (in subsection 5.1). There might be other cases like this, so please check all the abbreviations you have. Also, check that there are no abbreviations whose full names are given more than once. I have noticed this for AC, R, SR, MF, TMR (redefined in subsection 5.1), ITO (redefined in subsection 5.2), and NBER (redefined in Section 7). Please check if other examples exist and correct them accordingly.

Author response#8:

We thank the author for the suggestion, and we apologize for our incorrect use of the acronyms. We have revisited all the acronyms in the paper, making sure they all were introduced properly.

Author Action#8:
We correctly defined the acronyms only when they are firstly mentioned. Specifically, we modified the abstract including the complete name for “Software” rather than SW acronym and the definition of ReLU, lr, NBER, BER and RAD according to the suggestions. Moreover, as suggested in Concern6, we removed the acronyms for Baseline (BL), Ranger (R), Swap ReLU (SR), Adaptive Clipper (AC) and Median Filter (MF).

Concern#9: In the list of references, when you put a link or a website that you visited (like ref. [2], [5], [27], [57], [58], [59]), I think that it would be needed to add the date and time the website was accessed.

Author response#9:

We thank the reviewer for pointing out the improvement in our manuscript references; we have carefully revisited all the references, following the journal guidelines.

Author Action#9:

We added the date when the websites referenced as [2], [5], [27], [57], [58], [59] were accessed.

Reviewer 4 Report

Comments and Suggestions for Authors

This paper aims to guide balancing resilience and efficiency in neural network deployments for safety-critical systems. To this end, the authors evaluate five different application-level, software-based hardening techniques (HTs): Adaptive Clipper (AC), Ranger (R), Swap ReLU6 (SR), Median Filter (MF), and Fine-Grain Triple Modular Redundancy (TMR). Their evaluation is based on systematic fault injection experiments that simulate bit-flip errors in neural network weights (WSBF) and neuron activations (NBER) on GPUs.

With AI playing an increasingly important role in safety-critical applications, ensuring reliability is crucial. This study makes a meaningful contribution by systematically comparing different fault mitigation strategies in terms of accuracy retention and computational overhead. The discussion of range restriction methods (AC and R) as lightweight alternatives to redundancy-based approaches like TMR is particularly insightful, as it provides practical takeaways for real-world AI deployment.

The paper is well-written and well-structured, but I have three points for further clarification and potential improvement:

The authors primarily evaluate (i) Weight Single Bit-Flip (WSBF) and (ii) Neuron Bit Error Rate (NBER), but it is unclear why these two were chosen over other well-documented fault types. Real-world AI systems experience a variety of GPU failures, including:

DRAM corruption (e.g., RowHammer)
Soft errors (e.g., single-event upsets from cosmic rays)
Aging-induced failures (e.g., Bias Temperature Instability, Hot Carrier Injection)
Thermal-induced faults

Since these faults can also impact neural network reliability, it would be helpful to provide some discussion on why WSBF and NBER were prioritized while other failure mechanisms were not considered. Are these two models the most representative for AI workloads on GPUs? Or were they chosen for practical reasons, such as ease of fault injection? Some clarification would strengthen this section.

The paper appears to assume that fault behavior is the same across all GPUs, but isn’t it the case that NVIDIA, AMD, and Intel GPUs differ significantly in terms of:

Memory architectures
Error correction mechanisms
Instruction sets and parallelization strategies

If certain hardening techniques work well on one brand but not another, that could impact the generalizability of the results. Has there been any validation on different GPU architectures, or do the authors expect the findings to hold across vendors? Some discussion on this would help clarify the scope of the conclusions.

Since this paper aims to provide practical recommendations, should energy consumption also be considered? The study evaluates computational overhead but does not address power efficiency, which is a major concern for:

Mobile and edge AI applications
Cloud-based AI deployments
Large-scale AI inference where power costs matter

Some of the tested techniques—especially TMR—are likely to significantly increase energy usage, making them impractical in power-constrained environments. A discussion on energy trade-offs would provide a more complete picture of how these hardening techniques fit into real-world AI deployment scenarios.

Author Response

Concern#1: The authors primarily evaluate (i) Weight Single Bit-Flip (WSBF) and (ii) Neuron Bit Error Rate (NBER), but it is unclear why these two were chosen over other well-documented fault types. Real-world AI systems experience a variety of GPU failures, including:

DRAM corruption (e.g., RowHammer)

Soft errors (e.g., single-event upsets from cosmic rays)

Aging-induced failures (e.g., Bias Temperature Instability, Hot Carrier Injection)

Thermal-induced faults

Author response#1:

We thank the reviewer for pointing out the presence of further failures. It is true that in the real world, AI-based systems may experience different types of faults and failures, including for example DRAM corruption, soft errors, aging-induced failures, and thermal-induced faults. However, in general error models can describe these types of faults. On the other hand, the employed evaluation approaches on the effectiveness of application-level Hardening Techniques do not consider the error propagation while WSBF and NBER consider how the error can propagate during the execution of matrix multiplication algorithms covering faults such as aging or single event upset among the others.

Author Action#1:

We elaborated more about our experiments design choice in Section 3 (Proposed evaluation methodology) and the motivation behind in Section 1 (Introduction).

Concern#2: The paper appears to assume that fault behavior is the same across all GPUs, but isn’t it the case that NVIDIA, AMD, and Intel GPUs differ significantly in terms of:

Memory architectures

Error correction mechanisms

Instruction sets and parallelization strategies

Author response#2:

We thank the reviewer for the comment.

It is true that different hardware architectures can have different error propagation patterns due to different Memory architectures, Error correction mechanism and Instruction sets and parallelization strategies but Weight Single Bit Flips (WSBF) and Neuron Bit Error Rate (NBER) are well-known hardware-aware error models that mimic the effect of faults in memory units and basic cores of the GPUs, respectively. In particular, WSBF resort to single stuck-at fault model which has been demonstrated to mimic well the effect of permanent faults in the temporary memory units storing the weights. Applying such fault model at the application level guarantees the generalizability of the error model to any GPU architecture. On the other hand, NBER simulates the error propagation that occurs during the Tiling Matrix Multiplication algorithm. Since all the GPUs use the Single Instruction Multiple Data or Single Instruction Multiple Threads paradigms, modeling the error propagation throughout the Matrix Multiplication execution makes the model valid across all GPUs.

Author Action#2:

We supported the generalizability of our results in Section 1 (Introduction), Section 2.2 (GPU organization and fault propagation).

Concern#3: Since this paper aims to provide practical recommendations, should energy consumption also be considered? The study evaluates computational overhead but does not address power efficiency, which is a major concern for:

Mobile and edge AI applications

Cloud-based AI deployments

Large-scale AI inference where power costs matter

Author response#3:

We thank the reviewer for suggesting us to include power consumption in our evaluations. Assessing power consumption in Edge AI applications is crucial, but this article mainly proposes an evaluation strategy for system designers and integrators developing software-based fault countermeasures. As these approaches only involve modifications in the software, estimating the Inference Time Overhead in terms of basic operations provides a reasonable approximation of the expected power consumption.

Author Action#3:

We included the discussion on the Inference Time Overhead in "Data analysis" section in the light of its inherent correlation with the power consumption.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

I congratulate you for the work performed in regard to your manuscript.

In order to comply with the academic standards of a scientific manuscript, a series of issues need your attention.

Despite regrouping the text, the Introduction is now clear, but it is very lonk, making it not to reach it s purpose. You should shorten it and add the Research Question which now is still unclear.

The literature review is now more visible, but text starting from your phrase ..in this study we have selected five HT...should be part of the Methodology section.

You still don t propose Hypotheses and don t back it up with literature. I suggest a graphical output based on your Hypotheses and results that would make everything clearer.

Moreover, these Hypotheses must be followed within your Analysis and the results (accepted/ rejected) should be displayed in a clear way. Further, discuss it in the light of previous literature, since now you don t only refer to the Hypotheses, but the results you have got.

Best regards,

Author Response

Concern#1: Despite regrouping the text, the Introduction is now clear, but it is very long, making it not to reach its purpose. You should shorten it and add the Research Question which now is still unclear.

Author response#1:

We thank the reviewer for the comment.

In our literature review we considered the works proposing application-level software-based hardening techniques. Specifically, we focused on the 2 main features that characterize their evaluations: i) error models and ii) test benches.

In particular, previous works proposing application-level hardening techniques evaluate them through different hardware-agnostic error models (i.e., different types of injected faults that neglect the error propagation throughout the different levels of the system). Concerning the employed test bench (i.e., Neural Networks and datasets) they differ across the different works that propose such hardening techniques.

Consequently, considering both the mismatch in terms of error models and test benches, a direct comparison between the application level HTs among the works in the literature becomes challenging to be performed by only resorting to the results of the works that proposed such techniques.

Here comes our first research question: How much are the different hardening techniques effective in terms of reliability enhancement when they are evaluated on consistent hardware-aware error models and on the same Neural Networks?

Furthermore, the evaluations always neglect the impact of the implementation of the Hardening Techniques that can yield valuable insights regarding their practicality of implementation, beyond purely evaluating their effectiveness.

Consequently, the second research question is: which Hardening Technique shows the best trade-off between the cost of its implementation and its capabilities in enhancing NNs reliability?

To answer the proposed research questions this study presents a systematic evaluation methodology for application-level software-based Hardening Techniques. The proposed evaluation methodology enables a direct and fair comparison of the Hardening Techniques in terms of i) practicality of implementation and ii) effectiveness as a fault countermeasure. In the first aspect, we proposed a combination of qualitative and quantitative metrics that describe the required features of a hardening technique to be implemented in a given NN. In the second aspect we used hardware-aware error models (WSBF and NBER) measuring the impact that injected errors have on the NN accuracy (i.e., accuracy degradation). The fairness of our results is guaranteed by the consistency of the experiments with each other in terms of target number format and target dataset during the Fault Injection campaigns.

Our achieved results provide answers to our research questions:

1. Not all the hardening techniques are suitable for all Neural Networks, and our study of their impact can help us understand this phenomenon. For example, while Swap ReLU6 achieves an accuracy degradation up to only 13% w.r.t. the accuracy achieved in the fault-free scenario when it is submitted to the Weight Single Bit Flip Fault Injection (FI) campaigns, the same Hardening Technique achieves a critical degradation (more than 35%) when it is implemented on Resnet 18. On the other hand, when Swap ReLU6 is implemented on Mobilenet V2 and the NN is submitted to the Neuron Bit Error Rate FI campaigns, its accuracy achieves up to 16% of degradation w.r.t. the accuracy

achieved in the fault-free scenario. At the same time, when it is implemented on Resnet18, it achieves only 1.5% of degradation. These results prove that the effectiveness of each hardening technique strongly depends on the original NN topology on which they are implemented and on the error model employed during the evaluations.

2. Range restriction-based hardening techniques show the best trade-off between the implementation cost and the reliability enhancement effectiveness. For example, Adaptive Clipper scores up to 6/10 in effort and produces only a 4.5\% accuracy degradation. Inference Time Overhead is modest for Adaptive Clipper (up to 3.52\%), while Ranger reaches 7.54\%. Notably, these HTs can effectively mask fault effects in up to 58.23\% of cases during fault injections.

Author action#1: We updated the Introduction making it shorter and adding the research questions to the text. Moreover, at the end of the introduction we discussed the results in the light of the presented research questions.

Concern#2: The literature review is now more visible, but text starting from your phrase ..in this study we have selected five HT...should be part of the Methodology section.

Author response#2:

We thank the reviewer for the comment. We agree that our experimental choice should not be part of our background (specifically, Section 2.3). However, the Hardening Techniques that we described in the Background Section are the state-of-the-art among the existing application-level hardening techniques; for this reason we believe it is important to provide some background on their implementation already in Section 2.

Consequently, we updated the sentence referring to the ones presented as the state-of-the-art in application-level hardening techniques.

Moreover, our methodology assesses current hardening techniques and but also it is applicable to all other hardening techniques documented in literature, while they are implemented on any neural network architecture. Thus, we decided not to include the paragraphs describing the state-of-art HTs in the methodology section and leave the description of our test benches in the dedicated section (experimental setup).

Author action#2:

We updated the sentence in “Software solutions for reliability enhancement of NNs” section accordingly.

Concern#3: You still don’t propose Hypotheses and don t back it up with literature. I suggest a graphical output based on your Hypotheses and results that would make everything clearer. Moreover, these Hypotheses must be followed within your Analysis and the results (accepted/ rejected) should be displayed in a clear way. Further, discuss it in the light of previous literature, since now you don t only refer to the Hypotheses, but the results you have got.

Author response#3:

We thank the reviewer for the comment.

Based on the research gap that we have identified in the answer to the Concern#1 we can formulate the main hypotheses that guided the choices taken along the path of our research activity. As a result of our work, where the hardening techniques are evaluated with consistent and not only with hardware-agnostic (thus simplistic) error models, our hypothesis is: “Not all the application-level hardening techniques really enhance the reliability of the Neural Networks (NNs) when they are submitted to both Weight Single Bit Flip and Neuron Bit Error Rate error models.”

In literature, the reliability of neural networks (NNs) is assessed by evaluating accuracy degradation. This involves comparing the impact of faults on the accuracy of the NN architecture both with and without the implementation of a Hardening Technique. By doing this, we can quantitatively measure the effectiveness of the Hardening Technique in enhancing the NN architecture. Consequently, our experimental results show that the effectiveness of the Hardening Techniques strongly depends on the error model employed for the evaluations and on the Neural Network architecture on which the Hardening Technique is implemented. For example, while Swap ReLU6 achieves an accuracy degradation up to only 13% w.r.t. the accuracy achieved in the fault-free scenario when it is submitted to the Weight Single Bit Flip Fault Injection (FI) Campaigns, the same Hardening Technique achieves a critical degradation (more than 35%) when it is implemented on Resnet 18. On the other hand, when Swap ReLU6 is implemented on Mobilenet V2 and the NN is submitted to the Neuron Bit Error Rate FI campaigns, its accuracy achieves up to 16% degradation w.r.t. the accuracy achieved in the fault-free scenario, when it is implemented on Resnet18, it achieves only 1.5% of degradation.

Author action#3:

We clarified our hypotheses in the introduction and discussed the results in the context of accepting or rejecting them. Specifically, in the Discussion section, we included an additional table that details which Hardening Technique effectively improved reliability for each Neural Network (NN) architecture, based on the Fault Injection strategy. This table compares the accuracy of the original NN architecture, and the accuracy observed during the fault injection campaign, particularly focusing on FIs that induced the highest level of corruption.