NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation
Abstract
1. Introduction
- Multi-level FI Plug-in: Developed based on the PyTorch framework, this highly configurable plug-in adapts to CNN hierarchical computing and NPU parallel architectures, balancing evaluation accuracy and efficiency.
- High-Precision Module Modeling: We construct hardware-level models for key components (computing units, memory, and data paths), bridging the fidelity gap caused by hardware opacity and ensuring simulation results align with physical hardware behavior.
- Statistical-Oriented FI Method: By identifying fault-sensitive layers in Quantized CNNs (QCNNs) through statistical analysis, we implement a prioritized, staged FI approach that significantly reduces computational overhead while maintaining statistical validity.
2. Related Work
2.1. Hardware-Level Fault Tolerance
2.2. Fault Injection and Simulation Tools
3. Methodology
3.1. Fault Injection Framework
3.1.1. Framework Overview
3.1.2. Fault Characterization and Modeling
- Fault Types: We model Single-Event Upsets within storage units and Single-Event Transients that induce disturbances in computational arrays.
- Parameterization and Unified Metrics: Key parameters govern the injection space and impact scope.
- –
- Fault Injection Rate (): Represents the fault density within a specific injection target. It is defined as the proportion of elements (e.g., or of the target layer weights/activations) subjected to random bit-flips during an injection pass.
- –
- Screening Threshold (): The Top-1 accuracy loss tolerance (e.g., 0.001) used exclusively as a decision boundary during the preliminary screening phase to identify fault-sensitive layers.
Under these parameterized constraints, we support both Single-Bit Flips (SBF) and Multi-Bit Flips (MBF) across registers, processing elements, and weight buffers. - Quantization-Aware FI: To accommodate low-precision hardware, we develop specialized fault models for INT8 and FP8 formats, accounting for the differential sensitivity between Most Significant Bits (MSB) and Least Significant Bits (LSB).
3.1.3. NPU Architectural Modeling
3.1.4. Networks and Datasets
3.1.5. Evaluation Metrics
- Maximum Error ()While average error characterizes the overall degradation of a model, the maximum error serves as a critical reliability metric for safety-critical deployments. This indicator is defined as the maximum deviation observed between the faulty output and the golden fault-free output across all injection iterations. In space-borne environments, the worst-case scenario involving a single bit-flip that causes a catastrophic output shift is more indicative of system vulnerability than average performance. By capturing these extreme outliers, provides an upper-bound assessment of the system’s susceptibility to hardware-induced failures.
- Peak Signal-to-Noise Ratio (PSNR)Beyond final classification accuracy, we utilize PSNR to quantify the fidelity of intermediate feature maps. Unlike top-level metrics like mAP or Accuracy, PSNR provides a high-sensitivity numerical probe into the feature distribution [23]. During fault injection, many faults manifest as Silent Data Corruptions (SDCs); these are disturbances that alter intermediate activation values but may not be of sufficient magnitude to trigger an immediate label change. By calculating the PSNR between the faulty feature map and the reference map, we can sensitively capture these hidden fault effects. A significant drop in PSNR serves as an early warning for structural feature distortion, providing a standardized physical basis for identifying fault-sensitive layers and guiding subsequent hardening strategies.
3.2. Statistical Methodology for Fault Resilience Analysis
3.2.1. Error Propagation in Convolutional Layers
3.2.2. Error Mitigation via Activation Functions
3.2.3. Structural Fault Tolerance
3.2.4. Fault Injection Priority Sequence
4. Experimental Results and Discussion
4.1. Convolutional Layer Reliability Analysis
4.2. Activation Function Analysis
4.3. Hierarchical and Block-Level Analysis
4.4. Efficiency Analysis and Dataset Validation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ibrahim, Y.; Wang, H. Soft errors in DNN accelerators: A comprehensive review. Microelectron. Reliab. 2020, 115, 113969. [Google Scholar] [CrossRef]
- Zhang, J.J.; Gu, T.; Basu, K.; Garg, S. Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. In 2018 IEEE 36th VLSI Test Symposium (VTS); IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
- Trindade, M.G. Assessment of Edge Machine-Learning Systems Under Radiation-Induced Effects. Ph.D. Thesis, Université Grenoble Alpes, Saint-Martin-d’Hères, France, 2021. [Google Scholar]
- LaBel, K.A.; Cohn, L.M. Radiation testing and evaluation issues for modern integrated circuits. In Eighth European Conference on Radiation and Its Effects on Components and Systems (RADECS05); IEEE: Piscataway, NJ, USA, 2005. [Google Scholar]
- Ruospo, A.; Luza, L.M.; Bosio, A.; Traiola, M.; Dilillo, L.; Sanchez, E. Pros and cons of fault injection approaches for the reliability assessment of deep neural networks. In 2021 IEEE 22nd Latin American Test Symposium (LATS); IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
- Geier, J.; Mueller-Gritschneder, D.; Schlichtmann, U. Techniques and tools for fast fault injection simulations of RISC-V processors at RTL. In RISC-V Summit Europe; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
- Kchaou, A.; Saad, S.; Garrab, H.; Machhout, M. Reliability of LEON3 processor program counter against SEU, MBU, and SET fault injection. Cryptography 2025, 9, 54. [Google Scholar] [CrossRef]
- Mahmoud, A.; Aggarwal, N.; Nobbe, A.; Vicarte, J.R.S.; Adve, S.V.; Fletcher, C.W.; Frosio, I.; Hari, S.K.S.P. PyTorchFI: A runtime perturbation tool for DNNs. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W); IEEE: Piscataway, NJ, USA, 2020; pp. 25–31. [Google Scholar]
- Chen, Z.; Narayanan, N.; Fang, B.; Li, G.; Pattabiraman, K.; DeBardeleben, N. TensorFI: A flexible fault injection framework for TensorFlow applications. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE); IEEE: Piscataway, NJ, USA, 2020; pp. 426–435. [Google Scholar]
- Zheng, Y.; Feng, Z.; Hu, Z.; Pei, K. MindFI: A fault injection tool for reliability assessment of MindSpore applications. In 2021 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW); IEEE: Piscataway, NJ, USA, 2021; pp. 235–238. [Google Scholar]
- Ruospo, A.; Gavarini, G.; Bragaglia, I.; Traiola, M.; Bosio, A.; Sanchez, E. Selective hardening of critical neurons in deep neural networks. In 2022 25th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS); IEEE: Piscataway, NJ, USA, 2022; pp. 136–141. [Google Scholar]
- Huang, H.; Xue, X.; Liu, C.; Wang, Y.; Luo, T.; Cheng, L.; Li, H.; Li, X. Statistical modeling of soft error influence on neural networks. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2023, 42, 4152–4163. [Google Scholar] [CrossRef]
- Li, G.; Hari, S.K.S.; Sullivan, M.; Tsai, T.; Pattabiraman, K.; Emer, J.; Keckler, S.W. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; IEEE: Piscataway, NJ, USA, 2017; pp. 1–12. [Google Scholar]
- Libano, F.; Wilson, B.; Anderson, J.; Wirthlin, M.J.; Cazzaniga, C.; Frost, C.; Rech, P. Selective hardening for neural networks in FPGAs. IEEE Trans. Nucl. Sci. 2018, 66, 216–222. [Google Scholar] [CrossRef]
- Xu, D.; Zhu, Z.; Liu, C.; Wang, Y.; Zhao, S.; Zhang, L.; Liang, H.; Li, H.; Cheng, K.T. Reliability evaluation and analysis of FPGA-based neural network acceleration system. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 472–484. [Google Scholar] [CrossRef]
- Bertoa, T.G.; Gambardella, G.; Fraser, N.J.; Blott, M.; McAllister, J. Fault-tolerant neural network accelerators with selective TMR. IEEE Des. Test 2022, 40, 67–74. [Google Scholar] [CrossRef]
- Tan, J.; Wang, Q.; Yan, K.; Wei, X.; Fu, X. Saca-FI: A microarchitecture-level fault injection framework for reliability analysis of systolic array based CNN accelerator. Future Gener. Comput. Syst. 2023, 147, 251–264. [Google Scholar] [CrossRef]
- Taheri, M.; Daneshtalab, M.; Raik, J.; Jenihhin, M.; Pappalardo, S.; Jimenez, P.; Deveautour, B.; Bosio, A. Saffira: A framework for assessing the reliability of systolic-array-based DNN accelerators. In 2024 27th International Symposium on Design & Diagnostics of Electronic Circuits & Systems (DDECS); IEEE: Piscataway, NJ, USA, 2024; pp. 19–24. [Google Scholar]
- Huang, H.; Liu, C.; Xue, X.; Liu, B.; Li, H.; Li, X. MRFI: An open-source multiresolution fault injection framework for neural network processing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 1325–1335. [Google Scholar] [CrossRef]
- Ozen, E.; Orailoglu, A. SNR: Squeezing numerical range defuses bit error vulnerability surface in deep neural networks. ACM Trans. Embed. Comput. Syst. 2021, 20, 1–25. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
- Aketi, S.A.; Roy, K. Cross-feature contrastive loss for decentralized deep learning on heterogeneous data. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 10897–10910. [Google Scholar]
- Kang, J.H.; Jeong, H.W.; Choi, C.K.; Ali, M.S.; Bae, S.H.; Kim, H.Y. An analysis on the properties of features against various distortions in deep neural networks. J. Korea Inst. Broadcast. Media Eng. 2021, 26, 868–876. [Google Scholar]






| Index | Layer Name | Small-Sample (%) | Full-Space (%) |
|---|---|---|---|
| 1 | layer4.0.downsample.0 | 22.18 | 16.18 |
| 2 | layer4.1.conv1 | 14.68 | 25.32 |
| 3 | conv1 | 4.98 | 5.39 |
| 4 | layer4.2.conv1 | 4.71 | 4.71 |
| 5 | layer4.0.conv2 | 3.48 | 2.59 |
| 6 | layer2.0.downsample.0 | 3.34 | 0.20 |
| 7 | layer1.0.conv1 | 1.77 | 0.75 |
| 8 | layer4.0.conv3 | 2.87 | 2.05 |
| 9 | layer3.1.conv3 | 1.77 | 2.73 |
| 10 | layer4.1.conv2 | 1.77 | 0.55 |
| Average Top-1 Loss Rate | 0.0155 | 0.0161 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hua, Y.; Zhang, J.; Piao, Q.; Zhuang, W.; Zhao, Y. NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation. Electronics 2026, 15, 1295. https://doi.org/10.3390/electronics15061295
Hua Y, Zhang J, Piao Q, Zhuang W, Zhao Y. NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation. Electronics. 2026; 15(6):1295. https://doi.org/10.3390/electronics15061295
Chicago/Turabian StyleHua, Yang, Jianyu Zhang, Quanyu Piao, Wei Zhuang, and Yuanfu Zhao. 2026. "NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation" Electronics 15, no. 6: 1295. https://doi.org/10.3390/electronics15061295
APA StyleHua, Y., Zhang, J., Piao, Q., Zhuang, W., & Zhao, Y. (2026). NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation. Electronics, 15(6), 1295. https://doi.org/10.3390/electronics15061295

