NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation

Hua, Yang; Zhang, Jianyu; Piao, Quanyu; Zhuang, Wei; Zhao, Yuanfu

doi:10.3390/electronics15061295

Open AccessArticle

NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation

by

Yang Hua

,

Jianyu Zhang

,

Quanyu Piao

,

Wei Zhuang

and

Yuanfu Zhao

^*

Beijing Microelectronics Technology Institute, Beijing 100076, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1295; https://doi.org/10.3390/electronics15061295

Submission received: 15 February 2026 / Revised: 15 March 2026 / Accepted: 17 March 2026 / Published: 20 March 2026

(This article belongs to the Special Issue Artificial Intelligence and Microsystems)

Download

Browse Figures

Versions Notes

Abstract

Artificial intelligence (AI) is propelling space exploration into a new era. Synergistic breakthroughs in chip design and high-speed communications have facilitated the large-scale deployment of on-board satellite computing. Assessing the reliability of these systems via fault injection (FI) remains difficult due to the massive computational demands of Convolutional Neural Networks (CNNs) and the complex architectures of Neural Processing Units (NPUs). This research presents a high-precision, efficient FI methodology specifically tailored for NPU architectures to optimize both evaluation accuracy and execution efficiency. Implementing a hierarchical injection strategy to identify fault-sensitive layers minimizes computational overhead while ensuring statistical validity. Experimental results on the ResNet-50 network demonstrate that the proposed methodology constrains accuracy degradation to less than 0.1% while achieving a 60.80% reduction in total execution time.

Keywords:

neural processing unit; fault injection; reliability evaluation; convolutional neural networks

1. Introduction

As semiconductor manufacturing scales down to nanometer nodes, Neural Processing Units (NPUs) have become increasingly susceptible to Single-Event Effects (SEEs) caused by high-energy particle radiation [1]. To achieve peak energy efficiency, NPUs employ specialized architectures such as dense computing arrays and low-precision quantization [2], which further exacerbate this vulnerability. Single-Event Upsets (SEUs), Single-Event Transients (SETs), and Single-Event Functional Interruptions (SEFIs) can lead to degraded inference accuracy, functional anomalies, or even catastrophic system failure, severely limiting NPU deployment in safety-critical applications like aerospace and autonomous driving.

NPU reliability assessment primarily relies on two approaches: radiation hardening tests and fault injection simulations [3]. While radiation tests provide the most authentic reflection of hardware behavior, they suffer from high costs, long cycles, and poor reproducibility [4]. Conversely, FI simulation offers a cost-effective and flexible alternative for analyzing fault propagation and reliability. Based on the abstraction level, FI techniques are categorized into RTL-level, architecture-level, and software-level. RTL-level FI provides high fidelity but suffers from prohibitive computational overhead. Software-level FI is highly efficient but overlooks the critical impact of the underlying hardware architecture [5].

The principle of fault injection simulation is based on the hardware model or software model of NPU to simulate the SEE fault caused by high-energy particle radiation, track the propagation path of the fault in the calculation, storage and control unit, and analyze the influence of the fault on the NPU output results. According to the abstract level of NPU model, combined with simulation accuracy, efficiency and applicable scenarios, NPU SEE fault injection simulation software is divided into RTL level [6], architecture level [7], and software level fault injection simulation [8,9,10]. RTL-level fault injection simulation builds a hardware model based on NPU RTL code. Fault injection directly acts on hardware units such as registers and combinational logic, which can accurately simulate SEU/SET injection timing, fault propagation paths, and hardware behavior response. The disadvantage is that the simulation efficiency is extremely low and requires a lot of computing resources, which is suitable for the refined SEE evaluation of small-scale NPU internal modules. The architecture-level fault injection simulation is based on the abstract model of computing array, cache and data path of NPU core architecture components, and SEU/SET faults are injected through the fault model. Software-level fault injection simulation is based on NPU software runtime or neural network model, which does not rely on NPU hardware model. By modifying the weight, activation value, and instruction flow at the software level, the influence of SEU/SET faults is simulated. The simulation efficiency is the highest and the versatility is strong. It can quickly evaluate the influence of fault on the inference accuracy of neural network, but it cannot simulate the propagation process of fault in hardware, ignoring the influence of NPU hardware architecture. CNNs have a huge computational volume and complex hierarchical data flow. Their inherent neural network fault-tolerant characteristics make the generation, propagation path and final impact of faults within the network highly nonlinear and random [11]. The ultra-large-scale fault injection space of CNNs, multiple fault injection, inference operation, result collection, and a series of high-overhead operations make the detailed fault injection calculation far exceed the carrying range of the current computing platform [12].

However, existing related studies often fail to balance evaluation accuracy and efficiency. Specifically, hardware-level fault simulations incur prohibitive computational overhead, while software-level simulations typically lack hardware architectural fidelity. Moreover, the complex error propagation patterns in CNNs, influenced by factors such as layer depth, activation functions, and architectural features, are often overlooked. This gap underscores the need for a comprehensive fault injection framework that integrates high-fidelity hardware modeling with a statistically rigorous analysis of CNN sensitivity to hardware faults.

To address these limitations, this research presents a comprehensive fault injection framework tailored for NPU reliability assessment. The core contributions are threefold:

Multi-level FI Plug-in: Developed based on the PyTorch framework, this highly configurable plug-in adapts to CNN hierarchical computing and NPU parallel architectures, balancing evaluation accuracy and efficiency.
High-Precision Module Modeling: We construct hardware-level models for key components (computing units, memory, and data paths), bridging the fidelity gap caused by hardware opacity and ensuring simulation results align with physical hardware behavior.
Statistical-Oriented FI Method: By identifying fault-sensitive layers in Quantized CNNs (QCNNs) through statistical analysis, we implement a prioritized, staged FI approach that significantly reduces computational overhead while maintaining statistical validity.

The remainder of this paper is structured as follows: Section 2 reviews the related work on neural network fault injection in recent years. Section 3 details the proposed fault injection framework and the statistical-oriented fault injection methodology. In Section 4, the experimental setup and result analysis of CNN models are conducted using the proposed framework. Finally, Section 5 concludes the paper and discusses potential future research directions.

2. Related Work

2.1. Hardware-Level Fault Tolerance

Hardware fault tolerance focuses on suppressing error propagation through architectural or circuit-level techniques during the NPU design stage. Li et al. [13] investigated the Soft Error Rate (SER) across various buffers, revealing that DNN fault tolerance is highly dependent on data types, numerical values, data reuse rates, and layer characteristics. Their findings underscore the necessity of targeted hardware redundancy. For FPGA-based accelerators, Libano et al. [14] proposed a selective hardening strategy that reinforces only critical components, such as key registers and specific memory blocks, thereby balancing reliability with area and power constraints. Similarly, Xu et al. [15] demonstrated that system-level malfunctions are primarily triggered by faults within the DMA, control units, and instruction memory. To optimize resource usage, Bertoa et al. [16] developed Selective Triple Modular Redundancy (STMR), an automated tool that achieves high fault tolerance with significantly lower overhead than traditional Triple Modular Redundancy (TMR).

2.2. Fault Injection and Simulation Tools

Fault simulation is indispensable for evaluating the reliability of neural network models and their underlying hardware. Current simulation tools generally fall into two categories: model-level and architecture-level. General-purpose tools such as TensorFI [9], PyTorchFI [8], and MindFI [10] utilize hook functions to inject perturbations into specific layers or feature maps, allowing for flexible configuration of fault types and locations. For domain-specific architectures, tools like Saca-FI [17] and Saffira [18] incorporate hardware data paths into the simulation to provide higher-fidelity reliability analysis. MRFI [19] introduced a multi-resolution approach, supporting fault injection at various granularities (e.g., weights, activations, and feature maps) for a multi-tiered impact assessment. Furthermore, Ozen and Orailoglu [20] explored error sensitivity reduction by compressing numerical ranges, while Huang et al. [12] established a statistical vulnerability model for feature maps to quantify soft error impacts.

As AI accelerators evolve toward higher density and lower power consumption, traditional hardware redundancy alone is becoming insufficient. Bridging the gap between hardware-induced soft errors and software-level mitigation mechanisms remains a pivotal challenge for next-generation NPU design.

3. Methodology

3.1. Fault Injection Framework

3.1.1. Framework Overview

The proposed fault injection tool is integrated into the PyTorch 2.7.1 framework and comprises three core modules—Configuration, Execution, and Evaluation—as illustrated in Figure 1. The Configuration Module allows researchers to define FI targets and mechanisms via standardized configuration files, offering extensible interfaces for various fault models and NPU abstraction layers. The Execution Module implements data quantization interfaces, fault injectors, and observers, enabling perturbations without necessitating modifications to the underlying neural network (NN) architecture. Finally, the Evaluation Module leverages PyTorch APIs to quantify layer-wise sensitivity and assess the model’s overall robustness.

3.1.2. Fault Characterization and Modeling

The fidelity of the fault model is paramount for simulation accuracy. Our modeling captures the stochastic nature of Single-Event Effects, specifically focusing on:

Fault Types: We model Single-Event Upsets within storage units and Single-Event Transients that induce disturbances in computational arrays.
Parameterization and Unified Metrics: Key parameters govern the injection space and impact scope.
–
Fault Injection Rate ( $R_{inj}$ ): Represents the fault density within a specific injection target. It is defined as the proportion of elements (e.g., $0.1 %$ or $1 %$ of the target layer weights/activations) subjected to random bit-flips during an injection pass.
–
Screening Threshold ( $τ_{screen}$ ): The Top-1 accuracy loss tolerance (e.g., 0.001) used exclusively as a decision boundary during the preliminary screening phase to identify fault-sensitive layers.
Under these parameterized constraints, we support both Single-Bit Flips (SBF) and Multi-Bit Flips (MBF) across registers, processing elements, and weight buffers.
Quantization-Aware FI: To accommodate low-precision hardware, we develop specialized fault models for INT8 and FP8 formats, accounting for the differential sensitivity between Most Significant Bits (MSB) and Least Significant Bits (LSB).

3.1.3. NPU Architectural Modeling

The NPU model serves as the physical substrate for fault injection. By integrating the deep learning framework with a specialized NPU simulation environment, we developed a co-simulation platform that synchronizes neural network inference with NPU runtime behaviors. This model enables the customization of micro-architectural parameters, including the arrangement of computing arrays and memory hierarchies.

Through this architectural abstraction, we emulate Single-Event Effects induced by high-energy particle radiation, enabling the precise tracking of fault propagation across arithmetic logic units, buffers, and control paths. Furthermore, the model is tightly coupled with the algorithm layer, specifically accounting for the quantization-induced sensitivity variations inherent in low-precision NPU deployments. This cross-layer approach allows for a holistic quantification of the NPU’s reliability under radiation-induced hardware upsets.

3.1.4. Networks and Datasets

To evaluate the proposed methodology, the ResNet [21] family was selected as the primary benchmark. Its modular residual structure and deterministic data paths provide an ideal pipeline for investigating the correlation between network depth, parameter redundancy, and fault vulnerability. We utilized a spectrum of configurations—including ResNet-18, 34, 50, 101, and 152—to ensure the scalability and generalizability of our fault injection strategy across varying depths and capacities within residual architectures.

For the dataset, we employed Imagenette, a curated subset of ImageNet consisting of 10 representative classes with 9469 training and 3925 validation samples. Imagenette offers a balanced trade-off between semantic complexity and computational tractability [22]. Importantly, Imagenette employs an identical data interface, resolution, and preprocessing pipeline to the full ImageNet dataset. This structural consistency guarantees that our proposed fault injection framework can be directly deployed on the complete ImageNet without any architectural or codebase modifications. Its moderate scale significantly accelerates the fault injection process while maintaining high fidelity; experimental insights derived from this dataset are demonstrably transferable to the full-scale ImageNet challenge, providing a statistically sound basis for reliability assessment.

3.1.5. Evaluation Metrics

To rigorously quantify the impact of hardware faults on NPU performance, this study employs a multi-dimensional evaluation approach. In addition to conventional metrics such as Top-1 Accuracy and Mean Squared Error (MSE), we prioritize the following two indicators:

Maximum Error ( $E_{\max}$ )
While average error characterizes the overall degradation of a model, the maximum error $E_{\max}$ serves as a critical reliability metric for safety-critical deployments. This indicator is defined as the maximum deviation observed between the faulty output and the golden fault-free output across all injection iterations. In space-borne environments, the worst-case scenario involving a single bit-flip that causes a catastrophic output shift is more indicative of system vulnerability than average performance. By capturing these extreme outliers, $E_{\max}$ provides an upper-bound assessment of the system’s susceptibility to hardware-induced failures.
Peak Signal-to-Noise Ratio (PSNR)
Beyond final classification accuracy, we utilize PSNR to quantify the fidelity of intermediate feature maps. Unlike top-level metrics like mAP or Accuracy, PSNR provides a high-sensitivity numerical probe into the feature distribution [23]. During fault injection, many faults manifest as Silent Data Corruptions (SDCs); these are disturbances that alter intermediate activation values but may not be of sufficient magnitude to trigger an immediate label change. By calculating the PSNR between the faulty feature map and the reference map, we can sensitively capture these hidden fault effects. A significant drop in PSNR serves as an early warning for structural feature distortion, providing a standardized physical basis for identifying fault-sensitive layers and guiding subsequent hardening strategies.

3.2. Statistical Methodology for Fault Resilience Analysis

3.2.1. Error Propagation in Convolutional Layers

The error propagation within convolutional layers exhibits strong spatial correlation. A single fault in a weight kernel does not merely corrupt a single output pixel but propagates to an entire receptive field. For a

3 \times 3

kernel, a solitary weight error affects a

3 \times 3

region in the immediate output feature map. As the network depth increases, this localized error spreads geometrically, potentially corrupting large-scale feature representations in deeper layers. Furthermore, the channel dimension significantly influences fault resilience; increased network width provides redundant information paths, which can dilute the impact of a single-channel disturbance. We also observe that layer position is critical: errors in early layers undergo multiple non-linear transformations and pooling operations that may partially attenuate or mask the fault. Conversely, faults in layers proximal to the output directly bias the final classification logits, offering less opportunity for structural error suppression.

3.2.2. Error Mitigation via Activation Functions

As the primary source of non-linearity, the activation function determines the network’s error response characteristics through its mathematical properties.

Saturating activation functions such as Sigmoid and Tanh exhibit heightened error sensitivity. When a soft error shifts the input into a saturation region, the output becomes fixed at an extreme value. Although this mechanism may resemble filtering, it frequently leads to severe activation bias. In such cases, the neuron becomes unresponsive and loses its discriminative power, subsequently propagating a biased signal that compromises the integrity of deeper layers.

Non-saturating activation functions including ReLU and LeakyReLU generally offer superior fault tolerance. The linear nature of ReLU for positive inputs ensures that subtle errors do not necessarily trigger extreme non-linear distortions. Crucially, the sparsity-inducing property of ReLU, which sets negative values to zero, acts as a natural error filter. This characteristic effectively discards negative-going transients and prevents them from contaminating the feature map, thereby maintaining the structural stability of the data flow.

3.2.3. Structural Fault Tolerance

Modern architectural innovations like Residual Connections and Multi-scale Fusion provide inherent redundancy mechanisms that enhance system reliability.

Shortcut Paths: The primary advantage of residual learning lies in its identity shortcut mechanism. In the event of a fault within the convolutional blocks, this shortcut path preserves the original, uncorrupted input information. This allows subsequent layers to receive a combination of the intact skip signal and the corrupted residual signal, effectively creating a parallel redundant path that mitigates the impact of localized hardware upsets.

Multi-scale Feature Complementarity: Structures that employ parallel convolution kernels of varying sizes exhibit high resilience. Because features are extracted across multiple scales simultaneously, an error in one scale-specific branch can be compensated for by the intact features from parallel branches during the fusion stage.

3.2.4. Fault Injection Priority Sequence

Operators are categorized and quantified based on their degradation across different network levels. To ensure high screening efficiency and computational simplicity, two primary indicators are defined in Equations (1) and (2):

\bar{D R} = \frac{1}{N} \sum_{i = 1}^{N} \frac{A c c_{0} (i) - \bar{A c c_{e}} (i)}{A c c_{0} (i)} \times 100 %

(1)

MaxDR = max_{1 \leq i \leq N} (\frac{A c c_{0} (i) - \bar{A c c_{e}} (i)}{A c c_{0} (i)} \times 100 %)

(2)

Here, N denotes the total number of samples in the validation subset, and M represents the number of repeated fault injection iterations for a given operator or layer. For a specific sample i (

1 \leq i \leq N

),

A c c_{0} (i)

is its baseline fault-free accuracy, and

\bar{A c c_{e}} (i)

denotes its average accuracy calculated across the M injection iterations. Higher

\bar{D R}

values signify more pronounced performance degradation and diminished robustness.

MaxDR

reflects the worst-case performance degradation, where higher values indicate increased target sensitivity.

Operators are ranked by

\bar{D R}

and

MaxDR

in descending order and classified into high-, medium-, and low-sensitivity candidates. We employ a combination of the Mann–Whitney U Test and Cliff’s

δ

effect size analysis to verify the significance of these differences. A candidate is confirmed as a Final Highly Sensitive Target if it satisfies two conditions: first, its

\bar{D R}

must be significantly greater than that of low-sensitivity targets (

p < 0.05

); second, the degradation must have practical engineering significance (

| δ | \geq 0.33

), as formalized in Equation (3):

Final Highly Sensitive Targets = Candidates \cap (p < 0.05) \cap (| δ | \geq 0.33)

(3)

Based on these verified targets, the final error priority sequence is established by ranking the product of

\bar{D R} \times MaxDR

. High-priority targets in this sequence are subsequently selected for refined fault injection.

4. Experimental Results and Discussion

4.1. Convolutional Layer Reliability Analysis

To evaluate the fault tolerance of convolutional layers against Single-Event Upsets, we conducted fault injection experiments across various kernel sizes and channel counts. The fault injection rate (

R_{inj}

) was maintained at 0.001. The experimental results are presented in Figure 2 and Figure 3.

As the kernel size increases from

1 \times 1

to

7 \times 7

, the numerical deviation in the output feature maps becomes more pronounced. A

1 \times 1

convolution primarily performs channel-wise transformations, where the computational process only involves the fusion of individual pixels. This configuration exhibits high parameter redundancy and a localized fault impact, which limits error diffusion. In contrast, larger kernels extract spatial neighborhood features through the fusion of adjacent pixels. Consequently, errors propagate through these spatial correlations, leading to a degradation in fault tolerance as the kernel size increases.

Regarding the impact of channel density, the

3 \times 3

convolution kernel was fixed while varying the number of channels. The resulting reliability distribution follows a convex pattern. At a low channel count (e.g., 8 channels), the limited feature representation and lack of redundancy provide insufficient compensation for injected errors. Conversely, when the channel count exceeds 64, the sharp increase in parameter scale at a constant fault rate aggravates inter-channel error accumulation.

The results demonstrate that the fault-tolerant capacity of a convolutional layer is jointly determined by its kernel dimensions and channel density. Optimizing these parameters is essential for enhancing the SEU resilience of neural networks.

4.2. Activation Function Analysis

In the fault injection experiments for activation functions, perturbations were exclusively introduced into the input tensors. The results indicate that activation functions exert a significant inhibitory effect on soft errors. These findings are illustrated in Figure 4 and Figure 5.

The mitigation of soft errors by activation functions is primarily attributed to the intrinsic mechanisms of error shielding and amplitude attenuation. Activation functions can map negative values or extreme deviations toward zero. By inherently masking negative-going transients, this mechanism effectively attenuates a significant portion of localized soft errors, preventing them from propagating to subsequent network layers.

Smooth activation functions (such as Swish) facilitate the attenuation of error magnitudes via their continuous non-linear mapping characteristics. Unlike the sharp thresholding and abrupt gradient changes of standard ReLU, the smooth transition of Swish around zero prevents abrupt error amplification. This smooth, non-monotonic curve helps to dampen small localized perturbations, preserving the stability of the feature maps and ensuring that input deviations—such as those induced by bit-flips—are smoothly absorbed rather than sharply propagated.

Smooth activation functions like Swish are the preferred choice for enhancing hardware fault tolerance. They simultaneously offer robust masking for negative perturbations and a smooth, continuous transformation that actively dampens error propagation, thereby preventing individual hardware upsets from destabilizing the overall inference process.

4.3. Hierarchical and Block-Level Analysis

Fault injection experiments were conducted on an INT8-quantized ResNet-50 model with a fault injection rate (

R_{inj}

) of 0.001. Perturbations were introduced into both input feature maps and weights of all convolutional layers. The PSNR of the final output feature map was utilized as the primary metric to evaluate fault resilience. The experimental results are illustrated in Figure 6.

The results demonstrate significant hierarchical and intra-module variations in fault tolerance, closely correlated with the network depth. As the initial feature extraction stage, conv1 lacks a residual structure to provide error compensation. Because its output serves as the foundation for all subsequent operations, errors propagate and accumulate throughout the entire network without being suppressed by downstream redundancy. Consequently, the PSNR for conv1 is markedly lower than that of other layers.

From a global perspective, fault resilience exhibits a clear hierarchical degradation trend. Low-level feature extraction modules demonstrate superior robustness compared to high-level semantic modules. Early layers primarily process redundant basic textures and edges; however, as network depth increases, feature maps decrease in size while abstraction increases. High-level features possess high semantic specificity and low redundancy, causing errors to accumulate and amplify rapidly.

Intra-module analysis reveals consistent fault tolerance patterns across the three-layer convolutional blocks. Within the same residual block, the resilience of conv1, conv2, and conv3 layers typically increases in sequence. This trend is dictated by their functional roles: conv1 executes channel dimension reduction, while conv3 facilitates channel recovery and the subsequent residual addition. These findings confirm that residual connections significantly enhance the overall fault tolerance of the architecture.

4.4. Efficiency Analysis and Dataset Validation

The computational efficiency of the proposed methodology was evaluated using an INT8-quantized ResNet-50 as the benchmark network. A comprehensive comparative analysis was conducted between the full-space injection test, which targets all layers sequentially, and the focused injection strategy. Validation was performed on the Imagenette2 dataset, consisting of 3925 images from the verification set. Prior to inference, all images underwent standard preprocessing including scaling and normalization. To ensure a rigorous assessment, the fault injection rate (

R_{inj}

) was fixed at 0.01. Random bit-flips were introduced into 1% of the weights in the target layer.

The focused injection strategy was implemented by first conducting an initial screening on 100 randomly selected images to identify fault-sensitive layers. This phase utilized a screening threshold (

τ_{screen}

) of 0.001 for Top-1 accuracy loss to generate a high-sensitivity injection sequence. Subsequently, extensive fault injection was performed exclusively on these identified high-priority layers using the full validation set. The total execution time for this focused approach encompasses both the initial sequence generation and the refined testing phase. To ensure the statistical validity of the comparison, all experiments were executed under identical random seeds and unified inference logic.

Detailed performance metrics, including total time consumption and speedup factors, are summarized in Table 1. The experimental results demonstrate that the focused injection strategy significantly curtails the fault injection space through preliminary screening, resulting in a 60.80% reduction in total time consumption. Notably, while the computational overhead was substantially reduced, the accuracy degradation remained below 0.1%. This performance confirms that the proposed method achieves an optimal balance between high-precision reliability assessment and execution efficiency.

5. Conclusions

This paper presents a fault injection framework developed within the PyTorch environment, specifically designed for SEE fault injection and convolutional neural network reliability evaluation. The framework enables refined fault injection by incorporating the impact of model quantization on fault characteristics and supporting fine-grained injection configurations.

Experimental results demonstrate that the framework comprehensively evaluates the fault characteristics of NPUs and convolutional neural networks across multiple dimensions, effectively facilitating research on fault tolerance and meeting rigorous experimental requirements. Based on this framework, this study proposes a statistically oriented efficiency optimization method for quantized convolutional neural network fault injection. This methodology provides a critical experimental basis and theoretical support for the radiation-hardened design of low-precision quantized NPUs. While our experiments focus on INT8, the framework’s quantization adapter is inherently precision-agnostic, enabling direct evaluation of other quantization schemes.

Author Contributions

Conceptualization, Y.H. and Y.Z.; methodology, Y.H.; software, J.Z. and Q.P.; validation, Y.H., J.Z. and W.Z.; formal analysis, Y.H.; investigation, J.Z. and W.Z.; resources, Y.Z.; data curation, Y.H. and W.Z.; writing—original draft preparation, Y.H.; writing—review and editing, J.Z., Q.P. and W.Z.; visualization, W.Z. and Q.P.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ibrahim, Y.; Wang, H. Soft errors in DNN accelerators: A comprehensive review. Microelectron. Reliab. 2020, 115, 113969. [Google Scholar] [CrossRef]
Zhang, J.J.; Gu, T.; Basu, K.; Garg, S. Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. In 2018 IEEE 36th VLSI Test Symposium (VTS); IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Trindade, M.G. Assessment of Edge Machine-Learning Systems Under Radiation-Induced Effects. Ph.D. Thesis, Université Grenoble Alpes, Saint-Martin-d’Hères, France, 2021. [Google Scholar]
LaBel, K.A.; Cohn, L.M. Radiation testing and evaluation issues for modern integrated circuits. In Eighth European Conference on Radiation and Its Effects on Components and Systems (RADECS05); IEEE: Piscataway, NJ, USA, 2005. [Google Scholar]
Ruospo, A.; Luza, L.M.; Bosio, A.; Traiola, M.; Dilillo, L.; Sanchez, E. Pros and cons of fault injection approaches for the reliability assessment of deep neural networks. In 2021 IEEE 22nd Latin American Test Symposium (LATS); IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
Geier, J.; Mueller-Gritschneder, D.; Schlichtmann, U. Techniques and tools for fast fault injection simulations of RISC-V processors at RTL. In RISC-V Summit Europe; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Kchaou, A.; Saad, S.; Garrab, H.; Machhout, M. Reliability of LEON3 processor program counter against SEU, MBU, and SET fault injection. Cryptography 2025, 9, 54. [Google Scholar] [CrossRef]
Mahmoud, A.; Aggarwal, N.; Nobbe, A.; Vicarte, J.R.S.; Adve, S.V.; Fletcher, C.W.; Frosio, I.; Hari, S.K.S.P. PyTorchFI: A runtime perturbation tool for DNNs. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W); IEEE: Piscataway, NJ, USA, 2020; pp. 25–31. [Google Scholar]
Chen, Z.; Narayanan, N.; Fang, B.; Li, G.; Pattabiraman, K.; DeBardeleben, N. TensorFI: A flexible fault injection framework for TensorFlow applications. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE); IEEE: Piscataway, NJ, USA, 2020; pp. 426–435. [Google Scholar]
Zheng, Y.; Feng, Z.; Hu, Z.; Pei, K. MindFI: A fault injection tool for reliability assessment of MindSpore applications. In 2021 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW); IEEE: Piscataway, NJ, USA, 2021; pp. 235–238. [Google Scholar]
Ruospo, A.; Gavarini, G.; Bragaglia, I.; Traiola, M.; Bosio, A.; Sanchez, E. Selective hardening of critical neurons in deep neural networks. In 2022 25th International Symposium on Design and Diagnostics of Electronic Circuits and Systems (DDECS); IEEE: Piscataway, NJ, USA, 2022; pp. 136–141. [Google Scholar]
Huang, H.; Xue, X.; Liu, C.; Wang, Y.; Luo, T.; Cheng, L.; Li, H.; Li, X. Statistical modeling of soft error influence on neural networks. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2023, 42, 4152–4163. [Google Scholar] [CrossRef]
Li, G.; Hari, S.K.S.; Sullivan, M.; Tsai, T.; Pattabiraman, K.; Emer, J.; Keckler, S.W. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; IEEE: Piscataway, NJ, USA, 2017; pp. 1–12. [Google Scholar]
Libano, F.; Wilson, B.; Anderson, J.; Wirthlin, M.J.; Cazzaniga, C.; Frost, C.; Rech, P. Selective hardening for neural networks in FPGAs. IEEE Trans. Nucl. Sci. 2018, 66, 216–222. [Google Scholar] [CrossRef]
Xu, D.; Zhu, Z.; Liu, C.; Wang, Y.; Zhao, S.; Zhang, L.; Liang, H.; Li, H.; Cheng, K.T. Reliability evaluation and analysis of FPGA-based neural network acceleration system. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 472–484. [Google Scholar] [CrossRef]
Bertoa, T.G.; Gambardella, G.; Fraser, N.J.; Blott, M.; McAllister, J. Fault-tolerant neural network accelerators with selective TMR. IEEE Des. Test 2022, 40, 67–74. [Google Scholar] [CrossRef]
Tan, J.; Wang, Q.; Yan, K.; Wei, X.; Fu, X. Saca-FI: A microarchitecture-level fault injection framework for reliability analysis of systolic array based CNN accelerator. Future Gener. Comput. Syst. 2023, 147, 251–264. [Google Scholar] [CrossRef]
Taheri, M.; Daneshtalab, M.; Raik, J.; Jenihhin, M.; Pappalardo, S.; Jimenez, P.; Deveautour, B.; Bosio, A. Saffira: A framework for assessing the reliability of systolic-array-based DNN accelerators. In 2024 27th International Symposium on Design & Diagnostics of Electronic Circuits & Systems (DDECS); IEEE: Piscataway, NJ, USA, 2024; pp. 19–24. [Google Scholar]
Huang, H.; Liu, C.; Xue, X.; Liu, B.; Li, H.; Li, X. MRFI: An open-source multiresolution fault injection framework for neural network processing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2024, 32, 1325–1335. [Google Scholar] [CrossRef]
Ozen, E.; Orailoglu, A. SNR: Squeezing numerical range defuses bit error vulnerability surface in deep neural networks. ACM Trans. Embed. Comput. Syst. 2021, 20, 1–25. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Aketi, S.A.; Roy, K. Cross-feature contrastive loss for decentralized deep learning on heterogeneous data. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 10897–10910. [Google Scholar]
Kang, J.H.; Jeong, H.W.; Choi, C.K.; Ali, M.S.; Bae, S.H.; Kim, H.Y. An analysis on the properties of features against various distortions in deep neural networks. J. Korea Inst. Broadcast. Media Eng. 2021, 26, 868–876. [Google Scholar]

Figure 1. Fault injection framework for NPU reliability assessment.

Figure 2. Output distribution of the convolutional layer for kernel sizes: (a)

1 \times 1

, (b)

3 \times 3

, (c)

5 \times 5

, and (d)

7 \times 7

.

Figure 2. Output distribution of the convolutional layer for kernel sizes: (a)

1 \times 1

, (b)

3 \times 3

, (c)

5 \times 5

, and (d)

7 \times 7

.

Figure 3. PSNR of output feature maps for

3 \times 3

kernels under varying channel counts.

Figure 3. PSNR of output feature maps for

3 \times 3

kernels under varying channel counts.

Figure 4. Mitigation of fault injection tensors: (a) ReLU Output, (b) ReLU Absolute Error, (c) Sigmoid Output, and (d) Sigmoid Absolute Error.

Figure 5. Error amplitude attenuation across various activation functions under bit-flip conditions.

Figure 6. PSNR of the final output feature map under fault injection across individual ResNet-50 layers.

Table 1. Comparison of

MaxDR

and global Top-1 loss for high-sensitivity layers.

Table 1. Comparison of

MaxDR

and global Top-1 loss for high-sensitivity layers.

Index	Layer Name	Small-Sample $MaxDR$ (%)	Full-Space $MaxDR$ (%)
1	`layer4.0.downsample.0`	22.18	16.18
2	`layer4.1.conv1`	14.68	25.32
3	`conv1`	4.98	5.39
4	`layer4.2.conv1`	4.71	4.71
5	`layer4.0.conv2`	3.48	2.59
6	`layer2.0.downsample.0`	3.34	0.20
7	`layer1.0.conv1`	1.77	0.75
8	`layer4.0.conv3`	2.87	2.05
9	`layer3.1.conv3`	1.77	2.73
10	`layer4.1.conv2`	1.77	0.55
Average Top-1 Loss Rate		0.0155	0.0161

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hua, Y.; Zhang, J.; Piao, Q.; Zhuang, W.; Zhao, Y. NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation. Electronics 2026, 15, 1295. https://doi.org/10.3390/electronics15061295

AMA Style

Hua Y, Zhang J, Piao Q, Zhuang W, Zhao Y. NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation. Electronics. 2026; 15(6):1295. https://doi.org/10.3390/electronics15061295

Chicago/Turabian Style

Hua, Yang, Jianyu Zhang, Quanyu Piao, Wei Zhuang, and Yuanfu Zhao. 2026. "NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation" Electronics 15, no. 6: 1295. https://doi.org/10.3390/electronics15061295

APA Style

Hua, Y., Zhang, J., Piao, Q., Zhuang, W., & Zhao, Y. (2026). NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation. Electronics, 15(6), 1295. https://doi.org/10.3390/electronics15061295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NPU-Aware Fault Injection and Statistical Sensitivity Analysis for CNN Reliability Evaluation

Abstract

1. Introduction

2. Related Work

2.1. Hardware-Level Fault Tolerance

2.2. Fault Injection and Simulation Tools

3. Methodology

3.1. Fault Injection Framework

3.1.1. Framework Overview

3.1.2. Fault Characterization and Modeling

3.1.3. NPU Architectural Modeling

3.1.4. Networks and Datasets

3.1.5. Evaluation Metrics

3.2. Statistical Methodology for Fault Resilience Analysis

3.2.1. Error Propagation in Convolutional Layers

3.2.2. Error Mitigation via Activation Functions

3.2.3. Structural Fault Tolerance

3.2.4. Fault Injection Priority Sequence

4. Experimental Results and Discussion

4.1. Convolutional Layer Reliability Analysis

4.2. Activation Function Analysis

4.3. Hierarchical and Block-Level Analysis

4.4. Efficiency Analysis and Dataset Validation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI