1. Introduction
Secondary equipment in electric power grids plays a crucial role in monitoring, protection, control, and communication functions, ensuring the safe and efficient operation of primary equipment [
1,
2,
3]. In recent years, there has been increasing demand for higher computing power and lower power consumption from power systems [
4,
5,
6,
7]. To meet this demand, related devices have increasingly embraced advanced semiconductor manufacturing processes and modular design [
8,
9].
The reliability of the secondary equipment, however, currently faces new challenges. Microprocessor-based relays, for instance, can be affected by single event effects (SEEs) caused by atmospheric neutrons or α-particles generated from radioactive impurities in packages of the chip, such as U and Th [
10,
11]. During decay, these impurities emit high-energy α-particles that, when striking the sensitive nodes of the chip, deposit energy and generate electron–hole pairs. This process can lead to SEEs.
In 2018 and 2020, SEE incidents were recorded in Chinese relay protection devices, highlighting the urgent need to assess the impact of SEEs on these devices. While some previous studies have examined the SEE influence induced by neutrons in the atmospheric environment [
12,
13,
14], our current research emphasizes the assessment of the SEE impact from α-particles generated within the packaging on the reliability of Chinese relay protection devices, especially from the system and application perspectives. These are extremely lacking in existing relevant research.
Owing to the uncertainties associated with irradiation tests, including variations in irradiation duration and costs, the primary focus of the current study is on using irradiation experiments to verify whether α-particles can induce SEEs in a similar manufacturing technology with a core processor of targeted secondary equipment. Building on irradiation experiments, the following research emphasizes the outcomes of Monte Carlo simulations and software fault injection. Compared with irradiation tests alone, Monte Carlo simulation combined with software fault injection provides more detailed information that might be difficult to obtain through irradiation tests [
15,
16,
17,
18]. Furthermore, software fault injection can trigger a broader range of events inside the relay protection process or system, including failures to act, false operations, and others. In this study, an SEE assessment framework was developed for relay protection devices. This incorporates Monte Carlo simulation of α-particles generated in the chip package, software fault injection of α-particle-induced single event upsets (SEUs), as well as system assessment from a system perspective, combining the specific applications in the secondary equipment. This can provide new insights into the reliability assessment of secondary equipment in power systems.
2. Alpha Particle Irradiation
SEU irradiation experiments were conducted using
241Am α-particle sources targeting a system-on-chip (SoC) that employed the same technology process as the relay protection device. Before irradiation, the SoC was de-capped and the
241Am α-particle sources covered the top of the bare chip [
19]. The particle flux of the
241Am α-particle source was 3759.8 cm
−2·s
−1. The emitted α-particles had a Linear Energy Transfer (LET) value of approximately 0.576 MeV·cm
2·mg
−1, with a range of 27.93 μm in silicon. Irradiation was performed on the 64 kB memory of the SoC, and the duration was 324 min with a final cumulative fluence of 7.31 × 10
7 cm
−2.
Different types of SEU were observed during the irradiation. Specifically, single-bit upset (SBU), dual-cell upset (DCU), triple-cell upset (TCU), and single event functional interruption (SEFI) events were detected.
Table 1 lists the details of the events that were investigated. The corresponding SEU cross section was 2.48 × 10
−11 cm
2/bit.
From
Table 1, it can be confirmed that α-particles emitted from the packaging can induce SEEs in the same process chips utilized in the secondary equipment. Nevertheless, to date, a comprehensive assessment of α-particle-induced SEEs in relay protection devices is lacking. For example, how the SEE in a memory block impacts the specific applications of the secondary equipment, how to quantitatively evaluate its propagation influence, and other issues should be investigated.
Compared with the irradiation tests to verify the SEE occurrences inner the relay protection device, the propagation mechanisms of system-level SEEs across different modules in the device can be investigated through simulation and fault injection techniques. In particular, it is combined with probability safe analysis solutions, such as fault tree analysis.
3. Monte Carlo Simulations
The Geant4 (Version@Geant4.10.5) toolkit can be applied to evaluate SEEs in the target relay protection device [
20,
21,
22]. In the current research, it follows the same model that was adopted in [
12]. The simulation model is based on the actual package structure of the target device. Based on the simulation model, during the α-particles simulation, 1 K and 1 M sensitive volumes were conducted for comparison. Each sensitive volume measured 160 nm × 160 nm × 160 nm in size in the simulation model, and the critical charge was 3820 eV. The sensitive volume and the critical charge are not chosen arbitrarily, the reliability of these parameters also needs to be confirmed by irradiation experiments and Geant4 simulations. Specifically, they are verified in [
12] for the target device and iterated from the same technology SRAM irradiation tests results in [
17,
18,
19,
20,
21] and others. Thus, they can be applied in this project.
Although the 241Am is one of the most widely available α sources, the 232Th is an important impurity in the packaging materials which can emit a majority of 4.013 MeVα-particles. This work aims to mimic α-particles inducing SEEs from package impurities. Hence, the energy of the α-particles was set to 4.013 MeV in simulation, and the corresponding LET was 0.70 MeV·cm2·mg−1. The physic process which was utilized contained ionization, decay, Coulomb scattering, etc. A total of 108 impinging particles were generated at various angles.
As α-particles can be emitted from any direction in chip packaging, different incoming angles (angular deviation from the normal line of the chip surface), as shown in
Figure 1, were used in the simulation to comprehensively evaluate the impact of particle incidence angles on SEUs.
At last, various SEU events were detected. Specifically, up to 18 simultaneously upset bits were observed in the 1 Kbits’ simulation, and 22 simultaneously flipped bits were investigated in the 1 Mbits’ simulation. The details of the recorded SEUs are listed in
Table 2 and
Table 3.
Figure 2 illustrates the flipping bit distribution schematic diagrams of several multiple-cell upsets (MCUs), which may be distributed over multiple words and have different forms of distribution.
4. Software Fault Injection
For the SoC chip deployed in secondary equipment, the actual workload may not fully utilize the entire memory capacity, and not all SEUs occurring in the workload will lead to malfunction. It is necessary to combine software fault injection to evaluate and analyze the fault manifestation probability within the actual workload in the relay protection device.
4.1. Actual Failure and Fault Injection
The evaluation was carried out using the above-mentioned framework for assessing the SEE in relay protection devices, with fault injection performed on memory cells based on fault injection and Monte Carlo simulation results. Because of the high proportion of SBUs caused by α-particles within the packaging, and because SBUs can be eliminated with the use of error-correcting codes (ECC), this phase focuses solely on the soft errors resulting from MCUs in the equipment. It is worth noting that the SEU cross section of the device under α-particle striking, derived from Monte Carlo simulations, does not directly represent the failure rate during the actual operation of the secondary equipment, as not all soft errors necessarily led to functional failures that may have triggered anomalies such as false tripping or refusal to trip. Therefore, the fault manifestation probability, which is the likelihood that soft errors translate into perceivable failures, was obtained using software fault injection methods. This probability depends on multiple factors, including, but not limited to:
Error masking: In certain cases, even though soft errors occur because of logical relationships or how data are used, these errors may have no discernible impact on the operational process or program output, or the erroneous data may be overwritten by new data before causing an exception.
Error tolerance and recovery technologies: System design may include specific redundancy or fault-tolerance techniques, such as ECC or other types of error correction technology, which can reduce the probability of soft errors manifesting as failures.
Position of the soft error: The significance of the data or execution path affected by the error also influences its manifestation; for example, the soft error that occurs in hardware regions outside the active workload will not cause exceptions.
4.2. Fault Injection in General Test Programs
The Fast Fourier Transform (FFT) is widely used in secondary equipment for applications such as harmonic detection, fault detection and location, system stability analysis, and load monitoring and analysis [
23]. Consequently, a general test program based on FFT was developed in C language. The main functionalities of the program are as follows:
- ➢
The authentication section simulates user input and verifies credentials;
- ➢
The I/O reading section acquires time-domain signals of voltage and current;
- ➢
The data analysis section checks instantaneous values and performs FFT;
- ➢
The power calculation section computes active power, root mean square values, apparent power, and power factor based on the available data.
The general test program operates with a cycle time of approximately 1000 ms.
In the implementation of ECC, the Single Error Correct–Double Error Detect (SEC-DED) code was used [
24], where each 64-bit data segment underwent parity calculation to produce the first 7 bits of checksum, and the 8
th bit was a parity bit that checked all other data bits. The data and checksum bits (64-bit + 8-bit) were stored separately in different members of a structure. During fault injection, logic that may induce errors in the checksum bits was added. In the ECC checking process, syndromes were calculated for error detection and correction, as shown in
Table 4. It is important to note that, in cases where an odd number of bit errors greater than or equal to three occurs, it may not be possible to identify or calculate the error location, though the number of errors in the final data will not exceed the number of original erroneous bits, which might occur in fault injection.
During the fault injection testing process, the pattern of bit flips was based on the coordinates of DCUs and MCUs in the memory cell array obtained from Geant4 simulation results. For each bit to be flipped in DCU or MCU, the corresponding byte and bit positions were calculated, and each designated bit was flipped using XOR operations. The ratios of dual-bit and multi-bit flips used in the tests reflect the outcomes of various particle incidence angles derived from Geant4 simulations.
The fault injection process involved changing the deflection angles of impinging particles and whether ECC was activated. A total of 16,385 fault injections were conducted, with the proportion of faults varying according to different deflection angles as follows: At a deflection angle of 30°, DCUs accounted for 91.93% and MCUs for 9.07%; at 60°, DCUs constituted 99.96% and MCUs 0.04%; and at 90°, DCUs represented 86.09% and MCUs 13.91%. As shown in
Figure 3, five results were detected during the fault injection in general test programs, including abnormal exit (AE), system halt (SH), time out (TO), error result (ER), and normal [
12].
- ➢
Abnormal exit (AE): Program exit code experiences an abnormality.
- ➢
System halt (SH): Program execution is halted.
- ➢
Time out (TO): Program execution is out of the expected duration.
- ➢
Error result (ER): The execution results are different from the expected.
- ➢
Normal: The injected faults have no visible influence on the tested program’s execution.
Among them, the first four soft errors are abnormal for the secondary equipment.
Figure 4 shows the abnormal results caused by DCUs and MCUs in different cases.
As depicted in
Figure 3, the occurrence probability of SH and ER was significantly higher than that of AE and TO. For both DCUs and MCUs, the use of ECC effectively reduced the occurrence of SH, potentially converting SH into one of the other three abnormal outcomes (mainly ER).
Figure 4 illustrates that MCUs were more likely to cause abnormal results compared to DCUs, and in practice, ECC did not significantly reduce the total number of errors.
4.3. Fault Injection in Actual Workloads
A test platform was constructed using the actual operational software of secondary equipment, featuring an embedded system design identical to the real product. The testing system consisted of a real-time CPU core (bare metal) and a management CPU core (ARM side). The real-time CPU core, operating without an operating system, handled functions such as logic, alarms, and control, which require high real-time performance. On the other hand, the ARM side, based on an embedded Linux Operating System, was designed to manage, communicate, and display the functions of the device. The modules shown in
Figure 5 represent the core function modules running on the management CPU core. They include the algorithm-processing and alarm analysis module, task flow control module, data transmission and display module, and data acquisition module.
Combining the specific characters of the test platform and the workload, the types of results have changed from five to three categories: System Halt (SH), Error Result (ER), and Normal.
Table 5 displays the results of fault injection in core function modules.
Using the same test platform, fault injection tests were conducted on the bare-metal program. It employed variable swapping to introduce errors into the operation of the real-time CPU core. The operation states were observed, and only SH appeared in the test with a count of 295, while the total test count was 504.
To quantitatively analyze the system’s reliability, the fault tree was built. Based on the probabilities of various abnormal results, a fault tree analysis model, as illustrated in
Figure 6, was constructed to analyze the failures throughout the entire secondary equipment system [
25]. This fault tree model helped to identify potential vulnerabilities and assess the overall reliability of the system.
The impact of each module’s soft error on the system’s reliability based on the fault injection results is provided, and the sensitive modules were observed directly in this fault tree. The entire probability of system failure was approximately 33%, considering the multi-module’s failure from the management CPU core and the single failure from the real-time CPU core. Each module contributed differently to system failures in the management CPU core. In specific, the task flow control module had the greatest impact on SH (7.61%), and the data acquisition module experienced a significant probability of causing ER (2.46%). The failure of real-time CPU core directly caused SH (11.69%), and although its impact probability was lower than the total probability of management CPU core failure (21.36%), its influence as a single source of failure was significant.
5. Discussion
From
Table 2 and
Table 3, it can be observed that α-particles deflected by 15° induced the highest number of bit flips owing to SEEs. For a 1 K sensitive volume array, the SEE cross section under a 15° deflection of α-particles was 2.16 × 10
−10 cm
2/bit, while for a 1 M sensitive volume array, it was 2.67 × 10
−10 cm
2/bit. These results indicate that, for α-particles, the simulation outcomes for 1 K and 1 M sensitive volume arrays were similar, primarily because α-particle-induced SEEs resulted from direct ionization and were mainly related to the Linear Energy Transfer (LET), showing negligible dependency on capacity. Additionally, from the statistics of SBUs, DCUs, and MCUs caused by α-particle incidence, it is evident that SBUs played a significant role across different incidence scenarios.
For the α-particle irradiation test in
Section 2, the SEU cross section was 2.48 × 10
−11 cm
2/bit. In the simulation, the cross section corresponding to the highest number of bit flips was 2.16 × 10
−10 cm
2/bit. The LET in the irradiation test was 0.576 MeV·cm
2·mg
−1, while in the simulation, it was 0.70 MeV·cm
2·mg
−1. The slightly higher LET in the simulation resulted in a slightly higher SEU cross section. This confirms that the constructed simulation was reasonable, and the subsequent fault injection and analysis were also valid.
Additionally, the failure rate could be estimated based on the simulation results. Since the 28 nm CMOS SRAM was packaged with an α particle emission rate of 0.001 α/cm2·h, deploying approximately 105 CPU board chips with a memory capacity of 1 M bits using the same technology could result in 270 SEUs per year. Each SEU had a probability of approximately 99.989% as an SBU, approximately 0.01% as a DCU, and less than 0.01% as an MCU.
Software fault injection revealed how the test load responded to the DCUs and MCUs induced by α-particles. By evaluating ECC without considering SBUs, it was found that ECC did not significantly reduce the abnormal results caused by DCUs and MCUs in the test load. However, ECC was able to mitigate the severity of these events to some extent. Based on the fault injection results of the actual workload, a fault tree was established. Analysis of this fault tree indicates that, in the event of a DCU or MCU, the real-time CPU core in the secondary equipment is most likely to cause a failure to act because of an SH. In contrast, the data acquisition module is most likely to cause a false operation owing to the ER.
In summary, when evaluating the reliability of the relevant secondary equipment systems, especially when there are no guidelines or standards for this area, the recommended implementation measures are as follows: Firstly, it calculates the SEU cross section for some bits’ memory cell arrays to quantitatively assess its sensitivity to SEUs. Secondly, it conducts software fault injection testing to determine the failure rates of different software architectures or versions under conditions of SEUs while considering the occupancy rates of various software loads to evaluate overall software robustness. Thirdly, it examines the effectiveness of various mitigation techniques, such as ECC, periodic refreshing, and triple voting systems, in alleviating the adverse effects of SEUs, and assesses the performance of these techniques under actual workload conditions. Last but not least, it builds a fault tree to quantitatively assess the influence propagation by SEE in memory cells.
Furthermore, it should be clarified that the proposed solution in this paper is not limited to relay protection devices; it is also applicable to a wide range of microprocessor-based devices, including SoCs and other integrated systems. While the paper primarily focuses on relay protection devices, it takes the following two points into account: First, relay protection systems are high-reliability systems that are increasingly susceptible to SEE threats. Second, the paper aims to remind researchers to consider the effects of SEE on terrestrial environments, such as relay protection devices, in addition to aerospace electronic systems.
6. Conclusions
SEEs induced by α-particles on a relay protection device chip in secondary equipment were assessed. Through Monte Carlo simulations, the results of SEUs caused by α-particles striking various deflection angles were obtained. The simulation results indicate that α-particles are more likely to trigger SEUs when they are incident at a 15° deflection angle. Based on the Monte Carlo simulation data for SEUs, fault injection was performed on a general test program and real workload conditions, cataloging various anomalous results, such as abnormal exit, system halt, time out, and error results. The results suggest that, to enhance the mitigation effectiveness against soft errors induced by DCU or MCU, the chip would require more robust ECC or other hardening measures. Fault tree analysis identified the modules within the core CPU of the secondary equipment that were the most susceptible to different errors during operation. The Monte Carlo simulation method is universally applicable, and the software fault injection framework can be ported to similar systems, providing reliability assessments for SEUs in a variety of secondary equipment, not limited to relay protection devices.
Author Contributions
Conceptualization, W.Y. and C.H.; methodology, H.Z. and Z.Z.; software, H.Y.; validation, Z.S.; resources, H.Z. and Z.X.; writing—original draft preparation, H.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This project was supported by the Science and Technology Project of NARI Technology Co., Ltd. (Grant No. SGNRGF00XAJS2301697).
Data Availability Statement
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
Authors H. Zhou, Z. Zou, Z. Su, and Z. Xu were employed by the company NARI Group Corporation (State Grid Electric Power Research Institute) and NARI Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflict of interests.
References
- Laloux, D.; Rivier, M. Technology and Operation of Electric Power Systems; Regulation of the Power Sector; Springer: London, UK, 2013; pp. 1–46. [Google Scholar]
- Weedy, B.M.; Cory, B.J.; Jenkins, N.; Ekanayake, J.B.; Strbac, G. Electric Power Systems; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Zhang, B.; Hao, Z.; Bo, Z. New development in relay protection for smart grid. Prot. Control. Mod. Power Syst. 2016, 1, 1–14. [Google Scholar] [CrossRef]
- Esfahani, M.M.; Mohammed, O. An intelligent protection scheme to deal with extreme fault currents in smart power systems. Electr. Power Energy Syst. 2020, 115, 105434. [Google Scholar] [CrossRef]
- Jimada-Ojuolape, B.; Teh, J. Impact of the integration of information and communication technology on power system reliability: A review. IEEE Access 2020, 8, 24600–24615. [Google Scholar] [CrossRef]
- Aruna, S.B.; Suchitra, D.; Rajarajeswari, R.; Fernandez, S.G. A comprehensve review on the modern power system reliability assessment. Int. J. Renew. Energy Res. 2021, 11, 1734–1747. [Google Scholar]
- Babu, S.; Hilber, P.; Jürgensen, J.H. On the status of reliability studies involving primary and secondary equipment applied to power system. In Proceedings of the 2014 International Conference on Probabilistic Methods Applied to Power Systems (PMAPS), Durham, UK, 7–10 July 2014; pp. 1–6. [Google Scholar]
- Armstrong, K.O.; Das, S.; Cresko, J. Wide bandgap semiconductor opportunities in power electronics. In Proceedings of the 2016 IEEE 4th Workshop on Wide Bandgap Power Devices and Applications (WiPDA), Fayetteville, AR, USA, 7–9 November 2016; pp. 259–264. [Google Scholar]
- Zakarian, A.; Rushton, G.J. Development of modular electrical systems. IEEE/ASME Trans. Mechatron. 2001, 6, 507–520. [Google Scholar] [CrossRef]
- Zimmerman, K.; Haas, D. Impacts of single event upsets on protective relays. In Proceedings of the the 72nd Annual Conference for Protective Relay Engineers, College Station, TX, USA, 25–28 March 2019; pp. 25–28. [Google Scholar]
- Haas, D.; Zimmerman, K. Single Event Upsets in SEL Relays. March. 2018. Available online: https://selinc.com (accessed on 1 March 2023).
- Zhou, H.; Yu, H.; Zou, Z.; Su, Z.; Zhao, Q.; Yang, W.; He, C. Evaluation of Single Event Upset on a Relay Protection Device. Electronics 2024, 13, 64. [Google Scholar] [CrossRef]
- Hands, A.; Morris, P.; Ryden, K.; Dyer, C.; Truscott, P.; Chugg, A.; Parker, S. Single event effects in power MOSFETs due to atmospheric and thermal neutrons. IEEE Trans. Nucl. Sci. 2011, 58, 2687–2694. [Google Scholar] [CrossRef]
- Infantino, A.; Alía, R.G.; Brugger, M. Monte Carlo evaluation of single event effects in a deep-submicron bulk technology: Comparison between atmospheric and accelerator environment. IEEE Trans. Nucl. Sci. 2016, 64, 596–604. [Google Scholar] [CrossRef]
- Chen, H.; Chen, Y.; Wang, F.; Liang, T.; Jia, X.; Ji, Q.; Hu, C.; He, W.; Yin, W.; He, K.; et al. Target station status of China Spallation Neutron Source. Neutron News 2018, 29, 2–6. [Google Scholar] [CrossRef]
- Tang, J.; An, Q.; Bai, J.; Bao, J.; Bao, Y.; Cao, P.; Chen, H.L.; Chen, Q.P.; Chen, Y.H.; Chen, Z.; et al. Back-n white neutron source at CSNS and its applications. Nucl. Sci. Tech. 2021, 32, 11. [Google Scholar] [CrossRef]
- Yang, W.; Song, W.; Guo, Y.; Li, Y.; He, C.; Wu, L.; Wang, B.; Liu, H.; Shi, G. Enhancement of Deep Neural Network Recognition on MPSoC with Single Event Upset. Micromachines 2023, 14, 2215. [Google Scholar] [CrossRef] [PubMed]
- Yang, W.; Li, Y.; Li, Y.; Hu, Z.; Xie, F.; He, C.; Wang, S.; Zhou, B.; He, H.; Khan, W.; et al. Atmospheric neutron single event effect test on Xilinx 28 nm system on chip at CSNS-BL09. Microelectron. Reliab. 2019, 99, 119–124. [Google Scholar] [CrossRef]
- Du, X.; He, C.; Liu, S.; Zhang, Y.; Li, Y.; Yang, W. Measurement of single event effects induced by alpha particles in the Xilinx Zynq-7010 System-on-Chip. J. Nucl. Sci. Technol. 2017, 54, 287–292. [Google Scholar] [CrossRef]
- Agostinelli, S.; Allison, J.; Amako, K.A.; Apostolakis, J.; Araujo, H.; Arce, P.; Asai, M.; Axen, D.; Banerjee, S.; Barrand, G.J.; et al. GEANT4—A simulation toolkit. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrometers Detect. Assoc. Equip. 2003, 506, 250–303. [Google Scholar] [CrossRef]
- Yang, W.; Li, Y.; Li, Y.; Hu, Z.; Cai, J.; He, C.; Wang, B.; Wu, L. Neutron Irradiation Testing and Monte Carlo Simulation of a Xilinx Zynq-7000 System on Chip. Electronics 2023, 12, 2057. [Google Scholar] [CrossRef]
- Yang, W.; Li, Y.; Zhang, W.; Guo, Y.; Zhao, H.; Wei, J.; Li, Y.; He, C.; Chen, K.; Guo, G.; et al. Electron inducing soft errors in 28 nm system-on-Chip. Radiat. Eff. Defects Solids 2020, 175, 745–754. [Google Scholar] [CrossRef]
- Girgis, A.A.; Ham, F.M. A New FFT-Based Digital Frequency Relay for Load Shedding. IEEE Trans. Power Appar. Syst. 1982, PAS-101, 433–439. [Google Scholar] [CrossRef]
- Hamming, R. Error correcting and error detecting codes. Bell Sys. Tech. J. 1950, 29, 147–160. [Google Scholar] [CrossRef]
- Volkanovski, A.; Čepin, M.; Mavko, B. Application of the fault tree analysis for assessment of power system reliability. Reliab. Eng. Syst. Saf. 2009, 94, 1116–1127. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).