FPGA-Based Reliable Fault Secure Design for Protection against Single and Multiple Soft Errors

: Field programmable gate arrays (FPGAs) are increasingly used in industry (e.g., biomedical, space, and automotive industries). FPGAs are subjected to single, as well as multiple event upsets (SEUs and MEUs), due to the continuous shrinking of transistor dimensions. These upsets inevitably decrease system lifetime. Fault-tolerant techniques are often used to mitigate these problems. In this research, penta and hexa modular redundancy, as well as dynamic partial reconﬁguration (DPR), are used to increase system reliability. We show, depending on the relative rates of the SEUs and MEUs, that penta modular redundancy has a higher reliability than hexa modular redundancy, which is a counter-intuitive result in some cases since increasing redundancy is expected to increase reliability. Focusing on penta modular redundancy, an error detection and recovery mechanism (voter) is designed. This mechanism uses the internal conﬁguration access port (ICAP) and its associated controller, as well as DPR to mitigate SEUs and MEUs. Then, it is implemented on Xilinx Vivado tools targeting the Kintex7 7k410tfbg676 device. Finally, we show how to render this design fault secure in the event that SEUs or MEUs a ﬀ ect the voter itself. This fault secure voter either produces the correct output or gives an indication that the output is incorrect.


Introduction
The automotive industry is one of the largest global markets in the world. It has a lot of electronic systems to meet driver requirements. Vehicles merge between engine control units (ECUs) and other systems to control many functions such as airbags, tire pressure, the antilock braking system, infotainment, body electronics, CAN/LIN bus controllers, lighting systems, and advanced driver assistant systems (ADAS) [1]. Embedded processors are well suited for many of these systems. Because field programmable gate arrays (FPGAs) have high speed and large logic densities, FPGAs facilitate the development of vehicle infotainment and communication [2]. FPGAs are the best candidates for some applications such as automotive, space, and biomedical applications [1][2][3].
FPGAs provide the capability of being reprogrammed to execute different functions. They have high performance via implementing an algorithm in an efficient way. They simplify the task of adding new features to the product as compared with ASIC-based embedded systems.
There are different families of FPGAs, i.e., SRAM, Flash/EPROM, and Antifuse [4]. The majority of FPGA families are SRAM-based FPGAs. SRAM-based FPGAs are susceptible to atmospheric radiation many fault-tolerant techniques to recover from FPGA upsets. Some of these techniques can be used to mitigate SEUs, such as scrubbing and TMR. However, penta and hexa redundancy techniques can be used to mitigate both SEUs and MEUs (DEUs).
Scrubbing is the most commonly technique used for mitigating upsets in FPGA. The major problem with scrubbing is the detection time; there is a certain amount of time between the occurrence of an error in the configuration memory and discovering it then repairing it. This time interval can be in the order of milliseconds before the correct state of the configuration memory (CRAM) bits is restored [8]. The system may produce an erroneous result, during this time interval, and this is not acceptable in safety critical modules in automotive systems; therefore, scrubbing cannot be used to recover from SEUs or MEUs.
Triple modular redundancy (TMR) is also a potential solution to mitigate upsets in FPGA. The targeted module (M) is triplicated, and a majority voter circuit produces the correct output. Dynamic partial reconfiguration (DPR) can be used to recover the failed module. TMR produces an incorrect output if the majority voter circuit fails, i.e., the detection time interval is not a problem with TMR. Hence, TMR can only detect and correct a single module failure at a time. The problem with TMR is the occurrence of non-adjacent double event upsets (DEUs) affecting two modules, which produces a system failure. TMR solves DEU problems if they occur in the same module but produces an erroneous output when two memory cells of two modules are simultaneously affected [10,12].
The shortcoming of TMR can be solved using penta modular redundancy (5MR), which can mitigate SEUs, as well as DEUs [13,14] and follows the same methodology as that of TMR. Five identical copies of the module (M) are connected to an error detection and recovery mechanism. The failed modules can be recovered using DPR if the problem is transient. 5MR can tolerate 3 successive single failures in 3 different modules (Ms) or a double failure in two Ms followed by a single failure in a third M but the sequence of a single module failure followed by a double failure cannot be mitigated by 5MR.
The hexa modular redundancy (6MR) fault-tolerant technique [7,10] can detect this specific sequence, as well as additional failure sequences. 6MR follows the same methodology as TMR and 5MR. Six identical copies of the module (M) are connected to an error detection and recovery mechanism. 6MR can tolerate 4 module failures as follows: (1) 4 successive single failures in 4 different Ms, (2) a single failure followed by a double failure followed by another single failure, and (3) a double failure followed by 2 successive single failures. These failure sequences of 6MR are explained in Figure 1. The failed modules can be recovered using DPR. Scrubbing is the most commonly technique used for mitigating upsets in FPGA. The major problem with scrubbing is the detection time; there is a certain amount of time between the occurrence of an error in the configuration memory and discovering it then repairing it. This time interval can be in the order of milliseconds before the correct state of the configuration memory (CRAM) bits is restored [8]. The system may produce an erroneous result, during this time interval, and this is not acceptable in safety critical modules in automotive systems; therefore, scrubbing cannot be used to recover from SEUs or MEUs.
Triple modular redundancy (TMR) is also a potential solution to mitigate upsets in FPGA. The targeted module (M) is triplicated, and a majority voter circuit produces the correct output. Dynamic partial reconfiguration (DPR) can be used to recover the failed module. TMR produces an incorrect output if the majority voter circuit fails, i.e., the detection time interval is not a problem with TMR. Hence, TMR can only detect and correct a single module failure at a time. The problem with TMR is the occurrence of non-adjacent double event upsets (DEUs) affecting two modules, which produces a system failure. TMR solves DEU problems if they occur in the same module but produces an erroneous output when two memory cells of two modules are simultaneously affected [10,12].
The shortcoming of TMR can be solved using penta modular redundancy (5MR), which can mitigate SEUs, as well as DEUs [13,14] and follows the same methodology as that of TMR. Five identical copies of the module (M) are connected to an error detection and recovery mechanism. The failed modules can be recovered using DPR if the problem is transient. 5MR can tolerate 3 successive single failures in 3 different modules (Ms) or a double failure in two Ms followed by a single failure in a third M but the sequence of a single module failure followed by a double failure cannot be mitigated by 5MR.
The hexa modular redundancy (6MR) fault-tolerant technique [7,10] can detect this specific sequence, as well as additional failure sequences. 6MR follows the same methodology as TMR and 5MR. Six identical copies of the module (M) are connected to an error detection and recovery mechanism. 6MR can tolerate 4 module failures as follows: (1) 4 successive single failures in 4 different Ms, (2) a single failure followed by a double failure followed by another single failure, and (3) a double failure followed by 2 successive single failures. These failure sequences of 6MR are explained in Figure 1. The failed modules can be recovered using DPR.   In [7], it was proven that 6MR was more reliable than 5MR, with λs (SEU rate) 10 times higher than λ d (DEU rate) based on the data in [3,15]. The reliability was calculated by simulating Markov models for 5MR and 6MR using SHARPE [16]. Five identical copies of a 32-bit embedded MicroBlaze soft core processor (widely used in automotive electronic systems) have been implemented for the 5MR system and six identical copies for the 6MR system. The effect of different ratios between λs and λ d on reliability are studied in the next section.

Effect of Failure Rates Ratio on Reliability
In this subsection, the reliability was studied for the 5MR and 6MR fault tolerance techniques using different ratios between λs and λ d as the ratio may be 10 or less [17,18]. Figures 2 and 3 show Markov models for 5MR and 6MR [10,19]. In [7], it was proven that 6MR was more reliable than 5MR, with λs (SEU rate) 10 times higher than λd (DEU rate) based on the data in [3,15]. The reliability was calculated by simulating Markov models for 5MR and 6MR using SHARPE [16]. Five identical copies of a 32-bit embedded MicroBlaze soft core processor (widely used in automotive electronic systems) have been implemented for the 5MR system and six identical copies for the 6MR system. The effect of different ratios between λs and λd on reliability are studied in the next section.

Effect of Failure Rates Ratio on Reliability
In this subsection, the reliability was studied for the 5MR and 6MR fault tolerance techniques using different ratios between λs and λd as the ratio may be 10 or less [17,18]. Figures 2 and 3 show Markov models for 5MR and 6MR [10,19].  As mentioned in this section, for the 5MR system, five identical copies of the module (M) are connected to a voter (an error detection and recovery mechanism). 5MR can tolerate 3 successive single failures in 3 different Ms or a double failure in two Ms followed by a single failure in a third M. DPR is used to recover the failed modules [7]. For the Markov model in Figure 2, the state is named according to the number of operating modules in this state and "F" is the failure state. Let λs be the rate of SEUs affecting any module M and λd be the rate of DEUs affecting two modules. The model has different repair rates (µ1, µ2, µ3, and µ4) because the repair time depends on the amount of time taken to complete the DPR and this time depends on the size of the bit file (=1/µ1). Therefore, in state "5", only one M has to be reconfigured while, in state "4", two Ms have to be reconfigured (µ2 = 0.5 µ1 and µ3 = 0.333 µ1) [7]. Failures and repair times are both assumed to be exponentially distributed; hence, the repair rate µ and the failure rate λ are constant [10]. In [7], it was proven that 6MR was more reliable than 5MR, with λs (SEU rate) 10 times higher than λd (DEU rate) based on the data in [3,15]. The reliability was calculated by simulating Markov models for 5MR and 6MR using SHARPE [16]. Five identical copies of a 32-bit embedded MicroBlaze soft core processor (widely used in automotive electronic systems) have been implemented for the 5MR system and six identical copies for the 6MR system. The effect of different ratios between λs and λd on reliability are studied in the next section.

Effect of Failure Rates Ratio on Reliability
In this subsection, the reliability was studied for the 5MR and 6MR fault tolerance techniques using different ratios between λs and λd as the ratio may be 10 or less [17,18]. Figures 2 and 3 show Markov models for 5MR and 6MR [10,19].  As mentioned in this section, for the 5MR system, five identical copies of the module (M) are connected to a voter (an error detection and recovery mechanism). 5MR can tolerate 3 successive single failures in 3 different Ms or a double failure in two Ms followed by a single failure in a third M. DPR is used to recover the failed modules [7]. For the Markov model in Figure 2, the state is named according to the number of operating modules in this state and "F" is the failure state. Let λs be the rate of SEUs affecting any module M and λd be the rate of DEUs affecting two modules. The model has different repair rates (µ1, µ2, µ3, and µ4) because the repair time depends on the amount of time taken to complete the DPR and this time depends on the size of the bit file (=1/µ1). Therefore, in state "5", only one M has to be reconfigured while, in state "4", two Ms have to be reconfigured (µ2 = 0.5 µ1 and µ3 = 0.333 µ1) [7]. Failures and repair times are both assumed to be exponentially distributed; hence, the repair rate µ and the failure rate λ are constant [10]. As mentioned in this section, for the 5MR system, five identical copies of the module (M) are connected to a voter (an error detection and recovery mechanism). 5MR can tolerate 3 successive single failures in 3 different Ms or a double failure in two Ms followed by a single failure in a third M. DPR is used to recover the failed modules [7]. For the Markov model in Figure 2, the state is named according to the number of operating modules in this state and "F" is the failure state. Let λ s be the rate of SEUs affecting any module M and λ d be the rate of DEUs affecting two modules. The model has different repair rates (µ 1 , µ 2 , µ 3 , and µ 4 ) because the repair time depends on the amount of time taken to complete the DPR and this time depends on the size of the bit file (=1/µ 1 ). Therefore, in state "5", only one M has to be reconfigured while, in state "4", two Ms have to be reconfigured (µ 2 = 0.5 µ 1 and µ 3 = 0.333 µ 1 ) [7]. Failures and repair times are both assumed to be exponentially distributed; hence, the repair rate µ and the failure rate λ are constant [10].
Reliability (R(t)) is the probability that a system is alive at time t given that it was alive at t = 0 [10]. Reliability is the probability of NOT being in state "F" at any time t as follows: where R(t) is the system reliability and P F (t) is the probability of the system being in state F (failure state). The 5MR Markov models can be solved using the Chapman-Kolmogorov equations [10]. Let P i (t) be the probability of residing in state i at time t as: where T is the transition rate matrix, and Then, dP dt = PxT (4) and, Therefore, assuming that P 5 (0) = 1, P 4 (0) = P 3 (0) = P 2 (0) = P F (0) = 0, and using the transition rate matrix in Equation (3) and substituting in Equation (2), the Chapman-Kolmogorov equations can be solved in order to obtain P i (t)∀i ∈ {F, 2, 3, 4, 5}. Then, Equation (1) is used to obtain R(t).
6MR follows the same methodology as 5MR. Six identical copies of the module (M) are connected to a voter (an error detection and recovery mechanism). 6MR can tolerate 4 module failures as follows: (1) 4 successive single failures in 4 different Ms, (2) a single failure followed by a double failure followed by another single failure, (3) a double failure followed by 2 successive single failures. The Markov model in Figure 3 can be explained and solved using the same reasoning as the 5MR model in Figure 2 [7].

Penta Modular Redundancy (5MR) Implementation
In this subsection, we discuss the penta modular redundancy (5MR) voter implementation. The 5MR voter is implemented in addition to its modules (M i ) on the same FPGA, as shown in Figure 4.

Proposed Design
The 5MR reconfiguration scheme is shown in Figure 5. Next, is a very brief description of the functionality of each of the three blocks V, W, and MAJ. Block V compares each of the module outputs to the voter output using an XOR gate, as shown in Figure 6. When the module fails, the XOR

Proposed Design
The 5MR reconfiguration scheme is shown in Figure 5. Next, is a very brief description of the functionality of each of the three blocks V, W, and MAJ. Block V compares each of the module outputs to the voter output using an XOR gate, as shown in Figure 6. When the module fails, the XOR produces a logic 1 which is, then, stored in a D flip-flop (with output q i ) to indicate that this module has to be removed from the voting process. The outputs of the 5 XOR gates (DPR 1 through DPR 5 ) are also sent to another module, as will be explained later. Each D flip-flop has a separate reset signal r i .

Proposed Design
The 5MR reconfiguration scheme is shown in Figure 5. Next, is a very brief description of the functionality of each of the three blocks V, W, and MAJ. Block V compares each of the module outputs to the voter output using an XOR gate, as shown in Figure 6. When the module fails, the XOR produces a logic 1 which is, then, stored in a D flip-flop (with output qi) to indicate that this module has to be removed from the voting process. The outputs of the 5 XOR gates (DPR1 through DPR5) are also sent to another module, as will be explained later. Each D flip-flop has a separate reset signal ri.  Block W has two sets of inputs, i.e., the outputs of the 5 modules (M1 through M5) and the outputs of the 5 flip-flops (q1 through q5) in Block V. Block W consists of five sub-blocks; one of these subblocks is shown in Figure 7. In the error-free case, Block W produces the outputs of the five modules, i.e., Mi = hi in Figure 5. When one module fails, its corresponding h output becomes a logic 0 while Block W has two sets of inputs, i.e., the outputs of the 5 modules (M 1 through M 5 ) and the outputs of the 5 flip-flops (q 1 through q 5 ) in Block V. Block W consists of five sub-blocks; one of these sub-blocks is shown in Figure 7. In the error-free case, Block W produces the outputs of the five modules, i.e., M i = h i in Figure 5. When one module fails, its corresponding h output becomes a logic 0 while the output h of another module becomes a logic 1. Block MAJ being a 5-input majority voter and two of its inputs being constant at 0 and 1, it functions as a 3-input majority voter (as in triple modular redundancy (TMR)). If two modules fail simultaneously, one of their corresponding h outputs becomes a logic 0, while the other becomes a logic 1. Again, Block MAJ becomes a 3-input majority voter. All D flip-flops are reset at the beginning of operation. All D flip-flops have the same clock. If DPR i = 1, then qi is set (qi = 1) when the clock pulse occurs. If all modules are fault free, then qi = 0 for all D flip-flops. It is worth noting that the design of Block W is identical to the one in [20]. Block W has two sets of inputs, i.e., the outputs of the 5 modules (M1 through M5) and the outputs of the 5 flip-flops (q1 through q5) in Block V. Block W consists of five sub-blocks; one of these subblocks is shown in Figure 7. In the error-free case, Block W produces the outputs of the five modules, i.e., Mi = hi in Figure 5. When one module fails, its corresponding h output becomes a logic 0 while the output h of another module becomes a logic 1. Block MAJ being a 5-input majority voter and two of its inputs being constant at 0 and 1, it functions as a 3-input majority voter (as in triple modular redundancy (TMR)). If two modules fail simultaneously, one of their corresponding h outputs becomes a logic 0, while the other becomes a logic 1. Again, Block MAJ becomes a 3-input majority voter. All D flip-flops are reset at the beginning of operation. All D flip-flops have the same clock. If DPRi = 1, then qi is set (qi = 1) when the clock pulse occurs. If all modules are fault free, then qi = 0 for all D flip-flops. It is worth noting that the design of Block W is identical to the one in [20].  Since the fault model in this research includes SEUs and MEUs (transient faults), it is important to use DPR to attempt to repair a failed module instead of removing it from the voting process, especially since the frequency of occurrence of transient faults is much higher than that of permanent faults [21].
Block V is designed to consider DPR, as shown in Figure 6. The outputs of the XOR gates (DPR 1 through DPR 5 ) are to be used by another module below. DPR i = 1 if the Mi has a fault (M i has a different value from the system output F).
Another important part of the voter design is the use of the internal configuration access port (ICAP) and its associated controller [22]; ICAP has direct access to the FPGA configuration memory. ICAP is a primitive used for DPR. Figure 8 shows how the ICAP interacts with the voter to take DPR into account. As soon as DPR i = 1, the D flip-flop associated with module M i is set and M i is not considered in the voting process. While the system is still operating with the remaining four fault-free modules, the ICAP performs DPR on M i and monitors DPR i . If DPRi returns to "0", the fault must have been transient and the ICAP resets the D flip-flop associated with M i (using the separate reset signal r, as mentioned above). Each module has an ICAP status bit (status flip-flop i ) associated with it (see Figure 8) and controlled by the ICAP module. Let this status bit be called DPRDone i . This status bit is initially reset indicating that its associated module is fault free or that DPR was performed on this module and was able to recover from the transient fault. If DPR does not succeed in repairing M i , the ICAP writes a "1" in the corresponding status bit. Consequently, M i is considered to have an irrecoverable fault and is permanently removed from the list of modules on which the ICAP can still perform DPR. with it (see Figure 8) and controlled by the ICAP module. Let this status bit be called DPRDonei. This status bit is initially reset indicating that its associated module is fault free or that DPR was performed on this module and was able to recover from the transient fault. If DPR does not succeed in repairing Mi, the ICAP writes a "1" in the corresponding status bit. Consequently, Mi is considered to have an irrecoverable fault and is permanently removed from the list of modules on which the ICAP can still perform DPR. The enable signal (Enablei) is equal to"1" if the ith module failed (DPRi = 1) and (DPRDonei = 0). Remember that the signal "DPRDonei = 0" is an indication that DPR has already been applied to Mi and that it was successful, or the module Mi has never suffered from any upsets. DPRDonei is connected to the reset signal corresponding to this module (ri).

Fault Secure Voter Design
The area occupied by the voting circuitry is much smaller than that occupied by the five modules. Hence, the reliability of the voter is expected to be much higher than that of any of the five modules. However, it may still be affected by SEUs or DEUs. This will cause the voter to produce an incorrect output and the rest of the system will have no indication that the signal received is incorrect. Next, we show how to solve this problem using the concept of fault secure circuits [9,10]. The enable signal (Enable i ) is equal to"1" if the ith module failed (DPR i = 1) and (DPRDone i = 0). Remember that the signal "DPRDone i = 0" is an indication that DPR has already been applied to M i and that it was successful, or the module M i has never suffered from any upsets. DPRDone i is connected to the reset signal corresponding to this module (r i ).

Fault Secure Voter Design
The area occupied by the voting circuitry is much smaller than that occupied by the five modules. Hence, the reliability of the voter is expected to be much higher than that of any of the five modules. However, it may still be affected by SEUs or DEUs. This will cause the voter to produce an incorrect output and the rest of the system will have no indication that the signal received is incorrect. Next, we show how to solve this problem using the concept of fault secure circuits [9,10].
A circuit is fault secure if, for every fault from a prescribed set, the circuit never produces an incorrect codeword output for codeword inputs [9]. Let "A" be the set of all input vectors applied to the voter system. There are 32 (=2 5 ) input vectors in this set (as long as at least two of the five identical modules are functioning). Let "a" be any vector in "A".
The fault set considered for the voter is the same as that considered for the five modules, namely SEUs and DEUs. Therefore, let "B" be the set of all faults considered in this research, namely SEUs and DEUs in the voter system and "b" any member of this set. In addition, let ϕ be the null fault, i.e., the circuit is fault free.
The voter is triplicated, as shown in Figure 9. Its output consists of a three-bit vector. Let OUT(a,ϕ) be the three-bit vector produced by the three voters in the fault free situation, OUT(a,b) the output of the three voters in the presence of any fault b ( SEUs and DEUs. Therefore, let "B" be the set of all faults considered in this research, name and DEUs in the voter system and "b" any member of this set. In addition, let φ be the null fa the circuit is fault free. The voter is triplicated, as shown in Figure 9. Its output consists of a three-bit vec OUT(a,φ) be the three-bit vector produced by the three voters in the fault free situation, OUT output of the three voters in the presence of any fault b (Ɛ of B). Only two of the eight combinations constitute valid codeword outputs, i.e., 000 and 111. The reason for triplica voter circuitry instead of just duplicating it is the effect of DEUs, even if the three copies of t are not physically adjacent. Floor planning of the design voters can be controlled during the implementation on FPGA via user constraints [6]. Therefore, designed voters can be spatially away from each other to mitigate adjacent MEUs but this feature cannot guarantee the mitig non-adjacent MEUs. Two of the three copies may be affected by one DEU. If the voter h duplicated instead of triplicated, there would not have been any indication that the outp incorrect; a correct 00 codeword could be changed by the DEU to another correct codeword, 11. The three-voter system is fault secure as follows: OUT(a,φ) C Ɛ of B). Only two of the eight possible combinations constitute valid codeword outputs, i.e., 000 and 111. The reason for triplicating the voter circuitry instead of just duplicating it is the effect of DEUs, even if the three copies of the voter are not physically adjacent. Floor planning of the design voters can be controlled during the design implementation on FPGA via user constraints [6]. Therefore, designed voters can be spatially mapped away from each other to mitigate adjacent MEUs but this feature cannot guarantee the mitigation of non-adjacent MEUs. Two of the three copies may be affected by one DEU. If the voter had been duplicated instead of triplicated, there would not have been any indication that the output was incorrect; a correct 00 codeword could be changed by the DEU to another correct codeword, namely 11. implementation on FPGA via user constraints [6]. Therefore, designed voters can be spatially mapped away from each other to mitigate adjacent MEUs but this feature cannot guarantee the mitigation of non-adjacent MEUs. Two of the three copies may be affected by one DEU. If the voter had been duplicated instead of triplicated, there would not have been any indication that the output was incorrect; a correct 00 codeword could be changed by the DEU to another correct codeword, namely 11. The three-voter system is fault secure as follows: OUT(a,φ) Ɛ C The three-voter system is fault secure as follows: OUT(a,ϕ) implementation on FPGA via user constraints [6]. Therefore, designed voters can be spatially mapped away from each other to mitigate adjacent MEUs but this feature cannot guarantee the mitigation of non-adjacent MEUs. Two of the three copies may be affected by one DEU. If the voter had been duplicated instead of triplicated, there would not have been any indication that the output was incorrect; a correct 00 codeword could be changed by the DEU to another correct codeword, namely 11. The three-voter system is fault secure as follows: are not physically adjacent. Floor planning of the design voters can be controlled during the design implementation on FPGA via user constraints [6]. Therefore, designed voters can be spatially mapped away from each other to mitigate adjacent MEUs but this feature cannot guarantee the mitigation of non-adjacent MEUs. Two of the three copies may be affected by one DEU. If the voter had been duplicated instead of triplicated, there would not have been any indication that the output was incorrect; a correct 00 codeword could be changed by the DEU to another correct codeword, namely 11. The three-voter system is fault secure as follows: OUT(a,φ) C Ɛ C → OUT(a,b) = OUT(a,ϕ) The disadvantage of the proposed scheme is that any module receiving the output of the voter has to "decode" this voter output.

Reliability Results
The presented Markov models for 5MR and 6MR are simulated using SHARPE [16] to calculate the reliability. Machine learning and deep learning have many applications in the automotive field, inside and outside the vehicle [23]. The module (M i ) used in this research is the Alexnet accelerator (an important module in machine learning and neural networks [24]). Let the module SEU rate λs = 0.012/h based on the data in [24,25]. According to the data in [18], λ s is approximately twice λ d (DEU rate). Remember that in [7], the ratio used was λ s = 10 λ d ; this ratio was based on the data in [3,15]. Obviously, this ratio can differ based on the environment in which the system is operating. Therefore, next, several ratios are simulated (λs = 3 λ d and λs = 5 λ d ) to investigate the effect of the ratio on system reliability. In addition, let the repair rate of one module M be µ 1 be around 144/hour (where 1/µ 1 is the average time to download 1 bit file corresponding to the Alexnet accelerator neural network [24,26]), µ 2 = 1/2 µ 1 since 1/µ 2 is the time needed for downloading 2 bit files. Similarly, µ 3 = 1/3 µ 1 and µ 4 = 1/4 µ 1 depending on the time needed to download 3 and 4 bit files, respectively. The reliability results from SHARPE of 5MR and 6MR are showed in Tables 1 and 2.

Implementation Results
The proposed design is verified and simulated using ModelSim, as shown in Figure 10.

Implementation Results
The proposed design is verified and simulated using ModelSim, as shown in Figure 10. The proposed voter design was implemented using Xilinx Vivado tools targeting Kintex7 7k410tfbg676 device. Table 3 shows the resources utilization for the proposed design. Table 3. Implementation results of the proposed design. -

Proposed Design
Slices LUTs 21 Slices FFs 11 The proposed voter design was implemented using Xilinx Vivado tools targeting Kintex7 7k410tfbg676 device. Table 3 shows the resources utilization for the proposed design.

Discussion
As shown in Tables 1 and 2, 6MR is more reliable than 5MR, in the case λs = 5 λ d , but 5MR is more reliable than 6MR in the case λs = 3 λ d , although 5MR has a lower cost than 6MR (since it uses only five copies of the module M and not six copies as in 6MR). These results are counter intuitive, because reliability is expected to increase with redundancy. Several simulations are, then, applied on the model in Figure 3 using different transition rates from State 6 to State 4 (15 λ d , 14 λ d , 13 λ d , and 12 λ d ). It is shown that the reason for this counter-intuitive result is mainly the "15 λ d " transition from state 6 to State 4 in Figure 3. If λ d is only one third λ s , this rate will be relatively large.
This result can be considered to be another benefit for using FPGAs as they can be reconfigured with DPR based on the ratio between λs and λ d . The failure rate (λ) is based on flux [15] that varies with the environment as it depends on several factors such as geographical location (latitude and longitude), altitude, and barometric pressure [27]. Therefore, depending on the environment where the system is expected to operate, the appropriate fault-tolerant architecture (5MR or 6MR) can be downloaded onto the FPGA using DPR.

Conclusions
Nowadays, FPGAs are used in a lot of applications such as automotive, space, and biomedical applications. SRAM-based FPGAs are susceptible to radiation-induced SEUs and MEUs. Several fault-tolerant techniques have been studied in previous works within the context of FPGA, namely scrubbing, TMR, as well as penta and hexa modular redundancy (5MR and 6MR). Scrubbing and TMR have problems with detecting MEUs.
The first contribution of this paper is a counter-intuitive result in some cases. We have shown that a 5MR fault-tolerant architecture can be more reliable than a 6MR architecture, because of the relative rate of DEUs as compared with that of SEUs. If the rate of SEUs is, at most, four times higher than that of DEUs, the 5MR architecture becomes more reliable (and has a lower cost) than the 6MR architecture. However, if the rate of SEUs is more than four times higher than that of the DEUs, the 6MR becomes more reliable than the 5MR, as reported in the literature. Markov models and the SHARPE software package were used in the analysis.
Then, in this paper, we focus on the 5MR architecture. Recovery circuits for this architecture can be found in the literature but they do not take into account the occurrence of SEUs/DEUs and the ability of DPR to recover from these upsets. A proposed error detection and recovery mechanism (voter) is designed taking into consideration transient faults and DPR; then, it is implemented using the Xilinx Vivado tool and the Kintex7 7k410tfbg676 device, and then this circuit is tested and it is shown that it performs correctly.
The third contribution of this paper is related to the concept of fault security. If the recovery mechanism suffers from any type of failure, it must alert the rest of the system in order not to use this incorrect output. Therefore, the developed design is further modified to produce a coded output.
As the technology evolves and the transistor size shrinks, the effect of multiple event upsets is expected to be more and more important, especially in harsh environments (e.g., space applications and nuclear plants). Therefore, future work should focus on fault-tolerant architectures other than 5MR that can withstand MEUs as well as SEUs. The ratio of MEUs to SEUs is expected to play an important role in determining the efficiency of these new architectures.