Revisiting Symptom-Based Fault Tolerant Techniques against Soft Errors

Aggressive technology scaling and near-threshold computing have made soft error reliability one of the leading design considerations in modern embedded microprocessors. Although traditional hardware/software redundancy-based schemes can provide a high level of protection, they incur significant overheads in terms of performance and hardware resources. The considerable overheads from such full redundancy-based techniques has motivated researchers to propose lowcost soft error protection schemes, such as symptom-based error protection schemes. The main idea behind a symptom-based error protection scheme is that soft errors in the system will quickly generate some symptoms, such as exceptions, branch mispredictions, cache or TLB misses, or unpredictable variable values. Therefore, monitoring such infrequent symptoms makes it possible to cover the manifestation of failures caused by soft errors. Symptom-based protection schemes have been suggested as shortcuts to achieve acceptable reliability with comparable overheads. Since the symptom-based protection schemes seem attractive due to their generality and simplicity, even state-of-the-art protection schemes exploit them as the baseline protections. However, our detailed analysis of the fault coverage and performance overheads of such schemes reveals that the uservisible failure coverage, particularly of ReStore, is limited (29% on average). By contrast, the runtime overheads are significant (40% on average) because the majority of the fault injection experiments, which were considered as detected/recovered failures by low-level symptoms, are actually benign faults by program-level masking effects.


Introduction
Soft errors or transient faults are considered primary sources of unreliability in modern processors. Soft errors caused by external sources, such as high-energy neutrons and alpha particles, or internal events, such as noise in the power voltage, can alter the state of a transistor or change the logical value stored in a memory element of the microprocessor. Despite the existence of several masking effects, ranging from circuit-level [1] to softwarelevel [2,3], some soft errors might not be masked, causing system failures. Traditionally, soft errors have been problematic for high-altitude applications, such as airplanes [4] and space craft [5,6]. However, even ground-level applications have experienced soft errorinduced failures owing to sub-nano transistor scaling and near-threshold computing. The International Technology Roadmap for Semiconductors (ITRS) [7] predicted that even schemes, the accurate way is the beam testing to generate radiation-induced soft errors [27]. However, beam testing takes lots of time, and it is even challenging to set up correctly. Therefore, fault injection, which mimics transient faults, can be an alternative to estimate the fault coverage. As Schirmeier et al. [28] revealed, the error coverage metric usually misrepresents the soft error protection and creates the illusion that the fault-tolerant scheme has improved the system reliability when, in fact, it might have not. To quantify the importance of using the wrong error coverage metric, we calculate the effectiveness of symptom-based protection schemes in the same way as in the original study. In this case, the precise evaluation showed only a 30% failure rate reduction.

Background and Motivation
Both academia and industry have been dealing with soft error problems for more than half a century. The first widespread indication of soft errors was reported in 1970 when it was demonstrated that high-energy neutrons from cosmic radiation could cause soft errors in memory and combinational circuits [29]. Although advanced FinFET technology has been shown reduced soft error rates against alpha particles [30], the soft error rate trends for FinFET technology still increase because of proton and muon-induced single-event upsets [31].

Protection Schemes against Soft Errors
In order to protect systems against soft errors, diverse hardware-based approaches have been proposed. One of the most straightforward techniques is hardening, making hardware resist the damage or system malfunction caused by ionizing radiation. The amount of radiation can be affected by altitude, nuclear energy, and cosmic ray [32]. However, it is impossible to protect systems by hardware hardening perfectly, e.g., neutroninduced soft errors can pass through many meters of concrete [33]. Moreover, hardware hardening techniques also induce severe overheads in terms of area and power consumption. In order to mitigate overheads, optimized hardware-based protection has been proposed. For memory systems, information redundancy techniques, such as error detection code (e.g., parity) and error detection code (e.g., Hamming code), have been proposed [34]. On the other hand, modular redundancy (e.g., dual or triple modular redundancy) has been presented for non-memory systems [35].
Many software-based techniques for protecting the computations from soft errors have been proposed. Most of the proposed schemes are redundancy-based solutions, the computations of which are executed redundantly in space or time, and the presence of an error is determined from any discrepancy in the results. It has been shown that the hardware implementation of these simple fault-tolerant strategies is extremely effective, and such techniques have been used in many safety-critical applications [36], such as air traffic control systems [4] and NASA spacecraft [5,6]. However, because the cost of such hardware redundancy-based schemes is considerably high, they cannot be used in many cost-sensitive embedded applications. Software-level redundancy-based schemes shift the cost of using redundant hardware to the software and apply a trade-off between program execution time and extra hardware. Through both radiation-based testing and extensive fault injection campaigns, it has been shown that software-level redundancy-based schemes can achieve a high degree of error coverage [37,38].
However, the main drawback of hardware/software redundancy-based schemes is that they impose a considerably large overhead on the system. For instance, software redundancy-based schemes such as EDDI [39] and nZDC [38] cause a considerable performance overhead (more than double) to the system by applying temporal redundancy to low-level instructions. This considerably high overhead of redundancy-based errortolerant schemes has motivated researchers to develop low-cost error-tolerant approaches. Existing low-cost error-tolerant schemes can be divided into four main categories: (i) partialredundancy schemes [40,41], (ii) control-flow checking schemes [42], (iii) vulnerabilityreduction schemes [43], and (iv) symptom-based fault-tolerant schemes.
Partial-redundancy schemes protect more important data [40] or duplicate more critical instructions [41] than full-redundancy schemes to mitigate the performance overhead. However, they have limited fault coverage because they only protect a subset of systems. Although control flow checking schemes [42] have been considered reliable protection techniques for detecting control flow violations, a quantitative analysis [44] showed that control flow checking is ineffective and unreliable against soft errors. Vulnerability-reduction schemes [43] use a reliability-driven instruction scheduler at the compiler level to improve the hardware reliability against soft errors. However, their fault coverage is limited because they are not system-level schemes. This study focuses on symptom-based error-tolerant schemes because the fault coverage and effectiveness of symptom-based techniques have yet to be elucidated through a comprehensive analysis.

Symptom-Based Fault Tolerant Schemes
As the key idea of symptom-based fault-tolerant schemes, it is possible to detect the manifestation of errors by monitoring some rare execution phenomena. In general, the existing symptom-based fault-tolerant schemes can be divided into three categories, as described in Table 1: (i) Low-level hardware symptoms for error detection: techniques that monitor low-level hardware symptoms for error detection, (ii) OS-level symptoms for error detection: schemes that utilize OS-level abnormalities for error detection, and (iii) Application-level symptoms for error detection: techniques that explore application-level features for error detection. For error recovery, all such schemes rely on the existence of some type of regular checkpoint support. They can then simply roll back to the last checkpoint and re-execute the instructions. Application-level symptoms for error detection [11,23] • Soft and hard errors, • Range-based invariant • Offline profiling • Application-level modification Table 1 summarizes the main features of these techniques. The first symptom-based fault-tolerant scheme is called ReStore [8,21], by which the authors propose that soft error detection is easily achievable by monitoring low-level hardware symptoms. Figure 1 depicts a conceptual diagram for ReStore-protected microprocessor. In such architecture, symptoms such as branch mispredictions, ISA-defined exceptions, and cache misses, can be detected during the execution. If one of these symptoms are detected, it re-executes last N instructions. As the main argument of research into ReStore, errors leading to failures usually generate symptoms "noisily" and "quickly". Noisily means that if there is a failureinducing soft error in a system, it causes abnormal behaviors in the program executions. For instance, a branch misprediction will occur because errors usually alter the control flow of the program. Alternatively, an exception such as an unknown opcode execution divided by a zero exception or illegal memory access may occur. TLB and cache misses can also be considered symptoms because an incorrect data flow can affect the spatial locality. Finally, as a key part of the argument, soft errors usually generate symptoms almost immediately (within 100 instructions) after their occurrence. Therefore, assuming that hardware can provide an effective checkpointing scheme by simply rolling back to the last checkpoint and resuming the execution from there, the effect of soft errors from the system can be eliminated.  Furthermore, authors studying ReStore have argued that because the hardware structures required for creating low-level checkpointing/rollback mechanisms already exist in the modern speculative processor to handle improper speculations, the ReStore technique can be applied to the existing processors through a minimal hardware modification. As the main drawback of the ReStore scheme, the symptoms are often natural and appear even during the normal execution of the program. For instance, cache misses and branch mispredictions can occur even in a fault-free run of the programs. These natural symptoms can cause a false alarm, leading to an unnecessary rollback and a re-execution of the instructions. Figure 2 illustrates the execution traces of a program running on a normal and ReStore architecture. Figure 2a shows the program execution trace for a simple system with one natural symptom is shown without any symptom-based protection schemes. Figure 2b illustrates how the execution trace of the program changes on a ReStore-protected machine. As shown, upon observing a symptom, the ReStore architectures re-execute the instructions from the recovery window. If the same symptom reappears, it assumes that the symptom is natural and keeps executing the program. Figure 2c demonstrates the ReStore execution in the presence of a soft error. In this case, the soft error caused a symptom, and the symptom-generation latency was less than that of the re-execution window. Therefore, the ReStore architecture can roll back the execution of the program to a fault-free checkpoint, and it eliminates the effect of the error from the system through a re-execution. Because the symptoms will not be generated again, it is assumed that an error is detected and successfully recovered. However, if the error does not create a symptom or the symptomgeneration latency of the error is larger than the re-execution window, the error will remain unrecoverable. Owing to its generality and simplicity, the ReStore strategy has also been used in a hybrid approach (symptom + redundancy), for example, Shoestring [45] and profiling-based soft error protection schemes [46].

Symptom-based fault tolerant techniques
OS-level symptom techniques [22] look at the high-level impact of symptoms, such as hardware traps, crashes, and high OS activity, and enhance the fault coverage to hard errors. Rather than detecting low-level symptoms (i.e., exceptions and TLB/cache misses), OS-level symptom techniques postpone error detection until the impact of a fault causes atypical behaviors at the operating system level. Therefore, OS-level symptom techniques trade off the high false alarm rate of low-level symptom-based error detectors with at least a five-orders-of-magnitude longer detection latency. The high error detection latency implies that OS-level symptom techniques require a recovery/rollback strategy to re-execute the last tens of millions of instructions upon observing operating-system-level symptoms. In this case, the overheads of a false alarm are enormous because of the large number of re-executed instructions. Offline program profiling is used to determine the threshold of hang and high OS activity symptoms. SWAT techniques [23] improve upon OS-level symptom techniques by including likely program-invariant and value-range checking as application-level symptoms. The key idea is to extract the likely values of important program variables by offline profiling and to check whether the dynamically computed value is within the accepted range. The mSWAT framework [11] enhances the SWAT concept for error detection and recovery in multicore systems. The main drawback of OS-level and application-level symptom-detection-based schemes is the lack of generality. Many OS-level symptoms (e.g., hangs, high OS activity, and an acceptable range of variables) are extracted through offline program profiling. They might have difficulty being held under different situations, i.e., changing the input data or using different architectures may show completely different results. Moreover, to replay tens of millions of instructions, sophisticated checkpointing mechanisms that demand major hardware modifications are required.
In order to prove the fault coverage from application-level symptoms by preliminary experiments, we have profiled the value changes of program-defined variables. Then, the authors have picked the specific variables which can affect the program output based on the heuristic way. Lastly, we have added high-level assertions to detect the notable changes of selected variables. For this simple set of experiments, we have chosen a benchmark susan (smoothing). The benchmark handles the multimedia image to smooth the quality, and many variables are limited to the size of the image in ordinary cases. For the specific benchmark susan (smoothing), it can prevent 37% of failures with less than 10% performance overheads. However, the changing value depends on the input. The same benchmark susan (smoothing) cannot detect any failures if we use another set of inputs. Further, it is challenging to select the target variables and their ranges at the profile stage. Thus, we will use the hardware-detectable symptoms in this manuscript.
Even state-of-the-art protection techniques rely on symptom-based techniques due to their simplicity. For example, Minotaur [14] and gem5-Approxilyzer [15] assumed that hardware faults could be detected without severe overheads if they generate the observable symptoms. However, we have to answer the two following questions to exploit symptombased protections as the baseline guideline. First, how often do failure-inducing soft errors eventually generate symptoms? Secondly, are symptom-based approaches effective in terms of performance? In this paper, we have analyzed the performance effectiveness and reliability improvement of symptom-based protection schemes.

Reconsidering Low-Level Symptoms
OS and application-level symptom-based techniques are limited since they are challenging to apply for the general-purpose processors. On the other hand, low-level symptombased schemes, mainly ReStore, become an attractive solution for embedded applications due to the high level of generality. However, research into symptom-based protection schemes has not provided any intuition in this area other than fault injection experimental results that explain why a failure-inducing soft error will generate a symptom quickly after its occurrence. We notice the following weaknesses in the experimental results (the sole reason for bounding failures to symptoms), highlighting the need for a precise evaluation of such schemes.
Provably as a metric to demonstrate the effectiveness of the scheme. They injected several faults into microarchitectural components, and they counted the number of system failures due to faults. For instance, if 30 faults induce system failures out of totally injected 100 faults, the failure rate will be 30%. However, as a study [28] revealed, such metrics can significantly overestimate the protection capability of schemes such as symptom-based protection schemes, which prolongs the runtime of the program and requires additional hardware. For example, consider an imaginary fault tolerance scheme that, rather than applying a re-execution of the last 100 instructions on the observations of the symptoms, without reason re-executes the last 100 instructions for some random points of execution. If we randomly inject the same number of faults in the original and imaginary fault-tolerant schemes, the percentage of failures will be reduced. This is because faults inserted close to the randomly selected re-execution point in the imaginary fault-tolerant scheme will become masked, and fewer faults can cause a failure. Therefore, the overall percentage of failures will be improved upon in the baseline architecture, which is an illusion (We strongly encourage the reader to reference [28] for a more detailed explanation of the flaws in the traditional and widely used fault coverage metrics extracted from random fault injection experiments.).
Interestingly, in [18], the authors estimate the coverage of ReStore by both a fault injection campaign and ACE analysis [2]. Because the coverage of the ACE analysis was significantly (around 10×) less than the fault coverage extracted from fault injection results, they improperly conclude that the latter is correct; however, from [28], we know that fault injection results were incorrectly deciphered.
Immature failure definition: Many terms and definitions in studies on ReStore are inconsistent with the widely accepted versions, which have caused frequent misinterpretations regarding the fault coverage capability of the ReStore architecture. For instance, ReStore defines silent data corruptions (SDCs) as cases in which the injected faults corrupt the architectural state of the program (register file and memory state). However, in most studies [12,38,39,45,47], the SDC is considered to be the case in which the error affects the final (user-visible) output of a program [48]. Because several software-level masking effects prevent error propagation from the architectural state to the final output of the program, it is possible that ReStore detects/recovers benign errors (those that eventually do not affect the program outputs). Likewise, ReStore considers all control-flow violations and latent errors as failures. However, as [3] demonstrated, our experimental results also verified that, on average, more than 50% of conditional branches do not affect the correct program behaviors even when forced into incorrect paths. It is also possible that the effects of many latent errors (errors remaining in the system for a long time) will eventually become masked by the program. Therefore, it is crucial to evaluate the protection offered by the ReStore architecture in terms of real user-visible failures. These failures are defined as silent output corruptions (SOCs) [49], a meaningful subset of silent data corruptions.

Fault Coverage: Overview of Results
To quantify the coverage offered by the architecture covered by symptom-based protection schemes, we supplied a cycle-accurate microarchitecturally simulated microprocessor with perfect branch predictors and caches. There is no branch misprediction, cache miss, or hardware exception in the fault-free runs of programs on the simulated microprocessor. We then injected more than 600,000 single-bit flips on different core components of the simulated processor. We collect information regarding whether the injected fault causes a symptom. If it does, we estimate the distance (in terms of the committed instructions) between the fault injection time and the appearance of the first symptom. Finally, if the symptom-generated fault eventually leads to a user-visible failure, it is defined as a silent output corruption. Figure 3 demonstrates overall soft error protection coverage of ReStore architecture with the checkpointing interval of 100 instructions. Note that we have exploited the cycleaccurate gem5 simulator [50] to implement the low-level symptom protection scheme, ReStore. We have injected faults that mimic soft errors into the simulator and then estimated the fault coverage of the protection scheme. The Y-axis represents the percentage of silent output corruptions, which can be covered by the ReStore architecture, whereas the Xaxis shows programs from the Mibench [26] test suite. As the figure demonstrates, even considering all types of symptoms (the right sidebar for each program), the coverage offered by ReStore is on average approximately 29%. If we consider branch misprediction or cache misses only as symptoms, the average fault coverage is approximately 19% and 16%, respectively. However, because of overlapping symptom phenomena (some faults generate several different symptoms), the overall coverage of ReStore when combining all symptoms is less than the summation of the coverage offered by considering symptoms separately.
As we will discuss later in Section 6.2, the runtime overhead when considering all architecture symptoms is approximately 40%. Overall, we conclude that the assumption that a high level of coverage can be achieved with an extremely low performance overhead by monitoring low-level symptoms is incorrect, as revealed by our comprehensive analysis and exhaustive fault injection campaigns.

Fault Coverage Analysis
According to [28], to estimate the protection offered by fault-tolerant schemes, which modify the fault space of a program, the performance and hardware overheads of such a scheme are required. Then, by simply computing the conventional failure rate by conducting random fault injection experiments (percentage of failure-inducing fault injection experiments divided by the total number of fault injection experiments) and multiplying it by the correlation factor γ (essentially the product of the performance and the hardware overheads), the correct failure rate cover can be computed. However, because the failure mentioned in the above rate estimation is heavily implementation dependent, we develop an analytical model for symptom-based protection coverage estimation.
We argue that the protection provided by symptom-based protection sachems is proportional to P, where P is the probability that a failure-inducing soft error generates a symptom quickly after its occurrence. In our definition, failure-inducing soft errors are those that modify the user-visible output of the programs. We demonstrate the correctness of our argument for the two cases. Note that we have used three symptoms for detecting soft errors such as cache misses and branch mispredictions as the low-level symptoms [21] and exceptions as the OS-level symptom [22]. For brevity's sake, we call the specific architectures covered by symptom-based protection sachems ReStore. We have assumed ReStore architecture can detect symptoms from various levels, and it can re-execute the latest 100 instructions upon the detection.
A perfect system with no natural symptom: As the ideal case for ReStore coverage, when the branch predictor, as well as caches and TLBs, are perfect, the program execution never exercises any symptoms during its fault-free execution. In this case, there will be no false alarms, and soft errors can be safely considered as the sole reason for any symptoms. By applying ReStore to such an ideal system, however, only soft errors that generate a symptom within the re-execution window size of 100 instructions can be recovered. The ReStore architecture is based on the ability of the system to roll back to a checkpoint and re-execute the last 100 instructions upon the observation of symptoms. If the distance (in terms of instruction) between an error and the generated symptom is larger than the re-execution window size, the error has propagated to the checkpoint, and the re-execution from the checkpoint cannot eliminate the effect of the soft error from the system. In such latent error cases, the symptom will appear again, and ReStore logic considers it a natural symptom. Therefore, even in the case of an ideal branch predictor and cache, the overall protection of ReStore is equal to P, the probability that a failure-inducing soft error will generate a symptom after its occurrence within a distance less than the checkpointing interval. Note that the results shown in Figure 3 are extracted from fault injection in a symptom-free system based on the same argument. The (a) and (b) parts in Figure 4 demonstrate the execution trace of a program with 500 instructions on an unprotected and ReStore-protected machine.
A real system with natural symptoms: As we mentioned before, symptoms do occur even in the fault-free run of the programs. For instance, branch mispredictions and cache misses are part of the executions in real processors. When considering the implementation of the ReStore architecture for a real system, we can assume that the program execution within 100 instructions of natural symptoms is protected (This is an overestimation of fault coverage because (2% in our experimental results) soft errors sometimes occur close to a natural symptom, causing the natural symptom to disappear.). However, we argue that it will not improve the soft error protection in the entire program because the probability of error occurrence before a natural symptom should be considered equal to the probability that an error will after the natural symptom (during the re-execution of the last 100 instructions). In other words, as compared to unprotected systems, natural symptoms protect against soft errors that occur within the re-execution window before natural symptoms; however, because they prolong the program execution time, they introduce almost equal soft error vulnerability to the program by a re-execution. Figure 4c,d show traces of a program with 500 instructions and one natural symptom. The entire program execution, 500 instructions, is susceptible to soft errors on an unprotected machine, as shown in part (c). By applying the ReStore scheme to the machine, we can mostly protect the execution time of the 100 instructions before the natural symptom.
Moreover, we also have P% protected parts, that is, the re-executed instructions and the rest of the program executions. Note that if we negate the 100 protected instructions from the complete instructions (600 executed instructions on a ReStore-protected machine), there will still be 500 (=150 + 100 + 250) P% protected instructions, which is exactly the same as a system without natural symptoms in part (b) of Figure 4. Natural symptoms impact the coverage of the ReStore scheme both positively and negatively, and overall, it does not affect the total program fault coverage. Therefore, the coverage of the ReStore scheme will not be affected by the natural symptoms, and this case can be simplified as a perfect system, which means that even in cases with natural symptoms, the coverage of the ReStore architecture remains equal to P.
Overall, by merely estimating the probability of P for systems with natural symptoms, we can estimate the coverage of the ReStore scheme. We need to estimate P, which is the probability that a failure-inducing fault will generate a symptom within 100 instructions from its occurrence. If this probability is high, the architecture coverage of ReStore should also be high. Otherwise, the coverage of ReStore is as good as P.  Coverage provided by the ReStore architecture on a system without and with natural symptoms. In the first case, the protection of the ReStore architecture is P, which is the probability that a failure-inducing soft error generates a symptom quickly after its occurrence. Even in the case of systems with natural symptoms, the coverage of ReStore remains the same.

Experimental Setup
As described in the previous section, we break down the effectiveness of ReStore to the probability P that a user-visible failure-inducing soft error generates at least one symptom within 100 instructions after its incident. For this purpose, we conducted extensive fault injection experiments on gem5, a cycle-accurate microarchitectural-level simulator [50]. Figure 5 depicts our fault injection and output classification framework. We simulate an ARM Cortex-A53-like processor (The processor is a 32-bit in-order processor with a fixed pipeline, whereas symptom-based fault-tolerant techniques do not depend on the architectures.), which is a modern high-performance and low-power embedded microprocessor. For a ReStore protection estimation, we customized the branch predictor, cache replacement, and prefetching policy for each program so that there are no natural symptoms (branch misprediction and cache miss) in the fault-free run of the programs. We save the fault injection time for each fault injection run, the final output of the program, time, and type of any symptoms if introduced by the injected fault. We use the ARM-GCC 4.6.2 cross compiler with optimization flag O3 to compile benchmarks from the MiBench suite [26]. We categorized MiBench as computation-intensive and communication-intensive applications. The computation-intensive application is composed of numerical calculation (e.g., basicmath, bitcount, qsort, and stringsearch) and image processing (e.g., jpeg encoding and decoding). Additionally, the communication-intensive application is composed of communication (e.g., gsm and FFT) and security algorithms (sha and crc).  Figure 5. Diagram of our fault injection framework. In our campaigns, we gather symptom event logs from more than 1000 silent output corruptions (SOCs) per benchmark through more than 600,000 fault injections.
We have used fault injection to mimic the impact of soft errors since soft errors per bit are very rare in reality [28]. We consider the commonly used single transient bit flip fault as the primary fault model since the majority of system failures is caused by soft errors on the physical register file rather than other microarchitectural components [51]. Further, most soft errors on the microprocessors can eventually modify the state of the physical register file [45]. We randomly selected a fault injection cycle for each fault injection experiment into a randomly chosen bit of the register file. At the end of each fault injection experiment, we classified the outcomes as follows: Masked: The program execution usually terminates, and the final output of the program is correct.
SOC (Silent Output Corruption): The program execution usually terminates; however, the final output of the program, compared to that from the fault-free run, is incorrect.
Others: The injected fault leads to an early termination of the program execution by causing fatal hardware exceptions, i.e., unknown instruction opcode and illegal access to the memory. It is also defined as "Others" if the processor hangs or the program loops forever.
Because our fault injection experiments aim to quantify the user-level reliability provided by low-level symptom-based error detectors, we focus on the SOC fault injection experiments in this section. In addition, as [28] shows, the number of masked faults is irrelevant to the coverage provided by an error-tolerant scheme. We repeated the fault injection experiments until we collected more than 1000 SOC cases for each benchmark, as shown in Figure 5. Note that because the probability of a fault causing SOC varies with the benchmark applied, the number of fault injection experiments in our framework can differ according to the benchmark. For instance, the SOC rate can be as low as 2% for the suusan benchmark and as high as 50% for the sha benchmark, which is approximately 15% in our experimental results. Consequently, we performed more than 600,000 random fault-injection campaigns for the overall benchmarks.
We explored whether an SOC-inducing fault satisfies the ReStore error coverage condition. If the SOC-inducing fault generates symptoms within 100 instructions, it does; otherwise, it does not. For this, we checked the symptom event log files of the SOC fault injection cases. If there is a record of at least one symptom within 100 instructions after the fault injection time, we consider an SOC-inducing failure as covered by the ReStore architecture. Otherwise, we mark the error as unrecoverable by the ReStore architectures. Table 2 demonstrates the absolute number of SOC-inducing faults that can be detected/recovered by a different type of low-level symptom in the ReStore architecture with a checkpoint interval of 100 instructions. Benchmarks are sorted in descending order by the fault coverage of symptom-based protection schemes. As the main point of the results, in most cases (more than 70% of the time on average), SOC-inducing soft errors do not generate a symptom quickly after an incident. This also shows that branch misprediction is the most helpful symptom. On average, approximately 19% of SOC failures can be recovered by simply considering the branch as a hint for error detection. Cache misses are the second most effective symptom. On average, approximately 16% of failures can be avoided if we decipher such events as the presence of errors. However, approximately 6% of the SOC-inducing faults generate both cache misses and branch mispredictions. Note that the hardware exceptions are the most ineffective symptoms, and almost all cases in which soft error induced a hardware exception are also covered with either a branch misprediction or symptoms of a cache miss. ReStore coverage varies significantly between the different applications. For instance, ReStore performs poorly for the sha benchmark (2%) and performs much better for the qsort benchmark (41%) when we use a branch misprediction as a single symptom. The ratio of branch instructions is one of the main factors that affect the fault coverage of the ReStore scheme, as shown in Figure 6. The ratio of branch instructions was defined as the number of branch instructions over the number of total instructions. Note that the branch instructions include conditional and unconditional control-flow instructions. For a controlintensive benchmark (i.e., many branch instructions in the benchmark), injected faults can cause more branch mispredictions owing to the large number of branch on-demand to instructions are branch instructions, and the fault coverage is only 2% based on branch misprediction symptoms.

Ineffective Failure Coverage
By contrast, 20% of the total instructions are branch instructions in the benchmark qsort. Thus, faults are likely to cause branch mispredictions owing to many control-flow instructions, and the fault coverage of the qsort benchmark is larger than those of the other benchmarks. Note that the fault coverage from the cache miss symptoms depends on the ratio of the memory instructions. The ratio of memory instruction for a benchmark qosrt is 36%, and the fault coverage achieved from cache miss symptoms is 29%. However, the fault coverage from cache miss symptoms is only 5% for the benchmark bitcount because only 7% of the instructions are memory instructions. However, the number of branch instructions in the benchmark determines the coverage. For example, the ReStore fault coverage for benchmarks bitcount and gsm (untoast) is almost similar to that of a benchmark crc, even though the branch instruction ratio of bitcount and gsm (untoast) is much larger than that of crc. Although different applications have a similar ratio of branch instructions, their distributions can be different. To estimate the uniformity of the branch instructions, we estimated the length between consecutive branch instructions. We then calculated the standard deviation of the interval between branch instructions. If the standard deviation is large, the branch instructions are not uniformly distributed. For benchmarks bitcount and gsm (untoast), the standard deviations are 7 and 6, respectively, and their fault coverages achieved from branch mispredictions are close to each other. For a benchmark crc, the standard deviation was only 2. The fault coverage achieved from branch misprediction symptoms is 7%, despite this application having a lower branch instruction ratio than that of bitcount and gsm (untoast).
In general, symptom-based fault-tolerant techniques provide better fault coverage when there are many uniformly distributed symptoms. If there are many symptoms (e.g., a large ratio of branch or memory instructions) and they are uniformly distributed (e.g., with a small standard deviation), the number of instructions between injected faults and symptoms should be small. If the number of symptoms is less than the other benchmarks (e.g., a small ratio of branch or memory instructions), the number of instructions between faults and symptoms can be considerable. If symptoms are not uniformly distributed (e.g., they have a large standard deviation), the instruction length between faults and symptoms can also be greater, despite the large number of symptoms.

Considerable Performance Overhead
In general, because the symptom-based schemes re-execute some portion of the instructions, e.g., the last N instructions, under the observation of symptoms, the runtime overhead of such schemes is a function of the frequency of natural symptoms (false alarms) in a system. We use the following equation to estimate the overheads (in terms of extra instructions) in applying the ReStore scheme to a system: The above equation simply captures the number of additional instructions that will be re-executed on a symptom-based error coverage scheme where the branch mispredictions and cache misses are interpreted as signs of soft errors (We do not include exceptions as a symptom here because they are rare, and even re-executing the last N instructions on the observation of the exceptions does not impose a considerable performance overheads on the system.). However, we should still be careful regarding overlapping symptoms (symptoms with a distance of less than N). Basically, if two or more natural symptoms occur in a window size of less than the checkpointing frequency N, using only Equation (2) may cause a significant overhead overestimation. Therefore, in such cases, for the first symptom, we assume N instruction overhead, and for the following overlapped symptoms, we only consider their distance (in terms of instructions) from the last symptom as their false alarm overhead. The runtime overheads cannot exceed the runtime of the original programs because we do not re-execute instructions that are already covered by other symptoms.
To collect the information required for the performance overhead estimation of the ReStore architecture, we profiled programs from the MiBench test suite and collected the required statistics, such as the number of branch and memory instructions, branch misprediction, cache miss ratio, and their distribution. Because the performance overheads of the ReStore scheme depend on the branch predictor and cache performance, we compute the overheads of ReStore for four different configurations with different branch predictors and cache efficiencies. We exploit four different hardware architectures by configuring the cache memory and branch predictor to compare the performance overheads due to natural symptoms. We have modified the cache memory and branch predictor to guarantee the certain amount of accuracy. They are categorized as (i) configuration 1, the accuracy of the branch prediction and cache hit ratio of 95% on average; (ii) configuration 2, the accuracy of branch prediction and cache hit ratio of 99% and 95% on average, respectively; (iii) configuration 3, the accuracy of branch prediction and cache hit ratio of 95% and 99% on average, respectively; and (iv) configuration 4, the accuracy of branch prediction and cache hit ratio of 99% on average. Figure 7 demonstrates the performance overhead results for the ReStore scheme with the checkpointing interval of 100 instructions on four different hardware configurations. Interestingly, although we use a 99% accurate branch predictor and cache memory (configuration 4), ReStore increases the runtime by 34% on average. If we use an inaccurate branch predictor and cache memory, which induces many cache misses (configuration 1), the runtime overhead is more than 80%, as compared to without protection. Figure 7. Runtime overheads of symptom-based techniques depend on the accuracy of the branch predictor and cache hit ratio. Even though we have exploited 99% accurate branch predictor and cache memory, the performance overhead is more than 40% as compared to unprotected architectures. (Conf, configuration options; BP, accuracy of branch prediction; and CH, cache hit ratio).
To analyze the runtime overheads on the default configuration, we estimated the accuracy of the branch prediction and cache hit ratio. On average, approximately 95% of the predicted branches are correct for our set of benchmarks. Furthermore, the cache hit ratio is more than 98% for all benchmarks, and the average cache hit ratio was almost 100% for our benchmark suite. Symptom-based protection techniques need to re-execute instructions when facing natural symptoms, such as cache misses and branch mispredictions during a fault-free run. Thus, the runtime increases by 40% on average compared to the unprotected architectures. For a benchmark basicmath, the runtime overhead from a branch misprediction as a single symptom is approximately 75%. Branch mispredictions occur for every 70 instructions on average during a fault-free run. The frequency of a branch misprediction is less than 100 instructions, and the runtime overhead is close to the original runtime of the benchmark basicmath. On average, branch mispredictions have the most frequent and heaviest symptoms, followed by cache misses and exceptions. Figure 8 shows the average fault coverage and runtime overhead of ReStore with different checkpointing intervals or recovery windows. For instance, if the recovery window is less than or equal to 10, 16% of the silent output corruptions can be covered with 13% runtime overhead. In other words, the ReStore architecture can cover or avoid 16% of the silent output corruptions if the hardware provides checkpointing and rollback to the last 10 instructions at any point of the program execution. If system architects want to avoid more than 50% of silent output corruptions, a checkpoint/rollback strategy with the ability to re-execute 100,000 instructions is required; however, the runtime overhead is larger than 94%, as compared to unprotected architectures. Interestingly, approximately 69% of silent output corruptions generate at least one symptom until the end of the application, whereas 31% of silent output corruptions do not generate any symptoms. These 31% of the silent output corruptions cannot be handled or covered by symptom-based faulttolerant techniques. We also observed that the fault coverage by the recovery instruction window could vary according to the benchmark. We examined the fault coverage of ReStore with different recovery instruction window sizes for two benchmarks, bitcount and gsm (toast). For the bitcount and gsm (toast) benchmarks, only 14% and 16% of the SOC-inducing faults generate symptoms within 100 instructions, respectively. The benchmark bitcount requires one million instructions of a re-execution for better fault coverage; however, the fault coverage is still 22%. By contrast, the fault coverage of the benchmark gsm (toast) can increase significantly by extending the size of the recovery instruction window. By setting the recovery instruction window size to 1 million, 97% of the silent output corruptions can be covered. Thus, it is difficult to determine the optimal size of the recovery instruction window to satisfy all types of applications.

Recovery Window Size Slightly Improves Coverage and Drastically Hurts Overhead
6.4. Quantifying the Negative Impact of the Program-Level Masking Effects As we described in Section 2.3, one of the main problems with the ReStore work evaluation is that it did not consider the effect of program-level masking effects on the coverage of ReStore. However, we found that the majority of faults, which generate a symptom soon after their occurrence, will eventually become masked by program-level masking effects. Figure 9 shows this probability separately for a branch misprediction and cache misses. On average, only 46% and 48% of errors detected by a branch misprediction and cache miss symptoms in the ReStore scheme are harmful. The probability that soft errors will quickly generate symptoms resulting in SOCs depends on the benchmarks used. For instance, Figure 9 shows that only 6% of branch misprediction symptoms cause SOCs for the crc benchmark, whereas more than 82% cause SOCs for the jpeg (decode) benchmark. This occurs because even incorrectly taken branch instructions, called Y-branches, can result in correct outputs, whereas the portion of Y-branches depends on the applications [3]. To analyze the effectiveness of branch misprediction symptoms, we changed the control flow of 300 randomly selected branch instructions over our benchmarks. For the benchmark jpeg (decode), approximately 36% of the control flow violations resulted in correct outputs, whereas the rate was approximately 84% for the crc benchmark. Because the crc benchmark is less sensitive to control flow violations than the jpeg (decode) benchmark, the branch misprediction symptom is not an effective clue for SOCs for crc. Nevertheless, the fact that more than 57% of branches are Y-branches on average demonstrates that a branch misprediction is a poor candidate for representing the existence of soft errors.
The effectiveness of cache miss symptoms is affected by silent/dead memory instructions [24,25], which results in correct outputs even though the memory instructions are not executed. To analyze the effectiveness of cache miss symptoms, we selected 100 store and load instructions from a benchmark. Moreover, we discarded one of the selected memory instructions for each simulation. For the susan and sha benchmarks, almost 80% and 4% of memory instructions do not affect the program results, respectively, despite not being executed. Approximately 8% of cache miss symptoms induce SOCs for the susan benchmark, as shown in Figure 9, because many memory instructions are not critical in this benchmark. By contrast, 96% of the symptoms cause SOCs for the benchmark sha because most memory instructions are not silent/dead. On average, 48% of memory instructions do not cause failures at all, despite not being executed over our benchmarks, which induces an overprotection of the cache miss symptoms.

Conclusions
With aggressive technology scaling, the soft error rate is increasing, particularly in modern embedded systems. In order o protect embedded systems against soft errors, several hardware and software redundancy schemes have been proposed. However, they can be expensive in terms of performance and hardware, and they are not suitable for resource-constrained embedded systems. Symptom-based techniques have been suggested as an alternative to protect embedded processors effectively. As the main claim behind symptom-based techniques, soft errors generate symptoms quickly when soft errors cause a failure. Failures can then be avoided by re-executing the last 100 instructions when symptoms are detected. Symptom-based fault-tolerant techniques seem compelling because they do not have to duplicate all instructions or require expensive hardware modifications. Since the symptom-based protection schemes seem attractive due to their generality and simplicity, even state-of-the-art protection schemes exploit them as the baseline protections. In this work, we have implemented the reliability analysis module to reconsider the existing protection schemes. Then, we have found that there are no royal roads to achieve the high reliability. Our experimental results show that symptom-based techniques can cover only 29% of silent output corruptions. Furthermore, their runtime overhead is almost 40% compared to unprotected architectures owing to frequent false alarms caused by natural symptoms. Finally, symptom-based fault-tolerant techniques are ineffective because more than half of the quickly generated symptoms do not cause failures owing to several masking effects.
Our future work will include implementing a more general framework to analyze the efficacy of existing protection schemes since this work focuses on low-level symptombased protection schemes. Our detailed experimental results found that selective protection schemes need to be validated thoroughly and comprehensively. Then, the analysis will be the first step to reach the reliability goal.