An Engineered Minimal-Set Stimulus for Periodic Information Leakage Fault Detection on a RISC-V Microprocessor

: Recent evaluations of counter-based periodic testing strategies for fault detection in Micro-processor ( µ P ) have shown that only a small set of counters is needed to provide complete coverage of severe faults. Severe faults are defined as faults that leak sensitive information, e


Introduction
Information leakage in µPs, a security vulnerability that occurs when sensitive information is accessed or transmitted without proper authorization while executing applications such as cryptographic algorithms, has become a hotbed for research over the last couple of years [1,2].The challenges associated with providing leakage-safe implementations are numerous and stem from the existence of countless vectors that lead to these situations.For example, Electromagnetic Induction (EMI) can cause a µP to enter an unexpected state, or a physical attack can damage the µP or disrupt its operation.Another common cause of stuck-at faults is hardware failure, such as a faulty transistor or a damaged electrical connection.Software bugs can also cause a system stuck-at fault.For example, a bug in the operating system or a malicious application can cause the µP to enter an infinite loop, preventing it from servicing other concurrent applications.When these faults occur, it is important to identify the root cause of the failure, which typically involves examining hardware components, analyzing software logs, and performing diagnostic tests.
In this paper, we propose a low-overhead method that utilizes already existing DFT scan chains and a handful of counters in conjunction with a specially designed binary to achieve low-latency hardware fault detection.We propose two methods of creating the specially designed binary.In the first method, we use a counter-based leakage detection method and a processor run-cycle analysis to determine the failure point on the µP caused by an injected fault and to recreate the processor state at that failure point.In the second method, a binary is constructed with the assistance of ATPG tools, which coerces ATPG vectors into closely matched processor instructions and register file values.The goal is to create a relatively small binary program that provides high levels of fault coverage.Experimental results are provided for each method and compared to determine which process provides the highest coverage when considering binary generation complexity.
The specific contributions of this work include the following: • The evaluation of fault propagation latency for leakage scenarios when executing a common cryptography algorithm on an RISC-V processor.

•
An analysis and discussion of the process that converts exact processor states, i.e., register values, peripheral states, and instruction inputs, into specialized low-instructioncount binaries targeted at triggering hard-to-reach faults.Given that the binary will be stored in memory for periodic fault detection, minimizing size, e.g., less than 100 instructions, is an important design goal.• An analysis of the specialized binary executables and an analysis that compares fault trigger latencies with the latencies obtained when executing common cryptography algorithms.
The remainder of this paper is organized as follows.Section 2 discusses additional related work.Section 3 describes the experimental design and attributes of the binary generation sequence.Section 4 presents the details of the proposed Periodic Built-In-Self-Test (PBIST).Section 5 presents the fault coverage and latency results for the generated binaries.Section 6 presents our conclusions.

Related Work
An overview of the different strategies that can be employed to detect faults through either continuous checkers (also called concurrent) or periodic testing is provided in [3].The authors describe four general approaches, namely redundant execution, PBIST, dynamic verification, and anomaly detection.The periodical specialized binary run described in this work falls under the periodic built-in self-test category.The method is uniquely applied here to detect faults before information leakage occurs and is portable to a wide range of µP architectures and input-output peripherals.The methods described in previous work have higher overhead and do not address protection against information leakage.
Software-only fault detection methodologies are described in [4][5][6], which significantly improve reliability without requiring hardware modifications.This makes software redundancy techniques significantly cheaper and easier to deploy.For example, the authors of [4] used code transformation and specialized instructions to create fault-resistant binaries, which require a lengthy processor-specific, fault-agnostic run and do not provide true fault detection-only fault tolerance under certain circumstances.
The authors of [7] introduced an RISC-V framework for hardware-software codesign that can aid in the implementation of secure and safe SoCs based on RISC-V.The scriptbased framework provides cycle-true verification, ensuring accuracy in the simulation of hardware and software interactions.The framework's versatility makes it applicable in various scenarios, including designing systems resilient against Side-Channel Attacks (SCAs) and other vulnerability points.Additionally, the authors show that the framework enables the fast implementation, functional verification, and post-synthesis verification of projects such as the design of Post-Quantum Cryptography ISA extensions for RISC-V and cryptographic hardware accelerators for the Advanced Encryption Standard (AES).While this framework is effective in speeding up the evaluation of software-aware.hardwaredependent metrics such as performance, power consumption, and area utilization, it has not been shown to be capable of aiding in the detection of hardware-based information leakage faults.
The authors of [8] proposed a predominantly software-based fault detection scheme supplemented by hardware.They utilized a special instruction set, which they coined as Access-Control Extension (ACE), that interacts with a custom-instrumented, full-scan chain to test the µP.Unfortunately, the specialized instructions add complexity to the µP and create a side-channel attack vector.The implementation of their approach is complicated because the ACE instructions are privileged to only the ACE firmware.Additionally, the tree architecture exposes no avenue to target information leakage sites.In contrast, our proposed periodic testing method introduces only a small set of counters and utilizes standard instructions, eliminating the need for custom instructions.
Austin [9] proposed a µP with a unique architecture called Dynamic Implementation Verification Architecture (DIVA), designed to detect both transient and permanent faults.In DIVA, a checker validates the functional unit result by recomputing it using the instruction's input operands and compares this result before permitting the instruction to commit.Despite the advantage of a simplified checker design due to the leveraging of processor pipeline decisions, there is considerable overhead in the checker pipeline.This limitation restricts its practical use to super-scalar architectures.Additionally, the effectiveness of DIVA relies on the assumption that the register file and memory employ Error-Correcting Code (ECC) for error detection and correction, serving as a mitigation strategy against faults related to storage.
In [10], a high-level, symptom-based fault detection technique combining hardware and software was introduced.This method monitors software execution to identify anomalous behavior.The fault detection process occurs at a high level by observing hardware traps and utilizing µP performance counters.While the technique demonstrates an ability to detect 95% of unmasked faults, it comes with a potential drawback of high latency.Most faults are identified within the first 100,000 instructions, but some may take longer, extending up to 10 million instructions.
In recent work [11], a high-speed fault emulation platform was developed on an FPGA to assess the Potato RISC-V µP [12].A dynamic verification or continuous symptommonitoring approach was proposed to evaluate information leakage events introduced by faults from various classes.The study delved into the effectiveness and latency associated with a set of countermeasures based on self-assertions called Self-Assertion-Based Countermeasures (SABC)s.Self-Assertion-Based Countermeasures (SABC)s perform consistency checks on instruction and datapath values during program execution.The fault detection results and the associated latency are compared to those provided by a periodic counter-based countermeasure proposed in [13].The evaluation of the SABCs includes assessing the number of severe faults they can detect, the latency associated with these detections, and the extent of collateral coverage for active faults.The results demonstrated that SABCs are nearly as effective as the node counter-based CMs in detecting all active faults and are nearly equivalent in effectiveness for detection of severe faults.Notably, all severe faults are successfully detected by SABCs, highlighting the effectiveness of the proposed countermeasures in preventing information leakage during program execution.The SABCs, however, are expected to scale somewhat poorly to more complex microprocessor architectures, including super-scalar architectures.Integrating them will demand adjustments and additional resources to navigate the heightened intricacies of the pipeline.Specifically, this involves synchronizing assertions with out-of-order executed instructions and managing the complexities associated with branch prediction and execution.
In other recent work [13,14], a counter-based node-monitoring technique and a fault injection technique were proposed.In this paper, we will expand on previous work to explore µP information leakage by analyzing internal node fault effects and discuss a low-overhead fault detection methodology that enables periodic fault detection without the need for special instructions or to take the µP offline.

System Overview
This section describes the RISC-V architecture used in the emulation experiments, including a special add-on feature referred to as Emulation ROM Side Loading (ERSL), which enables binary executable loading to be accomplished at run time, as well as the characteristics of the fault campaign, Fault Injection Manager (FIM) and Fault Emulation Engine (FE).Also discussed are the CAD tools used in the synthesis and implementation, the testing process, and details regarding counter-based periodic testing.

RISC-V Architecture
The architecture of the Potato µP utilized in this research is shown in Figure 1.Potato is compliant with the RISC-V v2.0 standard [12] and is classified as a 32-bit RISC-V ISA CPU core (RV32I).It possesses a complete set of integer instructions with Control and Status Register (CSR) and exception handling while supporting RISC-V integer (I), multiplication and division (M), and CSR instruction (Z) extensions (RV32IMZicsr).All instructions except load and store execute in one clock cycle.Potato utilizes the wishbone B4 standard [15] as an internal bus.

Fault Campaign Characteristics
A fault campaign refers to the characteristics of the Fault Injection (FI) system [16].The architecture employed in this research is shown in Figure 2 and has the following features:

•
The Potato µP [12] serves as the processor under test, configured with a 32 KB ROM for application code and a 132 KB BRAM for scratch memory.The netlist for Potato is generated using an ASIC synthesis and place-and-route computer-aided design (CAD) tool flow in which 34,110 fault injection circuits are integrated.The netlist is instrumented with scan chains, which provide access to fault injection circuits and counters.The instrumented netlist is used as input to an FPGA CAD tool flow to produce the programming bitstream for the FPGA.

•
The Xilinx UltraScale+ Multiprocessor System-On-Chip (MPSoC) FPGA on the ZCU102 development board serves as the emulation platform for the Potato µP.

•
The Fault Injection Manager (FIM) is implemented as a C program that runs on an embedded processor within the FPGA.Similar to the FI architecture proposed in [11], we leverage two 32-bit high-speed, memory-mapped General-Purpose Input/Output (GPIO) registers to facilitate fault injection, control, and counter data retrieval between the Processing System (PS) and Programmable Logic (PL) components.

•
The FE is realized as a set of State-Machines (SM)s designed to collect serial and address bus data as Potato executes the Advanced Encryption Standard (AES) algorithm [17].
Configured by the FIM, the SMs limit the number of run cycles.When combined with a binary search routine implemented within the FIM C program, this setup enables the latency of fault effects to be determined.• A wishbone-based independent ROM binary side-load architecture is integrated into the design, which significantly accelerates the testing process.

•
The C program running in the PS of the FPGA is used for communication with and control of the Fault Emulation Engine (FE) and the ROM wide-load module, which are both implemented in the PL.

•
The fault detection capabilities and detection latencies of the countermeasures (CMs) are assessed offline using data collected from the scan chains.

System Architecture
The emulation hardware platform uses a Xilinx ZCU102 development board [18], in which both the PS and PL are utilized in a codesign-based system architecture.

Fault Injection Circuit with Counter (FIC)
The Fault Injection (FI) circuit structure is implemented using three scan chains, namely scan_in[0], scan_in [1], and scan_in [2].The first scan chain, with scan_in[0], is used to selectively enable one of the faults, while scan inputs [1] and [2] are used to select from one of four fault types.The scan chain consists of 34,110 elements, i.e., one instance of the FI is added to each of the gate input signals driving the logic gates within an instance of Potato's core ASIC design.The term fault injection with counter, or FIC, refers to the encompassing circuit, which includes both a counter and an FI circuit instance.The scan chains are extended into the counter circuit, as shown in Figure 3, to enable the count values to be scanned out after each FI experiment.The counters record the number of rising and falling transitions that occur on the node during the program's execution.
In prior work [11], we showed that a substantial proportion of the active faults within various fault types, including Stuck At 0 (SA0), Stuck At 1 (SA1), delay, and invert fault classes, can be identified by a relatively small number of counters.In particular, a set of five counters has been shown to identify a large fraction of all faults.Notably, this identical set of counters also proves adept at detecting all severe faults, underscoring their efficacy across all active fault scenarios.However, several severe faults have been shown to have high latency, with as many as 6 million cycles during program execution.
In this work, we demonstrate that a small number of µPs with carefully crafted instructions designed to exercise specific nodes and the small subset of counters identified in previous work, referred to as TopCounters, can be used to detect all faults that lead to information leakage with very low latency.Therefore, an effective countermeasure can be constructed with the node-monitoring counters in Figure 3 (without the fault injection portions) to serve as a part of a Periodic Built-In-Self-Test (PBIST) Counter-Measure (CM) for the detection of information leakage faults in the Potato RISC-V design.

Fault Trigger Binary Executable (FTBE)
The Fault Trigger Binary Executable (FTBE) is a minimized set of instructions designed to recreate a processor state that leads to information leakage in the presence of a fault with minimal latency.In this section, we describe two methods that can be used to generate an FTBE and discuss the tradeoffs between the two.The first, which we call the Fault-Run-Cycle-Based FTBE (RCBE), is created by utilizing the counters already present in the FE, where we identify the run cycle in which faults are observable while running the cryptography algorithm encode/decode sequence.The second method, which we call the ATPG-Based FTBE (ATPGB), is created by using an ATPG flow to create high-coverage test vectors, which are coerced into µP instructions and represent a sequence of concise stimuli.

Run Cycle-Based Binary
The process of generating the Fault-Run-Cycle-Based FTBE (RCBE) is visually presented in Figure 5 and is categorized into three segments.The first segment is called Incremental Search Fault-Detected Cycle (IFTC) and is color-coded in blue.It involves determining the specific run cycles during which faults become observable on the node counters while the AES algorithm is executing.The second set of steps handles the determination of Microprocessor (µP) state at IFTC, color-coded in purple, is executed once the IFTC is found and used to determine the state of the Microprocessor (µP) at the identified fault-observable run cycles.The last segment, color-coded in green, is called binary creation for fault-triggered state replication.In this step, using the knowledge of the IFTC and µP state, a binary is crafted to replicate the state of the µP where faults are triggered.Each step is discussed in greater detail in the following subsections.

Identification of Fault-Observable Run Cycles (IFTCs)
Identifying the IFTC requires generating and searching through node counter values for over 6 million clock cycles.Carrying out this search in a single-cycle incremental search would be impractical; therefore, the task is broken down into stages that combine a binary search and an incremental search to define a solution that has much lower memory and run-time overhead.
The process for identifying the IFTC is listed as follows: 1.
Fault-free counter values for each node while running the AES binary are generated and stored in increments of 1024 cycles from cycle 0 to cycle 6,717,440.

2.
A binary search with an exponentially increasing multiple of 1024 clock cycles is then performed.

•
The binary search concludes when a multiple of 1024 clock cycles is identified where the fault remains undetected at the lower bound but is detected at the upper bound.

•
Since the search spans from 1024 to 6,717,440 clock cycles in increments of 1024, each fault necessitates 13 iterations to complete the process.

3.
When a fault is detected through the binary search method, the run cycle is stored as a Binary Search Fault-Detected Cycle (BFTC).
Table 1 shows the BFTCs (severe-fault detect cycles) at 1024 cycle increments for Potato when running the AES binary.Each run cycle is passed onto the next step in the process, incremental sweep search, as a starting value for the single-cycle increment analyses.The highest latency fault triggered is at cycle 6,078,464 and correlates with delay faults.Detecting these delay faults at much lower latencies is crucial for this work.A script that takes multiple BFTCs as input arguments and performs the incremental search while iterating through the BFTCs is utilized to automate this process.
Table 2 shows the IFTCs, (severe-fault detect cycles)at single-cycle increments for Potato when running the AES binary.Each run cycle is passed onto the next step as a stop cycle for the simulation run.The µP state at the IFTC is extracted with a SystemVerilog test bench, which instantiates a clean, non-instrumented version of Potato loaded with the same AES executable as the instrumented Potato.This ensures that, in a fault-free run, both the simulated and emulated Potato would be in the same state, including register values, peripheral states, and instruction inputs.The test bench starts at run cycle 0 and simulates Potato to the IFTC; it then stores the values of each general-purpose register, as well as the previous 10 instructions before that point.The test bench also stores the values read from and written-to-execution memory and the Wishbone interconnect bus for that time frame.

Binary Creation for Fault-Triggered State Replication
With the µP state values at the IFTC extracted, the FTBE is created with the following steps: 1.
First, register states are recreated.This is accomplished by utilizing Load upper immediate (LUI) and ADD Immediate (ADDI) instructions (Note that Potato does not implement the Load Word Immediate (LWI) instruction and that LWI should be used for processors that do implement the instruction).

2.
Second, memory locations that are accessed by the binary executable are reconstructed by storing the values read by the µP in the simulation in the addresses from which the binary executable will later be read.
Combining these into a binary program, with the memory and peripheral state reconstruction instructions executed first, followed by the register load instructions, creates an executable that mimics the processor state at the IFTC. Figure 6 shows a system-level view for generating the RCBE from node counter values and which device is utilized.The counter values are extracted from the Potato core by the PL, the fault trigger cycles are analyzed by the PS, the processor state values are determined by a simulation running on the host, and the binary is then constructed.Finally, the binary is loaded into the Potato core for testing using the ERSL.

ATPG-Based Binary
Creating an ATPG-Based FTBE (ATPGB) requires the generation of ATPG test vectors and conversion of those normally serial test vectors into binaries that correlate with high fault coverage of the ATPG test vectors.Figure 7 shows a flowchart for the ATPGB generation process.The first step is to modify the Potato RTL so that the values of the instruction memory and general-purpose registers are observable on output ports at the top level.This is done to make extracting the µP state values easier following synthesis and to ensure minimal impact on the generated test patterns.Next, an open-source Design For Testing (DFT) solution, AUCOHL-Fault (Fault) [19], is utilized to automatically generate test patterns.The test patterns are then converted into an FTBE.To create the ATPG test vectors, the Potato module is synthesized and mapped to the osu035_stdcells library [20] by calling the Fault synth command.This initiates a Yosys [21]-based synthesis script and generates a flattened netlist.The netlist is then cut using the fault cut command, which eliminates the flip-flops in the netlist, converting it into a pure combinational netlist, which is utilized in conjunction with the original netlist to generate the test vectors.Lastly, test vectors are generated with the AUCOHL-Faults built-in PODEM [22] test pattern algorithm.Patterns are generated with default values of 100 test vectors and an expected minimum of 95% coverage.The generated test vectors for Potato offer 82% fault coverage of the design.
Figure 8 shows a diagram of the conversion process from ATPG test vectors to the FTBE, which is described as follows: 1.
The DFT scan-inserted Potato µP is simulated in test_mode, i.e., scan_en asserted, with the ATPG vectors as a stimulus in the scan_in port.

2.
During the simulation, instruction memory values, as well as the general-purpose registers, are captured from the signal exported to the top-level ports and stored for each test vector.

3.
Processor state values are parsed to extract only those that contain valid instruction memory inputs and within valid address ranges.The processor state values extracted from the ATPG vectors differ from the RCBE in that preceding instructions are not available due to the non-contiguous nature of ATPG vectors.As a result, the binary program conversion process is slightly different.

4.
An RISCV assembler [23] is used to convert the instructions into the RV32I [24] set that Potato supports.

5.
The set of instructions is used to heuristically construct a coherent binary program while avoiding endless loops and other unwanted processor states.

Experimental Results
The primary goal of the FTBE fault coverage experiment is to determine the fault detection capabilities of each binary executable and to identify the minimum latency in which they trigger those faults.In this section, we focus on identifying the designed binary programs that provide the smallest possible severe fault trigger latency and compare the results with the latencies of other binary executables.The analysis in this section is carried out on the faults in the SevereFaults [14] class only and the five counters discussed earlier in Section 3.4 that are determined to provide the highest fault coverage.
For each test run, the ERSL is utilized to override the initial binary that Potato is synthesized with, saving hours of bitstream generation time by avoiding the need to resynthesize Potato each time a new binary is tested.To avoid running Potato while the instruction memory is changing, the ERSL loads the binary under test to the emulated ROM module using a secondary clock while ensuring that the clock is de-asserted to the rest of Potato.
A fault-free run is then performed with each binary to obtain fault-free counter values at 1024 increments.Next, Potato is run with faults introduced for SA0, SA1, invert, and delay severe faults while examining the TopCounters values.Figure 9 plots the fault trigger results for each binary program on faults in the SevereFaultsclass as percentages of the total number of faults in the class.Each binary is assigned a unique color to differentiate it from others.The size of each binary in kilobits is presented along the x-axis.The fraction of SevereFaults detected by each binary is presented along the y-axis.Each point represents the fraction of SevereFaults detected by each binary in comparison to the binary size.The results for each tested binary program are summarized below.

•
For the Fault-Run-Cycle-Based FTBE methodology, the optimal number of binaries needed to satisfy requirements is determined to be three; B0, B1, and B2 are generated from IFTC 6,076,072, 1,728,512, and 550,800, respectively.Each binary targets specific faults that are triggered at the IFTC from which it is derived.
• The ATPG binary requires fewer instructions in total, which directly correlates with the memory overhead needed to store the target binary; however, it also detects fewer SevereFaults than the previous binaries, at 188.

•
The Coremarks algorithm requires the most instructions in total and triggers 191 faults in the SevereFaults class throughout a full program run.

•
The hello world program, included in the analysis for surety, requires the smallest amount of instructions but only triggers one fault in the SevereFaults class.
We surmise from this analysis that while the ATPGB provides adequate coverage, it is likely limited by the test vector coverage achieved by the pattern generator, as well as the complexities of converting disjointed ATPG test vectors into effective coherent binary programs.A possible better utilization of ATPG principles in this research track could be the creation of a special five-node scan chain that is made up of the TopCounter nodes.This concept is discussed fully in Section 5.3.Additionally, RCBEs are shown to be effective in detecting the specific faults observed at the IFTCs they are generated from but are not guaranteed to detect any other faults, even though they often do.Also, their modular nature means that multiple binaries or a larger contiguous binary would have to be stored to fully utilize this method.Combining multiple RCBEs in one is a relatively simple process, consisting of overwriting registers that differ between binary programs, then updating instructions.Overall, we believe the RCBE method provides a combination of binary size and low latency for the achievement of 100% information leakage fault coverage.

Latency Analysis
The objective of latency analysis is to determine whether the FTBEs can detect the presence of faults well before the AES algorithm, thereby justifying the area overhead that their storage incurs.This goal is addressed for the FTBEs by evaluating latency for each designed binary, then comparing them to other generic binaries.For this analysis, only faults in the SevereFaults class are considered.First, fault-free runs are executed for each binary to acquire fault-free counter values.Then, fault-injected runs are performed for each binary and fault type at 4096 cycle increments from clock periods 0 to 1,024,000.The latency results are presented as a cumulative fault detection graph in Figure 10, where the number of clock cycles that Potato is run for is plotted along the x-axis and the cumulative number of detected faults is plotted along the y-axis.Each binary program is assigned a different color for differentiation, and a color guide is provided to the right.Both FTBEs demonstrate much better performance in terms of latency when compared to other standard binaries.
The RCBE detects all SevereFaults by 4096 run clock cycles, while the ATPGB triggers 188 faults by the 4096 run clock cycle index.

Overhead Analysis
In this section, a comparative analysis is undertaken to evaluate the performance and area overhead of the proposed counter-based Counter-Measure (CM), along with the FTBE, in contrast to previous works.Because our primary focus is detecting leakage-sensitive faults, comparing the overhead of our technique with previous methodologies that aim for complete fault coverage is not a clear comparison, as they essentially address different problems.Leakage-sensitive faults have inherent latency and sequence dependencies, while generic faults might not; this discrepancy is likely to lead to differences in overhead costs of the detection techniques.Additionally, techniques such as continuous symptom monitoring (CSM) and PBIST also have unique complexities that must be taken into account when making comparisons.
The counter circuit, consisting of two 24-bit counters, is analyzed for area overhead using the Synopsys Design Compiler and the ASAP7 standard cell library.The synthesis report indicates an area of 339 µm² per counter, and deploying five counters results in an overhead of 1695 µm².In contrast, the Potato core has a larger area of 28,510 µm².The fractional area overhead is approximately 5.9%.The performance overhead is estimated using a checkpoint interval of 100 million instructions, akin to the ACE technique reported above.Notably, unlike the ACE methodology, the number of scan clock cycles is minimal (120 with 5 24-bit counters), and the vast majority of self-test time is attributed to program execution.Leveraging the full runtime required to reach maximum fault detection of the generated FTBEs' 1350 clock cycles for the RCBE and 1072 clock cycles for the ATPGB, the performance overhead is estimated at (1350, 1072)÷ 100 million ≈ (0.0035%, 0.00107%) for the RCBE and the ATPGB, respectively.Thus, our proposed counter-based CM + FTBE methodology incurs minimal overhead compared to existing methods.
It is assumed that the FTBE will be stored in non-volatile memory, which is tightly coupled with the µP, perhaps on chip at some offset from the bootloader.This presents another overhead that must be considered and analyzed.Using an area of 0.52 µm² per memory bit, we estimate the total area of each FTBE.We calculate area overhead by taking into account the total number of bits per binary and the total area of Potato, namely ((6300, 4800) × 0.52 µm 2 ) ÷ 28,510) ≈ (11.4%, 8.7%) , respectively.Table 3 shows the area overhead comparison of our two generated binaries with the closest equivalent in the literature [8].While the FTBEs require more overhead area than existing works, the Potato µP analyzed in this work is a much smaller core than the OpenSPARC T1 µP used in [8].A comparison with the Rocket µP analyzed in [13], which has an overall area of 112,224 µm², yields better overhead results, at 2.9% and 2.2% for RCBE and ATPGB respectively.Our prior analysis of both Potato and Rocket [11,13] implies that the number of counters necessary to detect all severe faults is common in most µP architectures; therefore, we have reason to believe that the counter-based periodic testing methodology will scale well with larger processor systems.

Next Steps
The output of the Fault Trigger Binary Executable (FTBE) is a counter value that could be stored in memory and later compared to the fault-free counter value also stored in memory.Additional efforts would be needed to create a test controller software binary that works in conjunction with hardware timers to periodically run the Fault Trigger Binary Executable (FTBE).This test controller would need to have special rights if running on an operating system, and analysis would be performed to determine what test frequencies would be ideal when PBIST-induced downtime is considered.
The ATPG-Based FTBE performs poorly compared to the Fault-Run-Cycle-Based FTBE.This is largely because ATPG vectors do not take into account sequential operations and transitions in the internal states of the circuit, which are typically associated with clock cycles or other triggering events.When bypassing state transitions, the objective is to directly set or reach a particular state without going through the intermediate states that would occur in normal circuit operation.This can be useful in certain testing scenarios where the primary goal is to reach a specific state quickly for fault detection or analysis.However, replicating these optimizations in a standalone binary and achieving the same levels of efficiency is difficult, as discussed in Section 5. A minimized scan chain consisting of only the TopCounter nodes could offer the precision of ATPG vectors without the need to serially scanin data to the over 34,000 nodes in Potato.Future work could investigate the process of mixing and chaining the TopCounter nodes into a "mini-scan-chain", as well as implementing a specialized BIST controller [25] that serially scans in the test vector and inspects the serial data output.These test vectors would likely be much shorter than those described in this work, while the BIST controller would likely consume minimal overhead due to only interacting with a handful of nodes.
In addition, the final goal for this track of research is to demonstrate the counter-based PBIST in a manufactured processor on a viable technology node.Future research could tape out the Potato µP with the counter-based countermeasures inserted at the five TopCounter nodes, with the FTBE and fault-free values stored in memory with a program set to run it periodically.This instrumented µP could then be inserted into a high-radiation environment and tested for fault countermeasures.The insights gained from these experiments could be invaluable in accessing failure probability for both temporary and permanent faults and comparing theoretical countermeasure performance to actualized performance.

Conclusions
This paper investigates the generation of specially designed executable binary programs for a counter-based periodic BIST intended for the detection of faults in the Potato RISC-V microprocessor using an FPGA emulation platform.The specially designed binary programs are generated using two different methods, with varying success.The detection and latency capabilities of the designed binaries using the counter-based approach are evaluated on a subset of the active faults referred to as severe faults, which are defined as faults that leak sensitive information, e.g., a portion of the plain text and/or encryption key, through the serial port output.The designed binary programs, when utilized in combination with a small set of strategically placed counters, are shown to achieve high fault coverage and low latency while adding little overhead when compared to competing approaches.

Figure 2 .
Figure 2. Block diagram of the experimental setup with ROM side load for rapid binary loading and testing.Adapted from [13].

Figure 3 .
Figure 3. Schematic of the counter circuit without fault injection signals.

3. 5 .
Testing ProcessTo initialize each test, the C program uses the The Processor Configuration Access Port (PCAP) interface to configure the FPGA fabric with the instrumented Potato bitstream.This bitstream incorporates the default AES binary executable stored in the BRAM-based emulated ROM of the µP.The ERSL shown in Figure4is used to override the boot memory locations with the executable binary being tested.The fault-free counter values associated with each designed binary are then computed by executing a fault-free run of Potato at 1024 clock cycle increments for the entirety of the AES algorithm execution, which spans 6,717,440 clock cycles.In subsequent steps of the testing process, faults are injected in a set of faulty runs of the selected binary to determine the latency to fault detection.

Figure 4 .
Figure 4. Block diagram of the Wishbone side-load architecture.

Figure 5 .
Figure 5. Flowchart for the generation of the RCBE.

Figure 9 .
Figure 9. Binary size-to-effectiveness comparison for both FTBEs and several other general binaries.

Figure 10 .
Figure 10.Subset of fault trigger latencies for select binaries.
Author Contributions: Conceptualization, I.O.S., J.P. and B.D.; methodology, I.O.S.; software, I.O.S.; validation, I.O.S.; resources, I.O.S. and J.P.; data curation, I.O.S.; writing-original draft preparation, I.O.S.; writing-review and editing, I.O.S.; visualization, I.O.S.; supervision, I.O.S., J.P., T.J.M. and B.D.; project administration, I.O.S. and J.P.; funding acquisition, I.O.S., B.D. and T.J.M.All authors have read and agreed to the published version of the manuscript.Funding: Sandia National Laboratories is a multimissionlaboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.This paper describes objective technical results and analysis.Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Table 3 .
Memory allocation per Counter-Measure.