Fault-Tolerant Hardware Acceleration for High-Performance Edge-Computing Nodes

: High-performance embedded systems with powerful processors, specialized hardware accelerators, and advanced software techniques are all key technologies driving the growth of the IoT. By combining hardware and software techniques, it is possible to increase the overall reliability and safety of these systems by designing embedded architectures that can continue to function correctly in the event of a failure or malfunction. In this work, we fully investigate the integration of a conﬁgurable hardware vector acceleration unit in the fault-tolerant RISC-V Klessydra-fT03 soft core, introducing two different redundant vector co-processors coupled with the Interleaved-Multi-Threading paradigm on which the microprocessor is based. We then illustrate the pros and cons of both approaches, comparing their impacts on performance and hardware utilization with their vulnerability, presenting a quantitative large-fault-injection simulation analysis on typical vector computing benchmarks, and comparing and classifying the obtained results. The results demonstrate, under speciﬁc conditions, that it is possible to add a hardware co-processor to a fault-tolerant microprocessor, improving performance without degrading safety and reliability.


Introduction
High-performance embedded systems are becoming increasingly prevalent in daily life, because they are used in various applications, from smartphones and laptops to automotive and aerospace industries, industrial automation, medical equipment, and telecommunications. Since power consumption is a critical aspect of the domain of embedded systems conceived to be powered by batteries, they must be designed to consume minimal power to reduce cost and improve portability. For these reasons, one of the key challenges in embedded systems is balancing the need for high performance with low power consumption. This challenge is addressed using hardware acceleration units, which can process a large number of data in real time and offload computationally intensive and specific tasks from the main processor to specialized hardware, allowing for significant power savings while maintaining a high level of performance.
In recent years, hardware acceleration has become increasingly popular in the Internet of Things (IoT), Artificial Intelligence (AI), and powerful embedded systems [1]. Hardware accelerators implemented as digital signal processors (DSPs) or developed inside Field-Programmable Gate Arrays (FPGAs) can accelerate a wide range of applications [2]. They can be used to implement security features critical in the IoT and embedded systems; they can perform complex encryption and decryption operations in real time that are essential • Filling the gap in the literature about fault-tolerant hardware acceleration in edgecomputing devices. • Demonstrating how a full-TMR hardware accelerator works, describing its architecture, and proving its functionalities with extensive fault-injection (FI) tests. • Demonstrating that thanks to the inherent behavior of an Interleaved-Multi-Threading structure, it is possible to convert a full-TMR hardware accelerator into a singlebuffered accelerator without degrading system reliability, decreasing hardware overhead, cost, and power consumption.
The rest of the paper is organized as follows: Section 2 discusses the related works on FT in the IoT and edge-computing systems; Section 3 outlines the proposed architecture ideas and methodology; Section 4 presents FPGA-synthesis comparisons; Section 5 presents the FI tests and the results obtained in terms of fault resilience; Section 6 reports the comparison with existing FT techniques on RISC-V processors. Finally, Section 7 summarizes the conclusions of our study.

Related Works
In the literature, a considerable number of fault-tolerant applications have been proposed for the IoT, and edge and cloud computing. However, these do not provide the presence "on the edge" of redundant hardware accelerators. The authors in [12] present an interesting review about FT in IoT systems, classifying many FT-IoT works according to the techniques used to obtain FT. Particularly important is the one based on Cloud, where the impact of a failed component is minimized within an integrated fog-cloud platform, with Microservices being able to implement small applications with single responsibility giving support to FT for real-time fault detection, or layered views, where systems are viewed as a complex heterogeneous entity that can be decomposed into interacting layers ranging from 3 to 5 and that communicates via software. Microservices are also explained in [13]. They are used to build a specific framework with two fault-tolerant features, real-time FT detection and online machine learning, to detect fault patterns and mitigate faults before they are activated, and they have a scalable architecture with individual responsibilities for each micro-application. The work in [14] presents a way to perform energy-efficient and low-overhead fault mitigation in the IoT, using ECC for single and multiple faults, with SEC DED TAEC 6AED techniques to add redundancy in the IoT and Network on Chip (NoC), evaluating all these techniques with respect to joint-crosstalkavoidance multiple-error correction (JCAMEC) and joint-crosstalk multiple-error correction (JMEC). The authors in [15] present a military view of the IoT (MIoT) for mission-critical scenarios, in which the same techniques that can be applied in safety-critical systems are used, providing virtualization and replication capabilities with cloud or fog platforms able to reduce latency and increase the resilience of all MIoT layers, integrating advanced machine learning techniques for fault detection and diagnosis.
Other mechanisms are based on neural networks or customized for neural network applications. The work [16] presents the Selective TMR tool, which analyzes the sensitivity in a neural network architecture with respect to the overall accuracy and triples the most sensitive parts, increasing functional safety without resulting in full TMR. The work [17] presents a prediction of faults to obtain preemptive migration decisions in case of a future failure of the system, while the authors in [18] present the OR-ML approach to enhance the reliability of "special purpose" ML accelerators by opportunistically exploring the chances of runtime redundancy provided by neighboring PEs without adding computing resources, significantly reducing implementation overhead. Further, in the field of special-purpose accelerators, the authors in [19] present fault-aware pruning (FAP) and retraining (FAP + T). They declare that it is possible to modify the deep neural network architecture to adapt it to the faults without deleting them by re-executing the test, spending extra time overhead.
Over the years, it was taken for granted that node redundancy and network contingencies were the best solutions [20], and a lot of work has been performed on high-level structures, software or hybrid cloud-computing frameworks [21], and virtualization layers provided by containers with checkpointing mechanism in case of a node failure. However, nowadays, with the increasing number of edge devices, this solution becomes not feasible on edge-computing nodes [22].
The focus of the proposed study is the definition of an FT microarchitecture for those edge-computing applications requiring both safety and high performance. In this work, we deal with the edge node, trying to obtain redundancy and protection directly inside the node and lightening the load on subsequent higher layers of the system, which are also necessary. We, therefore, deal with the hardware aspects of the single node, equipped with a general-purpose vector co-processor capable of performing any vector arithmetic operation in an FT way, differently from most of the FT vector accelerators created for specific special-purpose applications.
Different techniques have been proposed for FT in IoT systems [12]: time redundancy at the instruction and task levels, with the replication of tasks to obtain redundancy; active or passive replication with multiple processors in fog or cloud, which executes the same process concurrently or executes tasks only in case of fault, respectively; network control, where a cluster head makes requests to the other nodes and confirms a failure in case of no reply message; distributed recovery block, with lock-step execution between nodes and the testing of the results, which is performed only if the test passed. To the best of our knowledge, little work has been performed to protect a single edge node, especially if it has hardware acceleration features.
This work builds on the preliminary idea presented in [4]. With respect to [4], the work presents the following: • An in-depth analysis of the idea of the replicated VCU, with an expanded and detailed description; • The introduction of the hybrid-VCU mode, assuming and analyzing the impact of memory-scrubbing techniques within the accelerator scratchpad memory; • The novel detailed analysis of fault-resilience performance on algorithms of practical interest in hardware acceleration, such as FFT and Matmul.

Methodology
The framework for this study is the Klessydra-fT13 architecture, a fault-tolerant 32-bit RISC-V IMT soft processor integrated inside the PULPino [23] open-source System-on-Chip architecture. The processor is composed of a fault-tolerant non-accelerated scalar core, resembling Klessydra fT03 [7][8][9], tightly coupled with a fault-tolerant configurable accelerating co-processor unit ( Figure 1).
The processor was implemented as an RTL design and synthesized on the Vivado platform. Our analysis focused on the effect of SEU faults on sequential cells in the hardware. The RTL files were integrated in an ad hoc Universal Verification Methodology (UVM) environment, as it has dedicated support for SEU fault injection in the Flip-Flop (FF) cells of the design. In order to be able to inject faults in the RTL code corresponding to FFs, an automatic pre-processing routine was run at the RTL to identify all the synchronous assignments, which generate FFs in hardware implementation. The target of the present work was the validation of the vector accelerator subsystem's resilience, while the scalar core was already validated in previous works, so we concentrated on the fault-injection campaign on the accelerator microarchitecture.
The FI campaign was defined according to the Time Frame simulation methodology (described in Section 5.1), which allows for the calculation of an upper bound to the application program's failure probability assuming that an SEU fault hits the processor. The analysis was performed on individual units of the accelerator microarchitecture, so as to obtain a detailed identification of the critical parts of the design for FT. The analysis was further enhanced by integrating memory scrubbing in the simulation to evaluate the impact of this feature on the overall FT. Further details of the research methodology steps are described in the corresponding sections.

Fault-Tolerant Scalar-Core Microarchitecture
IMT cores are not immediately usable for fault-tolerant applications, because they do not rely on replicated processor cores, such as multi-core architectures or multiple Functional Units (FUs), or they do not rely on Simultaneous-Multi-Threading (SMT) architectures, and the results are not available at the same time [4]. We exploited the IMT architecture to merge spatial redundancy and temporal redundancy, developing the Buffered TMR technique, in which three threads are instances of the same program maintaining their state in redundant registers (spatial redundancy) with shared combinational logic and executing interleaved instructions with one-cycle time distance between any two instructions belonging to different threads (temporal redundancy). We modified a non-hardened core [10,11], proving the intrinsic FT capability of an IMT architecture with three identical threads, each having its own register file (RF), program counter (PC), and control/status registers (CSRs). The Buffered TMR paradigm implies using specific buffer registers that can maintain the state produced by the corresponding thread, waiting for the latest execution of the three threads (Thread 0) before voting on the results. In the pipeline on the left side of Figure 1, it is possible to highlight the use of buffer registers within the units that produce data, such as the Execution Unit (EXE) or Load-Store Unit (LSU). Write-back voting can occur when Thread 0 reaches the write-back phase as well as load/store operations, which are performed by Thread 0 in a voting manner. Because of the buffer registers in the LSU, it is possible to prevent replicated load/store access to the same location, consume less power, and avoid inconvenient behavior when reading peripherals. The interested reader may refer to [7,8] for additional details and performance evaluation of fault-tolerant scalar cores.

Fault-Tolerant Vector Co-Processor
The primary targets of this work are the application of the Buffered TMR concept to the vector acceleration unit and its evaluation. The Vector Co-Processing Unit (VCU) in Klessydra processors is a highly configurable general-purpose vector accelerator [10,11] capable of processing up to 256 bit of data in a single cycle. It comprises two main components: the Vector Co-Processing Engine (VCE), which executes vector instructions, and the Scratchpad Memory Subsystem (SPMI), which locally stores intermediate results of the computation kernel, similarly to a vector register file. The VCU can be customized using a variety of parameters, such as the number of SPMs per SPMI, SPM size, Single-Instruction Multiple-Data (SIMD) width, and execution paradigm, which usually enables Multiple-Instructions Multiple-Data (MIMD) capabilities by dedicating a co-processor to each thread of execution in the application. In this study, we particularly focus on the execution paradigm parameter, as customizing its usage can affect the FT features of the co-processor. There are three main execution paradigm configurations in the vector co-processor: a single VCU shared by all threads, a distinct VCU for each hart, and a hybrid VCU. The first configuration offers no redundancy and thus is not suitable for FT purposes. In the scope of fault-tolerant accelerators, we are particularly interested in the latter two schemes, because they maximize instruction-level parallelism (ILP) and offer some degree of potential redundancy in the execution. For details on the internal operation of the vector accelerators used in this work, the interested reader may refer to [10,11].
In the replicated-VCU scheme, each thread in the fT13 core has its own VCE and SPMI. When a vector instruction arrives at the decoding stage, it is directed to the VCU corresponding to its hardware-thread identifier (hartID). The instruction then requests the data from its own SPMI, processes the instruction, and writes it back to its own destination SPM. A stall in a thread can only occur when its own VCU is busy. Figure 2 illustrates this scheme. In a non-hardened accelerator, the scheme allows for the parallel execution of vector instructions, whereas in fT13, it is exploited to obtain redundant execution of the same vector instruction among three parallel accelerators. Therefore, the first characteristic of this structure is the concept of spatial redundancy at the architectural level. The threaddedicated VCUs can implement a hardware replication typical of multi-core fault-tolerant systems, offering instruction execution redundancy at the cost of reducing the MIMD capabilities of the vector co-processor, which can only work in SIMD mode. Since the three SPM banks work in parallel and independently from each other, yet each of them writes the same data owing to the temporal redundancy of the instructions, it would be theoretically possible to further increase the architectural redundancy using voting when the three threads read data from all of the three scratchpads simultaneously. However, since the SPM bank works with a single read port, this solution is not feasible for hardware utilization nor is the cost of implementing a writing voting mechanism. We opted for a solution in which each thread writes its results on its own SPM only, and any SPM writing errors are corrected when writing back the results in the data memory using redundancy in the LSU [8]. While this mechanism can solve isolated inconsistencies in the final data, it does not prevent error accumulation and propagation within the SPM banks. As a consequence, errors in an SPM related to one thread could add up to other errors in another SPM related to another thread, thereby preventing majority voting. Depending on the statistical frequency of such situations, ECC protection or a memory-scrubbing mechanism may be necessary, capable of autonomously checking and correcting all the SPM banks, as it already happens within the data memory and instruction memory of the scalar core.
In the hybrid-VCU scheme, each hardware thread in the fT13 core has its own SPMI, but all share Functional Units in the VCE. This scheme allows for the parallel execution of vector instructions as long as the different threads use different Functional Units; otherwise, they are buffered in a queue. When a vector instruction arrives at the decoding stage, it is directed to the VCU. The instruction then requests and stores local data from/to its thread-related SPMI. During vector-instruction execution, if a vector instruction of another thread arrives at the VCU, it checks if the Functional Units to be used are free, and in such a case, the thread goes ahead in parallel with the first thread, on the same accelerator yet in a different Functional Unit. Figure 3 illustrates this scheme. A key characteristic of this scheme is that harts employing redundancy in the execution of instructions always make the case where there will be contention on a busy Functional Unit in the VCU. Hence, the scheme automatically deploys temporally redundant execution of instructions, as an identical instruction is dispatched after the first instruction is executed. Since, in the hybrid-VCU scheme, each thread has its SPMI and the VCU operation is no longer symmetrically parallel, it is not possible to vote on writing back the results from the SPMs to the main memory. In this context, memory-scrubbing mechanisms are more necessary than in the previous cases and should be implemented after each memory write operation. Figure 3. Hybrid-VCU scheme with shared hardware. Once the instruction is decoded for each thread, it goes to the VCU and requests data from its own SPMI, processing the instruction inside the shared free FU, waiting in case of busyness, and then writing back the result inside the SPM. Table 1 shows that the hardware configuration of Klessydra-fT13 in both approaches, compared with the one presented for the non-hardened Klessydra-T13, results in handling only SIMD mode, caused by the single-thread behavior of Klessydra-fT13, which loses the MIMD capabilities due to instruction replication among the threads. From the hardware point of view, synthesis results from Vivado 2022 targeting a Genesys2 board show larger hardware utilization compared with the Klessydra-T13 version, as well as a decrease in hardware acceleration performance and frequency due to redundancy and voting blocks. By comparing the hardware utilization of the two accelerators, it is easy to observe that hardware resources can be saved at the expense of the FT level by choosing the hybrid-VCU scheme, with respect to the replicated VCU.

Validation
To validate the resilience of both architectures, we implemented a Time Frame Spanning fault-injection (FI) simulation [24], within a Universal Verification Methodology (UVM) environment able to simulate SEU faults hitting synchronous registers in the design. We, furthermore, implemented a classical random Monte Carlo FI simulation specifically devoted to analyzing the potential improvement given by an ideal memory-scrubbing unit in the SPMs of the VCUs.
In the analysis, we used two common machine learning (ML) and digital signal processing (DSP) application algorithms as the benchmark workload, namely, FFT and Matrix multiplication. Each FI experiment covers the entire program execution, injecting logic value flips in target bits previously selected within the processor design as an Architecturally Correct Execution (ACE) bit [25]. By definition, an ACE bit is a bit of the architecture in which a fault in a specific clock cycle may cause a program failure.

Failure Probability Estimation with Time Frame Spanning FI
The Time Frame Spanning methodology [24] divides the whole execution time into M different intervals called time frames. Unlike classical Monte Carlo FI methods, the approach performs the deterministic injection of many faults within the architecture in a specific time frame within the total execution time for each target bit in the hardware architecture. An ad hoc simulation is launched for each target bit, injecting at least one fault every 40 clock cycles for each of the M time frames, resulting in an extremely highfault-rate analysis. Table 2 shows the Time Frame Spanning setup planned to analyze the core under FI. We decided to only target the internal bits of the VCU, as the other units were previously studied in different works [7,8], comparing the results with those obtained on the non-hardened Klessydra-T13 processor, containing the same vector accelerator.  Figure 4 shows the FI results on the FFT benchmark for all the architectures under analysis. Comparing the FI results of both processors highlights how the redundant-VCU case has greater FT properties than the hybrid-VCU case. The Klessydra-fT13-hybrid processor has a much lighter structure than the fT13-replicated one, with single FUs per thread. Having many shared hardware resources, it is inherently more prone to failures. Nonetheless, owing to the Buffered TMR paradigm, we found that this shared version of the VCU may exhibit relatively good FT features. Observing the results, it can be seen that the green bars of the replicated-VCU case and the blue bars of the hybrid VCU represent the advantage obtained in terms of fault tolerance. In the replicated case, the advantage is represented by the reduction in the total failure rate, obtained as the weighted average based on the number of bits in each register, from 40% to about 3.7%. In comparison, this advantage is minor in the hybrid case, with a decrease from 31% to about 22%. The lowest level of failures related to the T13 hybrid case is due to the fault-masking effect. For the non-hardened T13 processor, it is possible to hypothesize that some faults are masked by the execution of the last thread in interleaving order. With three threads interleaved in temporal execution, faults that occur during Thread 2 or Thread 1 can be masked by the execution of Thread 0, resulting in a low failure rate. Thread 0 is used to write down the obtained result into the same scratchpad memory location of the previous threads, deleting any previous error. Forcing high fault rates avoids the masking effects that lead to observing errors inside the scratchpad memory, which in turn are written inside the data memory despite the ability to hide them. In the T13-hybrid case, to further break down the masking problem, the test was only performed on the single Thread 0, reporting failures only for its related signals. The minor advantage of fault tolerance in the hybrid case is due to the Functional Units shared by all the threads that do not allow for the parallel execution of the instructions. There is now no complete spatial redundancy, and the Buffered TMR technique guarantees the only redundancy level. For both FFT tests, comparisons are made with the replicated-VCU and hybrid-VCU versions having memory scrubbing, bringing the weighted failure rates to 1.53% and 1.18% for replicated and hybrid, respectively. As a proof of concept, we also developed an analysis for the system with memory scrubbing. For both cases, it can be noted that the impact on the use of memory scrubbing is considerable and almost comparable, pointing out how the periodic control of memory systems in FT architectures is a critical and essential task. Figure 5 shows the breakdown of the system failure probabilities associated with different registers inside the VCU for the Matmul benchmark, for the best-performing fT13-replicated VCU architecture. In this figure, we can recognize the same characteristics described for Figure 4. The non-protected architecture experiences failure rates close to 90% in most registers. Similar to Klessydra-fT03 [7,8], the protection level is also extended with temporal redundancy with the use of the same instructions among different threads, providing a dual protection layer (spatial and temporal) on all operations that activate the co-processor [4], saving the output produced by each thread within the individual, dedicated SPMs. Figure 5 also shows that each SPM can be treated as a result buffer in the context of a Buffered TMR paradigm. Voting among the SPMs during LSU operations leads to the deletion of most of the faults that would otherwise enter the main data memory. Figure 5 points out again how SPM scrubbing leads to enormous benefits even in the replicated-VCU case compared with the non-scrubbed case.  (1) dsp_data_gnt_i_lat (2) adder_stage_1_en (0) adder_stage_1_en (1) adder_stage_1_en (2) adder_stage_2_en (0) adder_stage_2_en (1) adder_stage_2_en (2) adder_stage_3_en (0) adder_stage_3_en (1) adder_stage_3_en (2) halt_dsp_lat (0) halt_dsp_lat (1) halt_dsp_lat (2) busy_DSP_internal_lat (0) busy_DSP_internal_lat (1) busy_DSP_internal_lat (2) shi�_en (0) shi�_en (1) shi�_en (2) add_en (0) add_en (1) add_en (2) mul_en (0) mul_en (1) mul_en (2) carry_16(0)

Monte Carlo FI Simulation to Analyze Memory-Scrubbing Impact
We performed an additional, conventional, random Monte Carlo FI simulation to better highlight the impact of SPM scrubbing. In this simulation, each bit is tested individually, and for each test run, a certain number of random faults is inserted, gradually increasing. There is no interest in seeing the failure rate of the individual bits, and the test results of all the architecture bits are averaged among them and reported in a graph. Without loss of generality, we report the results for the fT13-hybrid architecture, with the other case being substantially equivalent. Figure 6 shows the Monte Carlo FI process with randomly injected faults in a range from 10 up to 2000 for the FFT benchmark, for each of the 2154 target bits in the VCU. Comparing a non-hardened T13-hybrid processor and the fT13-hybrid processor shows a constantly lower failure rate in the latter. For a low fault rate, this solution results in good FT performance. Comparing Figure 6 and Figure 7, it is possible to note that most of the failures are related to memory inconsistencies with respect to the golden reference, due to incorrect values inside the scratchpad memories, which, if not corrected by a voting system, inevitably lead to incorrect values in the data memory. As proof of concept, we analyzed the impact of an ideal memory-scrubbing mechanism that checks the SPM banks and periodically corrects wrong values when scratchpads are not in use. The orange line in Figure 6 represents the fT13-hybrid processor with SPM scrubbing, which results in a lower failure rate. This shows the theoretical optimal performance that we would obtain by having the ideal hardware units capable of performing perfect SPM scrubbing (Figure 7). Practical implementations with corresponding hardware impacts will be addressed in future works.
Overall, the reliability of the system stays around 1-2% with the support of memoryscrubbing techniques, with a decrease from 17,006 to 12,117 in FFs and from 53,198 to 50,913 in LUTs for the SIMD 8 configuration, thanks to the shared Functional Units. While we do not have reliable, precise data on power consumption, we expect that the significant lower use of hardware resources also results in a decrease in power consumption.

Comparison with Existing Fault-Tolerance Techniques
Since our experiments target the RISC-V instruction set architecture (ISA) [26], RISC-Vcompliant fault-tolerant cores are the reference architectures for performance comparison, so that the results may not be influenced by the use of different ISAs [8]. In order to compare different FT designs, it is possible to refer to the normalized failure percentage (NFP) figure, obtained as the ratio between the failure percentage of the protected microarchitecture and the failure percentage of the unprotected version of the same microarchitecture subjected to the same FI campaign [27]. Table 3 reports the outcome of the analysis. In [28], the authors present the BL-TMR software suite, which can analyze a Xilinx Vivado netlist and add triple redundancy to critical nodes of the RISC-V Taiga processor implemented on a Kintex Ultrascale device, resulting in the hardware resources of 3X FFs and 5.64X LUTs. Based on the declared improvement of 24X in the average time between failures, we can estimate an NFP of 1/24 = 4.1% by assuming a uniform time distribution of radiation-induced faults. In [27], the authors protected the Rocket RISC-V processor with a distributed TMR technique, exhibiting 3X overhead in FFs and LUTs and 2.3% NFP. In contrast, the work reported in [29] exploits the idea that only the most statistically frequent ALU operations require protection to reduce hardware overhead, obtaining NFP in a range from 40% to 5.63%, with FI affecting only the protected units. In [30], the authors use a Hamming code to protect the register file and the program counter, and TMR to protect the ALU and control logic. The modified microarchitecture was implemented on a Xilinx Zynq FPGA, utilizing 1.19X FFs and 1.70X LUTs and obtaining 7.7% NFP. In the non-accelerated Klessydra-fT03 design [8], the overhead was reported as 1.09X in FFs and 1.17X in LUTs, with an NFP of 2.3%, compared with the 1.53% NFP obtained in the presented work. The latter result is related to FI tests carried out on the VCU accelerator subsystem, which was the target of our analysis, so the actual NFP on the entire processor ( Figure 1) can be estimated as an average between 1.53% and 2.3% depending on the VCU configuration and the consequent ratio between the areas of the scalar processing pipeline and of the VCU.

Conclusions
This work investigated the potentials and the issues of designing fault-tolerant hardware acceleration for edge-computing devices. The first research goal mentioned in the introduction was attained by performing a comprehensive study on hardware acceleration in edge-computing devices and demonstrating how hardware techniques like Buffered TMR can improve FT using redundancy in hardware co-processors within an IMT RISC-V core for high-performance embedded systems, filling a gap in the existing literature. We described two co-processor architectures, replicated VCU and hybrid VCU, allowing us to obtain a fault-tolerant co-processor companion to a fault-tolerant scalar core. The second research goal was attained by proving all the full-TMR replicated-VCU hardware accelerator functionalities, with extensive fault-injection tests on DSP benchmarks, commenting and reporting all the results regarding hardware overhead and fault tolerance. The third research goal was reached by describing a different hybrid architecture, demonstrating that thanks to the inherent behavior of an Interleaved-Multi-Threading structure, it is possible to replace a full-TMR hardware accelerator like the replicated VCU with a single-buffered accelerator like the hybrid VCU with limited degradation of system reliability and significantly less hardware overhead, also describing the potential impact of SPM scrubbing on reducing the failure probability.