Fault Tolerant Digital Data-Path Design via Control Feedback Loops

: In this paper, we propose a novel fault tolerant methodology for digital pipelined data-paths called Control Feedback Loop Error Decimation (CFLED), that reduces the error magnitude at the outputs. The data-path is regarded from a control perspective as a process affected by perturbations or faults. Based on the corresponding dynamic model, we design feedback control loops with the goal of attenuating the effect of the faults on the output. The correction loops apply correction factors to selected data-path registers from blocks that have their execution rewinded. We apply the proposed methodology on the data-path of a controller designed for a 2-degree of freedom robot arm, and compare the cost and reliability to the generic triple modular redundancy. For Field Programmable Gate Array (FPGA) technology, the solution we propose uses 30% less slices with respect to Triple Modular Redundancy (TMR), while having a third less digital signal processing blocks. Simulation results show that our approach improves the reliability and error detection.


Introduction
Reliability represents a key factor in electronic devices that operate in radiation prone environments, frequent in applications in the aerospace domain. It is achieved by adding redundancy to the digital circuits, with cost increase acting as the main drawback. Fault tolerant design is based on adding some form of replication, or using specially designed error detection and correction codes, so that the cost overhead with respect to the nominal solution is acceptable, and the target reliability is achieved.
The most common way of increasing the reliability of a circuit is to employ Triple Modular Redundancy (TMR), i.e., using three copies of the digital circuit and a voter structure [1]. The main drawbacks of TMR are its high cost and the sensitivity of errors affecting the voters. Regarding the former, schemes to reduce the cost of the TMR, such as approximate TMR [2,3] or inexact TMR [4], have been proposed. Furthermore, the use of TMR in the context of approximate computing is discussed in [5]. For arithmetic circuits, the concept of reduced precision replicas has been applied, with the most significant part of the circuit being triplicated [6,7]. Reduced precision replicas can also be used in a dual manner, with the lower cost data-path working in a higher supply voltage domain and providing a highly reliable reference [8]. Other works improved reliability by increasing the resilience of the voting circuits within the TMR, such as the self checking voting circuits [9], adaptable bit-width voter [10], or employing dedicated voters for approximate TMR [11]. Modular redundancy approaches use multiple instances or truncated versions of the same nominal circuit. Other solutions use different types of redundancies. For example, approaches using error correction codes-Hamming or Reed-Muller codes-in combinational circuits have been proposed in [12]. Redundant residue number systems have been employed for signal processing data-paths in [13,14]; in this case, modular redundancy is employed, with each replica having its own residue representation. Another approach uses a design inspired from the Markov Random Field (MRF) theory. In [15], a complementary dual modular redundancy with an MRF voter is proposed, while [16] uses coding-based partial MRF, where the circuit is employed using MRF-based logic components. The common drawback of MRF-based implementations is the high implementation cost. Therefore, for non time-critical applications, several approaches using testing methods and redundancy have been proposed [17], with fault detection being performed during a circuit testing phase.
Different from all the mentioned approaches, this work proposes a control engineering approach-Control Feedback Loop Error Decimation (CFLED)-for improving the reliability of digital data-path processing pipelines, by employing feedback control loops to reduce the error magnitude. CFLED targets applications that rely heavily on arithmetic operations, such as signal and image processing or control and artificial intelligence applications. In arithmetic data-paths, error magnitude is an important factor, as low-magnitude errors may by tolerated by the application. The goal of reducing the error magnitude, and not to completely remove the fault, is also employed in fault tolerant techniques that rely on approximate redundancy [2,5], inexact redundancy [4], or reduced precision replicas [7]. The proposed technique may be applied for arithmetic data-paths implemented on both Application-Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA) technology.
To the best of our knowledge, CFLED represents the first approach that uses control theory in order to improve the reliability of processing data-paths. Fault-tolerant control for finite state machines has been considered in e.g., [18,19]. The design for these use cases is aided by the finite state machine characteristics: finite and rather low (at most tens) number of states, low number of inputs (typically 1-bit inputs), and finite number of state transitions. In contrast, the proposed approach considers a dynamic model that captures high order input-state-output dynamics, where inputs, states, and outputs have continuous values within their variation domains.
We analyzed the proposed approach on the hardware architecture for the two-degree of freedom robotic arm controller (2-DOF) presented in [20]. The considered use case consists of a processing pipeline that relies on function evaluations-performed using Taylor series-scalar-matrix multiplications, and vector-matrix multiplications. Such operations are widely used in applications such as graphic processing and signal and image processing. Furthermore, this use case represents a novel application in control engineering: design of fault tolerant controller when the faults affect the controller implementation. Other approaches for fault-tolerant control, such as the ones used for electronic power converters [21], induction motors [22], filters [23], robot control, and localization [24,25], assume faults in components of the controlled system, usually actuators or sensors. This paper is organized as follows: Section 2 discusses the methodology-the discrete time dynamical model of the circuit used to derive the feedback controller; Section 3 discusses a case study for the proposed methodology and presents FPGA implementation costs and reliability estimates; concluding remarks are presented in Section 4.

Control Feedback Loop for Fault Mitigation
The proposed methodology targets arithmetic data-path circuits consisting of several processing stages built out of arithmetic operations: additions, accumulations, multiplications, and multiply-add fused. The operands can be either registers or constants. Thus, we consider data-paths commonly used in a wide variety of applications fields that rely on matrix-vector/matrix-matrix multiplications, convolutions, or weighted sums, and are therefore dominant in fields like signal and image processing, computer graphics, or artificial intelligence. Each processing stage i consists of a sequence of registers {R i 0 , R i 1 , . . . , R i n i }, and it takes n i clock cycles to compute a set of n i s data output elements out of a total of n i el data values stored to memory or inside registers. Thus, in order to produce a set of output results for stage i, a total of n i × n i el /n i s clock cycles (cc) are required. The design process for enhancing the reliability of a given nominal circuit data-path involves: (i) Modeling the nominal circuit, (ii) CFLED design and simulation, (iii) CFLED operation, and (iv) hardware implementation of the enhanced CFLED circuit.

Dynamic Model of Digital Data-Path
Given a digital data-path, our first goal is to develop a mathematical model of the operations performed and the propagation of values through the data-path. This model forms the basis for designing a control law that will be used to correct the output of the circuit.
A common way to model a system from a control perspective is by using a state space representation. Since digital circuits are synchronous, we employ a discrete time-domain state-space representation, where the sample k corresponds to the current clock cycle, and has the general form: where x denotes the states describing the circuit, u the inputs, η some parameters of the dynamic system, and f is a vector function-to be determined-that describes the evolution of the states in time. For a given circuit, each (partial) result (in case of iterative loops) is mapped to a state variable. Hence, each meaningful register value at different clock cycles has a corresponding state variable. Furthermore, a data-path (process) model is built by the composition of several processing stages i. In this work we consider arithmetic data-path circuits, with the operations: addition, accumulation, multiplication, and multiply-add. Since we map each (partial) result to a state, the value of the state may be obtained as either: (1) addition of a constant to a state; (2) addition of two states; (3) multiplication of a state by a constant; or (4) multiplication of two states. The dynamic model that describes the circuit aggregates all the operations throughout the clock cycles and therefore it may contain only these four mathematical operations. Consequently, the model chosen is of the form: where κ(x(k)) denotes the Kronecker product of the vector x(k) with x(k), Q, and A are matrices of appropriate dimensions, a is the vector of biases or affine terms, and δ denotes the faults, possibly with a known distribution.
To determine the exact values of the parameters-matrices Q and A and affine term a-and the correspondence between the states and registers, we adopt the following convention: 1. The affine term a contains all inputs to the circuit. 2. Since scalar multiplications are commutative, the Q matrix is not unique. Nonzero values should appear in the first element corresponding to a multiplication.
The model and its parameters are obtained as follows: 1. Write the chronological operations for each clock cycle; 2. assign a state variable to every register containing a partial result at different clock cycles; 3. define the dynamic of each state variable as the operations performed to obtain the value in the corresponding register at the corresponding clock cycle; 4. separate the bilinear, linear, and affine terms in the resulting equations; 5. collect the coefficients of the bilinear terms in the matrix Q, the coefficients of the linear terms in the matrix A, while the affine terms, including the inputs to the data-path, will form the vector a.
Note that a model developed for a given circuit is not unique. We have formulated a generic dynamic structure that can be used for any circuit implementing arithmetic operations. In what follows, we illustrate the modeling steps on the circuit in Figure 1. In this example, In 1 and In 2 denote the inputs of the circuit. We define the state variables as the register values (states) in reverse order, i.e., x 1 (k + 1) = R3(n + 2), x 2 (k + 1) = R2(n + 1), etc., and express each of them as a function of the state variables at the previous sample. This leads to the state equations: . . .
which can be written in a vector form as with 0 n×m we denote the zero matrix of dimensions n × m, I n the identity matrix, and 1 n a column vector of 1 s. Models of multiple pipelined sub-components can be combined as follows: (1) The outputs of pipeline stage block i become states of pipeline stage i + 1; (2) modify the matrices A and Q and vector a appropriately. Thus, larger models of more complex circuits can be built from individual sub-models of its parts.

Controller Design
The first objective of the controller is to reduce the fault effect at the digital circuit outputs, such that it is as close as possible to a reference value (i.e., gold output result). The second objective is to converge as fast as possible to the correct value. A way to include control inputs into model (2) is to include a linear input, modifying the system model as where u denotes the vector of the control inputs, k is the current clock cycle, B is the input matrix that determines which state is affected by which input and in what measure, y is the vector of outputs available for correction, and C is the output selection matrix. Next, we define some specific design constraints. From the circuit side, we need to ensure that even if multiple states are mapped to the same register, the control input is the same. Note that in (4), the matrix B is an n x -by-n u matrix, where n x is the number of states and n u is the number of inputs. In this matrix, each element B ij indicates the measure in which the input corresponding to the column j is applied to the state corresponding to the row i. If the same input is applied to several states, then the values corresponding to these states can all be set to the same value. Thus, we enforce that the same control (correction) input is applied to all the states corresponding to the same register by the appropriate choice of the values in the matrix B.
The control input u(k) is usually computed based on the outputs of the circuit and a given reference. Cost-wise, the preferred solution is the simplest possible, mainly, a linear control law, i.e., when the control input u(k) is computed as u(k) = −Ky(k), with K a constant matrix. The controller gain K is determined such that the closed-loop system-meaning that the control input is applied to the system described by (4)-is asymptotically stable. Thus, we have experimented with the control input computed as u(k) = −K(y(k) − y d ), where y d are the desired values. However, since model (4) is in most cases nonlinear, such a controller may not suffice, in which case nonlinear controllers can be designed.
The proposed methodology consists of the steps depicted in Figure 2. The inputs are: a register transfer level nominal digital design, a set of application specific constraints (e.g., the acceptable output error), and the aforementioned design specific constraints. The output is a CFLED model.
A final remark concerns the operands data representation, since it is an important implementation cost factor. Although in control engineering floating point is the preferred representation format, from the circuit design point of view, having fixed point operands is a major source of cost saving. Many designs are fine tuned to the application specific needs and use custom fixed point representation. Thus, if a fixed-point representation is chosen, the CFLED gains, as well as resulting states, cannot exceed the representation used for the initial implementation, and use fixed point representation as well.

Fault Tolerant CFLED Operation
Before describing the CFLED operation, we revise the circuit design constraints with respect to timing (i.e., number of clock cycles required computing a result). In digital circuit operation, a pipeline stage result is obtained at well defined moments in time (i.e., after a number of n i clock cycles for an arbitrary processing stage i). In addition to this, we add the latency constraint that the CFLED enhanced circuit execution can not be larger than twice the execution of the nominal circuit. The latter implies that re-execution of an input data set is not applicable, otherwise we would be unable to justify the additional logic overhead with respect to time redundancy. However, given the principle of the correction loop, an extended execution, such that the correction offsets are added is required. This is handled in parallel for all the pipeline stages having non-zero coefficients in the K matrix, and we refer to it as pipeline stages execution rewind. Thus, latency increase is limited, and all computation sub-blocks are synchronized to the worst case number of clock-cycles it takes to compute a pipeline stage output result. This operating principle is depicted in Figure 3.
To summarize, the operation of the CFLED augmented digital circuit requires the following steps: 1. compute the difference between the output and the reference (y(k) − y d )), 2. determine the correction factors by means of −K × (y(k) − y d ), 3. rewind the computation by adding the obtained correction factors to the corresponding registers.
It should be emphasized that for a circuit composed of several sub-blocks, the rewinding process is performed only in the sub-modules for which the coefficients in K are non-zero. Furthermore, the re-executions is performed in parallel for the sub-blocks where the correction factors are applied-see Figure 3. Therefore, a correction phase duration is equal to the highest latency of the rewinded sub-modules. The correction process may take several phases/iterations, until the result is below the tolerated threshold.
A key issue is represented by the need of the y d reference. We solved this problem by employing two CFLED augmented processing pipelines, each providing the reference to the other circuit. In this case, an application specific filter may be employed, in order to exclude results that are out-of-bounds for the specific application. . CFLED execution example. Three stages are considered in this example, each computing n i el results. Each result takes n i cc for computation. First, the normal operation is performed. If, after finishing the computation, a difference is detected at the output, the correction operation is started. All three stages are executed-rewinded -in parallel with correction inputs added to designated registers.

Nominal Circuit Design
To illustrate the proposed approach, we consider the hardware implementation of a fuzzy controller for a two degree of freedom (2-DOF) robot arm. The parameters of the physical robot are described in [26], and the hardware implementation of the controller in [20]. The output of the nominal (controller) circuit implementation is the vector u (2 elements), that represents the inputs for controlling the robot arm. With x Ir (2 element vector) we denote the controller state, while x r represents the current state of the robot (4 element vector), and y r is the reference that the robot arm should follow.
The hardware architecture is a fixed point implementation, using 24 bits operands, 8 bits for the integer part, and 16 bits for the fractional part. The equations and the numerical approximations used are described in [20]. For the purpose of this study, we focus on the register transfer level operations of the five pipelined processing blocks of the data-path depicted in Figure 4: 1. Trigonometric function approximator, that computes the sin(x), sin(2x), and cos(x)-these three functions are evaluated based on the 2nd and 3rd order Taylor series, and are computed in parallel; 2. computation of weighting functions-the computation of the three weighting functions h 1 , h 2 , and v is serialized, with one element computed at a time; 3. multiplication between the weighting functions-this block performs the 8 multiplications between h 1 or (1 − h 1 ), h 2 or 1 − h 2 , and v or (1 − v) sequentially; 4. final gain matrix computation-this block performs the accumulation of 8 scalar-matrix products; the size of the matrices is 2 × 6; it employs a single multiply-add fused unit, and therefore, the 12 elements of the final gain elements are computed sequentially; 5. output computation-the output control vector is obtained by multiplication of the final gain matrix with the vector composed of the process state x r and internal state x Ir ; the two elements of the output vector are computed sequentially; The 2-DOF robot arm controller contains 34 data registers and the processing of a set of input samples requires 175 clock cycles.

Dynamic Model and CFLED Design
For the implemented architecture, a dynamic model has been developed, following the steps described in Section 2. The output of the system has been defined as the output of the circuit. The model is non-linear (consequence of the multiplication operation), of the form (4), and has 310 states. In addition to the discussion from Section 2, in order to avoid obtaining a switching system or the need to include clock-dependence in the model, the blocks that compute results sequentially (e.g., weighting functions, weighting function multiplication, and output computation) are modeled as parallel in the dynamic model.
For the actual design of the controller, we have opted to linearize the model in the zero equilibrium point, reduce it using a balanced realization, and compute the gain matrix by placing the closed-loop system poles as close as possible to the origin [27]. As a result that K is constant, it is computed offline. Finally, the resulting controller has been tested for the original nonlinear model. The correction gains are constant during the entire correction process and are independent of the fault rate affecting the circuit. The computed gains are multiplied by the difference between the actual and desired output vector (u[k] − u d [k]), and their sum added at each clock cycle to the designated register during the correction operation phase. Furthermore, the vector u d is updated at the end of each correction phase execution.
Regarding the values in the gain matrix K, we have obtained values that are non-negligible for two out of five blocks: the final gain matrix computation and output computation blocks (5 registers out of 34). Thus, we employ CFLED for these two computational stages. The values for the gain factors corresponding to the two blocks are given in Table 1, where R 3 i denotes registers within the final gain matrix computation block, while R 4 i denotes registers within the output computation. During the correction phase, processing is performed only on these two blocks and the duration of the correction is given by the maximum of the latency of the two modules augmented with CFLED. Since the correction gain values are constant (computed offline), optimized implementation of multiplication with constants can be employed, using a simple shift-and-add approach.
First, the Matlab model of the CFLED enhanced 2-DOF controller with golden reference has been built. In the next step, the golden reference has been replaced by a second CFLED controller that is also subject to probabilistic errors. A simple control unit determines if the difference between the outputs is smaller than the application specific error tolerance, chosen as 10 −3 . If the computed output register difference is larger than this bound, the error correction phase is started. Thus, the fault tolerant solution relies on two CFLED enhanced models connected in reaction, that exchange data at the beginning of each correction phase.
We simulated the Matlab model to asses the performance of the overall system. The behavior of the proposed solution can be observed in Figure 5, which depicts outputs u 1 , and u 2 of the two controllers in the presence of errors. Note that the error at the output of one circuit influences the other. The simulation shows that up to cycle 27, the output of the two CFLED circuits are identical; thus, the black and red lines overlap. Then, the first controller has errors manifesting at the output u 1 ; thus, triggering the correction phase during the next simulation cycle. Due to the correction factors the two outputs influence each other, leading to a small perturbation in output u 2 , that is eliminated successfully during further correction phases. This is also the case for the other two errors that are larger in amplitude. The only difference is that the recovery time is larger. Given this observation, it makes sense to include a check in the control unit, that verifies if the output values are within the allowed range. For the considered use-case, the range is [−3.5, 3.5]. If both CFLED enhanced circuits yield out of bounds results, then system failure occurs, and computation is restarted. Otherwise, they may serve as reference for each other during an error correction phase, if needed. If only one of the CFLED enhanced circuit outputs exceeds the range, then the other output is considered correct.

Results and Discussion
In this section, we discuss the results obtained by the proposed approach and compare them with those obtained by a TMR.

Reliability Analysis
The reliability analysis of the proposed approach has been performed using simulated fault injection, in a Matlab-Modelsim co-simulation environment. Two main components are needed: the Matlab model for the process, and the HDL CFLED RTL design simulated using Modelsim. The Matlab simulation is responsible for generating and sending input data to the robot arm controller. It also generates the correct output results, and handles the statistics recording and result post-processing. The communication between the two CAD programs is through TCP/IP connection.
The simulated fault injection is performed in three phases: I Setup: initializes all data structures from Matlab from the configuration files, loads the HDL design inside the Modelsim environment with the parameters sent from Matlab, and generates a TCL data structure in the HDL simulation environment that mirrors the Matlab simulated fault injection configuration data (i.e., bit error probability, fault location). II Simulation runs: One simulation step is run for both Matlab and Modelsim. In the Matlab environment, a set of input data and output gold values are computed. A TCL script with appropriate command parameters is invoked through the TCP/IP connection. Next, the register values for outputs, and the debug registers are read in Matlab, and data are logged in *.csv files. III Results processing: Matlab scripts process the logged data and compute statistics. We have performed the simulated fault injection both on the CFLED augmented architecture, and on a fault tolerant controller that uses TMR. We have considered bit-flips that affect each flip-flop within all data registers, with different fault rates per clock cycle: 10 −4 , 10 −5 , and 10 −6 . During a sample processing, for a bit-flip rate of 10 −4 , the number of faults injected in the baseline architecture-without any fault tolerant mechanism-is 14. For the fault tolerant designs-CFLED and TMR-faults have been applied for the entire design and during the entire execution.
We present the results in terms of: 1. Output failure rate-in this case, we have counted the number of outputs that have a difference greater than 10 −3 with respect to the correct output. 2. Gaussian distribution of the output error magnitude in terms of mean (µ) and covariance (σ)-this type of analysis has been performed because in arithmetic data-paths the error magnitude is in many cases more important than the number of errors (such as bit errors) affecting the result. Smaller values of the mean and of the covariance mean an improved distribution of error magnitude, and therefore, increased fault tolerance.
Results are depicted in Table 2 for the first component of the output vector (u 1 ), and Table 3 for the second component of the output vector (u 2 ). Although the TMR presents better output failure rate-a smaller number of erroneous frames-the distribution of error magnitude is improved in the case of CFLED. This indicates that although the number of erroneous results is higher in the case of CFLED, the error magnitudes achieved with the proposed method are lower compared to TMR. Therefore, it can be noted that CFLED achieves its primary goal, to reduce or mitigate the magnitude of errors in arithmetic data-paths.  Figure 6 depicts the average number of correction phase executions, depending on the bit error probability. It can be noted that for a bit error probability of 10 −6 , less than a quarter of phase executions are required on average. Given that a phase execution requires 20 clock cycles-out of 175 clock cycles required for processing one sample-less than 5 clock cycles are added on average to each sample processing. This means that the execution time is increased on average with less than 0.3%. Average error correction phase execution Figure 6. CFLED average phase execution increase for a maximum of 24 correction phase executions.

FPGA Implementation Results
The implementation results for the proposed CFLED enhanced 2-DOF controller design for Xilinx Virtex-7 VX485T-2 FPGA device, using the Xilinx Vivado 2017.1 tool are depicted in Table 4 for the baseline-non-fault tolerant-circuit, and the controlled, fault tolerant version, with two CFLED enhanced circuits providing the reference to the other. FPGA implementation results indicate that the two circuit CFLED solution has an overhead of 2.1× in terms of slices, with respect to the baseline circuit. In terms of DSP blocks, the CFLED solution contains the same number of blocks as two baseline circuits. Therefore, the CFLED solution uses 30% less slices with respect to the TMR, while having a third less DSP blocks. The results in Table 4 also indicate the low cost of the correction feedback with respect to the baseline circuit. As a result that this block consists of 4 multipliers with constants-using a simple shift-and-add approach-no dedicated DSP blocks are required, while the overhead in logic slices is rather negligible.

Conclusions
To the best of our knowledge, this paper represents the first attempt to improve the reliability of digital processing pipelines by employing control engineering techniques. CFLED requires the development of a dynamic model associated to the processing pipeline, and the design of a feedback controller that computes correction factors. The correction to be applied is computed as the multiplication between offline computed constants with the difference between the circuit's output and the reference, and they are added to a sub-set of registers within the processing pipeline. A key issue is represented by obtaining the reference. In this paper, we propose to use two CFLED augmented data-paths, each providing the reference to the other.
Reliability estimates obtained by means of simulated fault injection for a hardware architecture of a two degree of freedom robotic arm controller indicates that the error magnitude of the CFLED augmented circuit is reduced with respect to a classic TMR-based solution. Regarding the cost overhead for an FPGA-based implementation, CFLED shows a 30% reduction in slice-based resources compared to the TMR.
Thus, future work will consist of reducing the cost of this solution, by employing reduced precision replicas or approximate methods for part of the data-paths. Funding: This research was funded by European Space Agency ITI REDOUBT activity.

Conflicts of Interest:
The authors declare no conflict of interest.