Self-Adaptive Run-Time Variable Floating-Point Precision for Iterative Algorithms: A Joint HW/SW Approach

: Using standard Floating-Point (FP) formats for computation leads to signiﬁcant hardware overhead since these formats are over-designed for error-resilient workloads such as iterative algorithms. Hence, hardware FP Unit (FPU) architectures need run-time variable precision capabilities. In this work, we propose a new method and an FPU architecture that enable designers to dynamically tune FP computations’ precision automatically at run-time called Variable Precision in Time (VPT), leading to signiﬁcant power consumption, execution time, and energy savings. In spite of its circuit area overhead, the proposed approach simpliﬁes the integration of variable precision in existing software workloads at any level of the software stack (OS, RTOS, or application-level): it only requires lightweight software support and solely relies on traditional assembly instructions, without the need for a specialized compiler or custom instructions. We apply the technique on the Jacobi and the Gauss–Seidel iterative methods taking full advantage of the suggested FPU. For each algorithm, two modiﬁed versions are proposed: a conservative version and a relaxed one. Both algorithms are analyzed and compared statistically to understand the effects of VPT on iterative applications. The implementations demonstrate up to 70.67% power consumption saving, up to 59.80% execution time saving, and up to 88.20% total energy saving w.r.t the reference double precision implementation, and with no accuracy loss.


Introduction
Recently, many industrial applications have emerged in domains such as the Internet of Things (IoT), Artificial Intelligence (AI), Neural Networks (NNs), etc. with a common characteristic: inherent error-resilience. Thus, HW/SW designers could trade the precision of computations against cost, resource, and power savings for such a class of applications.
Transprecision Computing (TC) [1] is a paradigm that came to implement this vision by introducing efficient, flexible, and reconfigurable architectures adequate for such applications. TC especially targets Floating-Point (FP) arithmetic, since Floating-Point Units (FPUs) are present in most modern application classes and even embedded processors since they significantly boost computationally-intensive applications. However, an FPU is usually responsible for an extensive amount of power consumption and high memory bandwidth. Moreover, the energy consumption associated with FP arithmetic is known to be higher than that of its integer counterpart [2], making FPU optimization a priority.
A typical FP algorithm implementation flow is shown in Figure 1. It is a process that usually starts with the algorithm design and mathematical stability analysis [3][4][5]. The next step is often a naive implementation, where all variables are declared in high precision formats, e.g., using double or long double type variables in C/C++. Next, the process of Variable Type Optimization (VTO) [6][7][8][9] is overtaken. Its objective is to migrate as many variables as possible from high precisions to lower ones for a given constraint. For example, by carefully changing double variables to float or long double variables to double, either manually or automatically, one can still satisfy a given Quality of Result (QoR) constraint on the output. Then, designers can use arbitrary Reduced Precision (ARP) [10] for fine-grained bit-width optimization. It consists of reducing the exponent and/or mantissa bit-widths to either narrower standard bit-widths such as the 16-bit binary16 format [11], or custom reduced arbitrary bit-widths such as Intel Nervana's Flexpoint [12], Microsoft Brainwave's 9-bit floats [13], Google TPU's 16-bit BFloats [14], and NVIDIA's 19-bit format [15].
Algorithm design & stability analysis [3][4][5] Naive implementation with high-precision FP types only Variable Type Optimization (VTO) [6][7][8][9] Optimization with Arbitrary Reduced Precision (ARP) [9,10,[16][17][18][19] Dynamic optimization with Variable Precision in Time (VPT) [20,21, this work] Many tools have been proposed to allow the functional simulation of such reduced precision operators [16][17][18][19][20]. However, when it comes to arbitrary reduced precision, most of these tools are only adequate for simulation without actually providing appropriate hardware architectures that leverage such concepts to quantify hardware-level gains. In this work, a new method that enables designers to automatically and dynamically tune FP computations' precision at run-time is proposed. The theoretical and statistical study performed in this work shows a drastic energy reduction for iterative algorithms. These algorithms are the cornerstone of Computer Vision applications which are widely used in the context of Artificial Intelligence. Reducing the power consumption of such applications is crucial to meet the current international trends.
This paper is an extension of our previous work [21] in which the concept of Variable Precision in Time (VPT) was introduced for the first time. VPT consists of allowing the programmer to configure computations' precision dynamically at run-time. The previous paper presented a preliminary feasibility study to evaluate the potential of the methodology. In the current paper, we focus on generalizing VPT to all stationary iterative algorithms such as Jacobi and Gauss-Seidel. For such applications, the requirements in terms of precision vary at run-time. This means that standard FP formats are over-designed for these workloads. Moreover, it is shown that even fixed reduced precision formats are over-designed for some parts of these applications. Hence, there is a need for FPU architectures that enable fine-grained VPT. A video presentation explaining our previous work is available at [22]. This paper first revisits the proposed hardware architecture and its lightweight software support in more detail (Section 4). Then the concept of VPT is pushed further by proposing two transformed iterative algorithms that self-adapt their computation precision automatically. These algorithms do not need intervention from the user or the programmer. The first is a conservative version and the second is a relaxed (approximate) version. Programmers can choose one of them depending on their application; the first provides more robust guarantees while achieving interesting energy gains, while the second has relaxed guarantees but with a higher power-efficiency. Moreover, the software implementation of these algorithms is discussed and a set of statistical studies is performed to evaluate the effectiveness of the two approaches across many inputs.
This paper proposes the following new contributions: • Generalization of VPT to other iterative methods (Section 5) and proposition of two modified VPT-enabled algorithms with self-adaptive run-time precision (Section 6); • An in-depth statistical verification study of the VPT's impact on Jacobi algorithm's behavior (Section 7); and • An ASIC implementation of the proposed FPU on a 28 nm FD-SOI technology node and a post-synthesis evaluation of the proposed VPT algorithms' power consumption, execution time, and overall dissipated energy (Section 8).
Before diving into the details, some related works are presented in Section 2, and a set of definitions and the motivations behind this work are introduced in Section 3.1.

Related Works
The proposed approach takes advantage of non-standard arbitrary reduced operators [10,19] to achieve significant gains in terms of execution time, power consumption, and overall energy consumption. The following are some related works from the literature.

FP Variable Type Optimization (VTO)
For a given triplet {application, input dataset, QoR constraint}, tools such as [6][7][8] perform coarse-grained VTO using the delta-debugging search heuristic [23]. The authors' objective is to minimize the number of high-precision variables and maximize the number of low-precision variables. For some of these tools [7,8], the objective is to optimize for speed, whereas for others, such as Promise [6], the goal is to maximize the number of float variables.

Non-Standard/Arbitrary Precision Support
The mentioned tools support standard IEEE 754 [11] formats only, except Precimonious [8], which also supports Intel's 80-bit format implemented as long double in C. The authors of fpPrecisionTuning [9] proposed an arbitrary precision impact simulation methodology based on an automatic source code transformation tool. Libraries such as FlexFloat and FloatX [17,18] enable developers to simulate the impact of reduced precision on application-level Quality of Result (QoR). These tools support arbitrary precision only in simulation to help designers decide which precision is adequate for their application. However, they do not provide hardware-level support to leverage these decisions for power, execution time, or energy savings.

Mixed-Precision for Linear Algebra
There are two categories of linear system solvers: direct solvers (e.g., Gaussian Elimination, Cholesky decomposition) and iterative solvers. Carson and Higham [24,25] have proposed a general algorithm for solving linear systems based on iterative refinement [26] with three standard FP precisions. However, this method only uses standard precisions that traditional processors support. Moreover, the technique is mixed-precision in space but not in time, i.e., it contains mixed-precision instructions. Still, each instruction keeps its precision for all iterations. Authors of [27] have built on the previous work and proposed a solution with five different precisions. In contrast, new FP data types supported in NVIDIA GPUs were introduced in [28]. Finally, others [29] proposed an adaptive scheme to reduce communication overhead by selectively storing parts of the system preconditioner in different precision formats (half, single, or double). In this paper, we focus specifically on classical stationary iterative solvers such as Jacobi, Gauss-Seidel, Richardson, Successive Over-Relaxation, [30,31] etc.

Arbitrary Reduced Precision
In [10,19], we proposed a simulator (AxQEMU) (source code available at [32]) that simulates the impact of arbitrary reduced FP precision on applications in a non-intrusive way without modifying the source code. This approach is complementary to VTO since the tools mentioned before can be used as a starting point. Furthermore, using AxQEMU allows primary memory footprint optimization and enables fine-grained accuracy/energy trade-offs.

Variable Precision in Time (VPT)
VPREC [20] is a software back-end component that enables non-intrusive run-time variable precision simulation in the Verificarlo [33] software toolchain. This tool simulates run-time variable precision, specifically for iterative algorithms. However, VPREC is designed for impact simulation only in software and does not provide a hardware implementation for power reduction, whereas this work targets both software and hardware-level implementations. Moreover, VPREC evaluates the needed reduced precision for each iteration through off-line and data-dependent studies performed after the execution. In contrast, we propose a technique to automatically select the adequate precision for each iteration online at run-time in this work.

Definitions
A binary floating-point number can be written in the form (−1) s (1 + m)2 e , where s is the sign bit, m is the mantissa (also called significand or fraction), and e is the exponent. An FP number can be encoded either following standard IEEE 754 [11] formats, e.g., binary32 (32-bit single-precision format), binary64 (64-bit double precision format), or using a custom bit-width representation.

Definition 1.
A Floating-Point (FP) format is defined by the pair (E, M), where E is the bitwidth of its exponent and M is the bit-width of its mantissa field.
An FP format's "precision" refers to its mantissa bit-width M. The exponent bit-width E reflects its dynamic range. Custom non-standard (arbitrary) formats can be defined when some loss of precision is tolerated, or the numbers represented have a limited dynamic range. Hardware implementations contain an additional bit in the mantissa field, also called hidden bit [11]. In this work, the hidden bit is not counted in the mantissa bit-width M. Definition 2. The machine epsilon (or machine precision) of an FP format (E, M) provides an upper bound on the relative error caused by rounding. It is defined as The machine epsilon M also constitutes the distance from 1.0 to the next larger FP number [34].
For this work, only software Application Binary Interfaces (ABI) that support FP arithmetic in hardware (i.e., hard-float ABI) are considered. We also focus on CPU architectures and precisely on the 64-bit RISC-V [35] Instruction Set Architecture (ISA), with single-precision (F) and double precision (D) extensions. This architecture is referred to as "RV64FD" and will constitute our reference baseline FPU architecture.
In the rest of this paper, let P be the maximum number of precisions supported by the FPU and p an integer index such that 1 ≤ p ≤ P. Then, let (E 1 , M 1 ), . . . , (E P , M P ) denote the list of reduced precision FP formats supported by the hardware.

Motivation
This section presents the motivations for this work. First, the fact that floating-point computations constitute an essential part of iterative workloads is demonstrated. Second, we show to which extent the standard FP double precision is over-designed for such applications using the AxQEMU Arbitrary Reduced Precision simulator [10,19].

FP Computation Usage in Jacobi and Gauss-Seidel
The first motivation behind this work is the fact that iterative algorithms such as Jacobi and Gauss-Seidel spend a lot of time performing floating-point (FP) computations. To evaluate the execution time associated with such computational operations, a cycle-accurate simulation is performed.
To do that, a cycle-accurate processor model (called CVA6 [36]) running at 200 MHz is used. The Jacobi and Gauss-Seidel algorithms have been executed on top of this model with a randomly generated input. Figure 2 depicts dynamic assembly instructions' breakdown for both Jacobi and Gauss-Seidel, i.e., how many instances of assembly instructions have been issued through the processor pipeline during the execution of the application.  As depicted in Figure 2, for a 10 −12 tolerance threshold, 10.2% (resp. 16.2%) of the total assembly instructions executed are FP arithmetic instructions, and 19.8% (resp. 31.5%) are FP memory instructions for Jacobi (resp. Gauss-Seidel). The remaining portions concern instructions manipulating integers (arithmetic and memory) as well as other instructions such as branch and control flow instructions, system instructions that manipulate CSRs, etc.
These statistics might vary slightly from one input to another and depend on the SoC parameters (cache parameters, memory bus/interconnect, . . . ). However, it is safe to say that targeting FP computation optimization for these applications is a good decision. Moreover, when the approach is combined with classic state-of-the-art techniques, designers can also benefit from low memory overhead in addition to the computation gains afforded by our technique.

The Limitation of Arbitrary Fixed Reduced Precision
The second motivation behind this work is the fact that even fixed ARP is overdesigned for iterative algorithms. Figure 3 shows the impact of precision variation on the convergence profile (variation of the computed solution's accuracy at each iteration cf., Definition 3) of Jacobi when operating on different arbitrary reduced precisions. The reference application is simulated using the AxQEMU simulator [10,19] for multiple precisions (4, 8, . . . , 48, 52). Each of the colored lines shows the evolution of the error metric when the application is executed with arbitrary reduced precision. For each simulation, the precision is defined at launch time and stays constant at run-time.
This example demonstrates that it is unnecessary to compute all iterations with double precision. For instance, for Jacobi to reach a target 10 −10 error threshold, the designer only needs an FP format with a 32-bit mantissa, i.e., the FP format (11,32), which has a total bit-width of 44 bits. Similarly, for Gauss-Seidel, a 28-bit mantissa, i.e., a total bit-width of 40 bits, is sufficient to reach a target error threshold of 10 −10 .
For Jacobi, even though the (11,32) format will dissipate less power than the original binary64 format, there is still room for significant improvement. For instance, the (11,32) format is still over-designed for iterations 0 to 457. A similar case could be made for Gauss-Seidel.
As a conclusion, the standard FP formats are over-designed for these applications and ARP optimizes the energy consumption compared to the original reference precision. However, there is still room for improvement using an Arbitrary Reduced Precision variable at run-time. This would allow precision to be varied during the execution depending on the error threshold needed. In this paper, an FPU capable of computing different configurable ARPs at run-time is proposed with two modified self-adaptive algorithms that automatically vary the computation precision without user intervention.

Proposed Hardware Architecture
We propose a hardware FPU architecture named VPT-FPU. It enables dynamic variable precision at run-time. We also propose a lightweight software library that facilitates its integration in an existing CPU core.

Architecture Overview
VPT-FPU contains two types of data-paths: precise data-paths, that contain standardprecision arithmetic operators, and approximate data-paths, that contain reduced arbitrary precision operators. The proposed architecture ( Figure 4) includes: A Non-computational operations data-path: performs non computational operations such as FP-to-integer and integer-to-FP conversion, comparison, input classification etc. In practice, the hardware/energy cost of these instructions is negligible when compared to computational instructions, so we chose to not optimize them.
B Multi-precision DIV/SQRT data-path [37]: a block that computes division and square root operations using an iterative algorithm implemented in hardware [38].
C Precise Fused Multiply Add (FMA) data-path: contains the precise FMA operators that perform addition, subtraction, multiplication, and multiplication-accumulation in binary32 and binary64 formats. For each supported non-standard FP format (E p , M p ), 1 ≤ p ≤ P, an approximate block is integrated inside either D F or D D at the RTL level at synthesis time. This means that the set of supported precisions is fixed, once synthesized. Each of these approximate blocks contains: • A reduction block that converts the inputs from the original format (binary32 or binary64) to the target reduced format (E p , M p ).
• An FMA computational operator in (E p , M p ) format.
• An extension block that converts the result from (E p , M p ) back to the original format.
The blocks A , B , and C constitute the original standard RV64FD FPU. The block B containing DIV/SQRT operators do not need to be duplicated or surrounded by reduction and extension blocks since their precision can be adjusted via a precision selection input signal that varies at run-time. This is why only FMA blocks are duplicated.
Section 8 studies the area overhead introduced by block D and the savings in terms of power/energy and execution time.

Custom VPT Registers
To simply and flexibly support VPT, three custom Control and Status Registers (CSR) have been added to the FPU (see Figure 4): the VPT_STATUS register, which is used to enable/disable the VPT mode, and the two registers VPT_FLOAT_PREC and VPT_DOUBLE_PREC, which, respectively, select the precision settings for float and double operations via software.

VPT Software Support
The added registers are memory-mapped when the FPU is integrated into a CPU core. This enables lightweight software support since read/write (R/W) operations can then be performed using the usual CSR assembly instructions, i.e., no special compiler modification is needed. The additional CSRs belong to the custom R/W user-level address space [35], which means that no particular machine-or supervisor-level privileges are needed to perform R/W operations.
A small set of Hardware Abstraction Layer (HAL) functions has been developed to enable the programmer to select float and double operations' precision via software. Of course, at the SW level, the programmer can only select a precision among the ones supported in hardware (E p , M p ), 1 ≤ p ≤ P. The proposed HAL functions are wrappers for the CSR read/write assembly instructions. For a seamless integration in existing applications, the functions are bundled as a header-only library. Listing 1 depicts a few examples. For instance, vpt_set_prec_float (resp. vpt_set_prec_double), sets the precision of float (resp. double) operations, i.e., by configuring the content of the register VPT_FLOAT_PREC (resp. VPT_DOUBLE_PREC) to 1 << M, where M is the target arbitrary reduced precision.

Iterative Methods: Mathematical Foundations
The Jacobi and Gauss-Seidel iterative algorithms are used as case studies to evaluate the gains and the limitations of the proposed VPT-FPU. The original algorithms are presented as well as two VPT-enabled versions that benefit from the run-time variable precision capability of suggested hardware FPU.

Presentation of Jacobi and Gauss-Seidel Iterative Methods
Stationary iterative methods are algorithms that determine the solution of a square n × n system of linear equations, in the form: Given the n × n real coefficient matrix A and the right-hand side n-vector b, an iterative algorithm aims to find the unknown vector x satisfying Equation (2). This equation can be transformed to a fixed-point iteration [30]. The system is solved by computing, at each iteration (k + 1), the (k + 1)th approximation of the vector x (k+1) as a function of the previous result x (k) . This paper focuses on two main iterative algorithms to solve this problem based on the fixed-point method: • The Jacobi method (published in 1845), whose formula can be written as follows: is the ith element of the vector x (k+1) .
• The Gauss-Seidel method (published in 1874), which uses a slightly different equation: These methods and others are well documented in the literature [30,34].

Convergence of Iterative Algorithms
To solve this problem iteratively, the convergence of the system should first be mathematically ensured. In the remainder of this paper, we assume A to be strictly diagonally dominant: the elements of matrix A satisfy the following condition |a ii | > ∑ j =i a ij . For both Jacobi and Gauss-Seidel, this is a sufficient (but not necessary) condition to ensure the convergence for any initial guess vector x (0) [30] (Theorem 4.5, p. 111).
The iterative algorithm reaches convergence when the computed approximation x (k) is very close to the exact solution of the system x * , i.e., the forward error e (k) defined below is small enough However, the exact solution x * is unknown. This is why the error of the computed solution is estimated in an indirect way using a convergence metric. The latter is a scalar that represents the evolution of the solution accuracy at each iteration k. The idea is to compute the convergence metric at the end of each iteration and stop the iterative process when progress is no longer being made, i.e., when the computed metric is below a given user-defined threshold denoted TOL. The result of this comparison is called the stopping criterion.
For the sake of simplicity, the distance between two consecutive results is chosen as the metric. It is defined as follows: where ||.|| refers to the Euclidean norm. The metric is computed at the end of each iteration. Other metrics (e.g., residual error || b − A x||) can also be used. At the end of each iteration, the stopping criterion is evaluated to (1) identify when the forward error e (k) is small enough to stop iterations [31] (p. 63), (2) detect when the error is no longer decreasing or decreasing too slowly, and (3) limit the maximum amount of time spent iterating.
The following definitions are provided for the remaining of this document: Definition 3. The Convergence Profile (CP) is the curve representing the variation of the convergence metric (e.g., || x (k+1) − x (k) ||) through iterations k. An average CP is an average curve computed across many inputs.

Definition 4. The Precision Variation Profile
(PVP) is the curve representing the variation of operating precision (M p ) through iterations k. An average PVP is an average curve computed across many inputs.

Implementation of VPT-Enabled Iterative Methods
This part presents the implementation details of iterative algorithms.

The Original Algorithm
The following algorithm (Algorithm 1) depicts a typical implementation of an iterative method with only standard precision, (i.e., single-precision/double precision) with no custom variable precision involved.

Algorithm 1: general structure of iterative methods.
Inputs : A : a diagonally dominant matrix of size n × n, b : the right-hand side vector of size n × 1, x (0) : an initial guess vector of size n × 1, MAX_ITER : maximum number of iterations, TOL : original global error threshold. Output : x (k+1) : the solution of the linear system.
Compute convergence metric (Equation (6)) The original algorithm contains a main loop (lines 2-6), where the elements x (k+1) i of the solution vector x (k+1) are computed in line 3, using either Equation (3) for Jacobi, or Equation (4) for Gauss-Seidel. Then, the convergence metric is computed according to Equation (6) (line 4). After that, the stopping criterion evaluation is performed by checking if the computed metric is lower than the tolerance threshold TOL or if the number of iterations k has reached its limit MAX_ITER. If the stopping criterion is satisfied, the algorithm stops and returns the last computed result. Otherwise, the algorithm continues until reaching convergence in the next iterations or potentially reaching the maximum iterations limit MAX_ITER.

The Transformed Algorithm
To take advantage of the VPT-FPU presented in Section 4, Jacobi and Gauss-Seidel algorithms should be manually transformed. Algorithm 2 depicts the general structure of an iterative method with VPT enabled. The regions added to the original baseline implementation (depicted in Algorithm 1) are colored in blue.
Our transformation methodology consists of starting the process with the lowest possible precision and increasing it gradually until convergence. In this example, both the original and the VPT-enabled algorithms are applied to the same input matrix A and vector b for comparison. These inputs are generated randomly with a randomization seed equal to zero. The steps of the transformation process are:

Define a list of intermediate tolerance thresholds TOL p (input of Algorithm 2).
Since the objective is to increase precision gradually, the intermediate thresholds

3.
Enable VPT at the beginning of the algorithm (line 2 of Algorithm 2).

4.
Iterate over the supported precisions to gradually improve the accuracy of the solution. This process is illustrated through the example depicted in Figure 5. It shows the evolution of the convergence profile (left axis), with iterations (horizontal axis) for the double precision reference original algorithm (continuous blue line) and the VPT-enabled algorithm (continuous red line) of Jacobi.
In As shown in Figure 5, the precision is increased from M p to M p+1 when the convergence metric reaches the intermediate threshold TOL p . Through this example, it is shown that the VPT-enabled algorithm follows the same trend and provides the same accuracy at the input while operating with much lower and auto-adaptive precision.
The presented methodology does not alter the convergence of the algorithm, and programmers can apply it to other applications, (e.g., Successive Over-Relaxation (SOR), Richardson method, etc. [30]) as long as its convergence is guaranteed mathematically. In strict logic, we verify that the input couple (A, b) remains strictly diagonally dominant for each reduced precision. The variation of the convergence profile for Gauss-Seidel is depicted in Figure A1 in the Appendix A.  Reference The choice of intermediate thresholds is critical: the smaller they are, the harder they can be reached. On one hand, choosing thresholds that are easy to reach will lead to premature precision increment; hence more iterations will be spent on higher precisions, and power consumption will be increased. On the other hand, if the thresholds are very difficult to reach for a given precision, it can cause convergence stagnation. In this paper, two threshold policies are proposed for choosing these thresholds. The first one provides conservative thresholds and is explained in Section 6.2.1. The second generates smaller thresholds and handles stagnation cases. The latter is detailed in Section 6.2.2.

Details of Threshold Policy (1): Conservative Thresholds
An intermediate threshold should be computed for each precision M p . With this threshold policy, the computed thresholds are more conservative, i.e., they are sufficiently high, which makes them more reachable with a given precision. They are mathematically computed according to the smallest distance between two consecutive points in a given FP format.
In the case where the distance metric is used for convergence, an upper bound can be computed in the precision M p , by assuming that the distances between all elements of i are as low as some small positive floating-point value u. This assumption provides an upper bound on the tolerance threshold equal to u √ n, as explained below.
This upper bound has to be computed and rounded to the precision M p . For example, if u is set to M p , then TOL p = round( √ n M p , M p ), where round(X, M p ) is a function that rounds the result of a FP number X to M p mantissa bits following one of the standard FP rounding modes, typically the round-to-nearest rounding mode.
As the convergence metric is computed in low precision, the designer should also take the rounding errors associated with the metric computation into consideration [31] (Section 4.2.5, p. 56), since the metric computation involves n multiplications, (n − 1) additions, plus a final square root operation (see Equation (6)). All these computations will contribute to the final computed TOL p values.
To compute an estimation of the thresholds, a proof assistant called Gappa [39] is used to compute the thresholds for all possible precisions M p ∈ {1, 2, . . . , 52}. Listing A1 of the Appendix A shows the Gappa script used to compute the thresholds. In this case, it is applied to a precision M p equal to 40 bits. Furthermore, this script is run offline only once for each precision. This means that it does not add an overhead to the iterative application itself at run-time.

Example
Consider n = 50, M p = 40, and u = 40 = 2 −40 . If the threshold is computed ideally, the result would be TOL p = 2 −40 × √ 50 = 6.43109 . . . 10 −12 , which is similar to Gappa's result. However, for M p = 4, the ideal result would be TOL p = 0.44194 . . . , whereas using Gappa, it provides the result 0.5. The latter is more conservative and takes rounding error into account.

Details of Threshold Policy (2): Relaxed Thresholds with Stagnation Detection
This threshold policy is more "relaxed" in the sense that the chosen intermediate thresholds are as low as possible to maximize the number of iterations spent on lower precisions.

Definition of Convergence Stagnation
Choosing very low thresholds is riskier since there are no guarantees that the convergence metric can actually go as low as the specified intermediate TOL p thresholds, i.e., there is a chance that the convergence metric will stagnate at a specific value or oscillate around it. Hence, it is important to consider this effect when choosing to lower down the selected convergence thresholds. This process will be referred to as "stagnation detection". The stagnation behavior has been observed only for a small subset of the tested inputs, yet it is important to consider when the thresholds are selected.
Example of Convergence Stagnation Figure 6 depicts the convergence profile and the precision variation profile for three separate cases 0, 1, and 2 (each one has a randomly generated matrix A and vector b). In this case, TOL p is set to M p for each precision M p and a discrete subset of precisions, i.e., M p ∈ {4, 8, 12, . . . , 48, 52} is chosen so that the total bit-width (1 + 11 + M p ) of the reduced VPT-FPU formats are multiples of 4 bits.
As shown in Figure 6, input matrices 0 and 1 converge normally without problems. However, the distance metric associated with input matrix 2 oscillates between two values 0.14312744140625 (which is equal to 2 −6.1265 ) and 0.13732910156250 (which is equal to 2 −6.1862 ) while operating in 8-bit precision. The residual error r (k) (i.e., || b (k) − A x (k) ||) also stagnates at the value 0.0000889301300049. For an 8-bit mantissa, the expected convergence threshold is normally 2 −8 , which seems to be difficult to reach for input matrix 2. Hence the importance of being able to detect such stagnation cases. Figure A2 of the Appendix A shows the same phenomenon for the residual error metric.

The Proposed Convergence Stagnation Detection Mechanism
To avoid the stagnation of the convergence profile, a condition to the stopping criterion should be added to detect this stagnation or oscillation phenomenon [31] (Section 4.2.4, p. 56). For that, we define the Stagnation Maximum Iterations (denoted SMI) as the maximum number of iterations for which stagnation can be tolerated. If the convergence metric stagnates at a fixed level or oscillates, the precision should be increased if higher precision is available. Otherwise, if there is no higher available precision, then the algorithm is stopped.
A comprehensive VPT-enabled iterative algorithm is proposed (Algorithm 3) to account for the convergence stagnation effect. In addition to the standard version of the algorithm (black-colored statements) and the original VPT statements (colored in blue), additional instructions (colored in red) have been added to implement stagnation detection.
From an implementation point of view, an additional integer variable should be added to keep stagnation iterations count, as well as one subtraction instruction and one comparison to check if two consecutive iterations have similar or very close distance metric (|| x (k+1) − x (k) ||) values (line 11). This overhead is negligible compared to the cost of the main iteration computations, which is confirmed by the performed energy consumption study. • Theoretical conservative thresholds: computed using the formula √ n M p and rounded to the nearest.

•
Conservative thresholds computed with Gappa: these are generated using the Gappa proof assistant as explained in Section 6.2.1 to take rounding errors into account.

Statistical Analysis
Section 6.2, demonstrated the effect of VPT by applying it to a single randomly generated input. In this section, the behavior of VPT and its gains will be evaluated across a set of 1000 randomly generated input (A, b).
The first objective of this section is to statistically study the effects of VPT on the generated 1000-input set. The second is to study and compare the influence of each convergence threshold policy (conservative thresholds vs. relaxed thresholds) on the convergence profile, the precision variation profile, the total number of iterations, the distribution of iterations across the intermediate precisions.
To perform the statistical study, each input will be identified by an ID ranging from 0 to 999. Then, each input is generated using a randomization seed equal to its ID. This section only studies the Jacobi iterative method, but the conclusions also apply to Gauss-Seidel and similar iterative methods.

Software Implementation Aspects
In the following sections, an open-source Jacobi implementation [40] written in C is considered. It takes as an input a randomly generated 50 × 50 diagonally dominant matrix A, with pseudo-random values between 0 and 1. The default target tolerance threshold is set to 10 −10 unless otherwise is specified. Gauss-Seidel is also implemented following the same structure of the original Jacobi application.
The software applications have been cross-compiled for the Proxy Kernel execution environment (a lightweight bare-metal-like OS dedicated to RISC-V-based systems) [41], using the RISC-V GCC compiler in a similar fashion to the cycle-accurate study performed earlier in Section 3.2.1.

Effects of VPT on the Convergence Profile and Precision Variation Profile
Considering the two threshold policies and their respective parameters presented in Section 6.2, five different use cases are established to be evaluated and compared:
In this case, the original algorithm (Algorithm 1) double precision implementation is applied to the randomly generated 1000-input set. Figure 8 depicts the average convergence profile (continuous blue line). The area highlighted in light-blue covers the range of possible convergence profiles obtained for the 1000 inputs. This case is represented by the blue line (Figures 8-10).
In this case, the VPT-enabled algorithm (Algorithm 1) is implemented. Here, the thresholds are computed with the formula below as explained in Section 6.2.1.
The corresponding convergence profile is depicted in Figure 9, which overlaps with the original standard baseline convergence profile. This means that in this case, the convergence speed does not change compared to the reference even though lower precisions are used. This case is represented by the orange line (Figures 9 and 10).

VPT-enabled results, with threshold policy (1), using conservative thresholds generated with Gappa.
In this case, the VPT-enabled algorithm (Algorithm 1) is implemented. Here the thresholds computed using the Gappa proof assistant are used (Section 6.2.1). Please note that these thresholds are similar to the latter case except for low precisions 1 ≤ M p ≤ 5. This case is represented by the green line (Figures 9 and 10).

VPT-enabled, with threshold policy (2), along with M p thresholds and the Stagnation Maximum Iterations set to 2 (SMI = 2).
In this case, the VPT-enabled algorithm (Algorithm 2) is implemented. The convergence thresholds used here are more relaxed, and the stagnation detection mechanism presented in Section 6.2.2 is activated. For this experiment, stagnation is tolerated for at most two consecutive iterations before incrementing the precision. This case is represented by the red line (Figures 9 and 10).

VPT-enabled, with threshold policy (2), along with M p thresholds and the Stagnation Maximum Iterations set to 4 (SMI = 4).
This case is represented by the purple line (Figures 9 and 10). Figure 9 depicts the average convergence profiles for each one of the five use cases. The conservative thresholds (overlapping orange and green) produce a convergence profile similar to the standard baseline (blue). The relaxed thresholds (overlapping red and purple) also result in similar convergence profiles on average, although its speed slows down compared to the baseline starting from iteration 700. Figure 10 shows the precision variation profile for the five cases. For the baseline reference, the operating precision is fixed at 52 bits, i.e., double precision format for all iterations. Both conservative threshold sets ( √ n M p and Gappa) produce a similar overlapping precision variation profile (green and orange). Relaxed thresholds also produce similar overlapping profiles (red and purple) indifferent to the value of SMI.

Effects of VPT on the Total Number of Iterations
In this paragraph, the effect of VPT on the total number of iterations is studied by analyzing each one of the five use cases stated previously. Figures 11 and 12 show the distribution of the total number of iterations (vertical axis) for each use case (horizontal axis), for two target thresholds 10 −4 and 10 −12 , respectively. To get these data, each version of Jacobi has been executed on the 1000 inputs. The figure represents the medians (red lines in the middle of the box plot) and the means (black dot). The blue crosses (+) represent non-significant outliers.
For higher target thresholds, and specifically 10 −4 in this case (Figure 11), there is a clear effect on iterations' number distribution: using the relaxed thresholds induces less time to converge. It is also clear that an SMI value of 4 achieves slightly less number of iterations compared to an SMI of 2. On the other hand, threshold policy (1) (conservative thresholds) shows no noticeable effect on the number of iterations.
When the target tolerance threshold is set to 10 −12 (Figure 12), no significant effect is observed. However, the relaxed thresholds lead to slightly higher iteration numbers on average compared to the double precision reference.
There seems to be an effect on the outliers (+). Although this is not statistically significant, a worst-case and best-case HW-level comparison will be performed to evaluate to which extent this increase affects the power, energy, and execution time savings. Figures 13 and 14 present the distribution of the overheads brought by each threshold policy with respect to the total number of iterations for the whole 1000-input set for a 10 −04 and 10 −12 tolerance threshold, respectively. Positive percentage values signify an increase in the number of iterations, whereas negative ones mean that there was a reduction in the number of iterations. Furthermore, the conservative policies do not alter the number of iterations much compared to the double precision reference: they achieve a bit less than ±1.5%. However, the relaxed policies have a slightly important but scattered effect on iterations' count. For example, for a threshold of 10 −4 (Figure 13), relaxed policies achieve a −17% to −1% reduction in the number of iterations for more than 75% of inputs. At 10 −12 (Figure 14), nearly half of the input dataset achieves a variation between −1% and +2.5%.

Effects of VPT on Iterations' Distribution
Studying the effects on the total number of iterations alone is insufficient to understand the consequences of applying VPT to such algorithms fully. Hence the necessity of also studying the distribution of iterations across the intermediate precisions. The reason for that is that, even though two use cases have the same total number of iterations, some cases will tend to over-use lower precisions more than the higher ones or vice versa, hence leading to potentially different power/energy consumption. Figures 15 and 16 illustrate a study similar to the one performed in the last section, this time by considering the distribution of the number of iterations per each precision for the 1000 inputs. Figure 15 depicts the study when the target threshold is set to 10 −4 . As you can see, threshold policy (2) (relaxed thresholds) tends to maximize iterations at the lower precisions, whereas threshold policy (1) (conservative thresholds) tend to maximize iterations at the higher precisions. Thus, if the two policies have the same total number of iterations, it is more likely that most of these iterations will be skewed towards lower precisions for threshold policy (2) and skewed to higher precisions for threshold policy (1). This will translate into an important difference in terms of power consumption. Figure 16 provides a similar study with a target threshold of 10 −12 to examine the previous trend in the long term. The figure shows that the earlier conclusion holds since threshold policy (1) seems to minimize iterations at lower precisions and maximize them at higher ones. For example, at M p = 44, most cases already finished their execution when operating with threshold policy (2), whereas they have around 45 iterations left when working with threshold policy (1). Moreover, those 45 iterations are executed in higher precision, which means that there will be significant overhead in terms of energy compared to threshold policy (1).

Conclusions
The conservative thresholds provide a similar convergence profile to the baseline reference, whereas the relaxed thresholds tend to be slightly slower at the end of convergence. However, the relaxed thresholds tend to use less precision, which will translate to lower power consumption.
In the next section, a hardware-level power, execution time, and total computational energy evaluation will be performed for a single (typical) input and then on two cases among the 1000 random inputs.

Hardware-Level Evaluation & Discussion
After assessing the effectiveness of the proposed approach from a statistical and software point of view, this section presents a hardware-level evaluation of the power, execution time, and energy savings related to computations occurring inside the FPU.

Hardware Synthesis Conditions
The presented VPT-FPU was implemented in SystemVerilog, based on an open-source parametrized FPU [42]. The hardware design is globally parametric so that the list of supported formats {(E 1 , M 1 ), . . . , (E P , M P )} in hardware are variable at design time.
The implementation was synthesized as an ASIC, on a 28-nm FD-SOI technology node, in the typical corner (Regular V t , 1.00 V, 25°C, No Body Biasing) for a 200-MHz frequency target. Synthesis has been performed on Synopsys Design Compiler ® with automatic clock-gating enabled and default effort levels. In addition, Post-synthesis gatelevel simulations were performed using Synopsys VCS ® , and power consumption was estimated by considering both static power and dynamic switching activity associated with the application using Synopsys PrimeTime ® .

HW-Level Evaluation with One Input and Relaxed Thresholds (Nominal Scenario)
This section will quantitatively evaluate the hardware savings and overheads introduced by VPT in terms of execution time, average power consumption, and overall energy. In this subsection, only one randomly generated matrix A and vector b is considered (the same one presented in Section 6.2). Table 1 (Table 1).

Baseline Results
The baseline performance results are obtained by executing the reference SW configuration Ref double on the reference RV64FD hardware FPU. This experiment is repeated for each tolerance threshold. All upcoming results will be normalized w.r.t this one.

VPT Results
For each tolerance threshold and each {SW configuration, HW configuration} pair, a gate-level simulation is performed. The absolute values of execution times, average power, and the dissipated energy spent on FP computations are reported in Table 1 for Jacobi and Gauss-Seidel. The results were obtained by applying the algorithm on a single input matrix A and vector b auto-generated with a randomization seed equal to zero. Column-wise normalized values (%) are reported by dividing the estimated values by the reference ones for each target error threshold, i.e., there is a reference for each target threshold column.  Table 1 shows that the VPT-enabled implementations always achieve better performance with no accuracy loss compared to the reference. For example, the VPT double SW configuration of Jacobi achieves power consumption savings ranging from 59.71% up to 74.31%, execution time savings ranging from 39.17% up to 59.77%, and energy savings between 77.21% and 87.89%. Meanwhile, the same SW configuration for Gauss-Seidel application achieves power consumption savings ranging from 50.83% up to 78.18%, execution time savings ranging from 31.70% up to 51.70%, and energy savings between 66.40% and 89.46%.

Conclusions
To sum up, starting from an already optimized standard software version (using VTO) instead of a double precision, one can achieve higher power and energy savings. Using the VPT approach allows to refine and enhance VTO savings one step further. Even though the additional VTO gains shown may seem limited to a few percentage points in terms of computations, starting from an already optimized version (using VTO) guarantees subsequent memory footprint savings too. However, this is only feasible for 10 −4 and 10 −6 target error thresholds which are reachable using single-precision.

Worst Case/Best Case HW-Level Evaluation
With everything said in Section 7 in mind, it is important to perform an empirical HW-level evaluation. Only edge cases where VPT affects either negatively or positively the energy consumption of the algorithm are considered since it would take months to perform the HW-level study for all the 1000 inputs. Moreover, for the sake of concision, only the conservative thresholds generated with Gappa in the case of threshold policy (1) are considered. Furthermore, for threshold policy (2), only the case where SMI is set to 4 is evaluated.
The edge cases have been chosen from the 1000-input dataset based on the software simulation presented in Section 7. They correspond to the input couples (A, b) that produce the lowest and the biggest iterations' overhead for each threshold in Figures 13 and 14. Table 2 lists the nomenclature that will be used in the remaining of this paper. For example, BC-04 refers to the input couple (A, b) that represents the best case for a 10 −4 tolerance threshold.
To ensure a fair comparison from a hardware standing point, the following experiments are all run on top of a VPT-FPU hardware configuration that will be referred to as VPT_H. This configuration supports the following precisions {4, 8, . . . , 48} within the D D data-path (with an exponent bit-width maintained at 11). Post-synthesis power simulations are performed with the same synthesis conditions as explained in Section 8.1. Baseline HW-Level Results for All Edge Cases Table 3 shows the baseline average power, execution time, and energy consumption results for the standard double precision version of Jacobi applied to each of the selected cases and executed on top of the reference baseline RV64FD architecture. This data will constitute the baseline against which all the following studies will be compared. Overhead of VPT-FPU When Operating in Full Precision Table 4 depicts HW results when executing the application on the precise part of the VPT-FPU for each selected edge case. This evaluates the overhead brought by the static energy dissipated in the approximate part of the VPT-FPU circuitry when execution mode is fully precise.
As shown in the table, execution time is not affected. Only power increases due to static power (leakage) dissipated in the non-active approximate parts, leading to a 3.72% (input BC-12) up to 3.85% (input WR-12) energy increase, which is negligible compared to the savings exposed in the following sections. Gains Using the Conservative Thresholds Table 5 depicts the HW-level gains for the conservative threshold policy (1). In addition, the table presents the results of the best case and worst case inputs at 10 −4 and 10 −12 target thresholds. As depicted in the table, there is not much difference between best-case and worst-case scenarios in terms of computational energy, i.e., 84.06% (best case for 10 −4 ) vs. 83.43% (worst case for 10 −4 ) then 63.13% (best case for 10 −12 ) vs. 62.33% (worst case for 10 −12 ). This can be explained by the fact that using VPT threshold policy (1) does not affect the number of iterations executed.
Upon a closer look to Figures 13 and 14, we conclude that this is due to the negligible effect on the number of iterations when using conservative threshold policies.

Gains Using the Relaxed Thresholds
Results for this experiment are depicted in Table 6. For a target threshold of 10 −4 , this threshold policy achieves energy savings between 82.95%(worst case) up to 88.20% (best case), mainly due to 42.20% up to 59.80% savings in terms of execution time. Similarly, for a target of 10 −12 , the energy gain stands between 63.20% (worst case) and 68.97% (best case). These percentages are slightly more interesting than in threshold policy (1), especially in the best case. Please remember that, in this situation, the VPT relaxed threshold policy (2) has a more substantial impact on the number of iterations, as shown in Figures 13 and 14. For example, input BR-12 sees a decrease in the number of iterations from 612 to 578 (−5.5% variation), translating to a 68.97% energy gain. On the other hand, WR-12 sees an increase from 700 to 763 iterations (+9.0% increase) which translates to 63.20% total energy gain, which is still an important achievement despite the extra iterations' overhead. Table 1 reports the total cell area overhead for each HW configuration w.r.t the baseline standard RV64FD FPU.

Circuit Area Results
The area overhead depends on the additional supported reduced precision formats. Overheads can range from 1.19× (VPT_A) up to 2.63× (VPT_G).
When VTO is also applied along with VPT, (i.e., when the VPT float SW configuration is considered), this technique can attenuate the area overhead. For example, overhead can be reduced from 1.27× (VPT_B) down to 1.19× (VPT_A) for a 10 −4 threshold.
These ratios also depend on the number of intermediate pipeline registers inserted in each approximate datapath, configurable at design time. The area overhead is the only significant disadvantage of the proposed architecture.

Conclusions
These results demonstrate that the VPT-enabled software implementation is a onesize-fits-all solution, i.e., using the same software implementation along with our proposed VPT-FPU, the designer can drastically reduce consumed resources by using only the needed precision instead of an over-designed solution such as standard FPUs. The price to pay is in terms of circuit area overhead. Designers can also use standard VTO techniques jointly to enhance the energy savings of our technique and reduce its area overhead.

Conclusions
This work proposes a new method and an FPU architecture that enable designers to dynamically tune FP computations' precision automatically at run-time called Variable Precision in Time (VPT). In spite of its circuit area overhead, the proposed approach simplifies the integration of variable precision in existing software workloads at any level of the software stack (OS, RTOS, or application-level): it only requires lightweight software support and solely relies on traditional assembly instructions, without the need for a specialized compiler or custom instructions.
The technique was applied to the Jacobi and the Gauss-Seidel iterative methods taking full advantage of the suggested FPU. For each algorithm two threshold policies were proposed: a conservative policy (1) and a relaxed policy (2).
The last two sections presented a statistical study that explored the effects of each VPT threshold policy on many aspects of the application: impact on the total number of iterations, impact on the iterations' distribution across different precisions, impact on HW-level estimations such as execution time, power and the overall energy consumption.
The implementations demonstrate up to 70.67% power consumption saving, up to 59.80% execution time saving, and up to 88.20% total energy saving w.r.t the reference double precision implementation, and with no accuracy loss.
To conclude, the threshold policy (1) is a conservative approach that brings some predictability and safety along with all the run-time variable precision benefits. On the other hand, threshold policy (2) is a relaxed approach that further optimizes power and energy consumption. However, it tends to alter the total number of iterations, sometimes favorably and sometimes negatively. However, even when the total number of iterations is increased, there is still a very interesting energy gain. Thus, threshold policy (1) is the safest solution, and threshold policy (2) represents a risky but optimized solution.

Limitations
To choose between the two threshold policies for a given problem, the user/designer should use representative datasets and especially evaluate the memory-related aspects, especially for threshold policy (2). Furthermore, the memory-related aspects should be studied further to evaluate whether the cost of the potential added iterations (hence more load and store operations) is lower than the gain associated with computation optimization.
The proposed methodology is application-dependent, i.e., the designer should transform the algorithm manually, ensure convergence after modification, and select adequate intermediate tolerance thresholds depending on the convergence metric used in the application.

Future Works
Our future studies will focus on optimizing the circuit area by merging some components of the VPT-FPU. We also intend to explore the usability of the proposed VPT-FPU with other kinds of workloads such as Machine Learning and Computer Vision applications.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Listing A1. Gappa script for the generation of conservative thresholds. It is applied on n = 50, M p = 40. Here we use RNE (round-to-nearest tie to even) rounding for the round() function.