Low-Latency Bit-Accurate Architecture for Conﬁgurable Precision Floating-Point Division

: Floating-point division is indispensable and becoming increasingly important in many modern applications. To improve speed performance of ﬂoating-point division in actual micro-processors, this paper proposes a low-latency architecture with a multi-precision architecture for ﬂoating-point division which will meet the IEEE-754 standard. There are three parts in the ﬂoating-point division design: pre-conﬁguration, mantissa division, and quotient normalization. In the part of mantissa division, based on the fast division algorithm, a Predict–Correct algorithm is employed which brings about more partial quotient bits per cycle without consuming too much circuit area. Detailed analysis is presented to support the guaranteed accuracy per cycle with no restriction to speciﬁc parameters. In the synthesis using TSMC, 90 nm standard cell library, the results show that the proposed architecture has ≈ 63.6% latency, ≈ 30.23% total time (latency × period), ≈ 31.8% total energy (power × latency × period), and ≈ 44.6% efﬁcient average energy (power × latency × period/efﬁcient length) overhead over the latest ﬂoating-point division structure. In terms of latency, the proposed division architecture is much faster than several classic processors.


Introduction
Modern applications comprise several floating-point (FP) operations including FP addition, multiplication, and division. In recent FP units, emphasis has been placed on designing ever-faster adders and multipliers, with division receiving less attention. Typically, the range for FP addition latency is two to four cycles, and the range for FP multiplication is two to eight cycles [1]. In contrast, the latency for double precision division in modern floating point units ranges from less than eight cycles to over 60 cycles [2].
Literature exists describing division algorithms, of which digit recurrence, functional iteration, variable latency, very high radix, and look-up table are five typical division implementations [3].
Digit-recurrence algorithm is based on iterative subtraction, including restoring [4], non-restoring [5], and Sweeney-Robertson-Tocher (radix-n SRT) algorithm (SRT is in fact one of non-restoring algorithms) [6]. It works digit by-digit with an iterative-type subtraction and produces a quotient in sequence [7]. According to [8], digit-recurrence algorithm of low radix is likely to cause long latency when encountering high-precision calculation due to its linear convergence speed. In contrast, high-radix digit-recurrence algorithm can reduce latency at the expense of multifold area consumption.
Functional iteration algorithm is mainly comprised of Newton-Raphson [9,10], Goldschmidt [11,12], Series expansion [13], and Taylor series algorithm [14,15]. A functional iteration divider computes the quotient of division by prediction; thus, based on multiplication instead of subtraction, it can give more than one digit of the quotient in one

•
Leading Zero Detection module to transfer subnormal inputs; Exception Judgement module to check exception conditions; • Finite State Machine module to reduce the number of multipliers and to perform basic fast division steps; • Quotient Selection Unit module to finish the critical part of our proposed Predict-Correct algorithm and to gain the most approximative 32-bit quotient per cycle; • Rounding Unit module to ensure the accuracy of unit at last place of quotient.
The remaining paper is structured as follows. In Section 2, two classical division implementations employing very high-radix algorithm are reviewed and higher radix division implementation is discussed. In Section 3, a novel Predict-Correct algorithm based on fast division and its mathematical arguments are presented. In Section 4, general architecture of the proposed multi-precision FP division and its main techniques are detailed. In Section 5, the results of our hardware implementation are reported and compared with prior works. Finally, in Section 6, conclusion is drawn.

Background
Very high-radix class algorithm is similar to non-restoring digit-recurrence algorithm. Their differences lie in hardware and logic arrangements for quotient selection and partial remainder generation. A simple basic schematic of very high-radix class algorithm is presented in Figure 1.
Proposed by Wong and Flynn, fast division [18] is the earliest high-radix algorithm. Fast division requires hardware with at least one look-up table of size 2 m−1 × m bits and three multipliers, a carrying assimilation multiplier of size (m + 1) × n for the divisor's initial multiplications and a carry-save multiplier of size (m + 1) × m for the quotient segments computation. As for the basic version of fast division, the look-up table has m = 11, i.e., 2 (11−1) = 1024 entries, each 11 bits wide, so in total, 11 K bits are required in the look-up table. As for the advanced version, 736 K bits are required in the look-up table when m = 16. Proposed by Wong and Flynn, fast division [18] is the earliest high-radix algorithm. Fast division requires hardware with at least one look-up table of size 2 m-1 × m bits and three multipliers, a carrying assimilation multiplier of size (m + 1) × n for the divisor's initial multiplications and a carry-save multiplier of size (m + 1) × m for the quotient segments computation. As for the basic version of fast division, the look-up table has m = 11, i.e., 2 (11−1) = 1024 entries, each 11 bits wide, so in total, 11K bits are required in the look-up table. As for the advanced version, 736K bits are required in the look-up table when m = 16.
The high-radix algorithm proposed by Lang and Nannarelli [19] shows the construction of a radix-2 K divider for implementing a radix-10 divider whose quotient digit is decomposed into two parts, one in radix-5 and the other in radix-2. In radix-5, the quotient digit is represented as values {−2, −1, 0, 1, 2}, requiring three multipliers. Radix-2 is used to perform division on the most significant slice. It uses an estimation technique in the quotient selection component, which requires the use of a redundant digit format.
In brief, high-radix division algorithm works with a scaling dividend and divisor by correct initial approximation of the reciprocal, followed by quotient selection logic with a multiplier and subtraction. Beyond this, high-radix dividers are almost the same as SRTbased radix dividers.
When it comes to ways to implement higher radix dividers, prior works focus on enhancing the complexity and criticality of SRT-based radix dividers [37,38]. Moreover, a combination of two or more alternatives together could be another way. Many works are going on to provide different standpoints for high-radix dividers. Use of different lookup tables along with quotient-digit selection logic look-up table [39][40][41], speculating quotient digit and using arithmetic functions to multiplicative iterations rather than subtractive iterations [42], pre-scaling operands [43][44][45], using Fourier division [46,47], using alternative digit codes such as binary-coded decimal (BCD) digits instead of decimal and basic binary digits [48], cascading multiple stages of lower radix dividers [49], overlapping two or more stages of low radix [50,51], a truncated schema of exact cell binary shifted adder array [52][53][54], on-line serial and pipelined operand division [55], parallel implementation of the low-radix dividers [8], array implementation [56], these are some of the possible ways applicable for high-radix dividers.

Predict-Correct Algorithm for Division
Inspired by fast division method [18], this paper proposes a Predict-Correct algorithm which will increase iteration speed by bringing about n more quotient bits than fast division without consuming many areas. The high-radix algorithm proposed by Lang and Nannarelli [19] shows the construction of a radix-2 K divider for implementing a radix-10 divider whose quotient digit is decomposed into two parts, one in radix-5 and the other in radix-2. In radix-5, the quotient digit is represented as values {−2, −1, 0, 1, 2}, requiring three multipliers. Radix-2 is used to perform division on the most significant slice. It uses an estimation technique in the quotient selection component, which requires the use of a redundant digit format.
In brief, high-radix division algorithm works with a scaling dividend and divisor by correct initial approximation of the reciprocal, followed by quotient selection logic with a multiplier and subtraction. Beyond this, high-radix dividers are almost the same as SRT-based radix dividers.
When it comes to ways to implement higher radix dividers, prior works focus on enhancing the complexity and criticality of SRT-based radix dividers [37,38]. Moreover, a combination of two or more alternatives together could be another way. Many works are going on to provide different standpoints for high-radix dividers. Use of different look-up tables along with quotient-digit selection logic look-up table [39][40][41], speculating quotient digit and using arithmetic functions to multiplicative iterations rather than subtractive iterations [42], pre-scaling operands [43][44][45], using Fourier division [46,47], using alternative digit codes such as binary-coded decimal (BCD) digits instead of decimal and basic binary digits [48], cascading multiple stages of lower radix dividers [49], overlapping two or more stages of low radix [50,51], a truncated schema of exact cell binary shifted adder array [52][53][54], on-line serial and pipelined operand division [55], parallel implementation of the low-radix dividers [8], array implementation [56], these are some of the possible ways applicable for high-radix dividers.

Predict-Correct Algorithm for Division
Inspired by fast division method [18], this paper proposes a Predict-Correct algorithm which will increase iteration speed by bringing about n more quotient bits than fast division without consuming many areas.
Our proposed division is similar to fast division in that both use multiplication for divisor multiple formation and look-up tables to obtain an initial approximation to the reciprocal of divisor. Their differences lie in the number and type of subsequent operations used in each cycle and the technique used for quotient-digit selection.

Predict-Correct Algorithm with Accurate Quotient Approximation
In the Predict-Correct algorithm, truncated versions of the integer dividend X and divisor Y are used, denoted X h and Y h . X h is defined as the high-order p bits of X extended with 0s to obtain a q-bit number, i.e., X h = X (q−1) . . . X (q−p) 00 . . . 00 where the number of 0 in X h is q − p. Similarly, Y h is defined as the high-order m bits of Y extended with 1s to obtain a q-bit number, i.e., Due to the definitions, X h is always less than or equal to X, and Y h is always greater than or equal to Y. Let ∆X = X − X h and ∆Y = Y − Y h . The deltas ∆X and ∆Y are the adjustments needed to obtain the true X and Y from X h and Y h . This implies that ∆X is always nonnegative and ∆Y nonpositive. The fraction 1/Y h is always less than or equal to 1/Y, and, therefore, X h /Y h is always less than or equal to X/Y.
The Taylor series approximation equation for 1/Y about Y = Y h is: The Predict-Correct algorithm (Algorithm 1) is conceptually summarized as follows.
where t is the number of the used leading terms of (1). The relationship of parameter p and m is p = m × t − t + 2. 3: Compute an approximation B to 1/Y using the leading t terms of (1): Truncate B to the most significant m × t − t + 4 bits, which reduces the sizing of multipliers. 4: Compute intermediate value P = X h × B and round P to m × t − t − 1 bits to obtain P . 5: List 2 n kinds of (m × t − t + n − 1)-bit quotients that combinate the (m × t − t − 1)-bit intermediate value P with subsequent 2 n kinds of n-bit predictive values ranging from 00 . . . 0 to 11 . . . 1. 6: Pick up the most approximative (m × t − t + n − 1)-bit quotient Q a from the 2 n kinds of (m × t − t + n − 1)-bit quotients by multiplication and comparison. 7: Calculate j = j + m × t−t + n − 1. Update j with j . 8: New dividend is X = X − Q a × Y. New quotient is Q = Q + Q a × 1/2 (q−j) . 9: Left-shift X by m × t − t + n − 1 bits. 10: Update Q with Q . Update X with X . 11: Repeat Step 4 through 10 until j ≥ q. In the Predict-Correct algorithm, Parameter t represents the number of the used leading terms of (1). For instance, if t = 2, the approximation to 1/Y is

12: Final quotient is
Parameter m represents the digits number of the index of look-up table G 1 , G 2 , . . . , G t . m plays a crucial role in the sizes of look-up table G 1 , G 2 , . . . , G t . Moreover, the data width of entries in look-up table G i is b i = (m × t − t) + log 2 t − (m × i − m − i), i = 1, 2, . . . , t. Parameter n represents the number of additional digits of quotient owing to Step 5 and 6 per iteration. For instance, if n = 2, subsequent 2 n kinds of (m × t − t + n − 1)-bit possible quotients are P -00, P -01, P -10 and P -11, where the most approximative quotient Q a is picked up from these possible quotients via quotient selection in Step 6.
To help to understand the abovementioned algorithm, an example of fixed-point division using the proposed Predict-Correct algorithm (Algorithm 2) is demonstrated as follows.

Guaranteed Bits per Cycle Using Predict-Correct Algorithm
There are three sources of inaccuracy affecting the approximation B ≈ 1/Y. Define the represents an error due to truncating the Taylor series after t terms. Since the truncated terms in the series are all nonnegative, R b is nonnegative. R c represents the error in using look-up table with finite width words to calculate B. As tables are rounded down, R c is always nonnegative. R d represents the error in truncating the arithmetic used to calculate B.
To obtain the maximal value of X, R b , R c , and R d should be accurately bounded. The (w + 1)th term of the Taylor series for B is B w + 1 = (−∆Y) w /Y h (w + 1) . The worst case occurs when −∆Y = 1/2 m − 1/2 q , the bound holds: Therefore, the remainder R b is bounded by For m >> 1, For m >> 5, a nonstringent bound can be posed on R b : Suppose the error in each table look-up is ε i and the truncation error in computing each multiplicative term is δ i . Let δ 0 be an additional term that represents truncating B to a certain number of bits after the summation. A cumulative error will be Then, and Since table G i is b i bits wide and 1/2 < Y h < 1, the maximal value of (1/Y h ) i is slightly less than 2 i . If words in table G i can represent values up to but not including 2 i , the unit of the most significant bit in table G i should have value 2 (i−1) while the unit of the least significant bit should have value 1/2 (bi−i) .
Each ε i is less than the unit of the LSB: ε i < 1/2 (bi−i) . The worst case for −∆Y occurs i.e., δ i represents each maximal permissible truncation error and it allows the arithmetic of B to be reduced into an appropriate size to accelerate the computation of B. The allowable truncation error δ i can be used both in discarding least significant partial products to prune multiplier trees and in truncating results to smaller widths.
As 1 ≤ B < 2, the most significant bit (MSB) of B has unit 1. Suppose B is truncated to b b bits. According to the definition of δ 0 , δ 0 is less than the LSB, i.e., δ 0 < 1/2 (bb−1) . To restrict R d as follows: R d < 1/2 (m×t−t+2) , δ 0 should be restricted as follows: δ 0 < 1/2 (m×t−t+3) , i.e., B can be truncated to be b b = m × t − t + 4 bits; meanwhile, the remaining δ i should be restricted to: We already have where S is the follow-up n-bit digits after P in the partial quotient Q a . Since P is the truncated version of P = X h × B and X the left-shifted version of X , Now substitute the bounds for R b , R c and R d into (13) to determine the maximum value of X . Since X h < 1 and ∆X ≤ 1/2 p − 1/2 q , the worst case for X is: Since Y h ≥ Y, set Y h = Y in the worst case for X and yield: Take then To determine the worst case of the value of X 2 in the case of 1/2 ≤ Y < 1, all possible maxima and minima should be located through setting the partial derivative ∂X 2 /∂Y to zero.
It can be demonstrated that the highest possible value of X 2 occurs at Y = 1/2. When p = m × t − t + 2, In Step 4 of every cycle, the highest-order bit of X that could possibly be 1 is the (m × t − t)th bit, X (q−m×t+t) . For any p such that p ≥ m × t − t + 2, the worst case for the value of X is bounded by the above inequality. As a result, at least the front m × t − t − 1 bits of quotient per cycle before Step 5 can be guaranteed in the proposed Predict-Correct algorithm.
According to (12) and (13), Since S is the follow-up n-bit digits after P in the partial quotient Q a , the worst case of the value of S × Y also occurs in the case of 1/2 ≤ Y < 1 after Step 5 and 6. Meanwhile, the highest possible value of X 2 locates at Y = 1/2, Similarly, the highest-order bit of X that could possibly be 1 is the (m × t − t + n)th bit. As for the subsequent n-bit predictive quotient values, the n-bit value is accurate within the least significant bit in Step 6. In short, at least the front m × t − t + n − 1 bits of quotient per cycle after Step 6 can be guaranteed.
Fast division [18] generates new quotient by m × t -t -1 bits per cycle. Therefore, the algorithm requires q/(m × t − t − 1) cycles where q is digit amount of dividend X and divisor Y. Contrastively, the proposed Predict-Correct algorithm generates m × t − t + n − 1 bits per cycle and its implementation requires q/(m × t − t + n − 1) cycles theoretically.

Choice of Parameters m, t, and n in Predict-Correct Algorithm
As stated in Section 3.2, the proposed Predict-Correct algorithm generates m × t − t + n − 1 bits per clock cycle. Change of Parameter m, t, or n leads to different guaranteed bits per cycle. Discussions about choice of parameters m, t, and n adopted in FP division hardware architecture are as follows. Table 1 lists five options of Parameter n from 1 to 5. It can be seen from Table 1 that as n increases, the number of possible quotients increases exponentially. When n > 3, increase in guaranteed bits of quotient per cycle cannot make up the foreseeable extra cost of subsequent selectors accompanied with more possible quotients in Step 6. Compared with Parameter n = 2, Parameter n = 3 brings about one more guaranteed bit of quotient with only four more subsequent selectors needed. Compared with Parameter n = 3, Parameter n = 4 takes eight more subsequent selectors. One more guaranteed bit of quotient can apparently not overweigh the follow-up computational burden of eight more selectors. To sum up, Parameter n in the Predict-Correct Algorithm for FP division is chosen to be 3. When t ≥ 5, the increase in guaranteed bits of quotient per cycle is at the cost of pre-computation of the terms in the approximation B. Pre-computation of the terms in B will definitely add more clock cycles, which is not friendly to low-precision division. When t = 1, the Predict-Correct algorithm becomes classical reciprocal method. Therefore, t = 2, 3, 4 will be discussed. Table 2 lists different patterns of Parameter m and t when n = 3. Since this paper mainly focuses on high-precision FP division, so column clock cycles in Table 2 is analyzed based on quadruple precision FP division. When it comes to column clock cycles, the former figure represents the number of iterations in 113-bit mantissa division; the latter represents the number of cycle(s) needed in the pre-computation of the approximation B. In the case of Row 1 in Table 2, 113/20 = 6; t = 2 means that at least one cycle is needed to pre-compute the approximation It can be seen from Table 2 that the listed patterns of Parameter m and t have 6 or 7 clock cycles. As Parameter m increases from 10 to 11, data width of entries in look-up table (LUT) increase by 512. As Parameter m increases from 11 to 12, data width of entries in LUT increase by 1024. Apparently, the sharp increase in data width of LUT entries will bring about considerable area. Moreover, the increase in Parameter m seems to be no benefit to clock cycles. Hence, the patterns of Parameter m = 12 are not considered. Therefore, the left four patterns are more rational: m = 10, t = 3; m = 10, t = 4; m = 11, t = 3; m = 11, t = 4. Achieved bits per cycle of patterns m = 10, t = 3 and m = 10, t = 4 are 116 and 114. Error may occur in the rounding of the iterated 116-bit or 114-bit quotient to the desired 113-bit normalized quotient. The two patterns are removed from consideration. As for patterns m = 11, t = 3 and m = 11, t = 4, the former pattern has more achieved bits per cycle than the latter pattern with the same clock cycles. In all, this paper takes Parameter m = 11, t = 3 and n = 3.

General Architecture and Main Parts
The proposed Predict-Correct algorithm can apply to both fixed-point division and floating-point division. Based on the bit-accurate Predict-Correct algorithm, this paper designs a multi-precision FP division architecture with low latency.
In our design, take m = 11, t = 3 and n = 3. According to the Predict-Correct algorithm, 32 bits of new quotient can be generated per cycle. Denote a desired accuracy of quotient as precision, which is determined by the 2-bit input signal type. According to IEEE-754 standard, precision and type cover three FP formats: type = 2 b00 and precision = 23 for single precision (SP, 32 bits, 23 bits of mantissa), type = 2 b01 and precision = 52 for double precision (DP, 64 bits, 52 bits of mantissa), and type = 2 b10 and precision = 113 for quadruple precision (QP, 128 bits, 113 bits of mantissa).
The overall architecture of our proposed FP divider is illustrated in Figure 2. The proposed design is divided into three parts. The inputs are two multi-precision FP numbers: dividend_In and divisor_In. Part1 PRECONFIG involves pre-configuration and exception judgement of the two inputs, Part 2 MANTISSA_DIVIDE mainly fulfills 29-bit-accurate quotient approximation and subsequent 3-bit quotient selection, and Part 3 NORMALIZE achieves normalization.

Part 1 PRECONFIG
As presented in Figure 2, PRECONFIG first judges whether exception situations exit after breaking down the two FP inputs dividend_In and divisor_In into three portions: sign, exponent, and mantissa. If either one of the two inputs or both belong to the following exception situations: Zero, Infinity or NaN (Not a Number), the exception signal Exception is then judged to be Zero or NaN depending on dividend_In and divisor_In. Specific judgement is illustrated in Table 3.
The overall architecture of our proposed FP divider is illustrated in Figure 2. The proposed design is divided into three parts. The inputs are two multi-precision FP numbers: dividend_In and divisor_In. Part1 PRECONFIG involves pre-configuration and exception judgement of the two inputs, Part 2 MANTISSA_DIVIDE mainly fulfills 29-bit-accurate quotient approximation and subsequent 3-bit quotient selection, and Part 3 NOR-MALIZE achieves normalization.    Otherwise, the exception signal Exception is set to be Normal. The next is performing Leading Zero Detection onto the mantissas of dividend_In and divisor_In for subnormal checks.
Two output signals dividend_Out and divisor_Out are the mantissas of dividend_In and divisor_In after subnormal checks with an implicit bit "1".
At last, the exponent difference Exp_out and the exclusive-OR value Sign_out (if the exception signal Exception is normal) of the inputs dividend_In and divisor_In are computed and delivered to NORMALIZE along with signal Exception. The above explanation of the operations in Part1 PRECONFIG can be observed in Figure 2.
The detailed architecture of Exception Judgement is demonstrated in Figure 3. Exception Cases mainly check whether the input FP number belongs to the following exceptive conditions: Zero (the input's exponent and mantissa are both "0"), Infinity (the input's exponent is all "1" while mantissa is all "0"), and NaN (the input's exponent is all "1" but mantissa is not all "0").

Part 2 MANTISSA_DIVIDE
MANTISSA_DIVIDE is intended to accelerate the calculation and to reduce the number of multipliers (i.e., to increase the use rate of multipliers employed in MANTISSA_DI-VIDE).
Since SP, DP, and QP are three common FP formats in our architecture, 23, 52, and 113 are mantissa bit numbers of SP, DP, and QP with an implicit bit "1". Furthermore, customized precision can be adjusted into our architecture only if adjusting module Exception Judgement in PRECONFIG, 3:1 multiplexer (MUX) in MANTISSA_DIVIDE, and module Rounding Unit in NORMALIZE.
Traditional fast division algorithm looks up 1/Yh, 1/Yh 2 , 1/Yh 3 , … in the table G1, G2, G3, … at the same time, which costs more areas and power. In our implementation, to accelerate the calculation of B, Finite State Machine unit is employed with a Reciprocal Look-up Table. The least significant partial products can be truncated for high-order terms.
The state transition diagram of Finite State Machine unit appearing in Figure 2 is listed in Figure 4

Part 2 MANTISSA_DIVIDE
MANTISSA_DIVIDE is intended to accelerate the calculation and to reduce the number of multipliers (i.e., to increase the use rate of multipliers employed in MAN-TISSA_DIVIDE).
Since SP, DP, and QP are three common FP formats in our architecture, 23, 52, and 113 are mantissa bit numbers of SP, DP, and QP with an implicit bit "1". Furthermore, customized precision can be adjusted into our architecture only if adjusting module Exception Judgement in PRECONFIG, 3:1 multiplexer (MUX) in MANTISSA_DIVIDE, and module Rounding Unit in NORMALIZE.
As the Predict-Correct algorithm demonstrated, first, MANTISSA_DIVIDE is to calculate the value of B. Since m = 11 and t = 3, . . in the table G 1 , G 2 , G 3 , . . . at the same time, which costs more areas and power. In our implementation, to accelerate the calculation of B, Finite State Machine unit is employed with a Reciprocal Look-up Table. The least significant partial products can be truncated for high-order terms.
The state transition diagram of Finite State Machine unit appearing in Figure 2 is listed in Figure 4.
Traditional fast division algorithm looks up 1/Yh, 1/Yh 2 , 1/Yh 3 , … in the table G1, G2, G3, … at the same time, which costs more areas and power. In our implementation, to accelerate the calculation of B, Finite State Machine unit is employed with a Reciprocal Look-up Table. The least significant partial products can be truncated for high-order terms.
The state transition diagram of Finite State Machine unit appearing in Figure 2 is listed in Figure 4. To be detailed, the procedure of every cycle is presented in Figure 5. Critical path of the design in this paper lies in a 34-bit × 34-bit multiplier and Quotient Selection Unit, as demonstrated with red dotted lines in Figure 5.  To be detailed, the procedure of every cycle is presented in Figure 5. Critical path of the design in this paper lies in a 34-bit × 34-bit multiplier and Quotient Selection Unit, as demonstrated with red dotted lines in Figure 5. The first cycle only looks up 1/Yh with low-10-bits of the divisor Y's high-11-bits in the Reciprocal Look-up Table, calculates The first cycle only looks up 1/Y h with low-10-bits of the divisor Y's high-11-bits in the Reciprocal Look-up Table, calculates with a 116-bit × 34-bit Multiplier where the product deltaY_Y is truncated to 34-bit, and, then calculates with a 34-bit × 34-bit Multiplier.
In the second cycle, the 34-bit × 34-bit Multiplier is employed again to calculate After that, we can obtain In the next cycles, compute intermediate value X h × B with the 34-bit × 34-bit multiplier where X h is the leading 32 bits of X, and, round the product result to obtain a 29-bit intermediate value q_h. Quotient Selection Unit, which contains the 116-bit × 34-bit multiplier, realizes the multiplicative quotient selection method in order to attain the most approximative 32-bit quotient q_32 and the partial product tmp_product.
The architecture of Quotient Selection Unit is shown in Figure 6. In Figure 6, the 116-bit × 34-bit multiplier is employed again to calculate the product of 29-bit intermediate value q_h and divisor Y, Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 23 In the second cycle, the 34-bit × 34-bit Multiplier is employed again to calculate After that, we can obtain In the next cycles, compute intermediate value Xh × B with the 34-bit × 34-bit multiplier where Xh is the leading 32 bits of X, and, round the product result to obtain a 29-bit intermediate value q_h. Quotient Selection Unit, which contains the 116-bit × 34-bit multiplier, realizes the multiplicative quotient selection method in order to attain the most approximative 32-bit quotient q_32 and the partial product tmp_product.
The architecture of Quotient Selection Unit is shown in Figure 6. In Figure 6, the 116bit × 34-bit multiplier is employed again to calculate the product of 29-bit intermediate value q_h and divisor Y,   Through addition and subtraction operations, 6 minor possible quotients

116-bit×34-bit
and 2 major possible quotients are generated afterwards. All in all, there are 16 possible 32-bit quotients. The architecture of Partial Product Comparator is presented in Figure 7.  Figure 7. ① in Figure 7 [20]). Such rule is applicable for the rest of 16 MUX.
Partial Product Comparator module is to calculate 16 partial products of the 16 possible 32-bit quotients and divisor Y using product and Y and to gain the final partial product tmp_product by Product Comparator. The architecture of Partial Product Comparator is displayed in Figure 8.  As displayed in Figure 9, the architecture of module Quotient Comparator in Figure  6 is quite similar to module Partial Product Comparator. Although the Product Comparator module outputs tmp_product with a 16-stage MUX group, the Quotient Comparator Partial Product Comparator module is to calculate 16 partial products of the 16 possible 32-bit quotients and divisor Y using product and Y and to gain the final partial product tmp_product by Product Comparator. The architecture of Partial Product Comparator is displayed in Figure 8. As displayed in Figure 9, the architecture of module Quotient Comparator in Figure  6 is quite similar to module Partial Product Comparator. Although the Product Comparator module outputs tmp_product with a 16-stage MUX group, the Quotient Comparator As displayed in Figure 9, the architecture of module Quotient Comparator in Figure 6 is quite similar to module Partial Product Comparator. Although the Product Comparator module outputs tmp_product with a 16-stage MUX group, the Quotient Comparator module outputs the most approximative 32-bit partial quotient q_32 which makes up 32 bits of quotient_out_tmp every cycle.
Since Y is invariant during the whole division progress and the first and second cycles have already computed B, only two multiplication (Xh × B and q_32 × Y) and one subtraction is needed in the third and later cycles.
Afterwards, repeat the third cycle procedure until cnt > precision. Otherwise, terminate the recurrence and jump to NORMALIZE.

Part 3 NORMALIZE
NORMALIZE normalizes the quotient signal quotient_out generated in MAN-TISSA_DIVIDE into standardized output quotient upon the principle of rounding to the nearest.
Rounding Unit module is performed as follows: After Quotient Selection Unit, update cnt with Therefore, new dividend and new quotient can be attained. Since Y is invariant during the whole division progress and the first and second cycles have already computed B, only two multiplication (X h × B and q_32 × Y) and one subtraction is needed in the third and later cycles.
Afterwards, repeat the third cycle procedure until cnt > precision. Otherwise, terminate the recurrence and jump to NORMALIZE.

Part 3 NORMALIZE
NORMALIZE normalizes the quotient signal quotient_out generated in MANTISSA_DI VIDE into standardized output quotient upon the principle of rounding to the nearest.  [15], type = 2 b10.
As shown in Figure 10, SP, DP, and QP inputs need 3, 4, and 6 cycles, respectively.    Figure 10. Latency of the proposed FP division for SP, DP, and QP inputs.

Results and Comparisons
The proposed Predict-Correct iterative FP division architecture is synthesized with TSMC 90nm standard cell library, using Synopsys Design Compiler. The implementation details are shown in Table 4. The FP division unit is synthesized with best achievable timing constraints, with constraint of max-area set to zero and global operating voltage of 0.9V.

Results and Comparisons
The proposed Predict-Correct iterative FP division architecture is synthesized with TSMC 90nm standard cell library, using Synopsys Design Compiler. The implementation details are shown in Table 4. The FP division unit is synthesized with best achievable timing constraints, with constraint of max-area set to zero and global operating voltage of 0.9 V.
In Table 4, two novel metrics are proposed. One is total time (latency × period) and the other is efficient average time (latency × period/efficient length). Define total time as the time needed for a division unit from inputting numbers to outputting results. Define efficient average time as the time needed for a division unit to process single bit of input numbers. Total time measures the computational speed of single division operation for a division unit. It finds its significance as division operation is not frequent in processors or co-processors, making the computational speed of single division operation important. Efficient average time measures the ability of a division unit to process high-precision input numbers, which makes it useful in large-scale high-precision applications. In our implementation, we have been able to put out a quotient portion with at least 29-bit in a single cycle for division, and a quotient portion with more bits using the correction mechanism between iterations in the same cycle. In addition, there is only one pre-computational cycle before the iterations, unpacking of dividend and divisor inputs and pre-configuration. Finally, no post-processing cycle for rounding after the iterations is needed.
The total latency of the division unit consists of cycles of PRECONFIG, MANTISSA_ DIVIDE and NORMALIZE. Additionally, as the proposed division unit is iterative in nature, every next input can be applied soon after finite state machine (FSM) in MAN-TISSA_DIVIDE finishes the current processing. Thus, the Predict-Correct FP division unit will have a latency of 3, 4, and 6 cycles for SP, DP, and QP FP division computations.
The Predict-Correct FP division unit with DP requires 1 more period than it with SP, whereas it with QP it needs 2 more cycles than with DP.
It can be seen from Table 4 that the higher the precision of the proposed FP division unit, the more economical its latency and total time (latency × period) in the implementation results. Furthermore, efficient average time (latency × period/efficient length) also indicates that the proposed FP division architecture is of more value when used in more accurate computation.

Functional Verification
The functional verification of the proposed FP division unit is carried out using 5millions random test cases for normal-normal, normal-subnormal, subnormal-normal, and subnormal-subnormal operands combination, along with the other exceptional case verification for quadruple mode.
The proposed Predict-Correct FP division unit with QP produces a maximum of 1-ULP (unit at last place) precision loss. The statistical correct rate of the proposed divider in the case of the 5-millions random QP test cases is 99.996% compared against Bigfloat data results using Python.

Related Work and Comparisons
Javier D. Bruguera has proposed a low-latency FP division unit in [8] with radix-64digit-recurrence algorithm. However, the implementation in [8] lacks necessary parameters such as technology, area, power, and so on. In [57], based on series expansion methodology, Jaiswal et al. has proposed QP FP division with SP and DP support. The proposed multiprecision architecture in [57] is implemented using FPGA device at a frequent of 89MHz, without matched parameters to perform comparison. In [58], Jaiswal et al. has proposed an iterative dual-mode DPdSP division architecture using the series expansion algorithm, a digit-recurrence method, and synthesized using TSMC 90 nm library. Compared with [8] or [57,58], this is a more appropriate object for our result comparison as it has provided detailed information after hardware implementation.
A comparison with prior paper [58] on FP division architecture is shown in Table 5. Definitions of total energy and efficient average energy are somewhere the same as total time and efficient average time. Total energy (power × latency × period) is defined as the energy needed for a division unit from inputting numbers to outputting results. Efficient average time (power × latency × period/efficient length) is defined as the energy needed for a division unit to process single bit of input numbers. The two novel metrics measure the computational efficiency of a division unit in terms of power. Since [58] has no QP synthesis results, only cases of DP FP inputs are compared between [58] and the proposed division unit. A technological independent comparison is presented in terms of area, latency, period, and power. Comparison is also made in terms of four unified metrics, total time (latency × period), total energy (power × latency × period), efficient average time (latency × period/efficient length) and efficient average energy (power × latency × period/efficient length), which are supposed to be smaller for a low-latency design. Subnormal computation support is both included for the two division units.
The architecture with 1-Stage Multiplier in [58] has a latency of 10 clock cycles for DP, and 8 clock cycles for SP; while that with 2-Stage Multiplier has a latency of 15 clock cycles for DP, and 11 clock cycles for SP.
In comparison to Jaiswal et al. [58] 's dual-mode architecture, the proposed Predict-Correct iterative FP division architecture requires much fewer latency and total time. The unified metrics total energy (power × latency × period) and efficient average energy (power × latency × period/efficient length) of the proposed architecture are much better than the architecture with 2-Stage Multiplier in [58]. In detail, the proposed FP division unit has ≈63.6% latency, ≈30.23% total time (latency × period), ≈31.8% total energy (power × latency × period), and ≈44.6% efficient average energy (power × latency × period/efficient length) overhead over 2-Stage Multiplier in [58]. Table 6 compares the latency and total time of the proposed division unit with these of classic processors for FP SP and DP with normalized operands and result, Intel Penryn [59], IBM zSeries [60], IBM z13 [61], HAL Sparc [62], AMD K7 [63], AMD Jaguar [64].
It must be pointed out that the comparison is done in terms of the latency and total time without taking into account that different processors might run at different technologies and frequencies. Moreover, since the classic processors are usually based on pipeline architecture, their clock speeds are high to a considerable extent. However, the design of the proposed architecture in this paper does not use pipeline architecture as it is only an infrequent unit in computer arithmetic, leading to a bit low clock speed. If our architecture adopts pipeline architecture, clock speed of our architecture is predicted to double at least, which we may study in the future. Most of the designs in Table 6 use a multiplicative division algorithm or a radix-16/8/4 digit-recurrence algorithm. The Intel Penryn processor [59] implements a radix-16 combined division unit by cascading two radix-4 iterations every cycle. Consequently, the latency is almost halved with respect to that of the radix-4 unit. The IBM z13 processor [61] has a divide unit supporting SP, DP, QP, and all the hexadecimal FP data types. The underlying algorithm is a radix-8 division generating 3 bits per cycle. The major challenge was to perform a radix-8 divide step on a wide QP mantissa, 113 bits plus some extra rounding bits, and fit it in a single cycle.
As shown in Table 6, our proposal obtains much lower latencies. The multiplicative implementation is limited by the latency of the multiplier of multiply-and-accumulate units. On the other hand, the implementation in [60] uses a very low radix, which implies a high number of iterations, although its implementation is quite simple. As for total time of SP or DP, owing to low clock speed, the performance of our proposal is in the middle of the classical processors.

Conclusions
This paper has presented a novel Predict-Correct iterative architecture for configurable FP division arithmetic. It can be dynamically configured for SP, DP, QP, or other userdefined precisions. Aiming at fast and efficient FP division processing, the architecture is also proposed with period and power trade-offs.
The Predict-Correct algorithm is based on the very high-radix arithmetic. The entire logic path has been tuned to perform a low-latency computation. The proposed FP division unit has ≈63.6% latency and ≈30.23% total time overhead over [58]. Moreover, the proposed FP division unit outperforms the prior arts in terms of total energy (power × latency × period) and efficient average energy (power × latency × period/efficient length), which are two unified metrics relevant to effective energy. From the implementation results, it is much more favorable for the proposed FP division to perform in DP, QP, or other high-precision computations.
Based on the current proposed division architecture, similar units for division can be formed using other algorithms, such as Newton-Raphson, Goldschmidt, and series expansion. Moreover, the proposed division architecture can also be employed in fast fixed-point division after simple adjustment.