Linear and Quadratic Interpolators Using Truncated-Matrix Multipliers and Squarers

: This paper presents a technique for designing linear and quadratic interpolators for function approximation using truncated multipliers and squarers. Initial coefﬁcient values are found using a Chebyshev-series approximation and then adjusted through exhaustive simulation to minimize the maximum absolute error of the interpolator output. This technique is suitable for any function and any precision up to 24 bits (IEEE single precision). Designs for linear and quadratic interpolators that implement the 1 /x , 1 / √ x , log 2 (1+2 x ) , log 2 ( x ) and 2 x functions are presented and analyzed as examples. Results show that a proposed 24-bit interpolator computing 1 /x with a design speciﬁcation of ± 1 unit in the last place of the product (ulp) error uses 16.4% less area and 15.3% less power than a comparable standard interpolator with the same error speciﬁcation. Sixteen-bit linear interpolators for other functions are shown to use up to 17.3% less area and 12.1% less power, and 16-bit quadratic interpolators are shown to use up to 25.8% less area and 24.7% less power.


Introduction
The approximation of functions, such as reciprocal, reciprocal root, logarithm, exponential and others, is important for a wide variety of applications, including general-purpose computing, scientific computing, digital signal processing (DSP) and application-specific processors, such as those used for graphics acceleration.The logarithmic number system (LNS), which greatly reduces the area and power required for multiplication and other operations, requires approximation of a special logarithm to perform addition and subtraction.As transistor densities continue to increase, more of these functions are implemented in hardware.Piecewise-polynomial function approximation, using coefficients stored in a lookup table, is an efficient technique with many papers written on the subject [2][3][4][5][6][7][8][9][10][11][12][13].
This paper describes the application of truncated-matrix multipliers and squarers to linear and quadratic interpolators to approximate 1/x, 1/ √ x, log 2 (1 + 2 x ), log 2 (x) and 2 x .Some of the techniques and results discussed in this paper were presented at the 17th IEEE Symposium on Computer Arithmetic in 2005 [1].This work is a significant extension of that paper.Section 2 discusses polynomial function approximation and how initial coefficients are obtained using a Chebyshev-series approximation.Section 4 describes general function-interpolator hardware designs that use standard multipliers and squarer and presents an error analysis that is used to determine initial coefficient lengths and rounding precisions.Section 5 describes a method for minimizing the size of the coefficient lookup table, then truncating the partial-product matrices of the arithmetic units to reduce area and power while maintaining the design specifications for error.Section 6 presents results for several interpolator designs using these techniques.

Polynomial Function Approximation
Approximating a function Y = f (X) is usually done in three steps.First, range reduction is performed to reduce X to x, where x ∈ [x min , x max ).Next, y = f (x) is computed.Finally, Y = f (X) is computed by performing a reconstruction step.Range-reduction and reconstruction techniques for common function approximations are well known [4,13] and are not discussed here.
Polynomial approximations have the following form: where N is the number of terms in the polynomial approximation, f (x) is the function to be approximated and a i is the coefficient of the i-th term.In practice, many hardware implementations are limited to linear or quadratic interpolators.For a linear interpolator, N = 2, and the approximation is: For a quadratic interpolator, N = 3, and the approximation is:

Investigated Functions
The techniques in this paper are applicable to interpolators for a wide variety of functions.In this work, five functions commonly used in high-performance DSP systems are investigated: reciprocal, reciprocal root, the LNS-addition logarithm, logarithm and exponential.For each of these functions, Table 1 gives the range-reduced function computed by the interpolator, the input and output ranges of the interpolator and the maximum absolute values of the second and third derivatives, |f (ξ)| and |f (ξ)|, which are used in the design of the interpolator and its error analysis.The reciprocal function, f (x) = 1/x, is often used in DSP systems to implement division efficiently by converting it to a multiplication by a reciprocal.For example, A ÷ B can be implemented as A × 1/B.The reciprocal can also be used as an initial value for computing higher-precision reciprocals or performing higher-precision division using iterative algorithms, such as Newton-Raphson or the Goldschmidt algorithm [14,15].

Reciprocal Root
The reciprocal root function, f (x) = 1/ √ x, is used in DSP systems to implement division by the square root of a number.One common use for this function is vector normalization.For example, consider the two-dimensional vector V , which has x-axis and y-axis components of V i and V j , respectively.The components of the normalized vector, v, are computed by: These values can be computed efficiently by computing x = V 2 i + V 2 j , then 1/ √ x, then: The reciprocal root function can also be used to compute the square root using the following identity:

LNS-Addition Logarithm
In the logarithmic number system, a real number, A, is represented by its sign and the logarithm of its absolute value [16].Any base can be chosen for the logarithm, but base-two is common and is used exclusively in this work.(In much of the literature on LNS, X and Y are used to represent real numbers, with x and y representing their logarithms.In much of the literature on function interpolation, Y = f (X) is used to represent the value of a function and its argument, with y = f (x) used to represent the value of a function after range reduction has taken place.In order to prevent confusion, X, Y , x and y are used in this paper for function interpolation, while A, B, a and b are used for LNS numbers.)The number A is given as: where S A is the sign of A, with "1" being negative, and the exponent, a, is log 2 (|A|).If |A| < 1, then a is negative, so a must be signed.In this work, a is quantized and stored as a two's-complement fixed-point number with K integer bits and F fractional bits.The K integer bits includes the sign bit for a.Counting the sign bit for A, S A , and the exponent, a, K + F + 1 bits are required to represent A.
The range of a is: so the range of the magnitude of values that can represented is: The representable numbers adjacent to the representation of A are 2 a−2 −F and 2 a+2 −F .From this, it can be seen that the range of representable values is set primarily by K, and F sets the precision.
One significant advantage of LNS is that multiplication and division are simple operations.For multiplication, the exponents of the two operands are simply added.For division, the exponent of the divisor is subtracted from the exponent of the dividend.Squaring and the square root are even simpler.To square a number in LNS, the exponent is multiplied by two, which is easily accomplished by a left shift of one bit.The square root of a number is computed by dividing the exponent by two, which is done by shifting one bit to the right.These operations are summarized in Table 2.
For many DSP algorithms, the majority of silicon area and power are consumed by the operations listed in Table 2, so LNS has great promise.The primary challenge of LNS implementations is that addition and subtraction operations are computationally expensive.As such, optimization of addition and subtraction is an important topic in the LNS literature.
LNS addition of two numbers, A and B, is commonly implemented as [17]: where s b (x) is known as the LNS-addition logarithm.For base-two LNS, which is often implemented by interpolation.Using Equation (10) for addition ensures that x ≤ 0, which in turn ensures that 0 it rounds to zero because the smallest representable number is 2 −F .This occurs when x < log 2 (2 2 −F −1 − 1) and is known as "essential zero".Interpolator implementations can take advantage of the essential zero with no loss in accuracy by limiting the input range to x ∈ [log 2 (2 2 −F −1 − 1), 0], and outputting zero for inputs below that range.The logic for detecting the essential zero can be simplified with a small loss in accuracy by recognizing that log LNS subtraction can be implemented in a similar manner using the LNS-subtraction logarithm, d b (x) = log 2 (1 − 2 x ).Unfortunately, d b (x) is difficult to implement efficiently because it approaches −∞ as x → 0. Coleman's transformation [18] mitigates this problem by eliminating the need to evaluate d b (x) near x = 0, so d b (x) is easier to implement efficiently.However, Coleman's transformation requires an LNS processor to implement both the s b (x) and the d b (x) functions.Arnold et al.'s transformation [19] eliminates the need to implement both functions by implementing d b (x) as a function of s b (x).The drawback is that s b (x) must accept positive values of x.Arnold's improved cotransformation [20] eliminates this problem, allowing d b (x) to be implemented as a function of s b (x) with x ≤ 0. Vouzis et al. [17] improve this cotransformation to work in cases not considered in [20].Thus, the LNS-addition logarithm investigated in this work can be used for both LNS-addition and LNS-subtraction operations.

Logarithm
The base-two logarithm function, f (x) = log 2 (x), can be used to convert a number in two's-complement representation to LNS representation.It can also be used to compute logarithms for other bases using the identity: Note that 1/ log 2 (y) can be precomputed and stored in memory, so that log y (x) can be computed as log 2 (x) multiplied by a constant.The logarithm function is also important in many classes of graphics-shading programs and is implemented in the instruction sets of modern graphics-processing units (GPUs) [13].

Exponential
The base-two exponential function, f (x) = 2 x , can be used to convert a number in LNS representation to two's-complement representation.Like the logarithm, exponentiation is also important for graphics processing and is implemented by modern GPUs.
The logarithm and exponential functions can be used together to compute the powering function, f (x) = x y , using the following identity: x y = 2 y log 2 (x) (13)

Partitioning
In order to reduce the order of the polynomial while maintaining the desired output accuracy, the interval [x min , x max ) is often partitioned into 2 m subintervals of equal size, each with a different set of coefficients.To implement this efficiently in hardware, the interpolator input is split into an m-bit most-significant part, x m , and an (n − m)-bit least-significant part, x l , where n is the number of bits from x input to the interpolator.If any of the leading bits in x are constants, they are not input to the interpolator, so n is not necessarily the number of bits in x.The value input to the interpolator is x m + x l , which is a fixed-point number with n int integer bits.
For the LNS-addition logarithm, x ∈ [x min , 0), where x min varies with the application.As an example, consider the case where x min = −16 and F = 8.In this example, x has the form "1xxxx.xxxxxxxx", and the number of integer bits input to the interpolator, n int , is four.The relationship of x to x m + x l is: In the case where x ∈ [0, 1), such as for the exponential function, x has the form "0.xxxxxxxx".The leading "0" is not input to the interpolator, so n int = 0, and the relationship of x to x m + x l is: In the case where x ∈ [1, 2), such as for the reciprocal function, x has the form "1.xxxxxxxx".The leading "1" is not input to the interpolator, so n int = 0, and the relationship of x to x m + x l is: For the reciprocal root function, when the exponent is odd, x ∈ [2, 4) and has the form "1x.xxxxxxx".In this situation, n int = 1, and the relationship of x to x m + x l is:

Determining the Partitioning Interval
The coefficients for each subinterval are stored in a lookup table .x m is used to select the coefficients, so Equation (1) becomes: where a 0 (x m ), a 1 (x m ), . . ., a N −1 (x m ) are the coefficients.
The approach presented in this paper uses a Chebyshev-series approximation [21] to select the initial coefficient values for each subinterval.These coefficients are then quantized as described in Section 4 and optimized as described in Section 5. Schulte and Swartzlander [4] describe Chebyshev-series approximation in detail.This section summarizes the equations used to generate initial coefficient values for linear and quadratic approximations.
First, the number of subintervals, 2 m , must be determined.When using infinite-precision arithmetic, the error for a function approximated using the Chebyshev-series over the interval [x 1 , x 2 ) is: where ξ is the point on the interval being approximated, such that the N -th derivative of f (x) is at its maximum absolute value [21].Table 1 gives these values for the investigated functions.The range of each subinterval is [x m , x m + 2 n int −m ), so Equation (19) becomes: The maximum allowable error of the interpolator is defined to be 2 −q , which is selected as a design parameter.Error budgets for linear and quadratic designs are discussed in later sections.For both designs, the error budget for E Chebyshev is 1/4 of the total maximum error, so E Chebyshev is limited to 2 −q−2 , and Equation ( 20) is solved for m.Since m must be an integer, this gives:

Calculating the Coefficients
For a linear Chebyshev interpolator, the coefficients used in Equation (18) for the interval [x m , x m + 2 n int −m ) are given by: where: For a quadratic Chebyshev interpolator, the coefficients for the interval [x m , x m + 2 n int −m ) are given by: where: Coefficients for cubic Chebyshev interpolators are given in [22].Coefficients for higher-order Chebyshev interpolators can be computed using the equations presented in [4].

Truncated Multipliers and Squarers
Truncated multipliers and squarers are units in which several of the least-significant columns of partial products are not formed [23].Eliminating partial products from the multiplication or squaring matrix reduces the area of the unit by eliminating the logic needed to generate those terms, as well as reducing the number of adder cells required to reduce the matrix prior to the final addition.Additional area savings are realized because a shorter carry-propagate adder can be used to compute the final results, which often yields reduced delay, as well.Eliminating adder cells and, thus, their related switching activity also results in reduced power consumption.
Figure 1 shows a 14 × 10-bit truncated multiplier, where r denotes the number of unformed columns and k denotes the number of columns that are formed, but discarded in the final result.In this example, r = 7 and k = 2. Eliminating partial products introduces a reduction error, E r , into the output.This error ranges from E r_low , which occurs when each of the unformed partial-product bits is a "1", to E r_high , which occurs when each is a "0".E r_low is given by [24]: where ulps is units in the last place of the product.Since E r_high is zero, the range of the reduction error is: In the example given in Figure 1, the range of reduction error is −1.502 ulps ≤ E r ≤ 0 ulps.In comparison, the error due to rounding a product to the nearest has a range of −0.5 ulps < E rnd ≤ 0.5 ulps.Figure 2 shows a 12-bit truncated squarer.Truncated squarers are an extension of specialized squarers, which are described in [25].As with truncated multipliers, r denotes the number of unformed columns, and k denotes the number of columns that are formed, but discarded in the final result.Unlike truncated multipliers, it is impossible for each of the unformed partial-product bits in a truncated squarer to be "1".E r_low for a truncated squarer is given by [26]: The range of reduction error for a truncated squarer is E r_low ≤ E r ≤ 0. In the example given in Figure 2, k = 2, r = 10, and the range of reduction error is −1.000 ulps ≤ E r ≤ 0 ulps.For generality, k and r will be used in equations given in later sections to describe both standard and truncated multipliers and squarers.For a standard unit, r = 0, because all columns of partial products are formed.

Constant Correction
From the previous discussion, it can be seen that the reduction error for both truncated multipliers and truncated squarers is always negative.One way to offset this error is to add a constant to the partial-product matrix [24].In some applications, it is desirable to select a constant that makes the average error of the multiplier or squarer as close to zero as possible.When designing a function interpolator, however, it is usually desirable to minimize the maximum absolute error.This is done by selecting a correction constant, C cc_abs , equal to the additive inverse of the midpoint of the range of the error.This value is: where E r_low is given by Equation ( 32) for truncated multipliers and Equation ( 34) for truncated squarers.
In practice, C cc_abs is rounded, so that it does not have any bits in the r least-significant columns, since those bits will have no effect on the final output.

Variable Correction
Another way to offset reduction error is through variable correction [27].Figure 3 shows a truncated squarer with variable correction.With this technique, the most-significant column of unformed partial products is added to the next most-significant column.This can be thought of as rounding each row of partial products, such that the r least-significant columns are eliminated.

Finite-Precision Arithmetic Effects
In addition to the error of the Chebyshev approximation, there are errors due to quantization effects, multiplier/squarer rounding and rounding of the final addition that affect the accuracy of the interpolator output.
Each coefficient a i is quantized to n i bits using the round-to-nearest approach and stored in a lookup table.The least-significant bit of each coefficient has a weight of 2 −n f i , where n f i is the number of fractional bits in coefficient a i .The difference between a quantized coefficient and its infinite-precision value is defined as ε i , where In order to prevent excessive intermediate word lengths, multiplier and squarer outputs are rounded.When standard multipliers are used, rounding is accomplished by adding a "1" to the column immediately to the right of the rounding point, then truncating the k least-significant bits at the output.Recall that k is the number of columns in the multiplication matrix that are formed and added, but discarded in the final result.The maximum rounding error, E rnd , is 0.5 ulps, where 1 ulp is the weight of the least-significant bit output by the multiplier or squarer.

Linear Interpolator
Figure 4 shows the block diagram of a linear interpolator that computes Equation ( 2).x m is used to select coefficients a 0 and a 1 from a lookup table.Multiplier 1 computes a 1 • x l , which is then added to a 0 to produce the output.The multiplier output can be kept in carry-save form or a multiply-accumulate (MAC) unit may be used to reduce the overall area and delay of the interpolator.Table 3 gives the error budget for the initial linear-interpolator design.As described earlier, the maximum allowable error for the interpolator is defined to be 2 −q , which is selected as a design parameter.For the Chebyshev approximation, m is selected to limit the approximation error to 1/4 of the total error.The total error budget is slightly more than the maximum allowable error, 2 −q .However, parameters are selected for each source of error, such that the maximum error is less than or equal to the budget amount.Therefore, it is unlikely that each source of error would be at its maximum value at the same time, so these budget values have been found to work for most designs.When these values result in an initial design that does not meet the error specification, one or more of the design parameters are tightened to bring the design into compliance.Table 3. Error budget for linear interpolators.

Source of Error
Error Budget Absolute Fraction Chebyshev approximation Coefficient quantization (2 −q−3 each) Multiplier rounding 2 −q−4 1/16 Rounding at interpolator output Errors in the output due to the quantization of a 0 and a 1 are E ε 0 and E ε 1 , respectively.Since a 0 contributes directly to the output, E ε 0 = ε 0 , so: a 1 is multiplied by x l , which has a maximum value less than 2 n int −m , so: The error budget for quantization of each coefficient is 1/8 of the total error.Thus, the coefficient length is determined by setting E ε i = 2 −q−3 and solving for n f i , so: and: Section 5 describes how these lengths are then reduced to the minimum precision that maintains the desired accuracy of the interpolator.In addition to the fractional bits, a sign bit is needed, so: If additional bits to the left of the binary point are required (e.g., if any value of |a i | is greater than or equal to one), these values are incremented accordingly.Likewise, if some of the most-significant fractional bits are always constant, these values are decremented accordingly.For example, if all values of a i are less than 0.25, the two most-significant fractional bits are always zero and do not need to be stored in the coefficient table or input to the multiplier.
In addition to quantization errors, rounding the multiplier output introduces a rounding error, E rnd_m1 , at the interpolator output.The least-significant-bit (LSB) weight of a 1 is 2 −n f 1 , and the LSB weight of x l is 2 n int −n , so the LSB weight of a full-precision product would be 2 −n f 1 +n int −n .Since r columns of partial products are not formed and k output bits are discarded, the LSB weight of the multiplier output is 2 −n f 1 +n int −n+k m1 +r m1 , so: where k m1 and r m1 are k and r for the multiplier.
The initial design for linear interpolation uses a standard multiplier.The error budget for multiplier rounding is 1/16 of the total error, so k m1 is chosen by setting E rnd_m1 = 2 −q−4 and solving for k m1 .
For a standard multiplier, all columns of partial products are formed, so r m1 = 0. Substituting Equation (39) for n f 1 gives: The equations for finding the initial design parameters for linear interpolators are summarized in Table 4. Table 4. Initial design parameters for linear interpolators.

Design
Initial Value Description Parameter Multiplier 1, number of columns formed, but discarded in the final result

Quadratic Interpolator
Figure 5 shows the block diagram of a quadratic interpolator that computes Equation (3).x m is used to select coefficients a 0 , a 1 and a 2 from a lookup table.A specialized squarer computes x 2 l .Multiplier 1 computes a 1 • x l , and Multiplier 2 computes a 2 • x 2 l , both of which are then added to a 0 to produce the output.As with the linear interpolator, the multiplier outputs can be kept in carry-save form.
E ε 0 and E ε 1 for a quadratic interpolator are the same as for a linear interpolator, given by Equations ( 36) and (37).a 2 is multiplied by the squarer output, which has a maximum value less than 2 2(n int −m) , so: Table 5 gives the error budget for the initial quadratic-interpolator design.The error budget for quantization of each coefficient is 1/16 of the total error, so the coefficient length is determined by setting E ε i equal to 2 −q−4 and solving for n f i , giving: and: Assuming a sign bit in addition to the fractional bits, As noted above, these values are incremented accordingly if additional bits to the left of the binary point are required, or decremented if all values of |a i | are less than 0.5.

Source of Error
Error Budget Absolute Fraction Chebyshev approximation Truncation of squarer input (x l ) 2 −q−4 1/16 Multiplier and squarer rounding (2 −q−5 each) 3 • 2 −q−5 3/32 Rounding at interpolator output Analysis shows that for some configurations, x l , can be truncated at the input to the squarer to reduce the size of the squarer.Assume that t least-significant bits of x l are truncated, such that x l = x l + ε x l , where x l is the truncated version of x l .The squarer output is then x l 2 , rather than x 2 l , resulting in a squarer output error of x l is negligible, the magnitude of the squarer output error is less than 2 2n int −n−m+t+1 .This error is then multiplied by a 2 , so the error at the interpolator output due to ε x l is: assuming all values of |a 2 | are less than or equal to one.The error budget for truncation of the squarer input is 1/16 of the total error, so E εx l is set equal to 2 −q−4 to find the maximum value for t, which gives: If any value of |a 2 | is larger than one, or if all values of |a 2 | are less than 0.5, |E εx l | varies accordingly, which affects the calculation for t.For every additional bit required to represent a 2 , t decreases by one.If all values of |a 2 | are less than 0.5, then one or more of the most-significant fractional bits are always zero.For each of those bits, t can be increased by one.For example, if all values of |a 2 | are less than 0.25, two of the most-significant fractional bits are always zero, so t can be increased by two.If t ≤ 0, x l cannot be truncated.
As with the linear interpolator, k is set for the multipliers and the squarer to limit the rounding error.The error budget for rounding is 1/32 of the total error for each unit, so E rnd is set equal to 2 −q−5 , and the equation is solved for k.
The LSB weight of the Multiplier 1 output is 2 −n f 1 +n int −n+k m1 +r m1 , so: Setting this equal to 2 −q−5 and solving for k m1 gives: Standard multipliers are used for the initial design, so r m1 = 0. Substituting Equation (47) for n f 1 gives: The LSB weight of the squarer output is 2 2n int −2n+2t+ksq+rsq , so: where k sq and r sq are k and r for the squarer.The squarer output is multiplied by a 2 , which is assumed to be less than or equal to one, so the rounding error for the squarer is set equal to 2 −q−5 to find k sq : A standard squarer is used for the initial design, so r sq = 0, so: If any values of |a 2 | are greater than 1.0, k sq is decreased accordingly, and if all values of |a 2 | are less than 0.5, k sq is increased accordingly.
The LSB weight of the Multiplier 2 output equals the LSB weight of a 2 multiplied by the LSB weight of the squarer output, which is 2 −n f 2 +2n int −2n+2t+ksq+rsq+k m2 +r m2 , so: Setting the maximum rounding error equal to 2 −q−5 and solving for k m2 gives: Substituting Equation (48) for n f 2 gives: Standard multipliers and squarers are used for the initial design, so r sq = 0 and r m2 = 0. Substituting Equation (58) for k sq gives: Because k sq is subtracted, the range of values for |a 2 | affects the calculation.If any values of |a 2 | are greater than 1.0, k m2 is increased accordingly, and if all values of |a 2 | are less than 0.5, k m2 is decreased accordingly.
The equations for finding the initial design parameters for quadratic interpolators are summarized in Table 6.

Design Optimization
After the preliminary design is complete, the coefficients are optimized through exhaustive simulation and coefficient adjustment.First, simulation is done using standard multipliers and squarers to find the minimum coefficient lengths that can be used while maintaining a maximum absolute output error less than 2 −q .This reduces the lookup table size.Next, the standard multipliers and squarers are replaced with truncated-matrix units, and simulation is done to maximize the number of unformed columns while remaining within design specifications for error.This reduces the area and power consumption of the interpolator unit.

Simulation Software
Exhaustive simulation is performed using a software tool based on an algorithm for fast, bit-accurate simulation of truncated-matrix multipliers and squarers [28].An interactive graphical user interface (GUI) is provided that allows the user to change the design parameters, then to perform the simulation of all possible input values on some or all of the 2 m subintervals.Figure 6 shows a screenshot.The simulation is bit-accurate, taking into account rounding effects of table lookups and arithmetic units, as well as accounting for the reduction error and correction techniques of truncated-matrix multipliers and squarers.The software also includes a routine for adjusting the coefficients to minimize the maximum absolute error of the output.Simplified pseudocode for this routine is as follows: for delta = -3 to +3 ulps simulate exhaustively, using a(i) + delta if error is improved, set a(i) = a(i) + delta next delta if a(i) was changed i = 0 else i++ end if end while This method for adjusting coefficients is similar to that presented by Schulte and Swartzlander [4].One difference is that the coefficient a i is varied by ±3 ulps, rather than ±1 ulp, because it is possible for increasing/decreasing values of a i to produce identical or worse error results before a better result is achieved, due to the non-monotonic nature of truncated-matrix arithmetic units.Another difference is that if a i changes, i is reset to zero to readjust all lower-order coefficients before higher-order coefficients are adjusted.Thus, priority for adjustment goes in the order a 0 , a 1 , . . ., a N −1 .
The software performs a large number of simulations, so there are practical limits to the size of the interpolator that can be simulated.For this research, the program is run on a laptop computer with a 1.86-GHz Pentium M processor.The time required to run the coefficient-optimization routine for a 24-bit quadratic interpolator is on the order of five to fifteen minutes.This must be done every time a parameter is changed to see if the new configuration meets the design specifications, so the time required to optimize the design is on the order of several hours.Since the software is developed for research purposes, rather than production purposes, there is a great deal of room for improvement in execution speed.For example, the code could be easily ported to C++, optimized, and run on a faster machine.Exhaustive simulation lends itself well to parallel processing, offering another approach to improve speed.Considering the time and money involved in developing commercial hardware that could benefit from this approach, a few hours of computer time is probably insignificant.

Optimization Using Standard Multipliers and Squarers
Using the simulation software just described, the initial hardware design is tested and adjusted to reduce the coefficient lengths as much as possible while maintaining the desired accuracy.This is done through trial-and-error by reducing a single coefficient length by one bit, then running the coefficient-optimization routine described above.If the resulting output error range is acceptable, the length of another coefficient is reduced by one bit.If the error is not acceptable, the coefficient is restored to its previous length.This process continues until each coefficient is reduced to a minimum length, typically one or two bits less than the initial lengths.At this point, the size of the lookup table is minimized.Next, multiplier and squarer output lengths are minimized by attempting to increase k m1 , k m2 and k sq through trial-and-error, as is done for the coefficient lengths, resulting in the final design for a standard interpolator.
If desired, the software allows the user to further explore the design space.Increasing the precision of the multipliers and the squarer by decreasing k below the originally calculated values may allow one or more coefficient lengths to be further reduced, at the expense of larger arithmetic units.The flexibility of the software allows such design trade-offs to be easily quantified.

Optimization Using Truncated-Matrix Units
After the size of the lookup table is minimized as described in the previous section, the multipliers and the squarer are replaced with truncated-matrix units.The goal is to maximize the number of unformed columns of partial-product bits in each unit without exceeding the interpolator output error bounds.This reduces the area and power consumption of the interpolator.
As with finding the minimum coefficient lengths, the maximum value of r for each unit is obtained by trial-and-error.Note that for a standard multiplier or squarer, r = 0, because all partial products are formed.As r is increased for a truncated-matrix unit, k must be decreased by an equal amount in order to maintain the precision and weight of the output product or square.
A good place to start when using truncated-matrix units with constant correction is to set k = 4 and r = k std − 4, where k std is the value of k used for the original standard multiplier or squarer.For variable-correction units, which are generally more accurate for equivalent values of r, one should start by setting k = 3 and r = k std − 3. The coefficient-optimization routine is run each time k and r are changed, and the error bounds are determined through exhaustive simulation.In some designs, r cannot be set very high without violating error constraints.In these cases, decreasing k by one often allows r to be set much higher, with the large decrease in matrix size more than offsetting the effect of the increase in output size.
As with coefficient length reduction described above, k is increased and r is decreased for each unit until coefficient optimization fails to produce an acceptable range of output error.At this point, the previously acceptable values for k and r are restored, and the process continues until the maximum number of unformed columns is determined for each unit.When this is complete, the final truncated-matrix interpolator design is found.The top graph shows the ideal outputs of a digital function interpolator, indicated by circles.Using infinite precision for intermediate calculations and true round-to-nearest at the output, the output error is limited to ± 1 2 ulp.For any function, a ± 1 2 ulp error can be achieved using finite-precision intermediate calculations [4], but for some functions, this may result in long intermediate-precision word lengths and a large lookup table.

Interpolator Error
For many applications, errors larger than ± 1 2 ulp can be tolerated [9].The middle graph of Figure 7 adds squares to indicate the output of an interpolator that uses standard multipliers and squarers, optimized as described in Section 5.2 for a maximum error less than ±1 ulp.Input values of 1, 12 and 24 are shown to have outputs that differ from the ideal values, but still within a ±1 ulp error.Note that keeping the error less than ±1 ulp is also known as faithful rounding [29].
The bottom graph of Figure 7 adds "×"s to indicate the output of a truncated-matrix interpolator that is optimized as described in Section 5.3.When r is zero for each multiplier and squarer in a truncated-matrix interpolator, it is identical to a standard round-to-nearest interpolator and has the same output values.As r is increased, the resulting errors cause some outputs to change.In the bottom graph, the input values of 3, 10, 17 and 24 produce outputs that differ from the standard interpolator.For the first three of those, the error increases, but for f (24), the error actually decreases, because the errors due to finite-precision arithmetic and the errors of the truncated-matrix units are opposite and partially cancel out.This explains why a standard interpolator with less than ±1 ulp error can tolerate additional errors due to using truncated-matrix multipliers and squarers and still have less than a ±1 ulp error.

Methodology
In order to verify the techniques described in this paper and to quantify the benefits, a number of interpolator designs are developed and synthesized.First, linear and quadratic interpolators for the reciprocal function are created for several bit widths.Next, 16-bit linear and quadratic interpolators are designed for the other functions given in Table 1.
For each of the designs, the maximum error is specified to be less than ±1 ulp.The designs are simulated using the bit-accurate simulation tools described in Section 5.1 to ensure that they meet the error specifications.Verilog models for each design are generated using the Verilog-generation tools that have evolved from work presented in [30].For a given design, the constant-correction and variable-correction truncated-matrix versions use the same size of lookup table as the standard round-to-nearest interpolators, but with slightly different values.Because the tables are so similar and can be implemented in different ways, the lookup tables are not included in the model, just the arithmetic units and the pipeline registers.Parallel-tree multipliers and squarers using reduced-area reduction are employed, and the multi-operand adder also uses a parallel-tree structure with reduced-area reduction [31].Otherwise, the models match the block diagrams given in Figures 4 and 5.The models are pipelined in three stages, with the lookup table (and squarer for quadratic designs) in the first stage, the multiplier(s) in the second stage and the multi-operand adder in the final stage.
The designs are synthesized using Synopsys Design Compiler and the Taiwan Semiconductor Manufacturing Company (TSMC) tcbn65gplustc 65-nm standard-cell library with nominal-case operating conditions of 25 • C and V DD = 1.0 V .Timing and area constraints for linear interpolators are set to 0.45 ns and 1 µm 2 , respectively, to find the smallest designs that can run at that speed.A worst-case delay of 0.45 ns is chosen because it pushes the limits of how fast each design can run, yet each design can meet that speed requirement.Synthesizing each design at the same speed provides a more accurate comparison of area and power for each model.Timing and area constraints for quadratic interpolators are set to 0.60 ns and 1 µm 2 , respectively, for the same reasons.

Improvements in Area, Delay and Power
Table 7 shows baseline designs using standard multipliers and squarers.The number of address lines for the coefficient lookup table are given, as well as the total size of the table in bits.Lookup table sizes are comparable to other interpolator design techniques [11,13].Estimates for area, delay and power are given.Delay is reported in nanoseconds, as well as in fanout-of-four (FO4) delay, which is 31.3picoseconds for the TSMC library that is used for synthesis.As one would expect, area and power increase as the number of input bits increases.Furthermore, the arithmetic unit portion of quadratic interpolators requires more area and power than for linear interpolators of the same input size.As mentioned above, the area, delay and power shown in Table 7 do not include the lookup table.Each of the designs in Table 7 are also synthesized using both constant-correction and variable-correction truncated-matrix multipliers and squarers.Table 8 gives results for 12-, 16-and 20-bit linear interpolators for the reciprocal function.The number of bits input to the interpolator, n, is given for each size.As discussed in Section 2.2, constants, such as the leading "1" for the reciprocal interpolator, are not input.The baseline designs using standard multipliers and squarers are listed as SMRTN under type.Constant-correction and variable-correction variants are listed as CCTM and VCTM, respectively.The number of multiplier columns that are formed, but rounded away in the output, k m1 , and the number of columns that are not formed in the truncated-matrix units, r m1 , are given.The minimum and maximum errors, E min and E max , as well as the variance of the error, E σ 2 , are given in ulps.The number of full adders (FAs) and half adders (HAs) used in the multiplier and the multi-operand adder are given for each interpolator.Area, delay and power estimates are given, normalized to the standard design.Parameters affecting the size of the lookup table are given for each design.This includes m, the number of address lines for the lookup table, as well as n 0 and n 1 , the lengths of coefficients a 0 and a 1 .The number of entries in the lookup table is 2 m .A careful examination of the coefficient values shows that some of the coefficient bits do not change and therefore do not need to be stored.For example, the first two bits in the 16-bit reciprocal lookup table are always "01" for a 0 , and the first bit for a 1 is always "1", which can be hard wired rather than stored.Taking this into account, the size of the lookup table in bits is given for each interpolator size.
Table 9 shows results for 16-, 20-and 24-bit quadratic interpolators for the reciprocal function.As with Table 8, key parameters of each design are given, and the size of the lookup table is calculated.Additional parameters that are applicable only to quadratic interpolators are also given.This includes t, the number of LSBs in x l not input to the squarer, and k and r values for the second multiplier and the squarer.As with the linear interpolators, error statistics, the number of FAs and HAs used in the multipliers, squarer and multi-operand adder and area, delay and power estimates are reported.
In order to demonstrate the generality of this method, interpolators for the remaining functions listed in Table 1 are designed and synthesized.Table 10 shows results for 16-bit linear interpolators, and Table 11 shows results for 16-bit quadratic interpolators.The reciprocal root interpolators require two lookup tables; one for the case where the exponent is even and the other for the case where it is odd.From these results, it can be seen that a significant reduction in area and power can be achieved without exceeding the design specification of a ±1 ulp error.For the linear-interpolator designs that are developed, area reductions up to 25.2% and power reductions up to 24.5% are obtained.For quadratic interpolators, area reductions up to 34.3% and power reductions up to 25.7% are obtained.If desired, savings in area can be traded-off for increased speed by changing the constraints in the synthesis tools.
There is no clear tendency for either constant correction or variable correction to produce better results.Constant-correction truncated-matrix multipliers and squarers with r = r −1 unformed columns have the same number of unformed partial products as variable-correction units with r = r unformed columns.For a given number of unformed partial products, constant-correction and variable-correction units have similar error characteristics.Due to the nonlinear nature of these units, however, the actual error bounds of the interpolator using them can only be determined through exhaustive simulation.

Future Work
The main contribution of this work is to extend [1] and show that standard multipliers and squarers can be replaced with truncated-matrix units in function interpolators to reduce area, delay and power for many different functions without changing the specified error bound.
A Chebyshev-series approximation is used to get initial coefficient values.Another approximation method, such as minimax, could be used.Minimax polynomials are known to be better approximations than Chebyshev polynomials.However, these are only initial values, which are modified using the algorithm described in Section 5.1.During the course of this work, it was found that the coefficients only change by a few ulps at most, so using a better initial approximation, such as minimax, would likely produce the same final results.The algorithm for finding the best coefficients is simple, and a better optimization method could be used.Since the coefficients only change by a few ulps, a better algorithm would again likely produce the same final results.While a better initial approximation method and a better optimization method may produce the same final results, they are worth investigating because they would probably reduce the run time of the overall method significantly, which would be beneficial when designing interpolators with larger operand sizes.
In this work, constant-correction [24] and variable-correction [27] methods are used to compensate for errors due to truncating partial-product bits in the multiplication and squaring matrices.These methods are simple to implement and have good error characteristics.Much work has been published in this area, with [32][33][34] being some of the more recent.Although these methods have better error characteristics, they are better by only a fraction of an ulp and require a more involved design procedure.Furthermore, the system-level correction provided by adjusting coefficients using the proposed algorithm may have a greater benefit to the simpler methods, because they have more room for improvement.However, it is worth investigating the use of fixed-width multipliers and squarers with better error characteristics, because they may further reduce area, delay and power and may do so more consistently as the operand size and the function being evaluated are changed.
Much work has been published on function approximation and evaluation, some of the more recent being [35][36][37].It should be noted that the methods presented in this paper can be applied to most, if not all other designs for function approximation.This paper uses a piecewise polynomial evaluation based on a Chebyshev-series approximation as a starting point, because it is well known and produces good results.Designs for function evaluation that use other arithmetic structures, different methods for finding and optimizing coefficients or that are optimized for specific applications can also use the methods presented in this paper.Future work could be done to extend other designs by replacing standard arithmetic units with truncated-matrix versions and applying the methods in this work to adjust coefficients to maximize the number of partial-product bits that are eliminated.

Conclusions
Unlike many high-performance DSP kernels, function interpolators tend to require more precise results, so it would seem that they are not a good application for truncated-matrix units.However, this work shows that even applications with tight error specifications can benefit from using truncated-matrix multipliers and squarers.While this paper presents a technique for designing linear and quadratic

Figure 6 .
Figure 6.Screenshot of the function-interpolator simulation GUI.

Figure 7
Figure 7  shows an example of interpolator error.In each graph, the true value of an arbitrary function f (x) is shown over a small range by a continuous line.The tick marks on the x-axis represent the possible inputs to the interpolator, and the tick marks on the y-axis represent the possible outputs, each tick mark being one ulp.

Table 6 .
Initial design parameters for quadratic interpolators.
Number of bits in x l that are not input to the squarer