Piecewise Parabolic Approximate Computation Based on an Error ‐ Flattened Segmenter and a Novel Quantizer

: This paper proposes a novel Piecewise Parabolic Approximate Computation method for hardware function evaluation, which mainly incorporates an error ‐ flattened segmenter and an im ‐ plementation quantizer. Under a required software maximum absolute error (MAE), the segmenter adaptively selects a minimum number of parabolas to approximate the objective function. By com ‐ pletely imitating the circuit’s behavior before actual implementation, the quantizer calculates the minimum quantization bit width to ensure a non ‐ redundant fixed ‐ point hardware architecture with an MAE of 1 unit of least precision (ulp), eliminating the iterative design time for the circuits. The method causes the number of segments to reach the theoretical limit, and has great advantages in the number of segments and the size of the look ‐ up table (LUT). To prove the superiority of the proposed method, six common functions were implemented by the proposed method under TSMC ‐ 90 nm technology. Compared to the state ‐ of ‐ the ‐ art piecewise quadratic approximation methods, the proposed method has advantages in the area with roughly the same delay. Furthermore, a uni ‐ fied function ‐ evaluation unit was also implemented under TSMC ‐ 90 nm technology.


Introduction
High speed hardware function evaluation has been widely deployed in computer graphics. Graphical processing units (GPU) use special function units (SFU) to compute elementary functions, which are essential operations in graphical processing [1]. As the constraints on performance become much more stringent, more and more hardware-oriented methods for high-speed elementary functions have been proposed. Generally, implementation methods of unary functions can be separated into two categories: iterative methods and non-iterative methods. From the perspective of VLSI implementation, iterative methods need to repeat the same or similar calculations multiple times on the time axis, while non-iterative methods use the calculation resource only once.
The mainstream iterative methods of complex computing units are the Newton-Raphson (NR) method [2] and Goldschmidt algorithms [3], which are multiplicativebased. Although these methods converge quickly, each iteration requires several multipliers, which lead to a long execution time and large area. Coordinate Rotation Digital Computer (CORDIC) is also popular [4,5], but its latency is too high due to its multicycle execution delays.
Non-iterative methods are usually based on look-up tables (LUTs). According to the size of LUTs and computation complexity, table-based methods are further divided into three categories: computer-bound methods, table-bound methods, and in-between methods. Computer-bound methods use small-sized LUTs to store significant parameters which are used in cubic or higher-degree polynomial calculation. Therefore, a lot of multiplications and additions are necessary, which leads to a high hardware overhead. The polynomial is usually implemented with fused multiply-add units using Horner's rule. Table-bound methods use large LUTs and simple calculation units, usually a few additions. Bipartite table methods (BTM) [6] and multipartite table methods [7] are typical examples of table-bound methods, which use a few tables and some additions. In general, table-bound methods are suitable for low-precision applications, because as the accuracy increases, the size of the LUTs increases exponentially. In high-precision applications, the hardware overhead of large LUTs is unaffordable.
In middle-to-high-precision applications, in-between methods are more popular, because they obtain a better trade-off between the size of LUTs and arithmetic complexity. This method can be further subdivided into linear approximation and quadratic approximation depending on the polynomial approximation degree. Linear approximation has advantages when the accuracy is low, because of its high speed and low computation complexity [8,9]. It is usually used in low-precision applications or to provide an initial value for other methods. Ref. [10] proposed two levels of approximation for medium-tohigh-precision applications. In the first level of initial approximation, piecewise degreeone polynomials are used to approximate the target function, and approximations to the corresponding difference functions are generated. Afterward, in the second level of refined approximation, the shared normalized difference function is computed using either direct LUTs or another piecewise interpolation. This method reduces the total area at the cost of a long delay. In general, quadratic approximation offers a better compromise between LUT size and computation complexity in medium-to-high-precision applications [11,12].
The uniform segmentation method is a common method in quadratic approximation. The input X is split into two parts, the most significant part X and the least significant part X . The target function is evenly divided into 2 subintervals and approximated by a quadratic polynomial a for each subinterval. The optimal quadratic polynomial for a special subinterval can be obtained by interpolation at Chebyshev points [13] or minimax approximation [14], which needs Maple toolbox. In hardware implementation, X can be used to index the coefficients a, b, c directly. Many works have adopted this method and focus on reducing the coefficients bit width and optimizing the multiplier. Ref. [15] used an enhanced minimax approximation to reduce the LUT size, which takes into account the effect of rounding the coefficients to a finite size through an iterative process. Ref. [16] proposed a novel technique to minimize polynomial coefficients bit width, which uses integer linear programming (ILP) to optimize the polynomial coefficients, considering all error components simultaneously. The multiplier in [17] is optimized in detail based on [16]. Uniform segmentation method has two obvious advantages. Firstly, the index logic is simple because X contains the information of each subinterval. Secondly, the multiplier bit width is small. On the one hand, X has less bit width than X. On the other, the coefficients can be reduced by novel methods. However, the accuracy of such methods is limited by the largest error among all segments. In high-precision applications, the number of segments is too large, leading to an LUT size that is too large and just unnecessary. When evaluating functions with high nonlinearity, uniform segmentation methods are not efficient. As for SFU in GPU, multiple functions need to be implemented, the shortcomings caused by large LUTs are much more apparent.
Therefore, several non-uniform segmentation methods have been proposed [18][19][20][21][22][23]. Refs. [18,19] proposed multilevel hierarchical segmentation methods, in which the segment size at each level can be the same or different by increasing or decreasing the order of power-of-two. Ref. [20] presented another similar segmentation method targeting for floating-point arithmetic which places the boundary of segments so that every segment contains consecutive numbers, and thus the segment index encoder can be realized by simple combinational logic. These methods are not flexible enough to dynamically fit target functions with high nonlinearity, although the hardware design of the segment index encoder is simple. In [21], the segmentation boundary could be at any position of the interval depending on the adopted segmentation schemes. However, the design of the segment index encoder is complicated. Refs. [22,23] presented new non-uniform segmentation approaches that merge a fixed number of uniform base segments into a larger nonuniform segment and design an innovative method to remap addresses. However, this method does not cause the number of segments to reach the theoretically minimum value, and there is still room for improvement.
The SFU of GPU usually needs the implementation of multiple functions, so the size of LUTs is significant to the whole area. This paper proposes a new method, which cuts down the LUTs' size by reducing the number of segments to the theoretical minimum value for any given precision.
An error-flattened segmenter and a novel quantizer are components of the piecewise parabolic approximate computation method. The segmenter uses piecewise parabolas to approximate the objective function under the required software maximum absolute error (MAE), and ensures that the required accuracy is achieved with the minimum number of segments. In detail, we discretize the data and set a target value as the unified software MAE for each segment. In each segment, we guarantee the real MAE to be at the minimum. Then, the segmenter will select the maximum range for each segment and derive the corresponding coefficients until the whole input range has been covered. Because the range of each region has been maximized, the number of segments is minimized. So far, we have used piecewise parabolas with the lowest number of segments to approximate the target function in the input region. Based on the segmenter, we design a general hardware architecture. The quantizer is designed for this hardware architecture. It can completely imitate the input and output behavior of the circuits and ensure that the hardware overhead is minimized with an MAE of 1 unit of least precision (ulp). Therefore, the designer does not have to adjust the hardware bit width iteratively.
The main contributions of this paper are summarized as follows:  The proposed method reduces the number of segments to the theoretical minimum value in any given precision to reduce the size of the LUTs, which has obvious advantages in reducing the area of applications that implement multiple functions simultaneously, such as SFU.  The proposed arithmetic circuits can be implemented without a waste of hardware resources. With the help of an innovative quantizer, we quantize the coefficients to the minimum bit width, which can guarantee the precision up to 1 ulp.  It is a calculation method with an absolutely controllable error. The exact bits of the result in hardware can be set in advance.


The proposed method is applicable to all unary functions with finite codomain.
The remainder of this paper is organized as follows. Section 2 depicts the basic principles of the proposed method. The error-flattened segmenter is introduced and verified in Section 3. In Section 4, the hardware architecture and the quantizer are described. The details of hardware implementation and comparison with the existing methods are demonstrated in Section 5. Finally, Section 6 summarizes our work.

Basis of Piecewise Parabola Approximation
Let the objective function be expressed as f x , x ∈ M, N . Piecewise parabola approximation uses a special parabola segmenter to split the input range x ∈ M, N] into several segments. Each segment is approximated by a parabola ℎ . The coefficients ， ， are stored in the LUT in advance. Over the entire input range, the function can be approximated as Four sets of data need to be determined when using the proposed method. The first is the segment start point value in the input range , , ,… . The others are the quadratic term coefficients , , ,… , linear term coefficients , , ,… , and constant term coefficients , , ,… . Supposing the objective function has n segments in the input range, n − 1 start point values, n quadratic term coefficients, n linear term coefficients, and n constant term coefficients need to be stored in the LUT.

Discretization
The data are discrete in fixed-point calculations in VLSI. Let the objective function be f x , x ∈ M, N . Assume that the number of fractional bits of x in hardware is Q. We sample the input range [M,N] with an equal interval of 2 . According to Equation (2), the input becomes a one-dimensional vector with the size of NUM: At this time, the output fq(x) also becomes a one-dimensional vector with the size of NUM: The design of parabola segmenter is based on the discrete point set {xq(i),fq(i)}.

Metric for Error Evaluation
The maximum absolute error (MAE) is a common metric to evaluate the approximate error. Considering the fixed-point implementation with Q fractional bits in hardware, the continuous input range [M,N] is sampled into NUM points: In hardware implementation, xq, fq(x), and h(x) are one-dimensional vectors with the size of NUM. MAE is the absolute value of the worst-case approximate error. It is defined as

Minimization of MAE
In this subsection, we explore ways to minimize MAE for a given segment. Suppose the input of the segment is xq j: k , 1 j k NUM. Ref. [14] pointed out that interpolation at cleverly chosen points could be a sensible solution: interpolating a function at Chebyshev points gives a polynomial almost as good as the minimax polynomial. The parabola generated by Chebyshev interpolation can be directly calculated by MATLAB, while the minimax method needs a Maple toolbox. Thus, with the aim of minimizing MAE, we choose three Chebyshev points of the target function and fine-tune it to generate the optimal parabola [24]. As for n-order interpolation, the Chebyshev nodes on [−1, 1] are: Therefore, the third-order Chebyshev nodes on [−1, 1] are: The Chebyshev nodes are then transformed from [−1,1] to [a, b] by the following formula: The three Chebyshev nodes in the i-th subinterval xq j: k become: In order to show more clearly the relationship between the Chebyshev nodes and xq j , xq k , Equation (9) can be rewritten as where c1 is √3 2 1/2 , c2 is 1/2, c3 is √3 2 1/2 . The target function values of and are expressed as , , and . The coefficients of the parabola in the i-th subinterval are determined by these three points: We plot the error distribution diagram of the first segment for function ln 1 x in Figure 1. As can be seen from Figure 1a, there are four error peaks, and the absolute values of four peaks are almost equal. However, there is still a small gap between the four peaks. Ref. [14] pointed out that p* is the minimax degree-2 approximation to function f on [a,b] if and only if there exist at least four values x ,x , x , x ∈ a, b , such that |p * x f x | ‖f p * ‖ . Therefore, we need to make the four peaks equal. The first error peak and the last error peak are taken as the benchmark. Through the error distribution diagram, we can see that the second peak is slightly larger and the third peak is slightly smaller. Therefore, we only need to move x to the right-that is, to increase c2. As shown in Figure  1b, when c2 becomes 0.501, the four peaks are equal. For other functions, we can perform the same operation to make the four error peaks in the subinterval equal, as shown in Figure 2. The fine-tuned c2 can be expressed as c2 c2 δ. The δ for a few common functions are shown in Table 1.

Minimization of the Segments Number with a Given Precision
Here, we propose an innovative segmentation method, which can adaptively generate the minimum number of segments under a given precision and a certain input range.
First of all, the input range of a segment needs to be maximized with a given MAE. The process is easy to understand. From the previous section, we can calculate the minimum MAE of a certain segment. If the MAE is greater than the predefined error, it means that the segment width is too large and should be reduced; if the MAE is no greater than the predetermined error, it means that the segment width may be too small, and there may be room for enlargement.
Assume that the required MAE of the software segmenter is MAE_sw and the objective function is xq i , fq i : . Then, suppose that the function in xq 1: j 1 has been successfully approximated and the function in xq j: NUM remains to be addressed. Now, we need to determine the input range of the next segment in xq j: NUM . The start point is xq j and the end point xq ep needs to be found. In theory, there are (NUM-j) possibilities for the selection of the end point xq ep , and the exhaustion method can be used to select it. According to Equation (6), the $MAE$ for every choice of xq ep can be calculated and expressed as MAE j 1: NUM . Among those MAEs which are no greater than the MAE_sw, we can choose the largest corresponding segment. However, the calculation of the exhaustion method is time consuming, resulting in the segmenter running on the software for hours or even days. Therefore, we optimize the calculation process of the segmenter based on the method proposed in [9].
We design the calculation process of segment width maximization based on dichotomy, as depicted in Figure 3. At initialization time, we assign the start index number j of the unsegmented input xq j: NUM to the segment start pointer sp and left pointer of the dichotomy window lp. Similarly, we assign the end index number NUM of the unsegmented input xq j: NUM to the segment end pointer ep and right pointer of the dichotomy window rp. It can be seen that, after the right window is shifted to the left, the updated indexes from rp to NUM cannot be used as the end pointer ep. Then, the judgment condition of the maximum segmentation end point that satisfies the calculation precision requirements should be ep rp 1-that is, the previous point of rp cannot meet the requirements. Therefore, even if MAE meets the precision requirement, it still necessary to judge whether the segment width xq sp: ep at this time is the largest according to whether ep = rp − 1 is true. If it does not satisfy the judgment, the left window lp needs to be updated to the current end pointer ep, then the end point ep moves to the right by half of xq ep: rp and MAE is recalculated. After repeated iterations of the above calculation process, the judgment condition ep rp 1 will finally hold and the corresponding coefficients, segment start pointer sp, and segment end pointer ep will be the outputs.
The segmentation in the whole input range can be carried out as in the flowchart in Figure 4. First of all, the calculation process of segment width maximization is adopted to select the first segment xq 1: ep . The start pointer, the end pointer, and the corresponding coefficients are stored. Then, the segment start pointer sp is updated to ep+1, and the segment end pointer ep is updated to NUM. The calculation of the segment width maximization is continuously performed until the sp points to NUM+1. Now, segmentation over the whole input range has been completely discussed. Assuming the number of the segments is n, we put the start pointer sp of every segment together to get a vector of start pointers: Similarly, we can obtain the vectors of quadratic coefficients, linear coefficients, and constant coefficients: Up until now, the objective function fq 1: NUM can be approximated by piecewise parabolas:

Test of Segmenter Performance
We test whether the segmenter meets the preset performance in MATLAB. The discretization interval in Equation (2) is set as 2 . The error distribution curve of square root √x with 12 segments is shown in Figure 5. The preset MAE of software segmenter MAE_sw is 4.502 10 . As can be seen, the maximum absolute error of each segment is equal. There are four peaks per segment in the error distribution diagram, whose absolute values are all equal to MAE_sw. Thus, the MAE in each segment is minimized and error flattening in the whole input range is realized. In conclusion, the number of segments reaches the theoretical minimum. In addition, the MAE_sw is set in advance; this reflects that the error is absolutely controllable. Experiments show that the designed segmenter achieves basic functionality. Furtherly, we set the discretization interval in Equation (2) as 2 . For different target functions, such as 2 , log , sin x , 1/ 1 x , we set different MAE_sw and plot the approximate function and error distribution in Figure 6. As is shown in Figure 6, for each segment of each function, there are four peaks whose absolute values equal MAE_sw. According to [14], the MAE in each segment is minimized and error flattening in the whole input range is realized.

Hardware Architecture
The principle of the indexing of parabola coefficients is similar to that of the indexing of PWL coefficients. The conventional indexing method comprises judging segment by segment. For each judgment, a subtractor and a 2-1 MUX (multiplexer) are used in the hardware implementation. By this method, (n − 1) subtractors and (n − 1) 2-1 MUXes are cascaded to determine which segment the input x is located in. The delay increases linearly with the augment of the number of segments.
In order to solve the problem of too high a delay in the above indexing method, this paper proposes a new indexing method based on [8,9], as shown in the left side of Figure  7. For a function with n segments, the input is simultaneously compared with the starting points S , S , . . . , S through (n − 1) parallel subtractors. (n − 1) subtractors calculate and output sign bits sign at the same time; then, (n − 1) symbol bits are concatenated into the MUX selection signals sign, and coefficients of corresponding segments are directly indexed. It is worth mentioning that the (n − 1) subtractors only have carry chains, which occupy much less area than ordinary subtractors. In addition, the proposed method has no encoder. The delay of this method is only one subtractor and an n − 1 MUX, which is lower than the existing method. Naive implementation ax 2 + bx + c requires three multipliers and a carry save adder (CSA) [25]. By horner's rule: (ax + b)x + c two cascaded fused multiply-add (FMA) units are required. The FMA inserts the addend of the addition into the partial product array of the multiplication and then processes them together to reduce delay and save hardware overhead [26]. Compared to the conventional method, one multiplier is saved.
The hardware architecture is shown in Figure 7. The hardware resources include: (n − 1) subtractors (carry chains), an n − 1 MUX, and two FMA units. This architecture does not have a pipeline, in order to compare with related papers. Because the hardware architecture is a forward circuit, we can easily add registers as needed. As for the indexing of coefficients, the proposed method occupies (n − 1) subtractors (carry chains) and an n − 1 MUX, while Ref. [16] only uses an n − 1 MUX. In the calculation unit, the proposed method contains two FMAs, while Ref. [16] has one more multiplier.

A Novel Quantizer
A novel quantizer is proposed to imitate input and output behavior in hardware and quantize the related data in order to make the MAE in hardware equal to 1 ulp with a non-redundant hardware overhead. Figure 8 is a detailed description of the arithmetic circuits regarding the proposed quantizer. The input is discretized by the step of 2 (ow is the width of fractional bit) in the segmenter, so the input value has been quantized in accordance with the circuits. To ensure consistency, the output fractional bit width is set equal to that of the input. Assume the approximate output in the circuits is R. Then, the maximum absolute error in circuits MAE_C is defined as The quantizer aims to make MAE_C no greater than 1 ulp, which is 2 . Firstly, the fractional bit width of coefficients a, b, and c need to be determined. In order to be accurate, they are quantized by rounding: a_q round a 2 2 b_q round b 2 2 c_q round c 2 2 .
It should be noted that, although the coefficients are all quantized to qw bits, not every bit is saved in the LUT. We only keep the required significant bits in the LUT.
The quantization of intermediate results is discussed below. The FMA combines the multiplier and adder, so we do not have to quantize the result of multiplication. For the first FMA, we only need to keep the high qw bit of the result and discard the lower bit. This operation does not require any additional hardware overhead. The quantizer can use floor operation to completely imitate the truncation operation when the hardware is implemented. Assuming that the result of the first FMA is R1, the quantized R1 is referred to as R1_q: R1_q floor a_q xq b_q 2 2 .
The quantization of the final result remains to be discussed. If we still choose the truncation operation, this step causes an error of 1 ulp, and we cannot make the total error within 1 ulp. Thus, we round the final result to ow bits. The final output R can be expressed as: During hardware implementation, the round operation can be implemented by a one-bit adder. If the output is required to round to ow bits, we just need a one-bit adder to add the value in (ow+1) bit to ow bit and discard the lower bits.
Algorithm 1 depicts the quantizer in detail. The input of the quantizer includes discretized input xq, benchmark f, fractional bit width of input and output ow, a set of segment start pointers S S , S , S , … , S , a set of segment end pointers E E , E , E , … , E , the number of segments n, and the number of sample points for the objective function NUM (refer to Equation (2)). It should be noted that the end pointers set E does not appear in the hardware implementation, because the location of the segment can be determined only by the start pointers. This concept is introduced here to simplify the description of the algorithm. The output of the quantizer includes the fractional width of coefficients and intermediate results qw, and the MAE in circuits MAE_C. The main target of the quantizer is to find the smallest qw to ensure that the MAE_C is less than or equal to 1 ulp. Due to error propagation, setting qw equal to ow may not meet the circuit's error requirements, and additional guard bits gw is required. Therefore, qw can be expressed as qw ow gw. The main purpose of the quantizer is to continuously increase guard bits gw and calculate MAE_C until it is less than or equal to 1 ulp.
In order to explore the relationship between the error requirements in software and hardware, we introduce a trade-off factor called TF, which is defined as The MAE_sw in the segmenter can be determined by adjusting TF: MAE_sw TF 1 ulp, TF 1.
The error of the circuit consists of three aspects: the algorithm error, the coefficient and intermediate result quantization error, and the rounding error. The algorithm error is caused by the gap between the approximate quadratic function and the objective function. The MAE of the algorithm error is expressed as MAE_sw, which is analyzed in the segmenter. The coefficient and intermediate result quantization error is written as MAE_quantized. The rounding error is caused by the rounding operation of the output, and its MAE is recorded as MAE_round. In order not to lose generality, we consider the most pessimistic case-that is, the total error is the algebraic sum of the above three errors. The total error is less than or equal to 1 ulp. Thus: Because of MAE_round is 0.5 ulp, we can get the following relationship: MAE_sw 0.5 ulp.
Therefore, TF is less than 0.5. We can change TF between 0 and 0.5. With the ulp unchanged, the smaller TF is, the smaller the segmenter's error requirement MAE_sw is, which may lead to too many unnecessary segments. The larger TF is, the bigger the MAE_sw is. Although the quantizer can obtain the desired effect by increasing the guard bit width, an excessive bit width will lead to an unnecessary increase in area. Therefore, designers have the flexibility to choose TF for better trade-offs.
In general, TF is chosen by two steps. Firstly, candidates are chosen by observing the error distribution of the last segment. If the maximum absolute error of the last segment is not equal to MAE_sw, as shown in Figure 9a, the minor reduction in TF will only cause the error to increase but will not increase the number of segments. Smaller TF will strive for more space for quantization error, which is a pure optimization. Therefore, TF is not the candidate. When the maximum absolute error in the last segment is equal to MAE_sw, as shown in Figure 9b, the reduction in TF will cause an increase in the number of segments, which is undesirable. Thus, TF is the candidate. For the sinusoidal function sin x , x ∈ 0,1 with the ow of 23, the candidate TF and related parameters are shown in Table 2. Secondly, the optimal solution with the minimum LUT size, which is calculated through the number of segments and bit width, is chosen from the candidates.

Simulation on CPU
We perform the proposed algorithm of the segmenter and quantizer in MATLAB and test the computing time under different functions and different accuracies. Table 3 reports the CPU time running on an AMD Ryzen 5 Six-Core Processor with 16 GB RAM. The CPU time of different functions with the same precision is slightly different because of the different properties of the functions. The CPU time with different precision varies greatly, from tens of milliseconds to hundreds of seconds, but they are all acceptable. In the worst case of the test, for the cosine function with the accuracy of 2 , the total required CPU time is only about six minutes.

Comparison of LUT Size
To highlight the advantages of our method with regard to the size of LUT, we compare it with many existing methods, including the most widely accepted uniform segmentation method [15], the non-uniform segmentation method [22], the most advanced method using ILP [16], and other classic methods [11,12,27]. Table 4 shows in detail the number of segments, coefficient bit width, and LUT size of various methods under different functions when the accuracy is 2 and 2 , respectively. The size of LUTs compared to [11,12,16,27] for various functions and various accuracies is reported in Table 5. It can be seen that the proposed method has the smallest LUT size and number of segments for all functions and all precisions tested. As for the bit width of coefficients, in many cases, our method is slightly inferior, which makes the arithmetic unit larger. The LUT size is a trade-off with the bit width of the arithmetic unit. The proposed method is excellent in the number of segments and LUT size, leaving little space for bit width optimization. The proposed method is suitable for applications where the size of LUTs has a great impact on performance.

Implementations and Comparisons in 90 nm CMOS
The proposed hardware architecture is coded in Verilog HDL and implemented in 90 nm technology. We synthesize the circuits by Synopsys Design Compiler and present the implementation results in Table 6.
Ref. [10] and [16] reported several implementation results in 90 nm technology of 23bit rounded interpolators using: the classic Chebyshev quadratic interpolation method [13], minimax degree-two polynomial approximation [15], the two-level function evaluation approach proposed in [10], and the ILP method targeted at minimizing coefficients' bit width in [16]. For the fairness of comparison, we implemented the same set of functions, with the same accuracy and the same technology, using our technique. These functions cover the common function types in hardware implementation, which are representative. In addition, the proposed method has obvious advantages in middle-to-highprecision applications. These functions usually require medium to high precision when implemented in hardware. We also designed a multi-function unit that can compute several arithmetic functions in the same hardware as in [10].
We test the LUT and address generation area of the proposed method for several single functions and compare it to the LUT area (without address generation area) in [10,13,15]. Obviously, the proposed method is better. This is consistent with the previous analysis. Ref. [16] did not provide the result of the LUT area. Moreover, the method proposed in this paper lacks details, and so we cannot reproduce it. However, the number of segments of uniform segmentation method is almost the same, and so the proportion of LUT area in the total area is almost the same. We take the average value of the proportion of LUT area in [13] and [15] as the proportion of LUT area in [16]. We can obtain the estimated value of LUT area in [16]. As shown in Table 6, the proposed method has a smaller LUT and address generation area than [16] for all cases in the table.
For the implementation of single functions, the synthesis results show that the proposed method is superior to the classical method in area and delay [10,13,15]. For the functions y √ and y √x, the proposed method has advantages over ILP [16], which is the state-of-the-art. Compared with ILP, the proposed method has a smaller area and a higher delay for the functions y log x , y 2 and y 1 x ⁄ . Meanwhile, for the function y sin x , the area of the proposed method has weak disadvantage over ILP in area and delay. To provide a fair comparison, we use area delay as a new metric of cost. As shown in Table 6, the proposed method has a lower cost than [10,13,15]. For the function, y log x , y √ , y , and y √x, the proposed method costs less than in [16].
For the multi-function unit, each function needs an individual LUT, but the calculation unit can be reused. The hardware architecture of the proposed multi-function unit is shown in Figure 10. For the methods in this paper and references, the index units of all functions are accumulated, while the calculation units are reused, although the index units and calculation units in this paper and references are different. The method proposed has apparent advantages because the LUT is very small. The synthesis results support this view. Compared with the methods in [13,15,16], the proposed method saves 58.67%, 42.86%, 35.06% of the area and 35.48%, 66.23%, and 50.75% of the delay, respectively.