Low-Latency Hardware Implementation of High-Precision Hyperbolic Functions Sinh x and Cosh x Based on Improved CORDIC Algorithm

: CORDIC algorithm is used for low-cost hardware implementation to calculate transcendental functions. This paper proposes a low-latency high-precision architecture for the computation of hyperbolic functions sinh x and cosh x based on an improved CORDIC algorithm, that is, the QH-CORDIC. The principle, structure, and range of convergence of the QH-CORDIC are discussed, and the hardware circuit architecture of functions sinh x and cosh x using the QH-CORDIC is plotted in this paper. The proposed architecture is implemented using an FPGA device, showing that it has 75% and 50% latency overhead over the two latest prior works. In the synthesis using TSMC 65 nm standard cell library, ASIC implementation results show that the proposed architecture is also superior to the two latest prior works in terms of total time (latency × period), ATP (area × total time), total energy (power × total time), energy efﬁciency (total energy/efﬁcient bits), and area efﬁciency (efﬁcient bits/area/total time). Comparison of related works indicates that it is much more favorable for the proposed architecture to perform high-precision ﬂoating-point computations on functions sinh x and cosh x than the LUT method, stochastic computing, and other CORDIC algorithms.


Introduction
Scientific computing has penetrated almost all scientific and engineering computing and is widely used in energy survey, game rendering, meteorology and oceanography, finance and insurance, computer-aided design, etc. As for numerical precision during computation, different fields have different requirements. Currently, IEEE's 64-bit floatingpoint (FP) standard is accurate enough for most scientific applications. However, a higher level of numerical precision is required for the rapidly growing number of important scientific computing applications such as climate modeling, fluid mechanics, etc. This means that these applications require hundreds or more digits to achieve meaningful numerical results. Furthermore, high demand for real-time computing is usually put forward in these scientific computing applications.
In scientific computing, hyperbolic functions such as sinhx and coshx find wide applications in engineering fields such as signal processing, power transmission, aerospace, statistics, etc. [1,2]. Hyperbolic functions were typically implemented only in software until recently, wherein their hardware implementation has become important; this is largely due to the performance gains of hardware systems compared with software implementations. Extensive literature exists describing hardware implementation of functions sinhx and coshx. Look-up table (LUT) approach, polynomial approximation technique, and coordinate rotation digital computer (CORDIC) algorithm are three typical computational results of our proposed architecture with previously published work and reports the ASIC implementation results of the proposed architecture. Finally, Section 6 concludes this paper.

Computation of Functions Sinhx and Coshx with CORDIC
Based on the recurrent Equation (1) and appropriate choice of initial values (X 0 , Y 0 , and Z 0 for circular coordinates or linear coordinates; X 1 , Y 1 , and Z 1 for hyperbolic coordinates), a variety of functions can be generated [23]. Table 1 lists common functions that can be calculated with the CORDIC algorithm.

Range of Convergence for Basic Hyperbolic CORDIC Algorithm
For basic CORDIC in hyperbolic coordinates, convergence conditions are expressed as in (4) [24].

Another Computation of Functions Sinhx and Coshx
Restricted to limited ROC, rough implementation of functions sinhx and coshx with basic CORDIC seems inappropriate. To realize the across-all-range computation of functions sinhx and coshx, this paper proposes another methodology.
Hyperbolic functions sinhx and coshx can be defined in terms of exponential function e x , where e −x = 1/e x . It can be seen from (5) and (6) that computation of sinhx and coshx consists of function e x , division (to compute e −x ), addition/subtraction operation, and shift operation (right shift). When it comes to the computation of function e x , several studies [25,26] address this problem using an approximation method. In addition to the approximation approach, iterative methods are also widely exploited. Iterative methods include digit-recurrence method [27][28][29] and hyperbolic CORDIC [30,31].
To enhance computational precision of function e x as high as possible with less complex hardware, hyperbolic CORDIC was chosen for this study. However, hyperbolic CORDIC brings about high-precision computation at the cost of high latency, which cannot be tolerated by modern hardware. To eliminate the high-latency flaw from the hyperbolic CORDIC algorithm, this paper proposes a novel QH-CORDIC architecture.

Improvement of Basic CORDIC Algorithm
Inspired by the double-step CORDIC algorithm [32], this paper proposes a QH-CORDIC architecture, which combines four sequential iterations into one single iteration step. Recurrent equations of the proposed QH-CORDIC are shown in (7)- (9).

General Architecture of QH-CORDIC
The hardware architecture of basic CORDIC and QH-CORDIC is presented in Figure 1a,b, respectively. Figure 1a bears a close resemblance to Figure 1b because they both have three major data paths (X data path, Y data path, and Z data path). Their differences mainly lie in the signal(s) that determines the rotation direction of the next iteration. The hardware iteration of QH-CORDIC in two modes, vectoring mode and rotating mode, is briefly demonstrated in Figure 2a,b, respectively. As explained in Section 2.3, in order to ensure ROC of hyperbolic CORDIC, when i = 5, 13, 41, 121, · · · , (3 u+2 -1)/2, · · · where u starts from 0, repeated iterations are needed. Therefore, except X/Y/Z iterative data path that performs iterative formulae of X i+4 /Y i+4 /Z i+4 , X/Y/Z repetitive data path when i = 5, 13, 41, · · · is also listed.
It should be noted that QH-CORDIC in rotating mode can be used to compute exponential function e x . According to (5) and (6), this paper only employs QH-CORDIC in rotating mode to implement hyperbolic functions sinhx and coshx. As for QH-CORDIC in vectoring mode, it can be used to compute the logarithmic function lnx.

ROC of QH-CORDIC for Exponential Function
The computation of exponential function e v is performed through (10).
Initial conditions and terminated statuses for QH-CORDIC-based computation of e v are listed in (11) and (12), respectively.

Validity of Computing Exponential Function with QH-CORDIC
To study the validity of computation of exponential function e x in FP format using QH-CORDIC, suppose input FP number x as (-1) S × M × 2 E where S is the sign of x, E is the exponent of x after correcting bias, and M is mantissa of x after complementing the implicit bit. The assumption is made that the output of function e x is A × 2 B where 0.5 < A < 1 and B is an integer.
Suppose S = 0 first. The discussion of sign S = 1 will be involved later. From we can obtain Performing the two-based-log operation of both sides to (15), we obtain Since B is an integer, and the value of B can be attained with (16).
In order to ensure the value of A, suppose 2 B = e Z . Then, Substitute (17) into (13) and yield By (16), the value of B can be computed. A is in the range of (0.5,1). According to the graph of exponential function e x , M × 2 E -B × ln2 must locate in the ROC of CORDIC, i.e., (−1.7433,1.7433). Therefore, the value of A can be attained by (18).
When S = 1, e x = -A × 2 B . Following the abovementioned steps, we can obtain Similarly, for the condition where S = 1, the value of B can be computed by (19) and A is also in the range of (0.5,1). According to the graph of exponential function e x , -M × 2 E -B × ln2 must locate in the ROC of CORDIC. Therefore, the value of A can be attained by (20).
Thus, the validity of computing exponential function e x with CORDIC is checked. (16) or (19) Since the proposed QH-CORDIC architecture is mainly for quadruple precision FP hyperbolic functions sinhx and coshx, it is necessary to reduce the area of circuit design in the context of high-precision FP input. In Section 3.4, if input FP number x is a quadruple precision FP number, M will be a 113-bit fixed-point number. The difficulty of computing B in Formula (16) or (19) lies in the calculation of M × 2 E/ ln2 where both M and 1/ln2 are 113bit fixed-point numbers. Multiplying M with 1/ln2 straightforward is theoretically feasible. However, in practice, such operation will take an extremely large circuit design area.

Simplified Computing of B in Formula
It can be observed that in the context of the above situation, B will be a 15-bit fixedpoint number, which means that the complex multiplication of M and 1/ln2 can be simplified. The challenge is to reduce effective digits of M and 1/ln2 in the actual calculation. Denote M and 1/ln2 as (21) and (22), x −(p−1) 00 · · · 00 + p 0.00 · · · 00 x −p x −(p+1) · · · x −111 x −112 1/ ln 2 = p where p and q are two positive integers. P is defined as the high-order p bits of M extended with 0s to obtain a 113-bit number, while Q is defined as the high-order q bits of 1/ln2 extended with 0s to obtain a 113-bit number. Let ∆P = M -P and ∆Q = 1/ln2 -Q. Hence, P < 2 and Q < 2; |∆P| < 2 −p , and |∆Q| < 2 −q .
Let p = q = 16, and (23) becomes From the check of (28), we can derive that setting p and q to 16 can ensure |P × Q -M × 1/ln2| < 2 −13 . That is to say, effective digits of M and 1/ln2 only need to be 16 rather than 113. This helps to simplify the calculation of B in formula (16) or (19), which is also reflected in the architecture of state PRE_B in Section 4.

Hardware Implementation of Hyperbolic Functions Sinhx and Coshx with QH-CORDIC
The proposed QH-CORDIC architecture can apply to both fixed-point and FP operations. Meanwhile, the QH-CORDIC architecture is appropriate for configurable precision. In this paper, based on the QH-CORDIC architecture, a quadruple precision FP hardware implementation of hyperbolic functions sinhx and coshx is presented.
The overall architecture of the quadruple precision FP hyperbolic functions sinhx and coshx is illustrated in Figure 3. The proposed architecture is divided into three parts: Module Pre_deal, Module Cordic_core, and Module exp_divide_sinh_cosh. Inputs are an FP number, input_num, and two signals-clk and rst_n. Outputs are sinh_result, cosh_result, sinh_cosh_done, and sinh_cosh_exception, which are a 128-bit calculated FP result of function sinhx, a 128-bit calculated FP result of function coshx, a completion signal, and an exception signal, respectively. Module Pre_deal is to judge whether exception situations exit after breaking down the FP input input_num into three portions: 1-bit sign (sign), 15-bit exponent (e), and 112-bit mantissa (m). The output of Module Pre_deal is a 3-bit signal exception. There are five possible values of exception: 3 b000 (no exception), 3 b001 (input_num is not a number), 3 b010 (input_num is negative infinite), 3 b011 (input_num is positive infinite), and 3 b100 (input_num is small enough to be seen as zero when 15-bit exponent of input_num is smaller than 15'h3f8c).
After Module Pre_deal, Module Cordic_core computes function e input_num with the proposed QH-CORDIC algorithm under the circumstance of no exception. If any exception exists, signal exception_out will be outputted and completion signal finish turns to be 1. Module Cordic_core is mainly composed of a finite state machine (FSM), which has six states in total. The state transition diagram of the FSM is shown in Figure 4. Among these six states, state PRE_B and state PRE_A are to calculate the value of B and A, respectively, in (16) and (18). The architectures of state PRE_B and state PRE_A are shown in Figure 5a,b, respectively. State INIT is to perform the initialization process of exponential function e input_num . Figure 6 shows the data path of state INIT. In Figure 6, input A is the output of state PRE_A. K_inv is a constant, and its value is expressed in (21). State ITE is to perform QH-CORDIC computation of exponential function e input_num . Figure 7 shows the data path of state ITE. The red box in Figure 7 corresponds to the rotating-mode X/Y/Z iterative data path in Figure 2b. The three important modules x_pre, y_pre, and z_pre, respectively, perform iterative data path of X i+4 in (7), Y i+4 in (8), and Z i+4 in (9). In addition, the signal in the register cnt_next and signal exception_in determine the next state of state ITE together.   Figure 8a,b make up X/Y/Z repetitive iterative data path and in Figure 2b for exponential function jointly. After Module Cordic_core, output signals x_out, y_out, exp_out, exception_out and finish are generated. Receiving the above-mentioned five signals and two control signals clk and rst_n, Module exp_divide_sinh_cosh firstly calculates exponential function e −input_num with x_in and y_in. As Figure 9 demonstrates, e input_num = x_in + y_in. Furthermore, e −input_num = 1/e input_num . The computation of e −input_num is implemented through the Predict-Correct algorithm in [33] with p = 113, q = 113, m = 11, n = 3 and t = 3. After obtaining e −input_num and e input_num , the two desired hyperbolic functions sinh(input_num) and cosh(input_num) can be attained with (21) and (22).
It can be inferred from (21) and (22) that computation of sinh(input_num) and cosh (input_num) is made up with a 128-bit FP addition/subtraction operation and a right-shift operation, which is also demonstrated in Figure 9.
There also exists exception handling in Module exp_divide_sinh_cosh. If no exception conditions exist, Module exp_divide_sinh_cosh outputs the result of hyperbolic functions sinh(input_num) and cosh(input_num), respectively, sinh_out and cosh_out. Otherwise, Module exp_divide_sinh_cosh outputs an exception flag signal sinh_cosh_exception and the corresponding exceptional result of sinh(input_num) and cosh(input_num), respectively, sinh_out and cosh_out.

Implementation and Comparisons
The proposed architecture was coded in Verilog Hardware Description Language. Verification of hardware implementation of the two functions sinhx and coshx is presented in Section 5.1. After that, it was synthesized in the Xilinx ISE Design Suite and mapped to an FPGA device (xc7vx485). Comparisons in terms of timing analysis and device utilization are discussed in Section 5.2. The proposed architecture was also synthesized with TSMC 65 nm standard cell library, using Synopsys Design Compiler. The ASIC implementation details are shown in Section 5.3. Section 5.4 compares the proposed architecture with the LUT method, stochastic computing, and other CORDIC algorithms to show its characteristics of high accuracy, low error, and vast ROC when performing high-precision computing.

Functional Verification
The functional verification of the proposed architecture was carried out using 1-million random test cases for normal, sub-normal, and other exceptional input numbers with IEEE's 128-bit FP mode. This paper compares the hardware simulation results of the proposed architecture with software results using the bigfloat package. The bigfloat package is a Python wrapper for the GNU MPFR library for arbitrary-precision FP reliable arithmetic. It provides precise control over precisions and gives correctly rounded reproducible platformindependent results.
In the case of the 1-million random 128-bit FP tests, the statistical correct rate of the proposed architecture is 99.6%, compared with bigfloat data results using Python. Among the 0.4% not-matched 128-bit FP tests, the proposed architecture produces a maximum of 2-ULP (unit at last place) precision loss.

FPGA Implementation Analysis
Timing analysis and device utilization are discussed in this subsection. This paper mainly implements a 128-bit (i.e., quadruple precision) FP hyperbolic functions architecture, where the number of internal iterations is up to 128.
The methods developed by [3,32] and the proposed architecture in this study are three variants of the CORDIC algorithm. For an equal comparison, set N to 128 in [3,32]. Meanwhile, the study by [3] only focuses on hyperbolic functions with fixed-point inputs that are convergent to ROC of basic CORDIC. Hence, for equal comparison, only Module Cordic_core (without states PRE_B and PRE_A) of the proposed architecture is synthesized.
The designed hardware was simulated with a clock of period 10 ns. Table 4 provides the timing analysis and device utilization of [3] (N = 128), [32] (N = 128), and the proposed architecture. According to Table 4, the method by [3] takes three times more clock cycles than the proposed architecture, while the method by [32] takes one time more clock cycles than the proposed architecture. It can be inferred that for [3], the number of clock cycles absolutely depends on the value of N; for [32] and the proposed architecture, the number of clock cycles equals N/2 and N/4, respectively. The reason why clock cycles of the proposed architecture are so few lies in the fact that the proposed architecture performs four-bits computation every iteration.  Table 4 also shows that the number of device resources consumed by [3] (N = 128), [32] (N = 128), and the proposed architecture (only Module Cordic_core). According to Table 4, the number of bonded IOBs consumed by the proposed architecture is the same as or even smaller than that consumed by [3] or [32]. The number of slice flip flops consumed by the proposed architecture is half-time more than or slightly larger than that consumed by [3] or [32], respectively.
However, the number of slices and four-input LUTs consumed by the proposed architecture is about 7.5 times more than those consumed by [3]. The reason why the proposed architecture consumes so many slices and LUTs lies in the calculation of X, Y, and Z's 16 predictive formulae. Considering the amount of calculation magnified by a factor of 16 in theory, the practical utilization of the device resources of the proposed architecture seems to be acceptable.

ASIC Implementation Performance
The proposed architecture is synthesized with the best achievable timing constraints, with a constraint of the max-area set to zero and a global operating voltage of 0.9 V. Section 5.3 compares the performance of ASIC implementation of the proposed architecture with [3] (N = 128) and [32] (N = 128). This paper retrieves studies [3,32] after enlarging the ROC of [3,32] to (−2 15 , 2 15 ) and reducing their error to be below 2 −113 . Table 5 lists nine parameters of ASIC implementation of the three variants of the CORDIC algorithm. Since the clock period is set to be 3.3 ns for [3,32] and the proposed architecture, the clock frequency of ASIC implementation is 300 MHz. Keeping the same clock frequency, the latency parameter of [3,32] and the proposed architecture is 137, 73, and 41, respectively, for 128-bit FP input numbers. The downward trend of parameter latency from [3], to [32], to the proposed architecture, is steeper, showing that the proposed architecture can dramatically cut down on latency. Therefore, it is with the total time parameter. However, the latency and total time of the proposed architecture are reduced at the expense of area and power. In comparison to [3], the area and power of the proposed architecture are approximately three times those of [3]. In comparison to [32], the area and power of the proposed architecture are approximately 1.5 times those of [32].
ATP and total energy parameters are usually used to evaluate ASIC performance more properly and roundly. The smaller ATP and total energy are, the better the ASIC design is. In Table 5, ATP and total energy of the proposed architecture are smaller than those of [3,32]. This can be explained as the advantage of the proposed architecture is low latency at the cost of area and power. To solve the problem of the expanded area and power, the proposed architecture employs module re-using, clock gating, and other techniques. Meanwhile, low latency leads to less computing time, which eventually makes the proposed architecture superior to the first two CORDIC variants in terms of ATP and total energy.
According to the definitions of energy efficiency and area efficiency, the smaller the energy efficiency is and the larger the area efficiency is, the better the ASIC design is. As for the energy efficiency and area efficiency of the two architectures, the proposed architecture also achieves better performance. Due to low latency, less energy is consumed, and more area is utilized per bit in the computing of hyperbolic functions with 128-bit FP inputs using the proposed architecture. Specifically, the proposed architecture has 15.1% energy efficiency and 19.2% energy efficiency overhead over [3,32], respectively. The proposed architecture has 12.7% and 22.4% more area efficiency over, respectively, [3,32].
To summarize, the proposed architecture does not supersede [3] or [32] in terms of parameter area and power. However, it outperforms the other two variants of the CORDIC algorithm in terms of ATP, energy efficiency, and area efficiency parameters since the proposed QH-CORDIC algorithm brings about a low-latency feature.

Related Works and Comparisons
The proposed architecture also focuses on high-precision computing of the two functions sinhx and coshx by enhancing accuracy, lowering function error, and enlarging ROC. Table 6 demonstrates the comparisons of the LUT method, stochastic computing, and CORDIC algorithms. It should be noted that the data of the CORDIC algorithm is adopted from original studies [3,9,32], without retrieval.
LUT method is a way to compute hyperbolic functions sinhx and coshx. The study by [5] computes trigonometric and hyperbolic functions using look-up tables whose size is 77 bit × 14 to achieve the accuracy of 4 bits. In order to improve accuracy, the volume of look-up tables used in this method will increase exponentially; that is, high-precision function values will run out of a huge amount of LUTs. Meanwhile, a larger look-up table brings about the lower searching speed.
Another way to compute hyperbolic functions is stochastic computing, as performed in studies by [20,34]. Stochastic computing applies stochastic bitstreams to compute, and its main features are having a low cost and low power [35]. The accuracy of stochastic computing is related to the length of stochastic numbers. According to [36], the length of stochastic numbers l is related to the precision i, and the number of independent variables n in the calculated function, i.e., l = 2 i−n . High-precision function values require a larger length of stochastic numbers. For 128-bit FP inputs, the accuracy of 113 for the mantissa part should be guaranteed. In this case, l = 2 113−n . In practice, l cannot be too large, so n needs to be appropriate. This means that for high-precision computation, a large number of stochastic data will be generated, leading to tremendous latency, area, and power.
From Table 6, the function error of the proposed architecture is less than 2 −113, and ROC is expanded to (−2 15 ,2 15 ). It is a dramatic improvement, compared with the other structures. To summarize, both the LUT method and stochastic computing are disadvantageous when performing high-precision computation. Among the above four CORDIC algorithms, metrics accuracy (or function error) and ROC are both considered in the proposed architecture.

Conclusions
A new method and hardware architecture were proposed to compute hyperbolic functions sinhx and coshx based on the QH-CORDIC algorithm in this study.
Restricted to limited ROC of basic CORDIC algorithm, hardware implementation of functions sinhx and coshx with all-floating-point-domain inputs on basis of basic CORDIC seems infeasible. The proposed QH-CORDIC algorithm is based on a basic hyperbolic CORDIC algorithm. Explaining the principle and structure of the QH-CORIDC, this study discussed ROC and validity of the QH-CORDIC when computing exponential function e x with all-floating-point-domain inputs since function e x mainly consists of functions sinhx and coshx.
As for the circuit design of functions sinhx and coshx with QH-CORDIC, the entire logic path was tuned to perform a low-latency computation. The proposed circuit architecture has 75% clock cycles overhead over [3] and 50% clock cycles overhead over [32]. From the trade-off aspect of performance-power-area, in Section 5.3, the proposed architecture was proved to be superior to [3,32] in terms of metrics of total time, ATP, total energy, energy efficiency, and area efficiency. Section 5.4 showed that it is much more favorable for the proposed architecture to perform high-precision computing of hyperbolic functions.
In addition, the proposed architecture can be configured for single-precision, doubleprecision, quadruple-precision, or other user-defined precisions. Meanwhile, the proposed architecture can also be adapted in computations of hyperbolic functions sinhx and coshx with fixed-point input numbers after simple adjustment. Moreover, other common hyperbolic functions such as tanhx, arcsinhx, arccoshx, and arctanhx can also be computed using the QH-CORDIC algorithm.