Low-Latency and Minor-Error Architecture for Parallel Computing X Y -like Functions with High-Precision Floating-Point Inputs

: This paper proposes a novel architecture for the computation of X Y -like functions based on the QH CORDIC (Quadruple-Step-Ahead Hyperbolic Coordinate Rotation Digital Computer) methodology. The proposed architecture converts direct computing of function X Y to logarithm, multiplication, and exponent operations. The QH CORDIC methodology is a parallel variant of the traditional CORDIC algorithm. Traditional CORDIC suffers from long latency and large area, while the QH CORDIC has much lower latency. The computation of functions ln x and e x is accomplished with the QH CORDIC. To solve the problem of the limited range of convergence of the QH CORDIC, this paper employs two speciﬁc techniques to enlarge the range of convergence for functions ln x and e x , making it possible to deal with high-precision ﬂoating-point inputs. Hardware modeling of function X Y using the QH CORDIC is plotted in this paper. Under the TSMC 65 nm standard cell library, this paper designs and synthesizes a reference circuit. The ASIC implementation results show that the proposed architecture has 30 more orders of magnitude of maximum relative error and average relative error than the state-of-the-art. On top of that, the proposed architecture is also superior to the state-of-the-art in terms of latency, word length and energy efﬁciency (power × latency × period /efﬁcient bits).


Introduction
X Y -like functions usually find their place in engineering and scientific applications such as digital signal processing, real-time 3D (three dimensions) graphics, scientific computing and so forth [1]. Currently, customized hardware designs for X Y -like functions are becoming more promising due to the demanding timing constraints of these applications [2].
For most scientific applications, 64-bit floating-point (FP) numbers conforming to the IEEE-754(2008) standard are extensively applied. However, a rapidly growing number of scientific applications such as climate modeling, fluid mechanics, and economic analysis require a higher level of numerical precision [3]. This means that hundreds or more digits, such as 128 bits, are needed to gain valid numerical results.
Since X Y = e YlnX , the computation of function X Y can be decomposed into the logarithmic computation of lnX, the multiplication of lnX and Y, and the exponential computation e YlnX . Therefore, the computation of X Y -like functions can be converted to the computation of logarithmic and exponential functions. The hardware methods of computing X Y -like functions directly or indirectly can be divided into four categories: digit-recurrence [4][5][6], functional iteration [7][8][9], LUT (Look-Up Table)-based [10][11][12], and CORDIC-based [13][14][15]. In fact, CORDIC is one of the digit-recurrence algorithms.
However, there is not much research on general computation for X Y -like functions with high-precision FP inputs. Such a generic approach for N th power computation is proposed in [1] based on the natural logarithm-exponent relation, i.e., X Y = e YlnX .
Duprat and Muller [29] introduced the Branching CORDIC algorithm, which enables a fast implementation of the CORDIC algorithm by performing two basic CORDIC rotations in parallel in two separate modules. D.S. Phatak [30] has improved the algorithm and proposed a double step branching CORDIC algorithm where two circular mode rotations are performed in a single step with little additional hardware. To achieve the goal of high-precision, high-accuracy, and low-latency in computing X Y -like functions, this paper adopts the QH CORDIC methodology [31], which is inspired by the double step branching CORDIC algorithm. In this paper,

•
We propose a parallel computing architecture with low-latency based on the QH CORDIC methodology; • We enlarge the feasible range of FP inputs of the proposed architecture with specific techniques to make sure the proposed architecture applies to high-precision computing; • We conduct hardware modeling on the proposed architecture to achieve the lowest possible circuit complexity and resource consumption; • We compare the hardware implementation results with related works to show the minor-error and high-accuracy features of the proposed architecture.
The rest of this paper is organized as follows. Section 2 provides the necessary theoretical background of the QH CORDIC methodology. Section 3 introduces the hardware modeling of X Y -like functions for 128-bit FP numbers. Section 4 shows the ASIC implementation results of the proposed architecture and compares it with the state-of-the-art in terms of correctness, word length, timing, and power. Section 5 concludes this paper.

QH CORDIC-Based Methodology of X Y -Like Functions
In Section 2, emphasis is placed on the feasibility of logarithmic function lnx and exponential function e x computing with the QH CORDIC methodology.
We first review the QH CORDIC methodology in terms of iterative formulae. Then, we discuss the range of convergence (ROC) of the QH CORDIC methodology. Given the ROC of the QH CORDIC, the validity of computing of logarithmic function lnx and exponential function e x based on the QH CORDIC is analyzed.

Iterative Formulae of QH CORDIC Methodology
Based on shift-addition and vector rotation, the hyperbolic CORDIC algorithm is simple and efficient. However, hyperbolic CORDIC only generates one accurate bit per iteration, which is an apparent drawback for real-time scientific computing. Unlike traditional hyperbolic CORDIC, the QH CORDIC methodology combinates four sequential iterations into a single integrated iteration, greatly cutting down the quantity of iterations.
The key of the QH CORDIC lies in the prediction of σ in four sequential steps, which is also the necklace of traditional hyperbolic CORDIC. In basic hyperbolic CORDIC, the value of σ is either −1 (rotating in a clockwise direction) or 1 (rotating in a counterclockwise direction). For a group of four sequential steps, the corresponding {σ n , σ n+1 , σ n+2 , σ n+3 } has 16 possible cases for its values, from {−1, −1, −1, −1} to {1, 1, 1, 1}.
For an iteration step of the QH CORDIC in vectoring mode, parallelly compute 16 iterative formulae of y n+4 shown in Table 1 and obtain a group of 16 different values. Sort the closest-to-zero value out from the 16 y n+4 values and take it as the output of y n+4 in the current iteration step of the QH CORDIC. Simultaneously, take {σ n , σ n+1 , σ n+2 , σ n+3 } corresponding to the iterative formula of the output of y n+4 as rotation directions in the current iteration step. Then, the computer outputs x n+4 and z n+4 with the iterative formulae of x n+4 and z n+4 corresponding to the rotation directions, respectively.
For an iteration step of the QH CORDIC in rotating mode, parallelly compute 16 iterative formulae of z n+4 shown in Table 2 and obtain a group of 16 different values. Sort the closestto-zero value out from the 16 z n+4 values and take it as the output of z n+4 in the current iteration step of the QH CORDIC. Simultaneously, take {σ n , σ n+1 , σ n+2 , σ n+3 } corresponding to the iterative formula of the output of z n+4 as the rotation directions in the current iteration step. Then, the computer outputs x n+4 and y n+4 with the iterative formulae of x n+4 and y n+4 corresponding to the rotation directions, respectively.
It can be seen in Table 1 that the eight upper iterative formulae of y n+4 (Cases 1-8) are partly symmetric to the eight lower iterative formulae (Cases 9-16). Such elaborate symmetry also exists with the iterative formulae of x n+4 . To reduce the computational burden for every iteration step of the QH CORDIC, multiplications with the same absolute value of the coefficients can be simplified. By merging repeated multiplications into one multiplication and one sign-inversing operation, it takes 34 additions, 12 multiplications, and four shifts to parallelly finish the computation of the 16 iterative formulae of y n+4 shown in Table 1. Similarly, it takes 34 additions, 12 multiplications, and four shifts to parallelly finish the computation of the 16 iterative formulae of x n+4 .

Range of Convergence of QH CORDIC Methodology
The ROCs of traditional hyperbolic CORDIC [32] are showed in Equation (3): where y 1 and x 1 are initial inputs. It can be inferred that the angle of an input vector in radians for traditional hyperbolic CORDIC must be located in (−1.1182, 1.1182).
Similar to traditional hyperbolic CORDIC, constraints on the ROC of the QH CORDIC also exist.
Since the logarithmic function lnu cannot be attained directly by the QH CORDIC, the computation of function lnu is done through Equation (4): The initial conditions and terminated statuses for QH CORDIC-based computation of lnu are listed in Equations (5) and (6), respectively: Accompanied by Equation (3), the ROC of input u for function lnu is (0.11, 9.51).

Validity of Computation for Logarithmic Function and Exponential Function with QH CORDIC
As X Y = e YlnX , it is necessary to study the validity of the computation for logarithmic function lnu and exponential function e v with the QH CORDIC, that is to say, to enlarge ROC of the QH CORDIC for logarithmic and exponential functions.
Since the inputs of the proposed architecture for X Y -like functions are all FP numbers, suppose that the input FP number u is (−1) S × M × 2 E , where S is sign of u, E is the exponent of u after correcting bias, M is the mantissa of u after complementing the implicit bit, and M ∈ [1,2).
There is nothing ambiguous about S = 0 because the input FP number u for the logarithmic function lnu is bound to be positive. So, we obtain u = M × 2 E . Perform natural logarithmic computation of both sides of u = M × 2 E to obtain ln u = lnM + e × ln 2 (10) To adjust M into the range of (0.11, 9.51), right-shift M one bit. Represent the rightshifted M as M . Equation (10) is updated as ln u = ln M + (e + 1) × ln 2 (11) From Equation (11) we can see that the computation of lnu can be split into one logarithmic operation and one constant multiplication, as well as one addition.
In a similar way, the validity of the computation for exponential function e v with the QH CORDIC can also be ensured [15].

Hardware Modeling of X Y -Like Functions with QH CORDIC
The QH CORDIC can be applied to both fixed-point and FP operations. Based on the QH CORDIC, this paper presents a quad-precision (128 bits) FP hardware modeling of X Y -like functions.
The overall architecture of the quad-precision FP X Y -like functions is illustrated in Figure 1. The proposed architecture is divided into three parts. As for X Y -like functions, X is base while Y is index. Inputs are two quad-precision FP numbers: base and index (X and Y) and two control signals: clk and rst_n. Outputs are power_result and finish, which are a 128-bit calculated result of function X Y and a completed signal of function X Y , respectively. There are three segments in the overall hardware architecture for X Y -like functions, the preprocessing module, the QH module, and the postprocessing module. Since the inputs of the proposed architecture for X Y -like functions are all FP numbers, suppose that the input FP number u is (−1) S × M × 2 E , where S is sign of u, E is the exponent of u after correcting bias, M is the mantissa of u after complementing the implicit bit, and M∈ [1,2).
There is nothing ambiguous about S = 0 because the input FP number u for the logarithmic function lnu is bound to be positive. So, we obtain u = M × 2 E . Perform natural logarithmic computation of both sides of u = M × 2 E to obtain To adjust M into the range of (0.11, 9.51), right-shift M one bit. Represent the rightshifted M as M'. Equation (10) is updated as From Equation (11) we can see that the computation of lnu can be split into one logarithmic operation and one constant multiplication, as well as one addition.
In a similar way, the validity of the computation for exponential function e v with the QH CORDIC can also be ensured [15].

Hardware Modeling of X Y -Like Functions with QH CORDIC
The QH CORDIC can be applied to both fixed-point and FP operations. Based on the QH CORDIC, this paper presents a quad-precision (128 bits) FP hardware modeling of X Y -like functions.
The overall architecture of the quad-precision FP X Y -like functions is illustrated in Figure 1. The proposed architecture is divided into three parts. As for X Y -like functions, X is base while Y is index. Inputs are two quad-precision FP numbers: base and index (X and Y) and two control signals: clk and rst_n. Outputs are power_result and finish, which are a 128-bit calculated result of function X Y and a completed signal of function X Y , respectively. There are three segments in the overall hardware architecture for X Y -like functions, the preprocessing module, the QH module, and the postprocessing module.

Preprocessing Module
The preprocessing module judges whether exception situations exist after breaking down the two FP inputs base and index into three portions: sign (base_sign and index_sign), exponent (base_e and index_e) and mantissa (base_m and index_m). The program flowchart of the preprocessing module is presented in Figure 2. As it is shown in Figure 2, for the quad-precision FP input base, base_sign is a 1-bit sign; base_e is a 15-bit exponent of base after correcting bias, and base_m is a 113-bit mantissa of base after complementing an implicit bit "1". Meanwhile, index_sign, index_e, and index_m are generated in the same way.

QH Module
In this paper, hardware modeling of the function X Y is done through the logarithmic function lnX, the multiplication of lnX and Y, and the exponential function e YlnX . In the QH module, the implementation of the logarithmic function and the exponential function with QH CORDIC methodology in Section 2.1 is abstracted as in Figure 3.
In Figure 3, the vectoring mode of the QH CORDIC is for the logarithmic function lnX, while the rotating mode of the QH CORDIC is for exponential function e YlnX . The vectoring mode of the QH CORDIC bears a close resemblance to the rotating mode of the QH CORDIC because they both have three major data paths and seven submodules. Their differences mainly lie in the signal that determines the rotation direction of the next iteration (signal s_ln and signal s_exp). As explained in Section 2.2, in order to ensure the ROC of QH CORDIC, when i = 5, 13, 41, 121, ···, (3 i + 2 − 1)/2, ··· where i starts from 0, repeated iterations are needed. Therefore, except for the iterative X/Y/Z data path that performs the The new definitions sign_1_2, sign_1_4, sign_3_4, sign_1_8, sign_3_8, sign_5_8, and sign_7_8 in Figure 2 denote that index equals 1/2, 1/4, 3/4, 1/8, 3/8, 5/8, and 7/8, respectively. Although the inputs base and index have quite a wide range around their numerical values, both of them are still rational numbers, which means that base and index can be expressed by p/q where p and q are two integers with the denominator q not equal to 0. Denote base = p 1 /q 1 and index = p 2 /q 2 . When the denominator q 2 is even, base must be nonnegative. This is difficult to check in an actual implementation, as p 2 and q 2 of index are difficult to confirm when index is an irregular high-precision FP number. In this paper, we only focus on eight common cases: index equals 1/2, 1/4, 3/4, 1/8, 3/8, 5/8, and 7/8. When index = 1/2, 1/4, 3/4, 1/8, 3/8, 5/8, or 7/8, and, in the meantime, base is negative (base_sign = 1), the 1-bit exception judgement signal exception of the preprocessing module is set to be 1 and the 128-bit exception output signal exception_result is set to be NaN (128 h7fff_8000_0000_0000_0000_0000 _0000_0000).
According to Figure 2, there exist certain cases the where exception judgement signal exception is set to be 1 and exception output signal exception_result is set to a corresponding numerical value. Otherwise, exception is set to be 0, which means that there is no exceptional situation in the preprocessing module. After the preprocessing module, the signals base_sign, index_sign, base_e, index_e, base_m, index_m, exception, and exception_result are transferred to next module, the QH module.

QH Module
In this paper, hardware modeling of the function X Y is done through the logarithmic function lnX, the multiplication of lnX and Y, and the exponential function e YlnX . In the QH module, the implementation of the logarithmic function and the exponential function with QH CORDIC methodology in Section 2.1 is abstracted as in Figure 3.   In Figure 3, the vectoring mode of the QH CORDIC is for the logarithmic function lnX, while the rotating mode of the QH CORDIC is for exponential function e YlnX . The vectoring mode of the QH CORDIC bears a close resemblance to the rotating mode of the QH CORDIC because they both have three major data paths and seven submodules. Their differences mainly lie in the signal that determines the rotation direction of the next iteration (signal s_ln and signal s_exp). As explained in Section 2.2, in order to ensure the ROC of QH CORDIC, when i = 5, 13, 41, 121, ···, (3 i+2 − 1)/2, ··· where i starts from 0, repeated iterations are needed. Therefore, except for the iterative X/Y/Z data path that performs the iterative formulae x n+4 /y n+4 /z n+4 , the iterative X/Y/Z data paths when i = 5, 13, 41, 121 are also listed.
The actual implementation of the QH module establishes a finite state machine (FSM), which has twelve states : INIT_LN, ITE_LN, ONE_STEP_1_LN, ONE_STEP_2_LN, IN  To some extent, the architectures of the twelve states are parallelly similar. Figure 5 shows the initialization process of the logarithmic function and the exponential function. In Figure 5a, z_in = {base_m, 35′b0} where base_m is one of the outputs of the preprocessing module. In Figure 5b, K_inv is a constant and its value is as Equation (12).  To some extent, the architectures of the twelve states are parallelly similar. Figure 5 shows the initialization process of the logarithmic function and the exponential function. In Figure 5a, z_in = {base_m, 35 b0} where base_m is one of the outputs of the preprocessing module. In Figure 5b, K_inv is a constant and its value is as Equation (12).
Input exp_pre_mantissa is one of the outputs of the state INNER_DEAL. The architecture of the states INNER_DEAL_0, INNER_DEAL_1, INNER_DEAL_2, and IN-NER_DEAL_3 is shown in Figure 6.
In Figure 6, the external inputs are index_s, index_m, index_e, base_e, ite_z_ln, and a constant 1/ln2. Among them, index_s, index_m, index_e, and base_e are the outputs of the preprocessing module, while ite_z_ln is one of the outputs of the module z_pre in the state ITE_LN.
The state INNER_DEAL_0 in Figure 6 realizes the normalization of the calculated results base_e and ite_z_ln after the states INIT_LN, ITE_LN, ONE_STEP_1_LN, and ONE_STEP_2_LN, while the state INNER_DEAL_0 turns the results of logarithmic function lnX into a normalized 128-bit FP number {ln_sign, ln_e, ln_m}.
INNER_DEAL_1 in Figure 6 computes the multiplication of two 128-bit FP numbers, lnX and Y. The result of INNER_DEAL_1 is also a normalized 128-bit FP number {rslt_s, rslt_e, rslt_e}.
The states INNER_DEAL_2 and INNER_DEAL_3 realize predealing of the 128-bit FP input {rslt_s, rslt_e, rslt_e} of exponential function e YlnX , including exception checking and ensuring the computational validity problem of the exponential function. Generally, state INNER_DEAL_2 performs the former multiplication operation while state INNER_DEAL_3 performs the later multiplication operation and an addition operation. Figure 7a,b demonstrate the architectures of state ITE_LN and state ITE_EXP, respectively. The yellow box in Figure 7a corresponds to the yellow box in Figure 3, while the red box in Figure 7b corresponds to the red box in Figure 3. Module x_pre in Figure 7a,b is the same, so it is with the module y_pre and the module z_pre. The three important modules x_pre, y_pre, and z_pre consist of iterative data paths of x n+4 , y n+4 , and z n+4 in Equation (2). Furthermore, the signal in the register cnt_next and signal exception determine the next state after ITE_LN or ITE_EXP.
The architecture of module x_pre is shown in detail in Figure 8. The module x_pre is performs iterative X data path calculation and involves many multiplications with single integer constants. Binary decomposition is used to encode single integer constants to reduce delay. Module x_pre divides the iterative X data path into three layers. Layer1 evaluates 2, 4, 8, 16, and 32 times of the inputs x and y with shift operations. Layer2 uses the results of layer1 and the binary decomposed results of single integer constants with a series of compressors and adders to obtain 9, 15, 19, 21, and 35 times the input x, and 3, 5, 7, 9, 11, 13, and 15 times the input y. Layer3 performs shift and compress operations on the results of layer2 according to Equation (2).
Architecture of the module y_pre is quite similar to module x_pre. Iterative Z data path calculation is also optimized and its architecture is shown in Figure 9.
The iterations of the Z data path require 16 calculations for the inputs {σ n , σ n+1 , σ n+2 , σ n+3 } from 0000 to 1111 to obtain 16 results. The calculating process of each result is quite similar. Figure 9 takes {σ n , σ n+1 , σ n+2 , σ n+3 } = {0, 0, 0, 0} as an example.  To some extent, the architectures of the twelve states are parallelly similar. Figure 5 shows the initialization process of the logarithmic function and the exponential function. In Figure 5a, z_in = {base_m, 35′b0} where base_m is one of the outputs of the preprocessing module. In Figure 5b, K_inv is a constant and its value is as Equation (12).     Figure 7a corresponds to the yellow box in Figure 3, while the red box in Figure 7b corresponds to the red box in Figure 3. Module x_pre in Figure 7a,b is the same, so it is with the module y_pre and the module z_pre. The three important modules x_pre, y_pre, and z_pre consist of iterative data paths of xn + 4, yn + 4, and zn + 4 in Equation (2). Furthermore, the signal in the register cnt_next and signal exception determine the next state after ITE_LN or ITE_EXP.   Figure 10a,b jointly make up the repetitive iterative x n+4 , y n+4 and z n+4 data path for logarithmic functions. However, when register cnt is 5 or 41, the FSM jumps to state ONE_STEP_1_LN; when register cnt is 13 or 121, the FSM jumps to state ONE_STEP_2_LN. Figure 11a,b demonstrate architectures of State ONE_STEP_1_EXP and state ONE_STE-P_2_EXP, respectively. The two blue boxes in Figure 11a,b jointly make up the repetitive iterative x n+4 , y n+4 and z n+4 data path for exponential functions.

Postprocessing Module
After the QH module, the outputs x_out, y_out, exp_out, and exception_out are generated. As Figure 12 shows, the signal exp_pre_exp is just exp_out of the QH module and signal exception_out_inner is exception_out of the QH module. The postprocessing module mainly serves to merge a normalized 128-bit FP output of powering function X Y . The architecture of module x_pre is shown in detail in Figure 8. The module x_pre is performs iterative X data path calculation and involves many multiplications with single integer constants. Binary decomposition is used to encode single integer constants to reduce delay. Module x_pre divides the iterative X data path into three layers. Layer1 evaluates 2, 4, 8, 16, and 32 times of the inputs x and y with shift operations. Layer2 uses the results of layer1 and the binary decomposed results of single integer constants with a se-   Architecture of the module y_pre is quite similar to module x_pre. Iterative Z data path calculation is also optimized and its architecture is shown in Figure 9 Figure 9. Architecture of module z_pre.   Architecture of the module y_pre is quite similar to module x_pre. Iterative Z data path calculation is also optimized and its architecture is shown in Figure 9 σi+1, σi+2, σi+3}={0,0,0,0}  {σi, σi+1, σi+2, σi+3}={0,0,0,1}   {σi, σi+1, σi+2, σi+3}={1,1,1,0}  {σi, σi+1, σi+2, σi+3}={1,1,1,1}   Figure 9. Architecture of module z_pre. Figure 9. Architecture of module z_pre. σn + 3} from 0000 to 1111 to obtain 16 results. The calculating process of each result is quite similar. Figure 9 takes {σn, σn + 1, σn + 2, σn + 3} = {0, 0, 0, 0} as an example. Figure 10a,b demonstrate architectures of the state ONE_STEP_1_LN and state ONE_STEP_2_LN, respectively. The two green boxes in Figure 10a,b jointly make up the repetitive iterative xn + 4, yn + 4 and zn + 4 data path for logarithmic functions. However, when register cnt is 5 or 41, the FSM jumps to state ONE_STEP_1_LN; when register cnt is 13 or 121, the FSM jumps to state ONE_STEP_2_LN.  Figure 11a,b demonstrate architectures of State ONE_STEP_1_EXP and state ONE_STEP_2_EXP, respectively. The two blue boxes in Figure 11a,b jointly make up the repetitive iterative xn + 4, yn + 4 and zn + 4 data path for exponential functions.  register cnt is 5 or 41, the FSM jumps to state ONE_STEP_1_LN; when register cnt is 13 or 121, the FSM jumps to state ONE_STEP_2_LN.  Figure 11a,b demonstrate architectures of State ONE_STEP_1_EXP and state ONE_STEP_2_EXP, respectively. The two blue boxes in Figure 11a,b jointly make up the repetitive iterative xn + 4, yn + 4 and zn + 4 data path for exponential functions.   After the QH module, the outputs x_out, y_out, exp_out, and exception_out are generated. As Figure 12 shows, the signal exp_pre_exp is just exp_out of the QH module and signal exception_out_inner is exception_out of the QH module. The postprocessing module mainly serves to merge a normalized 128-bit FP output of powering function X Y . mantissa [14] mantissa [15]

ASIC Implementation Results of the Proposed Architecture
The proposed architecture was coded in Verilog HDL and synthesized with the

ASIC Implementation Results of the Proposed Architecture
The proposed architecture was coded in Verilog HDL and synthesized with the TSMC 65 nm standard cell library, using Synopsys Design Compiler. The proposed architecture was synthesized with the best achievable timing constraints, with the constraint of the max-area set to zero and a global operating voltage of 0.9 V. The ASIC implementation details are shown in Table 3. ATP and total energy are usually used to properly and roundly evaluate ASIC performance. The smaller the ATP and total energy are, the better the ASIC design. In a similar way, according to the definitions of energy efficiency and area efficiency, the smaller the energy efficiency is and the larger the area efficiency is, the better the ASIC design.

Evalutation and Comparative Analysis
In order to reveal the superiority of our approach based on the same conditions, this paper makes a comparative analysis using different indicators against other state-of-the-art approaches for computing X Y -like functions, including computation correctness, word length, computation latency, and power consumption.

Computational Correctness
In Python, we verify the computational correctness of the proposed architecture and other approaches by evaluating their relative errors. The definition of the indicator relative error (RE) is as Equation (13): where V T stands for the theoretical value of function X Y and V M stands for the measured value of function X Y with the proposed and other approaches. The maximum relative error is denoted as max(RE k ) of k test quantities. Another important indicator, the average relative error (ARE), is defined as Equation (14): where k stands for test quantity of function X Y . Currently, only [12,14] in the state-of-the-art approaches implement both N √ X and X N computation. For [12,14], the test quantity k is 40,000. For [14], the number of iterations n is set to be 10. For [12], 1024 pieces of result data are stored in either LUT1 or LUT2. Software verification of [12,14] and the proposed architecture for function X Y is presented in Table 4. From Table 4, in terms of max(RE) and ARE, it is evident that the proposed architecture for N √ X computation is superior compared with the state-of-the-art approaches [12,14]. The proposed architecture for X N computation is also superior compared with the state-ofthe-art approaches [12,14].

Word Length
In this subsection, this paper analyzes the word length required for each approach's hardware implementation based on the conditions in Section 4.2.1. The longer the word length is, the better the precision.
The word lengths of each module in [12,14], and the proposed architecture are shown in Table 5. By contrast, the word length of the proposed architecture is much longer than the state-of-the-art approaches, which means that the proposed architecture has the characteristic of high precision. However, the long word length consumes more area and power, increases the critical path, and lowers the working frequency to some extent.

Timing Analysis and Power Analysis
In this section, this paper first analyzes computation latency of [14,15] and the proposed architecture to calculate N √ X or X N . NL refers to latency savings compared with other architectures and it is defined as Equation (15): where L is latency. Compared with [14,15], the percentage of NL is as Equation (16): Under the circumstances, we can calculate NL, P NL for [14,15] and the proposed architecture, as shown in Table 6.  [14,15], and proposed architecture is 13, 38 and 113 respectively.
From Table 6, for N √ X computation, the proposed architecture saves 2.56% and 12.64% latency compared with [14,15], respectively. For X N computation, the proposed architecture saves 3.79% and 12.64% latency, respectively. The direct comparison of latency in terms of cycle is not fair, as the periods of [14,15] and the proposed architecture may be different. Hence, the indicator total time is employed to measure the timing efficiency of the three architectures. According to Table 6, the total time of the proposed architecture is about 6 times that of [14] and about 2.8 times that of [15]. It should be noticed that the comparison is done in terms of total time without taking into account the fact that the three architectures run on different technologies.
For fairness, the indicator throughput was also measured. From Table 6, for N √ X computation, the proposed architecture has ≈35.4% and ≈3.2% throughput overhead over [14] and [15], respectively; for X N computation, the proposed architecture has ≈37.1% and ≈3.2% throughput overhead over [14,15], respectively. It can be inferred that the throughput of our proposed architecture is larger, as our precision is much higher. For highprecision applications, the proposed architecture can achieve better timing performance.
Next, this paper focuses on the power consumption to calculate N √ X or X N for [14,15] and the proposed architecture. Similar with Section 4.1, this paper employs energy efficiency to compare with other architectures.
As seen in Table 7, the proposed design for N √ X computation saves 25.44% and 34.18% energy efficiency when it is compared with [14,15], respectively. For X N computation, our design can save 21.58% and 34.18% energy efficiency compared with [14] and [15], respectively.

Conclusions
In this paper, a novel hardware architecture is proposed based on the QH CORDIC methodology to compute X Y -like functions. The computation of function X Y is divided into the computation of function lnX, a multiplication operation, and function e x , of which function lnx and function e x are calculated using the QH CORDIC.
The QH CORDIC methodology merges four single iterations into a whole parallel iteration, simultaneously computes a total of 16 possible values of x n+4 , y n+4 , and z n+4 respectively, and picks up rotation directions of the next iteration according to z n+4 for rotating mode or y n+4 for vectoring mode. Compared with traditional hyperbolic CORDIC algorithms, the QH CORDIC only needs 32 clock cycles to achieve 128-bit FP data accuracy, which greatly reduces the circuit latency.
The proposed architecture solves the problem of limited ROC of CORDIC algorithm and explains the validity of computing function lnx and e x with the QH CORDIC, ensuring that the domain of inputs for the proposed architecture are high-precision (128 bits) floatingpoint numbers.
The ASIC implementation results show that the proposed architecture's circuit latency, area, and power consumption are 76 cycles, 1417366 µm 2 , and 36.2189 mW, respectively, under a working frequency of 300 MHz and a precision of at least 113 bits. The conspicuous timing performances of the proposed architecture with high-precision inputs and dramatically accurate outputs are achieved at the cost of area and power consumption. This is related to the adopted QH CORDIC methodology to a large extent, as the QH CORDIC follows the parallel computing strategy.
Compared with other architectures, the entire logic path turned out to be minor error and high accuracy. The proposed architecture is about 30 orders of magnitude superior to [12,14] in terms of max(RE) and ARE. The proposed architecture was also proved to be low-latency. For N √ X computation, the proposed architecture has 2.56% and 12.64% latency overhead over [14,15], respectively. For X N computation, it has 3.79% and 12.64% latency overhead over [14,15], respectively. In addition, the proposed architecture outmatches [14,15] in terms of indicator energy efficiency and throughput. The word length of the proposed architecture is 128 bits, three or four times that of [12,14].
Therefore, the proposed architecture is highly favored to perform high-precision floating-point computing of X Y -like functions. However, the long word length may weaken the advantages in terms of area and power. Our focus in the future will firstly be rigorous error analysis that cuts down on the word length of x n+4 , y n+4 , and z n+4 during the QH CORDIC iterations. Secondly, the word length of the two multipliers Y and lnX can be shortened with the accuracy of product YlnX maintained. In this way, the computation of YlnX in e YlnX can be simplified. These improvements will help to reduce the total area and power of the circuit design.
In addition, the proposed hardware architecture can be modified into a low-latency and minor-error architecture for X Y -like functions whose inputs are configurable in terms of precision. In line with the precisions of current scientific computing applications, not only quadruple precision (128 bits), but also half-precision (16 bits), single precision (32 bits), and double precision (64 bits) are accessible in the configurable architecture.

Conflicts of Interest:
The authors declare no conflict of interest.