A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High E ﬃ ciency

: An e ﬃ cient, noniterative Radix-8 (NR-8) coordinate rotation digital computer (CORDIC) algorithm is proposed for low-latency and high-e ﬃ ciency computation of the functions of sine, cosine, or the phase shift, with which the values of the functions are precisely computed by only using the angle in a narrow range of [0, π / 12] rather than in a wide angle range of [0, π / 2]. This algorithm is expressed by a formula that simpliﬁes the traditional iterative processes by using a complex multiplier. The results obtained from the simulation and the experiment on an FPGA show that the NR-8 CORDIC algorithm operates well, with which the 16-bit precision output is extremely precise, with only 0.012% of the absolute error for computing the sine or cosine function with a step of 0.001 ◦ . Compared with the best conventional CORDIC algorithm, the clock latency of this algorithm signiﬁcantly decreases down to less than 50%, only needs half of the logic resources and consumes half of the power. This algorithm also takes advantages over other newly improved CORDIC algorithms and requires less than half of the clock latency, even for a 23-bit precision output. Therefore, this algorithm could provide a potential application in real-time systems such as radar digital beamforming.


Introduction
As one of the most common transcendental functions, the sine or cosine function has been widely used in real-time digital signal processing systems, such as radar, ultrasound, robotics, communication and so on [1][2][3][4][5][6][7]. The accuracy and efficiency of the computation of the functions are two key requirements for evaluating the performance of these systems. For this purpose, many methods to calculating the sine or cosine function have been developed, such as the lookup table, Taylor series, polynomial approximation and so on [8][9][10]. However, these methods have the disadvantage of either high complexity or high latency, and thus an efficient method is extremely required to meet the accurate and efficient computation for real-time systems. Fortunately, the coordinate rotation digital computer (CORDIC) algorithm [11] can provide accurate and efficient computations by employing an iterative way and decomposing the calculation into a series of addition, subtraction and shift operations, which enables it to be widely used in digital circuits to implement the computations of trigonometric and exponential functions, and so forth [12]. However, as an iterative algorithm, the accuracy of the CORDIC algorithm strongly relies on the number of iterations, so the increase of the iteration number leads to the increase of the clock latency, thus lowering the efficiency for the computations.
Electronics 2020, 9,  Some of the improved CORDIC algorithms have been widely used in radar digital beamforming (DBF) systems. For instance, Lee et al. developed a CORDIC-based algorithm to be used in Multi-Gbps MIMO systems, which is implemented by a Virtex-6 FPGA using 49,752 slices, and the algorithm needs 260 ns (250 MHz, 65 clock periods) of latency due to the many iterations required for computations [4]. Similarly, Jun et al. described look-ahead, pipelined CORDIC-based adaptive filters and their application to adaptive beamforming [5], and the pipeline level m depends on the m-bit precision. However, the CORDIC algorithm often requires many iterations to converge, which has become a major bottleneck for real-time applications.
In this work, a new noniterative Radix-8 (NR-8) CORDIC algorithm is proposed for low-latency implementation on FPGAs. In the process of the development of an NR-8 CORDIC algorithm, three steps were taken: (1) The NR-8 CORDIC algorithm was derived from the conventional Radix-2 CORDIC one. (2) The input angle θ was set to a narrow range by simultaneously transforming the input variables x 0 and y 0 . (3) A formula was deduced and optimized. These steps can narrow the selected range of the iteration angle and realize a noniterative formula of the CORDIC algorithm; besides, the algorithm can be accelerated by the multiplier module readily available in FPGAs [18]. As a result, the algorithm can reduce 7-17 clock latencies of the conventional CORDIC (16-bit precision) algorithm to a three-clock latency, needs less logic resources and consumes less power. Compared with the LLH algorithm [16], it has great advantages in terms of time and resources. For the structure of this paper, following the introduction is Section 2, in which the derivation from the conventional CORDIC algorithm is presented. In Section 3, the proposed NR-8 CORDIC is introduced. Section 4 presents its FPGA implementation and analysis. Section 5 introduces the application of the NR-8 CORDIC in radar DBF. Finally, a conclusion is made according to the results obtained from the above sections.

Conventional CORDIC Rotator Algorithm
The CORDIC algorithm usually operates in rotation mode or vector mode [11,12], following linear, circular or hyperbolic coordinate trajectories. In this paper, we focus on the rotation mode using circular trajectory.
The rotation mode is depicted in Figure 1, where θ is the angle between the V 0 (x 0 , y 0 ) and V d (x d , y d ) vectors. As the vector V 0 rotates counterclockwise to the vector V d , the coordinate, (x d , y d ), can be described as in Equation (1): Electronics 2020, 12, x FOR PEER REVIEW 2 of 18 [13][14][15][16][17]. Some of the improved CORDIC algorithms have been widely used in radar digital beamforming (DBF) systems. For instance, Lee et al. developed a CORDIC-based algorithm to be used in Multi-Gbps MIMO systems, which is implemented by a Virtex-6 FPGA using 49,752 slices, and the algorithm needs 260 ns (250 MHz, 65 clock periods) of latency due to the many iterations required for computations [4]. Similarly, Jun et al. described look-ahead, pipelined CORDIC-based adaptive filters and their application to adaptive beamforming [5], and the pipeline level m depends on the m-bit precision. However, the CORDIC algorithm often requires many iterations to converge, which has become a major bottleneck for real-time applications.
In this work, a new noniterative Radix-8 (NR-8) CORDIC algorithm is proposed for low-latency implementation on FPGAs. In the process of the development of an NR-8 CORDIC algorithm, three steps were taken: (1) The NR-8 CORDIC algorithm was derived from the conventional Radix-2 CORDIC one. (2) The input angle θ was set to a narrow range by simultaneously transforming the input variables 0 x and 0 y . (3) A formula was deduced and optimized. These steps can narrow the selected range of the iteration angle and realize a noniterative formula of the CORDIC algorithm; besides, the algorithm can be accelerated by the multiplier module readily available in FPGAs [18]. As a result, the algorithm can reduce 7-17 clock latencies of the conventional CORDIC (16-bit precision) algorithm to a three-clock latency, needs less logic resources and consumes less power. Compared with the LLH algorithm [16], it has great advantages in terms of time and resources. For the structure of this paper, following the introduction is Section 2, in which the derivation from the conventional CORDIC algorithm is presented. In Section 3, the proposed NR-8 CORDIC is introduced. Section 4 presents its FPGA implementation and analysis. Section 5 introduces the application of the NR-8 CORDIC in radar DBF. Finally, a conclusion is made according to the results obtained from the above sections.

Conventional CORDIC Rotator Algorithm
The CORDIC algorithm usually operates in rotation mode or vector mode [11,12], following linear, circular or hyperbolic coordinate trajectories. In this paper, we focus on the rotation mode using circular trajectory.
The rotation mode is depicted in Figure 1, where θ is the angle between the 0 0 0  (1):  If the initial vector 00 ( , ) xy is set to 00 =1 0  , xy , Equation (1) can be used to compute cosθ and sinθ . θ is decomposed into a series of micro angles, each of which corresponds to one step rotation as shown in Figure 1 and described as in Equation (2): If the initial vector (x 0 , y 0 ) is set to x 0 = 1, y 0 = 0, Equation (1) can be used to compute cos θ and sin θ. θ is decomposed into a series of micro angles, each of which corresponds to one step rotation as shown in Figure 1 and described as in Equation (2): where n denotes the number of rotations, R denotes the radix, R = 2 l , l ∈ N, θ i denotes micro angles and σ i is the selection factors defined as all integers within the interval σ i ∈ [−R/2, R/2]. Substituting Equation (2) into Equation (1) yields Equation (3): Equation (3) describes the computation process illustrated in Figure 1. Apparently, the recursion formula of the ith rotation can be written as Equation (4): If R equals 2, it is the conventional Radix-2 (R-2) CORDIC iterative algorithm. If R equals 4, it becomes a conventional Radix-4 (R-4) CORDIC iterative algorithm. If R equals 8, it becomes a conventional Radix-8 (R-8) CORDIC iterative algorithm, and thus Equation (3) can be written as Equation (5): where the scale factor K is defined as Equation (6): The conventional CORDIC algorithm is implemented in an iterative fashion, in which the input angle is completed by a step-by-step mode using the series of micro angles. After the ith iteration, the residual error angle is defined as in Equation (7): where z 0 = θ. For the next iteration, an optimal factor σ i+1 is selected so that the residual error angle becomes minimal, which can be calculated by using Equation (8): where ω i = z i R i+1 . It can be directly solved using a rounding operation via Equation (9): where the function U(ω i ) rounds each element of ω i to the nearest integer.
In each iteration, as the R number increases, the number of iterations decreases, but the selection factor σ i increases, thus increasing the complexity of the conventional CORDIC algorithm. Hence, Electronics 2020, 9, 1521 4 of 17 we found that R-8 CORDIC (R = 8) is a good balance between complexity and efficiency. However, the current iterative R-8 algorithm still needs several iterations. For example, six iterations are necessary for a 16-bit depth digital signal processing application. To address this problem, this paper proposes a noniterative form of the R-8 CORDIC (NR-8 CORDIC) algorithm.

Noniterative Radix-8 CORDIC Algorithm
We propose a noniterative computation structure of the R-8 CORDIC algorithm by iterating the data in a narrow input angle interval, using an explicit formula of solution, simplifying the scale factor and transforming the input variables x 0 and y 0 to accelerate the convergence of the algorithm.

Narrow Input Angle θ Range
Conventionally, one only needs to consider the input angle θ ∈ [0, π/2] of the first quadrant, from which the rest of the quadrants can be easily computed by invoking the symmetry property of the sine or cosine function. Thus, the rest of the quadrants can be mapped to the first quadrant by simple transformation. In this article, we first narrow the input angle interval into an angle range of [0, π/12].
The first quadrant of the coordinate system is equally divided into six regions, marked from A to F, the range of which becomes [0, π/12]. Then the angle θ ∈ [0, π/2] can be folded to the range of ϕ ∈ [0, π/12]. The CORDIC output mappings between θ and ϕ are given in Table 1. Accordingly, the input variables x 0 , y 0 need to be changed to x 0 , y 0 , respectively. Therefore, we can readily compute the output values in the angle range of ϕ according to the CORDIC algorithm, on the base of which, and as shown in Table 1, the output values x d , y d in the whole range of θ are achieved with ease ( √ 3 is calculated by using the Taylor series, and a matrix is defined as RT(θ) = cos θ − sin θ sin θ cos θ ). Table 1. The CORDIC output mappings between θ and ϕ. Regions

Explicit Formula of Convergence
Equation (5) can be computed naturally by using iterations. The scale factor K is temporarily ignored for the sake of simplicity. The iterative formula of Equation (5) is given as follows. Let us define If i = 1, then If i = 2, then If i = 3, then A deductive formula can be summarized as where A n , B n are respectively defined as where the function of f a (C m n ), 0 ≤ m ≤ n is defined as the product of m different elements selected from the sets {a 0 , a 1 , · · · a n−1 }, as described in Equation (14): where I(m) denotes all possible combinatorial sets of m unique indices selected from {0, 1, · · · n − 1}. Apparently, there are a total of C m n sets in I(m) . Substituting Equation (10) into the products in Equation (14), we have the following inequality: where Note that the equality in Equation (15) Figure 2 shows that F(m) quickly vanishes as m or S(m) increases, from which we have two observations as follows: Observation (1): If m ≥ 3, when S(m) ≥ 7, F(m) < 0.000031, which can be ignored. Observation (2): If m = 1 or m = 2, when S(m) ≥ 6, F(m) < 0.000061, which can be ignored.
Note that S(m) ≥ 8 is necessary in order to achieve a high accuracy. Thus, we can ignore the terms that satisfy any of the above two conditions in f a (C m n ) of Equation (13) to greatly simplify computation. For example, the terms of a 0 a 1 a 2 , a 0 a 2 a 3 , · · · , a 1 a 2 a 3 , · · · , a 0 a 1 a 2 a 3 , · · · , a 1 a 2 a 3 a 4 , a 0 a 1 a 2 Electronics 2020, 9, 1521 6 of 17 . . . a 2 a 3 a 4 , and a 0 a 3 , a 0 a 4 , · · · , a 1 a 2 , · · · , a 2 a 3 , · · · , a 4 , a 5 , . . . can be ignored. Thus, the variables of A n , B n in Equation (13) can be simplified as

Scale Factor
Now we consider the scale factor K of Equation (6). It can be written as a Taylor series, i.e., where    x x y y xy .

Scale Factor
Now we consider the scale factor K of Equation (6). It can be written as a Taylor series, i.e., where C = 1 − 1 2 a 2 0 + 3 8 a 4 0 , and apparently we can calculate all the values of C and 1 2 a 2 1 by enumerating all values of a 0 and a 2 1 , respectively. In order to speed up the parallel computation of the NR-8 CORDIC algorithm, we can compensate the input variables x 0 , y 0 instead of A n , B n by the scale factor K. More details about this process are described in the oncoming section.
3.4. Transformation of the Inputs x 0 and y 0 According to Table 1, the input angle θ ∈ [0, π/2] can be folded to the range of [0, π/12]. Accordingly, the input variable x 0 , y 0 should also be transformed to x 0 , y 0 . The transformation rules are as follows: Electronics 2020, 9, 1521 7 of 17 √ 3 can be calculated by using the Taylor series, which is The input variable x 0 , y 0 is multiplied by the scale factor K to compensate loss gain due to iteration, which produces two new variables, x 0 , y 0 , described as, Let A = A n , B = B n and the explicit formula of x n and y n in Equation (12) can be rewritten as As a result, the final outputs x d , y d in Equation (1) can be expressed by x n , y n in Equation (20), respectively, and Equation (20) can be easily implemented by using complex multiplication [19].

Implementation and Analysis
In this section, the architecture and performance of the NR-8 CORDIC algorithm are discussed with a simulation and an FPGA implementation.
For the residual z i , i ≥ 2, z i can be described as, where an approximation of tan −1 (x) ≈ x, x < 1/16 is taken. The error bound of such an approximation can be easily estimated to be 8.119 × 10 −5 . By considering m-bit fixed-point processing, where all variables are stored in an FPGA as m-bit integers, we use for the sake of simplicity, where • denotes rounding down.
From Equation (21), it can be found that the residual error angle is To denote the bit width of an n-bit fixed-point variable X, we use the form of X[n − 1 : 0]. For example, Z 1 is expressed as Z 1 [m − 3 : 0]. Now, let us expand Z 1 to the following form, where q = m−5 3 . Here, • denotes rounding up. Accordingly, z 1 can be rewritten as, According to Equation (2), θ i = tan −1 σ i 8 −(i+1) , we found that the variables σ 1 , σ 2 , . . . σ q and a 1 , a 2 , . . . a q (in Equation (10), a i = σ i 8 −(i+1) ) should be selected, which can satisfy both the computation of Equation (5) and automatically fulfill the equation of θ = q i=0 θ i . Thus, it is not necessary to follow the iterative formula of the solutions in Equations (8) and (9). Instead, from the proposed expansion in Equations (23) and (24), we can directly give the variables as, . . .
As a result, the residual z 1 is rewritten as, Likewise, the variables z 2 , z 3 , · · · can be expressed as, Note that the original iterative formula of approximation of the input angle is now replaced by the new formula in Equations (25)-(28), which becomes directly computable. The computation process of variables σ i is shown in Figure 3. In summary, the computation takes the following steps: 1. Compute 0 σ via rounding 8φ .  (20) using i σ at the third step, all of which are small integers. In summary, the computation takes the following steps: 1.

2.
Compute Z 1 via the constant values stored in registers and one subtractor in Equation (21).
Thus, Equation (17) can be rewritten as Since all σ i are small integers, their multiplication computations in Equation (29) can be easily implemented by using shifting and additions.
According to the above deduction for the NR-8 CORDIC algorithm, the implementation of the digital circuit structure of the proposed NR-8 CORDIC algorithm is shown in Figure 4. The contents of the green dashed box can be implemented with a Digital Signal Processing (DSP) module. Therefore, all iterative processes are not required, and thus the NR-8 CORDIC algorithm only takes three clock cycles for computation: Cycle 1: Fold the angle θ ∈ [0, π/2] to the range of ϕ ∈ [0, π/12], and transform the input variables from x 0 , y 0 to x 0 , y 0 , according to Table 1, Section 3.4 and Figure 4. Compute σ 0 , Z 1 via rounding 8ϕ and using the equation of Z i = z i × 2 m and a three-entry register in Equation (21), respectively.
Cycle 2: Directly fetch the values of σ i and z i , (i = 1, 2, · · · q) from Z 1 as in Equations (25)-(28), respectively, which are substituted to Equation (29) for computing A, B, and meanwhile compensate the amplitude of the variables x 0 , y 0 through the equations x 0 = K × x 0 , y 0 = K × y 0 .

Resource Utilization and Performance Analysis
Here, two comparisons are presented for analyzing resource utilization (RU) and performance as follows.

RU Comparison of Conventional CORDIC Algorithms
The NR-8 CORDIC algorithm and several conventional algorithms are implemented on a Xilinx FPGA (xcku040-ffva1156) including the evaluations of the critical RU, clock latency and power consumption, and these conventional algorithms are R-2, R-4 and R-8 [11][12][13]15]. Note that in the experiments, the 16, 8 and 6-level pipelines are used for R-2, R-4 and R-8 CORDIC cores to achieve the same accuracy, respectively [15]. Table 2

Resource Utilization and Performance Analysis
Here, two comparisons are presented for analyzing resource utilization (RU) and performance as follows.

RU Comparison of Conventional CORDIC Algorithms
The NR-8 CORDIC algorithm and several conventional algorithms are implemented on a Xilinx FPGA (xcku040-ffva1156) including the evaluations of the critical RU, clock latency and power consumption, and these conventional algorithms are R-2, R-4 and R-8 [11][12][13]15]. Note that in the experiments, the 16, 8 and 6-level pipelines are used for R-2, R-4 and R-8 CORDIC cores to achieve the same accuracy, respectively [15]. Table 2 lists RU comparisons of the R-2, R-4 and R-8 CORDIC algorithms with the NR-8 CORDIC algorithm by using a synthesis tool (Vivado 2019.2) (2019.2, Xilinx, San Jose, CA, USA, 2019). The results demonstrate that the proposed NR-8 CORDIC algorithm has advantages over the conventional algorithms in many aspects, such as the RUs of Configurable Logic Block (CLB) Lookup Tables (LUTs), flip-flop (FF), DSPs, clock latency, power consumption and so forth. For example, for a 16-bit precision output, compared with the corresponding parameters of the R-2, R-4 and R-8 CORDIC algorithms in Table 2, the proposed NR-8 algorithm only requires one-half to one-eighth the RU and reduces clock latency to one-half to one-sixth and power consumption to one-half. Then, we implemented the algorithm in Verilog Hardware Description Language (HDL) using a pipelined approach. The place and route tool reports the worst negative slack and the worst hold slack as 0.302 ns and 0.024 ns, respectively, when using a clock frequency of 250 MHz. Compared with the conventional ones, such as the CORDIC IP core (6.0) from Xilinx with 16-bit precision and three iterations, the power of the NR-8 CORDIC algorithm significantly decreases to below 70%, and the proposed algorithm only needs one-third of the flip-flops, though the low power consumption LUTs utilization increases by 43%.

Performance Comparison of Newly Developed CORDIC Algorithms
The comparisons of performance of the newly developed algorithms [12][13][14][15] with the NR-8 CORDIC algorithm are shown in Table 3. The conventional Radix-X CORDIC algorithms, such as the R-2 CORDIC with m-bit precision, require m iterations. Normally, the number of iterations decreases as the number of X in Radix-X increases, and the complexity and timing (critical path) of the algorithms are almost unchanged. The high-performance R-4 CORDIC algorithm [14] requires m/2 iterations, O(m) complexity and low latencies. The low-latency hybrid (LLH) CORDIC algorithm [16] requires 3m/8 + 1 iterations and more complexity O(3m). Although the high performance/low-latency (HPLL) CORDIC algorithm [17] has low latency, this algorithm is not conducive to pipeline optimization to improve the speed, owing to the inherent iterative structure.
For the NR-8 CORDIC algorithm, when the precision is less than 24 bits, complexity is less than O(2 q ), q = 23−5 3 = 6, and σ i has only seven types (i ∈ {0, 1, 2, 3, 4, 5, 6}). Thus, Equations (15) and (26) are rewritten as We can make a conclusion from Equation (30) that when 4 ≤ m ≤ 6, if and only if a i 1 a i 2 . . . a i m = a 0 a 1 a 2 a 3 , the maximum of F(m) is given as Apparently, when the NR-8 CORDIC algorithm requires 23-bit precision, the following approximations are produced: F(m) 4≤m≤6 ≈ 0 and f a (C m n ) 4≤m≤6 ≈ 0 ( f a (C m n ) from Equation (14)). The most time-consuming path is attributed to the computation of variables A, B. According to the above analysis and Equation (13), A, B can be simplified as Equation (32) can be realized by two-clock latency in the pipeline. Therefore, only four-clock latency is required for the NR-8 CORDIC algorithm with 23-bit precision, and the complexity is less than O(15). For instance, compared with the 10-clock latency required for 3 8 × 23 + 1 ≈ 10 iterations using the LLH CORDIC algorithm [16], the clock latency of the NR-8 CORDIC algorithm significantly decreases to less than 50%, which needs only the four-clock latency.

Comparisons with Low-Latency Hybrid (LLH) CORDIC
According to the literature [16], the simulation has been performed to compute the cosine and sine functions for the angles θ, ranging from 0 to π/2 in the step of π/500. σ 0 , σ 1 , z 1 , z 2 come from Equations (25)-(28). x 0 , y 0 come from Table 1 and Figure 4. For m-bit precision, the critical descriptive codes for the NR-8 CORDIC algorithm are described in Algorithm 1. Algorithm 1. The descriptive codes of the NR-8 CORDIC.
x 0 = 2 m , y 0 = 0; NR8 cos = R cos /2 2m ; NR8 sin = R sin /2 2m ; Two functions, cos θ and sin θ, are produced by using standard functions from MATLAB, and the amplitude errors are described as δ NR8c = |NR8 cos − cos θ| δ NR8s = |NR8 sin − sin θ| (33) Figure 5a shows the values of cosine and sine produced by the NR-8 CORDIC algorithm. Figure 5b,c compare the errors for the cosine and sine functions between the NR-8 CORDIC and the LLH CORDIC [16] with 16-bit precision, respectively. The symbols of δ NR8c and δ NR8s stand for the absolute differences of cosine and sine between the computed value from the NR-8 CORDIC and the theoretical value produced from MATLAB functions, respectively. Similarly, the symbols of δ LLHc and δ LLHs denote the absolute differences of cosine and sine between the computed value from the LLH CORDIC [16] and the theoretical value produced from MATLAB functions, respectively. It is found that the maximum errors are MAX(δ LLHc ) = 8.04 × 10 −4 and MAX(δ LLHs ) = 5.50 × 10 −4 for the cosine and sine functions, respectively, in the literature [16], which significantly decrease down to MAX(δ NR8c ) = 9.20 × 10 −5 and MAX(δ NR8s ) = 9.01 × 10 −5 in the NR-8 CORDIC algorithm, respectively, thus indicating that the proposed NR-8 algorithm has high precision. Moreover, according to our analyses for the structures of the two algorithms, similar results should be obtained for the 24-bit precision.
Electronics 2020, 12, x FOR PEER REVIEW 13 of 18 Figure 5a shows the values of cosine and sine produced by the NR-8 CORDIC algorithm. Figure  5b

Comparison of Conventional CORDIC Algorithms
Here, we analyze the computation errors cos θ and sin θ, as calculated by the R-2, R-4, R-8 and NR-8 CORDIC algorithms. When x 0 = 2 M − 1, y 0 = 0 (M ≤ 16) and a series of angles θ from 0 to 90 • with angle steps of 1, 0.1, 0.01 and 0.001 • are used, the values of cosθ and sinθ are computed by the algorithms above using FPGA and simulated by ModelSim SE (10.6e). The errors can be calculated by differing the above values from those computed by MATLAB using float-point computation and rounding to M-bit integers. Figure 6 shows the maximum absolute errors (MAE) for the cosθ and sinθ functions, which are denoted as δ cos (x), δ sin (x) (steps = 1, 0.1, 0.01, 0.001 • ), and the corresponding root mean squared errors (RMSE) are shown in Figure 7. The proposed algorithm was simulated by ModelSim SE and MATLAB fixed-point processing, and verified by using the FPGA. We obtained the same results, indicating that the algorithm is feasible in engineering implementation.    Specifically, both the MAE and RMSE values for the cosine function calculated using the NR-8 CORDIC algorithm are almost the same as the corresponding values for the sine function, indicating that the outputs of the cosine and sine functions are mostly orthogonal. However, for the other conventional algorithm, the orthogonality is relatively weak. Moreover, we made the statistical test more than 1000 times and found out that all of the MAEs and RMSEs for the cosine and sine functions are in the corresponding ranges described above, indicative of the significance of our proposed method. Therefore, we make a conclusion that the NR-8 CORDIC algorithm developed in this paper has lower clock latency, less complexity and less consumed power, allowing it to have higher efficiency than other algorithms, which provides a potential application in real-time systems such as radar digital beamforming.

Application of the NR-8 CORDIC Algorithm to DBF
The diagram of the DBF mode for the MIMO millimeter wave radar is shown in Figure 8. The interface of the FPGA and ADC is the LVDS bus, and the phase shift Transmission (TX) is implemented by the FPGA, which transmits commands to AWR1243 registers through the SPI bus.
The desired steering angles are defined to be 12 ,, n β β β for n TX antennas. The received I, Q complex data from ADC for each Reception (RX) channel go through a DSP module that includes However, for the other conventional algorithm, the orthogonality is relatively weak. Moreover, we made the statistical test more than 1000 times and found out that all of the MAEs and RMSEs for the cosine and sine functions are in the corresponding ranges described above, indicative of the significance of our proposed method. Therefore, we make a conclusion that the NR-8 CORDIC algorithm developed in this paper has lower clock latency, less complexity and less consumed power, allowing it to have higher efficiency than other algorithms, which provides a potential application in real-time systems such as radar digital beamforming.

Application of the NR-8 CORDIC Algorithm to DBF
The diagram of the DBF mode for the MIMO millimeter wave radar is shown in Figure 8. The interface of the FPGA and ADC is the LVDS bus, and the phase shift Transmission (TX) is implemented by the FPGA, which transmits commands to AWR1243 registers through the SPI bus. The desired steering angles are defined to be β 1 , β 2 , · · · β n for n TX antennas. The received I, Q complex data from ADC for each Reception (RX) channel go through a DSP module that includes the range and Doppler FFT. RX DBF is performed to steer the RX beam towards the same β i . After the corresponding phase delay, the echoes are summed to achieve a beamforming [20,21].

According to Equation (1), if the input vector
x 0 y 0 is not a constant vector and the input angle θ is a desired value, the vector x 0 y 0 will produce the phase shift by the θ. Thus, we can realize the beam delay of the desired steering angles according to the NR-8 CORDIC algorithm. In Figure 4, if x 0 , y 0 are replaced by I, Q complex data, respectively, and the angle θ is replaced by β i , the output values x d , y d will be obtained as the corresponding beam delay vector.
θ is a desired value, the vector    0 0 x y will produce the phase shift by the θ . Thus, we can realize the beam delay of the desired steering angles according to the NR-8 CORDIC algorithm. In Figure 4, if 00 , xy are replaced by I, Q complex data, respectively, and the angle θ is replaced by i β , the output values , dd xy will be obtained as the corresponding beam delay vector.
... In this section, the NR-8 CORDIC algorithm is applied to the phase delay for DBF on a 77 GHz MIMO millimeter wave radar system empowered by TI AWR1243 chips. The experimental device of DBF for the MIMO millimeter wave radar is shown in Figure 9. The related parameters are as follows: the sampling rate fs = 4 MHz, the bandwidth bw = 1120 MHz and the sampling points N = 16, where the bandwidth refers to the bandwidth of the radio frequency. However, if the baseband signal is implemented by the radar chip with digital down converters, the frequency will be reduced from 1120 MHz to less than 1 MHz. Therefore, we can use fs = 4 MHz for sampling. The One echo (I/Q) is taken for phase shift angle θ from 1.5° to 15° with a 1.5° step, producing 10 phases. Let 0 I

DBF
x  and 0 Q y  in Figure 4. The phase shift effects are shown in Figure 10, where the red and green lines represent I and Q signals, respectively. The 1st and 10st symbols in the illustration represent beam delays of 1.5° (beam_1) and 15° (beam_10), respectively. FFT transformation is applied to each beam, and the phase errors are listed in Table 4. The variables p ,  p ,  p and max_ max_ p δ are described as follows: In this section, the NR-8 CORDIC algorithm is applied to the phase delay for DBF on a 77 GHz MIMO millimeter wave radar system empowered by TI AWR1243 chips. The experimental device of DBF for the MIMO millimeter wave radar is shown in Figure 9. The related parameters are as follows: the sampling rate f s = 4 MHz, the bandwidth bw = 1120 MHz and the sampling points N = 16, where the bandwidth refers to the bandwidth of the radio frequency. However, if the baseband signal is implemented by the radar chip with digital down converters, the frequency will be reduced from 1120 MHz to less than 1 MHz. Therefore, we can use fs  0 the beam delay of the desired steering angles according to the NR-8 CORDIC algorithm. In Figure 4, if 00 , xy are replaced by I, Q complex data, respectively, and the angle θ is replaced by i β , the output values , dd xy will be obtained as the corresponding beam delay vector. In this section, the NR-8 CORDIC algorithm is applied to the phase delay for DBF on a 77 GHz MIMO millimeter wave radar system empowered by TI AWR1243 chips. The experimental device of DBF for the MIMO millimeter wave radar is shown in Figure 9. The related parameters are as follows: the sampling rate fs = 4 MHz, the bandwidth bw = 1120 MHz and the sampling points N = 16, where the bandwidth refers to the bandwidth of the radio frequency. However, if the baseband signal is implemented by the radar chip with digital down converters, the frequency will be reduced from 1120 MHz to less than 1 MHz. Therefore, we can use fs = 4 MHz for sampling. The One echo (I/Q) is taken for phase shift angle θ from 1.5° to 15° with a 1.5° step, producing 10 phases. Let 0 I x  and 0 Q y  in Figure 4. The phase shift effects are shown in Figure 10, where the red and green lines represent I and Q signals, respectively. The 1st and 10st symbols in the illustration represent beam delays of 1.5° (beam_1) and 15° (beam_10), respectively. FFT transformation is applied to each beam, and the phase errors are listed in Table 4. The variables p ,  p ,  p and max_ max_ p δ are described as follows: Figure 9. The experimental device of DBF for the MIMO millimeter wave radar.
One echo (I/Q) is taken for phase shift angle θ from 1.5 • to 15 • with a 1.5 • step, producing 10 phases. Let x 0 = I and y 0 = Q in Figure 4. The phase shift effects are shown in Figure 10, where the red and green lines represent I and Q signals, respectively. The 1st and 10st symbols in the illustration represent beam delays of 1.5 • (beam_1) and 15 • (beam_10), respectively. FFT transformation is applied to each beam, and the phase errors are listed in Table 4. The variables p, ∆ p , ∆ p and δ max_p δ max_p are described as follows:   where n = 1, 2, · · · , 10 and the functions FFT(beam_n) and FFT(beam_1) represent the FFT transformation of the nth delay beam and the first delay beam, respectively. Then, the phase differences corresponding to the peak values of the spectral lines are obtained. In Table 4, the small value of the inequality, δ ∆p < 0.064 • , assures that the NR-8 CORDIC algorithm can be applied in real-time systems like radar digital beamforming with a high precision of the phase shift.   Overall, our algorithm is mainly based on noniterative methods, whereas the majority of the conventional algorithms are based on the iterative methods, and thus our algorithm is low latency and high efficiency for the high precision output. As for the normal CORDIC algorithm, the increase of the computational complexity means the change of the important module, with the increase of precision and, in detail, the adders required for fulfilling the same task such as a task in Figure 4, will increase. For example, the number of adders that are used to get the values of An and Bn from Equation (13) will increase with the increase of the computational complexity. As for the R-2 CORDIC algorithm, the m iterators are required for achieving the m-bit precision output, each of which needs an adder and a subtractor; thus the computational complexity can be expressed by a term of O(m). Meanwhile, the data we achieved are based on the statistical test for several trials, and the results are found to be very reliable with a relatively low error, thus indicating the NR-8 CORDIC algorithm is able to be applied in these fields with low latency and high efficiency. Overall, our algorithm is mainly based on noniterative methods, whereas the majority of the conventional algorithms are based on the iterative methods, and thus our algorithm is low latency and high efficiency for the high precision output. As for the normal CORDIC algorithm, the increase of the computational complexity means the change of the important module, with the increase of precision and, in detail, the adders required for fulfilling the same task such as a task in Figure 4, will increase. For example, the number of adders that are used to get the values of A n and B n from Equation (13) will increase with the increase of the computational complexity. As for the R-2 CORDIC algorithm, the m iterators are required for achieving the m-bit precision output, each of which needs an adder and a subtractor; thus the computational complexity can be expressed by a term of O(m). Meanwhile, the data we achieved are based on the statistical test for several trials, and the results are found to be very reliable with a relatively low error, thus indicating the NR-8 CORDIC algorithm is able to be applied in these fields with low latency and high efficiency.

Conclusions
The proposed NR-8 CORDIC algorithm has low latency, low complexity and low RU, in comparison with the conventional R-X CORDIC and some newly developed CORDIC algorithms. In particular, when the m-bit precision is less than 24-bit, this algorithm has great advantages, e.g., the clock latencies can be reduced to 4 from 10 with much lower complexity. This algorithm adopts the narrow input angle range to obtain a high speed for calculations, and it uses the output uniform formula to efficiently compute the sine and cosine functions or the phase shift in a noniterative fashion. Therefore, this algorithm is of great value in time-critical applications, such as DBF, robot controllers, FFT transformation, signal modulation and demodulation, recently developed rapid convolutional neural networks (CNNs) [22,23] and so on. We anticipate that the algorithm will provide a higher precision, lower complexity and lower clock latency after further optimization in the future.