A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency

Tang, Wenming; Xu, Feng

doi:10.3390/electronics9091521

Open AccessArticle

A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency

by

Wenming Tang

^* and

Feng Xu

The Key Laboratory for Information Science of Electromagnetic Waves, School of Information Science and Technology, Fudan University, Shanghai 200433, China

^*

Author to whom correspondence should be addressed.

Electronics 2020, 9(9), 1521; https://doi.org/10.3390/electronics9091521

Submission received: 6 August 2020 / Revised: 30 August 2020 / Accepted: 1 September 2020 / Published: 17 September 2020

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

An efficient, noniterative Radix-8 (NR-8) coordinate rotation digital computer (CORDIC) algorithm is proposed for low-latency and high-efficiency computation of the functions of sine, cosine, or the phase shift, with which the values of the functions are precisely computed by only using the angle in a narrow range of [0, π/12] rather than in a wide angle range of [0, π/2]. This algorithm is expressed by a formula that simplifies the traditional iterative processes by using a complex multiplier. The results obtained from the simulation and the experiment on an FPGA show that the NR-8 CORDIC algorithm operates well, with which the 16-bit precision output is extremely precise, with only 0.012% of the absolute error for computing the sine or cosine function with a step of 0.001°. Compared with the best conventional CORDIC algorithm, the clock latency of this algorithm significantly decreases down to less than 50%, only needs half of the logic resources and consumes half of the power. This algorithm also takes advantages over other newly improved CORDIC algorithms and requires less than half of the clock latency, even for a 23-bit precision output. Therefore, this algorithm could provide a potential application in real-time systems such as radar digital beamforming.

Keywords:

CORDIC; sine and cosine; phase shift; FPGA; digital beamforming

1. Introduction

As one of the most common transcendental functions, the sine or cosine function has been widely used in real-time digital signal processing systems, such as radar, ultrasound, robotics, communication and so on [1,2,3,4,5,6,7]. The accuracy and efficiency of the computation of the functions are two key requirements for evaluating the performance of these systems. For this purpose, many methods to calculating the sine or cosine function have been developed, such as the lookup table, Taylor series, polynomial approximation and so on [8,9,10]. However, these methods have the disadvantage of either high complexity or high latency, and thus an efficient method is extremely required to meet the accurate and efficient computation for real-time systems. Fortunately, the coordinate rotation digital computer (CORDIC) algorithm [11] can provide accurate and efficient computations by employing an iterative way and decomposing the calculation into a series of addition, subtraction and shift operations, which enables it to be widely used in digital circuits to implement the computations of trigonometric and exponential functions, and so forth [12]. However, as an iterative algorithm, the accuracy of the CORDIC algorithm strongly relies on the number of iterations, so the increase of the iteration number leads to the increase of the clock latency, thus lowering the efficiency for the computations.

To further enhance the efficiency, more progress has been made by improving the architecture of the CORDIC algorithm to achieve a more efficient algorithm, such as the Scaling-Free (SF) CORDIC, Radix-4 CORDIC, Radix-8 CORDIC and low-latency Hybrid (LLH) CORDIC algorithms [13,14,15,16,17]. Some of the improved CORDIC algorithms have been widely used in radar digital beamforming (DBF) systems. For instance, Lee et al. developed a CORDIC-based algorithm to be used in Multi-Gbps MIMO systems, which is implemented by a Virtex-6 FPGA using 49,752 slices, and the algorithm needs 260 ns (250 MHz, 65 clock periods) of latency due to the many iterations required for computations [4]. Similarly, Jun et al. described look-ahead, pipelined CORDIC-based adaptive filters and their application to adaptive beamforming [5], and the pipeline level m depends on the m-bit precision. However, the CORDIC algorithm often requires many iterations to converge, which has become a major bottleneck for real-time applications.

In this work, a new noniterative Radix-8 (NR-8) CORDIC algorithm is proposed for low-latency implementation on FPGAs. In the process of the development of an NR-8 CORDIC algorithm, three steps were taken: (1) The NR-8 CORDIC algorithm was derived from the conventional Radix-2 CORDIC one. (2) The input angle θ was set to a narrow range by simultaneously transforming the input variables

x_{0}

and

y_{0}

. (3) A formula was deduced and optimized. These steps can narrow the selected range of the iteration angle and realize a noniterative formula of the CORDIC algorithm; besides, the algorithm can be accelerated by the multiplier module readily available in FPGAs [18]. As a result, the algorithm can reduce 7–17 clock latencies of the conventional CORDIC (16-bit precision) algorithm to a three-clock latency, needs less logic resources and consumes less power. Compared with the LLH algorithm [16], it has great advantages in terms of time and resources. For the structure of this paper, following the introduction is Section 2, in which the derivation from the conventional CORDIC algorithm is presented. In Section 3, the proposed NR-8 CORDIC is introduced. Section 4 presents its FPGA implementation and analysis. Section 5 introduces the application of the NR-8 CORDIC in radar DBF. Finally, a conclusion is made according to the results obtained from the above sections.

2. Conventional CORDIC Rotator Algorithm

The CORDIC algorithm usually operates in rotation mode or vector mode [11,12], following linear, circular or hyperbolic coordinate trajectories. In this paper, we focus on the rotation mode using circular trajectory.

The rotation mode is depicted in Figure 1, where

θ

is the angle between the

{\overset{⇀}{V}}_{0} (x_{0}, y_{0})

and

{\overset{⇀}{V}}_{d} (x_{d}, y_{d})

vectors. As the vector

{\overset{⇀}{V}}_{0}

rotates counterclockwise to the vector

{\overset{⇀}{V}}_{d}

, the coordinate,

(x_{d}, y_{d})

, can be described as in Equation (1):

[\begin{array}{l} x_{d} \\ y_{d} \end{array}] = [\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}] [\begin{array}{l} x_{0} \\ y_{0} \end{array}] = \cos θ [\begin{matrix} 1 & - \tan θ \\ \tan θ & 1 \end{matrix}] [\begin{array}{l} x_{0} \\ y_{0} \end{array}]

(1)

If the initial vector

(x_{0}, y_{0})

is set to

x_{0} = 1, y_{0} = 0

, Equation (1) can be used to compute

\cos θ

and

\sin θ

.

θ

is decomposed into a series of micro angles, each of which corresponds to one step rotation as shown in Figure 1 and described as in Equation (2):

θ = \sum_{i = 0}^{n} θ_{i}, θ_{i} = \tan^{- 1} (σ_{i} R^{- (i + 1)})

(2)

where n denotes the number of rotations, R denotes the radix,

R = 2^{l}, l \in N

,

θ_{i}

denotes micro angles and

σ_{i}

is the selection factors defined as all integers within the interval

σ_{i} \in [- R / 2, R / 2]

.

Substituting Equation (2) into Equation (1) yields Equation (3):

[\begin{array}{l} x_{d} \\ y_{d} \end{array}] = \prod_{i = 0}^{n} [\begin{matrix} \cos θ_{i} & - \sin θ_{i} \\ \sin θ_{i} & \cos θ_{i} \end{matrix}] [\begin{array}{l} x_{0} \\ y_{0} \end{array}] = [\prod_{i = 0}^{n} \cos θ_{i}] \times \prod_{i = 0}^{n} [\begin{matrix} 1 & - \tan θ_{i} \\ \tan θ_{i} & 1 \end{matrix}] [\begin{array}{l} x_{0} \\ y_{0} \end{array}]

(3)

Equation (3) describes the computation process illustrated in Figure 1.

Apparently, the recursion formula of the ith rotation can be written as Equation (4):

[\begin{array}{l} x_{i + 1} \\ y_{i + 1} \end{array}] = \cos θ_{i} [\begin{matrix} 1 & - \tan θ_{i} \\ \tan θ_{i} & 1 \end{matrix}] [\begin{array}{l} x_{i} \\ y_{i} \end{array}]

(4)

If R equals 2, it is the conventional Radix-2 (R-2) CORDIC iterative algorithm. If R equals 4, it becomes a conventional Radix-4 (R-4) CORDIC iterative algorithm. If R equals 8, it becomes a conventional Radix-8 (R-8) CORDIC iterative algorithm, and thus Equation (3) can be written as Equation (5):

[\begin{array}{l} x_{d} \\ y_{d} \end{array}] = K \times \prod_{i = 0}^{n} [\begin{matrix} 1 & - σ_{i} R^{- (i + 1)} \\ σ_{i} R^{- (i + 1)} & 1 \end{matrix}] [\begin{array}{l} x_{0} \\ y_{0} \end{array}]

(5)

where the scale factor K is defined as Equation (6):

K = \prod_{i = 0}^{n} \cos θ_{i} = \prod_{i = 0}^{n} {(1 + σ_{i}^{2} R^{- 2 (i + 1)})}^{1 / 2}

(6)

The conventional CORDIC algorithm is implemented in an iterative fashion, in which the input angle is completed by a step-by-step mode using the series of micro angles. After the ith iteration, the residual error angle is defined as in Equation (7):

z_{i + 1} = θ - \sum_{j = 0}^{i} θ_{j}, i = 0, 1, 2, \dots n

(7)

where

z_{0} = θ

. For the next iteration, an optimal factor

σ_{i + 1}

is selected so that the residual error angle becomes minimal, which can be calculated by using Equation (8):

σ_{i} = argmin | z_{i} - σ_{i} R^{- (i + 1)} | = argmin | ω_{i} - σ_{i} |, s . t . σ_{i} \in [- R / 2, R / 2]

(8)

where

ω_{i} = z_{i} R^{i + 1}

. It can be directly solved using a rounding operation via Equation (9):

σ_{i} = {\begin{cases} R / 2, & ω_{i} \geq (R - 1) / 2 \\ - R / 2, & ω_{i} \leq - (R - 1) / 2 \\ U (ω_{i}), & else \end{cases}

(9)

where the function

U (ω_{i})

rounds each element of

ω_{i}

to the nearest integer.

In each iteration, as the R number increases, the number of iterations decreases, but the selection factor

σ_{i}

increases, thus increasing the complexity of the conventional CORDIC algorithm. Hence, we found that R-8 CORDIC (R = 8) is a good balance between complexity and efficiency. However, the current iterative R-8 algorithm still needs several iterations. For example, six iterations are necessary for a 16-bit depth digital signal processing application. To address this problem, this paper proposes a noniterative form of the R-8 CORDIC (NR-8 CORDIC) algorithm.

3. Noniterative Radix-8 CORDIC Algorithm

We propose a noniterative computation structure of the R-8 CORDIC algorithm by iterating the data in a narrow input angle interval, using an explicit formula of solution, simplifying the scale factor and transforming the input variables

x_{0}

and

y_{0}

to accelerate the convergence of the algorithm.

3.1. Narrow Input Angle θ Range

Conventionally, one only needs to consider the input angle

θ \in [0, π / 2]

of the first quadrant, from which the rest of the quadrants can be easily computed by invoking the symmetry property of the sine or cosine function. Thus, the rest of the quadrants can be mapped to the first quadrant by simple transformation. In this article, we first narrow the input angle interval into an angle range of

[0, π / 12]

.

The first quadrant of the coordinate system is equally divided into six regions, marked from A to F, the range of which becomes

[0, π / 12]

. Then the angle

θ \in [0, π / 2]

can be folded to the range of

φ \in [0, π / 12]

. The CORDIC output mappings between

θ

and

φ

are given in Table 1. Accordingly, the input variables

x_{0}, y_{0}

need to be changed to

x_{0}^{'}, y_{0}^{'}

, respectively. Therefore, we can readily compute the output values in the angle range of

φ

according to the CORDIC algorithm, on the base of which, and as shown in Table 1, the output values

x_{d}, y_{d}

in the whole range of

θ

are achieved with ease (

\sqrt{3}

is calculated by using the Taylor series, and a matrix is defined as

R T (θ) = [\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}]

).

3.2. Explicit Formula of Convergence

Equation (5) can be computed naturally by using iterations. The scale factor K is temporarily ignored for the sake of simplicity. The iterative formula of Equation (5) is given as follows. Let us define

a_{i} = σ_{i} 8^{- (i + 1)}

(10)

If i = 1, then

{\begin{cases} x_{1} = x_{0} - a_{0} y_{0} \\ y_{1} = y_{0} + a_{0} x_{0} \end{cases} .

If i = 2, then

{\begin{cases} x_{2} = (1 - a_{0} a_{1}) x_{0} - (a_{0} + a_{1}) y_{0} \\ y_{2} = (1 - a_{0} a_{1}) y_{0} + (a_{0} + a_{1}) x_{0} \end{cases} .

If i = 3, then

{\begin{cases} x_{3} = (1 - a_{0} a_{1} - a_{0} a_{2} - a_{1} a_{2}) x_{0} - (a_{0} + a_{1} + a_{2} - a_{0} a_{1} a_{2}) y_{0} \\ y_{3} = (1 - a_{0} a_{1} - a_{0} a_{2} - a_{1} a_{2}) y_{0} + (a_{0} + a_{1} + a_{2} - a_{0} a_{1} a_{2}) x_{0} \end{cases} .

If i = 4, then

{\begin{cases} x_{4} = (1 - a_{0} a_{1} - a_{0} a_{2} - a_{0} a_{3} - a_{1} a_{2} - a_{1} a_{3} - a_{2} a_{3} + a_{0} a_{1} a_{2} a_{3}) x_{0} - (a_{0} + a_{1} + a_{2} + a_{3} - a_{0} a_{1} a_{2} - a_{0} a_{1} a_{3} - a_{0} a_{2} a_{3} - a_{1} a_{2} a_{3}) y_{0} \\ y_{4} = (1 - a_{0} a_{1} - a_{0} a_{2} - a_{0} a_{3} - a_{1} a_{2} - a_{1} a_{3} - a_{2} a_{3} + a_{0} a_{1} a_{2} a_{3}) y_{0} + (a_{0} + a_{1} + a_{2} + a_{3} - a_{0} a_{1} a_{2} - a_{0} a_{1} a_{3} - a_{0} a_{2} a_{3} - a_{1} a_{2} a_{3}) x_{0} \end{cases}

(11)

A deductive formula can be summarized as

{\begin{cases} x_{n} = A_{n} \times x_{0} - B_{n} \times y_{0} \\ y_{n} = A_{n} \times y_{0} + B_{n} \times x_{0} \end{cases}

(12)

where

A_{n}, B_{n}

are respectively defined as

{\begin{array}{l} A_{n} = 1 - \sum_{i = 1}^{k} {(- 1)}^{i + 1} f_{a} (C_{n}^{2 i}), n \in N \\ B_{n} = {\begin{matrix} \sum_{i = 1}^{k} {(- 1)}^{i + 1} f_{a} (C_{n}^{2 i - 1}), & n = 2 k, k \in N \\ \sum_{i = 0}^{k} {(- 1)}^{i} f_{a} (C_{n}^{2 i + 1}), & n = 2 k + 1, k \in N \end{matrix} \end{array}

(13)

where the function of

f_{a} (C_{n}^{m})

,

0 \leq m \leq n

is defined as the product of m different elements selected from the sets

{a_{0}, a_{1}, \dots a_{n - 1}}

, as described in Equation (14):

f_{a} (C_{n}^{m}) = a_{i_{1}} a_{i_{2}} \dots a_{i_{m}}, (i_{1}, i_{1}, \dots i_{m}) \in {I (m)}

(14)

where

{I (m)}

denotes all possible combinatorial sets of m unique indices selected from

{0, 1, \dots n - 1}

. Apparently, there are a total of

C_{n}^{m}

sets in

{I (m)}

. Substituting Equation (10) into the products in Equation (14), we have the following inequality:

F (m) = | a_{i_{1}} a_{i_{2}} \dots a_{i_{m}} | = | \frac{σ_{i_{1}} σ_{i_{2}} \dots σ_{i_{m}}}{8^{S (m)}} | \leq \frac{4^{m}}{8^{S (m)}}

(15)

where

S (m) = (i_{1} + 1) + (i_{2} + 1) + \dots + (i_{m} + 1)

(16)

Note that the equality in Equation (15) holds if and only if

| σ_{i_{1}} | = | σ_{i_{2}} | = \dots = | σ_{i_{m}} | = 4

. It can be observed that Equation (16) has a minimum of

S (m) = \frac{(1 + m) m}{2}

.

Figure 2 shows that F(m) quickly vanishes as m or S(m) increases, from which we have two observations as follows:

Observation (1): If $m \geq 3$ , when $S (m) \geq 7$ , $F (m) < 0.000031$ , which can be ignored.
Observation (2): If $m = 1$ or $m = 2$ , when $S (m) \geq 6$ , $F (m) < 0.000061$ , which can be ignored.

Note that

S (m) \geq 8

is necessary in order to achieve a high accuracy. Thus, we can ignore the terms that satisfy any of the above two conditions in

f_{a} (C_{n}^{m})

of Equation (13) to greatly simplify computation. For example, the terms of

a_{0} a_{1} a_{2},

a_{0} a_{2} a_{3},

\dots, a_{1} a_{2} a_{3},

\dots,

a_{0} a_{1} a_{2} a_{3},

\dots, a_{1} a_{2} a_{3} a_{4},

a_{0} a_{1} a_{2}

…

a_{2} a_{3} a_{4},

and

a_{0} a_{3}, a_{0} a_{4}, \dots, a_{1} a_{2}, \dots, a_{2} a_{3}, \dots, a_{4}, a_{5},

… can be ignored. Thus, the variables of

A_{n}, B_{n}

in Equation (13) can be simplified as

[\begin{matrix} A_{n} \\ B_{n} \end{matrix}] = [\begin{matrix} 1 - a_{0} \sum_{i = 1}^{3} a_{i} - a_{1} a_{2} \\ \sum_{i = 0}^{4} a_{i} \end{matrix}] or [\begin{matrix} 1 - a_{0} \sum_{i = 1}^{n} a_{i} - a_{1} \sum_{i = 2}^{n} a_{i} \\ \sum_{i = 0}^{4} a_{i} \end{matrix}]

(17)

3.3. Scale Factor

Now we consider the scale factor K of Equation (6). It can be written as a Taylor series, i.e.,

K = \prod_{i = 0}^{n} (1 - \frac{1}{2} a_{i}^{2} + \frac{3}{8} a_{i}^{4} - \frac{5}{16} a_{i}^{6} + \dots) \approx C - \frac{1}{2} a_{1}^{2}

(18)

where

C = 1 - \frac{1}{2} a_{0}^{2} + \frac{3}{8} a_{0}^{4}

, and apparently we can calculate all the values of C and

\frac{1}{2} a_{1}^{2}

by enumerating all values of

a_{0}

and

a_{1}^{2}

, respectively. In order to speed up the parallel computation of the NR-8 CORDIC algorithm, we can compensate the input variables

x_{0}, y_{0}

instead of

A_{n}, B_{n}

by the scale factor K. More details about this process are described in the oncoming section.

3.4. Transformation of the Inputs $x_{0}$ and $y_{0}$

According to Table 1, the input angle

θ \in [0, π / 2]

can be folded to the range of

[0, π / 12]

. Accordingly, the input variable

x_{0}, y_{0}

should also be transformed to

x_{0}^{'}, y_{0}^{'}

. The transformation rules are as follows:

If $θ \in [0, π / 12)$ , then $[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = [\begin{array}{l} x_{0} \\ y_{0} \end{array}]$ .
If $θ \in [\frac{π}{12}, \frac{π}{6})$ , then $[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} \sqrt{3} x_{0} - y_{0} \\ - x_{0} - \sqrt{3} y_{0} \end{array}]$ .
If $θ \in [\frac{π}{6}, \frac{π}{4})$ , then $[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} \sqrt{3} x_{0} - y_{0} \\ x_{0} - \sqrt{3} y_{0} \end{array}]$ .
If $θ \in [\frac{π}{4}, \frac{π}{3})$ , then $[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} x_{0} - \sqrt{3} y_{0} \\ \sqrt{3} x_{0} - y_{0} \end{array}]$ .
If $θ \in [\frac{π}{3}, \frac{5 π}{12})$ , then $[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} x_{0} - \sqrt{3} y_{0} \\ \sqrt{3} x_{0} + y_{0} \end{array}]$ .
If $θ \in [\frac{5 π}{12}, \frac{π}{2}]$ , then $[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = [\begin{array}{l} - x_{0} \\ - y_{0} \end{array}]$ .

\sqrt{3}

can be calculated by using the Taylor series, which is

\sqrt{3} = 2 - \frac{1}{4} - \frac{1}{64} - \frac{1}{512} = 1.732

. The input variable

x_{0}^{'}, y_{0}^{'}

is multiplied by the scale factor K to compensate loss gain due to iteration, which produces two new variables,

x_{0}^{″}, y_{0}^{″}

, described as,

[\begin{matrix} x_{0}^{″} \\ y_{0}^{″} \end{matrix}] = K \times [\begin{matrix} x_{0}^{'} \\ y_{0}^{'} \end{matrix}]

(19)

Let

A = A_{n}, B = B_{n}

and the explicit formula of

x_{n}

and

y_{n}

in Equation (12) can be rewritten as

{\begin{matrix} x_{n} = A \times x_{0}^{″} - B \times y_{0}^{″} \\ y_{n} = A \times y_{0}^{″} + B \times x_{0}^{″} \end{matrix} \Leftrightarrow (x_{n} + y_{n} i) = (A + B i) \times (x_{0}^{″} + y_{0}^{″} i)

(20)

As a result, the final outputs

x_{d}, y_{d}

in Equation (1) can be expressed by

x_{n}, y_{n}

in Equation (20), respectively, and Equation (20) can be easily implemented by using complex multiplication [19].

4. Implementation and Analysis

In this section, the architecture and performance of the NR-8 CORDIC algorithm are discussed with a simulation and an FPGA implementation.

4.1. Noniterative Implementation

After narrowing the input angle range from

θ

to

φ

, we set the initial angle of

z_{0}

in Equation (7) to be

z_{0} = φ \in [0, π / 12]

. Thus, we have

w_{0} = z_{0} R = 8 φ

and

8 φ \in [0, 2 π / 3]

, so

8 φ < 2.5

. Following from Equations (7)–(9),

σ_{0}

is obtained as

σ_{0} = U (w_{0})

=

U (8 φ)

, (

U (8 φ)

rounds each element of

8 φ

to the nearest integer and

U (8 φ) \in {0, 1, 2}

). Subsequently, the residual

z_{1}

can be described as,

z_{1} = z_{0} - \tan^{- 1} (σ_{0} / 8) = {\begin{matrix} z_{0} & σ_{0} = 0 \\ z_{0} - \tan^{- 1} (1 / 8) & σ_{0} = 1 \\ z_{0} - \tan^{- 1} (1 / 4) & σ_{0} = 2 \end{matrix}

(21)

Apparently,

\tan^{- 1} (•)

in Equation (21) has only three values, which can be implemented by using a small look-up table or registers.

For the residual

z_{i}, i \geq 2

,

z_{i}

can be described as,

z_{i} = z_{i - 1} - \tan^{- 1} σ_{i - 1} 8^{- i} \approx z_{i - 1} - σ_{i - 1} 8^{- i}

(22)

where an approximation of

\tan^{- 1} (x) \approx x, x < 1 / 16

is taken. The error bound of such an approximation can be easily estimated to be 8.119 × 10⁻⁵.

By considering m-bit fixed-point processing, where all variables are stored in an FPGA as m-bit integers, we use

Z_{i}, i = 0, 1, 2, \dots

to denote the fixed-point integer of

z_{i}, i = 0, 1, 2, \dots

(i.e.,

Z_{i} = ⌊ z_{i} 2^{m} ⌋

) for the sake of simplicity, where

⌊ • ⌋

denotes rounding down.

From Equation (21), it can be found that the residual error angle is

| z_{1} | < 1 / 8

(i.e.,

| Z_{1} | = | z_{1} 2^{m} | < 2^{m - 3}

) so the bit width of

Z_{1}

is m−2. To denote the bit width of an n-bit fixed-point variable X, we use the form of

X [n - 1 : 0]

. For example,

Z_{1}

is expressed as

Z_{1} [m - 3 : 0]

.

Now, let us expand

Z_{1}

to the following form,

Z_{1} = Z_{1} [m - 3 : m - 7] 2^{m - 7} + Z_{1} [m - 8 : m - 10] 2^{m - 10} + \dots + Z_{1} [m - 2 - 3 \times q : 0]

(23)

where

q = ⌈ \frac{m - 5}{3} ⌉

. Here,

⌈ • ⌉

denotes rounding up.

Accordingly,

z_{1}

can be rewritten as,

\begin{array}{l} z_{1} & = Z_{1} [m - 3 : m - 7] 2^{- 7} + Z_{1} [m - 8 : m - 10] 2^{- 10} + \dots + Z_{1} [m - 2 - 3 \times q : 0] 2^{- m} \\ = \tan^{- 1} (Z_{1} [m - 3 : m - 7] 2^{- 7}) + \tan^{- 1} (Z_{1} [m - 8 : m - 10] 2^{- 10}) + \dots + \tan^{- 1} (Z_{1} [m - 2 - 3 \times q : 0] 2^{- m}) \end{array}

(24)

According to Equation (2),

θ_{i} = \tan^{- 1} (σ_{i} 8^{- (i + 1)})

, we found that the variables

σ_{1}

,

σ_{2}

, …

σ_{q}

and

a_{1}

,

a_{2}

, …

a_{q}

(in Equation (10),

a_{i} = σ_{i} 8^{- (i + 1)}

) should be selected, which can satisfy both the computation of Equation (5) and automatically fulfill the equation of

θ = \sum_{i = 0}^{q} θ_{i}

. Thus, it is not necessary to follow the iterative formula of the solutions in Equations (8) and (9). Instead, from the proposed expansion in Equations (23) and (24), we can directly give the variables as,

{\begin{matrix} σ_{1} = Z_{1} [m - 3 : m - 7] \\ σ_{2} = Z_{1} [m - 8 : m - 10] \\ ⋮ \\ σ_{q} = Z_{1} [m - 2 - 3 \times q : 0] \end{matrix}

(25)

a_{i} = {\begin{matrix} σ_{0} 2^{- 3}, & i = 0 \\ σ_{i} 2^{- (4 + 3 i)}, & e l s e \end{matrix}

(26)

As a result, the residual

z_{1}

is rewritten as,

\begin{array}{l} z_{1} & = \tan^{- 1} \frac{σ_{1}}{2^{7}} + \tan^{- 1} \frac{σ_{2}}{2^{10}} + \dots + \tan^{- 1} \frac{σ_{q}}{2^{m}} \\ \approx \frac{σ_{1}}{2^{7}} + \frac{σ_{2}}{2^{10}} + \dots + \frac{σ_{q}}{2^{m}} \\ = \sum_{j = 1}^{q} a_{j} \end{array}

(27)

Likewise, the variables

z_{2}, z_{3}, \dots

can be expressed as,

z_{i} = \sum_{j = i}^{q} a_{j}

(28)

Note that the original iterative formula of approximation of the input angle is now replaced by the new formula in Equations (25)–(28), which becomes directly computable. The computation process of variables σ_i is shown in Figure 3.

In summary, the computation takes the following steps:

Compute $σ_{0}$ via rounding $8 φ$ .
Compute $Z_{1}$ via the constant values stored in registers and one subtractor in Equation (21).
Compute $σ_{i}, i = 1, \dots, q$ by directly fetching bits from $Z_{1}$ as in Equation (25).
Compute A and B as in Equation (20) using $σ_{i}$ at the third step, all of which are small integers. For example, $σ_{0} \in [0, 2]$ is a 2-bit unsigned integer, and $σ_{1} \in [- 8, 8]$ is a 5-bit signed integer, while $σ_{i} \in [0, 7], i = 2, 3, \dots q$ is an unsigned integer no greater than 3-bit.

Thus, Equation (17) can be rewritten as

[\begin{matrix} A \\ B \end{matrix}] = [\begin{matrix} A_{n} \\ B_{n} \end{matrix}] = \frac{1}{2^{m}} [\begin{matrix} 2^{m} - \frac{σ_{0}}{2^{3}} \times z_{1} - \frac{σ_{1}}{2^{7}} \times z_{2} \\ 2^{(m - 3)} \times σ_{0} + z_{1} \end{matrix}]

(29)

Since all

σ_{i}

are small integers, their multiplication computations in Equation (29) can be easily implemented by using shifting and additions.

According to the above deduction for the NR-8 CORDIC algorithm, the implementation of the digital circuit structure of the proposed NR-8 CORDIC algorithm is shown in Figure 4. The contents of the green dashed box can be implemented with a Digital Signal Processing (DSP) module. Therefore, all iterative processes are not required, and thus the NR-8 CORDIC algorithm only takes three clock cycles for computation:

Cycle 1: Fold the angle

θ \in [0, π / 2]

to the range of

φ \in [0, π / 12]

, and transform the input variables from

x_{0}, y_{0}

to

x_{0}^{'}, y_{0}^{'}

, according to Table 1, Section 3.4 and Figure 4. Compute

σ_{0}

,

Z_{1}

via rounding

8 φ

and using the equation of

Z_{i} = ⌊ z_{i} \times 2^{m} ⌋

and a three-entry register in Equation (21), respectively.

Cycle 2: Directly fetch the values of

σ_{i}

and

z_{i}

,

(i = 1, 2, \dots q)

from

Z_{1}

as in Equations (25)–(28), respectively, which are substituted to Equation (29) for computing A, B, and meanwhile compensate the amplitude of the variables

x_{0}^{'}, y_{0}^{'}

through the equations

x_{0}^{″} = K \times x_{0}^{'}, y_{0}^{″} = K \times y_{0}^{'}

.

Cycle 3: Compute the final results

[\begin{array}{l} x_{d} \\ y_{d} \end{array}]

according to Equation (20) and Table 1 by using the multiplier module [18].

4.2. Resource Utilization and Performance Analysis

Here, two comparisons are presented for analyzing resource utilization (RU) and performance as follows.

4.2.1. RU Comparison of Conventional CORDIC Algorithms

The NR-8 CORDIC algorithm and several conventional algorithms are implemented on a Xilinx FPGA (xcku040-ffva1156) including the evaluations of the critical RU, clock latency and power consumption, and these conventional algorithms are R-2, R-4 and R-8 [11,12,13,15]. Note that in the experiments, the 16, 8 and 6-level pipelines are used for R-2, R-4 and R-8 CORDIC cores to achieve the same accuracy, respectively [15].

Table 2 lists RU comparisons of the R-2, R-4 and R-8 CORDIC algorithms with the NR-8 CORDIC algorithm by using a synthesis tool (Vivado 2019.2) (2019.2, Xilinx, San Jose, CA, USA, 2019). The results demonstrate that the proposed NR-8 CORDIC algorithm has advantages over the conventional algorithms in many aspects, such as the RUs of Configurable Logic Block (CLB) Lookup Tables (LUTs), flip-flop (FF), DSPs, clock latency, power consumption and so forth. For example, for a 16-bit precision output, compared with the corresponding parameters of the R-2, R-4 and R-8 CORDIC algorithms in Table 2, the proposed NR-8 algorithm only requires one-half to one-eighth the RU and reduces clock latency to one-half to one-sixth and power consumption to one-half. Then, we implemented the algorithm in Verilog Hardware Description Language (HDL) using a pipelined approach. The place and route tool reports the worst negative slack and the worst hold slack as 0.302 ns and 0.024 ns, respectively, when using a clock frequency of 250 MHz. Compared with the conventional ones, such as the CORDIC IP core (6.0) from Xilinx with 16-bit precision and three iterations, the power of the NR-8 CORDIC algorithm significantly decreases to below 70%, and the proposed algorithm only needs one-third of the flip-flops, though the low power consumption LUTs utilization increases by 43%.

4.2.2. Performance Comparison of Newly Developed CORDIC Algorithms

The comparisons of performance of the newly developed algorithms [12,13,14,15] with the NR-8 CORDIC algorithm are shown in Table 3. The conventional Radix-X CORDIC algorithms, such as the R-2 CORDIC with m-bit precision, require m iterations. Normally, the number of iterations decreases as the number of X in Radix-X increases, and the complexity and timing (critical path) of the algorithms are almost unchanged. The high-performance R-4 CORDIC algorithm [14] requires m/2 iterations, O(m) complexity and low latencies. The low-latency hybrid (LLH) CORDIC algorithm [16] requires 3m/8 + 1 iterations and more complexity O(3m). Although the high performance/low-latency (HPLL) CORDIC algorithm [17] has low latency, this algorithm is not conducive to pipeline optimization to improve the speed, owing to the inherent iterative structure.

For the NR-8 CORDIC algorithm, when the precision is less than 24 bits, complexity is less than

O (2^{q})

,

q = ⌈ \frac{23 - 5}{3} ⌉ = 6

, and

σ_{i}

has only seven types (

i \in {0, 1, 2, 3, 4, 5, 6}

). Thus, Equations (15) and (26) are rewritten as

{\begin{cases} F (m) = | a_{i_{1}} a_{i_{2}} \dots a_{i_{m}} |, (i_{1}, i_{1}, \dots i_{m}) \in {0, 1, 2, 3, 4, 5, 6}, 0 \leq m \leq 6 \\ a_{i} = {\begin{array}{l} σ_{0} 2^{- 3}, & i = 0 \\ σ_{i} 2^{- (4 + 3 i)}, & i = 1, 2, 3, 4, 5, 6 \end{array} \end{cases}

(30)

We can make a conclusion from Equation (30) that when

4 \leq m \leq 6

, if and only if

a_{i_{1}} a_{i_{2}} \dots a_{i_{m}}

=

a_{0} a_{1} a_{2} a_{3}

, the maximum of

F (m)

is given as

MAX {(F (m)) |}_{4 \leq m \leq 6} = | a_{0} a_{1} a_{2} a_{3} | = | \frac{σ_{0}}{2^{3}} \times \frac{σ_{1}}{2^{7}} \times \frac{σ_{2}}{2^{10}} \times \frac{σ_{3}}{2^{13}} | \leq \frac{2 \times 8 \times 7 \times 7}{2^{33}} \leq \frac{1}{2^{23}}

(31)

Apparently, when the NR-8 CORDIC algorithm requires 23-bit precision, the following approximations are produced:

{F (m) |}_{4 \leq m \leq 6} \approx 0

and

{f_{a} (C_{n}^{m}) |}_{4 \leq m \leq 6} \approx 0

(

f_{a} (C_{n}^{m})

from Equation (14)). The most time-consuming path is attributed to the computation of variables A, B. According to the above analysis and Equation (13), A, B can be simplified as

[\begin{matrix} A \\ B \end{matrix}] = [\begin{matrix} A_{n} \\ B_{n} \end{matrix}] = \frac{1}{2^{m}} [\begin{matrix} 2^{m} - \frac{σ_{0} z_{1}}{2^{3}} - \frac{σ_{1} z_{2}}{2^{7}} - \frac{σ_{2} z_{3}}{2^{10}} - \frac{σ_{3} z_{4}}{2^{13}} \\ 2^{(m - 3)} \times σ_{0} + z_{1} - \frac{σ_{0} σ_{1} z_{2}}{2^{10}} - \frac{σ_{0} σ_{2} z_{3}}{2^{13}} \end{matrix}]

(32)

Equation (32) can be realized by two-clock latency in the pipeline. Therefore, only four-clock latency is required for the NR-8 CORDIC algorithm with 23-bit precision, and the complexity is less than

O (15)

. For instance, compared with the 10-clock latency required for

\frac{3}{8} \times 23 + 1 \approx 10

iterations using the LLH CORDIC algorithm [16], the clock latency of the NR-8 CORDIC algorithm significantly decreases to less than 50%, which needs only the four-clock latency.

4.3. Error Analysis

4.3.1. Comparisons with Low-Latency Hybrid (LLH) CORDIC

According to the literature [16], the simulation has been performed to compute the cosine and sine functions for the angles

θ

, ranging from 0 to

π / 2

in the step of

π / 500

.

σ_{0}, σ_{1}, z_{1}, z_{2}

come from Equations (25)–(28).

x_{0}^{'}, y_{0}^{'}

come from Table 1 and Figure 4. For m-bit precision, the critical descriptive codes for the NR-8 CORDIC algorithm are described in Algorithm 1.

Algorithm 1. The descriptive codes of the NR-8 CORDIC.

x_{0} = 2^{m}, y_{0} = 0;

\begin{array}{l} A = 2^{m} - f i x ((σ_{0} \times z_{1}) / 2^{3}) - f i x ((σ_{1} \times z_{2}) / 2^{7}); \\ B = 2^{(m - 3)} \times σ_{0} + z_{1}; \end{array}

K = 2^{15} - 2^{8} \times σ_{0}^{2} + 3 \times σ_{0}^{4} - σ_{1}^{2};

\begin{array}{l} x_{0}^{″} = f i x ((x_{0}^{'} \times K) / 2^{15}); \\ y_{0}^{″} = f i x ((y_{0}^{'} \times K) / 2^{15}); \end{array}

\begin{array}{l} x_{d} = A \times x_{0}^{″} - B \times y_{0}^{″}; \\ y_{d} = s i g n_s e l (A \times y^{″} + B \times x_{0}^{″}); \end{array}

\begin{array}{l} NR 8 \cos = R \cos / 2^{2 m}; \\ NR 8 \sin = R \sin / 2^{2 m}; \end{array}

Two functions,

\cos θ

and

\sin θ

, are produced by using standard functions from MATLAB, and the amplitude errors are described as

{\begin{matrix} δ_{N R 8 c} = | NR 8 \cos - \cos θ | \\ δ_{N R 8 s} = | NR 8 \sin - \sin θ | \end{matrix}

(33)

Figure 5a shows the values of cosine and sine produced by the NR-8 CORDIC algorithm. Figure 5b,c compare the errors for the cosine and sine functions between the NR-8 CORDIC and the LLH CORDIC [16] with 16-bit precision, respectively. The symbols of

δ_{N R 8 c}

and

δ_{N R 8 s}

stand for the absolute differences of cosine and sine between the computed value from the NR-8 CORDIC and the theoretical value produced from MATLAB functions, respectively. Similarly, the symbols of

δ_{L L H c}

and

δ_{L L H s}

denote the absolute differences of cosine and sine between the computed value from the LLH CORDIC [16] and the theoretical value produced from MATLAB functions, respectively. It is found that the maximum errors are

{MAX (δ}_{L L H c})

= 8.04 × 10⁻⁴ and

{MAX (δ}_{L L H s})

= 5.50 × 10⁻⁴ for the cosine and sine functions, respectively, in the literature [16], which significantly decrease down to

MAX (δ_{N R 8 c})

= 9.20 × 10⁻⁵ and

MAX (δ_{N R 8 s})

= 9.01 × 10⁻⁵ in the NR-8 CORDIC algorithm, respectively, thus indicating that the proposed NR-8 algorithm has high precision. Moreover, according to our analyses for the structures of the two algorithms, similar results should be obtained for the 24-bit precision.

4.3.2. Comparison of Conventional CORDIC Algorithms

Here, we analyze the computation errors

\cos θ

and

\sin θ

, as calculated by the R-2, R-4, R-8 and NR-8 CORDIC algorithms. When

x_{0} = 2^{M} - 1, y_{0} = 0

(

M \leq 16

) and a series of angles

θ

from 0 to 90° with angle steps of 1, 0.1, 0.01 and 0.001° are used, the values of cos

θ

and sin

θ

are computed by the algorithms above using FPGA and simulated by ModelSim SE (10.6e). The errors can be calculated by differing the above values from those computed by MATLAB using float-point computation and rounding to M-bit integers. Figure 6 shows the maximum absolute errors (MAE) for the cos

θ

and sin

θ

functions, which are denoted as

δ_{\cos} (x), δ_{\sin} (x)

(steps = 1, 0.1, 0.01, 0.001°), and the corresponding root mean squared errors (RMSE) are shown in Figure 7. The proposed algorithm was simulated by ModelSim SE and MATLAB fixed-point processing, and verified by using the FPGA. We obtained the same results, indicating that the algorithm is feasible in engineering implementation.

From Figure 6 and Figure 7, it is found that the proposed NR-8 CORDIC algorithm is the most sensitive to the bit width

M

, and as the

M

value decreases, both the MAE and RMSE values decrease sharply. Even though the value of

M

equals 15 or 16, the MAE and RMSE values of the proposed NR-8 CORDIC algorithm are almost as small as those of the other algorithms. Note that when the value of

M

is smaller than 15, the MAE and RMSE values of the NR-8 CORDIC algorithm are much smaller than those of the other algorithms. As a consequence, the overall MAE and RMSE values of the NR-8 CORDIC algorithm are relatively small in comparison with the conventional algorithm. In addition, the angle step functions have less influence on the MAE and RMSE values of the proposed NR-8 CORDIC algorithm than the other conventional algorithms. Specifically, both the MAE and RMSE values for the cosine function calculated using the NR-8 CORDIC algorithm are almost the same as the corresponding values for the sine function, indicating that the outputs of the cosine and sine functions are mostly orthogonal. However, for the other conventional algorithm, the orthogonality is relatively weak. Moreover, we made the statistical test more than 1000 times and found out that all of the MAEs and RMSEs for the cosine and sine functions are in the corresponding ranges described above, indicative of the significance of our proposed method. Therefore, we make a conclusion that the NR-8 CORDIC algorithm developed in this paper has lower clock latency, less complexity and less consumed power, allowing it to have higher efficiency than other algorithms, which provides a potential application in real-time systems such as radar digital beamforming.

5. Application of the NR-8 CORDIC Algorithm to DBF

The diagram of the DBF mode for the MIMO millimeter wave radar is shown in Figure 8. The interface of the FPGA and ADC is the LVDS bus, and the phase shift Transmission (TX) is implemented by the FPGA, which transmits commands to AWR1243 registers through the SPI bus. The desired steering angles are defined to be

β_{1}, β_{2}, \dots β_{n}

for n TX antennas. The received I, Q complex data from ADC for each Reception (RX) channel go through a DSP module that includes the range and Doppler FFT. RX DBF is performed to steer the RX beam towards the same

β_{i}

. After the corresponding phase delay, the echoes are summed to achieve a beamforming [20,21].

According to Equation (1), if the input vector

[\begin{matrix} x_{0} \\ y_{0} \end{matrix}]

is not a constant vector and the input angle

θ

is a desired value, the vector

[\begin{matrix} x_{0} \\ y_{0} \end{matrix}]

will produce the phase shift by the

θ

. Thus, we can realize the beam delay of the desired steering angles according to the NR-8 CORDIC algorithm. In Figure 4, if

x_{0}, y_{0}

are replaced by I, Q complex data, respectively, and the angle

θ

is replaced by

β_{i}

, the output values

x_{d}, y_{d}

will be obtained as the corresponding beam delay vector.

In this section, the NR-8 CORDIC algorithm is applied to the phase delay for DBF on a 77 GHz MIMO millimeter wave radar system empowered by TI AWR1243 chips. The experimental device of DBF for the MIMO millimeter wave radar is shown in Figure 9. The related parameters are as follows: the sampling rate f_s = 4 MHz, the bandwidth bw = 1120 MHz and the sampling points N = 16, where the bandwidth refers to the bandwidth of the radio frequency. However, if the baseband signal is implemented by the radar chip with digital down converters, the frequency will be reduced from 1120 MHz to less than 1 MHz. Therefore, we can use fs = 4 MHz for sampling. The sample points are listed below:

{I, Q}

= {−168 − 64i, −60 − 224i, 224 − 88i, 148 + 84i, 52 + 196i, −164 + 60i, −36 − 132i, 276 − 4i, 128 + 164i, −84 + 284i, −188 + 56i, 12 − 168i, 128 + 40i, −64 + 264i, −236 + 104i, −132 − 204i}. A corner reflector is placed in front of the radar at a distance of 3 m and an azimuth of 20°.

One echo (I/Q) is taken for phase shift angle

θ

from 1.5° to 15° with a 1.5° step, producing 10 phases. Let

x_{0} = I

and

y_{0} = Q

in Figure 4. The phase shift effects are shown in Figure 10, where the red and green lines represent I and Q signals, respectively. The 1st and 10st symbols in the illustration represent beam delays of 1.5° (beam_1) and 15° (beam_10), respectively. FFT transformation is applied to each beam, and the phase errors are listed in Table 4. The variables

p

,

{\tilde{Δ}}_{p}

,

Δ_{p}

and

δ_{\max_p}

δ_{\max_p}

are described as follows:

$p = {ANGLE (FFT (b e a m_n o r {I, Q})) |}_{P E A K}$ , the phase angle of the nth delay beam or ${I, Q}$ .
${\tilde{Δ}}_{p} = {ANGLE (FFT (b e a m_n)) |}_{P E A K} - {ANGLE (FFT ({I, Q})) |}_{P E A K}$ , the phase difference between the nth delay beam and the original echo ${I, Q}$ .
$Δ_{p}$ = the desired steering angles.
$δ_{Δ p} = {\tilde{Δ}}_{p} - Δ_{p}$ = the error of the phase shift.

where n = 1, 2, ⋯, 10 and the functions

FFT (b e a m_n)

and

FFT (b e a m_1)

represent the FFT transformation of the nth delay beam and the first delay beam, respectively. Then, the phase differences corresponding to the peak values of the spectral lines are obtained. In Table 4, the small value of the inequality,

| δ_{Δ p} | < {0.064}^{°}

, assures that the NR-8 CORDIC algorithm can be applied in real-time systems like radar digital beamforming with a high precision of the phase shift.

Overall, our algorithm is mainly based on noniterative methods, whereas the majority of the conventional algorithms are based on the iterative methods, and thus our algorithm is low latency and high efficiency for the high precision output. As for the normal CORDIC algorithm, the increase of the computational complexity means the change of the important module, with the increase of precision and, in detail, the adders required for fulfilling the same task such as a task in Figure 4, will increase. For example, the number of adders that are used to get the values of A_n and B_n from Equation (13) will increase with the increase of the computational complexity. As for the R-2 CORDIC algorithm, the m iterators are required for achieving the m-bit precision output, each of which needs an adder and a subtractor; thus the computational complexity can be expressed by a term of O(m). Meanwhile, the data we achieved are based on the statistical test for several trials, and the results are found to be very reliable with a relatively low error, thus indicating the NR-8 CORDIC algorithm is able to be applied in these fields with low latency and high efficiency.

6. Conclusions

The proposed NR-8 CORDIC algorithm has low latency, low complexity and low RU, in comparison with the conventional R-X CORDIC and some newly developed CORDIC algorithms. In particular, when the m-bit precision is less than 24-bit, this algorithm has great advantages, e.g., the clock latencies can be reduced to 4 from 10 with much lower complexity. This algorithm adopts the narrow input angle range to obtain a high speed for calculations, and it uses the output uniform formula to efficiently compute the sine and cosine functions or the phase shift in a noniterative fashion. Therefore, this algorithm is of great value in time-critical applications, such as DBF, robot controllers, FFT transformation, signal modulation and demodulation, recently developed rapid convolutional neural networks (CNNs) [22,23] and so on. We anticipate that the algorithm will provide a higher precision, lower complexity and lower clock latency after further optimization in the future.

Author Contributions

W.T. and F.X. developed the theory, performed the experiment, and drafted the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This work was supported by The Key Laboratory for Information Science of Electromagnetic Waves, School of Information Science and Technology, Fudan University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fang, L.; Xie, Y.; Li, B.; Chen, H. Generation scheme of chirp scaling phase functions based on floating-point CORDIC processor. J. Eng. 2019, 2019, 7436–7439. [Google Scholar] [CrossRef]
Vyas, P.; Vachhani, L. CORDIC-Based Azimuth Calculation and Obstacle Tracing via Optimal Sensor Placement on a Mobile Robot. IEEE/ASME Trans. Mechatron. 2016, 21, 2317–2329. [Google Scholar] [CrossRef]
Wong, C.C.; Liu, C.C. FPGA realisation of inverse kinematics for biped robot based on CORDIC. Electron. Lett. 2013, 49, 332–334. [Google Scholar]
Lee, H.; Oh, K.; Cho, M.; Jang, Y.; Kim, J. Efficient Low-Latency Implementation of CORDIC-Based Sorted QR Decomposition for Multi-Gbps MIMO Systems. IEEE Trans. Circuits Syst. II Express Brief 2018, 65, 1375–1379. [Google Scholar] [CrossRef]
Jun, M.; Parhi, K.K.; Deprettere, E.F. Annihilation-Reordering Look-Ahead Pipelined CORDIC-Based RLS Adaptive Filters and Their Application to Adaptive Beamforming. IEEE Trans. Signal Process. 2000, 48, 2414–2431. [Google Scholar] [CrossRef]
Nikolov, S.I.; Jensen, J.A.; Tomov, B.G. Fast parametric beamformer for synthetic aperture imaging. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 2008, 55, 1755–1767. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Pilato, L.; Fanucci, L.; Saponara, S. Real-Time and High-Accuracy Arctangent Computation Using CORDIC and Fast Magnitude Estimation. Electronics 2017, 6, 22. [Google Scholar] [CrossRef] [Green Version]
Lakshmi, B.; Dhar, A.S. CORDIC architectures: A survey. VLSI Des. 2010, 2010, 794891. [Google Scholar] [CrossRef]
Ylostalo, J. Function approximation using polynomials. IEEE Signal Process. Mag. 2006, 23, 99–102. [Google Scholar] [CrossRef]
Ercegovac, M.D.; Lang, T.; Muller, J.M.; Tisserand, A. Reciprocation, square root, inverse square root, and some elementary functions using small multipliers. IEEE Trans. Comput. 2000, 49, 628–637. [Google Scholar]
Volder, J.E. The CORDIC Trigonometric Computing Technique. IEEE Trans. Electr. Comput. 1959, 8, 330–334. [Google Scholar] [CrossRef]
Meher, P.K.; Valls, J.; Juang, T.B.; Sridharan, K.; Maharatna, K. 50 Years of CORDIC: Algorithms, Architectures, and Applications. IEEE Trans. Circuits Syst. I Regul. Pap. 2009, 56, 1893–1907. [Google Scholar] [CrossRef] [Green Version]
Maharatna, K.; Banerjee, S.; Grass, E.; Krstic, M.; Troya, A. Modified virtually scaling-free adaptive CORDIC rotator algorithm and architecture. IEEE Trans. Circuits Syst. Video Technol. 2005, 15, 1463–1474. [Google Scholar]
Antelo, E.; Villalba, J.; Bruguera, J.D.; Zapata, E.L. High performance rotation architectures based on the radix-4 CORDIC algorithm. IEEE Trans. Comput. 1997, 46, 855–870. [Google Scholar] [CrossRef] [Green Version]
Rudagi, J.; Subbaraman, S. Comparative Analysis of Radix-2, Radix-4, Radix-8 CORDIC Processors. In Proceedings of the 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 23–24 November 2017; pp. 378–382. [Google Scholar]
Shukla, R.; Ray, K. Low latency hybrid CORDIC algorithm. IEEE Trans. Comput. 2014, 63, 3066–3078. [Google Scholar] [CrossRef]
Wu, C.S.; Wu, A.Y.; Lin, C.H. A High-Performance/Low-Latency Vector Rotational CORDIC Architecture Based on Extended Elementary Angle Set and Trellis-Based Searching Schemes. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 2003, 50, 589–601. [Google Scholar]
DSP48 Macro v3.0, Xilinx Inc., USA. 2015. Available online: https://www.xilinx.com/support/documentation/ip_documentation/xbip_dsp48_macro/v3_0/pg148-dsp48-macro.pdf (accessed on 12 August 2020).
UltraScale Architecture Configurable Logic Block User Guide. Xilinx Inc., USA. 2017. Available online: https://www.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf (accessed on 12 August 2020).
Fischman, M.A.; Le, C. Digital beamforming developments for the joint NASA/Air Force Space Based Radar. In Proceedings of the IGARSS 2004, 2004 IEEE International Geoscience and Remote Sensing Symposium, Anchorage, AK, USA, 20–24 September 2004; pp. 687–690. [Google Scholar]
Lialios, D.I.; Ntetsikas, N.; Paschaloudis, K.D.; Zekios, C.L.; Georgakopoulos, S.V.; Kyriacou, G.A. Design of True Time Delay Millimeter Wave Beamformers for 5G Multibeam Phased Arrays. Electronics 2020, 9, 1331. [Google Scholar] [CrossRef]
Cao, Y.X.; Xiao, W.A.; Jia, J. A Cordic-based Acceleration Method on FPGA for CNN Normalization layer. In Proceedings of the 2020 International Conference on High Performance Big Data and Intelligent Systems (HPBD&IS), Shenzhen, China, 23 May 2020. [Google Scholar]
Parmar, Y.; Sridharan, K. A Resource-Efficient Multiplierless Systolic Array Architecture for Convolutions in Deep Networks. IEEE Trans. Circuits Syst. II Express Brief 2020, 67, 370–374. [Google Scholar] [CrossRef]

Figure 1. The CORDIC vector rotation model.

Figure 2. Decreasing of F(m) as m or S(m) increases.

Figure 3. The architectures of the variable

σ_{i}

and the variable

Z_{i}

(

Z_{i} = ⌊ z_{i} \times 2^{m} ⌋

).

Figure 3. The architectures of the variable

σ_{i}

and the variable

Z_{i}

(

Z_{i} = ⌊ z_{i} \times 2^{m} ⌋

).

Figure 4. The architecture diagram of the NR-8 CORDIC algorithm on an FPGA. * The multipliers can be optimized as shifters and adders.

Figure 5. (a) The curves of cosine and sine computed by the NR-8 CORDIC algorithm. (b) Comparison of errors for cosine between the NR-8 and the LLH CORDIC with 16-bit precision. (c) Comparison of errors for sine between the NR-8 and the LLH CORDIC with 16-bit precision.

Figure 6. The relations between maximum absolute errors (MAEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles

θ

change from 0 to 90° in steps of 1, 0.1, 0.01 and 0.001°.

Figure 6. The relations between maximum absolute errors (MAEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles

θ

change from 0 to 90° in steps of 1, 0.1, 0.01 and 0.001°.

Figure 7. The relations between root mean squared errors (RMSEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles

θ

change from 0 to 90° in steps of 1, 0.1, 0.01 and 0.001°.

Figure 7. The relations between root mean squared errors (RMSEs) and the input bit width M of the NR-8, R-8, R-4 and R-2 CORDIC algorithms. The angles

θ

change from 0 to 90° in steps of 1, 0.1, 0.01 and 0.001°.

Figure 8. The diagram of digital beamforming (DBF) mode for the MIMO millimeter wave radar.

Figure 9. The experimental device of DBF for the MIMO millimeter wave radar.

Figure 10. The effects of the phase shift showing the partial amplification of delay beams.

Table 1. The CORDIC output mappings between θ and φ.

Regions	$θ$	$x_{0}^{'}, y_{0}^{'}$	$x_{d}, y_{d}$
$A, [0, \frac{π}{12})$	$φ$	$[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = [\begin{array}{l} x_{0} \\ y_{0} \end{array}]$	$[\begin{array}{l} x_{d} \\ y_{d} \end{array}] = R T (θ) [\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}]$
$B, [\frac{π}{12}, \frac{π}{6})$	$\frac{π}{6} - φ$	$[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} \sqrt{3} x_{0} - y_{0} \\ - x_{0} - \sqrt{3} y_{0} \end{array}]$	$[\begin{array}{l} x_{d} \\ - y_{d} \end{array}] = R T (θ) [\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}]$
$C, [\frac{π}{6}, \frac{π}{4})$	$\frac{π}{6} + φ$	$[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} \sqrt{3} x_{0} - y_{0} \\ x_{0} - \sqrt{3} y_{0} \end{array}]$	$[\begin{array}{l} x_{d} \\ y_{d} \end{array}] = R T (θ) [\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}]$
$D, [\frac{π}{4}, \frac{π}{3})$	$\frac{π}{3} - φ$	$[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} x_{0} - \sqrt{3} y_{0} \\ \sqrt{3} x_{0} - y_{0} \end{array}]$	$[\begin{array}{l} x_{d} \\ - y_{d} \end{array}] = R T (θ) [\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}]$
$E, [\frac{π}{3}, \frac{5 π}{12})$	$\frac{π}{3} + φ$	$[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = \frac{1}{2} [\begin{array}{l} x_{0} - \sqrt{3} y_{0} \\ \sqrt{3} x_{0} + y_{0} \end{array}]$	$[\begin{array}{l} x_{d} \\ y_{d} \end{array}] = R T (θ) [\begin{array}{l} x_{0}^{'} \\ t_{0}^{'} \end{array}]$
$F, [\frac{5 π}{12}, \frac{π}{2}]$	$\frac{π}{2} - φ$	$[\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}] = [\begin{array}{l} - x_{0} \\ - y_{0} \end{array}]$	$[\begin{array}{l} x_{d} \\ - y_{d} \end{array}] = R T (θ) [\begin{array}{l} x_{0}^{'} \\ y_{0}^{'} \end{array}]$

Table 2. Utilization comparison of the NR-8 CORDIC algorithm with the conventional CORDIC algorithms with 16-bit precision.

Algorithms	CLB LUTs ^a (242,400)/UT (%)	FF (484,800)/UT (%)	DSPs (1920)/UT (%)	Clock ^b Latency	Power (Dynamic/Static) (W)
R-2 [11]	1095/0.45	785/0.16	2/0.1	17	0.071/0.479
R-4 [12]	975/0.4	329/0.07	5/0.26	9	0.066/0.478
R-8 [15]	880/0.36	234/0.05	6/0.31	7	0.065/0.478
NR-8	300/0.12	98/0.02	5/0.21	3	0.031/0.478

^a A CLB contains 8 6-input LUTs and 16 flip-flops [19]. ^b The working clock frequency of 250 MHz.

Table 3. Performance comparison of newly developed CORDIC algorithms with m-bit precision.

Algorithms	Conventional CORDIC [12,13,15]			High-Performance R-4 [14]	Low-Latency Hybrid (LLH) [16]	High-Performance/Low-Latency [17]	Proposed NR-8 CORDIC
Algorithms	R-2	R-4	R-6	High-Performance R-4 [14]	Low-Latency Hybrid (LLH) [16]	High-Performance/Low-Latency [17]	Proposed NR-8 CORDIC
Iterations	m + 1	(1/2)m	(3/8)m	m/2	(3/8)m+1	-	0
Complexity ^a	O(2m)	O(2m)	O((15/8)m)	O(m)	O(3m)	16 Adders/28 Adders (m = 16)	$O (2^{⌈ \frac{m - 5}{3} ⌉})$
Timing (Critical path) ^b	Tadd/sub	Tadd/sub	Tadd/sub	Tadd/sub	2Tadd/sub	2Tadd/sub	2Tadd/sub
Latency (m = 16)	17	9	7	8	6	68T_FA/26T_FA ^c	3

^a Base on analysis of critical rotator module. O(●): order in terms of full adders. - not reported. ^b Tadd/sub means Adder/subtractor delay. ^c T_FA means a full adder delay.

Table 4. Phase errors of FFT transform.

Beams	(I,Q)	Beam_1	Beam_2	Beam_3	Beam_4	Beam_5	Beam_6	Beam_7	Beam_8	Beam_9	Beam_10
$p$ (°)	−129.305	−127.807	−126.354	−124.814	−123.361	−121.804	−120.241	−118.812	−117.344	−115.858	−114.290
${\tilde{Δ}}_{p}$ (°)	0	1.498	2.951	4.491	5.944	7.501	9.064	10.493	11.961	13.447	15.015
$Δ_{p}$ (°)	0	1.5	3.0	4.5	6.0	7.5	9.0	10.5	12.0	13.5	15.0
$δ_{Δ p}$ (°)	0	−0.002	−0.049	−0.009	−0.056	0.001	0.064	−0.007	−0.039	−0.053	0.015

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, W.; Xu, F. A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency. Electronics 2020, 9, 1521. https://doi.org/10.3390/electronics9091521

AMA Style

Tang W, Xu F. A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency. Electronics. 2020; 9(9):1521. https://doi.org/10.3390/electronics9091521

Chicago/Turabian Style

Tang, Wenming, and Feng Xu. 2020. "A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency" Electronics 9, no. 9: 1521. https://doi.org/10.3390/electronics9091521

APA Style

Tang, W., & Xu, F. (2020). A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency. Electronics, 9(9), 1521. https://doi.org/10.3390/electronics9091521

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency

Abstract

1. Introduction

2. Conventional CORDIC Rotator Algorithm

3. Noniterative Radix-8 CORDIC Algorithm

3.1. Narrow Input Angle θ Range

3.2. Explicit Formula of Convergence

3.3. Scale Factor

3.4. Transformation of the Inputs $x_{0}$ and $y_{0}$

4. Implementation and Analysis

4.1. Noniterative Implementation

4.2. Resource Utilization and Performance Analysis

4.2.1. RU Comparison of Conventional CORDIC Algorithms

4.2.2. Performance Comparison of Newly Developed CORDIC Algorithms

4.3. Error Analysis

4.3.1. Comparisons with Low-Latency Hybrid (LLH) CORDIC

4.3.2. Comparison of Conventional CORDIC Algorithms

5. Application of the NR-8 CORDIC Algorithm to DBF

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Noniterative Radix-8 CORDIC Algorithm with Low Latency and High Efficiency

Abstract

1. Introduction

2. Conventional CORDIC Rotator Algorithm

3. Noniterative Radix-8 CORDIC Algorithm

3.1. Narrow Input Angle θ Range

3.2. Explicit Formula of Convergence

3.3. Scale Factor

3.4. Transformation of the Inputs x 0 and y 0

4. Implementation and Analysis

4.1. Noniterative Implementation

4.2. Resource Utilization and Performance Analysis

4.2.1. RU Comparison of Conventional CORDIC Algorithms

4.2.2. Performance Comparison of Newly Developed CORDIC Algorithms

4.3. Error Analysis

4.3.1. Comparisons with Low-Latency Hybrid (LLH) CORDIC

4.3.2. Comparison of Conventional CORDIC Algorithms

5. Application of the NR-8 CORDIC Algorithm to DBF

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4. Transformation of the Inputs $x_{0}$ and $y_{0}$