FPGA Implementation of an Efficient FFT Processor for FMCW Radar Signal Processing

This paper presents the design and implementation results of an efficient fast Fourier transform (FFT) processor for frequency-modulated continuous wave (FMCW) radar signal processing. The proposed FFT processor is designed with a memory-based FFT architecture and supports variable lengths from 64 to 4096. Moreover, it is designed with a floating-point operator to prevent the performance degradation of fixed-point operators. FMCW radar signal processing requires windowing operations to increase the target detection rate by reducing clutter side lobes, magnitude calculation operations based on the FFT results to detect the target, and accumulation operations to improve the detection performance of the target. In addition, in some applications such as the measurement of vital signs, the phase of the FFT result has to be calculated. In general, only the FFT is implemented in the hardware, and the other FMCW radar signal processing is performed in the software. The proposed FFT processor implements not only the FFT, but also windowing, accumulation, and magnitude/phase calculations in the hardware. Therefore, compared with a processor implementing only the FFT, the proposed FFT processor uses 1.69 times the hardware resources but achieves an execution time 7.32 times shorter.


Introduction
Recently, various types of sensors (passive infrared (PIR), ultrasonic, cameras, and lidar) have been used for target detection [1][2][3][4] but they all have weaknesses. PIR sensors cannot detect stationary targets or multiple targets, and if warm air is injected, a falsealarm detection may occur [1]. Ultrasonic sensors have trouble detecting targets at a distance greater than 5 m, and their angular resolution is poor compared with that of other sensors [2]. Camera sensors are less effective in the dark or in the presence of obstacles, and they require high-performance hardware due to onerous computational signal processing, which also has serious privacy issues [1,3]. Finally, Lidar sensors are limited by their high cost and susceptibility to weather conditions [4].
Unlike these types of sensors, radar sensors have the advantages of not being affected by harsh environmental conditions, such as light and weather, and of being able to measure the range, velocity, and angle of a target directly. It can detect stationary or moving objects and can detect multiple targets simultaneously. Therefore, it is free from privacy issues. Radar sensors can also measure small movements such as breathing and heart rate for vital-sign monitoring and tracking gestures and gait [1]. Because of these strengths, radar sensors are used in industrial machinery, drones, automobiles, and wearable devices [5][6][7][8][9].
Recently, [10] developed a method for detecting a human subject by investigating physical characteristics using Doppler radar. The trained support vector machine (SVM) had an accuracy of 96%. In addition, [11] a method of simultaneously performing target classification and estimating movement direction showed an identification accuracy of 85% even for newly acquired data by using the "you only look once (YOLO)" scheme. In [12], a compact radar system for autonomous walking for the visually impaired and blind was developed. By integrating a Tx/Rx circuit board with a radar antenna, the whole radar system was miniaturized.
Radar can be broadly classified into pulse or frequency-modulated continuous wave (FMCW) radar, the latter of which is simple to implement and has received increasing attention [13]. This radar systems can be divided into slow-and fast-ramp FMCW according to the transmission waveform used. Slow-ramp FMCW uses a triangle-shaped transmission waveform and a pairing technique to extract the range and velocity of the target. However, slow-ramp FMCW suffers from a serious disadvantage in that a ghost target appears when extracting the target's range and velocity. Therefore, fast-ramp FMCW radar systems are more widely used. These use a sawtooth transmission waveform to extract the target's range and velocity using a two-dimensional fast Fourier transform (2D FFT) [5,8]. Figure 1 shows a block diagram of typical fast-ramp FMCW radar signal processing for target detection. The received beat signal is digitized in the pre-processing step and the DC component is removed. If a low-reflectance target exists, it may not be detected because of the relatively strong side lobe of the clutter, which is reduced by applying a windowing function before the FFT. The 2D FFT is then applied to extract the range and velocity. The 2D FFT suboperations are called the range FFT and the Doppler FFT, and the result is called a range-Doppler map (RDM). Range FFT and Doppler FFT lengths are parameters that determine the maximum detection range and Doppler resolution. Therefore, FFT processors must support variable lengths because the length used depends on the performance required for each application [14]. FFT processors are generally designed and implemented using a fixed-point format because of their simplicity. However, because fixed-point formats have a limited number representation range, implementing the FFT with a fixed-point operator requires adjusting the result according to the possible representation. Therefore, fixed-point FFT processors offer poor FFT performance because of quantization noise (Q-noise). In FMCW radar, Q-noise accumulates because the FFT is repeatedly performed during signal processing. Therefore, many studies have proposed designing and implementing the FFT processor using a floating-point operator [15][16][17][18][19].
After 2D FFT, it is common to conduct constant false alarm rate (CFAR) detection. The range and velocity of a target are measured by changing the threshold depending on the local average noise power. The accumulation of magnitude components improves detection performance [20][21][22][23][24]. Therefore, it is necessary to calculate and accumulate the magnitude component of the FFT result. In addition, the phase component of the FFT result is often used to acquire various vital signs such as respiration and heart rate [25,26]. Therefore, it is also necessary to calculate the phase component of the FFT result.
Because FFT computations are the most resource-intensive among these, an optimized hardware implementation is required. The hardware architecture of FFT processors can be divided into two types: pipelined FFTs and memory-based FFTs. Memory-based FFTs are also called "in-place" or "iterative" FFTs [27]. In long FFT processors, pipelined FFT structures consume a lot of area, so memory-based FFT structures are preferred [28][29][30][31][32]. Furthermore, given the computational speed and variable length of the transformations, it is generally appropriate to use a mixed-radix butterfly unit with a mixture of radix-4 and radix-2 [28].
In this paper, we propose an FFT processor hardware structure supporting a variable length of 64-4096 and windowing, magnitude/phase calculation, and accumulation operations. We also present the results of an FPGA-based implementation of the proposed processor. Our results show that the proposed FFT processor can carry out the signal processing required for FMCW radar systems, reduce computation times, and achieve a high signal-to-quantification-noise ratio (SQNR) performance by using a floating-point operator.
The rest of this paper is organized as follows. Section 2 reviews the FMCW radar signal-processing algorithm. Section 3 describes the hardware architecture of the proposed FFT processor. Section 4 presents the implementation results. Finally, Section 5 presents our conclusions.

Measuring Range and Velocity in FMCW Radar
FMCW radar employs a transmission waveform the frequency of which varies linearly with time. This waveform can be either triangular or sawtooth. It is challenging to use FMCW radar based on the triangular transmission waveform in multi-target scenarios because it is difficult to remove the ghost target signal. Therefore, FMCW radar systems with sawtooth transmission waveforms are commonly used. The frequency of the sawtooth waveform is defined by Equation (1).
where f c is the carrier frequency; B is the bandwidth; T is the period; and t is time.
For convenience, B/T is replaced by α. The instantaneous phase of the transmission waveform can be obtained by integrating the frequency of the transmission waveform with respect to time t, as in Equation (2).
Here, t = nT + t s , and t s is the time between 0 and T. If the initial phase of the transmission signal is ϕ 0 and the amplitude is A, the transmission waveform of the first chirp is given by Equation (3).
By substituting t = nT + t s into Equation (3), the nth transmission signal can be obtained as shown in Equation (4).
After the delay time τ, the received signal can be expressed as in Equations (5) and (6).
Here, A is the amplitude of the received signal; R is the range of the target; v is the velocity of the target, and c is the speed of light.
Demodulation is performed by multiplying the received signal reflected from the target and the transmitted signal as shown in Equation (7).
The phase component of the demodulated signal s M (n, t s ) is in the form of a sum of cosine terms, and the high-frequency signal is removed by a low-pass filter. Thus, the in-phase components of the signal can be arranged as shown in Equation (8).
where C is the product of A and A . Equation (9) can be derived by expanding Equation (8).
In Equation (9), c is very large, so 1/c 2 terms are negligible. Moreover, 2 f c v/c and 2αvnT/c are very small compared to 2αR/c and can be ignored. If the same approach is applid to 2αvt s 2 /c, Equation (9) can be approximated by the expression for s M−I shown in Equation (10). Through the same process, the quadrature components can be approximated by Equation (11).
The range to the target R and the beat frequency f b can be defined as in Equations (12) and (13).
The frequency of the received signal reflected by the moving target can be defined as shown in Equation (14) by considering the Doppler effect.
Here, f r is the reception frequency, and f t is the transmission frequency. Equation (14) can be transformed into Equation (15) using the binomial series.
Because the speed of light c is very large, the higher-order terms can be removed to obtain Equation (16).
The Doppler frequency is defined as f d = 2v f c /c. By substituting f b and f d in Equations (10) and (11), Equations (17) and (18) are obtained.
Equation (19) can be derived by expressing the in-phase and quadrature components in exponential functions using Euler's formula.
The beat frequency can be obtained by performing the discrete Fourier transform (DFT) on the expression of s M (n, t s ) shown in Equation (19) for one chirp, that is, t s . Using Equation (12), we can obtain the range to the target from the beat frequency. In addition, the Doppler frequency can be obtained by performing the DFT on the frequency change of the signal for several chirps, that is, nT. Because f d = 2v f c /c, we can use the Doppler frequency to calculate the velocity of the target.
If the number of samples in the range direction, i.e., the range FFT length, is defined as M, the sampling interval becomes T/M, and thus the sampling frequency F s is given by Equation (20). Furthermore, the relationship between the maximum detection range R max and M can be derived from Equations (21) and (22).
Moreover, if we define the number of chirps, i.e., the Doppler FFT length, as N, then the sampling frequency is 1/T, and ∆ f D can be derived as in Equation (23).
Here, ∆v is the Doppler resolution. Equations (22) and (25) confirm that the range FFT and Doppler FFT lengths are essential parameters for determining the maximum detection range and Doppler resolution, respectively. Depending on the radar application, the maximum detection range and Doppler resolution requirements vary. Therefore, the FFT processor should ideally support variable lengths.

CFAR Algorithm
The simplest way to detect the range and velocity of a target in an FMCW radar system is to set a constant threshold. The detection algorithm then compares the magnitude component of the FFT result to this threshold. However, the average noise power varies with time. This is because various parameters of the environment where the radar operates, such as temperature and humidity, are not constant. Therefore, the false alarm detection rate can be very high while using a constant threshold. False alarms directly affect system performance by wasting radar resources owing to continuous detection.
The CFAR algorithm is widely used to reduce the false alarm rate in radar systems. The CFAR algorithm does not maintain the threshold constant but instead adjusts it according to the average noise power. The basic CFAR algorithm proceeds as follows.
(1) The magnitude component of the FFT result is calculated. (2) The signal for which it needs to be determined whether it is a target or not is called a test signal. The average local noise power is generated by the surrounding signals. (3) The algorithm checks if a test signal is a target by comparing it to the threshold generated using the surrounding signals. (4) Finally, steps (2) and (3) are repeated for all signals.
The FMCW radar system should apply CFAR detection using to both the range and Doppler axes directions to extract the range and velocity information from the target. To improve the detection performance, 1D data are generated by accumulating RDMs over the range or Doppler axes directions [23,24]. Therefore, a function to calculate and accumulate the FFT results into a magnitude component is required.

Hardware Architecture of the Proposed FFT Processor
As shown in Figure 2, the proposed FFT processor consists of a window multiplication unit (WMU), a butterfly unit (BFU), a magnitude/phase calculation unit (MPU), and an accumulation unit (ACU). In addition, it was designed with four channels to reduce execution time. The memory of the processor consists of FFT RAM to store input/output values, WIN RAM to store window coefficient values, and ACC RAM to store accumulated values.
The WMU performs windowing before the FFT operation. The WMU was designed so as to operate by reading from a separate WIN RAM. Therefore, the window coefficients can be changed easily by the user. Windowing is performed on the input data, but no windowing is performed on the intermediate calculated values of the FFT. Therefore, the WMU selectively outputs through a multiplexer (MUX). In addition, because only the real value of the window function is used, eight multipliers are used.
The BFU performs the butterfly operation of the FFT. This unit can perform radix-4/2 butterfly operations for various transform lengths. Because the input comes from four channels, inputs 3 and 4 are set to zero when radix-2 butterfly operations are performed. The intermediate value of the FFT obtained through the BFU is stored in the FFT RAM. Then, the BFU repeatedly performs butterfly operations until the final FFT result is obtained.
The MPU performs an operation that calculates, from the FFT result, the corresponding magnitude and phase components. We implemented it using an algorithm that approximates the magnitude and phase components to reduce the necessary hardware resources. Therefore, we implemented the MPU using only shifters and adders. The algorithms for approximating the magnitude and phase are discussed in detail in Section 3.2.
The ACU accumulates the FFT results. In contrast to windowing, accumulation is performed directly on the FFT results, but not on the intermediate calculated values of the FFT. Therefore, the ACU selectively outputs through a MUX. The accumulation process requires adding the current FFT result to the accumulated value. Thus, the accumulated values are written to, and read from, a separate ACC RAM.

HFP Operation
To measure the range and velocity of a target using the FMCW radar, 2D FFT had to be performed on the input data from ADC. Since 2D FFT increases quantization noise compared to the 1D FFT, essential information may be lost. For example, in the case of a hand gesture recognition radar system, the value of the echo signal was very small because the radar cross section of the human hand is very small [33,34]. If the quantization noise is overlapped and increased by the 2D FFT, important data for a hand gesture with a small echo signal value may be lost. To achieve a reasonable recognition performance, the SQNR of the 2D FFT needed to be large enough. Table 1 compares the 2D FFT SQNR performance based on a fixed-point and a floatingpoint operator when the number of bits in the input data ranged from 16 to 28. When the floating-point number system hads 16-bits of data, it was called a half-precision floatingpoint (HFP) format. As shown in Table 1 , the SQNR degradation occurs seriously in the fixed-point number system, especially when the number of bits in the input data was 16 to 24 for 4096 × 4096 data. Fixed-point systems did not exhibit significant performance penalties when the bit width was 28 bits; their SQNR performance was close to that of HFP systems. Table 2 compares the hardware resources used after designing and synthesizing an FFT processor based on either the 28-bit fixed-point operator or the HFP operator. The results showed that the FFT processor, implemented with the HFP operator, used more LUTs than that implemented by the fixed-point operator. Because the Xilinx FPGA's block RAM (BRAM) is composed of 16-bit units, an FFT processor implemented with a fixed-point operator required twice the BRAM. The Xilinx FPGA's DSP consists of fixed bits of the complement multipliers of two and is used to implement multipliers. Because the number of DSP bits is fixed, the DSP is used extensively during multiplications if the bit width is large. Therefore, a processor configured with a fixed-point operator will require approximately three times the DSP capacity. Therefore, it seems preferable to design an FFT processor with an HFP operator from the point of view of FFT performance degradation and the required hardware.
A floating-point number consists of a sign, an exponent, and a mantissa and performs operations by treating their components separately. The HFP-adder performs addition by separating the input data into sign, exponent, and mantissa, as shown in Figure 3. The sign is determined using sign logic. If two numbers have the same sign, the sign of the addition result is the same. If the signs of the two numbers are different, the sign of the result must be determined by comparing the numbers' exponents and mantissas. Finally, sign logic is used to determine the sign of addition and the addition/subtraction of the mantissa. The exponent of the result is determined in three steps. First, the larger value is selected by comparing the two exponent values. Then, the difference resulting from the mantissa calculation is added. Finally, if overflow occurred in post-normalization, it is adjusted to determine the final exponent. Adjustment ensures that the exponent did not overflow.
The mantissa is determined through a more complex process than that for the previous two components. It is calculated through a process of alignment, operation, normalization, rounding, and post-normalization. First, if the two input values have different exponents, an alignment process is required to match the number of digits. To use only one addition and subtraction operator, we compared two values of the mantissa and swapped them. After matching and swapping, the operation result was added or subtracted according to sign logic.
Then, leading zeros are detected; changes in the exponent value are calculated; and normalization to the floating-point format is performed. The least significant bits (LSBs) lost in this process are used as rounding bits, which are used to perfoem rounding and normalization. If an overflow occurs, normalization is performed again through post-normalization. Finally, the components are combined to generate the final result.
Similar to the HFP-adder, the HFP-multiplier performs multiplication by separating the input data into sign, exponent, and mantissa, as shown in Figure 4. The sign is determined by an exclusive-OR logic gate. The exponent of the floating-point number system uses a biased notation instead of a two's complement [35]. Therefore, the HFPmultiplier adds the two exponent values and subtracts the bias values. Then, the difference resulting from the mantissa calculation is added. Finally, the result is adjusted to ensure the exponent does not overflow. Mantissa calculations are performed in the following order: operation, normalization, rounding, and post-normalization. In contrast to the HFP-adders, the HFP-multipliers do not require alignment processing because multiplication can be conducted for inputs with any number of digits. The bits of the multiplication result are twice those of the operand. Because the number of bits of the multiplication result is too large, we reduce the number by shifting. The LSB that is lost at this point is used to generate three rounding bits.
Subsequently, an HFP-adder-like process follows. The leading zeros are detected and the changes in the exponent value are calculated. Normalization to floating-point format is then performed. The LSBs lost in this process are used as rounding bits. Rounding is then performed using the rounding bits generated in the operation and normalization steps. Again, overflows may occur during rounding; if it does, normalization is performed again through post-normalization. Finally, the components are combined to generate the final result.

Magnitude/Phase Calculation Unit
The MPU is used to calculate the magnitude and phase components of the FFT result. If the magnitude component is calculated using an approximation method, such as that shown in Equation (26), the number of calculations can be efficiently reduced by replacing the multiplication with an addition without significant performance degradation [36].
Here, x is the FFT result; Re(·) is a function that calculates the real part of an input value; Im(·) is a function that calculates the imaginary part of an input value; and max(a, b) is a function that selects the largest between two given values. In this case, it selects the largest absolute value between the real and imaginary parts.
The approximated norm unit was implemented as shown in Figure 5. After comparing the real and imaginary parts, the resulting magnitude is approximated using shifters and adders. A comparison between two floating-point numbers is performed through exponent and mantissa comparisons. The shifter subtracts the exponent by 2 and 3 to make 1/4 and 1/8, respectively. Finally, the numbers are added using HFP-adders, producing the same result as Equation (26).  The calculation of the phase component of the FFT result was implemented using a coordinate rotation digital computer (CORDIC) algorithm. The CORDIC algorithm is an iterative computation method that views a function as a vector in a two-dimensional plane and obtains a converged value through repeated vector rotation. In Equations (27) through (29), if a real value is substituted in x (1) and an imaginary value is substituted in y (1) and the operation is repeatedly performed until y (i) becomes 0, the phase value comes out in z (i) [37]. Here, d i = −sign(x (i) · y (i) ).

Real
The arctangent unit, composed of a shifter, a controller, a MUX, and an adder, was implemented as shown in Figure 6. When implementing CORDIC in a pipeline architecture, units must be used as many times as the number of iterations. Therefore, one unit is used repeatedly to calculate the phasor component. The shifter was implemented so that the exponent could be subtracted from 0 to 12, and the nth constant had a value of arctan (2 (−i) ).

Implementation Results of the Proposed FFT Processor
The proposed FFT processor was designed using hardware description language (HDL) and implemented on a Xilinx Zynq UltraScale+ device-based FPGA platform. As shown in Table 3, the FFT processor was implemented with 10,891 LUTs, 6365 FFs, and 20 DSPs. It used 1.69 times more hardware resources than the BFU, which performed only the FFT operation. As shown in Figure 7, the proposed FFT processor was configured on the FPGA platform using an advanced extensible interface (AXI) bus interface for verification. Figure 8 shows the verification environment for the FPGA platform. The system structure consisted of an FFT processor, a master interface for data transmission/reception with double datarate (DDR) memory, a slave interface for communication with a microprocessor (MP), internal RAM and a register that can change the operation mode of the FFT processor. Input data for hardware verification were initialized in DDR memory, and FFT length was set using the MP. When the start signal of the FFT IP was input through the MP, the initial data of the DDR memory was stored in the internal RAM of the FFT IP. After reading all the data, the FFT processor performed the necessary operations. When these operations were completed, the result was stored in DDR memory through the master interface.   Table 4 shows the evaluation results forexecution the time of FMCW radar signal processing, which refers to windowing the input data, performing a 2D FFT, calculating the magnitude/phase components, and accumulating it. To evaluate the speed of the FFT processor, we implemented different versions and measured their execution times across three versions: one using only software, one using dedicated hardware only for the FFT (similar to existing FFT processors), and one using the proposed FFT processor. When performing radar signal processing with 4096 × 4096 data, implementing only the FFT in the hardware shortened the execution time from 32.97 to 4.54 s compared to than using only software. This corresponded to a 7.26-fold acceleration. Execution time was reduced from 32.97 to 0.62 s when implementing the proposed FFT processor instead of only software. This corresponded to a 53.29-fold acceleration. Compared to implementing only FFT in hardware, the proposed FFT processor accelerated the radar signal processing by 7.32 times. Table 5 shows a comparison between the hardware resources of the proposed FFT processor and those of an existing FFT processor [38] and Xilinx's FFT IP [39], both of which were implemented with a memory-based architecture using a floating-point operator. Since the memory-based FFT architecture was implemented based on a single butterfly operator, the effect of the transform length on LUT and FF in FPGA was not significant. Therefore, the normalization for transform length was not applied. Because they only performed the FFT operation, it was more appropriate to consider only the hardware resources of the BFU of the proposed processor. Even though the LUT and FF of [38] were normalized by the bit width, it could be seen that the BFU of the proposed processor required fewer hardware resources. Compared with [39], the BFU of the proposed FFT processor required a similar amount of hardware resources with a similar clock frequency. However, the proposed FFT processor is expected to be much faster than that of [39] for FMCW radar signal processing owing to the integration of the WMU, MPU, and ACU. Therefore, the proposed FFT processor is more efficient than the others when considering the trade-off between hardware resources and execution time.

Discussion and Conclusions
We developed an FFT processor for FMCW radar signal processing to support variable lengths by applying a mixed-radix algorithm. It also supports windowing, magnitude/phase calculations, and accumulation functions. The processor was implemented using a Xilinx Zynq UltraScale+ device. In our implementation, 10,891 LUTs, 6365 FFs, 10 RAM blocks, and 20 DSPs were used as hardware resources.
Since the general FFT processor only supports FFT operation, it is appropriate to compare it with the BFU of the proposed processor. The Xilinx FFT processor and the BFU of the proposed FFT processor used similar hardware resources. However, the proposed processor required more hardware resources. Comparing the execution time of windowing, 2D FFT, magnitude/phase calculation, and accumulation, the proposed processor significantly shortened it 7.32 times compared to the Xilinx FFT processor.
As mentioned, the proposed FFT processor supported a high SQNR and special functions such as windowing, magnitude/phase calculation, and accumulation. Therefore, it is very efficient for FMCW radar signal processing and can be used for other applications such as wireless communication with orthogonal frequency division multiplexing (OFDM) modulation and voice recognition systems with frequency analysis, which requires a high SQNR and the abovementioned special functions [40,41].
In future work, we will implement a radar signal processor that includes the proposed FFT processor in VLSI. It and will be expected to find wide use in automobiles, drones and wearable devices that require low-cost, llow-power implementation.