Area-Efﬁcient Pipelined FFT Processor for Zero-Padded Signals

: This paper proposes an area-efﬁcient fast Fourier transform (FFT) processor for zero-padded signals based on the radix-2 2 and the radix-2 3 single-path delay feedback pipeline architectures. The delay elements for aligning the data in the pipeline stage are one of the most complex units and that of stage 1 is the biggest. By exploiting the fact that the input data sequence is zero-padded and that the twiddle factor multiplication in stage 1 is trivial, the proposed FFT processor can dramatically reduce the required number of delay elements. Moreover, the 256-point FFT processors were designed using hardware description language (HDL) and were synthesized to gate-level circuits using a standard cell library for 65 nm CMOS process. The proposed architecture results in a logic gate count of 40,396, which can be efﬁcient and suitable for zero-padded FFT processors.


Introduction
The fast Fourier transform (FFT) is a mathematical algorithm for reducing the computational complexity of the discrete Fourier transform (DFT) and is widely used for frequency analysis [1][2][3].The zero-padded FFT offers increased frequency resolution by extending the length of the input data sequence in the time domain by padding with zeros at the tail of the discrete-time signal.Because of this, it has been widely used for wireless communications and radar systems that require high-frequency resolution [4][5][6][7][8][9].
The radix-2 and radix-4 algorithms are the most widely used for implementing FFT processors because of their simple architectures.For pipeline architectures, the radix-4 algorithm has a smaller number of non-trivial multiplications than the radix-2 algorithm [10].However, the radix-4 algorithm complicates the control of butterfly architectures more than the radix-2 algorithm.Thus, radix-2 2 and radix-2 3 algorithms have been proposed to reduce the complexity of high-radix algorithms.The radix-2 2 algorithm has the same number of non-trivial multiplications as the radix-4 algorithm but maintains the butterfly architecture of the radix-2 algorithm.Similarly, the radix-2 3 algorithm has the same number of non-trivial multiplications as the radix-8 algorithm [11][12][13][14].The pruned FFT algorithm can also be applied to the zero-padded signals to reduce the computational complexity and many studies have been conducted [15][16][17][18][19][20].However, the pruned FFT processor based on the pipeline architecture requires an additional memory unit corresponding to FFT-length to re-arrange data sequence [20].
Single-path delay feedback (SDF) pipeline FFT architectures are commonly used because they have the smallest number of non-trivial multiplications compared with other pipeline architectures, such as single-path delay commutator (SDC) and multi-path delay commutator (MDC).However, as the number of FFT points increases, the SDF architecture requires significantly more circuit area because of the delay elements for data reordering [21][22][23][24][25][26].
In this paper, we propose an area-efficient FFT processor for zero-padded signals by taking advantage of the fact that the data sequence is zero-padded and that the twiddle factor (TF) operation in stage 1 is a trivial multiplication in the radix-2 2 and radix-2 3 algorithms.The rest of this paper is organized as follows.In Section 2, we review the zero-padded FFT.The hardware architecture of the proposed FFT processor is described in Section 3. In Section 4, we compare the proposed zero-padded FFT architecture with conventional architectures.Finally, Section 5 concludes the paper.

Zero-Padded FFT
The DFT for complex data sequence x(n) of length N is defined as where the twiddle factor is when analyzing the resolution of the DFT, there are two factors to consider.The first one is the spectral resolution, which refers to the algorithm's capability to detect closely spaced spectral components.The second one is the frequency resolution, which is the definition of the distance between frequency bins.Whereas the spectral resolution can only be increased by increasing the time window of the signal, the frequency resolution is determined by the number of input data points in the sequence given to the DFT [27][28][29][30].A longer data sequence is usually obtained by using the zero-padding method, which is described below.Assume that a new data sequence y(n) is created by zero-padding the original data sequence x(n) of length N to a length of M.
The M points of the DFT are calculated as Based on the divide-and-conquer algorithm, indices n and k can be written as where 0 Replacing Equations ( 5) and (6) in Equation (4), we obtain where the output of the stage 2 butterfly H(k 1 , k 2 , n 3 ) is expressed as shown in Equation ( 11): Alternatively, assuming that M is 4N in order to increase the frequency resolution four times, samples from y(N) to y(4N − 1) are zero so that Equation (11) can be simplified as follows: Therefore, Equation ( 10) can be summarized as follows: Similarly, even if the frequency resolution is increased by more than four times, y(M/4 + n3) in Equation ( 11) becomes zero and the radix-2 2 algorithm is derived as shown in Equation (13).
When increasing the frequency resolution by more than four times using the radix-2 3 algorithm, M points of the DFT are derived as shown in Equation ( 14) in a way similar to the radix-2 2 algorithm: where 0

Double Frequency Resolution
In order to double the frequency resolution, the tail of input data sequence x(n) of length N is padded with N zeros to double its length in the time domain.The FFT signal flow graph (SFG) of the radix-2 2 algorithm for a zero-padded signal with double frequency resolution is shown in Figure 1.
To implement the zero-padded FFT using the conventional radix-2 2 SDF architecture, delay elements of length N are required for data sequence reordering in stage 1 and the length of the delay elements required for each stage is reduced by half each time as shown in Figure 2.That is, in order to implement the FFT processor for a zero-padded signal of length 2N using the conventional radix-2 2 SDF architecture, delay elements with a total length of 2N − 1 are required [31].As a result, the number of delay elements notably increases with the FFT data points.To solve this problem, we propose the hardware architecture depicted in Figure 3 ) are fed back to delay elements of stage 2. Thus, it can be verified that the outputs of the stage 2 butterfly unit are equal to the results of the stage 2 butterfly operation shown in Figure 1.Therefore, this demonstrates that the number of delay elements in stage 1 can be reduced by 50% compared with the conventional architecture.Besides, multiplication by −j is a trivial multiplication that consists of changing the positions of the integer and imaginary parts of the complex number and the butterfly unit of stage 1 can be omitted as shown in Equation ( 9).

Four-times Frequency Resolution
When the tail of an input data sequence of length N is padded with 3N zeros, the number of data points in the sequence in the time domain becomes 4N and, consequently, the frequency resolution increases by a factor of 4. The DFT for a 4N-long zero-padded signal is expressed in Equation ( 13) and the corresponding SFG is depicted in Figure 5.As can be seen from the SFG, the outputs of stage 1 from (4N − 1) are zeros.Hence, the outputs from the butterfly unit in stage 2 are repeated in the input data sequence four times.Therefore, the hardware architecture at stage 1 and stage 2 of the SDF for a zero-padded signal with four times the frequency resolution requires N delay elements and one complex multiplier, as illustrated in Figure 6.In the proposed hardware architecture, an input data sequence of length N is delayed using N delay elements and the delayed data sequence is fed back to the delay elements and simultaneously transferred to a complex multiplier for multiplying by the TF.As a result, the input data sequence is repeated four times.After the calculations of stages 1 and 2 are completed, an N-point DFT calculation with log 2 N stages is performed.In other words, the proposed SDF architecture for a zero-padded signal with four times the frequency resolution can reduce the total number of delay elements by 50% compared with the conventional SDF architecture by eliminating stage 1, which has the largest number of delay elements.

2 m -Times Frequency Resolution
When the tail of an input data sequence of length 2 (q−m) is padded with 2 q −2 (q−m) zeros, the frequency resolution increases by a factor of 2 m , where m is 2 or more and q is m + 1 or more.Figure 7 shows the SFG when a data sequence of length 2 q is decomposed using the radix-2 2 algorithm.Among the outputs from the stage 1, from (2 q − 1) are zeros and the outputs from (2 q−1 + 2 q−m − 1) are repeatedly generated in the same form as the input data sequence.In addition, the outputs from the butterfly unit in stage 2 are repeated four times for an input data sequence of length 2 (q−m) and (2 q−2 + 2 q−m ) zeros.Therefore, the hardware architecture at stages 1 and 2 of the SDF for a zero-padded signal with 2 m -times frequency resolution requires 2 (q−m) delay elements and one complex multiplier.Additionally, a multiplexer for (2 q−2 + 2 q−m ) zeros is required for the stage 2 butterfly outputs, as illustrated in Figure 8.In the proposed hardware architecture, the input data sequence of length 2 q−m is delayed using delay elements of length 2 q−m and the delayed input data sequence is simultaneously transferred to a complex multiplier and to delay elements of length 2 q−m .As a result, in the outputs from the butterfly unit in stage 2, the input data sequence and the zeros are repeated.After the calculations of stage 1 and stage 2 are completed, 2 q−2 -point DFT calculations are performed over q − 2 stages.In other words, the proposed SDF architecture for a zero-padded signal with 2 m -times frequency resolution eliminates stage 1, which has the largest number of delay elements.Moreover, in the case of eight-times frequency resolution or higher and because the input of the 2 q−2 -point DFT calculations is zero-padded after the operations of stage 2, the number of the delay elements in the 2 q−2 -point FFT processors can be reduced in the same way as in the proposed hardware architecture.

Comparison
Table 1 shows a comparison of the hardware area and performance between the conventional pipelined FFT architecture and the proposed hardware architecture for a zero-padded signal of length 2 q when the frequency resolution is increased by a factor of 2 m .This Table includes the number of complex adders, complex multipliers and delay elements.The latency is also presented in terms of the number of cycles.Because all the architectures process single-path data, their throughput is one sample per clock cycle.Additionally, the number of complex multipliers is the same as in the radix-2 2 SDF architecture but it can be seen that the number of complex adders is reduced by 2m compared with the radix-2 2 SDF architecture.Most notably, compared with the conventional hardware architecture (in which the number of delay elements seriously increases with FFT length and the number of data paths), the proposed hardware architecture reduces the number of the delay elements significantly.Moreover, latency is significantly reduced compared with other single-path pipeline architectures.
Table 1.Comparison of pipeline hardware architectures for the computation of a 2 q -point zero-padded FFT on complex-valued data (frequency resolution is assumed to be increased by a factor of 2 m ).

Pipelined Architecture Complex Adder Complex Multipliers
Delay Elements Latency (Cycles) Proposed SDF Radix-2 2 (m : Odd/Even) In order to confirm the superiority of the proposed architecture, we implemented two 256-point FFT processors with the proposed and conventional radix-2 2 SDF architectures.For four-times frequency resolution, the tail of an input data sequence of length 64 is padded with 192 zeros.A 12-bit word for real and imaginary data paths was selected to satisfy the requirement for a signal-to-quantization noise-ratio (SQNR) of 40 dB.We designed the zero-padded FFT processor for integration in frequency modulated continuous wave (FMCW) radar signal processor and confirmed that the performance degradation due to quantization noise is minimized when the SQNR is above 40 dB.In addition, in the case of FFT processor for orthogonal frequency division multiplexing (OFDM) baseband processor, it is presented in Reference [32] that there is no effect of quantization noise when the SQNR is 40 dB or more.
Two FFT processors were designed using hardware description language (HDL) and synthesized to gate-level circuits using a standard cell library of 65 nm CMOS process.Table 2 shows comparison results for logic gate count.As depicted in this Table, the proposed architecture can reduce the gate count by 34.6% compared to the conventional architecture owing to the reduction of 50.2% for delay elements.Table 3 shows comparison results between this work and other FFT processors in References [33][34][35][36].For a fair comparison, we normalized the area as where N and Tech are the FFT length and the process technology in nanometers, respectively.As shown in Table 3, the normalized area of the proposed FFT processor is the smallest among different FFT processors because it can significantly reduce the number of delay elements.

Conclusions
In this paper, we proposed an area-efficient FFT processor for zero-padded signals based on the radix-2 2 and radix-2 3 SDF pipeline architectures by taking advantage of the fact that the input data sequence is zero-padded that and the twiddle factor multiplication in stage 1 is trivial.The proposed FFT processor can dramatically reduce the required the number of delay elements.For four-times frequency resolution, the tail of an input data sequence of length 64 is padded with 192 zeros, the by using the feedback path of the SDF architecture and exploiting the trivial multiplication of stage 1.

Figure 1 .
Figure 1.Signal flow graph for double frequency resolution.

Figure 2 .
Figure 2. Hardware architecture of the conventional SDF FFT processor.

Figure 3 .
Figure 3. Hardware architecture of proposed single path delay feedback (SDF) fast Fourier transform (FFT) processor for double frequency resolution.The data flow of the proposed hardware architecture is shown in Figure 4. First, x[0] to x[N/2 − 1] go through the delay elements of stage 2 for the butterfly operation of stage 2.After N/2 cycles, x[N/2] to x[N − 1] are entered into the butterfly unit of stage 2 and x[0] to x[N/2 − 1] are simultaneously outputted from the delay elements of stage 2. x[0] to x[N/2 − 1], which are now the output of the delay elements of stage 2, are delayed by the delay elements of length N/2 of stage 1; at the same time x[0] to x[N/2 − 1] and x[N/2] to x[N − 1] perform the butterfly operation in stage 2. The outputs of the butterfly unit of stage 2 are (x[0] + x[N/2]) to (x[N/2 − 1] + x[N − 1]) and (x[0] − x[N/2]) to (x[N/2 − 1] − x[N − 1]).(x[0] + x[N/2]) to (x[N/2 − 1] + x[N − 1]) are transferred into stage 3 and (x[0] − x[N/2]) to (x[N/2 − 1] − x[N − 1]) are fed back to the delay elements of stage 2.After N cycles, x[0] to x[N/2 − 1], which are now the output of the delay elements of stage 1, are entered into the delay elements of stage 2. In addition, the feedback data (x[0] − x[N/2]) to (x[N/2 − 1] − x[N − 1]) are multiplied by the TF ROM and then transferred into stage 3.At the same time, (x[0] − x[N/2]) to (x[N/2 − 1] − x[N − 1]) are entered into the delay elements of stage 1.After 3N/2 cycles, x[0] to x[N/2 − 1] are outputted from the delay elements of stage 2 and (x[0] − x[N/2]) to (x[N/2 − 1] − x[N − 1]) are outputted from the delay elements of stage 1.Consequently, ( − jx[N/2]) to ( − jx[N − 1]) can be obtained by subtracting x[0] to x[N/2 − 1] from (x[0] − x[N/2]) to (x[N/2 − 1] − x[N − 1]) and then via multiplication by −j.Additionally, x[0] to x[N/2 − 1] and ( − jx[N/2]) to ( − jx[N − 1]) perform the butterfly operation in stage 2. The butterfly unit outputs of stage 2 from (x[0] − jx[N/2]) to (x[N/2 − 1] − jx[N − 1]) are transferred into stage 3 and (x[0] + jx[N/2]) to (x[N/2 − 1] + jx[N − 1]) are fed back to delay elements of stage 2. Thus, it can be verified that the outputs of the stage 2 butterfly unit are equal to the results of the stage 2 butterfly operation shown in Figure1.Therefore, this demonstrates that the number of delay elements in stage 1 can be reduced by 50% compared with the conventional architecture.Besides, multiplication by −j is a trivial multiplication that consists of changing the positions of the integer and imaginary parts of the complex number and the butterfly unit of stage 1 can be omitted as shown in Equation (9).

Figure 4 .
Figure 4. Timing diagram of the proposed SDF FFT processor for double frequency resolution.

Figure 5 .
Figure 5. Signal flow graph for four-times frequency resolution.

Figure 6 .
Figure 6.Hardware architecture of the proposed SDF FFT processor for four-times frequency resolution.

Figure 7 .
Figure 7. Signal flow graph for 2 m -times frequency resolution.

Figure 8 .
Figure 8. Hardware architecture of proposed SDF FFT processor for 2 m -times frequency resolution.

Table 2 .
Comparison of logic synthesis results of a 256-point four-times frequency resolution zero-padded FFT on complex-valued data.

Table 3 .
Comparison of the proposed FFT processor with previous research results.