# Distributed-Memory-Based FFT Architecture and FPGA Implementations

## Abstract

**:**

## 1. Introduction

- Large transform sizes: LTE, 802.11ax with 2048-points; Digital Video Broadcasting-Second Generation Terrestrial (DVD-T2) [5], 32,768-points.
- High throughputs from multiple-input–multiple-output (MIMO) data streams and carrier aggregation: 8 × 8 MIMO in both uplink and downlink plus 160 MHz bandwidths in 802.11ax [4].

## 2. Background

#### 2.1. FPGA Implementations

#### 2.2. Related Work

## 3. Algorithm

#### 3.1. Base-b Algorithm

**N**, where b is the “base”, (1) becomes

_{2}= b_{b}is an ${N}_{1}\times {N}_{1}$ matrix with elements ${W}_{b}\left[{k}_{1},{n}_{1}\right]={W}_{M}^{{n}_{1}{k}_{1}}$, C

_{M}

_{1}is a N

_{1}× b coefficient matrix with elements ${C}_{M1}\left[{k}_{1},{n}_{2}\right]={W}_{{N}_{2}}^{{n}_{2}{k}_{1}}$, X

_{b}is a b × N

_{1}matrix with elements ${X}_{b}\left[{n}_{2},{n}_{1}\right]=X\left({n}_{1}+{N}_{1}{n}_{2}\right)$, Y

_{b}is a ${N}_{1}x{N}_{1}$ matrix with elements Y

_{b}[k

_{1},n

_{1}] = Y(k

_{1},n

_{1}) from [25], C

_{M}

_{2}is an b × N

_{1}coefficient matrix with elements ${C}_{M2}\left[{k}_{2},{n}_{1}\right]={W}_{{N}_{2}}^{{n}_{1}{k}_{2}}$, Z

_{b}is an b × N

_{1}matrix containing the transform outputs ${Z}_{b}\left[{k}_{2},{k}_{1}\right]=Z\left({k}_{1}+{k}_{2}{N}_{1}\right)$, “$\u2022$” indicates element-by-element multiplication and t denotes matrix transposition [25]. (Note that in this section, N

_{2}is used for the value b when there might be confusion about the use of “b” as a number or as a symbol to imply base-b processing). C

_{M}

_{1}and C

_{M}

_{2}contain M/b

^{2}sub-matrices ${\mathrm{C}}_{b}=\left[{c}_{1}^{}\left|{c}_{2}^{}\right|\dots |{c}_{{}_{N2}}^{}\right]$ with the form ${\mathrm{C}}_{M1}=\left[{C}_{b}^{t}\left|{C}_{b}^{t}\right|\dots \right]$ and ${\mathrm{C}}_{M2}=\left[{C}_{b}\left|{C}_{b}\right|\dots \right]$ due to the periodicity of ${W}_{{N}_{2}}$, and c

_{i}are constant vectors.

_{1}× b arrays of PEs that do the matrix–matrix computations C

_{M}

_{1}X

_{b}and C

_{M}

_{2}${Y}_{b}^{t}$ using systolic algorithms [19] with the multiplication by W

_{b}occurring in between the two arrays.

_{b}, containing the input data, is stored in an input buffer and C

_{M}

_{1}values are stored in PE registers in the left-hand side (LHS) PE array, or vice-versa. The systolic multiplication symmetry allows this option [19]. Then on the right-hand side (RHS), there is another input buffer for C

_{M}

_{2}containing all the radix-b butterfly coefficients that will be used (all preloaded before processing begins). During processing, data flows in groups of b and their movement is pipelined through the arrays of PEs. At the end of processing, the result Z

_{b}is stored in registers associated with each of the RHS N

_{1}× b PEs.

#### 3.2. Base b = 4 Systolic Array

_{2}= b = 4 (“base-4”), represents a good tradeoff between circuit performance and circuit complexity. This selection results in

_{b}above is the coefficient matrix for a four-point DFT, describing a radix-4 decimation in time butterfly and j is an imaginary number. Consequently, in (3) the matrix multiplications by C

_{M}

_{1}and C

_{M}

_{2}represent repeated use of a radix-4 butterfly.

_{1}= 4 with the architecture shown in Figure 3, corresponding to a transform size M = 16. The matrix multiplication C

_{M}

_{1}X

_{b}is performed on the LHS. It is obtained from the 4 × 4 input stream X

_{b}that is clocked into the PE array at time steps as shown and multiplied by the values of C

_{M}

_{1}stored internally in the LHS array PEs. The multiplier PEs contain coefficients W

_{ri}for row i of W

_{b}and produce Y

_{b}

^{t}= W

_{b}• C

_{M}

_{1}X

_{b}. Finally, the RHS array of PEs accumulates in place (one matrix element per PE) the result Z

_{b}, from Y

_{b}

^{t}and the streaming input C

_{M}

_{2}.

_{M}

_{1}X

_{b}and C

_{M}

_{2}Y

_{b}

^{t}involve only exchanges of real and imaginary parts plus additions, because the elements of C

_{M}

_{1}and C

_{M}

_{2}contain only ±1 or $\pm j$, whereas the product in (2) requires complex multiplications. The distribution of the elements in C

_{M}

_{2}does not impose significant bandwidth requirements because full complex numbers are not used.

_{b}in (3) is M/4 × M/4 vs. the M × M size of C in (2). Consequently, the matrix–vector multiply (2) requires M

^{2}multiplications, whereas the element-by-element multiply of (3) requires only M

^{2}/16.

#### 3.3. Matrix–Matrix Systolic Array

_{1}N

_{2}, N

_{1}= 4, N

_{2}= 4), the basic structure can be used to perform other transform sizes in conceptually the same way. For example, a 15-point transform could be performed as a five-point transform followed by a three-point transform using the structure shown in Figure 4. Here, the LHS SA reads a 3 × 5 array of input data, X

_{b}, and performs three five-point transforms. Next, the multiplier PEs perform the twiddle multiplications with ${W}_{15}^{nm}$, n = 0, 1, 2, 3, 4, m = 0, 1, 2, stored in a small ROM, and finally the RHS performs five three-point transforms.

## 4. Architecture

#### 4.1. Introduction

_{1}= 1024/4 = 256, meaning the DMBA would require 256 PE rows (a “PE row” contains all LHS, multiplier, and RHS PEs along a single horizontal data flow path in Figure 4). This inefficiency can be seen in the first equation of (3), which is rewritten as

_{b}is applied multiple times to X

_{b}, computing the same result each time. Clearly, it would be best to keep the size of X

_{b}small to avoid the repeated computations. A natural way to do this is to factor the transform size as N = N

_{3}N

_{4}. Here, the N

_{3}× N

_{4}DFT matrix X contains input samples x

_{i}that are arranged x

_{1}, x

_{2}, ..., x

_{N}

_{4}on row 1, x

_{N}

_{4+1}, x

_{N}

_{4+2}, ..., x

_{2*N4}, on row 2, etc. Using the “row/column” factorization, which is the traditional method to simplify and reduce DFT computations, this requires computation of two sets of smaller DFTs using (3), N

_{4}transforms of length N

_{3}(referred to as column DFTs) and N

_{3}transforms of length N

_{4}(referred to as row DFTs). In between column and row transforms, it is necessary to multiply each of the N points by the twiddle factor,

_{3}N

_{4}, followed by a second factorization, M = N

_{1}N

_{2}, where M applies to either the row or column DFTs. The second factorization leads to the base-b FFT processing flow described in the remainder of this section and in more detail in Section 5.

#### 4.2. Row/Column Factorization

#### 4.2.1. Architecture

_{3}/b) × b PE arrays connected by a (N

_{3}/b) × 1 array of complex multipliers. Each PE on the LHS and RHS contains the PE logic shown in Figure 6 with the addition of a small dual-port RAM to store results. RHS PEs write to the RAM and LHS PEs read from it (Figure 7). The column and row DFTs based on (3) are described below and again M refers to the transform size of either the column or row DFTs.

#### 4.2.2. Column DFTs

_{bci}, i = 1, ..., N

_{4}, from the N

_{3}× N

_{4}DFT matrix, with each array of column elements organized as an M = N

_{3}= N

_{1ci}× b matrix where c refers to column data. Also, each PE in the (N

_{3}/b) × b LHS contains an element of the N

_{1ci}× b matrix C

_{M}

_{1}. The systolic matrix–matrix multiplication result C

_{M}

_{1}X

_{bci}then flows out of the LHS through the multiplier array shown in Figure 5, where the coefficient multiplication by W

_{b}= W

_{M}= W

_{N}

_{3}—stored in ROM—produces ${Y}_{bci}^{t}$. A second systolic matrix–matrix multiplication is then performed by the RHS with inputs ${Y}_{bci}^{t}$ from the left and C

_{M}

_{2}from the bottom, producing the results Z

_{bci}, which are stored in a distributed fashion in RHS PE RAMs and become the X

_{bri}for row DFTs after the twiddle multiplication by ${W}_{N}$.

#### 4.2.3. Row DFTs

_{4}= N

_{1ri}× b, is identical to that for the column DFTs with two exceptions. First, there is a juxtaposition of the C

_{M}

_{1}values with the X

_{bri}row inputs. In this case, the row X

_{bri}values are retrieved from the PE internal RAMs, while the C

_{M}

_{1}values now flow into the LHS SA from the bottom. Second, the Z

_{bri}FFT outputs are stored in a separate RAM output buffer, one per PE row, if normal order data output is desired.

_{3}≠ N

_{4}. In this case, the structure of the base-b DMBA will be different for the column processing vs. row processing stages, e.g., one stage could have more PE rows than the other, so that a direct topological PE array match between the Z

_{b}outputs of the column DFT stage and row DFT X

_{b}inputs would not follow. For example, if N

_{3}= 32 and N

_{4}= 16, a base b = 4 DMBA during the column DFTs would require a RHS array of 8 × 4 PEs, since M = 32, and each PE RAM would store one element of each of the 8 × 4 Z

_{b}column DFT outputs. However, row DFT processing would require a 4 × 4 input from PE RAMs to a 4 × 4 LHS array, since for this stage M = 16. This is not possible given the 8 × 4 Z

_{b}PE array size.

_{3}= 32, N

_{4}= 16), after the 16 32-point column DFTs, have data for each row DFT written to a single physical PE row. That means that with an 8 × 4 RHS physical array, at the end of column DFT processing the four RAMs in each physical PE row would contain data associated with four of the 32 DFT matrix rows. Note that after column processing, there are always b DFTs contained in the RAMs of a physical PE row because the array length is N3/b and the number of DFT rows is always N

_{3}.

_{b}for a row DFT transform come from a single physical PE row and the Z

_{b}outputs are saved within the same single physical PE row in a single output buffer RAM. More details on the rationale for this approach and the virtual processing associated with it are provided in [25].

#### 4.3. Control, PE Structure and Memory

#### 4.4. Reachable Transform Sizes

_{1}/b to integer values, so N

_{1}= bn, where n is an integer. Therefore, in the column/row factorization using (3) it follows that both N

_{3}and N

_{4}must be multiples of b

^{2}because M = N

_{1}N

_{2}= (bn)b = nb

^{2}. Then, since N = N

_{3}N

_{4}, transform lengths are restricted to integer multiples of b

^{4}when using (3). In [25], the value b = 4 was used, so the circuits reported there were limited to transform sizes that were a multiple of 256. However, other values of b lead to different transform length constraints. For example, if the base b is chosen to be two, then the same analysis would show that a base-2 circuit design could perform any transform that is an integer multiple of 16. Additionally, it is possible that column processing might use one value of b and row processing another.

^{a}3

^{b}5

^{c}6

^{d}and a, b, c, d are integers.

#### 4.5. Physical Array

## 5. Implementation

#### 5.1. Introduction

#### 5.2. Row/Column Processing Details

_{b}and C

_{M}

_{1}in the first PE row, i.e., that connected directly to the input RAM and ROM buffers. Since the operations in all PE rows are identical, it is only necessary to describe data flow in one of the N

_{3}/b PE rows.

_{4}column sets X

_{bci}, and then distributes these column sets to the b RAMs labeled “B”, so that each contains N

_{4}N

_{3}/b = N

_{4}N

_{1ci}values prior to the start of FFT processing. More input buffer detail is provided in [25].

#### 5.2.1. Column DFTs

_{bci}from input buffer RAMs up through the LHS array. The LHS PEs in row 1 use preloaded c

_{M}

_{1}values, also selected by multiplexors, to perform vector–matrix products as a linear systolic array, producing the first row of C

_{M}

_{1}X

_{bci}. Successively produced elements are multiplied by appropriate elements of ${W}_{b}{=W}_{M}{=W}_{{N}_{3}}$ (stored in a ROM) and become the first row of ${Y}_{bci}^{t}$, i = 1, 2, …, N

_{4}. As the RHS PE row 1 receives elements of ${Y}_{bci}^{t}$, it performs similar matrix–vector multiplications, producing the first column of Z

_{bci}, i = 1, 2, …, N

_{4}. Here, the elements C

_{M}

_{2}(not shown in Figure 8a or 8b) flow up through registers in the RHS SA from a small ROM (not shown) at the array bottom edge into RHS PEs.

_{bci}is computed in the RHS PEs, a multiplexor selects it and sends it sequentially to a normalizer. Here, the growth in the word length at that point from the original n-bit input length is determined and a shift occurs so that the word is restored to n-bits with the shift amount saved as an exponent.

_{bci}flow to the correct memory.

#### 5.2.2. Row DFTs

_{4}based on (1) closely follows that for the column DFTs, with two minor differences. First, the LHS PE input multiplexors are set to take the X

_{bri}inputs from the internal PE memories “M” and C

_{M}

_{1}values from the bottom ROM in Figure 8b, so C

_{M}

_{1}X

_{bri}is computed using a systolic flow of C

_{M}

_{1}values into the LHS SA. Second, the Z

_{bri}outputs (normalized n-bit values plus exponents) are stored in the simple dual-port output buffer “B” instead of the internal memories (equivalent to twiddle multiplication by 1).

#### 5.2.3. Programmability

#### 5.3. Dynamic Range

_{i}and the superscript “ref” refers to the value of z

_{i}obtained using Matlab’s double precision floating-point FFT routine. Here, SQNR is always computed using complex inputs comprised of random real and imaginary numbers.

_{3}× N

_{4}DFT matrix and the maximum scaling amount for that row of the DFT matrix results in an exponent value. Therefore, at the end of the column processing, there are N

_{3}different exponents stored. Normally, for an N-point FFT using BFP scaling, a single exponent applies to all N data values; however, since there are N

_{3}different exponent values here, the SQNR values can be higher than for BFP scaling.

_{3}registers to keep track of maximum exponents. The factor of 2 required is because row/column processing stages overlap, and it can be necessary to simultaneously keep track of the scaling exponent values from each stage. A comparator is included in the BFP control block to test for the maximum exponent in each of the N

_{3}DFT matrix rows.

_{bri}mantissa and exponent values are read from the RAMs (“M” in Figure 9), the exponent values are subtracted from the corresponding maximum exponent value for that DFT matrix row and the difference is supplied to a shifter. The shifter then normalizes the mantissa value so that all X

_{bri}input values have the same exponent. When an FFT block is being computed, the N

_{3}maximum exponent values stored in the BFP control block are added to the corresponding output exponents as data is streamed out of the output buffer “B” in Figure 9. Each output value then has a real and imaginary mantissa and a single exponent. A simple output converter has been tested that will convert each to a standard BFP format (single exponent per FFT block); however, it reduces dynamic range.

## 6. Results

#### 6.1. Introduction

_{M}

_{1/2}contains only {±1, ±j} (see Figure 3). The third design (Section 6.3) is like the first, but adds a requirement for run-time choice of FFT size. The last three power-of-two FFT circuits (Section 6.4) use single-precision floating-point formats (IEEE 754), one of which targets the latest in FPGA embedded floating-point hardware support. Finally, in Section 6.5, a complex, non-power-of-two design is presented that uses mixed-radices and offers run-time FFT choice, yet has the a very simple programming model for performing any size of FFT.

#### 6.2. 256-Point and 1024-Point Streaming (Normal Order In and Out) Fixed-Size FFTs

#### 6.3. Variable FFT Streaming (Normal Order In and Out)

_{3}= 16 for all transform sizes. This choice leads to simple 4 × 4 SAs, as in Figure 3 and Figure 6, for the 16-point column transforms with processing based on (3). All the desired twiddle factors W

_{N}can be conveniently found among the elements of W

_{2048}.

_{4}-point row transforms using (3) exclusively; however, better efficiency is achieved by factoring some of the row DFTs again (N

_{4}= N

_{5}N

_{6}) and differently for each transform size, so row DFTs are broken into two steps. The rows in Table 3 show how each DFT transform of the 16 DFT matrix rows of length N

_{4}is performed in the SAs. For 128/512/2048-points, the factorization “4 × 2” means that an 8-point transform is done as a four-point transform in the LHS SA followed by a 2-point transform in the RHS SA. For 1024 and 512-points, “4 × 1” means the LHS SA only does the four-point transforms. In Table 3, all 16-point transforms are factored as “4 × 4” and follow the base-4 processing defined by (3) and shown in Figure 3. For example, to do a 2048-point transform, each row of length N

_{4}= 128 is factored with N

_{5}= 8 and N

_{6}= 16. Therefore, for these row transforms, 16 8-point transforms are computed using both LHS and RHS SAs (“4 × 2”), followed by eight 16-point DFTs (“4 × 4”) using (3). The linear multiplier array in Figure 3 is used to perform all associated twiddle multiplications.

^{2}delay-feedback (SDF) architecture from Intel [37]. Like most pipelined FFTs, N clock cycles are required to perform an N-point transform. Finally, at the other end of the spectrum, by adding parallelism based on multi-path delay commutator (MDC) and SDF stages, a circuit architecture that computes the FFT in much less than N [4] clock cycles is included. All the circuits allow run-time choice of 128/256/512/1024/2048-point transform sizes, except for [36] which also offers 16/32/64-points and [4] which does all but 128-points, but does offer a multi-mode capability for processing some transform sizes in parallel.

^{4}SDF stages, all of which have 8 I/O ports. Data output is not in natural order, so a reordering circuit would be necessary to make this design closely comparable. The parallel operation provides processing of eight complex samples per clock and reduces overall latency, but the added circuit complexity results in 11.4/6.7 times more LUTs/registers, yet achieves an FFT throughput time that is only 83% faster than the DMBA. Although the latency (first point in to first point out) in clock cycles (299 for 2048-points) is 11.7 times less than that of the DMBA (3507 clock cycles for 2048-points), the latency in time is 2.69 µs vs. 7.16 µs for the DMBA, an improvement of only 2.7-fold. With a reordering output buffer, the improvement would be reduced by approximately two-fold [4].

#### 6.4. Floating Point FFT

#### 6.4.1. Introduction

#### 6.4.2. Floating-Point without FPGA Embedded Hardware Support

#### 6.4.3. Floating Point with Embedded Hardware Support

- The Intel design in Table 7, compared to the DMBA (v1), used 2.2 times more ALMs, 1.7 times more LUTs, and 2.2 times more registers, leading to throughput per logic cell numbers that were 169% better. The DMBA “v2” trades off embedded memory (M20Ks) for LUTs, but still used 15% fewer ALMs.
- For the same Arria 10 FPGAs, the DMBA provided 35% higher throughputs.
- The DMBA design clock rates are at Fmax values very near the Arria 10 “restricted” speed limit of 608 MHz. This example shows that DMBA circuits are better able to take advantage of all the built-in speed provided by FPGAs.

#### 6.5. Non-Power-of-Two Circuit (LTE)

#### 6.5.1. Introduction

^{a}3

^{b}5

^{c}and a, b, c positive integers. The rationale for targeting FPGAs is due to the rapidly growing FPGA use in communications applications, e.g., base stations and remote radio heads at the top of cell phone towers. Here, we provide results of mapping the DMBA to Xilinx Virtex-6 devices.

#### 6.5.2. Related Work

^{n}, a variety of mixed radix approaches have been proposed [38,39,40,41]. The performance of the different designs is primarily related to the complexity of the butterfly unit design.

#### 6.5.3. DMBA Design Approach

#### 6.5.4. On-the-Fly-Twiddle Coefficient Calculation

#### 6.5.5. Programmability

_{3}× N

_{4}. Since our implementation consists of 6 × 6 SAs, it would be most efficient to choose the factorization N

_{3}N

_{4}= 36 × 15 = 6

^{2}× (3 × 5) because this makes best use of all the hardware. In this case, the processing consists of 15 36-point column DFTs followed by 36 15-point row DFTs. The input X

_{b}is stored in the input buffers in such a way that it is accessible as a sequence of blocks X

_{bci}, i = 1, …, 15 of 6 × 6 column data. Then, the 36-point column DFTs are done using (3) with M = N

_{3}= 36 = N

_{1ci}N

_{2}= 6 × 6, ${C}_{M1}{=C}_{M2}{=C}_{b}$ (b = 6) and ${C}_{b}^{i,k}{=e}^{-j\left(\frac{2\pi \mathrm{ik}}{b}\right)}$, i = 0, 1, …, b − 1; k = 0, 1, …, b − 1. Each X

_{bci}enters the array at the bottom of the LHS SA (Figure 5 and Figure 8) and flows upward with systolic matrix–matrix multiplications performed as before. As each of these 36-point column DFTs are computed, they are multiplied by elements in the 36 × 15 twiddle matrix W

_{520}which are generated on-the-fly. During this processing, all PEs are used with 100% efficiency.

_{540}, the multiplexor in Figure 8a is used to store data for the 15-point row DFTs in a way that they can be accessed as 3 × 5 data input blocks, X

_{bri}, i = 1, …, 36, from the internal PE RAMs. Each of the six PE virtual rows is responsible for storing six DFT matrix rows as six 3 × 5 blocks in associated internal RAMs.

_{5}values fed from the bottom of the LHS array. The multiplier array then multiplies these transform values arriving from the LHS array by appropriate elements of a 3 × 5 twiddle matrix stored in a small ROM. Finally, only three PE columns are used on the RHS array to perform all the three-point transforms. Results are stored in an output buffer and are output in normal order.

#### 6.5.6. DMBA LTE SC-FDMA Transform Throughput and Latency

#### 6.5.7. Comparison with Commercial Circuits

#### 6.5.8. Other FPGA LTE Implementations

## 7. Conclusions

- Architecture suitability for a wide range of both power-of-two and non-power-of-two transform sizes, so that all Section 6 example circuits could be based on the same architecture, whereas the Intel designs rely on several architectures to do the equivalent.
- Improved throughput rates resulting from high clock speeds and new algorithms (>500 MHz for 65 nm FPGA technologies).
- Reduced use of FPGA LUT and register fabric. For example, Intel’s latest IEEE 754 floating-point FFTs using an Arria 10 FPGA target device required 2.2/1.7/2.2 times more ALMs/LUTs/registers for a 1024-point FFT than our DMBA equivalent, which also has a 39% higher throughput (Table 7).
- A 20 to 30 db increase in SQNR derived from the combined block floating point and floating-point scaling scheme (Table 1).
- Throughput scalability by increasing the PE array length, N
_{3}/b, using more than one array to do the column processing or increasing the size of b. - Ability to do 2-D and 3-D transforms by not performing the row/column twiddle multiplications.

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Brigham, E.O. The Fast Fourier Transform and Its Applications; Prentice Hall: Upper Saddle River, NJ, USA, 1988. [Google Scholar]
- Rao, K.R.; Kim, D.N.; Hwang, J.J. Fast Fourier Transform—Algorithms and Applications; Springer: Berlin, Germany, 2011. [Google Scholar]
- The Mobile Broadband Standard. Available online: http://www.3gpp.org/LTE (accessed on 22 May 2018).
- Dinh, P.; Lanante, L.; Nguyen, M.; Kurosaki, M.; Ochi, H. An area-efficient multimode FFT circuit for IEEE 802.11ax WLAN devices. In Proceedings of the 19th IEEE International Conference on Advanced Communications Technology (ICACT2017), PyeongChang, Korea, 19–22 February 2017; pp. 735–739. [Google Scholar]
- DVB-T2. Available online: https://www.dvb.org/standards/dvb-t2 (accessed on 22 May 2018).
- Yang, Z.-X.; Hu, Y.-P.; Pan, C.-Y.; Yang, L. Design of a 3780-point IFFT processor for TDS-OFDM. IEEE Trans. Broadcast.
**2002**, 48, 57–61. [Google Scholar] [CrossRef] - Richards, M.A.; Sheer, J.A.; Holm, W.A. Principles of Modern Radar: Basic Principles; SciTech Publishing: Raleigh, NC, USA, 2010. [Google Scholar]
- Maimaitijiang, Y.; Wee, H.C.; Roula, A.; Watson, S.; Patz, R.; Williams, R.J. Evaluation of parallel FFT implementations on GPU and multi-core PCs for magnetic induction tomography. In Proceedings of the World Congress on Medical Physics and Biomedical Engineering (IFMBE), Munich, Germany, 7–12 September 2009. [Google Scholar]
- Sheng, J.; Humphries, B.; Zhang, H.; Herbordt, M. Design of 3D FFTs with FPGA clusters. In Proceedings of the High Performance Extreme Computing Confernce (HPEC), Boston Area, MA, USA, 9–11 September 2014. [Google Scholar]
- Hockney, R.W.; Eastwood, J.W. Computer Simulation Using Particles; Adam Hilger: Bristol, UK, 1988. [Google Scholar]
- Rodriguez-Andina, J.J.; Moure, M.J.; Valdes-Pena, M.D. Advanced features and industrial applications of FPGAs—A review. IEEE Trans. Ind. Inform.
**2015**, 11, 853–864. [Google Scholar] [CrossRef] - Tessier, R.; Pocek, K.; DeHon, A. Reconfigurable computing architectures. Proc. IEEE
**2015**, 103, 332–354. [Google Scholar] [CrossRef] - Trimberger, S.M. Three ages of FPGAs: A retrospective on the first thirty years of FPGA technology. Proc. IEEE
**2015**, 103, 318–331. [Google Scholar] [CrossRef] - Stratix 10 Overview. Available online: https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.html (accessed on 22 May 2018).
- Garrido, M.; Grajal, J.; Sanchez, M.A.; Gustafsson, O. Pipelined radix-2k feedforward FFT architectures. IEEE Trans. Very Large Scale Integr. Syst.
**2011**, 21, 23–32. [Google Scholar] [CrossRef] [Green Version] - Ingemarsson, C.; Källström, P.; Qureshi, F.; Gustafsson, O. Efficient FPGA Mapping of Pipeline. IEEE Trans. Very Large Scale Integr. Syst.
**2017**, 25, 2486–2497. [Google Scholar] [CrossRef] - Garrido, M.; Sanchez, M.A.; Lopez-Vallejo, M.L.; Grajal, J. A 4096-point radix-4 memory-based FFT using DSP slices. IEEE Trans. Very Large Scale Int. Syst.
**2017**, 25, 375–379. [Google Scholar] [CrossRef] - Xing, Q.-J.; Ma, Z.-G.; Xu, Y.-K. A novel conflict-free parallel memory access scheme for FFT processors. IEEE Trans. Circuits Syst.
**2017**, 64, 1347–1351. [Google Scholar] [CrossRef] - Kung, S. VLSI Array Processors; Prentice Hall: Upper Saddle River, NJ, USA, 1988. [Google Scholar]
- He, S.; Torkelson, M. A new expandable 2D systolic array for DFT computation based on symbiosis of 1D arrays. In Proceedings of the Algorithms and Architectures for Parallel Processing (ICAPP), Brisbane, Australia, 19–21 April 1995. [Google Scholar]
- Kar, D.C.; Rao, V.V.B. A new systolic realization for the discrete Fourier transform. IEEE Trans. Signal Proc.
**1993**, 41, 2008–2010. [Google Scholar] [CrossRef] - Ling, N.; Bayoumi, M.A. Systematic algorithm mapping for multidimensional systolic arrays. J. Parallel Distrib. Comput.
**1989**, 7, 368–382. [Google Scholar] [CrossRef] - Mamatha, I.; Sudarshan, T.S.B.; Tripathi, S.; Bhattar, N. Triple-Matrix Product-Based 2D Systolic. Circuits Syst. Signal Process.
**2015**, 34, 3221–3239. [Google Scholar] [CrossRef] - Cong, J.; Xiao, B. FPGA-RPI: A novel FPGA architecture with RRAM-based programmable interconnects. IEEE Trans. Very Large Scale Integr. Syst.
**2014**, 22, 864–877. [Google Scholar] [CrossRef] - Nash, J.G. Computationally eficient systolic architecture for computing the discreet Fourier transform. IEEE Trans. Signal Process.
**2005**, 53, 4640–4651. [Google Scholar] [CrossRef] - Nash, J.G. High-throughput programmable systolic array FFT architecture and FPGA implementations. In Proceedings of the International Conference on Computing, Networking and Communications (ICNC), Honolulu, HI, USA, 3–6 February 2014. [Google Scholar]
- Nash, J. A new class of high performance FFTs. In Proceedings of the Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI, USA, 15–20 April 2007. [Google Scholar]
- Nash, J.G. A high performance scalable FFT. In Proceedings of the Wireless Communications and Networking Conference (WCNC), Hong Kong, China, 11–15 March 2007. [Google Scholar]
- Cortes, A.; Velez, I.; Sevillano, J.F.; Irizar, A. An approach to simplify the design of IFFT/FFT cores for OFDM systems. IEEE Trans. Consum. Electron.
**2006**, 52, 26–32. [Google Scholar] [CrossRef] - Wenqi, L.; Wang, X.; Xiangran, S. Design of fixed-point high-performance FFT processor. In Proceedings of the 2nd International Conforence on Education Technology and Computer (ICETC), Shanghai, China, 22–24 June 2010. [Google Scholar]
- Lee, Y.; Yu, T.; Huang, K.; Wu, A. Rapid IP design of variable-length cached-FFT processor for OFDM-based communication systems. In Proceedings of the IEEE Workshop on Signal Processing Systems Design and Implementation, Banff, AB, Canada, 2–4 October 2006. [Google Scholar]
- Altera FFT. MegaCore Function User Guide (ug-fft-13.1); Altera FFT: San Jose, CA, USA, 2013. [Google Scholar]
- Products/FFT IP Cores. Available online: http://www.girdsystems.com/prod-FFTcores-pd.html (accessed on 22 May 2018).
- Spiral Software/Hardware Generation for DSP Algorithms. Available online: http://www.spiral.net/hardware/dftgen.html (accessed on 28 June 2018).
- Jacobson, A.T.; Truong, D.; Baas, B. The design of a reconfigurable continuous-flow mixed-radix FFT processor. In Proceedings of the IEEE International Symposium on Circuits and Systems, Taipei, Taiwan, 24–27 May 2009; pp. 1133–1136. [Google Scholar]
- Revanna, D.; Cucchi, M.; Anjum, O.; Airoldi, R.; Nurmi, J. A scalable FFT processor architecture for OFDM based communication systems. In Proceedings of the Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), Samos, Greece, 15–18 July 2013; pp. 19–27. [Google Scholar]
- Altera FFT. MegaCore Function User Guide (IPv17.1); Altera FFT: San Jose, CA, USA, 2017. [Google Scholar]
- Altera FFT. DFT/IDFT Reference Design, Application Note 464; Altera FFT: San Jose, CA, USA, 2007. [Google Scholar]
- Chen, J.; Hu, J.; Li, S. High throughput and hardware efficient FFT architecture for LTE application. In Proceedings of the 2012 IEEE Wireless Communications and Networking Conference, Las Vegas, NV, USA, 10–15 June 2012; pp. 826–831. [Google Scholar]
- Niras, C.V.; Thomas, V. Systolic variable length architecture for discrete Fourier transform in Long Term Evolution. In Proceedings of the International Symposium on Electronic System Design, Kolkata, India, 19–22 December 2012. [Google Scholar]
- Xilinx. Xilinx Discrete Fourier Transform v3.1, DS615; Xilinx: San Jose, CA, USA, 2011. [Google Scholar]

**Figure 2.**Processing architecture associated with (3). Input buffers are the non-colored boxes on the bottom. Data transfers are in groups of b words between blocks as indicated by the arrows.

**Figure 3.**Basic architecture for b = 4 and N

_{1}= 4 showing calculation of a 16-point DFT. Inputs X

_{b}(LHS array) and C

_{M}

_{2}(RHS array) are shown at the bottom at clock cycle time t. (Subscript b for matrix elements x and z are not shown).

**Figure 4.**SA used for a 15-point transform. C

_{b}are radix-5 and three butterfly matrices. Here, X

_{b}can be stored in LHS internal PE RAMs with C

_{b}flowing upward from the LHS input buffer or vice-versa.

**Figure 5.**Systolic processing flow for column and row DFTs. Subscripts c and r refer to column and row DFT transform data. Data transfers are in groups of b words between blocks as indicated by the arrows.

**Figure 7.**Physical architecture created by folding the LHS and RHS arrays from Figure 5 on top of each other.

**Figure 8.**Systolic processing flow for column (

**a**) and row (

**b**) DFTs. For both column and row DFTs, C

_{M}

_{2}inputs to RHS PEs, that come from the bottom of the array (Figure 5), are not shown. In (

**a**) the C

_{M}

_{1}data flow paths are not shown and in (

**b**) the X

_{bci}data flow paths are not shown.

**Figure 9.**Additional details of the BFP/FP systolic data flow not shown in Figure 8.

**Figure 10.**LTE RB composition. Each symbol (vertical column) corresponds to a DFT of size N, where N is divided into minimum groups of 12 subcarrier coefficients to support FDMA. The minimum transmission RB consists of 12 DFT coefficients by seven consecutive DFTs.

**Table 1.**SQNR values for different FFT scaling methodologies (16-bit input data and 1024-point transform).

Circuit | Intel | DMBA | Spiral | Intel | DMBA | Spiral | Gird |
---|---|---|---|---|---|---|---|

20-Bits | 16-Bits | 20-Bits | 20-Bits | 16-Bits | 20-Bits | 16-Bits | |

Transform Size | 256 | 1024 | |||||

ALMs | 4261 | 3982 | 3632 | 4394 | 4357 | 5237 | n.a. |

LUTs | 4416 | 6046 | 3934 | 4632 | 6426 | 3517 | 5649 |

Registers | 7841 | 6437 | 7413 | 8206 | 6718 | 9411 | 6190 |

Memory (M9Ks) | 38 | 31 | 44 | 38 | 31 | 79 | 52 |

Memory(K-bits) | 49 | 41 | 85 | 195 | 145 | 362 | n.a. |

Real Multipliers | 24 | 33 | 112 | 24 | 33 | 144 | 96 |

Fmax (MHz) | 387 | 566 | 292 | 382 | 533 | 293 | 263 |

SQNR | 88 | 87 | 86 | 81 | 83 | 80 | n.a. |

µj/FFT | 1.3 | 1.1 | n.a. | 6.4 | 4.3 | n.a. | n.a. |

(µsec) per FFT | 0.66 | 0.45 | 0.44 | 2.7 | 1.9 | 1.7 | 3.9 |

Thrpt/logic cell | 25 | 35 | 40 | 5.8 | 7.9 | 8.9 | 4.3 |

N | N_{4} | F | N_{5} | F | N_{6} | F |
---|---|---|---|---|---|---|

128 | 8 | (4 × 2) | - | - | - | - |

256 | 16 | (4 × 4) | - | - | - | - |

512 | 32 | - | 8 | (4 × 2) | 4 | (4 × 1) |

1024 | 64 | - | 4 | (4 × 1) | 16 | (4 × 4) |

2048 | 128 | - | 8 | (4 × 2) | 16 | (4 × 4) |

**Table 4.**Comparisons of commercial (Intel, DMGA) and other streaming variable FFT circuits (16-bit, run-time choice of sizes).

Circuit | Intel | DMBA | [36] | [4] |
---|---|---|---|---|

ALMs | 6089 | 4785 | n.a. | n.a. |

LUTs | 5453 | 7020 | 1143 | 80,088 |

Registers | 9752 | 7044 | 1754 | 47,129 |

Memory (K-bits) | 203 | 290 | n.a. | 117 |

Memory (M9Ks) | 28 | 42 | n.a. | n.a. |

Real Multipliers | 68 | 33 | 4 | n.a. |

Fmax (MHz) | 283 | 490 | 200 | 111 |

SQNR (average) | 90 | 84 | n.a. | n.a. |

FFT 2048pts (us) | 7.2 | 4.2 | 57 | 2.3 |

Thrpt/logic cell | 1.8 | 3.4 | 1.2 | 0.68 |

**Table 5.**Comparisons of commercial (Intel, DMBA) and Spiral single-precision floating-point FFT circuits, when the FPGA has no embedded floating-point support.

Circuit | Intel | DMBA v1 | DMBA v2 | Spiral | Intel | DMBA | Spiral |
---|---|---|---|---|---|---|---|

Transform | 256-Points | 1024-Points | |||||

ALMs | 10,834 | 7137 | 7834 | 22,039 | 13,559 | 7186 | 27,014 |

LUTs | 16,519 | 11,050 | 12,006 | 16,818 | 21,801 | 11,193 | 21,252 |

Registers | 15,545 | 10,431 | 12,535 | 32,946 | 18,169 | 10,495 | 42,054 |

Memory (M9Ks) | 54 | 62 | 30 | 36 | 87 | 62 | 90 |

Real Multipliers | 48 | 129 | 129 | 96 | 64 | 129 | 128 |

Fmax (MHz) | 299 | 456 | 426 | 260 | 285 | 386 | 260 |

Thrpt/logic cell | 7.3 | 16.6 | 13.6 | 8.2 | 1.4 | 3.5 | 1.6 |

**Table 6.**Error comparisons for single-precision floating-point FFT circuits in Table 5.

Circuit | Intel | DMBA | Spiral | Intel | DMBA | Spiral |
---|---|---|---|---|---|---|

Transform | 256-Point | 1024-Point | ||||

Mean | 2.4 × 10^{−7} | 3.1 × 10^{−8} | 3.5 × 10^{−7} | 2.9 × 10^{−7} | 4.2 × 10^{−8} | 4.4 × 10^{−7} |

Std Deviation | 2.5 × 10^{−7} | 3.1 × 10^{−8} | 4.2 × 10^{−7} | 3.4 × 10^{−7} | 7.7 × 10^{−8} | 5.6 × 10^{−7} |

Maximum | 2.4 × 10^{−5} | 4.3 × 10^{−6} | 3.4 × 10^{−5} | 9.1 × 10^{−5} | 2.7 × 10^{−5} | 8.2 × 10^{−5} |

**Table 7.**Commercial single-precision 1024-point floating-point FFT circuit comparisons, when the FPGA has embedded floating-point support.

Circuit | Intel | DMBA v1 | DMBA v2 |
---|---|---|---|

ALMs | 4852 | 2251 | 4106 |

LUTs | 6058 | 3531 | 6795 |

Registers | 10,844 | 4969 | 6121 |

Memory (M20Ks) | 20 | 62 | 30 |

MLAB Memory Bits | 4136 | 4776 | 70,312 |

DSP Blocks | 64 | 96 | 96 |

Fmax (MHz) | 432 | 585 | 572 |

Thrpt/logic cell | 5.0 | 13.4 | 8.7 |

N | T | L | N | T | L | N | T | L |
---|---|---|---|---|---|---|---|---|

1296 | 2592 | 5006 | 576 | 1154 | 2211 | 180 | 365 | 609 |

1200 | 3601 | 5798 | 540 | 1081 | 2096 | 144 | 289 | 569 |

1152 | 3458 | 5516 | 480 | 962 | 1878 | 120 | 244 | 424 |

1080 | 2160 | 4178 | 432 | 864 | 1694 | 108 | 221 | 390 |

972 | 3891 | 5550 | 384 | 770 | 1510 | 96 | 200 | 347 |

960 | 2881 | 4599 | 360 | 721 | 1412 | 72 | 149 | 283 |

900 | 1801 | 3464 | 324 | 651 | 1244 | 60 | 124 | 243 |

864 | 1728 | 3350 | 300 | 601 | 1168 | 48 | 102 | 206 |

768 | 2304 | 3686 | 288 | 578 | 1143 | 36 | 36 | 133 |

720 | 1440 | 2798 | 240 | 481 | 951 | 24 | 24 | 95 |

648 | 1296 | 2522 | 216 | 437 | 717 | 12 | 12 | 66 |

600 | 1201 | 2338 | 192 | 384 | 759 |

Circuit | FPGA | LUT | Reg | Blk RAM | Mult 18-Bit | Fmax (MHz) | RB Avg | Thrpt Norm |
---|---|---|---|---|---|---|---|---|

DMBA | Virtex-6 | 2915 | 2581 | 19 | 72 | 401 | 16.6N | 1.00 |

Xilinx | Virtex-6 | 3851 | 4326 | 10 | 16 | 407 | 23.4N | 0.72 |

DMBA | Stratix III | 3816 | 3188 | 29 | 60 | 417 | 16.6N | 1.00 |

Intel | Stratix III | 2600 | n.a. | 17 | 32 | 260 | 32.9N | 0.31 |

Circuit | FPGA | LUT | Reg | Blk RAM | Mult (18-Bit) | Fmax (MHz) | Thrpt (Cycles) | Thrpt Norm |
---|---|---|---|---|---|---|---|---|

[39] | Virtex-5 | 7791 | n.a. | 7 | 44 | 123 | N | 0.64 |

[40] | Virtex-6 | 10,768 | 786 | 45 | 41 | 61 | N | 0.32 |

DMBA | Virtex-6 | 2915 | 2581 | 19 | 72 | 401 | 2.1N | 1.00 |

Function | Circuit | Size (Points) | FPGA | Efficiency Increase (%) |
---|---|---|---|---|

Fixed | Intel | 256 | Stratix III | 44 |

Fixed | Spiral | 256 | Stratix III | −12 |

Fixed | Intel | 1024 | Stratix III | 36 |

Fixed | Spiral | 1024 | Stratix III | −11 |

Variable | Intel | 128:2048 | Stratix IV | 87 |

Variable | [36] | 16:2048 | Stratix V | 181 |

Flt.Pt. | Intel | 256 | Stratix IV | 128 |

Flt.Pt. | Spiral | 256 | Stratix IV | 103 |

Flt.Pt. | Intel | 1024 | Stratix IV | 150 |

Flt.Pt. | Spiral | 1024 | Stratix IV | 117 |

Flt.Pt. | Intel | 1024 | Arria 10 | 169 |

SC-FDMA | Xilinx | 12:1296 | Virtex 6 | 130 |

© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Nash, J.G.
Distributed-Memory-Based FFT Architecture and FPGA Implementations. *Electronics* **2018**, *7*, 116.
https://doi.org/10.3390/electronics7070116

**AMA Style**

Nash JG.
Distributed-Memory-Based FFT Architecture and FPGA Implementations. *Electronics*. 2018; 7(7):116.
https://doi.org/10.3390/electronics7070116

**Chicago/Turabian Style**

Nash, J. Greg.
2018. "Distributed-Memory-Based FFT Architecture and FPGA Implementations" *Electronics* 7, no. 7: 116.
https://doi.org/10.3390/electronics7070116