Improving Generalized Discrete Fourier Transform (GDFT) Filter Banks with Low-Complexity and Reconfigurable Hybrid Algorithm

With ever-increasing wireless network demands, low-complexity reconfigurable filter design is expected to continue to require research attention. Extracting and reconfiguring channels of choice from multi-standard receivers using a generalized discrete Fourier transform filter bank (GDFT-FB) is computationally intensive. In this work, a lower compexity algorithm is written for this transform. The design employs two different approaches: hybridization of the generalized discrete Fourier transform filter bank with frequency response masking and coefficient decimation method 1; and the improvement and implementation of the hybrid generalized discrete Fourier transform using a parallel distributed arithmetic-based residual number system (PDA-RNS) filter. The design is evaluated using MATLAB 2020a. Synthesis of area, resource utilization, delay, and power consumption was done on a Quartus 11 Altera 90 using the very high-speed integrated circuits (VHSIC) hardware description language. During MATLAB simulations, the proposed HGDFT algorithm attained a 66% reduction, in terms of number of multipliers, compared with existing algorithms. From co-simulation on the Quartus 11 Altera 90, optimization of the filter with PDA-RNS resulted in a 77% reduction in the number of occupied lookup table (LUT) slices, an 83% reduction in power consumption, and an 11% reduction in execution time, when compared with existing methods.


Introduction
The high computational complexity and low reconfigurability of generalized discrete Fourier transform filter banks (GDFT-FBs) render them unfit to handle the upcoming radio standards in software-defined radio (SDR) handsets. The main cause of such high filter order is the huge number of multipliers consumed during the channelization operations. Multipliers contribute remarkably to the complexity of digital filters and channelization algorithms, as evidenced by the high filter orders obtained during implementation. Multipliers slow down computational speed, limit filter bank reconfigurability, increase resource utilization, and increase production costs and power consumption. The extent of complexity and reconfigurability differs in different existing channelization algorithms, from uniform channelization algorithms, such as the per-channel (PC), pipelined/binary algorithm, and pipelined frequency transform (PFT), to the non-uniform ones. A review of these algorithms is summarized in Table 1. The major challenges of channelization algorithms, as shown from Table 1, are the higher filter orders, with attendant computational complexity and low reconfigurability. Metrics for the evaluation of the computational load are based on the following scales: Very High, High, and Low. The Very High scale indicates higher filter order and filter coefficients. High computational load indicates averagely high filter order, while the Low scale denotes low filter order and filter coefficients. Furthermore, the reconfigurability performance from Table 1 is based on the following scales: Good and In order to address these challenges, different approaches have been proposed to reduce the effects of multipliers in the design of FIR filters. Distributed arithmetic (DA) is a multiplierless memory-based architecture, which was proposed to replace the multiplications in signal processing with a combinational lookup table (LUT) [27][28][29][30]. The DA replaces the multiply-accumulate (MAC) operation of convolution operations with a bit serial lookup table read-write operation. This approach reduces the number of multipliers to barest minimum, but compromises the operating speed and the required memory. Many researchers have addressed the problems facing DA. Partial or full parallel structures [31,32] can be used to overcome the speed limitations of bit serial DA, but at the cost of an exponential increase in memory requirements. Yoo and Anderson also proposed an LUT-less architecture comprised of multiplexers and adder pairs. However, the gain in area reduction was offset by the cost of increased critical path. LUT decomposition, or slicing of LUT, has been suggested in [33]. An indexed LUT DA FIR filter has been proposed [27], which consists of indexed LUT pages (each of size 2 n ) and an m-bit multiplexer unit as a page selection module. Indexing of the LUT controls the exponential DA growth and eliminates the need for adders. LUT partitioning has been proposed, by [34], to reduce the memory usage of the LUT for higher order FIR filters. This design provides less latency, less memory usage, and high throughput, when compared with conventional DA. The author in [35] proposed a memoryless distributed arithmetic-based adaptive filter for low power and area efficiency. In this case, the conventional DA was replaced by 2:1 multiplex-ers, in order to reduce area. By replacing the algorithm with a 4:2 compressor adder, instead of a normal adder, area complexity enhancement was attained. The author in [29] proposed the use of a modified DA method to compute the sum of product, saving a considerable number of multiply and accumulate blocks and reducing the circuit size considerably. There were 40% less LUT flip flop pairs used, at the expense of speed. A DA-based LMS adaptive filter using offset binary coding without LUT has been presented, in order to improve the performance of bit-serial operation [36]. Additionally, a DA-RNS-based filter implementation has been used for the effective calculation of modular inner products in a FIR filter [28]. The RNS enhances high-speed processing, due to the absence of carry propagation, thus offering a solution to the conventional DA approach. Attempting to reduce the memory requirements of DA by reducing the area utilization and delay is very important for the implementation of a digital FIR filter. A Residual Number system (RNS) has been proposed to offer such a solution, providing high operating speeds with reduced word length, area utilization, and power consumption. The residue number system (RNS) is a non-weighted number system that can speed up arithmetic operations, due to its peculiar features of carry-free propagation and parallelism. This results in carry-free addition, multiplication, and subtraction [37][38][39]. The most important factors to consider when choosing an RNS for an FIR filter is the moduli set. The choice of moduli set greatly influences the area utilization, speed, cost, and power consumption of the hardware design. Different research efforts on the influence of the moduli set on the hardware complexities can be found in the literature. Sweidan and Hiasat proposed an algorithm that requires four binary adders-of which two operate in parallel mode-which resulted in higher speed and a smaller silicon area [40]. Furthermore, Amir Sabbagh and Keivan [41] presented two residues to binary converters, using the 2 n , 2 n+1 + 1, 2 n+1 − 1 moduli set. This moduli set consists of well-formed moduli and a balanced set, resulting in better and faster RNS implementation. Prem Kumar, in [42], described a residue number to binary converter, which converts numbers in the modulo set 2n + 2, 2n + 1, 2n, with 2 as a common factor. This algorithm achieved a faster conversion ratio, in terms of speed. Moreover, [43] discussed a high-speed realization of a residue to binary converter for the 2 n+1 , 2 n , 2 n+1 moduli set, which improved upon the best-known implementation twofold, in terms of the overall delay time. The algorithm employed certain symmetrical properties in its implementation, in order to reduce the hardware specification by n − 1 full adders. It also reduced redundancy in its implementation. Another approach has been proposed to perform inner product computation based on distributed algorithm principles [44]. The input data are represented in the residue domain and encoded with thermometer code, while the output data are encoded with one of the hot code formats. The operating speed of a one-hot code modular adder was superior to the conventional binary code. A non-recursive digital filter was presented based on moduli set 2 n−1 , 2 n , 2 n+1 , using diminished 1 representation [45]. The method investigated the usage of a n + 1-bit circuit for a 2 n + 1-bit channel.
A forward converter for RNS with diminished-1 encoded channels has been proposed by [46]. Furthermore, multiplication was eliminated in the design of a RNS converter [43]. Thus, fewer multipliers and adders were used in the design. This invariably reduced the hardware complexity and increased the speed. A dual sum carry look-ahead adder [47], which consists of a circular carry generator and a multiplexer, has been designed with reduced complexity. Jemmy, Yung Shem eliminated the bottleneck encountered in the carry propagation additions and modular adder free of the existing designs. This method resulted in a reduced power factor and leakage power.
Vinnakota and Rao discussed an RNS to the binary converter [48] and showed it to be a simple modification of the well-known mixed radix conversion techniques. The evaluation of this algorithm and comparison with the existing algorithms showed improvements, in terms of speed and cost, but not in terms of delay and area. A conjugate moduli set was presented in hardware-efficient two-level implementations of the weighted-to-RNS and RNS-to-weighted conversions [49]. The design offered 25 to 40% hardware savings, a reduction of 80% in complexity of the CRT, and achieved a higher dynamic range.
Kotha et al. [39] proposed new modular multiplication for 2 k − 1, 2 k , 2 k+1 − 1 for a fixed-point coefficient FIR filter. This algorithm improved the clock rates and reduced the area and power consumption, compared to conventional modular multiplication. Ahmad Hiasat [40] designed a converter consisting of three 4n-bit carry save adders (CSAs), together with an additional modulo 2 4n − 1 adder. This led to a reduction in hardware requirements, concerning area, delay, power, and energy efficiency. Richard Conway and John Nelson, in [50], used a moduli set of the form 2 n − 1, 2 n , 2 n + 1, which was primarily based on CSA and a one carry propagation adder (CPA) without the need for a look-up table (LUT). Their design occupied less silicon space and, therefore, was very fast. The authors proposed a new CRT property, in order to reduce the total dynamic range. The overall result was faster and more efficient, with improved delay and area cost.
Kazeem Alagbe Gbolade et al. [51] used the CRT to obtain a reverse converter that uses mod (2n − 1) operations, instead of mod (2n + 1)(2n − 1) and (2n)(2n − 1). This approach is traditional in nature but the results yielded better performance, in terms of conversion time, area, cost, and power consumption. Mohan [48] compared the designs of Vinnokota and Raos and Piestrak, together with the design of Andraros and Ahmad. It was seen that the design of Andraros and Ahmad was more cost-effective, in terms of delay and speed, when compared to Vinnokota and Rao's design. The design used the moduli set 2 n − 1, 2 n , 2 n + 1, which is a variation of the mixed radix conversion technique. Ahmad Hiasat [52] used the Chinese remainder theorem (CRT) approach to produce a simpler converter structure for the four moduli set 2 n − 1, 2 n , 2 2n + 1, 2 2n+p , using common factors. This led to considerable reductions in area, delay, time, energy, and power utilization, when compared with other published works.
From the foregoing, it can be stated that most of the research in the literature has focused on the speed improvement of FIR filters, while the costs of the area utilization and delay time are too high for the future trends of software-defined radio. Therefore, the goal of this work was to improve the performance of the generalized discrete Fourier transform, in terms of speed, area utilization, and delay time. This was approached by: (i) hybridization of a generalized discrete Fourier transform filter bank with frequency response masking and the coefficient decimation method 1; and (ii) filter bank design using a parallel distributed arithmetic-based residual number systems (PDA-RNS) filter. The final algorithm can, therefore, be described as hybridized GDFT with a PDA-RNS filter design. The design methodology and a detailed analysis are presented in the following.

Methodology
As proposed, two designs are investigated herein. The first approach is based on the hybridization of frequency-response masking with coefficient decimation filters and the classical generalized discrete Fourier transform (GDFT) filter bank. For ease of reference, this will be referred to as the hybrid GDFT (HGDFT). The second approach explores the improvement of the HGDFT using a parallel distributed arithmetic based residual number system (PDA-RNS). The algorithms for the two approaches and their simulation methods using the VHSIC hardware description language in the Altera DSP builder platform are presented in the following.

Proposed Hybrid Generalized Discrete Fourier Transform (HGDFT-FB)
The HGDFT-based filter bank consists of two branches: The upper and the lower branch. The upper branch is made up of FRM-interpolated coefficient decimated filters and the masking filter, whereas the lower branch consists of complementary FRM-interpolated coefficient decimated filters and the complementary masking filter. A low-pass interpolated coefficient decimated linear phase FIR filter, H a (z L M ), is formed from the cascade of the base interpolating filter, H a (z L ), and the coefficient decimating filter, H cd (z 1/M ), in order to extract the sharp narrow-band channel of choice. Furthermore, a bandpass edge-complementary interpolating coefficient decimated base filter, H c (z L M ), is formed from the cascade of the complementary base interpolating filter, H a (z L ), and the complementary coefficient decimating filter, H cd (z M ), in order to isolate multi-band frequency responses. The low-pass interpolated coefficient base filter, H a (z L/M ), cascades with the masking filter, H ma (z), in the upper branch, while the bandpass complementary interpolated coefficient base filter, H a (z L/M ), cascades with the complementary masking filter, H mc (z), in the lower branch, in order to produce reconfigurable low computational multinarrow frequency bands. The desired passband (ω p ) and cutoff frequency (ω s ) of the base filter response, H a (z), are calculated as indicated in Table 2. The transfer function of the FRM interpolated coefficient decimated filter is given by Equation (1), as: The interpolated coefficient decimated base and complementary filters are symmetrical and asymmetrical linear phase FIR filters, respectively, which can be expressed as . A half band filter is introduced into the FRM Interpolated coefficient decimated filter, in order to further reduce its computational complexity. This is possible as a result of the symmetrical properties possessed by the half-band filter. The time-domain impulse response of the CDM-1 technique requires every other component to be zero, except for the components at the centre. This indicates that it is symmetrical around the centre. This translates to reduced complexity, in terms of the number of the multipliers required by the filter. The transfer function of the half-band FRM interpolated coefficient decimated filter can be expressed in terms of two polyphase components, as in Equation (2): The masking filters are replaced with two GDFT-FBs, as shown in Figure 1. The transfer function for H-GDFT can be expressed as: By applying polyphase decomposition, Equation (4) is obtained: where E AI (z k ) and E Bi (z k ) are the k polyphase components of A(z) and B(z), respectively. The GDFT-FB modulated bandpass filters are obtained from the lowpass prototype filter by applying complex modulations, as in Equation (5): Finally, Equation (6) represents each of the modulated bandpass filters, as depicted in Figure 2: where . The transition band of the H-GDFT FB is centered at π 2 rad, whereas the complementary filter bank is centred at 2πK M , where K is an integer ranging from 0 to (M − 1).

Proposed Design Steps
The design steps for the proposed filter bank are outlined below: 4. Calculate the decimation factor M of the masking filter using the formula M= π ω ms . The interpolated factor is calculated using the formula L = π ω ms , where s i is the stopband frequency for each channel. Thus, the fractional rate for the masking filter can be calculated as L ma M ma . 5. Calculate the decimator factor of the complementary filter using the formula M = π π+ω mcs . The interpolated factor is calculated using the formula L = π π+ω mcs . Thus, the fractional rate for complementary filter can be calculated as L mc M mc . 6. Determine the transition bandwidth for the masking and complementary filters, tbwi, such that tbw k = tbw k × L k 10. Find the stopband ripple using δ s1 = δ s1 11. The modal passband peak ripple is calculated as: δ pmodal = min(δ p1 , δ p2 , ..., δ pn ).
12. The cutoff frequency of the prototype and masking filter are calculated using Table 2. 13. Determine the prototype filter order and the individual channel filter order using the Bellanger formula N = −2log 10 (δ p δ s ) 3∆ TBW − 1 [58].

Improving HGDFT with Parallel Distributed Arithmetic-Based Residual Number System (PDA-RNS)
In an attempt to lower the filter complexity and improve the reconfigurability of the proposed hybrid filter bank, we propose the design of our second approach, known as the parallel distributive arithmetic-based residual number system. The design and tcosimulation were carried out on the Quartus 11 Altera DSP builder 10, using the very high-speed integrated circuits (VHSIC) hardware description language. Consider an input signal sampling rate of 40 MHz for the filter design example in Section 2.2.1, with the filter specifications given in Table 3. The moduli set 2 n − 1, 2 n , 2 n + 1 was selected from the literature, due to its high speed and reduced hardware complexity. With the value of n = 5 bits, the relevant moduli set based on the moduli format is 31, 32, 33. The filter coefficients of the reconfigurable filter generated in Table 3 num2bin(Q, 1, b). The values of the parameter format create a parameter of binary numbers (word length, fractional length) for signed fixed-point mode. The input signal and the filter coefficients use 16-bit precision format, with a parameter word length of 16 and fractional length = 15. Distributed arithmetic can be expressed as: where H i represents the filter coefficient and X i,j denotes the input signal vectors. The fixedpoint binary values of the input signals and the filter coefficients are converted into the residue form using the arbitrary forward converter mechanism outlined below. The inputs y i in Equation (7) are converted to RNS, as shown in Equation (8): Assume that the input block X is partitioned into different bits, as follows: B k−1 , B k−2 , ......, B 0 . Then, the blocks of bits can be represented by The total residue X is calculated as the total sum of the partitioned residue bits blocks, with respect to the chosen modulus, as depicted in Equation (10): The reverse converter, which converts from residual to binary numbers, takes place at the back-end of the architecture; the shift and add method is used for its implementation. The 2 k possible values of r 1 , r 2 , and r 3 are pre-computed and stored in a 2 k × D-bit LUT. After the residual values are computed, Equation (8) becomes Equation (11): Without loss of generality, let us assume that r k is an N-bit residual of Q1; that is, an (N − 1) format number, such that The dot product of (12) can be written as: Rearranging the terms yields For K = 2 and N = 3, the rearrangement forms the following entries in the ROM, as indicated in Equation (15): The input values and filter coefficients pre-stored in the LUT tables are partitioned into different LUT tables and a modulo accumulator (ACC) performs the modulo shift-accumulate operation, in order to generate y i in D cycles, as shown in Figure 3. Performance comparison was evaluated in terms of the resource utilization, power consumption, and delay.

Application of HGDFT with PDA-RNS Filter Bank to Non-Uniform Channels: BT, ZIGBEE, WCDMA
The HGDFT channelization algorithm was applied to non-uniform input channels. The three multi-standard channels considered were: Bluetooth (BT), Zigbee, and wideband code division multiplexers (WCDMA). The input sampling rate used was F s = 40 MHz, with the channel bandwidths for BT, Zigbee, and WCDMA specified as 1 MHz, 4 MHz, and 5 MHz, respectively. The transition bandwidths for BT, Zigbee, and WCDMA were specified as 50 kHz, 200 kHz, and 500 kHz, respectively. The passband and stopband ripples specifications for BT and Zigbee were 0.1 and −40 dB, while those of WCDMA channels were 0.1 and −55 dB, respectively. The filters H a (z), H ma (z), and H mc (z) were the base filter, masking, and complementary filters, respectively, characterised using Case 1 of Table 2. The filter specifications shown in Table 4 were simulated following the design steps in Section 2.1.1. The following parameters were used to compare the performance of the new results: passband and stopband width, passband ripples, and stopband attenuation. The results obtained were compared with the designs in [53,54], using the same filter specifications and parameters. The realized HGDFT filter design specifications are shown in Tables 3-6. The filter coefficients obtained for BT, Zigbee, and WCDMA in Table 3 were revised by converting into RNS format. The double precision 16−bit values for the different filter coefficients were quantized and converted into integer values. These values were transformed into three modular RNS representations. The parallel distributed arithmetic architecture was used for implementing the addition of these three RNS values. The co-simulations of HGDFT with PDA-RNS used the following parameters on the Quartus 11 Altera software: LUT slices, total slices, slice registers and slice LUTs, flip flops, power, and delay. Performance comparisons were made using NU-MDFT CSD optimized with Pareto ABC [55], NU-MDFT SID-CSE [55], Sdr in channeliser [56], and SDR in Channeliser [57].

Results and Discussion
Using the information contained in Table 3, the normalized channel bandwidths of Bluetooth (BT), ZigBee, and wide code division multiplexer access (WCDMA) were 0.05, 0.2, and 0.25, respectively. From Step 2 of Section 2.1.1, the passband width of the prototype filter was thus set to 0.025. The fractional rate of each channel was calculated using the formula in Step 3 of Section 2.1.1. The fractional sampling rate for the modal filter was 39 40 , while the masking filters for BT, Zigbee, and WCDMA were 39 40 , 9 10 , and 7 8 , respectively. When the fractional rate of 39 40 was applied to the modal filter, the transition bandwidth computed was 0.002375, with passband peak ripple of 0.1 dB, stopband peak ripple of −50 dB, and filter length of 196. When the fractional rate of 39 40 was applied to the BT channels, the transition bandwidth was 0.0026, with passband peak ripple of 0.00975 dB, stopband peak ripple of −39 dB, and filter length of 159. When the fractional rate of Zigbee was 9 10 , the transition bandwidth was calculated to be 0.011, with passband ripple of 0.09, stopband peak ripple of −39, and filter order of 34. When the fractional rate of WCDMA was 7 8 , the transition bandwidth was calculated to be 0.021, with passband ripple of 0.0875, stopband peak ripple of −48.125, and filter order of 19. The frequency characteristic input is shown in Table 3, while Figures 4-7 show the magnitude responses of the input for the modal filter, BT masking filter, Zigbee masking filter, and WCDMA masking filter.    The stopband for the complementary masking frequency ω mcs was also calculated, using the equation in Table 2. The stopband edge, passband edge, and fractional rate values were calculated using design steps 5 and 9 in Section 2.1.1. The complementary masking decimator factors for the modal filter, BT, Zigbee, and WCDMA were 8 9 , 8 9 , 8 9 , and 7 8 , respectively. The complementary masking transition bandwidths for the modal filter, BT, Zigbee, and WCDMA were 0.00222, 0.00222, 0.0089, and 0.021875, with filter order of 209, 150, 37, and 13, respectively. Table 5 shows the filter characteristics of the complementary masking filter using the HGDFT channelisation algorithm.
The total number of multiplications used was 526, while the multiplications used in [53][54][55] were found to be 1745, 1545, and 1090, respectively. The number of multipliers utilized by the proposed HGDFT filter bank was compared and found to be lower than those of the CDFB [54] and ICDM [53,55] methods, as indicated in Table 7.  Figures 5-7 show the magnitude responses of the modal filter, BT masking filter, Zigbee masking filter, and WCDMA masking filter.
The results obtained from improving HGDFT with the PDA-RNS filter are as follows. The filter coefficients obtained in Table 2 were quantized with 16-bit representation, as it showed better passband ripples and stopband attenuation, when compared with 8-and 12-bit representations, as indicated in Table 8. By replacing the multiplier in the HGDFT with the PDA-RNS filter, there was an 100% decrease in multipliers, from 526 to 0, as the use of multipliers was totally eliminated.  Tables 9 and 10 show a device utilization comparison. The total hardware resources occupied by HGDFT with PDA-RNS were as follows: 941 total slices, 2073 slices of LUTs, 2338 flip-flops, total power of 333.53 mW, and total delay of 3.328 ns. The total slices occupied by NU-MDFT CSD with Pareto ABC were 2406, slice LUTs utilized were 8950, flip flops consumed were 8980, total power consumed was 1751 mW, and total delay of 3.75 ns. The performance of the NU-MDFT filter optimized with SID-CSE showed total slices consumed of 1633, slice LUTs utilized of 5901 with flip flops of 5911, total power of 1281 mW, and delay time 2.6 ns. From these performance results, the plot in Figure 8 shows that HGDFT with PDA-RNS utilized 12.97% of the total LUT, 14% of the LUT slices, had a 12.7% reduction in flipflops used, 17% power consumption, and 4% delay in the execution time, when compared with NU-MDFT CSD optimized with Pareto ABC [55]. It was observed that the filter achieved an 83% reduction in number of occupied slices, from 2406 to 941 slices. There was an 83% reduction in power consumption and 75% reduction in execution delay time. However, when HGDFT PDA-RNS filter was compared with the NU-MDFT SID-CSE in [55], the delay execution time for NU-MDFT SIDE-CSE was found to be lower and, thus, faster. The total resources utilised are compared in Table 10. The slice registers utilized by HGDFT with PDA-RNS filter bank were 2279 out of 239,616. This showed a drastic reduction in slice registers, when compared with NU MDFT FB in [55], which consumed 29,797 out of 301,440; that in [56], which used 15,295 out of 58,880; and also that in [57], which used up 29,797 out of 301,440. The total slice LUTs used by HGDFT with PDA-RNS filter were found to be 2022 out of 119,808, while the hardware utilized by the NU MDFT filter in [55] was observed to use 5901 out of 298,600, the device consumption rate in [57] was discovered to be 21,169 out of 150,720, and that in [56] used up 14,726 out of 58,880. Thus, Table 10 proves that the hardware resource utilization under the implementation of HGDFT with the PDA-RNS filter bank was less than that in SDR channeliser [55][56][57]. The lower filter order of the proposed design, coupled with the modularity of RNS, clearly contributed to its lower slice requirements and lower power consumption, when compared to the designs in [39,55].

Conclusions
The proposed HGDFT with PDA-RNS method was found to be an effective channelization algorithm for low-complexity reconfigurable filters in multi-standard receivers. Two improvement methods were used for the realization of the algorithm: The first improvement was achieved by hybridizing CD1 and FRM filters with the GDFT. The performance of the method was further improved by using a parallel distributed arithmetic-based residue number system. The HGDFT filter bank demonstrated a reduction in the number of multiplications and filter coefficients used, compared with FRM-based or modified GDFT. The HGDFT filter bank was also optimized with a PDA-RNS, after which it was shown that the number of adders, multipliers, and overall filter complexity were reduced to the barest minimum, while reconfigurability was preserved. This resulted in remarkable reductions in resource utilization, operational speed, and power consumption.