Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients

: Symmetric convolutions can be utilized for potential hardware resource reduction. How ‐ ever, they have not been realized in state ‐ of ‐ the ‐ art transposed block FIR designs. Therefore, we explore the feasibility of symmetric convolution in transposed parallel FIRs and propose a scalable hardware efficient parallel architecture. The proposed design inserts delay elements after multipli ‐ ers for temporal reuse of intermediate tap products. By doing this, the number of required multipli ‐ ers can be reduced by half. As a result, we can achieve up to 3.2× and 1.64× area efficiency improve ‐ ments over the modern transposed block method on reconfigurable and fixed designs, respectively. These results confirm the effectiveness of the proposed STB ‐ FIR architecture for hardware ‐ efficient, high ‐ speed signal processing.


Introduction
Finite impulse response (FIR) filter, one primary digital filter, has been widely used in signal processing due to its stability and linear phase characteristics. With the time domain input, , and the filter coefficient, ℎ , the corresponding output, , of a -tap FIR can be obtained as ∑ ℎ • , according to the discrete time convolution [1]. Due to the accuracy requirement in frequency domain, T is generally large [2] and consequently, incurs a large silicon area with significant power consumption. Therefore, over the past decades, much research effort has been toward hardware efficient FIR filter implementations.
Because multipliers are generally more expensive than adders in terms of area and power consumption, many previous works have focused on the design of FIR filters with area-efficient multipliers. In some application-specific FIRs, the coefficients can be predetermined; thus, instead of using the costly general multipliers, several constant-multiplier-based designs (CM) have been proposed [3][4][5][6][7][8]. These CM-based FIRs, however, are application-specific and only work for a specific coefficient set; therefore, they are not suitable in reconfigurable systems with programmable coefficients for real-time applications, such as adaptive pulse shaping and signal equalization [9]. On the other hand, because FIR filters are widely used in high-throughput multimedia signal processing and cellular wireless communication systems, there are also several parallel FIR implementations, such as fast FIR algorithm (FFA) [10][11][12] and block FIRs [13][14][15]. The basic idea of FFA is to break up an FIR filter into several sub-filters using polyphase decomposition so that they can operate in parallel with reduced computation complexity. In an FFA-based FIR filter design, the required number of multipliers can be greatly reduced at the cost of an increase in adders for extra pre-processing and post-processing. Symmetric FFA-based designs have also been proposed in [11,12] with the consideration of symmetric convolutions. Although FFA-based methods can achieve a significant reduction in multipliers, they are only effective for parallel FIR filters with low parallelism. Otherwise, the increased adders will introduce significant area overhead with the increased design complexity. On the other hand, block FIRs have also been proposed in [13][14][15] for highthroughput signal processing. Unlike FFA-based designs, block FIRs can be combined with CM-based methods for a specific coefficient set; however, the corresponding hardware resource increases linearly with the degree of parallelism. Therefore, area and power efficiency of the existing parallel FIRs are still the design challenges.
The symmetry of coefficients, which can lead to a significant saving in hardware cost, has not been taken into consideration in the existing block FIR designs yet. Therefore, we explore the feasibility of symmetric convolution in transposed parallel FIRs and propose a symmetric transposed block FIR filter (STB-FIR) architecture for area/power-efficient implementation of block FIR filters in which delay elements are inserted after multipliers for temporal reuse of intermediate tap products.
The remainder of this paper is organized as follows. The proposed STB-FIR architecture is illustrated in Section 2 with the corresponding generalized formulation. Evaluation results are provided in Section 3. Finally, the conclusion is given in Section 4.

Proposed STB-FIR
A T-tap digital FIR filter can be generally implemented either in the direct form or in the transposed form, as shown in Figure 1.

Generalized Mathematical Formulation for TB-Based FIRs
For L-parallel processing, the transposed block FIR proposed in [15] takes a block of new input samples , , . . . , and produces a block of output samples , , . . . , in each clock cycle. For a -tap -parallel transposed block FIR with and L and M indicate the degree of processing parallelism and the total number of blocks, respectively, the operations can be expressed in matrix form as: where , , … , is the input of the current clock cycle, , , … , represents the input in the previous clock cycle, and { , , … , represents the input m clocks before. It is obvious that the output of an FIR is a linear combination of the current input and some previous values.
According to (1), as an -parallel FIR has new inputs in each cycle, we define as Hence, substituting it into (1), we can obtain Similarly, the coefficients can also be divided into blocks as Thus, (3) can be further rewritten as Here, let the calculation of • be , called time block in the following.
According to (5), a TB-based scalable FIR architecture can be implemented in a transposed form, as shown in Figure 2, where a -tap -parallel block FIR filter has ( T L ⁄ time blocks and can process input samples in clock cycles with all the TBs working simultaneously. It should be noted that the input samples contains the current input and the previous input. Moreover, each time block ( ) has the corresponding coefficients, , and the result is added to the output of the neighboring time block ( ) to generate the output of and then is sent to the next time block ( ).

Proposed STB-FIR Design
For a linear-phase FIR filter, the impulse response can be symmetric (ℎ ℎ or anti-symmetric (ℎ ℎ . To simplify the explanation, only the symmetric realization with ℎ ℎ will be discussed in the following. It is worth noting that the proposed architecture is also able to be applied to anti-symmetric FIR implementations. As shown in (2), there are 2 1 input samples that are multiplied with the corresponding coefficients in each time block. Among the 2 1 different input samples, samples { , , … , are the inputs in the current clock cycle while the other 1 ones { , , … , were obtained in the previous clock cycle. Therefore, totally different multipliers are required. In [15], delay elements are inserted on the input side to make it possible for the temporary storage of the 1 samples for later calculation. Unfortunately, because in the transposed block form, the symmetry of coefficients cannot be easily realized.
To explore the feasibility of symmetric convolution in parallel block FIR filters, a hardware efficient symmetric transposed block FIR architecture (STB-FIR) is proposed. In STB-FIR, delay elements are inserted after multipliers; thus, temporal reuse of the intermediate tap products becomes possible. Consequently, half of the multipliers can be saved at the cost of increased registers.
For a -tap -parallel transposed FIR with symmetric coefficients where , there are two cases (i.e., is odd or even) that should be considered in the proposed STB-FIR design method, as shown in Figure 3.
(a) (b) Figure 3. Proposed STB-FIR structure with two cases: (a) TB-based fully symmetric folded structure when is even and (b) non-symmetric folded structure when is odd.

Case 1: M Is Even
A linear FIR filter that falls into this category has an even tap (i.e., is even), an even number of TBs (i.e., is even), and symmetric coefficients (i.e., ℎ ℎ ). Without loss of generality, let us consider two symmetric TBs ( and ) in a TB-based symmetric FIR.
For the time block, , shown in (6), we can divide it into two terms as below, and each of them can be implemented, as shown in Figure 4 a and b, respectively. Figure 4. Implementation of the time block, , with the corresponding two terms in (7): (a) the form term and (b) the latter term.
Here, it should be mentioned that there are 2 1 data inputs that are multiplied with the corresponding coefficients ( ) in each time block ( . Among the 2 1 different input samples, if the samples { , , … , are the inputs in the current clock cycle, the other 1 ones { , , … , are obtained in the previous clock cycle. Since in transposed form, every input sample should be multiplied with all the coefficients to generate the intermediate tap products, we can conduct the multiplication firstly and then save the products in registers for later addition. By doing this, intermediate products can be reused at the cost of several additional registers, while the required multipliers are reduced by half. As a result, the two parts shown in Figure 4 can be combined into one circuit, as shown in Figure 5, where the current L input samples { , , … , are multiplied with all the coefficients firstly in which some of the products will be directly delivered for addition, while some of them are firstly saved into the registers and then sent to the adder. Furthermore, in a TB-based symmetric FIR (i.e., ℎ ℎ ), the coefficients in the two symmetric TBs ( and ) are and , respectively, and then we have providing the vector of coefficients in reverse order, and Although , they consist of the same L separate coefficients as {ℎ , ℎ , … , ℎ }. Since, in transposed form, every input sample should be multiplied with all the coefficients to generate the intermediate tap products, we can implement each pair of symmetric TBs ( and ) as one symmetric time block ( ) to take advantage of the same separate coefficients for tap product reuse.
The proposed STB-FIR structure with an even M is shown in Figure 3a, which only consists of the basic STB units, and the detailed STB design is shown in Figure 6. For each STB, it has data inputs (i.e., , , , … , ) and L coefficients ( ) where ℎ ℎ . . . ℎ . It also accepts data from the neighboring STBs, performs the sum operations, and then sends the results to the corresponding two neighboring STBs. The proposed STB design can be viewed as the combination of two TBs, while the total number of multipliers is reduced by half when compared with the implementation of using two separate TBs. Meanwhile, the number of adders is kept the same as the existing transposed block FIR [15]. Therefore, the number of multipliers can be reduced at the cost of increased delay elements in STB-FIR. Since multipliers are much more expensive in area and power consumption than registers, STB-FIR can achieve significant area and power savings when compared with the existing transposed block FIRs.

Case 2: M Is Odd
When the number of time blocks ( ) is odd, the FIR structure cannot be fully TBbased folded, and a half-STB unit ( ⁄ ) is required, as shown in Figure 3b. Fortunately, due to the symmetry of coefficients (i.e., ℎ ℎ ), we have Thus, ⁄ can be calculated as Due to the symmetry in 1 2 ⁄ , ⁄ can be realized using the STB-like unit design only with the length changed from L to .
For better illustration, Figure 7 gives the example designs of the proposed STB-FIR and the existing transposed block FIR [15] for two 6-tap transposed FIRs with different degrees of parallelism (L). For the 6-tap 3-parallel transposed FIR (T = 6 and L = 3), because M equals to 2, it is TB-based fully symmetric folded, and the corresponding implementation of the proposed STB-FIR is shown in Figure 7a. Because 3, three data inputs (i.e., , , and ) are sent to the STB in each clock. When compared with the transposed block FIR [15], the number of multipliers is reduced from 18 to 9 with the same number of adders, while the number of registers is increased by 3. On the other hand, for the 6-tap 2-parallel transposed FIR (T = 6 and L = 2), as 3, it is not TB-based fully symmetric folded and then a half-STB unit is required, as shown in Figure 7b. Because 2, two input samples (i.e., and ) are applied in each clock cycle. The STB and the half-STB units work in a similar way as obtaining the data from the neighboring units, performing the sum operations, and then sending the results to the corresponding neighboring units. Finally, two outputs (i.e., and ) are generated in each clock cycle. When compared with that of [15], the number of multipliers is reduced from 12 to 6 with the same number of adders, and the number of registers is only increased by 2 in STB-FIR. From this simple example, it can be observed that, unlike the existing transposed block FIR structure [15] in which delay elements are inserted on the input side, the proposed STB-FIR inserts delay elements after multipliers for temporal reuse of tap products. By doing this, the costly multiplier can be reduced by half at the cost of slightly increased low-cost registers.

Evaluation Results and Comparisons
To examine the effectiveness of the proposed STB-FIR architecture, implementation results of various FIR filters are provided and compared with the existing works. In our work, all the designs are implemented in ROHM 180 nm CMOS process technology, and the post-synthesis simulation was conducted on the netlist for timing and power evaluation.

Comparison of Hardware Complexity
As the general FIR structure, the overall hardware complexity of the proposed STB-FIR is compared with the transposed-based block FIR [15], DA-based approach [13] and FFA-based L-parallel structure [1] in Table 1. , and indicate the number of multipliers, registers, and adders of the ith parallel FFA basic unit ( , respectively. * For the registers in STB-FIR, is an approximate value depending on and . A canonical -tap FIR filter consists of T multipliers, -1 adders, and registers which depends on the tap number ( ) for specific accuracy requirement. Therefore, a straightforward implementation of a -tap -parallel FIR design, times hardware resources are required. The transposed block structure proposed in [15] involves multipliers, 1 adders and 1 registers in which 1 registers are inserted at the input side for the storage of 1 samples for later calculation. The required hardware resource of the distributed arithmetic (DA)-based FIR structure [13] is estimated according to the results shown in [15] for comparison purpose. Here, it should be noted that DA [13] was implemented in the direct form, while transposed block [12] was in the transposed form. As for the FFA-based method, the required hardware resource is formulated according to the analysis, as shown in [1,11,12]. As illustrated above, the proposed STB-FIR in total involves ⌈ 2 ⁄ ⌉ •L multipliers, 1 adders, and registers, among which registers are approximately used as the delay elements for intermediate product sharing, and the other registers are required for the output storage of the adder trees.
When compared with DA-based FIR [13] and the transposed block FIR [15], the number of multipliers in STB-FIR can be reduced by half while with the same number of adders, which indicates the promising area savings in STB-FIR. The cost of this multiplier reduction is a slightly increased number of low-cost registers which are used to store the intermediate tap products for temporal reuse. According to our analysis, as and increase, the ratio of saved multipliers by the increased registers will get close to 4/3, which indicates that four multipliers can be saved at the cost of three additional registers. Because the area and power of a general multiplier is much larger than the corresponding number of registers, great area and power savings can be achieved in the proposed STB-FIR.

Comparison of Reconfigurable FIR Implementations
For evaluation and comparison purposes, reconfigurable FIRs in 8, 16, 24, 32, 64, and 128 taps with various degrees of processing parallelism (i.e., 2, 4, and 8) are implemented by using STB-FIR and the existing transposed block FIR [15], and the corresponding synthesis results are shown in Table 2.  [13]; therefore, we implemented their work as the state-or-the-art design, and the corresponding results are presented in Table 2 for comparison. The results are obtained using the logic synthesis tools, and the power consumption is extracted based on simulation of synthesis results with back-annotation of toggling activity where uniformly distributed sample input values are applied. In the table, the sample frequency is calculated as the degree of parallelism (L)/minimum clock period (MCP). Therefore, it is obvious that the -parallel FIR designs, including both the proposed method and the existing method, can improve the sampling frequency over the baseline non-parallel FIR design at the cost of increased silicon area. On average, the proposed STB-FIR can achieve 39.12% area savings and 35.16% power consumption reduction for all the 18 FIRs when compared with the existing transposed block-based designs [15].
As the structure of reconfigurable FIR filters is very regular, the measurement of area efficiency per tap (AE) is introduced as (12) where, Tap is the tap number of the FIR, indicates the degree of processing parallelism (i.e., in the proposed STB-FIR), and is the minimum clock period. The normalized results are given in Figure 8 in which the baseline design has the parallelism of 1, and STB-FIR can achieve up to 3.2× improvements.

Comparison of Fixed FIR Implementations
As for fixed FIR filters in which the coefficients are pre-determined, by referring to [16], a 105-tap FIR with 3 and a 60-tap FIR with 2, 3, and 4 are implemented in which the CSE multiplier [4] is adopted for area saving.
For the 105-tap 3-parallel FIR, STB-FIR can achieve 13.08% area saving and 23.05% power reduction when compared with the fixed transposed block FIR [15]. On the other hand, when compared with the fixed symmetric FFA designs [11] and [12], STB-FIR can achieve up to 31.5% and 40.5% sampling frequency improvement for the 105-tap 3-parallel FIR and the 60-tap 2-parallel one, respectively.
Without loss of generality, the normalized area efficiency comparison is presented in Figure 9 in which the baseline is the FFA-based design in [1]. For the 105-tap 3-parallel FIR, the proposed STB-FIR architecture can achieve 1.64× and 1.20× improvements over the transposed block FIR [15] and the fixed symmetric FFA design [11], respectively. Moreover, for the 60-tap FIR, the proposed STB-FIR filter design can achieve up to 1.29× AE improvement over the existing FFA designs.

Conclusions
The feasibility of a symmetric transposed block FIR filter is explored in this paper by taking advantage of the symmetric coefficients for area-power efficient implementation. In the proposed STB-FIR architecture, using registers to save intermediate tap products for temporal reuse makes it possible to realize hardware-efficient symmetric transposed block FIR filters. The evaluation results show that compared with the state-of-the-art reconfigurable architecture [15], the proposed STB-FIR architecture can achieve up to 39.97% and 41.23% area saving and power reduction, respectively. On the other hand, compared with the existing symmetric FFA designs [11,12], the proposed STB-FIR architecture can achieve up to 31.5% and 40.5% sampling frequency improvement and 1.20× AE improvement as well for the fixed FIR implementations. These results clearly illustrate the efficiency of the proposed STB-FIR architecture and confirm that STB-FIR can be applicable to both reconfigurable and fixed FIR implementations for area-power efficient high-speed signal processing. On the other hand, the optimization of multipliers will result in an increased importance of adder trees, especially in fixed FIRs. Therefore, further optimization of adder tree implementation will be one of our future works.