Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients

Ye, Jinghao; Yanagisawa, Masao; Shi, Youhua

doi:10.3390/electronics11203272

Open AccessArticle

Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients

by

Jinghao Ye

¹,

Masao Yanagisawa

² and

Youhua Shi

^2,*

¹

NVIDIA Semiconductor Technology (Shanghai) Co., Ltd., Shanghai 200001, China

²

Faculty of Science and Engineering, Waseda University, Tokyo 169-8555, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(20), 3272; https://doi.org/10.3390/electronics11203272

Submission received: 31 August 2022 / Revised: 8 October 2022 / Accepted: 9 October 2022 / Published: 11 October 2022

(This article belongs to the Special Issue FPGAs Based Hardware Design)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Symmetric convolutions can be utilized for potential hardware resource reduction. However, they have not been realized in state-of-the-art transposed block FIR designs. Therefore, we explore the feasibility of symmetric convolution in transposed parallel FIRs and propose a scalable hardware efficient parallel architecture. The proposed design inserts delay elements after multipliers for temporal reuse of intermediate tap products. By doing this, the number of required multipliers can be reduced by half. As a result, we can achieve up to 3.2× and 1.64× area efficiency improvements over the modern transposed block method on reconfigurable and fixed designs, respectively. These results confirm the effectiveness of the proposed STB-FIR architecture for hardware-efficient, high-speed signal processing.

Keywords:

FIR filter; symmetric transposed FIR; hardware efficient; high-speed signal processing

1. Introduction

Finite impulse response (FIR) filter, one primary digital filter, has been widely used in signal processing due to its stability and linear phase characteristics. With the time domain input,

x_{n},

and the filter coefficient,

h_{m}

, the corresponding output,

y_{n}

, of a

T

-tap FIR can be obtained as

y_{n} = \sum_{m = 0}^{T - 1} h_{m} \cdot x_{n - m}

, according to the discrete time convolution [1]. Due to the accuracy requirement in frequency domain, T is generally large [2] and consequently, incurs a large silicon area with significant power consumption. Therefore, over the past decades, much research effort has been toward hardware efficient FIR filter implementations.

Because multipliers are generally more expensive than adders in terms of area and power consumption, many previous works have focused on the design of FIR filters with area-efficient multipliers. In some application-specific FIRs, the coefficients can be pre-determined; thus, instead of using the costly general multipliers, several constant-multiplier-based designs (CM) have been proposed [3,4,5,6,7,8]. These CM-based FIRs, however, are application-specific and only work for a specific coefficient set; therefore, they are not suitable in reconfigurable systems with programmable coefficients for real-time applications, such as adaptive pulse shaping and signal equalization [9]. On the other hand, because FIR filters are widely used in high-throughput multimedia signal processing and cellular wireless communication systems, there are also several parallel FIR implementations, such as fast FIR algorithm (FFA) [10,11,12] and block FIRs [13,14,15]. The basic idea of FFA is to break up an FIR filter into several sub-filters using polyphase decomposition so that they can operate in parallel with reduced computation complexity. In an FFA-based FIR filter design, the required number of multipliers can be greatly reduced at the cost of an increase in adders for extra pre-processing and post-processing. Symmetric FFA-based designs have also been proposed in [11,12] with the consideration of symmetric convolutions. Although FFA-based methods can achieve a significant reduction in multipliers, they are only effective for parallel FIR filters with low parallelism. Otherwise, the increased adders will introduce significant area overhead with the increased design complexity. On the other hand, block FIRs have also been proposed in [13,14,15] for high-throughput signal processing. Unlike FFA-based designs, block FIRs can be combined with CM-based methods for a specific coefficient set; however, the corresponding hardware resource increases linearly with the degree of parallelism. Therefore, area and power efficiency of the existing parallel FIRs are still the design challenges.

The symmetry of coefficients, which can lead to a significant saving in hardware cost, has not been taken into consideration in the existing block FIR designs yet. Therefore, we explore the feasibility of symmetric convolution in transposed parallel FIRs and propose a symmetric transposed block FIR filter (STB-FIR) architecture for area/power-efficient implementation of block FIR filters in which delay elements are inserted after multipliers for temporal reuse of intermediate tap products.

The remainder of this paper is organized as follows. The proposed STB-FIR architecture is illustrated in Section 2 with the corresponding generalized formulation. Evaluation results are provided in Section 3. Finally, the conclusion is given in Section 4.

2. Proposed STB-FIR

A T-tap digital FIR filter can be generally implemented either in the direct form or in the transposed form, as shown in Figure 1.

2.1. Generalized Mathematical Formulation for TB-Based FIRs

For L-parallel processing, the transposed block FIR proposed in [15] takes a block of

L

new input samples

\{x_{n}, x_{n - 1}, \dots, x_{n - L + 1}\}

and produces a block of

L

output samples

\{y_{n}, y_{n - 1}, \dots, y_{n - L + 1}\}

in each clock cycle. For a

T

-tap

L

-parallel transposed block FIR with

T = M L

and L and M indicate the degree of processing parallelism and the total number of blocks, respectively, the operations can be expressed in matrix form as:

[\begin{matrix} y_{n} \\ \begin{matrix} y_{n - 1} \end{matrix} \\ \begin{matrix} ⋮ \\ y_{n - L + 1} \end{matrix} \end{matrix}] = [\begin{matrix} x_{n} & \dots & x_{n - L + 1} \\ x_{n - 1} & \dots & x_{n - L} \\ ⋮ & ⋮ & ⋮ \\ x_{n - j} & \dots & x_{n - j - L + 1} \\ ⋮ & ⋮ & ⋮ \\ x_{n - L + 1} & \dots & x_{n - 2 L + 2} \end{matrix} \begin{matrix} x_{n - L} & \dots & x_{n - 2 L + 1} \\ x_{n - L - 1} & \dots & x_{n - 2 L} \\ ⋮ & ⋮ & ⋮ \\ x_{n - j - L} & \dots & x_{n - j - 2 L + 1} \\ ⋮ & ⋮ & ⋮ \\ x_{n - 2 L + 1} & \dots & x_{n - 3 L + 2} \end{matrix} \begin{matrix} \dots & x_{n - M L + 1} \\ \dots & x_{n - M L} \\ ⋮ & ⋮ \\ \dots & x_{n - j - M L + 1} \\ ⋮ & ⋮ \\ \dots & x_{n - (M + 1) L + 2} \end{matrix}] \cdot [\begin{matrix} h_{0} \\ h_{1} \\ ⋮ \\ h_{L - 1} \\ ⋮ \\ h_{M L - 1} \end{matrix}]

(1)

where

\{x_{n}, x_{n - 1}, \dots, x_{n - L + 1}\}

is the input of the current clock cycle,

\{x_{n - L}, x_{n - L - 1}, \dots, x_{n - 2 L + 1}\}

represents the input in the previous clock cycle, and {

x_{n - m L}, x_{n - m L - 1}, \dots, x_{n - (m + 1) L + 1}}

represents the input m clocks before. It is obvious that the output of an FIR is a linear combination of the current input and some previous values.

According to (1), as an

L

-parallel FIR has

L

new inputs in each cycle, we define

B_{m}

as

B_{m} \equiv [\begin{matrix} x_{n - m L} & x_{n - m L - 1} & \dots & x_{n - (m + 1) L + 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n - m L - j} & x_{n - m L - (j + 1)} & \dots & x_{n - (m + 1) L - (j - 1)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n - (m + 1) L + 1} & x_{n - (m + 1) L} & \dots & x_{n - (m + 2) L + 2} \end{matrix}]

(2)

Hence, substituting it into (1), we can obtain

[\begin{matrix} y_{n} \\ y_{n - 1} \\ \begin{matrix} ⋮ \\ y_{n - L + 1} \end{matrix} \end{matrix}] = [\begin{matrix} B_{0} & B_{1} & \dots & B_{m} & \dots & B_{M - 1} \end{matrix}] \cdot [\begin{matrix} h_{0} \\ h_{1} \\ ⋮ \\ h_{M L - 1} \end{matrix}]

(3)

Similarly, the coefficients can also be divided into

M

blocks as

H = [\begin{matrix} h_{0} \\ h_{1} \\ ⋮ \\ h_{M L - 1} \end{matrix}] \equiv [\begin{matrix} H_{0} \\ H_{1} \\ ⋮ \\ H_{M - 1} \end{matrix}] where H_{m} = [\begin{matrix} h_{m L} \\ h_{m L + 1} \\ ⋮ \\ h_{(m + 1) L - 1} \end{matrix}]

(4)

Thus, (3) can be further rewritten as

\begin{array}{l} [\begin{array}{c} y_{n} \\ \begin{matrix} ⋮ \\ y_{n - j} \end{matrix} \\ \begin{matrix} ⋮ \\ y_{n - L + 1} \end{matrix} \end{array}] & = [\begin{matrix} B_{0} & B_{1} & \dots & B_{m} & \dots & B_{M - 1} \end{matrix}] \cdot [\begin{matrix} H_{0} \\ ⋮ \\ H_{m} \\ ⋮ \\ H_{M - 1} \end{matrix}] \\ = B_{0} \cdot H_{0} + B_{1} \cdot H_{1} + \dots + B_{M - 1} \cdot H_{M - 1} \\ = \sum_{m = 0}^{M - 1} B_{m} \cdot H_{m} \end{array}

(5)

Here, let the calculation of

B_{m} \cdot H_{m}

be

T B_{m}

, called time block in the following.

\begin{array}{l} T B_{m} & \equiv B_{m} \cdot H_{m} \\ = [\begin{matrix} x_{n - m L} & x_{n - m L - 1} & \dots & x_{n - (m + 1) L + 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n - m L - j} & x_{n - m L - (j + 1)} & \dots & x_{n - (m + 1) L - (j - 1)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n - (m + 1) L + 1} & x_{n - (m + 1) L} & \dots & x_{n - (m + 2) L + 2} \end{matrix}] \cdot [\begin{matrix} h_{m L} \\ h_{m L + 1} \\ ⋮ \\ h_{(m + 1) L - 1} \end{matrix}] \end{array}

(6)

According to (5), a TB-based scalable FIR architecture can be implemented in a transposed form, as shown in Figure 2, where a

T

-tap

L

-parallel block FIR filter has

M

(

= T / L)

time blocks and can process

M L

input samples in

M

clock cycles with all the TBs working simultaneously. It should be noted that the input samples

B_{m}

contains the current input and the previous input. Moreover, each time block (

T B_{m}

) has the corresponding coefficients,

H_{m}

, and the result is added to the output of the neighboring time block (

T B_{m + 1}^{'}

) to generate the output of

T B_{m}

and then is sent to the next time block (

T B_{m - 1}

).

2.2. Proposed STB-FIR Design

For a linear-phase FIR filter, the impulse response can be symmetric (

h_{i} = h_{T - 1 - i})

or anti-symmetric (

h_{i} = - h_{T - 1 - i})

. To simplify the explanation, only the symmetric realization with

h_{i} = h_{T - 1 - i}

will be discussed in the following. It is worth noting that the proposed architecture is also able to be applied to anti-symmetric FIR implementations.

As shown in (2), there are

2 L - 1

input samples that are multiplied with the corresponding coefficients in each time block. Among the

2 L - 1

different input samples,

L

samples {

x_{n - m L}, x_{n - m L - 1}, \dots, x_{n - (m + 1) L + 1}}

are the inputs in the current clock cycle while the other

L - 1

ones {

x_{n - (m + 1) L}, x_{n - (m + 1) L - 1}, \dots, x_{n - (m + 2) L + 2}}

were obtained in the previous clock cycle. Therefore, totally

L^{2}

different multipliers are required. In [15], delay elements are inserted on the input side to make it possible for the temporary storage of the

L - 1

samples for later calculation. Unfortunately, because

H_{i} \neq H_{M - 1 - i}

in the transposed block form, the symmetry of coefficients cannot be easily realized.

To explore the feasibility of symmetric convolution in parallel block FIR filters, a hardware efficient symmetric transposed block FIR architecture (STB-FIR) is proposed. In STB-FIR, delay elements are inserted after multipliers; thus, temporal reuse of the intermediate tap products becomes possible. Consequently, half of the multipliers can be saved at the cost of increased registers.

For a

T

-tap

L

-parallel transposed FIR with symmetric coefficients where

T = M L

, there are two cases (i.e.,

M

is odd or even) that should be considered in the proposed STB-FIR design method, as shown in Figure 3.

2.2.1. Case 1: M Is Even

A linear FIR filter that falls into this category has an even tap (i.e.,

T

is even), an even number of TBs (i.e.,

M

is even), and symmetric coefficients (i.e.,

h_{i} = h_{T - 1 - i}

). Without loss of generality, let us consider two symmetric TBs (

T B_{m}

and

T B_{M - 1 - m}

) in a TB-based symmetric FIR.

For the time block,

T B_{m}

, shown in (6), we can divide it into two terms as below, and each of them can be implemented, as shown in Figure 4a,b, respectively.

\begin{array}{l} T B_{m} & = B_{m} \cdot H_{m} \\ = [\begin{matrix} x_{n - m L} & x_{n - m L - 1} & \dots & x_{n - (m + 1) L + 1} \\ x_{n - m L - 1} & ⋮ & ⋰ & 0 \\ ⋮ & x_{n - (m + 1) L + 1} & ⋰ & ⋮ \\ x_{n - (m + 1) L + 1} & 0 & \dots & 0 \end{matrix}] \cdot [\begin{matrix} h_{m L} \\ h_{m L + 1} \\ ⋮ \\ h_{(m + 1) L - 1} \end{matrix}] \\ + [\begin{matrix} 0 & 0 & \dots & 0 \\ 0 & ⋮ & ⋰ & x_{n - (m + 1) L} \\ ⋮ & 0 & ⋰ & ⋮ \\ 0 & x_{n - (m + 1) L} & \dots & x_{n - (m + 2) L + 2} \end{matrix}] \cdot [\begin{matrix} h_{m L} \\ h_{m L + 1} \\ ⋮ \\ h_{(m + 1) L - 1} \end{matrix}] \end{array}

(7)

Here, it should be mentioned that there are

2 L - 1

data inputs that are multiplied with the corresponding coefficients (

H_{m}

) in each time block (

T B_{m})

. Among the

2 L - 1

different input samples, if the

L

samples {

x_{n - m L}, x_{n - m L - 1}, \dots, x_{n - (m + 1) L + 1}}

are the inputs in the current clock cycle, the other

L - 1

ones {

x_{n - (m + 1) L}, x_{n - (m + 1) L - 1}, \dots, x_{n - (m + 2) L + 2}}

are obtained in the previous clock cycle. Since in transposed form, every input sample should be multiplied with all the coefficients to generate the intermediate tap products, we can conduct the multiplication firstly and then save the products in registers for later addition. By doing this, intermediate products can be reused at the cost of several additional registers, while the required multipliers are reduced by half. As a result, the two parts shown in Figure 4 can be combined into one circuit, as shown in Figure 5, where the current L input samples {

x_{n - m L}, x_{n - m L - 1}, \dots, x_{n - (m + 1) L + 1}}

are multiplied with all the coefficients firstly in which some of the products will be directly delivered for addition, while some of them are firstly saved into the registers and then sent to the adder.

Furthermore, in a TB-based symmetric FIR (i.e.,

h_{i} = h_{T - 1 - i}

), the coefficients in the two symmetric TBs (

T B_{m}

and

T B_{M - 1 - m}

) are

H_{m}

and

H_{M - 1 - m}

, respectively, and then we have

H_{M - 1 - m} = [\begin{matrix} h_{(M - 1 - m) L} \\ h_{(M - 1 - m) L + 1} \\ ⋮ \\ h_{[(M - 1 - m) + 1] L - 2} \\ h_{[(M - 1 - m) + 1] L - 1} \end{matrix}] = [\begin{matrix} h_{(m + 1) L - 1} \\ h_{(m + 1) L - 2} \\ ⋮ \\ h_{m L + 1} \\ h_{m L} \end{matrix}] = E^{- 1} \cdot H_{m}

(8)

where

E^{- 1} = [\begin{matrix} 0 & \dots & 1 \\ ⋮ & ⋰ & ⋮ \\ 1 & \dots & 0 \end{matrix}]

providing the vector of coefficients in reverse order, and

\begin{array}{l} T B_{M - 1 - m} & = [\begin{matrix} x_{n - (M - 1 - m) L} & x_{n - (M - 1 - m) L - 1} & \dots & x_{n - (M - 1 - m) L - (L - 1)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n - (M - 1 - m) L - j} & x_{n - (M - 1 - m) L - (j + 1)} & \dots & x_{n - (M - 1 - m) L - (L + j - 1)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ x_{n - (M - 1 - m) L - (L - 1)} & x_{n - (M - 1 - m) L - L} & \dots & x_{n - (M - 1 - m) L - (2 L - 1)} \end{matrix}] \cdot E^{- 1} \cdot H_{m} \\ = [\begin{matrix} x_{n - (M - 1 - m) L} & x_{n - (M - 1 - m) L - 1} & \dots & x_{n - (M - 1 - m) L - (L - 1)} \\ x_{n - (M - 1 - m) L - 1} & ⋮ & ⋰ & 0 \\ ⋮ & x_{n - (M - 1 - m) L - (L - 1)} & ⋰ & ⋮ \\ x_{n - (M - 1 - m) L - (L - 1)} & 0 & \dots & 0 \end{matrix}] \cdot E^{- 1} \cdot H_{m} \\ + [\begin{matrix} 0 & 0 & \dots & 0 \\ 0 & ⋮ & ⋰ & x_{n - (M - 1 - m) L - L} \\ ⋮ & 0 & ⋰ & ⋮ \\ 0 & x_{n - (M - 1 - m) L - L} & \dots & x_{n - (M - 1 - m) L - (2 L - 1)} \end{matrix}] \cdot E^{- 1} \cdot H_{m} \end{array}

(9)

Although

H_{m} \neq H_{M - 1 - m}

, they consist of the same L separate coefficients as {

h_{m L}, h_{m L + 1,} \dots, h_{(m + 1) L - 1}

}. Since, in transposed form, every input sample should be multiplied with all the coefficients to generate the intermediate tap products, we can implement each pair of symmetric TBs (

T B_{m}

and

T B_{M - 1 - m}

) as one symmetric time block (

S T B_{m}

) to take advantage of the same separate coefficients for tap product reuse.

The proposed STB-FIR structure with an even M is shown in Figure 3a, which only consists of the basic STB units, and the detailed STB design is shown in Figure 6. For each STB, it has

L

data inputs (i.e.,

x_{n}, x_{n - 1}, x_{n - 2}, \dots, x_{n - L + 1}

) and L coefficients (

H_{k}

) where

H_{k} = {[\begin{matrix} h_{k L} & h_{k L + 1} & \begin{matrix} \dots & h_{(k + 1) L - 1} \end{matrix} \end{matrix}]}^{T}

. It also accepts data from the neighboring STBs, performs the sum operations, and then sends the results to the corresponding two neighboring STBs.

The proposed STB design can be viewed as the combination of two TBs, while the total number of multipliers is reduced by half when compared with the implementation of using two separate TBs. Meanwhile, the number of adders is kept the same as the existing transposed block FIR [15]. Therefore, the number of multipliers can be reduced at the cost of increased delay elements in STB-FIR. Since multipliers are much more expensive in area and power consumption than registers, STB-FIR can achieve significant area and power savings when compared with the existing transposed block FIRs.

2.2.2. Case 2: M Is Odd

When the number of time blocks (

M

) is odd, the FIR structure cannot be fully TB-based folded, and a half-STB unit (

T B_{(M - 1) / 2}

) is required, as shown in Figure 3b. Fortunately, due to the symmetry of coefficients (i.e.,

h_{i} = h_{T - 1 - i}

), we have

H_{\frac{M - 1}{2}} = [\begin{matrix} h_{\frac{M - 1}{2} \cdot L} \\ ⋮ \\ h_{\frac{M - 1}{2} \cdot L + j} \\ ⋮ \\ h_{\frac{M - 1}{2} \cdot L + L - 1} \end{matrix}] = [\begin{matrix} h_{\frac{M - 1}{2} \cdot L} \\ ⋮ \\ ⋮ \\ h_{\frac{M - 1}{2} \cdot L + 1} \\ h_{\frac{M - 1}{2} \cdot L} \end{matrix}]

(10)

Thus,

T B_{(M - 1) / 2}

can be calculated as

T B_{\frac{M - 1}{2}} = B_{\frac{M - 1}{2}} \cdot H_{\frac{M - 1}{2}} = B_{\frac{M - 1}{2}} \cdot [\begin{matrix} h_{\frac{M - 1}{2} \cdot L} \\ ⋮ \\ ⋮ \\ h_{\frac{M - 1}{2} \cdot L + 1} \\ h_{\frac{M - 1}{2} \cdot L} \end{matrix}]

(11)

Due to the symmetry in

H_{M - 1 / 2}

,

T B_{M - 1 / 2}

can be realized using the STB-like unit design only with the length changed from L to

[\frac{L}{2}]

.

For better illustration, Figure 7 gives the example designs of the proposed STB-FIR and the existing transposed block FIR [15] for two 6-tap transposed FIRs with different degrees of parallelism (L). For the 6-tap 3-parallel transposed FIR (T = 6 and L = 3), because M equals to 2, it is TB-based fully symmetric folded, and the corresponding implementation of the proposed STB-FIR is shown in Figure 7a. Because

L = 3

, three data inputs (i.e.,

x_{n}, x_{n - 1},

and

x_{n - 2}

) are sent to the STB in each clock. When compared with the transposed block FIR [15], the number of multipliers is reduced from 18 to 9 with the same number of adders, while the number of registers is increased by 3. On the other hand, for the 6-tap 2-parallel transposed FIR (T = 6 and L = 2), as

M = 3

, it is not TB-based fully symmetric folded and then a half-STB unit is required, as shown in Figure 7b. Because

L = 2

, two input samples (i.e.,

x_{n}

and

x_{n - 1}

) are applied in each clock cycle. The STB and the half-STB units work in a similar way as obtaining the data from the neighboring units, performing the sum operations, and then sending the results to the corresponding neighboring units. Finally, two outputs (i.e.,

y_{n}

and

y_{n - 1}

) are generated in each clock cycle. When compared with that of [15], the number of multipliers is reduced from 12 to 6 with the same number of adders, and the number of registers is only increased by 2 in STB-FIR.

From this simple example, it can be observed that, unlike the existing transposed block FIR structure [15] in which delay elements are inserted on the input side, the proposed STB-FIR inserts delay elements after multipliers for temporal reuse of tap products. By doing this, the costly multiplier can be reduced by half at the cost of slightly increased low-cost registers.

3. Evaluation Results and Comparisons

To examine the effectiveness of the proposed STB-FIR architecture, implementation results of various FIR filters are provided and compared with the existing works. In our work, all the designs are implemented in ROHM 180 nm CMOS process technology, and the post-synthesis simulation was conducted on the netlist for timing and power evaluation.

3.1. Comparison of Hardware Complexity

As the general FIR structure, the overall hardware complexity of the proposed STB-FIR is compared with the transposed-based block FIR [15], DA-based approach [13] and FFA-based L-parallel structure [1] in Table 1.

A canonical

T

-tap FIR filter consists of T multipliers,

T - 1

adders, and

T

registers which depends on the tap number (

T

) for specific accuracy requirement. Therefore, a straightforward implementation of a

T

-tap

L

-parallel FIR design,

L

times hardware resources are required. The transposed block structure proposed in [15] involves

T L

multipliers,

L (T - 1)

adders and

T + L - 1

registers in which

L - 1

registers are inserted at the input side for the storage of

L - 1

samples for later calculation. The required hardware resource of the distributed arithmetic (DA)-based FIR structure [13] is estimated according to the results shown in [15] for comparison purpose. Here, it should be noted that DA [13] was implemented in the direct form, while transposed block [12] was in the transposed form. As for the FFA-based method, the required hardware resource is formulated according to the analysis, as shown in [1,11,12]. As illustrated above, the proposed STB-FIR in total involves

T / 2 \cdot

L multipliers,

L (T - 1)

adders, and

\frac{T (3 L - 2)}{8} + T

registers, among which

\frac{T (3 L - 2)}{8}

registers are approximately used as the delay elements for intermediate product sharing, and the other

T

registers are required for the output storage of the adder trees.

When compared with DA-based FIR [13] and the transposed block FIR [15], the number of multipliers in STB-FIR can be reduced by half while with the same number of adders, which indicates the promising area savings in STB-FIR. The cost of this multiplier reduction is a slightly increased number of low-cost registers which are used to store the intermediate tap products for temporal reuse. According to our analysis, as

L

and

T

increase, the ratio of saved multipliers by the increased registers will get close to

4 / 3

, which indicates that four multipliers can be saved at the cost of three additional registers. Because the area and power of a general multiplier is much larger than the corresponding number of registers, great area and power savings can be achieved in the proposed STB-FIR.

3.2. Comparison of Reconfigurable FIR Implementations

For evaluation and comparison purposes, reconfigurable FIRs in 8, 16, 24, 32, 64, and 128 taps with various degrees of processing parallelism (i.e.,

L = 2

,

4

, and

8

) are implemented by using STB-FIR and the existing transposed block FIR [15], and the corresponding synthesis results are shown in Table 2.

In [15], Mohanty et.al have shown that their transposed block FIR designs outperform the existing DA-based method [13]; therefore, we implemented their work as the state-or-the-art design, and the corresponding results are presented in Table 2 for comparison. The results are obtained using the logic synthesis tools, and the power consumption is extracted based on simulation of synthesis results with back-annotation of toggling activity where uniformly distributed sample input values are applied. In the table, the sample frequency is calculated as the degree of parallelism (L)/minimum clock period (MCP). Therefore, it is obvious that the

L

-parallel FIR designs, including both the proposed method and the existing method, can improve the sampling frequency over the baseline non-parallel FIR design at the cost of increased silicon area. On average, the proposed STB-FIR can achieve 39.12% area savings and 35.16% power consumption reduction for all the 18 FIRs when compared with the existing transposed block-based designs [15].

As the structure of reconfigurable FIR filters is very regular, the measurement of area efficiency per tap (AE) is introduced as

A E = \frac{T a p \times P a r a l l e l i s m}{A r e a \times M C P}

(12)

where, Tap is the tap number of the FIR,

P a r a l l e l i s m

indicates the degree of processing parallelism (i.e.,

L

in the proposed STB-FIR), and

M C P

is the minimum clock period. The normalized

A E

results are given in Figure 8 in which the baseline design has the parallelism of 1, and STB-FIR can achieve up to 3.2×

A E

improvements.

3.3. Comparison of Fixed FIR Implementations

As for fixed FIR filters in which the coefficients are pre-determined, by referring to [16], a

105

-tap FIR with

L = 3

and a

60

-tap FIR with

L = 2, 3

, and 4 are implemented in which the CSE multiplier [4] is adopted for area saving.

For the 105-tap 3-parallel FIR, STB-FIR can achieve 13.08% area saving and 23.05% power reduction when compared with the fixed transposed block FIR [15]. On the other hand, when compared with the fixed symmetric FFA designs [11,12], STB-FIR can achieve up to 31.5% and 40.5% sampling frequency improvement for the 105-tap 3-parallel FIR and the 60-tap 2-parallel one, respectively.

Without loss of generality, the normalized area efficiency comparison is presented in Figure 9 in which the baseline is the FFA-based design in [1]. For the 105-tap 3-parallel FIR, the proposed STB-FIR architecture can achieve 1.64× and 1.20×

A E

improvements over the transposed block FIR [15] and the fixed symmetric FFA design [11], respectively. Moreover, for the

60

-tap FIR, the proposed STB-FIR filter design can achieve up to 1.29× AE improvement over the existing FFA designs.

4. Conclusions

The feasibility of a symmetric transposed block FIR filter is explored in this paper by taking advantage of the symmetric coefficients for area-power efficient implementation. In the proposed STB-FIR architecture, using registers to save intermediate tap products for temporal reuse makes it possible to realize hardware-efficient symmetric transposed block FIR filters. The evaluation results show that compared with the state-of-the-art reconfigurable architecture [15], the proposed STB-FIR architecture can achieve up to 39.97% and 41.23% area saving and power reduction, respectively. On the other hand, compared with the existing symmetric FFA designs [11,12], the proposed STB-FIR architecture can achieve up to 31.5% and 40.5% sampling frequency improvement and 1.20× AE improvement as well for the fixed FIR implementations. These results clearly illustrate the efficiency of the proposed STB-FIR architecture and confirm that STB-FIR can be applicable to both reconfigurable and fixed FIR implementations for area-power efficient high-speed signal processing. On the other hand, the optimization of multipliers will result in an increased importance of adder trees, especially in fixed FIRs. Therefore, further optimization of adder tree implementation will be one of our future works.

Author Contributions

Conceptualization, J.Y. and Y.S.; methodology, J.Y.; validation, J.Y., M.Y. and Y.S.; data curation, J.Y.; writing—original draft preparation, Y.S.; writing—review and editing, J.Y., M.Y. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by Waseda University Grant for Special Research Projects (Project number: 2021C-147).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the necessary data are included in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Parhi, K.K. VLSI Digital Signal Processing Systems: Design and Implementation; Wiley: New York, NY, USA, 1999. [Google Scholar]
Mirchandani, G.; Zinser, R.L.; Evans, J.B. A new adaptive noise cancellation scheme in the presence of crosstalk [speech signals]. IEEE Trans. Circuits Syst. II Analog. Digit. Signal Process. 1995, 39, 681–694. [Google Scholar] [CrossRef]
Dempster, A.G.; Macleod, M.D. Use of minimum-adder multiplier blocks in FIR digital filters. IEEE Trans. Circuits Syst. II Analog. Digit. Signal Process. 1995, 42, 569–577. [Google Scholar] [CrossRef]
Mahesh, R.; Vinod, A.P. A new common subexpression elimination algorithm for realizing low-complexity higher order digital filters. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2008, 27, 217–229. [Google Scholar] [CrossRef]
Lou, X.; Yu, Y.J.; Meher, P.K. Fine-grained critical path analysis and optimization for area-time efficient realization of multiple constant multiplications. IEEE Trans. Circuits Syst. I Regul. Pap. 2015, 62, 863–872. [Google Scholar] [CrossRef]
Meidani, M.; Mashoufi, B. Introducing new algorithms for realizing an FIR filter with less hardware in order to eliminate power line interference from the ECG signal. IET J. Signal Process. 2016, 10, 709–716. [Google Scholar] [CrossRef]
Ye, J.; Togawa, N.; Yanagisawa, M.; Shi, Y. A low cost and high speed CSD-based symmetric transpose block FIR implementation. In Proceedings of the IEEE International Conference on ASIC (ASICON), Guiyang, China, 25–28 October 2017. [Google Scholar]
Ye, J.; Togawa, N.; Yanagisawa, M.; Shi, Y. Static error analysis and optimization of faithfully truncated adders for area-power efficient FIR designs. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019. [Google Scholar]
Park, J.; Jeong, W.; Meimand, H.M.; Wang, Y.; Choo, H.; Roy, K. Computation sharing programmable FIR filter for low-power and high-performance applications. IEEE J. Solid-State Circuits 2004, 39, 348–357. [Google Scholar] [CrossRef]
Parker, D.A.; Parhi, K.K. Low-area/power parallel FIR digital filter implementations. J. VLSI Signal Process. Syst. 1997, 17, 75–92. [Google Scholar] [CrossRef]
Tsao, Y.; Choi, K. Area-efficient parallel FIR digital filter structures for symmetric convolutions based on fast FIR algorithm. IEEE Trans. Very Large Scale Integr. Syst. 2012, 20, 366–371. [Google Scholar] [CrossRef]
Tsao, Y.; Choi, K. Area-efficient VLSI implementation for parallel linear-phase FIR digital filters of odd length based on fast FIR algorithm. IEEE Trans. Circuits Syst. II Express Briefs. 2012, 59, 371–375. [Google Scholar] [CrossRef]
Mohanty, B.K.; Meher, P.K. A high-performance energy-efficient architecture for FIR adaptive filter based on new distributed arithmetic formulation of block LMS algorithm. IEEE Trans. Signal Process. 2013, 61, 921–932. [Google Scholar] [CrossRef]
Mohanty, B.K.; Meher, P.K.; Al-Maadeed, S.; Amira, A. Memory footprint reduction for power-efficient realization of 2-D finite impulse response filters. IEEE Trans. Circuits Syst. I Regul. Pap. 2014, 61, 120–133. [Google Scholar] [CrossRef]
Mohanty, B.K.; Meher, P.K. A high performance FIR filter architecture for fixed and reconfigurable applications. IEEE Trans. Very Large Scale Integr. Syst. 2016, 24, 444–452. [Google Scholar] [CrossRef]
Shahein, A.; Zhang, Q.; Lotze, N.; Manoli, Y. A novel hybrid monotonic local search algorithm for FIR filter coefficients optimization. IEEE Trans. Circuits Syst. I Regul. Pap. 2011, 59, 616–627. [Google Scholar] [CrossRef]

Figure 1. General FIR implementations: (a) direct form and (b) transposed form.

Figure 2. Time-block-based transposed block FIR architecture.

Figure 3. Proposed STB-FIR structure with two cases: (a) TB-based fully symmetric folded structure when

M

is even and (b) non-symmetric folded structure when

M

is odd.

Figure 3. Proposed STB-FIR structure with two cases: (a) TB-based fully symmetric folded structure when

M

is even and (b) non-symmetric folded structure when

M

is odd.

Figure 4. Implementation of the time block,

T B_{m}

, with the corresponding two terms in (7): (a) the form term and (b) the latter term.

Figure 4. Implementation of the time block,

T B_{m}

, with the corresponding two terms in (7): (a) the form term and (b) the latter term.

Figure 5. Combined implementation of the two circuits shown in Figure 4.

Figure 6. STB design in the proposed STB-FIR.

Figure 7. Comparisons of parallel FIR implementations (proposed STB-FIR vs. transposed block FIR). (a) Implementations of a 6-tap 3-parallel FIR (T = 6, L = 3 and M = 2) and (b) implementations of a 6-tap 2-parallel FIR (T = 6, L = 2 and M = 3).

Figure 8. Normalized AE results of various reconfigurable FIR implementations.

Figure 9. Normalized AE results of various fixed FIR implementations.

Table 1. Comparisons of various parallel FIR architectures.

Architecture.	No. of Multipliers	No. of Adders	No. of Registers
Transposed block	$T L$	$L (T - 1)$	$T + L - 1$
DA	$T L$	$L (T - 1)$	$T + L - 1$
L-Parallel FFA	$\frac{T}{\prod_{i = 1}^{r} L_{i}} \prod_{i = 1}^{r} M_{i}$	$A_{1} \prod_{i = 1}^{r} L_{i} + \sum_{i = 2}^{r} (A_{i} (\prod_{j = i + 1}^{r} L_{j}) (\prod_{k = 1}^{i - 1} M_{k})) + (\prod_{i = 1}^{r} M_{i}) (\frac{T}{\prod_{i = 1}^{r} L_{i}} - 1)$	$\frac{T}{L} (R_{1} \prod_{i = 2}^{r} M_{i} + R_{r}) + T$
Proposed STB-FIR	$T / 2 L$	$L (T - 1)$	$\frac{T (3 L - 2)}{8} + T$

* In DA, the number of multipliers indicates the required LUT-based multiplier blocks. * In

L

-Parallel FFA,

L = \prod_{i = 1}^{r} L_{i}

, and

M_{i}, R_{i}

, and

A_{i}

indicate the number of multipliers, registers, and adders of the ith parallel FFA basic unit (

L_{i}),

respectively. * For the registers in STB-FIR,

\frac{T (3 L - 2)}{8}

is an approximate value depending on

M

and

L

.

Table 2. Comparisons of various parallel FIR architectures.

Tap	FIR Structure		No. of Multipliers	No. of Adders	No. of Registers	Sampling Freq. (MHz)	$Area (u m^{2})$	Area Saving (%)	Power (mw)	Power Saving (%)
8	Parallel L = 2	Transposed block	16	14	9	207.04	203,380	39.35	12.47	34.40
	Parallel L = 2	Proposed STB-FIR	8	14	10	206.83	123,353	39.35	8.18	34.40
	Parallel L = 4	Transposed block	32	28	11	382.78	383,343	39.97	21.72	34.16
	Parallel L = 4	Proposed STB-FIR	16	28	18	382.41	230,127	39.97	14.30	34.16
	Parallel L = 8	Transposed block	64	56	15	735.97	731,514	37.39	42.17	35.83
	Parallel L = 8	Proposed STB-FIR	32	56	30	733.27	457,993	37.39	27.06	35.83
16	Parallel L = 2	Transposed block	32	30	17	207.04	410,745	39.28	26.29	37.31
	Parallel L = 2	Proposed STB-FIR	16	30	24	206.83	249,384	39.28	16.48	37.31
	Parallel L = 4	Transposed block	64	60	19	382.78	769,615	39.49	44.72	36.67
	Parallel L = 4	Proposed STB-FIR	32	60	36	382.41	465,661	39.49	28.32	36.67
	Parallel L = 8	Transposed block	128	120	23	717.49	1,474,409	39.03	88.39	41.23
	Parallel L = 8	Proposed STB-FIR	64	120	60	732.60	898,994	39.03	51.95	41.23
24	Parallel L = 2	Transposed block	48	46	25	207.04	618,293	39.66	34.96	36.13
	Parallel L = 2	Proposed STB-FIR	24	46	36	206.83	373,099	39.66	22.33	36.13
	Parallel L = 4	Transposed block	96	92	27	382.78	1,155,887	39.07	64.99	36.42
	Parallel L = 4	Proposed STB-FIR	48	92	54	382.41	704,290	39.07	41.32	36.42
	Parallel L = 8	Transposed block	192	184	31	733.27	2,217,483	38.81	120.65	32.95
	Parallel L = 8	Proposed STB-FIR	96	184	90	732.60	1,356,819	38.81	80.90	32.95
32	Parallel L = 2	Transposed block	64	62	33	207.04	827,379	39.62	50.40	32.34
	Parallel L = 2	Proposed STB-FIR	32	62	48	206.83	499,542	39.62	34.10	32.34
	Parallel L = 4	Transposed block	128	124	35	382.78	1,542,082	39.26	88.22	33.51
	Parallel L = 4	Proposed STB-FIR	64	124	72	382.41	936,727	39.26	58.66	33.51
	Parallel L = 8	Transposed block	256	248	39	733.27	2,963,939	39.00	159.96	33.09
	Parallel L = 8	Proposed STB-FIR	128	248	120	732.60	1,808,023	39.00	107.03	33.09
64	Parallel L = 2	Transposed block	128	126	65	207.04	1,613,977	39.29	99.83	36.44
	Parallel L = 2	Proposed STB-FIR	64	126	96	206.83	979,802	39.29	63.45	36.44
	Parallel L = 4	Transposed block	256	252	67	382.78	3,087,828	39.30	177.32	34.97
	Parallel L = 4	Proposed STB-FIR	128	252	144	382.41	1,874,254	39.30	115.32	34.97
	Parallel L = 8	Transposed block	512	504	71	733.27	5,932,523	38.87	322.64	33.07
	Parallel L = 8	Proposed STB-FIR	256	504	240	732.60	3,626,413	38.87	215.93	33.07
128	Parallel L = 2	Transposed block	256	254	129	207.04	3,272,258	38.90	208.11	36.97
	Parallel L = 2	Proposed STB-FIR	128	254	192	206.83	1,999,407	38.90	131.18	36.97
	Parallel L = 4	Transposed block	512	508	131	382.78	6,178,507	39.12	356.05	34.70
	Parallel L = 4	Proposed STB-FIR	256	508	288	382.41	3,761,321	39.12	232.5	34.70
	Parallel L = 8	Transposed block	1024	1016	135	733.27	11,872,904	38.82	648.16	32.76
	Parallel L = 8	Proposed STB-FIR	512	1016	480	732.60	7,263,638	38.82	435.81	32.76

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, J.; Yanagisawa, M.; Shi, Y. Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients. Electronics 2022, 11, 3272. https://doi.org/10.3390/electronics11203272

AMA Style

Ye J, Yanagisawa M, Shi Y. Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients. Electronics. 2022; 11(20):3272. https://doi.org/10.3390/electronics11203272

Chicago/Turabian Style

Ye, Jinghao, Masao Yanagisawa, and Youhua Shi. 2022. "Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients" Electronics 11, no. 20: 3272. https://doi.org/10.3390/electronics11203272

APA Style

Ye, J., Yanagisawa, M., & Shi, Y. (2022). Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients. Electronics, 11(20), 3272. https://doi.org/10.3390/electronics11203272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scalable Hardware Efficient Architecture for Parallel FIR Filters with Symmetric Coefficients

Abstract

1. Introduction

2. Proposed STB-FIR

2.1. Generalized Mathematical Formulation for TB-Based FIRs

2.2. Proposed STB-FIR Design

2.2.1. Case 1: M Is Even

2.2.2. Case 2: M Is Odd

3. Evaluation Results and Comparisons

3.1. Comparison of Hardware Complexity

3.2. Comparison of Reconfigurable FIR Implementations

3.3. Comparison of Fixed FIR Implementations

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI