A Method of Increasing Digital Filter Performance Based on Truncated Multiply-Accumulate Units

: This paper proposes new digital ﬁlter architecture based on a modiﬁed multiply-accumulate (MAC) unit architecture called truncated MAC (TMAC), with the aim of increasing the performance of digital ﬁltering. This paper provides a theoretical analysis of the proposed TMAC units and their hardware simulation. Theoretical analysis demonstrated that replacing conventional MAC units with modiﬁed TMAC units, as the basis for the implementation of digital ﬁlters, can theoretically reduce the ﬁltering time by 29.86%. Hardware simulation showed that TMAC units increased the performance of digital ﬁlters by up to 10.89% compared to digital ﬁlters using conventional MAC units, but were associated with increased hardware costs. The results of this research can be used in the theory of digital signal processing to solve practical problems such as noise reduction, ampliﬁcation and suppression of the frequency spectrum, interpolation, decimation, equalization and many others.


Introduction
Digital filtering is the core of digital signal processing since it is the foundation of the solution to most practical problems in this area: noise reduction [1], amplification and suppression of frequencies [2], interpolation [3], decimation [4], equalization [5] and many others. The tool for digital signal processing is a digital filter (DF), which is usually divided into filters with either finite impulse response (FIR) or infinite impulse response (IIR). In digital circuit design, there is a need to increase device performance. Usually, the two approaches for improving the performance of digital devices are distinguished as pipelining [6] and parallelization [7].
The implementation principle of IIR filters is similar to that of FIR filters in terms of arithmetic units. The difference is the recursive connections, which do not affect system performance in terms of frequency [8]. Therefore, in this article, we will consider the architecture of FIR filters.
FIR filters are usually components of complex digital signal processing systems; therefore, FIR filter performance affects the performance of the entire system. For instance, FIR filters were applied to image filtering [9]. In that paper, the authors propose a vectorizing pattern that accelerates FIR filters. The authors of [10] proposed a method of separable two-dimensional FIR filter design to increase device performance. In [11], an adaptive FIR filter was developed and the authors performed a simulation for the various device parameters. The authors of [12] proposed a constant multiplier based on the vertical-horizontal binary common sub-expression elimination algorithm and its application in FIR filter implementation. This multiplication technique allowed for the improvement in area, delay and power consumption of the device. Another way to improve the technical characteristics of FIR filters is the use a residual number system, which allows for parallelization of the computation across multiple channels [13].
The main tool of digital filtering is the multiply-accumulate (MAC) unit [14,15]. In [16], the authors propose a MAC unit architecture based on truncated multipliers and approximate adders. The authors of [17] designed an approximate MAC unit based on input awareness, which consisted of approximate multipliers and input-aware conditional blocks. This approach allowed for the achievement of high energy efficiency. In [18], the authors proposed a modified MAC unit that had a novel partial product reduction block. The authors of [19] conducted a comparative analysis of different precision-scalable MAC unit architectures.
The form of number representation has a great influence on the performance of blocks. In [20], MAC units were proposed, in which the technique of generating partial products combined the multiplier and the adder. Two's complement arithmetic was used to perform operations on negative numbers in these units. The effectiveness of this approach for application in deep neural networks was considered. Another way to represent numbers is in posit format. In [21], the authors presented a generator of posit MAC units and their use in deep-learning applications. In [22], for the implementation of DFs, single precision floating point Vedic multipliers were used.
In order to reduce the latency and hardware cost of devices containing MAC units, researchers have used various compression techniques. The authors of [23] proposed a 4:2 compression technique to add partial products in MAC units. In [24], the authors proposed a low-latency MAC unit architecture using the column bit compression technique.
In the present study, a new modified MAC unit called truncated MAC (TMAC) was developed to increase the performance of FIR DFs. This paper contains a theoretical analysis and hardware simulation on field-programmable gate array (FPGA) FIR DFs containing the proposed modified TMAC units and a comparative analysis with FIR DFs using traditional MAC units.
The remainder of this paper is organized as follows: Section 2 discusses the structure of FIR DFs and presents the FIR filter architecture using the proposed TMAC units. Section 3 presents the theoretical analysis and hardware simulation results, and the conclusion of the paper is reported in Section 4.

Digital Filters
A sequence of signal samples is generated by an analog-to-digital converter or transmitted by a computing bus from a digital source. Then, the digital signal X(N) is fed to the input of the FIR DF. An output signal Y(N) is generated by the formula where b i are filter coefficients and K is a filter order. Figure 1 shows the FIR DF circuit. The z −1 denote the signal delay blocks for one sample, which are implemented using buffers in practice. In other words, when a signal arrives at the input of the z −1 block, the signal X(N − 1) is generated at the output of this block. The basis of the circuit shown in Figure 1 is the repeated execution of the multiplication operation and addition with some intermediate values. In modern digital signal processing, it is customary to combine these two operations into one MAC unit. Since no signal is already available for addition into the first MAC unit, 0 is fed to the input of the unit as a summand.

Multiply-Accumulate Units
Consider the implementation of a MAC unit in the FIR DF node corresponding to the coefficient b i . This unit performs calculations using the formula where Y i is the result of the current MAC unit and Y i−1 is the result of the previous MAC unit.
To obtain a result, according to Formula (2), there is no need to perform a complete multiplication b i X(N − i). Instead, it is enough to use the generator of k partial products, where k = log 2 b i is the bit width of filter coefficient b i , and a carry-save adder (CSA) tree [25], without using the final addition of the Kogge-Stone adder (KSA) [26]. Instead, an additional term Y i−1 can be fed to the CSA tree, and the outputs A and B of this tree can be summed in the KSA.
The MAC unit operating according to this principle is shown in Figure 2. Using the notation ((k + 1):2), it can be shown that (k + 1) terms are fed to the input of the CSA tree, and two terms are formed at the output. Another modification of the adders is the Kogge-Stone parallel-prefix adders. Consider the addition of two k-bit numbers and . The idea of the parallel-prefix implementation is performed in three steps. At the first stage, the carry-generate bits , the carry-propagate bits and the half-sums are pre-calculated for , 0 ≤ ≤ − 1: The basic device for performing arithmetic operations is a full adder (FA) [25]. Bits α, β and the carry C in are the inputs of the device, which are converted to output bits S and C out using the formulas where bit S is a sum, the output bit C out is a carry obtained in the FA, ⊕ is an exclusive disjunction, & is a conjunction and ∨ is a disjunction.
The main idea of a CSA is to transform three input vectors of a device into two output vectors: sum and carry. At the same time, the amount of information for processing at the next step is reduced by 1.5 times.
Another modification of the adders is the Kogge-Stone parallel-prefix adders. Consider the addition of two k-bit numbers A and B. The idea of the parallel-prefix implementation is performed in three steps. At the first stage, the carry-generate bits G i , the carry-propagate bits P i and the half-sums H i are pre-calculated for i, 0 ≤ i ≤ k − 1: The second stage of addition, called the parallel-prefix network, computes the carry bits C i , for 0 ≤ i ≤ k − 1, using G i and P i . For this, an operator • is used that connects pairs of carry-generate and carry-propagate bits, and is defined as The sequential calculation of carry-generate and carry-propagate bit pairs (G, P) are denoted as G i:j , P i:j , i > j, where the corresponding pair is calculated based on the bits i, i − 1, . . . , j in the following way: Since the carry is C i = G i:0 for all i > 0, all carries can be calculated using only the operator • [26]. At the third stage, the sum is calculated as for 0 ≤ i ≤ k − 1. Figure 3 shows the basic blocks for the parallel-prefix addition. Block 3a implements Formula (4). Block 3b implements Formula (5). No action takes place in block 3c. Block 3d implements Formula (7). Figure 4 shows the parallel-prefix adder scheme with the organization of a parallel-prefix network, according to the Kogge-Stone method.
For a theoretical analysis of the digital device parameters, we used an abstract model for calculating the delay and area of the very large-scale integration (VLSI), known as a unit-gate model [27]. If we denote the logical device delay calculated according to the specified model as U delay and logical device area as U area then logic gates are formulated in the following way: U delay (XOR) = 2, U area (XOR) = 2 (11) Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 12     Then, taking into account Formulas (3) and (8) CSAs consist of FA blocks ( Figure 3); therefore, the delay and area parameters are defined as follows: U delay (CSA) = U delay (FA) = 4 (14) U area (CSA) = k · U area (FA) = 7k (15) For the KSA, when condition C in = 0 is satisfied, which does not require a logical operation ⊕ for calculating S 0 by Formula (7), the parameters of the delay and area are determined by the formulas The approximately equal sign in Formulas (16) and (17) means the assumption log 2 k ≈ log 2 k and does not introduce any error when considering the most common cases of addition in 8-bit, 16-bit, 32-bit numbers, etc.
Let us estimate the parameters of the delay and area of the MAC unit shown in Figure 2 for the worst case, where b i is not known in advance. In this case, we have The delay and area of the computational part of the FIR DF shown in Figure 1 are equal to the sum of delays and areas of MAC units, respectively. If we denote the computational part of the K-th order FIR DF with k-bit coefficients based on MAC units by FIR K,k MAC , then U delay FIR K,k MAC = (K + 1) · U delay (MAC) ≈ 8, 8Klog 2 k + 8, 8log 2 k + 5K + 5 (20) U area FIR K,k MAC = (K + 1) · U area (MAC) ≈ ≈ 3kKlog 2 k + 3klog 2 k + 8k 2 K + 8k 2 − 4kK − 4k + K + 1.
Analysis of the derivation of Formulas (20) and (21) shows that the main part of the delay and area FIR K,k MAC is made up of KSAs.

Proposed FIR Filter Architecture Using Truncated MAC Units
The number of KSAs in the MAC unit can be reduced to one if we use the iteration of the circuit in Figure 1 and the operation principle of the MAC unit in Figure 2. The output of each internal MAC unit in Figure 1 is fed to the input of the CSA tree of the subsequent MAC unit. Instead, numbers A and B from the previous MAC unit can be fed to the input of the adder tree of the next MAC unit, without adding them by a KSA. We call this unit a truncated MAC (TMAC); its operation principle is shown in Figure 5.
The input of each TMAC unit receives signal X(N − i), filter coefficient b i and terms A i−1 and B i−1 from previous TMAC unit output. The output of the TMAC unit is a pair of numbers A i and B i , which are fed to the next TMAC unit input or are added in a KSA if this TMAC unit is the last one in the FIR DF. The main difference between TMAC and MAC units is the absence of a KSA, which requires the most delay and area, and a slightly wider CSA tree, which transforms one more term.
The FIR DF scheme based on TMAC units is shown in Figure 6. Two zero signals must be fed to the inputs of the first TMAC unit, and outputs A K and B K of the last TMAC block must be fed to the input of the KSA.
The number of KSAs in the MAC unit can be reduced to one if we use the iteration of the circuit in Figure 1 and the operation principle of the MAC unit in Figure 2. The output of each internal MAC unit in Figure 1 is fed to the input of the CSA tree of the subsequent MAC unit. Instead, numbers and from the previous MAC unit can be fed to the input of the adder tree of the next MAC unit, without adding them by a KSA. We call this unit a truncated MAC (TMAC); its operation principle is shown in Figure 5. The input of each TMAC unit receives signal ( − ), filter coefficient and terms and from previous TMAC unit output. The output of the TMAC unit is a pair of numbers and , which are fed to the next TMAC unit input or are added in a KSA if this TMAC unit is the last one in the FIR DF. The main difference between TMAC and MAC units is the absence of a KSA, which requires the most delay and area, and a slightly wider CSA tree, which transforms one more term.
The FIR DF scheme based on TMAC units is shown in Figure 6. Two zero signals must be fed to the inputs of the first TMAC unit, and outputs and of the last TMAC block must be fed to the input of the KSA. To describe the device shown in Figure 6 in delay and area terms, we must first find parameters and of the TMAC unit:  The input of each TMAC unit receives signal ( − ), filter coefficient and terms and from previous TMAC unit output. The output of the TMAC unit is a pair of numbers and , which are fed to the next TMAC unit input or are added in a KSA if this TMAC unit is the last one in the FIR DF. The main difference between TMAC and MAC units is the absence of a KSA, which requires the most delay and area, and a slightly wider CSA tree, which transforms one more term.
The FIR DF scheme based on TMAC units is shown in Figure 6. Two zero signals must be fed to the inputs of the first TMAC unit, and outputs and of the last TMAC block must be fed to the input of the KSA. To describe the device shown in Figure 6 in delay and area terms, we must first find parameters and of the TMAC unit: To describe the device shown in Figure 6 in delay and area terms, we must first find parameters U delay and U area of the TMAC unit: The delay and area of the FIR DF computational part shown in Figure 6 are equal to the sum of delays and areas of the TMAC units and the KSA, respectively. If we denote the K-th order FIR DF computational part with k-bit coefficients based on TMAC units by FIR K,k TMAC , then U delay FIR K,k TMAC = (K + 1) · U delay (TMAC) + U delay (KSA) ≈ ≈ 6, 8Klog 2 k + 8, 8log 2 k + K + 5 (24) A comparison of Formulas (20), (21) and (24), (25) shows that the proposed blocks can reduce the FIR delay by about 2Klog 2 k and reduce its area by about 3kKlog 2 k

Digital Filters Theoretical Comparative Analysis
For a comparative analysis of the technical characteristics of FIR DFs based on known MAC units [28] and DFs based on proposed TMAC units, we alternately fix parameters K and k . Let us first consider the case of a 15th-order filter (i.e., K = 15) . For the considered case, we will change the Appl. Sci. 2020, 10, 9052 8 of 11 bit width k, sorting through the most popular data formats 8, 16, 32 and 64 bits. Table 1 shows the obtained values of the parameters U delay and U area for the corresponding devices. After that, we fix the capacity k = 16 bits, and we sort through the orders 3, 7, 15 and 31 for the FIR DF. Table 2 shows the obtained values of the parameters U delay and U area for the corresponding devices. Table 1. Comparison of 15th-order FIR DFs with different bit width based on known architecture [28] and based on the proposed architecture. The data analysis obtained in Tables 1 and 2 shows that the transition from MAC units to TMAC units as the basis for FIR DF implementation can theoretically reduce filtering time by 22.39-29.86% and reduce hardware costs by 2.41-6.32%, depending on the filter order and bit width of the processed data.

Hardware Simulation of Digital Filters
Hardware simulation was performed on FPGA Artix xc7a200tffg1156-3 in Xilinx Vivado 18.3 using the very-high-speed integrated circuit (VHSIC) hardware description language (VHDL).
The goal of the simulation was to compare the technical characteristics of FIR DFs containing TMAC units with FIR DFs using traditional MAC units [28].
Results of the hardware simulation of FIR DFs are shown in Figures 7 and 8, which demonstrate that using TMAC units in the implementation of FIR DFs allowed for an increase in the devices' maximum clock frequency by 4.41-10.89%, but at the same time, the hardware costs increased: the number of used look up tables (LUTs) by 0.63-18.63% and power consumption by 1.80-27.17%. The difference between theoretical and practical results is explained by the FPGA features and the weaknesses of the "unit-gate" model, which include ignoring the effects of the load outputs capacity of individual logic units and the circuit, generally.
Our approach allows for improvements in systems where performance is critical. The approach proposed in this paper may be applied in real-time systems or other systems where performance is the main characteristic, for example in medical image processing systems. Increasing the max clock frequency of a medical tomogram processing system would allow for an increase in its performance (i.e., the number of processed frames per second).
TMAC units with FIR DFs using traditional MAC units [28].
Results of the hardware simulation of FIR DFs are shown in Figures 7 and 8, which demonstrate that using TMAC units in the implementation of FIR DFs allowed for an increase in the devices' maximum clock frequency by 4.41-10.89%, but at the same time, the hardware costs increased: the number of used look up tables (LUTs) by 0.63-18.63% and power consumption by 1.80-27.17%. The difference between theoretical and practical results is explained by the FPGA features and the weaknesses of the "unit-gate" model, which include ignoring the effects of the load outputs capacity of individual logic units and the circuit, generally.  Our approach allows for improvements in systems where performance is critical. The approach proposed in this paper may be applied in real-time systems or other systems where performance is the main characteristic, for example in medical image processing systems. Increasing the max clock frequency of a medical tomogram processing system would allow for an increase in its performance

Conclusions
In this work, we developed a new FIR DF architecture based on a modified MAC unit architecture called TMAC, with the aim of increasing the digital filtering performance. Theoretical analysis of digital filter parameters was performed using the abstract "unit-gate" model. According to the theoretical analysis, FIR DF implementation based on TMAC units can theoretically reduce filtering time by 22.39-29.86% and reduce hardware costs by 2.41-6.32%. The results of the hardware simulation on a FPGA show that the use of TMAC units increased the FIR DF performance up to 10.89% but required more hardware costs compared to traditional FIR DFs using traditional MAC units. The results of this research can be used in the digital signal processing theory and for solving practical problems such as machine learning, multimedia processing, noise reduction and many others.
In future works, we plan to study the application of the proposed approach for discrete wavelet transform of medical tomograms. This type of medical image usually uses 8-, 12-or 16-bit data representation, and the wavelet filter banks have various orders. Thus, many FIR configurations discussed in this article may be applied in practice to process medical tomograms using a discrete wavelet transform.