Algorithmic Structures for Realizing Short-Length Circular Convolutions with Reduced Complexity

A set of efficient algorithmic solutions suitable to the fully parallel hardware implementation of the short-length circular convolution cores is proposed. The advantage of the presented algorithms is that they require significantly fewer multiplications as compared to the naive method of implementing this operation. During the synthesis of the presented algorithms, the matrix notation of the cyclic convolution operation was used, which made it possible to represent this operation using the matrix–vector product. The fact that the matrix multiplicand is a circulant matrix allows its successful factorization, which leads to a decrease in the number of multiplications when calculating such a product. The proposed algorithms are oriented towards a completely parallel hardware implementation, but in comparison with a naive approach to a completely parallel hardware implementation, they require a significantly smaller number of hardwired multipliers. Since the wired multiplier occupies a much larger area on the VLSI and consumes more power than the wired adder, the proposed solutions are resource efficient and energy efficient in terms of their hardware implementation. We considered circular convolutions for sequences of lengths N= 2, 3, 4, 5, 6, 7, 8, and 9.


Introduction
Digital convolution is used in various applications of digital signal and image processing. Its most interesting areas of application are wireless communication and artificial neural networks [1][2][3][4][5]. The general principles of developing convolution algorithms were described in [6][7][8][9][10][11][12]. Various algorithmic solutions have been proposed to speed up the computation of circular convolution [7][8][9][10][11][13][14][15][16]. The most common approach to efficiently computing the circular algorithm is the Fast Fourier Transform (FFT) algorithm, as well as a number of other discrete orthogonal transformations [17][18][19][20]. There are also known methods for implementing discrete orthogonal transformations using circular convolution [20][21][22]. FFT-based convolution relies on the fact that convolution can be performed as simple multiplication in the frequency domain [23]. The FFT-based approach to computing circular convolution is traditionally used for long-length sequences. However, in many practical applications, a situation arises where both convolving sequences are relatively short. As examples, we can refer to algorithms for calculating short linear convolutions, as well as overlap-save and overlap-add methods [24][25][26]. It is known that these methods use splitting a long data sequence into small segments, calculating short cyclic convolutions of these segments and the impulse response coefficients of a Finite Impulse Response (FIR) filter, and then, combining the short convolutions into a single whole.
To date, many algorithmic solutions have been developed that involve the computation of cyclic convolution in the time domain [7][8][9][10]21,[27][28][29]. In the cited publications, methods for calculating short convolutions were presented either as a set of arithmetic relations or as a set of matrix-vector products. Such approaches to the description of computations do not at all give an idea of the organization of the structures of processor cores intended for the implementation of the convolution operation. The solutions presented in the literature do not give a complete picture of the structural organization of such cores, if only because they (except for the cases N = 2 and N = 3) do not show the corresponding signal flow graphs. The absence of signal flow graphs in known publications also does not allow us to assess the possibilities of the obtained solutions from the point of view of their parallel implementation. Therefore, in this paper, we propose a set of algorithmic solutions for circular convolution of small length N sequences from 2-9.

Preliminary Remarks
Let {h n } and {x n } be two N-point sequences. Their circular convolution is the sequence {y n }, defined by [6]: Usually, the elements of one of the convolved sequences are constants. For correctness, we assume that it will be the elements of sequence {h n }.
Because sequences {x n } and {h n } are finite in length, then their circular convolution (1) can also be represented as a matrix-vector product: where: In the following, we assume that X N×1 will be the input data vector, Y N×1 will be the output data vector, and H N×1 will be the vector containing constants.
Calculating (2) directly requires N 2 multiplications and (N − 1)N additions. This leads to the fact that for a completely parallel hardware implementation of the circular convolution, N 2 multipliers and N N-input adders are required. Since the multiplier is a very cumbersome device and, when implemented in hardware, requires much more hardware resources compared to the adder, minimizing the number of multipliers required for the fully parallel implementation of algorithms is an important task.
Thus, taking into account the above, the purpose of this article is to develop and describe fully parallel resource-efficient algorithms for N = 2, 3, 4, 5, 6, 7, 8, and 9.
Calculating (4) directly requires four multiplications and two additions. It is easy to see that the H 2 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the two-point circular convolution can be reduced [7].
The optimized computational procedure for computing the two-point circular convolution is as follows: where: Figure 1 shows a signal flow graph for the proposed algorithm, which also provides a simplified algorithmic structure of a fully parallel processing core for resource-effective implementation of the two-point circular convolution. All signal flow graphs are oriented from left to right. Straight lines denote the data circuits. The circles in these figures show the hardwired multipliers by a constant inscribed inside a circle. Points, where lines converge, denote adders, and dotted lines indicate the sign-change data circuits (datapaths with multiplication by −1). Therefore, it only takes two multiplications and four additions to compute the twopoint circular convolution. As for the arithmetic blocks, for a completely parallel hardware implementation of the processor core to compute the two-point convolution, you need two multipliers and four two-input adders, instead of four multipliers and two two-input adders in the case of a completely parallel implementation (4).

Circular Convolution for N = 3
Let X 3×1 = [x 0 , x 1 , x 2 ] T and H 3×1 = [h 0 , h 1 , h 2 ] T be three-dimensional data vectors being convolved and Y 3×1 = [y 0 , y 1 , y 2 ] T be an output vector representing circular convolution for N = 3. The task is reduced to calculating the following product: where: Calculating (6) directly requires nine multiplications and five additions. It is easy to see that the H 3 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the three-point circular convolution can be reduced [7,8,11,27].
Therefore, the optimized computational procedure for computing the three-point circular convolution is as follows: where: 1 , s  As for the arithmetic blocks, for a completely parallel hardware implementation of the processor core to compute the three-point convolution (7), you need four multipliers and eleven two-input adders, instead of nine multipliers and six two-input adders in the case of a completely parallel implementation (6). Therefore, we have exchanged five multipliers for five two-input adders.
The task is reduced to calculating the following product: where: Calculating (8) directly requires 16 multiplications and 12 additions. It is easy to see that the H 4 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the four-point circular convolution can be reduced.
Therefore, the optimized computational procedure for computing the four-point circular convolution is as follows: where: is an identity N matrix, H 2 is the 2 × 2 Hadamard matrix, and signs "⊗" and "⊕" denote the Kronecker product and direct sum of two matrices, respectively [30,31].   As for the arithmetic blocks, to compute the four-point convolution (9), you need five multipliers and fifteen two-input adders, instead of sixteen multipliers and twelve two-input adders in the case of a completely parallel implementation (8). The proposed algorithm saves eleven multiplications at the cost of three extra additions compared to the ordinary matrix-vector multiplication method.

Circular Convolution for N
T be five-dimensional data vectors being convolved and Y 9×1 = [y 0 , y 1 , y 2 , y 3 , y 4 ] T be an output vector representing a circular convolution for N = 5.
The task is reduced to calculating the following product: where: Calculating (10) directly requires 25 multiplications and 20 additions. It is easy to see that the H 5 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the five-point circular convolution can be reduced. Therefore, an efficient algorithm for computing the five-point circular convolution can be represented using the following matrix-vector procedure: where:   Figure 4 shows a data flow graph of the proposed algorithm for the implementation of the five-point circular convolution.  As for the arithmetic blocks, to compute the five-point convolution (11), you need ten multipliers, and thirty two-input adders, instead of twenty-five multipliers and twenty two-input adders in the case of a completely parallel implementation (10). The proposed algorithm saves 15 multiplications at the cost of 11 extra additions compared to the ordinary matrix-vector multiplication method.

Circular Convolution for N
, T be six-dimensional data vectors being convolved and Y 11×1 = [y 0 , y 1 , y 2 , y 3 , y 4 , y 5 ] T be an output vector representing a circular convolution for N = 6.
The task is reduced to calculating the following product: where: Calculating (12) directly requires 36 multiplications and 30 additions. It is easy to see that the H 6 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the six-point circular convolution can be reduced.
Therefore, an efficient algorithm for computing the six-point circular convolution can be represented using the following matrix-vector procedure: where: 1 , ... , s  As far as arithmetic blocks are concerned, eight multipliers and thirty-four two-input adders are needed for the completely parallel hardware implementation of the processor core to compute the six-point convolution (13), instead of thirty-six multipliers and thirty two-input adders in the case of a completely parallel implementation (12). The proposed algorithm saves twenty-eight multiplications at the cost of six extra additions compared to the ordinary matrix-vector multiplication method.
The task is reduced to calculating the following product: Calculating (14) directly requires 49 multiplications and 42 additions. It is easy to see that the H 7 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the seven-point circular convolution can be reduced.
Therefore, an efficient algorithm for computing the seven-point circular convolution can be represented using the following matrix-vector procedure: where: 16 = diag(s 0 , s 1 , ... , s 15 ), s 11 = (−h 3 + h 1 )/6, s        . Figure 6 shows a data flow graph of the proposed algorithm for the implementation of the seven-point circular convolution.
As far as arithmetic blocks are concerned, sixteen multipliers and sixty-eight two-input adders are needed for the completely parallel hardware implementation of the processor core to compute the seven-point convolution (15), instead of forty-nine multipliers and forty-two two-input adders in the case of a completely parallel implementation (14). The proposed algorithm saves 33 multiplications at the cost of 26 extra additions compared to the ordinary matrix-vector multiplication method.  The task is reduced to calculating the following product: Calculating (16) directly requires 64 multiplications and 56 additions. It is easy to see that the H 8 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the eight-point circular convolution can be reduced.
Therefore, an efficient algorithm for computing the eight-point circular convolution can be represented using the following matrix-vector procedure: where:  As far as arithmetic blocks are concerned, fourteen multipliers and forty-six two-input adders are needed for the completely parallel hardware implementation of the processor core to compute the eight-point convolution (17), instead of sixty-four multipliers and fifty-six two-input adders in the case of a completely parallel implementation (16). The proposed algorithm saves 50 multiplications and 10 additions compared to the ordinary matrix-vector multiplication method.
The task is reduced to calculating the following product: Calculating (18) directly requires 81 multiplications and 72 additions. It is easy to see that the H 9 matrix has an unusual structure. Taking into account this specificity leads to the fact that the number of multiplications in the calculation of the nine-point circular convolution can be reduced. Therefore, an efficient algorithm for computing the nine-point circular convolution can be represented using the following matrix-vector procedure: 15×19 D (9) 19 A (9) 19×16 A (9) 16×14 A (9) 14×9 A (9) 9 X 9×1 (19) where: 1 , ... , s 18 ),      Figure 8 shows a data flow graph of the proposed algorithm for the implementation of the nine-point circular convolution. x 7 x 6 x 5 x 4 x 3 x 2 x 1 x 0 Figure 8. Algorithmic structure of the processing core for the computation of the 9-point circular convolution.
As far as arithmetic blocks are concerned, nineteen multipliers and seventy-four two-input adders are needed for the completely parallel hardware implementation of the processor core to compute the nine-point convolution (19), instead of eighty-one multipliers and seventy-two two-input adders in the case of a completely parallel implementation (18). The proposed algorithm saves sixty-two multiplications at the cost of the one extra addition compared to the ordinary matrix-vector multiplication method.

Implementation Complexity
We now estimate the hardware implementation costs of each solution. We assumed that the hardware implementation cost of the hardwired multiplier is α and the hardware implementation cost of the two-input adder is β. By the hardware implementation cost of the estimated solution, we mean a generalized assessment of the hardware complexity of implementing specific solutions, considering the area within the VLSI, the dissipation power, and therefore, the consumed energy. We also took into account that the N-input adder consists of N − 1 two-input adders. In this way, we treated the implementation cost of an N-input adder as the sum of the implementation costs of N − 1 two-input adders. Then, the total hardware implementation cost C of each solution is equal to: where M and A mean, respectively, the number of multipliers and the number of two-input adders required for the fully parallel implementation of a particular solution. We can normalize the above equation regarding the cost of the adder, obtaining the normalized cost: where γ = α/β is the relative cost coefficient of the multiplier. Table 1 shows estimates of the number of arithmetic blocks for the fully parallel implementation of the short-length circular convolution algorithms. The last two columns of the table show the unified hardware costs for the implementation of the corresponding solutions, expressed in terms of the implementation cost of one two-input adder. The charts presented in Figure 9 illustrate the normalized hardware implementation costs C of the proposed solution and naive method for various values of γ and N. Cost comparisons can also be made using percentage changes: where C p and C n are the normalized cost of proposed algorithm and naive method, respectively. Figure 10 shows the value of percentage changes as a function of the relative cost coefficient for various values of N. Assuming that the cost of the multiplier is always at least equal to, and most often greater than, the cost of the adder γ ≥ 1, the cost of the proposed algorithm is never greater than that of the naive method, and in some cases, even cost savings of over 70% are obtained.

Conclusions
In this article, we analyzed the possibilities of reducing the multiplicative complexity of computing circular convolutions for input sequences of small lengths. We synthesized new hardware-efficient, fully parallel algorithms to implement these operations for N = 3, 4, 5, 6, 7, 8, and 9. The reduced multiplicative complexity of the proposed algorithms is especially important when developing specialized fully parallel VLSI processors, since it minimizes the number of necessary hardware multipliers and reduces the power dissipation, as well as the total cost of the implementation of the entire system being introduced [30][31][32]. Thus, a decrease in the number of multipliers, even at the expense of a moderate increase in the number of adders, plays an important role in the hardware implementation of the proposed algorithms. Consequently, the use of the proposed solutions makes it possible to reduce the complexity of the hardware implementation of the cyclic convolution kernels. In addition, as can be seen from Figures 1-8, the algorithms presented in the article have a pronounced regular and modular structure. This facilitates the mapping of these algorithms to the ASIC structure and unifies their implementation in FPGAs. Thus, the acceleration of computations in the implementation of these algorithms can also be achieved by parallelizing the computations.
Funding: This research received no external funding.