Some Algorithms for Computing Short-Length Linear Convolution

Cariow, Aleksandr; Paplinski, Janusz P.

doi:10.3390/electronics9122115

Open AccessArticle

Some Algorithms for Computing Short-Length Linear Convolution

by

Aleksandr Cariow

^* and

Janusz P. Paplinski

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Żołnierska 49, 71-210 Szczecin, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2020, 9(12), 2115; https://doi.org/10.3390/electronics9122115

Submission received: 19 October 2020 / Revised: 30 November 2020 / Accepted: 7 December 2020 / Published: 10 December 2020

(This article belongs to the Special Issue Theory and Applications in Digital Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

In this article, we propose a set of efficient algorithmic solutions for computing short linear convolutions focused on hardware implementation in VLSI. We consider convolutions for sequences of length

N =

2, 3, 4, 5, 6, 7, and 8. Hardwired units that implement these algorithms can be used as building blocks when designing VLSI -based accelerators for more complex data processing systems. The proposed algorithms are focused on fully parallel hardware implementation, but compared to the naive approach to fully parallel hardware implementation, they require from 25% to about 60% less, depending on the length N and hardware multipliers. Since the multiplier takes up a much larger area on the chip than the adder and consumes more power, the proposed algorithms are resource-efficient and energy-efficient in terms of their hardware implementation.

Keywords:

linear convolution algorithms; fast hardware-oriented computations; convolution neural networks

1. Introduction

Discrete convolution is found in many applications in science and engineering. Above all, it plays a key role in modern digital signal and image processing. In digital signal processing, it is the basis of filtering, multiresolution decomposition, and optimization of the calculation of orthogonal transform [1,2,3,4,5,6,7,8,9,10]. In digital image processing, convolution is a basic operation of denoising, smoothing, edge detection, blurring, focusing, etc. [11,12,13]. There are two types of discrete convolutions: the cyclic convolution and the linear convolution. General principles for the synthesis of convolution algorithms were described in [1,2,3]. The main emphasis in these works was made primarily on the calculation of cyclic convolution, while in many digital signal and image processing applications, the calculation of linear convolutions is required.

In recent years, convolution has found unusually wide application in neural networks and deep learning. Among the various kinds of deep neural networks, convolutional neural networks (CNNs) are most widely used [14,15,16,17]. In CNNs, linear convolutions are the most computationally intensive operations, since in a typical implementation, their multiple computations occupy more than 90% of the CNN execution time [14]. Only one convolutional level in a typical CNN requires more than two thousand multiplications and additions. Usually, there are several such levels in the CNN. That is why developers of such type of networks seek and design efficient ways of implementing linear convolution using the smallest possible number of arithmetic operations.

To speed up linear convolution computation, various algorithmic methods have been proposed. The most common approach to effective calculating linear convolution is dipping it in the space of a double-size cyclic convolution with the subsequent application of a fast Fourier transform (FFT) algorithm [15,16,17]. The FFT-based linear convolution method is traditionally used for large length finite impulse response (FIR) filters; however, modern CNNs use predominantly small length FIR filters. In this situation, the most effective algorithms used in the computation of a small-length linear convolution are the Winograd-like minimal filtering algorithms [18,19,20], which are the most used currently. The algorithms compute linear convolution over small tiles with minimal complexity, which makes them more effective with small filters and small batch sizes; however, these algorithms do not calculate the whole convolution. They calculate only two inner products of neighboring vectors formed from the current data stream by a moving time window of length N; therefore, these algorithms do not compute the true linear convolution.

At the same time, there are a number of CNNs in which it is necessary to calculate full-size small-length linear convolutions. In addition, in many applications of digital signal processing, there is the problem of calculating a one-dimensional convolution using its conversion into a multidimensional convolution. The algorithm thus obtained has a modular structure, and each module calculates a short-length one-dimensional convolution [21].

The most popular sizes of sequences being convoluted are sequences of length 2, 3, 4, 5, 6, 7, and 8. However, in the papers known to the authors, there is no description of resource-efficient algorithms for calculation of linear convolutions for lengths greater than four [1,4,6,21,22]. In turn, the solutions given in the literature for N = 2, N = 3, and N = 4 do not give a complete imagination about the organization of the linear convolution calculation process, since their corresponding signal flow graphs are not presented anywhere. In this paper, we describe a complete set of solutions for linear convolution of small length N sequences from 2 to 8.

2. Preliminaries

Let

{h_{m}}

and,

{x_{n}}

,

m = 0, 1, \dots, M - 1

,

n = 0, 1, \dots, N - 1

be two finite sequences of length M and N, respectively. Their linear convolution is the sequence

{y_{1}}

,

i = 0, 1, \dots, M + N - 2

defined by [1]:

y_{i} = \sum_{n = 0}^{N - 1} h_{i - n} x_{n},

(1)

where we take

h_{i - n} = 0

, if

i - n < 0

.

As a rule, the elements of one of the sequences to be convolved are constant numbers. For definiteness, we assume that it will be a sequence

{h_{m}}

.

Because sequences

{x_{n}}

and

{h_{m}}

are finite length, then their linear convolution (1) can also be implemented as matrix-vector multiplication:

Y_{(N + M - 1) \times 1} = H_{(N + M - 1) \times N} X_{N \times 1}

(2)

where

H_{(N + M - 1) \times N} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ ⋮ & h_{1} & ⋱ \\ h_{M - 1} & ⋮ & \dots & h_{0} \\ h_{M - 1} & \dots & h_{1} \\ ⋱ & ⋮ \\ h_{M - 1} \end{matrix}],

(3)

X_{N \times 1} = {[x_{0}, x_{1}, \dots x_{N - 1}]}^{T},

Y_{(N + M - 1) \times 1} = {[y_{0}, y_{1}, \dots, y_{N + M - 2}]}^{T},

H_{M \times 1} = {[h_{0}, h_{1}, \dots, h_{M - 1}]}^{T} .

In the future, we assume that

X_{N \times 1}

is a vector of input data,

Y_{(N + M - 1) \times 1}

is a vector of output data, and

H_{M \times 1}

is a vector containing constants.

Direct computation of (3) takes MN multiplications and (M − 1)(N − 1) addition. This means that the fully parallel hardware implementation of the linear convolution operation requires MN multipliers and N + M − 3 multi-input adders with a different number of inputs, which depends on the length of the sequences being convolved. Traditionally, the convolution for which M = N is assumed as a basic linear convolution operation. Resource-effective cyclic convolution algorithms for benchmark lengths (N = 2, 3,...,16) have long been published [1,2,3,4,5,6,7,8,9]. For linear convolution, optimized algorithms are described only for the cases N = 2, 3, 4 [4,6,21,22]. Below we show how to reduce the implementation complexity of some benchmark-lengths linear convolutions for the case of completely parallel hardware their implementation. For completeness, we also consider algorithms for the sequences of lengths M = N = 2, 3, and 4.

So, considering the above, the goal of this article is to develop and describe fully parallel resource-efficient algorithms for N = 2, 3, 4, 5, 6, 7, 8.

3. Algorithms for Short-Length Linear Convolution

The main idea of presented algorithm is to transform the linear convolution matrix into circular matrix and two Toeplitz matrices. Then we can rewrite (3) in following form:

Y_{(2 N - 1) \times 1} = H_{(2 N - 1) \times N} X_{N \times 1} = [[\begin{matrix} H_{K \times N} \\ {\overset{˘}{H}}_{N} \\ H_{L \times N} \end{matrix}] - [\begin{matrix} 0_{K \times N} \\ H_{L \times N} \\ 0_{1 \times N} \\ H_{K \times N} \\ 0_{L \times N} \end{matrix}]] X_{N \times 1}

(4)

where

H_{K \times N} = [\begin{matrix} T_{K}^{(l)} & 0_{K \times (N - K)} \end{matrix}]

and

H_{L \times N} = [\begin{matrix} 0_{L \times (N - L)} & T_{L}^{(r)} \end{matrix}]

are matrices that are horizontal concatenations of null matrices and left-triangular or right-triangular Toeplitz matrices, respectively:

T_{K \times N}^{(l)} = [\begin{matrix} h_{0} & 0 & \dots & 0 \\ h_{1} & h_{0} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ h_{K - 1} & h_{K - 2} & \dots & h_{0} \end{matrix}], T_{L}^{(r)} = [\begin{matrix} h_{N - 1} & h_{N - 2} & \dots & h_{N - L - 2} \\ 0 & h_{N - 1} & \dots & h_{N - L - 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & h_{N - 1} \end{matrix}],

which gives

H_{K \times N} = [\begin{matrix} h_{0} & 0 & \dots & 0 & 0 & 0 & \dots & 0 \\ h_{1} & h_{0} & \dots & 0 & 0 & 0 & ⋮ & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ h_{K - 1} & h_{K - 2} & \dots & h_{0} & 0 & 0 & \dots & 0 \end{matrix}],

H_{L \times N} = [\begin{matrix} \begin{matrix} 0 & 0 & \dots & 0 \\ 0 & 0 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & 0 \end{matrix} & \begin{matrix} h_{N - 1} & h_{N - 2} & \dots & h_{N - L - 2} \\ 0 & h_{N - 1} & \dots & h_{N - L - 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & h_{N - 1} \end{matrix} \end{matrix}] .

A circulant matrix

{\overset{˘}{H}}_{N}

is a matrix of cyclic convolution H_N with rows cyclically shifted by n positions down:

{\overset{˘}{H}}_{N} = I_{N}^{(\leftarrow n)} H_{N} = [\begin{matrix} h_{K} & h_{K - 1} & \dots & h_{K + 1} \\ h_{K + 1} & h_{K} & \dots & h_{K + 2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ h_{N - 1} & h_{N - 2} & \dots & h_{0} \\ h_{0} & h_{N - 1} & \dots & h_{1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ h_{K - 1} & h_{K - 2} & \dots & h_{K} \end{matrix}],

where

I_{N}^{(\leftarrow n)}

- is permutation matrix obtained from the identity matrix by cyclic shift of its columns by n positions to the left and:

H_{N} = [\begin{matrix} h_{0} & h_{N - 1} & \dots & h_{1} \\ h_{1} & h_{0} & \dots & h_{2} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ h_{N - 1} & h_{N - 2} & \dots & h_{0} \end{matrix}] .

The coefficients K and L are natural numbers arbitrary taken and fulfilling the dependence K + L = N − 1. These values are selected heuristically for each N separately.

The product

{\overset{˘}{H}}_{N} X_{N \times 1}

is calculated using the well-known fast convolution algorithm. The products of

H_{K \times N} X_{N \times 1}

and

H_{L \times N} X_{N \times 1}

are also calculated using fast algorithms for matrix-vector multiplication with Toeplitz matrices. We use all of the above techniques to synthesize the final short-length linear convolution algorithms with reduced multiplicative complexity.

3.1. Algorithm for N = 2

Let

X_{2 \times 1} = {[x_{0}, x_{1}]}^{T}

and

H_{2 \times 1} = {[h_{0}, h_{1}]}^{T}

be 2-dimensional data vectors being convolved and

Y_{3 \times 1} = {[y_{0}, y_{1}, y_{2}]}^{T}

be an input vector representing a linear convolution. The problem is to calculate the product

Y_{3 \times 1} = H_{3 \times 2} X_{2 \times 1},

(5)

where

H_{3 \times 2} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ h_{1} \end{matrix}] .

Direct computation of (5) takes four multiplications and one addition. It is easy to see that the matrix

H_{3 \times 2}

possesses an uncommon structure. By the Toom–Cook algorithmic trick, the number of multiplications in the calculation of the 2-point linear convolution can be reduced [1,21].

With this in mind, the rationalized computational procedure for computing 2-point linear convolution has the following form:

Y_{3 \times 1} = A_{3}^{(2)} D_{3} A_{3 \times 2}^{(2)} X_{2 \times 1}

(6)

where

A_{3 \times 2}^{(2)} = [\begin{matrix} 1 \\ 1 & - 1 \\ 1 \end{matrix}], A_{3}^{(2)} = [\begin{matrix} 1 \\ 1 & - 1 & 1 \\ 1 \end{matrix}], D_{3} = d i a g (h_{0}, h_{0} - h_{1}, h_{1}),

(7)

s_{0}^{(2)} = h_{0}, s_{1}^{(2)} = h_{0} - h_{1}, s_{2}^{(2)} = h_{1} .

(8)

Figure 1 shows a signal flow graph for the proposed algorithm, which also provides a simplified schematic view of a fully parallel processing unit for resource-effective implementing of 2-point linear convolution. In this paper, the all data flow graphs are oriented from left to right. Straight lines in the figures denote the data transfer (data path) operations. The circles in these figures show the operation of multiplication (multipliers in the case of hardware implementation) by a number inscribed inside a circle. Points where lines converge denote summation (adders in the case of hardware implementation) and dotted lines indicate the sign-change data paths (data paths with multiplication by −1). We use the usual lines without arrows on purpose, so as not to clutter the picture [23].

As it can be seen, the calculation of 2-point linear convolution requires only three multiplications and three additions. In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 2-point convolution will require three multipliers, one two-input adder, and one three-input adder. In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 2-point convolution will require three multipliers, one two-input adder, and one three-input adder instead of four multipliers and one two-input adder in the case of completely parallel implementation of (6). So, we exchanged one multiplier for one three-input adder.

3.2. Algorithm for N = 3

Let

X_{3 \times 1} = {[x_{0}, x_{1}, x_{2}]}^{T}

and

H_{3 \times 1} = {[h_{0}, h_{1}, h_{2}]}^{T}

be 3-dimensional data vectors being convolved and

Y_{5 \times 1} = {[y_{0}, y_{1}, y_{2}, y_{3}, y_{4}]}^{T}

be an input vector represented linear convolution for N = 3. The problem is to calculate the product

Y_{5 \times 1} = H_{5 \times 3} X_{3 \times 1},

(9)

where

H_{5 \times 3} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ h_{2} & h_{1} & h_{0} \\ h_{2} & h_{1} \\ h_{2} \end{matrix}] .

(10)

Direct computation of (9) takes nine multiplications and five addition. Because the matrix

H_{5 \times 3}

also possesses uncommon structure, the number of multiplications in the calculation of the 3-point linear convolution can be reduced too [1,4,21].

An algorithm for computation 3-point linear convolution with reduced multiplicative complexity can be written with the help of following matrix-vector calculating procedure:

Y_{5 \times 1}^{(3)} = A_{5 \times 6}^{(3)} D_{6}^{(3)} A_{6 \times 3}^{(3)} X_{3 \times 1},

(11)

where

A_{6 \times 3}^{(3)} = [\begin{matrix} 1 \\ 1 \\ 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{matrix}], A_{5 \times 6}^{(3)} = [\begin{matrix} 1 \\ - 1 & - 1 & 1 \\ - 1 & 1 & - 1 & 1 \\ - 1 & - 1 & 1 \\ 1 \end{matrix}],

(12)

D_{6}^{(3)} = d i a g (h_{0}, h_{1}, h_{2}, h_{0} + h_{1}, h_{0} + h_{2}, h_{1} + h_{2}) .

(13)

Figure 2 shows a signal flow graph of the proposed algorithm for the implementation of 3-point linear convolution. As it can be seen, the calculation of 3-point linear convolution requires only 6 multiplications and 10 additions. Thus, the proposed algorithm saves six multiplications at the cost of six extra additions compared to the ordinary matrix-vector multiplication method.

In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 3-point linear convolution (11) will require only six multipliers, four two-input adders, one three-input adder, and one four-input adder instead of 12 multipliers, 2 two-input adders, and 1 two-input adder in the case of fully parallel implementation of (9).

3.3. Algorithm for N = 4

Let

X_{4 \times 1} = {[x_{0}, x_{1}, x_{2}, x_{3}]}^{T}

and

H_{4 \times 1} = {[h_{0}, h_{1}, h_{2}, h_{3}]}^{T}

be 4-dimensional data vectors being convolved and

Y_{7 \times 1} = {[y_{0}, y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, y_{6}]}^{T}

be an input vector represented linear convolution for N = 4.

The problem is to calculate the product:

Y_{7 \times 1} = H_{7 \times 4} X_{4 \times 1},

(14)

where

H_{7 \times 4} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ h_{2} & h_{1} & h_{0} \\ h_{3} & h_{2} & h_{1} & h_{0} \\ h_{3} & h_{2} & h_{1} \\ h_{3} & h_{2} \\ h_{3} \end{matrix}] .

Direct computation of (14) takes 16 multiplications and 9 addition. Due to the specific structure of matrix

H_{7 \times 4}

, the number of multiplication operations in the calculation of (14) can be significantly reduced.

An algorithm for computation 4-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:

Y_{7 \times 1} = A_{7 \times 8}^{(4)} A_{8 \times 9}^{(4)} D_{9}^{(4)} A_{9 \times 8}^{(4)} A_{8 \times 4}^{(4)} X_{4 \times 1},

(15)

where

A_{8 \times 4}^{(4)} = [\begin{matrix} 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & - 1 \\ 1 & - 1 \\ 1 \\ 1 \\ 1 \end{matrix}], A_{9 \times 8}^{(4)} = 1 \oplus H_{2} \oplus [\begin{matrix} 1 \\ 1 \\ 1 & 1 \end{matrix}] \oplus I_{3},

where

I_{N}

is anidentity

N \times N

matrix,

H_{2}

is the

(2 \times 2)

Hadamard matrix, and sign,”⊕” denotes direct sum of two matrices [23,24,25].

D_{9}^{(4)} = d i a g (s_{0}^{(4)} s_{1}^{(4)}, \dots, s_{8}^{(4)}),

s_{0}^{(4)} = h_{0}, s_{1}^{(4)} = (h_{0} + h_{1} + h_{2} + h_{3}) / 4, s_{2}^{(4)} = (h_{0} - h_{1} + h_{2} - h_{3}) / 4,

s_{3}^{(4)} = (h_{0} - h_{1} - h_{2} + h_{3}) / 2, s_{4}^{(4)} = (h_{0} + h_{1} - h_{2} - h_{3}) / 2, s_{5}^{(4)} = (h_{0} - h_{2}) / 2, s_{6}^{(4)} = h_{3},

s_{7}^{(4)} = h_{2}, s_{8}^{(4)} = h_{3},

A_{8 \times 9}^{(4)} = 1 \oplus H_{2} \oplus [\begin{matrix} - 1 & 1 \\ - 1 & 1 \end{matrix}] \oplus I_{3}, A_{7 \times 8}^{(4)} = [\begin{array}{c} 1 \\ 1 & 1 & - 1 & - 1 \\ 1 & - 1 & - 1 \\ 1 & - 1 \\ - 1 & 1 & 1 \\ 1 & 1 \\ 1 \end{array}] .

Figure 3 shows a signal flow graph of the proposed algorithm for the implementation of 4-point linear convolution. So, the proposed algorithm saves 7 multiplications at the cost of 11 extra additions compared to the ordinary matrix-vector multiplication method.

In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 4-point linear convolution will require only 9 multipliers, 13 two-input adders, 2 three-input adders, and 1 four-input adder instead of 16 multipliers, 2 two-input adders, 2 three-input adders, and 1 four-input adder in the case of fully parallel implementation of (14).

3.4. Algorithm for N = 5

Let

X_{5 \times 1} = {[x_{0}, x_{1}, x_{2}, x_{3}, x_{4}]}^{T}

and

H_{5 \times 1} = {[h_{0}, h_{1}, h_{2}, h_{3}, h_{4}]}^{T}

be 5-dimensional data vectors being convolved and

Y_{9 \times 1} = {[y_{0}, y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, y_{6}, y_{7}, y_{8}]}^{T}

be an input vector representing a linear convolution for N = 5.

The problem is to calculate the product:

Y_{9 \times 1} = H_{9 \times 5} X_{5 \times 1},

(16)

where

H_{9 \times 5} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ ⋮ & h_{1} & ⋱ \\ h_{4} & ⋮ & ⋱ & h_{0} \\ h_{4} & ⋱ & h_{1} \\ ⋱ & ⋮ \\ h_{4} \end{matrix}] .

Direct computation of (16) takes 25 multiplications and 16 addition. Due to the specific structure of matrix

H_{9 \times 5}

, the number of multiplication operations in the calculation of (16) can be significantly reduced.

Thus, an algorithm for computation 5-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:

Y_{9 \times 1} = A_{9 \times 11}^{(5)} A_{11 \times 13}^{(5)} A_{13 \times 16}^{(5)} D_{16}^{(5)} A_{16 \times 15}^{(5)} A_{15 \times 11}^{(5)} A_{11 \times 5}^{(5)} X_{5 \times 1},

(17)

where

A_{11 \times 5}^{(5)} = [\begin{array}{c} \begin{array}{c} 1 \\ 1 & 0_{3} \\ 1 \end{array} \\ \begin{matrix} 1 & - 1 \\ 1 & - 1 \\ 1 & - 1 \\ 1 & - 1 \\ 1 & 1 & 1 & 1 & 1 \end{matrix} \\ \begin{array}{c} 1 \\ 0_{3} & 1 \\ 1 \end{array} \end{array}], A_{15 \times 11}^{(5)} = [\begin{array}{c} \begin{matrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 & 1 \\ 1 \\ 1 \\ 1 & 1 \\ 1 & - 1 \\ 1 & - 1 \end{matrix} & 0_{11 \times 4} \\ 0_{4 \times 7} & I_{4} \end{array}]

and

0_{M \times N}

is a null matrix of order

M \times N

[23,24,25],

D_{16}^{(5)} = d i a g (s_{0}^{(5)} s_{1}^{(5)}, \dots, s_{15}^{(5)}),

s_{0}^{(5)} = h_{0}, s_{1}^{(5)} = h_{1}, s_{2}^{(5)} = h_{0}, s_{3}^{(5)} = (h_{0} - h_{2} + h_{3} - h_{4}) / 4, s_{4}^{(5)} = (h_{1} - h_{2} + h_{3} - h_{4}) / 4,

s_{5}^{(5)} = (3 h_{2} - 2 h_{1} + 2 h_{0} - 2 h_{3} + 3 h_{4}) / 5, s_{6}^{(5)} = (- h_{0} + h_{1} - h_{2} + h_{3}), s_{7}^{(5)} = (- h_{0} + h_{1} - h_{2} + h_{3}),

s_{8}^{(5)} = (3 h_{0} - 2 h_{1} + 3 h_{2} - 2 h_{3} - 2 h_{4}) / 5, s_{9}^{(5)} = - h_{2} + h_{3}, s_{10}^{(5)} = h_{1} - h_{2},

s_{11}^{(5)} = (- h_{0} - h_{1} + 4 h_{2} - h_{3} - h_{4}) / 5, s_{12}^{(5)} = (h_{0} + h_{1} + h_{2} + h_{3} + h_{4}) / 5, s_{13}^{(5)} = h_{4}, s_{14}^{(5)} = h_{3},

s_{15}^{(5)} = h_{4},

A_{16 \times 15}^{(5)} = [\begin{array}{c} \begin{array}{c} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 & - 1 \end{array} & 0_{12 \times 4} \\ 0_{4 \times 11} & I_{4} \end{array}],

A_{13 \times 16}^{(5)} = I_{3} \oplus [\begin{array}{c} 1 & 1 \\ 1 & 1 & 0_{3} \\ 1 & 1 \\ 1 & 1 \\ 0_{3} & 1 & 1 \\ 1 & 1 \end{array}] \oplus I_{4},

A_{11 \times 13}^{(5)} = I_{3} \oplus [\begin{array}{c} 1 & - 1 & 1 \\ - 1 & - 1 & - 1 & - 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & - 1 & 1 \end{array}] \oplus I_{3},

A_{9 \times 11}^{(5)} = [\begin{array}{c} 1 & 0_{2 \times 6} \\ 1 & 1 \\ 1 & - 1 & - 1 \\ 0_{3 \times 5} & 1 & - 1 \\ 1 \\ - 1 & 1 & \begin{matrix} 0_{2 \times 6} \end{matrix} \\ - 1 & - 1 & 1 \\ \begin{matrix} 0_{2 \times 5} \end{matrix} & 1 & 1 \\ 1 \end{array}] .

Figure 4 shows a data flow diagram of the proposed algorithm for the implementation of 5-point linear convolution. The algorithm saves 9 multiplications at the cost of 22 extra additions compared to the ordinary matrix-vector multiplication method.

In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 5-point linear convolution will require 16 multipliers, 20 two-input adders, 2 three-input adders, 1 four-input adder, and 1 five-input adder instead of 25 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, and 1 five-input adder in the case of fully parallel implementation of expression (16).

3.5. Algorithm for N = 6

Let

X_{6 \times 1} = {[x_{0}, x_{1}, x_{2}, x_{3}, x_{4}, x_{5}]}^{T}

and

H_{6 \times 1} = {[h_{0}, h_{1}, h_{2}, h_{3}, h_{4}, h_{5}]}^{T}

be 6-dimensional data vectors being convolved and

Y_{11 \times 1} = {[y_{0}, y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, y_{6}, y_{7}, y_{8}, y_{9}, y_{10}]}^{T}

be an input vector representing a linear convolution for N = 6.

The problem is to calculate the product:

Y_{11 \times 1} = H_{11 \times 6} X_{6 \times 1},

(18)

where

H_{11 \times 6} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ ⋮ & h_{1} & ⋱ \\ h_{5} & ⋮ & ⋱ & h_{0} \\ h_{5} & ⋱ & h_{1} \\ ⋱ & ⋮ \\ h_{5} \end{matrix}] .

Direct computation of (18) takes 36 multiplications and 25 addition. We proposed an algorithm that takes only 16 multiplications and 44 additions. It saves 20 multiplications at the cost of 19 extra additions compared to the ordinary matrix-vector multiplication method.

Proposed algorithm for computation 6-point linear convolution can be written with the help of following matrix-vector calculating procedure:

Y_{11}^{(6)} = A_{11}^{(6)} {\overset{˘}{A}}_{11}^{(6)} A_{11 \times 14}^{(6)} {\hat{A}}_{14}^{(6)} {\overset{˘}{A}}_{14}^{(6)} A_{14 \times 16}^{(6)} D_{16}^{(6)} A_{16 \times 14}^{(6)} {\overset{˘}{A}}_{14}^{(6)} A_{14}^{(6)} A_{14 \times 6}^{(6)} X_{6 \times 1},

(19)

where

A_{14 \times 6}^{(6)} = [\begin{array}{c} \begin{matrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{matrix} & 0_{5 \times 3} \\ 0_{3} & \begin{matrix} 1 \\ 1 \\ 1 \end{matrix} \\ \begin{matrix} 1 \end{matrix} & \begin{matrix}  \end{matrix} \\ 0_{3} & \begin{matrix} 1 \\ 1 \\ 1 \end{matrix} \\ 0_{2 \times 3} & \begin{matrix} 1 & 1 \\ 1 \end{matrix} \end{array}], A_{14}^{(6)} = I_{3} \oplus (H_{2} \otimes I_{3}) \oplus I_{5},

{\overset{˘}{A}}_{14}^{(6)} = I_{3} \oplus (I_{2} \otimes [\begin{matrix} 1 & 1 & 1 \\ 1 & - 1 \\ 1 & - 1 \end{matrix}]) \oplus I_{5}, A_{16 \times 14}^{(6)} = I_{3} \oplus (I_{2} \otimes [\begin{matrix} 1 \\ 1 \\ 1 \\ 1 & 1 \end{matrix}]) \oplus I_{5},

D_{16} = d i a g (s_{0}^{(6)}, s_{1}^{(6)}, \dots, s_{15}^{(6)}),

\begin{matrix} s_{0}^{(6)} = 6 h_{0}, s_{1}^{(6)} = 6 h_{1}, s_{2}^{(6)} = 6 h_{0}, s_{3}^{(6)} = h_{0} + h_{3} + h_{4} + h_{1} + h_{2} + h_{5}, \\ s_{4}^{(6)} = 3 (h_{4} + h_{1} - h_{0} - h_{3}), s_{5}^{(6)} = 3 (h_{2} + h_{5} - h_{0} - h_{3}), \\ s_{6}^{(6)} = 3 (h_{0} + h_{3}) - (h_{0} + h_{3} + h_{4} + h_{1} + h_{2} + h_{5}), s_{7}^{(6)} = h_{0} - h_{3} + h_{4} - h_{1} + h_{2} - h_{5}), \\ s_{8}^{(6)} = 3 (h_{4} - h_{1} - h_{0} + h_{3}), s_{9}^{(6)} = 3 (h_{2} - h_{5} - h_{0} + h_{3}), \\ s_{10}^{(6)} = 3 (h_{0} + h_{3}) - (h_{0} - h_{3} + h_{4} - h_{1} + h_{2} - h_{5}), s_{11}^{(6)} = 6 h_{5}, s_{12}^{(6)} = 6 (- h_{4} + h_{5}), \\ s_{13}^{(6)} = 6 (h_{3} - h_{4}), s_{14}^{(6)} = 6 h_{4}, s_{15}^{(6)} = 6 h_{5}, \end{matrix}

A_{14 \times 16}^{(6)} = I_{4} \oplus [\begin{array}{c} \begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix} & 0_{2 \times 4} \\ 0_{1 \times 3} & 1 & 0_{1 \times 3} \\ 0_{2 \times 4} & \begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix} \end{array}] \oplus I_{5}, {\hat{A}}_{14}^{(6)} = I_{4} \oplus [\begin{matrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{matrix}] \oplus I_{5},

\begin{matrix} A_{11 \times 14}^{(6)} = [\begin{matrix} 1 \\ 1 & 1 \end{matrix}] \oplus (H_{2} \otimes I_{3}) \oplus [\begin{matrix} 1 & 1 & 1 \\ 1 & 1 \\ 1 \end{matrix}], {\overset{˘}{A}}_{11}^{(6)} = [\begin{array}{c} I_{3} & 0_{3 \times 4} & 0_{3} \\ 0_{4 \times 3} & \begin{matrix} 1 \\ 1 \\ 1 \\ 1 \end{matrix} & 0_{4 \times 3} \\ 0_{4 \times 3} & 0_{4 \times 3} & I_{4} \end{array}], \end{matrix}

and sign “⊗” denotes tensor or Kronecker product of two matrices [23,24,25].

Figure 5 shows a data flow diagram of the proposed algorithm for the implementation of 6-point linear convolution.

In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 6-point linear convolution will require 16 multipliers, 32 two-input adders, and 5 three-input adders, instead of 36 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, 2 five-input adders, and 1 six-input adder in the case of completely parallel implementation of expression (18).

3.6. Algorithm for N = 7

Let

X_{7 \times 1} = {[x_{0}, x_{1}, x_{2}, x_{3}, x_{4} . x_{5}, x_{6}]}^{T}

and

H_{7 \times 1} = {[h_{0}, h_{1}, h_{2}, h_{3}, h_{4}, h_{5}, h_{6}]}^{T}

be 7-dimensional data vectors being convolved and

Y_{13 \times 1} = {[y_{0}, y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, y_{6}, y_{7}, y_{8}, y_{9}, y_{10}, y_{11}, y_{12}]}^{T}

be an input vector representing a linear convolution for N = 7.

The problem is to calculate the product:

Y_{13 \times 1} = H_{13 \times 7} X_{7 \times 1},

(20)

where

H_{13 \times 7} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ ⋮ & h_{1} & ⋱ \\ h_{6} & ⋮ & ⋱ & h_{0} \\ h_{6} & ⋱ & h_{1} \\ ⋱ & ⋮ \\ h_{6} \end{matrix}] .

Direct computation of (20) takes 49 multiplications and 36 addition. We developed an algorithm that contains only 26 multiplications and 79 additions. It saves 23 multiplications at the cost of 43 extra additions compared to the ordinary matrix-vector multiplication method.

The proposed algorithm for computation 7-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:

Y_{13 \times 1}^{(7)} = A_{13 \times 26}^{(7)} D_{26}^{(7)} A_{26 \times 7}^{(7)} X_{7 \times 1},

(21)

where

A_{13 \times 26}^{(7)} = A_{13 \times 15}^{(7)} A_{15 \times 20}^{(7)} A_{20 \times 21}^{(7)} A_{21 \times 22}^{(7)} A_{22 \times 25}^{(7)} A_{25 \times 21}^{(7)} A_{21}^{(7)} A_{21 \times 26}^{(7)},

A_{26 \times 7}^{(7)} = A_{26}^{(7)} A_{26 \times 28}^{(7)} A_{28 \times 21}^{(7)} A_{21 \times 18}^{(7)} A_{18 \times 7}^{(7)}

and

\begin{matrix} A_{18 \times 7}^{(7)} = [\begin{array}{c} \begin{matrix} 1 \\ 1 \\ 1 \\ 1 & 1 \\ 1 \\ 1 & 1 & 1 \end{matrix} & 0_{6 \times 3} \\ I_{4} & \begin{array}{c} 0_{4 \times 2} & - 1_{4 \times 1} \end{array} \\ 0_{2 \times 4} & \begin{matrix} 1 & - 1 \\ 1 & - 1 \end{matrix} \\ 0_{6 \times 4} & \begin{matrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 & 1 \\ 1 \end{matrix} \end{array}], A_{21 \times 18}^{(7)} = I_{5} \oplus [\begin{matrix} 1 \\ 1 & 1 \\ 1 \\ - 1 & 1 \\ 1 \\ 1 & 1 \\ - 1 & 1 \\ 1 & 1 \\ - 1 & 1 \\ 1 \\ 1 \end{matrix}] \oplus I_{5}, \end{matrix}

- 1_{4 \times 1} = {[- 1, - 1, - 1, - 1]}^{T},

A_{28 \times 21}^{(7)} = I_{7} \oplus [\begin{array}{c} 1 \\ 1 & 0_{3 \times 4} & 0_{3} \\ 1 \\ 1 & 1 \\ 1 & 1 & 0_{4 \times 3} \\ 1 & 1 \\ 1 & 1 \\ 1 \\ 1 \\ 0_{6 \times 3} & 1 & 0_{6 \times 3} \\ 1 \\ 1 \\ 1 \\ 1 \\ - 1 & 1 \\ 0_{4 \times 3} & 0_{4} & 1 \\ 1 \end{array}] \oplus I_{4},

A_{26 \times 28}^{(7)} = I_{10} \oplus [\begin{array}{c} \begin{matrix} 1 \\ - 1 & 1 \\ 1 \\ 1 \\ - 1 & 1 \\ 1 \end{matrix} & 0_{6} \\ \begin{matrix} 1 & \begin{matrix}  \end{matrix} \\ 1 \end{matrix} & \begin{matrix} 1 & 1 & 1 \\ 1 & 1 & 1 \end{matrix} \end{array}] \oplus I_{8}

A_{26}^{(7)} = I_{5} \oplus [\begin{array}{c} \begin{matrix} 1 & 1 & 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{matrix} & 0_{6} \\ 0_{7} & \begin{matrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ - 1 & 1 \\ 1 \end{matrix} \end{array}] \oplus I_{8},

D_{26}^{(7)} = d i a g (s_{0}^{(7)}, s_{1}^{(7)}, \dots, s_{25}^{(7)}),

\begin{matrix} s_{0}^{(7)} = h_{0}, s_{1}^{(7)} = h_{2} - h_{1}, s_{2}^{(7)} = h_{0} - h_{1}, s_{3}^{(7)} = h_{1}, s_{4}^{(7)} = h_{0}, \\ s_{5}^{(7)} = (h_{6} + h_{5} + h_{4} + h_{3} + 2 h_{2} + h_{1} + h_{0}) / 7, s_{6}^{(7)} = (- h_{6} - 2 h_{5} + 3 h_{4} - h_{3} - 2 h_{2} + h_{1} + 2 h_{0}) / 2, \\ s_{7}^{(7)} = (2 h_{4} - h_{3} - 2 h_{2} + h_{1}) / 2, s_{8}^{(7)} = (- h_{6} + h_{5} + 2 h_{4} - h_{3} - 2 h_{2} + 3 h_{1} - h_{0}) / 2, \\ s_{9}^{(7)} = (10 h_{6} + 3 h_{5} - 11 h_{4} + 10 h_{3} + 3 h_{2} - 11 h_{1} - 4 h_{0}) / 14, \\ s_{10}^{(7)} = (- 2 h_{6} - 2 h_{5} - 2 h_{4} + 12 h_{3} + 5 h_{2} - 9 h_{1} - 2 h_{0}) / 14, s_{11}^{(7)} = (2 h_{6} + 3 h_{5} - h_{4} - 2 h_{3} + 3 h_{2} - h_{1}) / 6, \\ s_{12}^{(7)} = (3 h_{6} - 11 h_{5} - 4 h_{4} + 10 h_{3} + 3 h_{2} - 11 h_{1} + 10 h_{0}) / 14, s_{13}^{(7)} = (- 2 h_{3} + 3 h_{2} - h_{1}) / 6, \\ s_{14}^{(7)} = (3 h_{6} - h_{5} - 2 h_{3} + 3 h_{2} - h_{1} - 2 h_{0}) / 6, s_{15}^{(7)} = (- h_{6} + h_{4} - h_{3} + h_{1}) / 6, s_{16}^{(7)} = (- h_{3} + h_{1}) / 6, \\ s_{17}^{(7)} = (h_{5} - h_{3} + h_{1} - h_{0}) / 6, s_{18}^{(7)} = 2 h_{6} - h_{5} - 2 h_{4} + 3 h_{3} - 2 h_{2} - 2 h_{1} + h_{0}, \\ s_{19}^{(7)} = 2 h_{3} - h_{2} - 2 h_{1} + h_{0}, s_{20}^{(7)} = - h_{6} - 2 h_{5} + h_{4} + 2 h_{3} - h_{2} - 2 h_{1} + 3 h_{0}, s_{21}^{(7)} = h_{6}, \\ s_{22}^{(7)} = h_{4} - h_{5}, s_{23}^{(7)} = h_{6} - h_{5}, s_{24}^{(7)} = h_{5}, s_{25}^{(7)} = h_{6}, \end{matrix}

A_{21 \times 26}^{(7)} = I_{6} \oplus [\begin{array}{c} \begin{matrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{matrix} & 0_{3 \times 6} \\ 0_{2 \times 9} & \begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix} \\ \begin{matrix} - 1 & 1 \\ - 1 & 1 \\ - 1 & 1 \end{matrix} & 0_{3 \times 6} \\ 0_{2 \times 9} & \begin{matrix} - 1 & 1 \\ - 1 & 1 \end{matrix} \end{array}] \oplus I_{5},

\begin{matrix} A_{21}^{(7)} = I_{7} \oplus [\begin{array}{c} \begin{matrix} 1 & 1 \\ 1 \\ 1 \\ 1 \end{matrix} & 0_{4} \\ 0_{4} & \begin{matrix} 1 \\ 1 & 1 \\ 1 \\ 1 \end{matrix} \end{array}] \oplus I_{6}, A_{25 \times 21}^{(7)} = I_{7} \oplus [\begin{array}{c} \begin{matrix} 1 & 1 \\ 1 & 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{matrix} & 0_{6 \times 4} \\ 0_{6 \times 4} & \begin{matrix} 1 \\ 1 \\ 1 & 1 \\ 1 \\ 1 \\ 1 \end{matrix} \end{array}] \oplus I_{6}, \end{matrix}

A_{22 \times 25}^{(7)} = I_{6} \oplus [\begin{array}{c} \begin{array}{c} 1 & 1 \\ 1 & 0_{3} \\ 1 & - 1 \\ 1 & 1 & 1 & 1 \\ 1 \end{array} & 0_{5 \times 7} \\ 0_{6 \times 7} & \begin{array}{c} 1 \\ 1 & 0_{3} \\ 1 \\ 1 & - 1 \\ 1 & 1 & 1 & 1 \\ 1 \end{array} \end{array}] \oplus I_{5},

\begin{matrix} A_{21 \times 22}^{(7)} = I_{6} \oplus [\begin{array}{c} \begin{matrix} 1 \\ - 1 & - 1 & - 1 \\ 1 \\ 1 & 1 \\ 1 \end{matrix} & 0_{5} \\ 0_{5 \times 6} & \begin{matrix} 1 \\ 1 \\ 1 & 1 \\ 1 \\ 1 \end{matrix} \end{array}] \oplus I_{5}, A_{20 \times 21}^{(7)} = I_{10} \oplus [\begin{matrix} 1 & 1 \\ 1 \\ 1 \\ 1 \end{matrix}] \oplus I_{6}, \end{matrix}

A_{15 \times 20}^{(7)} = [\begin{matrix} 1 \\ 1 & 1 \\ 1 & 1 & 1 \end{matrix}] \oplus I_{5} \oplus [\begin{array}{c} \begin{matrix} 1 \\ - 1 & - 1 & - 1 \\ 1 \\ 1 \end{matrix} & 0_{4 \times 5} \\ 0_{3 \times 5} & \begin{matrix} 1 & 1 & 1 \\ 1 & 1 \\ 1 \end{matrix} \end{array}],

A_{13 \times 15}^{(7)} = [\begin{array}{c} I_{3} & 0_{3 \times 4} & 0_{3 \times 5} & 0_{3} \\ 0_{4 \times 3} & \begin{matrix} 1 \\ 1 \\ 1 \\ 1 & 1 \end{matrix} & \begin{matrix} 1 \\ 1 \\ 1 \end{matrix} & \begin{matrix} - 1 \\ - 1 \\ - 1 \end{matrix} \\ \begin{matrix} - 1 & 1 \\ - 1 & 1 \\ - 1 & 1 \end{matrix} & \begin{matrix} 1 \\ 1 & \begin{matrix}  \end{matrix} \end{matrix} & \begin{matrix} 1 \\ 1 \end{matrix} & 0_{3} \\ 0_{3 \times 4} & 0_{3} & 0_{3 \times 5} & I_{3} \end{array}],

Figure 6 shows a data flow diagram of the proposed algorithm for the implementation of 7-point linear convolution.

In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 7-point linear convolution will require 27 multipliers, 49 two-input adders, 7 three-input adders, and 5 four-input adders, instead of 49 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, 2 five-input adders, 2 six-input adders, and 1 seven-input adder in the case of completely parallel implementation of expression (20).

3.7. Algorithm for N = 8

Let

X_{8 \times 1} = {[x_{0}, x_{1}, x_{2}, x_{3}, x_{4} . x_{5}, x_{6}, x_{7}]}^{T}

and

H_{8 \times 1} = {[h_{0}, h_{1}, h_{2}, h_{3}, h_{4}, h_{5}, h_{6}, h_{7}]}^{T}

be 8-dimensional data vectors being convolved and

Y_{15 \times 1} = {[y_{0}, y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, y_{6}, y_{7}, y_{8}, y_{9}, y_{10}, y_{11}, y_{12}, y_{13}, y_{14}]}^{T}

be an input vector representing a linear convolution for N = 8.

The problem is to calculate the product:

Y_{15 \times 1} = H_{15 \times 8} X_{8 \times 1},

(22)

where

H_{15 \times 8} = [\begin{matrix} h_{0} \\ h_{1} & h_{0} \\ ⋮ & h_{1} & ⋱ \\ h_{7} & ⋮ & ⋱ & h_{0} \\ h_{7} & ⋱ & h_{1} \\ ⋱ & ⋮ \\ h_{7} \end{matrix}] .

Direct computation of (22) takes 64 multiplications and 49 addition. We developed an algorithm that contains only 27 multiplications and 67 additions. Thus, the proposed algorithm saves 22 multiplications at the cost of 18 extra additions compared to the ordinary matrix-vector multiplication method.

Proposed algorithm for computation 8-point linear convolution with reduced multiplicative complexity can be written using the following matrix-vector calculating procedure:

Y_{15 \times 1} = A_{15}^{(8)} {\hat{A}}_{15}^{(8)} {\overset{˘}{A}}_{15}^{(8)} A_{15 \times 17}^{(8)} A_{17 \times 27}^{(8)} D_{27}^{(8)} A_{27 \times 17}^{(8)} A_{17 \times 15}^{(8)} A_{15 \times 8}^{(8)} X_{8 \times 1}

(23)

where

A_{15 \times 8}^{(8)} = [\begin{array}{c} I_{3} & 0_{3 \times 5} \\ I_{4} & I_{4} \\ I_{4} & - I_{4} \\ 0_{4} & I_{4} \end{array}], A_{17 \times 15}^{(8)} = I_{3} \oplus (H_{2} \otimes I_{2}) \oplus [\begin{array}{c} I_{2} & 0_{2} \\ 0_{2} & I_{2} \\ I_{2} & I_{2} \end{array}] \oplus I_{4},

A_{27 \times 17}^{(8)} = [\begin{matrix} 1 \\ 1 \\ 1 \\ 1 & 1 \\ 1 \end{matrix}] \oplus H_{2} \oplus (I_{4} \otimes [\begin{matrix} 1 \\ 1 \\ 1 & 1 \end{matrix}]) \oplus [\begin{array}{c} 1 \\ 1 & 1 & 1 \\ 1 & - 1 & 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{array}],

D_{27}^{(8)} = d i a g (s_{0}^{(8)}, s_{1}^{(8)}, . . ., s_{27}^{(8)}),

\begin{matrix} s_{0}^{(8)} = h_{0}, s_{1}^{(8)} = h_{2} - h_{1}, s_{2}^{(8)} = h_{0} - h_{1}, s_{3}^{(8)} = h_{1}, s_{4}^{(8)} = h_{0}, \\ s_{5}^{(8)} = \frac{1}{8} (h_{0} + h_{1} + h_{2} + h_{3} + h_{4} + h_{5} + h_{6} + h_{7}), s_{6}^{(8)} = \frac{1}{8} (h_{0} - h_{1} + h_{2} - h_{3} + h_{4} - h_{5} + h_{6} - h_{7}), \\ s_{7}^{(8)} = \frac{1}{4} (- h_{0} + h_{1} + h_{2} - h_{3} - h_{4} + h_{5} + h_{6} - h_{7}), s_{8}^{(8)} = \frac{1}{4} (- h_{0} - h_{1} + h_{2} + h_{3} - h_{4} - h_{5} + h_{6} + h_{7}), \\ s_{9}^{(8)} = \frac{1}{4} (h_{0} - h_{2} + h_{4} - h_{6}), s_{10}^{(8)} = \frac{1}{2} (h_{0} - h_{1} - h_{2} + h_{3} - h_{4} + h_{5} + h_{6} - h_{7}), \\ s_{11}^{(8)} = \frac{1}{2} (h_{0} + h_{1} - h_{2} + h_{3} - h_{4} - h_{5} + h_{6} - h_{7}), s_{12}^{(8)} = \frac{1}{2} (- h_{0} + h_{2} + h_{4} - h_{6}), \\ s_{13}^{(8)} = \frac{1}{2} (h_{0} - h_{1} + h_{2} - h_{3} - h_{4} + h_{5} - h_{6} + h_{7}), s_{14}^{(8)} = \frac{1}{2} (h_{0} - h_{1} + h_{2} + h_{3} - h_{4} + h_{5} - h_{6} - h_{7}), \\ s_{15}^{(8)} = \frac{1}{2} (- h_{0} - h_{2} + h_{4} + h_{6}), s_{16}^{(8)} = \frac{1}{2} (- h_{0} + h_{1} + h_{4} - h_{5}), s_{17}^{(8)} = \frac{1}{2} (- h_{0} - h_{3} + h_{4} + h_{7}), \\ s_{18}^{(8)} = \frac{1}{2} (h_{0} - h_{4}), s_{19}^{(8)} = h_{4} - h_{6}, s_{20}^{(8)} = \frac{1}{2} (h_{5} + h_{6}), s_{21}^{(8)} = \frac{1}{2} (h_{6} - h_{5}), s_{22}^{(8)} = h_{5} - h_{7}, \\ s_{23}^{(8)} = h_{7}, s_{24}^{(8)} = h_{6}, s_{25}^{(8)} = h_{7}, s_{26}^{(8)} = h_{7}, \end{matrix}

\begin{matrix} A_{17 \times 27}^{(8)} = [\begin{matrix} 1 \\ 1 & 1 \\ 1 & 1 & 1 \end{matrix}] \oplus H_{2} \oplus (I_{4} \otimes [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}]) \oplus [\begin{matrix} 1 & 1 & 1 & 1 \\ 1 & - 1 & - 1 \end{matrix}] \oplus [\begin{matrix} 1 & 1 \\ 1 \end{matrix}], \end{matrix}

A_{15 \times 17}^{(8)} = I_{3} \oplus (H_{2} \otimes I_{2}) \oplus [\begin{array}{c} 0_{2} & I_{2} & I_{2} \\ I_{2} & 0_{2} & I_{2} \end{array}] \oplus I_{4}, {\overset{˘}{A}}_{15}^{(8)} = I_{3} \oplus (H_{2} \otimes I_{4}) \oplus I_{4},

{\hat{A}}_{15}^{(8)} = I_{3} \oplus [\begin{matrix} 0_{5 \times 3} & I_{5} \\ I_{3} & 0_{3 \times 5} \end{matrix}] \oplus I_{4}, A_{15}^{(8)} = [\begin{array}{c} I_{7} & 0_{3 \times 4} \oplus (- I_{4}) \\ [\begin{matrix} - 1 \\ - 1 \\ - 1 \end{matrix}] \oplus I_{4} & I_{8} \end{array}] .

Figure 7 shows a data flow diagram of the proposed algorithm for the implementation of 8-point linear convolution.

In terms of arithmetic units, a fully parallel hardware implementation of the processor unit for calculating a 8-point linear convolution will require 27 multipliers, 57 two-input adders, 4 three-input adders and 1 four-input adder, instead of 64 multipliers, 2 two-input adders, 2 three-input adders, 2 four-input adders, 2 five-input adders, 2 six-input adders, 2 seven-point adders and 1 eight-input adder in the case of completely parallel implementation of expression (22).

4. Implementation Complexity

Since the lengths of the input sequences are relatively small, and the data flow graphs representing the organization of the computation process are fairly simple, it is easy to estimate the implementation complexity of the proposed solutions. Table 1 shows estimates of the number of arithmetic blocks for the fully parallel implementation of the short lengths linear convolution algorithms. Since a parallel N-input adder consists of N-1 two-input adders, we give integrated estimates of the implementing costs of the sets of adders for each proposed solution expressed as the sums of two-input adders. The penultimate column of the Table 1 shows the percentage reduction in the number of multipliers, while the last column shows the percentage increase in the number of adders. As you can see, the implementation of the proposed algorithms requires fewer multipliers than the implementation based on naive methods of performing the linear convolution operations.

It should be noted that our solutions are primarily focused on efficient implementation in application specific integrated circuits (ASICs). In low-power designing low-power digital circuits, optimization must be performed both at the algorithmic level and at the logic level. From the point of view of designing an ASIC-chip that implements fast linear convolution, you should pay attention to the fact that the hardwired multiplier is a very resource-intensive arithmetic unit. The multiplier is also the most energy-intensive arithmetic unit, occupying a large crystal area [26] and dissipating a lot of energy [27]. Reducing the number of multipliers is especially important in the design of specialized fully parallel ASIC-based processors because minimizing the number of necessary multipliers reduces power dissipation and lowers the cost implementation of the entire system being implemented. It is proved that the implementation complexity of a hardwired multiplier grows quadratically with operand size, while the hardware complexity of a binary adder increases linearly with operand size [28]. Therefore, a reduction in the number of multipliers, even at the cost of a small increase in the number of adders, has a significant role in the ASIC-based implementation of the algorithm. Thus, it can be argued categorically that algorithmic solutions that require fewer hardware multipliers in an ASIC-based implementation are better than those that require more embedded multipliers.

This statement is also true for field-programmable gate array (FPGA)-based implementations. Most modern high-performance FPGAs contain a number of built-in multipliers. This means that instead of implementing the multipliers with a help of a set of conventional logic gates, you can use the hardwired multipliers embedded in the FPGA. Thus, all multiplications contained in a fully parallel algorithm can be efficiently implemented using these embedded multipliers; however, their number may not be enough to meet the requirements of a fully parallel implementation of the algorithm. So, the developer uses the embedded multipliers to implement the multiplication operations until all of the multipliers built into the chip have been used. If the embedded multipliers available in the FPGA run out, the developer will be forced to use ordinary logic gates instead. This will lead to significant difficulties in the design and implementation of the computing unit. Therefore, the problem of reducing the number of multiplications in fully parallel hardware-oriented algorithms is critical. It is clear that you can go the other way—use a more complex FPGA chip from the same or another family, which contains a larger number of embedded multipliers; however, it should be remembered that the hardwired multiplier is a very resource-intensive unit. The multiplier is the most resource-intensive and energy-consuming arithmetic unit, occupying a large area of the chip and dissipating a lot of power; therefore, the use of complex and resource-intensive FPGAs containing a large number of multipliers without a special need is impractical.

Table 2 shows FPGA devices of the Spartan-3 family, in which the number of hardwired multipliers allows one to implement the linear convolution operation in a single chip. So, for example, a 4-point convolution implemented using our proposed algorithm can be implemented using a single Spartan XC3S200 device, while a 4-point convolution implemented using a naive method requires a more voluminous Spartan XC3S400 device. A 5-point convolution implemented using our proposed algorithm can be implemented using a single Spartan XC3S200A chip, while a 5-point convolution implemented using a naive method requires a more voluminous Spartan XC3S1500A chip, and so on.

Thus, the hardware implementation of our algorithms requires fewer hardware multipliers than the implementation of naive calculation methods, all other things being equal. Taking into account the previously listed arguments, this proves their effectiveness.

5. Conclusions

In this paper, we analyzed possibilities to reduce the multiplicative complexity of calculating the linear convolutions for small length input sequences. We also synthesized new algorithms for implementing these operations for N = 3, 4, 5, 6, 7, and 8. Using these algorithms reduces the computational complexity of linear convolution, thus reducing its hardware implementation complexity too. In addition, as can be seen from Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7, the proposed algorithms have a pronounced parallel modular structure. This simplifies the mapping of the algorithms into an ASIC structure and unifies its implementation in FPGAs. Thus, the acceleration of computations during the implementation of these algorithms can also be achieved due to the parallelization of the computation processes.

Author Contributions

Conceptualization, A.C.; methodology, A.C.; validation, J.P.P.; formal analysis, A.C. and J.P.P.; writing—original draft preparation, A.C.; writing—review and editing, J.P.P.; visualization, A.C. and J.P.P.; supervision, A.C. and J.P.P. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Blahut, R.E. Fast Algorithms for Signal Processing; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Tolimieri, R.; An, M.; Lu, C. Algorithms for Discrete Fourier Transform and Convolution; Springer Science & Business Media: New York, NY, USA, 1989. [Google Scholar]
McClellan, J.H.; Rader, C.M. Number Theory in Digital Signal Processing; Prentice-Hall: Englewood Cliffs, NJ, USA, 1979. [Google Scholar]
Berg, L.; Nussbaumer, H. Fast Fourier Transform and Convolution Algorithms. Z. fur Angew. Math. Und Mech. 1982, 62, 282. [Google Scholar] [CrossRef]
Burrus, C.S.; Parks, T. Convolution Algorithms; John Wiley and Sons: New York, NY, USA, 1985. [Google Scholar]
Krishna, H. Digital Signal Processing Algorithms: Number Theory, Convolution, Fast Fourier Transforms, and Applications; Routledge: New York, NY, USA, 2017. [Google Scholar]
Bi, G.; Zeng, Y. Transforms and Fast Algorithms for Signal Analysis and Representations; Springer Science & Business Media: New York, NY, USA, 2004. [Google Scholar]
Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef] [Green Version]
Parhi, K.K. VLSI Digital Signal Processing Systems: Design and Implementation; John Wiley & Sons: New York, NY, USA, 2007. [Google Scholar]
Chan, Y.H.; Siu, W.C. General approach for the realization of DCT/IDCT using convolutions. Signal Process. 1994, 37, 357–363. [Google Scholar] [CrossRef]
Huang, T.S. Two-dimensional digital signal processing II. Transforms and median filters. In Two-Dimensional Digital Signal Processing II. Transforms and Median Filters; Topics in Applied Physics; Springer: Berlin/Heidelberg, Germany, 1981; Volume 43. [Google Scholar]
Pratt, W.K. Digital Image Processing; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
Wolberg, G.; Massalin, H. A fast algorithm for digital image scaling. In Communicating with Virtual Worlds; Springer: Tokyo, Japan, 1993; pp. 526–539. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Mathieu, M.; Henaff, M.; LeCun, Y. Fast training of convolutional networks through ffts. arXiv 2013, arXiv:1312.5851. [Google Scholar]
Lin, S.; Liu, N.; Nazemi, M.; Li, H.; Ding, C.; Wang, Y.; Pedram, M. FFT-based deep learning deployment in embedded systems. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 19–23 March 2018; pp. 1045–1050. [Google Scholar]
Abtahi, T.; Shea, C.; Kulkarni, A.; Mohsenin, T. Accelerating convolutional neural network with fft on embedded hardware. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1737–1749. [Google Scholar] [CrossRef]
Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021. [Google Scholar]
Wang, X.; Wang, C.; Zhou, X. Work-in-progress: WinoNN: Optimising FPGA-based neural network accelerators using fast winograd algorithm. In Proceedings of the 2018 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), Turin, Italy, 30 September–5 October 2018; pp. 1–2. [Google Scholar]
Cariow, A.; Cariowa, G. Minimal Filtering Algorithms for Convolutional Neural Networks. arXiv 2020, arXiv:2004.05607. [Google Scholar]
Wang, Y.; Parhi, K. Explicit Cook-Toom algorithm for linear convolution. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 5–9 June 2000; Volume 6, pp. 3279–3282. [Google Scholar]
Ju, C.; Solomonik, E. Derivation and Analysis of Fast Bilinear Algorithms for Convolution. arXiv 2019, arXiv:1910.13367. [Google Scholar]
Cariow, A. Strategies for the Synthesis of Fast Algorithms for the Computation of the Matrix-vector Products. J. Signal Process. Theory Appl. 2014, 3, 1–19. [Google Scholar] [CrossRef]
Regalia, P.A.; Sanjit, M.K. Kronecker products, unitary matrices and signal processing applications. SIAM Rev. 1989, 31, 586–613. [Google Scholar] [CrossRef]
Granata, J.; Conner, M.; Tolimieri, R. The tensor product: A mathematical programming language for FFTs and other fast DSP operations. IEEE Signal Process. Mag. 1992, 9, 40–48. [Google Scholar] [CrossRef]
Wong, H.; Betz, V.; Rose, J. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 27 February–1 March 2011; pp. 5–14. [Google Scholar]
Jhamb, M.; Lohani, H. Design, implementation and performance comparison of multiplier topologies in power-delay space. Eng. Sci. Technol. Int. J. 2016, 19, 355–363. [Google Scholar] [CrossRef] [Green Version]
Oudjida, A.K.; Chaillet, N.; Berrandjia, M.L.; Liacha, A. A new high radix-2r (r ≥ 8) multibit recoding algorithm for large operand size (N ≥ 32) multipliers. J. Low Power Electron. 2013, 9, 50–62. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The signal flow graph of the proposed algorithm for computation of 2-point linear convolution.

Figure 2. The signal flow graph of the proposed algorithm for computation of 3-point linear convolution.

Figure 3. The signal flow graph of the proposed algorithm for computation of 4-point linear convolution.

Figure 4. The signal flow graph of the proposed algorithm for computation of 5-point linear convolution.

Figure 5. The signal flow graph of the proposed algorithm for computation of 6-point linear convolution.

Figure 6. The signal flow graph of the proposed algorithm for computation of 7-point linear convolution.

Figure 7. The signal flow graph of the proposed algorithm for computation of 8-point linear convolution.

Table 1. Implementation complexities of naïve method and proposed solutions.

Length N	Number of Arithmetical Units (Multipliers — “×” and Adders — “+”)
Length N	Naïve Method		Proposed Solutions		Percentage Stimate
	“×”	“+”	“×”	“+”	“×"	“+”
2	4	1	3	3	25%	66.7%
3	9	4	6	10	33.3%	60%
4	16	9	9	20	43.8%	55%
5	25	16	16	38	57.9%	66%
6	36	25	16	44	55.6%	43.2%
7	49	36	26	79	47%	54.4%
8	64	49	27	67	58%	26.9%

Table 2. The possibility of implementation the naive method and proposed solution on the field-programmable gate array (FPGA) devices of the Spartan-3 family.

Length N	Features of the Implementation in Spartan-3 Family Devices
Length N	Naïve Method	Proposed Solutions
	Type of Device	Type of Device
2	XC3S50	XC3S50AN
3	XC3S200	XC3S200
4	XC3S400	XC3S200
5	XC3S1400AN	XC3S200AN
6	XC3S2000	XC3S200
7	XC3S4000	XC3S1400AN

Sample Availability: Samples of the compounds are available from the authors.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cariow, A.; Paplinski, J.P. Some Algorithms for Computing Short-Length Linear Convolution. Electronics 2020, 9, 2115. https://doi.org/10.3390/electronics9122115

AMA Style

Cariow A, Paplinski JP. Some Algorithms for Computing Short-Length Linear Convolution. Electronics. 2020; 9(12):2115. https://doi.org/10.3390/electronics9122115

Chicago/Turabian Style

Cariow, Aleksandr, and Janusz P. Paplinski. 2020. "Some Algorithms for Computing Short-Length Linear Convolution" Electronics 9, no. 12: 2115. https://doi.org/10.3390/electronics9122115

APA Style

Cariow, A., & Paplinski, J. P. (2020). Some Algorithms for Computing Short-Length Linear Convolution. Electronics, 9(12), 2115. https://doi.org/10.3390/electronics9122115

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Some Algorithms for Computing Short-Length Linear Convolution

Abstract

1. Introduction

2. Preliminaries

3. Algorithms for Short-Length Linear Convolution

3.1. Algorithm for N = 2

3.2. Algorithm for N = 3

3.3. Algorithm for N = 4

3.4. Algorithm for N = 5

3.5. Algorithm for N = 6

3.6. Algorithm for N = 7

3.7. Algorithm for N = 8

4. Implementation Complexity

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI