A Faster Algorithm for Reducing the Computational Complexity of Convolutional Neural Networks

: Convolutional neural networks have achieved remarkable improvements in image and video recognition but incur a heavy computational burden. To reduce the computational complexity of a convolutional neural network, this paper proposes an algorithm based on the Winograd minimal ﬁltering algorithm and Strassen algorithm. Theoretical assessments of the proposed algorithm show that it can dramatically reduce computational complexity. Furthermore, the Visual Geometry Group (VGG) network is employed to evaluate the algorithm in practice. The results show that the proposed algorithm can provide the optimal performance by combining the savings of these two algorithms. It saves 75% of the runtime compared with the conventional algorithm.


Introduction
Deep convolutional neural networks have achieved remarkable improvements in image and video processing [1][2][3]. However, the computational complexity of these networks has also increased significantly. Since the prediction process of the networks used in real-time applications requires very low latency, the heavy computational burden is a major problem with these systems. Detecting faces from video imagery is still a challenging task [4,5]. The success of convolutional neural networks in these applications is limited by their heavy computational burden.
There have been a number of studies on accelerating the efficiency of convolutional neural networks. Denil et al. [6] indicate that there are significant redundancies in the parameterizations of neural networks. Han et al. [7] and Guo et al. [8] use certain training strategies to compress these neural network models without significantly weakening their performance. Some researchers [9][10][11] have found that low-precision computation is sufficient for the networks. Binary/Ternary Net [12,13] restricts the parameters to two or three values. Zhang et al. [14] used low-rank approximation to reconstruct the convolution matrix, which can reduce the complexity of convolution. These algorithms are effective in accelerating computation in the network, but they also cause a degradation in accuracy. Fast Fourier Transform (FFT) is also useful in reducing the computational complexity of convolutional neural networks without losing accuracy [15,16], but it is only effective for networks with large kernels. However, convolutional neural networks tend to use small kernels because they achieve better accuracy than networks with larger kernels [1]. For these reasons, there is a demand for an algorithm that can accelerate the efficiency of networks with small kernels.
In this paper, we present an algorithm based on Winograd's minimal filtering algorithm which was proposed by Toom [17] and Cook [18] and generalized by Winograd [19]. The minimal filtering algorithm can reduce the computational complexity of each convolution in the network without losing accuracy. However, the computational complexity is still large for real-time requirements. To reduce further the computational complexity of these networks, we utilize the Strassen algorithm to reduce the number of convolutions in the network simultaneously. Moreover, we evaluate our algorithm with the Visual Geometry Group (VGG) network. Experimental results show that it can save 75% of the time spent on computation when the batch size is 32.
The rest of this paper is organized as follows. Section 2 reviews related work on convolutional neural networks, the Winograd algorithm and the Strassen algorithm. The proposed algorithm is presented in Section 3. Several simulations are included in Section 4, and the work is concluded in Section 5.

Convolutional Neural Networks
Machine-learning has produced impressive results in many signal processing applications [20,21]. Convolutional neural networks extend the machine-learning capabilities of neural networks by introducing convolutional layers to the network. Convolutional neural networks are mainly used in image processing. Figure 1 shows the structure of a classical convolutional neural network, LeNet. It consists of two convolutional layers, two subsampling layers and three fully connected layers. Usually, the computation of the convolutional layers occupies most of the network.
Algorithms 2018, 11, x FOR PEER REVIEW 2 of 12 algorithm can reduce the computational complexity of each convolution in the network without losing accuracy. However, the computational complexity is still large for real-time requirements. To reduce further the computational complexity of these networks, we utilize the Strassen algorithm to reduce the number of convolutions in the network simultaneously. Moreover, we evaluate our algorithm with the Visual Geometry Group (VGG) network. Experimental results show that it can save 75% of the time spent on computation when the batch size is 32. The rest of this paper is organized as follows. Section 2 reviews related work on convolutional neural networks, the Winograd algorithm and the Strassen algorithm. The proposed algorithm is presented in Section 3. Several simulations are included in Section 4, and the work is concluded in Section 5.

Convolutional Neural Networks
Machine-learning has produced impressive results in many signal processing applications [20,21]. Convolutional neural networks extend the machine-learning capabilities of neural networks by introducing convolutional layers to the network. Convolutional neural networks are mainly used in image processing. Figure 1 shows the structure of a classical convolutional neural network, LeNet. It consists of two convolutional layers, two subsampling layers and three fully connected layers. Usually, the computation of the convolutional layers occupies most of the network. Convolutional layers extract features from the input feature maps via different kernels. Suppose there are Q input feature maps of size Mx × Nx and R output feature maps of size My × Ny. The size of the convolutional kernel is Mw × Nw. The computation of the output in a single layer is given by the equation where X is the input feature map, Y is the output feature map, and W is the kernel. The subscripts x and y indicate the position of the pixel in the feature map. The subscripts u and v indicate the position of the parameter in the kernel. Equation (1) can be rewritten as Equation (2).
Suppose there are P images that are sent together to the neural network, which means the batch size is P. Then the output Y in Equation (2) can be expressed by Equation (3).
If we regard the yr,p, wr,q and xq,p as the elements of the matrices Y, W and X, respectively, the output can be expressed as the convolution matrix in Equation (4). Convolutional layers extract features from the input feature maps via different kernels. Suppose there are Q input feature maps of size Mx × Nx and R output feature maps of size My × Ny. The size of the convolutional kernel is Mw × Nw. The computation of the output in a single layer is given by the equation where X is the input feature map, Y is the output feature map, and W is the kernel. The subscripts x and y indicate the position of the pixel in the feature map. The subscripts u and v indicate the position of the parameter in the kernel. Equation (1) can be rewritten as Equation (2).
Suppose there are P images that are sent together to the neural network, which means the batch size is P. Then the output Y in Equation (2) can be expressed by Equation (3). If we regard the y r,p , w r,q and x q,p as the elements of the matrices Y, W and X, respectively, the output can be expressed as the convolution matrix in Equation (4).
Matrix Y and matrix X are special matrices of feature maps. Matrix W is a special matrix of kernels. This convolutional matrix provides a new view of the computation of the output Y.

Winograd Algorithm
We denote an r-tap FIR filter with m outputs as F(m, r). The conventional algorithm for F(2, 3) is shown in Equation (6), where d 0 , d 1 , d 2 and d 3 are the inputs of the filter, and h 0 , h 1 and h 2 are the parameters of the filter. As Equation (6) shows, it uses 6 multiplications and 4 additions to compute F(2, 3).
If we use the minimal filtering algorithm [19] to compute F(m, r), it requires (m + r -1) multiplications. The process of the algorithm for computing F(2, 3) is shown in Equations (7)- (11).
The computation can be written in matrix form as Equation (12).
We substitute A, G and B for the matrices in Equation (12). Equation (12) can then be rewritten as Equation (13).
In Equation (13), • indicates element-wise multiplication, and the superscript T indicates the transpose operator. A, G and B are defined in Equation (14). We can see from Equation (7) to Equation (11) that the whole process needs 4 multiplications. However, it also needs 4 additions to transform data, 3 additions and 2 multiplications by a constant to transform the filter, and 4 additions to transform the final result. (To compare the complexity easily, we regard the multiplication by a constant as an addition.) The 2-dimensional filters F(m × m, r × r) can be generalized by the filter F(m, r) as Equation (15) [22].
F(2 × 2, 3 × 3) needs 4 × 4 = 16 multiplications, 32 additions to transform data, 28 additions to transform the filter, and 24 additions to transform the final result. The conventional algorithm needs 36 multiplications to calculate the result. This algorithm can reduce the number of multiplications from 36 to 16.
can be used to compute the convolutional layer with 3 × 3 kernels. Each input feature map can be divided into smaller feature maps in order to use Equation (15). If we substitute U = GwG T and V = B T × B, then Equation (3) can be rewritten as Equation (16).

Strassen Algorithm
Suppose there are two matrices A and B, and matrix C is the product of A and B. The numbers of the elements in both rows and columns of A, B and C are even. We can partition A, B and C into block matrices of equal sizes as follows: According to the conventional matrix multiplication algorithm, we then have Equation (18).
As Equation (18) shows, we need 8 multiplications and 4 additions to complete matrix C. The Strassen algorithm can be used to reduce the number of multiplications [23]. The process of the Strassen algorithm is shown as follows: Algorithms 2018, 11, 159 5 of 11 where I, II, III, IV, V, VI, VII are temporary matrices. The whole process requires 7 multiplications and 18 additions. It reduces the number of multiplications from 8 to 7 without changing the computational results. More multiplications can be saved by using the Strassen algorithm recursively, as long as the numbers of rows and columns of the submatrices are even. If we use N recursions of the Strassen algorithm, then it can save 1 − (7/8) N multiplications. The Strassen algorithm is suitable for the special convolutional matrix in Equation (4) [24]. Therefore, we can use the Strassen algorithm to handle a convolutional matrix.

Proposed Algorithm
As we can see from Section 2.2, the Winograd algorithm incurs more additions. To avoid repeating the transform of W and X in Equation (16), we calculate the matrices U and V separately. This can reduce the number of additions incurred by this algorithm. The practical implementation of this algorithm is listed in Algorithm 1. The calculation of output M in Algorithm 1 is the main complexity of multiplication in the whole computation process. To reduce the computational complexity of output M, we can use the Strassen algorithm. Before using the Strassen algorithm, we need to reform the expression of M as follows.
The output M in Algorithm 1 can be written as the equation where U r,q and V q,p are temporary matrices, and A is the constant parameter matrix. To show the equation easily, we ignore matrix A. (Matrix A is not ignored in the actual implementation of the algorithm.) The output M can then be written as shown in Equation (31).
We denote three special matrices M, U and V. M r,p , U r,q , and V q,p are the elements of the matrices M, U and V, respectively, as shown in Equation (33). The output M can then be written as a multiplication of matrix U and matrix V.
In this case, we can partition the matrices M, U and V into equal-sized block matrices, and then use the Strassen algorithm to reduce the number of multiplications between U r,q and V q,p . The multiplication in the Strassen algorithm is redefined as the element-wise multiplication of matrices U r,q and V q,p . We name this new combination as the Strassen-Winograd algorithm. To compare theoretically the computational complexity of the conventional algorithm, Strassen algorithm, Winograd algorithm, and Strassen-Winograd algorithm, we list the complexity of multiplication and addition in Table 1. The output feature map size is set to 64 × 64, and the kernel size is set to 3 × 3. 9.66 × 10 9 9.65 × 10 9 4.34 × 10 9 6.65 × 10 9 4.29 × 10 9 7.63 × 10 9 1.93 × 10 9 9.37 × 10 9 128 7.73 × 10 10 7.72 × 10 10 3.04 × 10 10 4.68 × 10 10 3.44 × 10 10 6.06 × 10 10 1.35 × 10 10 7.19 × 10 10 256 6.18 × 10 11 6.18 × 10 11 2.13 × 10 11 3.29 × 10 11 2.75 × 10 11 4.83 × 10 11 9.45 × 10 10 5.55 × 10 11 512 4.95 × 10 12 4.95 × 10 12 1.49 × 10 12 2.31 × 10 12 2.20 × 10 12 3.86 × 10 12 6.61 × 10 11 4.29 × 10 12 We can see from Table 1 that, although the algorithms cause more additions when the matrix size is small, the number of extra additions is less than the number of decreased multiplications. Moreover, multiplication usually costs more time than addition. Hence the three algorithms are all theoretically effective in reducing the computational complexity. Figure 2 shows a comparison of the computational complexity ratios. The Strassen algorithm shows less reduction of multiplication when the matrix size is small, but it incurs less additions. The Winograd algorithm shows a stable performance. Moreover, the number of additions slightly decreases as the matrix size increases. For small-sized matrices, the Strassen-Winograd algorithm shows a much better reduction in multiplication complexity than the Strassen algorithm. Although it incurs more additions, the number of extra additions is much less than the number of decreased multiplications. The Strassen-Winograd algorithm shows a similar performance to the Winograd algorithm. When the matrix size is small, the Winograd algorithm shows a slightly better performance, Winograd algorithm shows a stable performance. Moreover, the number of additions slightly decreases as the matrix size increases. For small-sized matrices, the Strassen-Winograd algorithm shows a much better reduction in multiplication complexity than the Strassen algorithm. Although it incurs more additions, the number of extra additions is much less than the number of decreased multiplications. The Strassen-Winograd algorithm shows a similar performance to the Winograd algorithm. When the matrix size is small, the Winograd algorithm shows a slightly better performance, whereas the Strassen-Winograd algorithm and Strassen algorithm perform much better as the matrix size increases.

Simulation Results
Several simulations were conducted to evaluate our algorithm. We compare our algorithm with the Strassen algorithm and Winograd algorithm, measuring performance by the runtime in MATLAB R2013b (CPU: Inter(R) Core(TM) i7-3370K). For objectivity, we apply Equation (18) to the conventional algorithm and use it as a benchmark. Moreover, all the input data x and kernel w are randomly generated. We measure the accuracy of our algorithm by the absolute element error in the output feature maps. As a benchmark, we use the conventional algorithm with double precision data, kernels, middle variables and outputs. The other algorithms in this comparison use double precision data and kernels but single precision middle variables and outputs.
The VGG network [1] was applied to our simulation. There are nine different convolutional layers in the VGG network. The parameters of the convolutional layer are shown in Table 2. The depth indicates the number of times a layer occurs in the network. Q indicates the number of input feature maps. R indicates the number of output feature maps. Mw and Nw represent the size of the kernel. My and Ny represent the size of the output feature map. The size of the kernel in the VGG network is 3 × 3. We apply F(2 × 2, 3 × 3) to the operation of convolution. For the computation of the output feature map with size My × Ny, the map is partitioned into (My/2) × (Ny/2) sets, each using one computation of F(2 × 2, 3 × 3).

Simulation Results
Several simulations were conducted to evaluate our algorithm. We compare our algorithm with the Strassen algorithm and Winograd algorithm, measuring performance by the runtime in MATLAB R2013b (CPU: Inter(R) Core(TM) i7-3370K). For objectivity, we apply Equation (18) to the conventional algorithm and use it as a benchmark. Moreover, all the input data x and kernel w are randomly generated. We measure the accuracy of our algorithm by the absolute element error in the output feature maps. As a benchmark, we use the conventional algorithm with double precision data, kernels, middle variables and outputs. The other algorithms in this comparison use double precision data and kernels but single precision middle variables and outputs.
The VGG network [1] was applied to our simulation. There are nine different convolutional layers in the VGG network. The parameters of the convolutional layer are shown in Table 2. The depth indicates the number of times a layer occurs in the network. Q indicates the number of input feature maps. R indicates the number of output feature maps. Mw and Nw represent the size of the kernel. My and Ny represent the size of the output feature map. The size of the kernel in the VGG network is 3 × 3. We apply F(2 × 2, 3 × 3) to the operation of convolution. For the computation of the output feature map with size My × Ny, the map is partitioned into (My/2) × (Ny/2) sets, each using one computation of F(2 × 2, 3 × 3).  (Nw)  3  3  3  3  3  3  3  3  3  My(Ny)  224 224 112 112  56  56  28  28  14 As Table 2 shows, the numbers of rows and columns are not always even, and the matrices are not always square. To solve this problem, we pad a dummy row or column in the matrices once we encounter an odd number of rows or columns. The matrix can then continue using the Strassen algorithm. We apply these nine convolutional layers in turn to our simulations. For each convolutional layer, we run the four algorithms with different batch sizes from 1 to 32. The runtime consumption of the algorithms is listed in Table 3, and the numerical accuracy of the different algorithms in different layers is shown in Table 4.  Table 4 shows that the Winograd algorithm is slightly more accurate than the Strassen algorithm and Strassen-Winograd algorithm. The maximum element error of these algorithms is 6.16 × 10 −4 . Compared with the minimum value of 1.09 × 10 3 in the output feature map, the accuracy loss incurred by these algorithms is negligible. As we can see from Section 2, theoretically, the processes in all of these algorithms do not result in a loss in accuracy. In practice, a loss in accuracy is mainly caused by the single precision data. Because the conventional algorithm with low precision data is sufficiently accurate for deep learning [10,11], we conclude that the accuracy of our algorithm is equally sufficient.
To compare runtime easily, we use the conventional algorithm as a benchmark, and calculate the saving on runtime displayed by the other algorithms. The result is shown in Figure 3.
The Strassen-Winograd algorithm shows a better performance than the benchmark in all layers except layer1. This is because the number of input feature maps Q in layer1 is three, which limits the performance of the algorithm as a small matrix size incurs more additions. Moreover, odd numbers of rows or columns need dummy rows or columns for matrix partitioning, which causes more runtime.

Conclusions and Future Work
The computational complexity of convolutional neural networks is an urgent problem for realtime applications. Both the Strassen algorithm and Winograd algorithm are effective in reducing the computational complexity without losing accuracy. This paper proposed to combine these algorithms to reduce the heavy computational burden. The proposed strategy was evaluated with the VGG network. Both the theoretical performance assessment and the experimental results show that the Strassen-Winograd algorithm can dramatically reduce the computational complexity.
There remain limitations that need to be addressed in future research. Although the algorithm reduces the computational complexity of convolutional neural networks, the cost is an increased The performance of the Winograd algorithm is stable from layer2 to layer9. It saves 53% of the runtime on average, which is close to the 56% reduction in multiplications. The performances of the Strassen algorithm and Strassen-Winograd algorithm improve as the batch size increases. For example, in layer7, when the batch size is 1, we cannot partition the matrix to use the Strassen algorithm, and there is almost no saving on runtime. The Strassen-Winograd algorithm saves 52% of the runtime, a similar saving as the Winograd algorithm. When the batch size is 2, the Strassen algorithm saves 13% of the runtime, which equates to the 13% reduction in multiplications. The Strassen-Winograd algorithm saves 58% of the runtime, which is close to the 61% reduction in multiplications. As the batch size increases, the Strassen algorithm and Strassen-Winograd algorithm can use more recursions, which can further reduce the number of multiplications and save more runtime. When the batch size is 32, the Strassen-Winograd algorithm saves 75% of the runtime, while the Strassen algorithm and Winograd algorithm save 49% and 53%, respectively.
Though experiments with larger batch sizes were not carried out due to limitations on time and memory, we can see the trend in performance as the batch size increases. This is consistent with the theoretical analysis in Section 3. We conclude therefore that the proposed algorithm can provide the optimal performance by combining the savings of these two algorithms.

Conclusions and Future Work
The computational complexity of convolutional neural networks is an urgent problem for real-time applications. Both the Strassen algorithm and Winograd algorithm are effective in reducing the computational complexity without losing accuracy. This paper proposed to combine these algorithms to reduce the heavy computational burden. The proposed strategy was evaluated with the VGG network. Both the theoretical performance assessment and the experimental results show that the Strassen-Winograd algorithm can dramatically reduce the computational complexity.
There remain limitations that need to be addressed in future research. Although the algorithm reduces the computational complexity of convolutional neural networks, the cost is an increased difficulty in implementation, especially in real-time systems and embedded devices. It also increases the difficulty of parallelizing an artificial network for hardware acceleration. In future work, we aim to apply this method to hardware accelerator using practical applications.