You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

18 October 2018

A Faster Algorithm for Reducing the Computational Complexity of Convolutional Neural Networks

,
,
and
1
Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
2
Key Laboratory of Information Technology for Autonomous Underwater Vehicles, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
3
University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.

Abstract

Convolutional neural networks have achieved remarkable improvements in image and video recognition but incur a heavy computational burden. To reduce the computational complexity of a convolutional neural network, this paper proposes an algorithm based on the Winograd minimal filtering algorithm and Strassen algorithm. Theoretical assessments of the proposed algorithm show that it can dramatically reduce computational complexity. Furthermore, the Visual Geometry Group (VGG) network is employed to evaluate the algorithm in practice. The results show that the proposed algorithm can provide the optimal performance by combining the savings of these two algorithms. It saves 75% of the runtime compared with the conventional algorithm.

1. Introduction

Deep convolutional neural networks have achieved remarkable improvements in image and video processing [1,2,3]. However, the computational complexity of these networks has also increased significantly. Since the prediction process of the networks used in real-time applications requires very low latency, the heavy computational burden is a major problem with these systems. Detecting faces from video imagery is still a challenging task [4,5]. The success of convolutional neural networks in these applications is limited by their heavy computational burden.
There have been a number of studies on accelerating the efficiency of convolutional neural networks. Denil et al. [6] indicate that there are significant redundancies in the parameterizations of neural networks. Han et al. [7] and Guo et al. [8] use certain training strategies to compress these neural network models without significantly weakening their performance. Some researchers [9,10,11] have found that low-precision computation is sufficient for the networks. Binary/Ternary Net [12,13] restricts the parameters to two or three values. Zhang et al. [14] used low-rank approximation to reconstruct the convolution matrix, which can reduce the complexity of convolution. These algorithms are effective in accelerating computation in the network, but they also cause a degradation in accuracy. Fast Fourier Transform (FFT) is also useful in reducing the computational complexity of convolutional neural networks without losing accuracy [15,16], but it is only effective for networks with large kernels. However, convolutional neural networks tend to use small kernels because they achieve better accuracy than networks with larger kernels [1]. For these reasons, there is a demand for an algorithm that can accelerate the efficiency of networks with small kernels.
In this paper, we present an algorithm based on Winograd’s minimal filtering algorithm which was proposed by Toom [17] and Cook [18] and generalized by Winograd [19]. The minimal filtering algorithm can reduce the computational complexity of each convolution in the network without losing accuracy. However, the computational complexity is still large for real-time requirements. To reduce further the computational complexity of these networks, we utilize the Strassen algorithm to reduce the number of convolutions in the network simultaneously. Moreover, we evaluate our algorithm with the Visual Geometry Group (VGG) network. Experimental results show that it can save 75% of the time spent on computation when the batch size is 32.
The rest of this paper is organized as follows. Section 2 reviews related work on convolutional neural networks, the Winograd algorithm and the Strassen algorithm. The proposed algorithm is presented in Section 3. Several simulations are included in Section 4, and the work is concluded in Section 5.

3. Proposed Algorithm

As we can see from Section 2.2, the Winograd algorithm incurs more additions. To avoid repeating the transform of W and X in Equation (16), we calculate the matrices U and V separately. This can reduce the number of additions incurred by this algorithm. The practical implementation of this algorithm is listed in Algorithm 1. The calculation of output M in Algorithm 1 is the main complexity of multiplication in the whole computation process. To reduce the computational complexity of output M, we can use the Strassen algorithm. Before using the Strassen algorithm, we need to reform the expression of M as follows.
The output M in Algorithm 1 can be written as the equation
M r , p = q = 1 Q ( A T ( U r , q V q , p ) A ) ,
where Ur,q and Vq,p are temporary matrices, and A is the constant parameter matrix. To show the equation easily, we ignore matrix A. (Matrix A is not ignored in the actual implementation of the algorithm.) The output M can then be written as shown in Equation (31).
M r , p = q = 1 Q ( U r , q V q , p )  
We denote three special matrices M, U and V. Mr,p, Ur,q, and Vq,p are the elements of the matrices M, U and V, respectively, as shown in Equation (33). The output M can then be written as a multiplication of matrix U and matrix V.
M = U × V  
M = [ M 1 , 1 M 1 , P M R , 1 M R , P ]   U = [ U 1 , 1 U 1 , Q U R , 1 U R , Q ]   V = [ V 1 , 1 V 1 , P V Q , 1 V Q , P ]  
In this case, we can partition the matrices M, U and V into equal-sized block matrices, and then use the Strassen algorithm to reduce the number of multiplications between Ur,q and Vq,p. The multiplication in the Strassen algorithm is redefined as the element-wise multiplication of matrices Ur,q and Vq,p. We name this new combination as the Strassen-Winograd algorithm.
Algorithm 1. Implementation of the Winograd algorithm.
1 for r = 1 to the number of output maps
2      for q = 1 to the number of input maps
3           U = GwGT
4       end
5  end
6  for p = 1 to batch size
7       for q = 1 to the number of input maps
8           for k = 1 to the number of image tiles
9               V = BTxB
10         end
11       end
12  end
13  for p = 1 to batch size
14       for r = 1 to the number of output maps
15           for j = 1 to the number of image tiles
16               M = zero;
17               for q = 1 to the number of input maps
18                   M = M + AT(U V)A
19               end
20           end
21       end
22  end
To compare theoretically the computational complexity of the conventional algorithm, Strassen algorithm, Winograd algorithm, and Strassen-Winograd algorithm, we list the complexity of multiplication and addition in Table 1. The output feature map size is set to 64 × 64, and the kernel size is set to 3 × 3.
Table 1. Computational complexity of different algorithms.
We can see from Table 1 that, although the algorithms cause more additions when the matrix size is small, the number of extra additions is less than the number of decreased multiplications. Moreover, multiplication usually costs more time than addition. Hence the three algorithms are all theoretically effective in reducing the computational complexity.
Figure 2 shows a comparison of the computational complexity ratios. The Strassen algorithm shows less reduction of multiplication when the matrix size is small, but it incurs less additions. The Winograd algorithm shows a stable performance. Moreover, the number of additions slightly decreases as the matrix size increases. For small-sized matrices, the Strassen-Winograd algorithm shows a much better reduction in multiplication complexity than the Strassen algorithm. Although it incurs more additions, the number of extra additions is much less than the number of decreased multiplications. The Strassen-Winograd algorithm shows a similar performance to the Winograd algorithm. When the matrix size is small, the Winograd algorithm shows a slightly better performance, whereas the Strassen-Winograd algorithm and Strassen algorithm perform much better as the matrix size increases.
Figure 2. Comparisons of the complexity ratio with different matrix sizes.

4. Simulation Results

Several simulations were conducted to evaluate our algorithm. We compare our algorithm with the Strassen algorithm and Winograd algorithm, measuring performance by the runtime in MATLAB R2013b (CPU: Inter(R) Core(TM) i7-3370K). For objectivity, we apply Equation (18) to the conventional algorithm and use it as a benchmark. Moreover, all the input data x and kernel w are randomly generated. We measure the accuracy of our algorithm by the absolute element error in the output feature maps. As a benchmark, we use the conventional algorithm with double precision data, kernels, middle variables and outputs. The other algorithms in this comparison use double precision data and kernels but single precision middle variables and outputs.
The VGG network [1] was applied to our simulation. There are nine different convolutional layers in the VGG network. The parameters of the convolutional layer are shown in Table 2. The depth indicates the number of times a layer occurs in the network. Q indicates the number of input feature maps. R indicates the number of output feature maps. Mw and Nw represent the size of the kernel. My and Ny represent the size of the output feature map. The size of the kernel in the VGG network is 3 × 3. We apply F(2 × 2, 3 × 3) to the operation of convolution. For the computation of the output feature map with size My × Ny, the map is partitioned into (My/2) × (Ny/2) sets, each using one computation of F(2 × 2, 3 × 3).
Table 2. Parameters of the convolutional layers in the Visual Geometry Group (VGG) network.
As Table 2 shows, the numbers of rows and columns are not always even, and the matrices are not always square. To solve this problem, we pad a dummy row or column in the matrices once we encounter an odd number of rows or columns. The matrix can then continue using the Strassen algorithm. We apply these nine convolutional layers in turn to our simulations. For each convolutional layer, we run the four algorithms with different batch sizes from 1 to 32. The runtime consumption of the algorithms is listed in Table 3, and the numerical accuracy of the different algorithms in different layers is shown in Table 4.
Table 3. Runtime consumption of different algorithms.
Table 4. Maximum element error of different algorithms in different layers.
Table 4 shows that the Winograd algorithm is slightly more accurate than the Strassen algorithm and Strassen-Winograd algorithm. The maximum element error of these algorithms is 6.16 × 10−4. Compared with the minimum value of 1.09 × 103 in the output feature map, the accuracy loss incurred by these algorithms is negligible. As we can see from Section 2, theoretically, the processes in all of these algorithms do not result in a loss in accuracy. In practice, a loss in accuracy is mainly caused by the single precision data. Because the conventional algorithm with low precision data is sufficiently accurate for deep learning [10,11], we conclude that the accuracy of our algorithm is equally sufficient.
To compare runtime easily, we use the conventional algorithm as a benchmark, and calculate the saving on runtime displayed by the other algorithms. The result is shown in Figure 3.
Figure 3. Comparisons with different batch sizes.
The Strassen-Winograd algorithm shows a better performance than the benchmark in all layers except layer1. This is because the number of input feature maps Q in layer1 is three, which limits the performance of the algorithm as a small matrix size incurs more additions. Moreover, odd numbers of rows or columns need dummy rows or columns for matrix partitioning, which causes more runtime.
The performance of the Winograd algorithm is stable from layer2 to layer9. It saves 53% of the runtime on average, which is close to the 56% reduction in multiplications. The performances of the Strassen algorithm and Strassen-Winograd algorithm improve as the batch size increases. For example, in layer7, when the batch size is 1, we cannot partition the matrix to use the Strassen algorithm, and there is almost no saving on runtime. The Strassen-Winograd algorithm saves 52% of the runtime, a similar saving as the Winograd algorithm. When the batch size is 2, the Strassen algorithm saves 13% of the runtime, which equates to the 13% reduction in multiplications. The Strassen-Winograd algorithm saves 58% of the runtime, which is close to the 61% reduction in multiplications. As the batch size increases, the Strassen algorithm and Strassen-Winograd algorithm can use more recursions, which can further reduce the number of multiplications and save more runtime. When the batch size is 32, the Strassen-Winograd algorithm saves 75% of the runtime, while the Strassen algorithm and Winograd algorithm save 49% and 53%, respectively.
Though experiments with larger batch sizes were not carried out due to limitations on time and memory, we can see the trend in performance as the batch size increases. This is consistent with the theoretical analysis in Section 3. We conclude therefore that the proposed algorithm can provide the optimal performance by combining the savings of these two algorithms.

5. Conclusions and Future Work

The computational complexity of convolutional neural networks is an urgent problem for real-time applications. Both the Strassen algorithm and Winograd algorithm are effective in reducing the computational complexity without losing accuracy. This paper proposed to combine these algorithms to reduce the heavy computational burden. The proposed strategy was evaluated with the VGG network. Both the theoretical performance assessment and the experimental results show that the Strassen-Winograd algorithm can dramatically reduce the computational complexity.
There remain limitations that need to be addressed in future research. Although the algorithm reduces the computational complexity of convolutional neural networks, the cost is an increased difficulty in implementation, especially in real-time systems and embedded devices. It also increases the difficulty of parallelizing an artificial network for hardware acceleration. In future work, we aim to apply this method to hardware accelerator using practical applications.

Author Contributions

Y.Z. performed the experiments and wrote the paper. D.W. provided suggestions about the algorithm. L.W. analyzed the complexity of the algorithms. P.L. checked the paper.

Funding

This research received no external funding.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Granted 61801469, and in part by the Young Talent Program of Institute of Acoustics, Chinese Academy of Science, under Granted QNYC201622.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
  2. Liu, N.; Wan, L.; Zhang, Y.; Zhou, T.; Huo, H.; Fang, T. Exploiting Convolutional Neural Networks with Deeply Local Description for Remote Sensing Image Classification. IEEE Access 2018, 6, 11215–11228. [Google Scholar] [CrossRef]
  3. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, 3–6 December 2012; Volume 60, pp. 1097–1105. [Google Scholar]
  4. Le, N.M.; Granger, E.; Kiran, M. A comparison of CNN-based face and head detectors for real-time video surveillance applications. In Proceedings of the Seventh International Conference on Image Processing Theory, Tools and Applications, Montreal, QC, Canada, 28 November–1 December 2018. [Google Scholar]
  5. Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 91–99. [Google Scholar]
  6. Denil, M.; Shakibi, B.; Dinh, L. Predicting Parameters in Deep Learning. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2148–2156. [Google Scholar]
  7. Han, S.; Pool, J.; Tran, J. Learning both Weights and Connections for Efficient Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems, Istanbul, Turkey, 9–12 November 2015; pp. 1135–1143. [Google Scholar]
  8. Guo, Y.; Yao, A.; Chen, Y. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 1379–1387. [Google Scholar]
  9. Qiu, J.; Wang, J.; Yao, S. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016; pp. 26–35. [Google Scholar]
  10. Courbariaux, M.; Bengio, Y.; David, J.P. Low Precision Arithmetic for Deep Learning. arXiv, 2014; arXiv:1412.0724. [Google Scholar]
  11. Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep Learning with Limited Numerical Precision. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
  12. Rastegari, M.; Ordonez, V.; Redmon, J. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 525–542. [Google Scholar]
  13. Zhu, C.; Han, S.; Mao, H. Trained Ternary Quantization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  14. Zhang, X.; Zou, J.; Ming, X.; Sun, J. Efficient and accurate approximations of nonlinear convolutional networks. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2014; pp. 1984–1992. [Google Scholar]
  15. Mathieu, M.; Henaff, M.; Lecun, Y.; Chintala, S.; Piantino, S.; Lecun, Y. Fast Training of Convolutional Networks through FFTs. arXiv, 2013; arXiv:1312.5851. [Google Scholar]
  16. Vasilache, N.; Johnson, J.; Mathieu, M. Fast Convolutional Nets with fbfft: A GPU Performance Evaluation. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  17. Toom, A.L. The complexity of a scheme of functional elements simulating the multiplication of integers. Dokl. Akad. Nauk SSSR 1963, 150, 496–498. [Google Scholar]
  18. Cook, S.A. On the Minimum Computation Time for Multiplication. Ph.D. Thesis, Harvard University, Cambridge, MA, USA, 1966. [Google Scholar]
  19. Winograd, S. Arithmetic Complexity of Computations; SIAM: Philadelphia, PA, USA, 1980. [Google Scholar]
  20. Jiao, Y.; Zhang, Y.; Wang, Y.; Wang, B.; Jin, J.; Wang, X. A novel multilayer correlation maximization model for improving CCA-based frequency recognition in SSVEP brain-computer interface. Int. J. Neural Syst. 2018, 28, 1750039. [Google Scholar] [CrossRef] [PubMed]
  21. Wang, R.; Zhang, Y.; Zhang, L. An adaptive neural network approach for operator functional state prediction using psychophysiological data. Integr. Comput. Aided Eng. 2015, 23, 81–97. [Google Scholar] [CrossRef]
  22. Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. In Proceedings of the Computer Vision and Pattern Recognition, Caesars Palace, NV, USA, 26 June–1 July 2016; pp. 4013–4021. [Google Scholar]
  23. Strassen, V. Gaussian elimination is not optimal. Numer. Math. 1969, 13, 354–356. [Google Scholar] [CrossRef]
  24. Cong, J.; Xiao, B. Minimizing Computation in Convolutional Neural Networks. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2014, Hamburg, Germany, 15–19 September 2014. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.