On the Reduction of Computational Complexity of Deep Convolutional Neural Networks †

Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom–Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy.


Introduction
Convolutional neural networks (ConvNets) are becoming a mainstream technology for an array of new embedded applications, including speech recognition, language translation, object detection, image recognition, and numerous other complex tasks ( [1][2][3][4][5]). This breakthrough has been made possible by recent progress in deep learning, although the theoretical understanding remains, however, unsatisfactory. Basic questions about optimal architecture, the number of required layers, and the number of neurons per layer are not well understood. Most state-of-the-art deep models typically require millions of parameters and billions of operations to produce human-level accuracy ( [6][7][8]). The memory and computational requirements in particular complicate the deployment of deep neural networks on low power-embedded platforms as they have a very limited computational and power budget. To avoid running end-to-end inference on embedded systems, the current state-of-the-art solutions enable this type of application by off-loading the computation to cloud-based infrastructures where server-grade machines (GPUs and other application-specific accelerators) perform the heavy number crunching. Unfortunately, the cloud-assisted approach places severe limitations on the usability and scalability of deep learning-based embedded and Internet of Things (IoT) applications. First and foremost, the user data is sent across the cloud, with serious privacy implications. Second, sending lots of data (e.g., every frame of a video) over a wireless network consumes significant power due to the communication overhead. For applications where continuous data exchange is required between the server and the mobile device, latency is also a big concern. For example, a wearable continuous glucose level monitoring sensor must detect an abnormal condition and must perform an action in real time. The third limitation is the scalability, which has mid to long-term implications. Gartner Inc., one of the world's leading research and advisory companies, estimates that by 2020, 26 billion IoT units will be installed globally [9]. The staggering amount of data generated by IoT devices will easily exceed the storage limits of cloud infrastructure. To truly scale deep learning-based applications globally in various scenarios, we have to enable these applications without the requirement of always having to connect to the cloud infrastructure.
In this paper, we propose a robust and easy-to-implement acceleration scheme, known as One-Dimensional Fast Approximate Low-rank CONvolution (1D-FALCON), which can be applied on readily available state-of-the-art pre-trained models. Very recently, Tishby et al. showed that deep neural networks can be explained from an information-theoretic approach [10]. The author showed us that the goal of deep learning can be expressed as an information-theoretic trade-off between compression and prediction accuracy. Our proposed scheme exploits the inherent redundancy present in the convolution layers in order to reduce the compute complexity of deep networks. Additionally, we decompose each filter bank into multiple one-dimensional (1D) low-rank vectors to reduce the total number of operations required per layer. We then apply a modified version of the Toom-Cook algorithm to compute the convolution using one-dimensional filters to further reduce the number of multiplications in discrete convolution. Figure 1 presents the high-level optimization pipeline from our 1D-FALCON scheme.
Although many earlier studies have focused on reducing overall memory footprint by compression, only a few have aimed at speeding up convolutional layers. Unlike many previously proposed pruning and regularization techniques, our scheme does not involve any time-consuming iterative retraining cycle. Furthermore, since rank selection and decomposition are only dependent on the individual layer's inherent property, each convolution layer can be approximated in parallel. Our approximation scheme is mathematically well-grounded, robust, and thus easily tunable using numerical formulation, without sacrificing baseline accuracy. To the best of our knowledge, this paper is the first to study a co-optimization scheme that combines both the one-shot low-rank model approximation technique and a fast arithmetic scheme that exploits convolutions by separability. the desired output, Y, has a significantly lower dimensionality of the predicted categories. In between the input and the output layer, the structure of deep network forms a Markov chain of intermediate representations made out of many hidden layers-h 1 , h 2 , .., h m (see Figure 2). In supervised learning we are interested in good representations of the input patterns that enable good predictions of the labels. The deep neural network obtains a Markov chain of such representations, the hidden layers, by minimization of the empirical error over the weights of the network layer by layer. This optimization takes place via stochastic gradient descent (SGD), using a noisy estimate of the gradient of the empirical error of each weight through back propagation. This SGD-based optimization process has two distinct phases: empirical error minimization and representation compression. During the first phase of the SGD-based training process, the network tries to memorize the data using maximum entropy weight distribution. In the second phase of the training, it adds noise to the network, which helps to generalize. How much information flows between the input and the output of a layer defines the trade-off between complexity and accuracy. Mutual information is a measure of correlation between different variables. Using the ReLU activation function the information is also compressed at each layer. In our research we noticed that deep neural networks trained using SGD-based optimization resulted in a lot of correlated filters in hidden layers. We exploit this redundancy to trade off complexity with accuracy. The following section covers this trade-off process in more detail. Figure 2. An example of a deep neural network with an input layer X, output layer Y p , and m hidden layers in between. During the training phase, the desired output Y is observed and is used to learn the connectivity matrices between the layers. In the inference phase, the network forms a Markov chain, which predicts output Y p for any input X.

Methodology
The proposed 1D-FALCON scheme consists of two main stages, namely, an approximation stage followed by a fast arithmetic stage, as shown in Figure 1. To achieve this we first approximate each convolutional layer to the necessary level to reduce computational complexity and then decompose each filter bank into two rank-1 filter banks by introducing an intermediate layer in between. If the classification accuracy drops after the layer restructuring stage we fine-tune the model using the training dataset. Then, we apply a modified version of the Toom-Cook algorithm, which computes each 1D convolution for a chosen set of distinct data points, to further reduce the number of strong operations (in this case multiplications). We will show that the combined application of these two schemes results in a significant reduction in computational complexity. In the following few sections describe each phase of our optimization pipeline in detail. We first introduce the idea of a separable filter in the context of convolution.

Separable Filters
The concept of separable filters by splitting convolution operations into convergent sums of matrix-valued stages was proposed by Hummel and Lowe in the 1980s before ConvNet became popular for automatic feature learning [26]. This property was exploited in many early image-processing filters-e.g., the Sobel edge detection filter, the Gaussian blurring filter, etc. This approach is very powerful but restricted to filters that are decomposable, which is often not the case for a trained filter such as in ConvNet. However, due to the presence of inherent redundancy between different filters or feature maps within a layer, this property can be exploited in the acceleration of ConvNet models.
Consider an arbitrary kernel of a ConvNet described by the (m × n) matrix W.
We say that kernel W is separable when it can be split into the outer product of an m-length column vector v and an n-length row vector h as follows: or, W can be explicitly expressed as: From Equations (1) and (3), it is apparent that a separable kernel has equivalent rows and columns. To store the original kernel W in Equation (1), it would require (mn) space. However, if the kernel W is a separable matrix, then we see from Equation (3) that it would require (m + n) space. As m and n becomes large and original kernel is separable W, one can see that substantial savings in computational time and storage will be achieved.
Unfortunately, we cannot generally expect that any trained kernel in ConvNet satisfies such stringent conditions. The collection of kernels in a ConvNet is generally of full rank and expensive to convolve with large images. However, we can aim for W to be approximately separable such that where E is an error kernel, whose importance we would like to be as small as possible in relation to the original kernel W. We can further generalize Equation (4) in the following form: where each term, is an exactly separable rank-1 outer product of a column vector of length m and row vector of length n, and E r is the error matrix associated with r-term approximation of original kernel W as shown in Figure 3. Eckart and Young showed that the SVD is the solution to the problem of minimizing E r [27]. Furthermore, if the original kernel W can be well approximated by r rank-1 updates, we will only require r(m + n) parameters to describe the kernel instead of original mn elements. The key idea here is that if we choose r such that r(m + n) << mn, then it would require less storage and computation. We can extend this idea to the convolutional neural network to reduce the overall cost of computation. Figure 3. A two-dimensional (2D) matrix can be represented by the sum of r rank-1 updates.

Layerwise Approximation and Convolution by Separability
In ConvNets, multiple layers of convolutional filter (also known as kernel) banks are stacked on top of each other, followed by a non-linear activation function. Significant redundancy exists between those spatial filter dimensions and also along cross-channel feature maps. Most of the previous research has focused on either exploiting approximation along spatial filter dimensions or along one of the feature channel dimensions. In our approach, we aim at approximating the redundancy across both the input and output feature maps.
Let us assume, in a convolutional neural network, that a four-dimensional kernel can be represented as W ∈ R F I ×(m×n)×F O , where spatial two-dimensional kernels are of size (m × n) and F I , F O are the input and output channels within a layer, respectively. We can also represent an input feature map as X ∈ R M×N×F I and corresponding kernels as W i ∈ R m×n×F I for ith set of weights, where each input feature map is of size (M × N). The original convolution for the ith set of weights in a given layer now becomes Our goal is to find an approximation of kernel W i , such that W i = W i + E . Using the concept of separable filters [18], let us assume that for a small error E , the chosen rank is R. How the rank R is chosen will be explained in the next section. The modified kernel now can be represented by Equation (8), where V ∈ R R×(m×1×F I ) is the approximate column kernel, and H ∈ R F O ×(1×n×R) is the approximate row kernel. Figure 4 depicts the idea of re-constructing the convolution layer using the newly constructed column and row low-rank kernels and compares them with the original two-dimensional (2D) direct convolution. We compute the column and row kernels (V, H) statically using generalized eigenvalue decomposition by minimizing the error E . Since we decide the magnitude of the approximation statically, we avoid the long running time of learning-based techniques. Additionally, as the approximation is an inherent property of each layer, we can restructure all the convolutional layers in a ConvNet in parallel, which also saves time. If the accuracy of a model drops at this stage after approximating all the layers, we fine-tune the complete model once using the training dataset.

Rank Search and Layer Restructuring Algorithm
The rank R is chosen by the one-shot minimization criterion described before. We apply singular value decomposition on the 2-D tensor R (F I m)×(nF O ) , which we obtain from the original four-dimensional (4D) tensor R F I ×m×n×F O . Unlike other minimization criteria such as the Mahalanobis distance metric or the data covariance distance metric [15], our simple criterion gives us an exact decomposition. Algorithm 1 describes the main steps of our low-rank approximation and ConvNet layer restructuring scheme.

The Modified Toom-Cook's Fast 1D Convolution
Once we have obtained newly constructed multi-stage 1D convolution layers, we apply a modified version of the Toom-Cook algorithm to further reduce the number of multiplications. In the Toom-Cook method, a linear convolution can be written as product of two polynomials in the real field ( [28,29]).
The output polynomial s(p) has a degree L + N − 2 and L + N − 1 different coefficients. Instead of explicitly multiplying the polynomials w(p) and x(p) using the discrete convolution, the Toom-Cook algorithm evaluates the polynomials w(p) and x(p) for a set of data points β i and then multiplies their values s(β i ) = w(β i )x(β i ). Afterwards, the product polynomials s(p) are constructed using the Lagrange interpolation (see Figure 5). The algorithm consists of four steps: Evaluate w(β i ) and x(β i ) for all the data points.

3.
Compute Finally, compute s(p) by Lagrange interpolation as follows : Compute s(p) using Lagrange Interpolation Since (L + N − 1) distinct data points are chosen in step 1, a total of (L + N − 1) multiplications are required in step 3. The Toom-Cook algorithm can also be viewed as a method of factoring matrices and can be expressed as the following form ( denotes element-wise multiplication): where W, X and S are the transform matrix for kernels, input, and output, respectively. The cost of computing {Ww(p)} gets amortized over reuse of the result for many input slices. The matrices X and S consist of small integers (0, ±1, ±2, ...), making it possible to realize them by a number of preand post-additions. In addition, in ConvNets multiple channels from the same layer can be computed at the same time. For example, a typical convolution layer with C channels will result in the following C output transforms S: We can rewrite the equation as follows and only apply the output transform once S on the final sum. This amortizes the cost of the output transform over the number of channels in a layer.
Finally, the only dominant costs left over here are (L + N − 1) elementwise multiplications from step 3.

A Fast Convolution Algorithm for Filtering of Dimension Three Using the Modified Toom-Cook Scheme
In our 1D-FALCON scheme, we have chosen an input block size of (6 × 1) to be convolved with (3 × 1) 1D filters. This results in a (4 × 1) block as output and we denote this algorithm as F(4 × 1, 3 × 1, {6 × 1}). Alternatively, one can also start with a (4 × 1) input block and swap output and input transforms to obtain the same result as shown in Equation (34). Using this alternative approach we will now compute the necessary transformation matrices, namely, W, X, and S. w(p) = w 0 + w 1 p + w 2 p 2 (14) s(p) = w(p)x(p) = s 0 + s 1 p + s 2 p 2 + s 3 p 3 + s 4 p 4 + s 5 p 5 (16) since L = 3 and N = 4, L + N − 3 = 4. Therefore we can choose β 0 = 0, β 1 = 1, β 2 = −1, β 3 = 2, and β 4 = −2. Now, let us calculate individual w(β k ) and x(β k ) as follows: According to the modified Toom-Cook algorithm, the polynomial of degree (L + N − 3) now can be expressed as follows: Using Lagrange interpolation the above equation can be simplified further and can be re-arranged in the polynomial form as follows: Since we have modified Toom-Cook algorithm to reduce number of additions, we can get back s(p) by using s(p) = s (p) + w 2 x 3 p 5 Finally, we have the output in matrix form by replacing all β k in the previous equation, The Toom-Cook algorithm can be viewed as a method of factoring matrices and can be expressed as the following form ( denotes element-wise multiplication): We can transpose this solution for a larger block size using matrix exchange theorem from linear algebra. According to matrix exchange theorem, if we have a matrix M which can be factored as: where D is a diagonal matrix, then it can also be factored as: whereS is the matrix obtained from S by reversing the order of its columns, and X is the matrix obtained from X by reversing the order of its rows. We can now apply the same on our final equation and have an alternative form as follows: Finally, we obtain the transformation matrices S T , X T , and W from Equations (36)-(38) respectively.
where 

Results and Discussion
In order to evaluate the effectiveness of our scheme we compared it against several popular networks targeting the MNIST, CIFAR-10, ImageNet, and PASCAL VOC datasets. In this paper, we demonstrate our result for the VGG-16 model, which won the the ImageNet challenge in 2014 [30]. VGG-16 is a deep architecture and consists of 13 convolutional layers out of a total of16 layers. To make a comparison with a wide variety of speedup techniques, we chose a direct 2D convolutional scheme [30], a low-rank scheme based on the Tucker decomposition [31], two popular pruning techniques ( [7,32]), a sparsification scheme [33], and the 2D Winograd filtering scheme [23].
We used three main metrics for comparison: • MULs: Total number of strong operations (i.e., multiplications) in the convolutional layers • Speedup: Total speedup achieved as compared to baseline 2D convolution • Fine-Tuning Time: Average fine-tuning time in number of epochs. The fine-tuning is the process of re-training a CNN after having trained it once and then having reduced its complexity. An epoch is a complete pass through the training set.
As can be seen from Table 1, our 1D-FALCON scheme achieves significant speedup compared to other schemes and does not require a long fine-tuning time. The overall speedup comes from combined application of both the low-rank approximation scheme and the fast 1D convolution technique using the modified Toom-Cook algorithm. The following section highlights the detail speed up achieved from the individual stages of our optimization pipeline.

Speedup from the Low-Rank Approximation Stage:
The computational cost of the baseline 2D direct convolution is O(F I MNmnF O ), where each input feature map is of size (M × N), spatial two-dimensional kernels are of size (m × n) and F I , F O are the input and output channels within a layer, respectively. However, using our 1D-FALCON approximation scheme, the computational costs for the vertical stage and the horizontal stage are O(F I MNmR), O(RMNnF O ), respectively, resulting in a total computational cost of O((mF I + nF O )MNR). If we choose R such that R(mF I + nF O ) << mn(F I F O ), then the computational cost can be reduced. In practice, current state-of-the-art convolutional neural networks use square kernels. Hence, let us assume m = n = p, which is the size of the kernel in the model. Using this assumption, the condition can be simplified to R(F I + F O ) << pF I F O . In addition, most modern ConvNets use more channels in the higher layers than the corresponding lower layers, i.e., the channel ratio F O F I >> 1. The higher the ratio, the larger the value of R can be. In most layers, the computation cost can be reduced by p, which is the dimension of the kernel in the respective layer. Our evaluation on VGG-16 showed an average speedup of 3-5 times in all layers and a maximum speedup of 8-9 times on many individual layers. Table 2 shows the layer-wise speedup of convolutions achieved in the VGG-16 model using an Intel i7-5930k system. It is possible to push the limit of approximation further with an increased loss in classification accuracy. This increased amount of loss can be recovered back by fine-tuning the model; however, more approximation in the layer leads to longer fine-tuning time. After a certain limit in the rank, the original baseline accuracy cannot be recovered back to an acceptable level. Figure 6 shows an accuracy vs. approximation trade-off for few selected layers from the VGG-16 model. In the figure the horizontal dashed line represents an acceptable loss of accuracy of 1% from baseline.

Speedup from the Fast Convolution Stage
The 1D Toom-Cook algorithm requires (N + L − 1) multiplications compared to a direct implementation which will require (N × L) multiplications, where N, L are the dimensions of an input feature slice and a 1D filter, respectively. In case of VGG-16 model, we chose N = 4 and L = 3, resulting in 2× savings in the total number of multiplications. As our modified VGG-16 model has vertical and horizontal stages, in total it achieves 2× savings in the number of multiplications in each 1D stage. A 4× reduction in computational intensity is also possible if we use a variant of the algorithm using output block size of 6. The S T , X T and W transformation matrices corresponding to this variant is shown in Appendix D. However, this speedup is achievable at the cost of a seven-fold increase in the memory footprint of the filters.

Efficient Use of Memory Bandwidth and Improved Local Reuse
Our 1D-FALCON scheme not only helps in reducing overall computational intensity but also reduces cost of storage that arises from the convolutional layers. The cost of storage without application of this scheme is F I F O p 2 , whereas cost reduces to (F I pR + RpF O ) after approximation and separating the kernels into two rectangular ones. If we choose R << p(F I F O )/(F I + F O ), significant savings can be made for the storage costs of the kernels. Table 3 shows an average 5× reduction in the overall memory footprint of the model, whereas many individual layers achieve a 9-10× reduction. Fetching data from off-chip main memory (DRAM) generates costs an order of magnitude greater than from on-chip or local storage [34,35]. Chen et al. in their Eyeriss research project showed that row-stationary 1D convolution is the optimal solution for throughput and energy efficiency, as compared to the scheme that uses classical 2D convolution [36]. Separable filters enable row-stationary 1D convolutions by reducing the number of unnecessary data loads in padded convolution, dividing the convolution into two 1D stages. To preserve information, many convolutional networks use zero-padding in many layers. Around the image tile, there is an apron of pixels that is required in order to filter the image tile. Note that the apron of one block also overlaps with the adjacent blocks. If we separate the convolution into vertical and horizontal passes, it is no longer necessary to load the top and bottom apron regions for the horizontal stage of computation. Similarly, for the vertical stage, it is no longer necessary to load the left and right apron regions . This allows more efficient use of the available memory bandwidth and on-chip storage. In case of strided convolution, this approach works very well. Table 3. VGG16 model approximation summary.

Layer
No. of Parameters

Reduction in
Layer Size

Extension of the 1D Algorithm to a 2D Variant and Its Limitations
We can extend the one-dimensional convolution solution shown in Equation (35) for two-dimensional convolution easily by nesting the first transforms inside the second transforms as follows: As a result of nesting, in a 2D convolution (L + N − 1) 2 , element-wise multiplications will be required. By choosing different values of N a number of variants of this algorithm can be produced using the steps shown in Figure 5. Using different variants of the algorithm, speeding up can be achieved. Figure 7 shows a comparison of computational intensity between fast convolution and direct convolution for different choices of output tile size. We can see from the figure that larger tile sizes lead to a higher speedup. However, for the larger tile size the associated cost of memory footprint increases dramatically. As we choose larger tile sizes, the filter tile size also needs to be increased to match the dimensions, which results in an increased memory footprint. A comparison between reduction in computational intensity and the increase in memory footprint associated with filters is shown in Figure 8. As an example, using an output tile size of 6, the computational intensity can be reduced by almost five times. However this speed up will result in a seven-fold increase in memory footprint associated with the filters, which may be significant for embedded systems as they do not have large on-chip memory.

Conclusions
In this paper, we demonstrated that co-optimization of internal structure of the ConvNet models and the underlying implementation of the fundamental algebraic operation form an efficient approach to speedup inference in convolutional neural networks. In the first stage of our optimization pipeline, to facilitate the structural optimization of the models we introduced an easy-to-implement and a correlation-based mathematically well-grounded technique which approximates each filter bank by exploiting inherent redundancies among feature maps in the ConvNets. Unlike many iterative pruning and regularization techniques, our scheme does not require any time consuming fine-tuning and yet preserves the baseline accuracy. In addition, the availability of several pre-tunes models with different performance-accuracy targets can provide significant advantages for deploying ConvNets on fast time-to-market emerging applications. The first stage of reduction in computational intensity is augmented with a further fast convolution stage using the modified Toom-Cook algorithm. In this stage, the total number of strong operations is reduced dramatically without any approximations that may affect accuracy. We have evaluated our 1D-FALCON optimization scheme on a variety of ConvNets targeting different datasets and model sizes. The results from the evaluation running on a range of real hardware provides strong evidence that a significant speedup in ConvNet can be achieved without sacrificing baseline accuracy by jointly optimizing the structure of the network and the underlying implementation of fundamental convolution operations.

Acknowledgments:
The authors would like to thank Daniel Bates from the computer architecture research group at the University of Cambridge for insightful discussions, and Sean Holden from the artificial intelligence research group at the University of Cambridge for valuable feedback in the early stages of this research. The authors also would like to thank Rune Holm from ARM Ltd. for providing insightful feedback on the applicability of our technique in the embedded system space. This research was supported by an EPSRC doctoral scholarship.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

ConvNet
Convolutional Neural Network ILSVRC The ImageNet Large Scale Visual Recognition Challenge 1D-FALCON One-Dimensional Fast Approximate Low-rank Convolution Appendix A. Algorithm F(2×1, 3×1, {4×1}) The following 1D algorithm can be used to convolve a (4 × 1) input with a (3 × 1) filter. The output of this algorithm will produce a (2 × 1) output block. This 1D algorithm can also be easily nested for use with a (3 × 3) filter on a (4 × 4) input tile to produce a (2 × 2) output region. The output transformation can be computed as follows: where intermediate results can be computed as follows: and the filter transformation can be obtained as follows: The following 1D algorithm can be used to convolve a (5 × 1) input with a (3 × 1) filter. The output of this algorithm will produce a (3 × 1) output block. This 1D algorithm can also be easily nested to be used with a (3 × 3) filter on a (5 × 5) input tile to produce a (3 × 3) output region. The output transformation can be computed as follows: where intermediate results can be computed as follows: and the filter transformation can be obtained as follows:

Appendix C. Additional Results from Other Widely Used CNNs
We have applied our technique to many other widely used CNNs trained on the ImageNet dataset. In this section we provide a comprehensive summary of the speedup achieved from our experiments. The following 1D algorithm can be used to convolve a (8 × 1) input with a (3 × 1) filter. The output of this algorithm will produce a (6 × 1) output block. This 1D algorithm can also be easily nested to be used with a (3 × 3) filter on a (8 × 8) input tile to produce a (6 × 6) output region. The output transformation can be computed as follows: where intermediate results can be computed as follows: and the filter transformation can be obtained as follows: