1. Introduction
Deep learning-based computer vision shows good performance in various computer vision areas such as image segmentation [
1,
2], image synthesis [
3], facial recognition [
4], classification [
5], person re-identification [
6], and object detection [
7,
8]. However, in spite of the remarkable achievements in difficult computer vision tasks, conventional deep convolutional neural networks (CNNs) use a high number of parameters which limits their use on devices with limited resources such as smartphones, embedded systems, etc. Even though it has been known that there exist a lot of redundancy between the parameters and the feature maps in deep models, over-parametrized CNN models are used due to the reason that over-parametrization makes the training of the network easier as has been shown in the experiments in [
9]. The reason for this phenomenon is believed to be due to a better gradient flow in over-parametrized networks.
Meanwhile, it has been shown in [
10] that even with the use of regularization methods, there still exists excessive capacity in trained networks, which again implies the fact that the redundancy between the parameters is still large. Therefore, many researches focus on finding a better network structure so that the parameters can be expressed in a structured subspace with smaller number of coefficients. The research topic on compressing large-scale deep learning models is increasing in importance as it is necessary to use compressed deep learning models in edge devices such as smartphones and IoT devices. While early works focused on compressing the parameters of pre-trained large scale deep learning models [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22], studies are also actively under way to limit the number of parameters by proposing small networks in the first place [
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37]. Most recently, researches have prevailed on how to efficiently use these compressed models on edge devices [
38,
39,
40,
41]. We will provide a detailed overview of these research trends in
Section 2.
In this paper, we propose a training method for the training of low-rank convolutional neural networks, which we call the alternating tensor compose-decompose (ATCD) method. The proposed training method can better train compressed low rank models than existing training methods, thus obtaining a compressed deep learning model with higher performance. In general, when training deep learning models, the same structure of the neural network is used during the training and the testing stages. Conventional tensor decomposing networks are trained with the fixed-structure based training method, i.e., they are trained in the decomposed form. We call the conventional training method the fixed-structure based training method. In comparison, the proposed training method do not use a fixed structure of neural network in the training stage, but allows the tensors to be alternatingly composed and decomposed so that a better gradient flow can flow through the tensors in the backpropagation step. This better gradient flow results in better parameter values than with conventional training method so that the compressed model can achieve a higher performance. As an example of the proposed training method, we apply it to the rank-1 CNN, where the rank-1 CNN is iteratively and alternatingly composed into a 3-D rank-1 CNN structure and decomposed into 1-D vectors in the training stage, where the 3-D rank-1 filters are constructed by the outer products of the 1-D vectors. The number of parameters in the 3-D rank-1 filters are the same as in the 3-D filters in standard CNNs, allowing a good gradient flow in the backpropagation stage. The difference with the backpropagation stage in standard CNNs is that the gradient flow flows also through the 1-D vectors from which the 3-D rank-1 filters are constructed, updating the parameters in the 1-D vectors also. After the backpropagation step, the 3-D filters lose their rank-1 property. However, at the next composition step, the parameters in the 3-D filters are updated again by the outer product operation to be projected onto the rank-1 subspace. By iterating this two-step update, all the 3-D filters in the network are trained to minimize the loss function while maintaining their rank-1 property. This is different from approaches which try to approximate the trained filters by low rank approximation after the training has been finished, e.g., like the low rank approximation in [
14] or from approaches which use the same fixed CNN structure both in the training and the testing stages. The composition operation is included in the training phase in our network, which directs the parameter update in a different direction from that of standard CNNs, directing the solution to live on a rank-1 subspace.
In the testing phase, we do not need the tensor composing stage anymore, and the 3-D rank-1 filters can be permanently decomposed into 1-D filters. So in the testing stage, the rank-1 CNN is now reconstructed into a 1-D rank-1 CNN structure with the trained 1-D vectors used as the 1-D filters. So the rank-1 CNN has the same accuracy as the 3-D rank-1 CNN, but has the same inference speed as the 1-D rank-1 CNN, i.e., the inference speed is exactly the same as that of the Flattened network. Moreover, with the proposed method, the network can be trained even in the case when the Flattened network cannot be trained at all. In other words, the proposed training method can be applied to train networks with very limited gradient paths due to the low rank property which cannot be trained with conventional training methods.
We also show how the same training method can be applied to the well-known MobileNet. We first show how the channel-wise filters can be expressed as a linear combination of low rank filters, and then show how the proposed alternating tensor compose-decompose (ATCD) training method can be applied to the training of the low rank filters. The low rank filters are composed into the MobileNet structure again at the end of the training. Thereby, better parameters are obtained than with conventional training with fixed MobileNet structure.
4. Application of the Proposed Training Method to the Rank-1 CNN
As an example of how the tensor composing-decomposing method can be applied to train low-rank CNNs, we apply the proposed training method to the rank-1 CNN which is composed of mere rank-1 convolutional filters. In comparison with other CNN models which use 1-D rank-1 filters, we propose the use of 3-D rank-1 filters(
) in the training stage, where the 3-D rank-1 filters are constructed by the outer product of three 1-D vectors, say
,
, and
:
This is an extension of the 2-D rank-1 planes used in the B2DPCA, where the 2-D planes are constructed by
.
Figure 2 shows the training and the testing phases of the proposed method. The structure of the proposed network is different for the training phase and the testing phase. In comparison with the Flattened network (
Figure 1), in the training phase, the gradient flow first flows through the 3-D rank-1 filters and then through the 1-D vectors. Therefore, the gradient flow is different from that of the Flattened network resulting in a different and better solution of parameters in the 1-D vectors. The solution can be obtained even in large networks with the proposed method, for which the gradient flow in the Flattened network cannot obtain a solution at all. Furthermore, at test time, i.e., at the end of optimization, we can use the 1-D vectors directly as 1-D filters in the same manner as in the Flattened network, resulting in the same inference speed and number of operations as the Flattened network (
Figure 2).
4.1. Construction of the 3-D Rank-1 Filters
We first observe that a 2-D convolution can be seen as shifting inner products, where each component
at position
of the output matrix
is computed as the inner product of a 2-D filter
and the image patch
centered at
:
If
is constructed by the outer product of two 1-D vectors
and
, i.e.,
, then
becomes a 2-D rank-1 filter. In this case, it can be observed that
As has been explained in the case of B2DPCA, since is multiplied to the rows of , tries to extract the features from the rows of which can minimize the loss function. That is, searches the rows in all patches for some common features which can reduce the loss function, while looks for the features in the columns of the patches. In analogy to the B2DPCA, this bilateral projection removes the redundancies among the rows and columns in the 2-D filters.
In convolutional neural networks, the input
and the convolutional filter
are both three dimensional, where the third dimension refers to the depth of the input, i.e., the numer of input channels. In this case, the 3-D rank-1 filter
is constructed by the outer product of three 1-D vectors
,
, and
,
where the length of
is the same as the depth of the input
. In analogy to the B2DPCA, the 3-D rank-1 filters which are learned by the three dimensional multilateral projection will have less redundancies among the rows, columns, and the channels than the normal 3-D filters in standard CNNs.
The three dimensional convolution of the 3-D rank-1 filter
and
can be expressed by the sum of channel-wise 2-D convolutions. Let
denote the
i’s 3-D rank-1 filter that results in the
i’s output channel
, and
the 2-D rank-1 filter that relates with
, which is constructed by the outer product of two 1-D vectors
and
. We construct a rank-1 2-D filter for each output channel
,
The total number of 2-D filters is
, where
q is the number of output channels. Then, the 3-D rank-1 filters can be constructed by
Furthermore, let
denote the
j’s 2-D channel in
. Then the 3-D convolution which results in the
i’s output channel
can be expressed as
where ∗ and ⊛ denote the 3-D and the 2-D convolution operations, respectively, and
refers to the
j’s component of the vector
.
Figure 3 visualizes how the 3-D rank-1 filters are constructed and how they convolve with the 2-D channels in
.
As explained above, at test time, we can use the trained 1-D vectors as the 1-D filters, so that in the test time only 1-D convolutions are used. As has been shown in [
30], when using only 1-D convolutions, the number of operations reduces to
instead of
where
and
are the width and height of the input feature map, respectively,
is the number of output channels, and
and
are the width and height of the filter.
4.2. Training Process
Figure 4 explains the training process with the proposed network structure in detail. At every epoch of the training phase, we first take the outer product of the three 1-D vectors
,
, and
. Then, we assign the result of the outer product to the weight values of the 3-D convolution filter, i.e., for every weight value in the 3-D convolution filter
, we assign
where,
correspond to the 3-D coordinates in
, the 3-D domain of the 3-D convolution filter
. Since the matrix constructed by the outer product of vectors has always a rank of one, the 3-D convolution filter
is a rank-1 filter.
During the back-propagation phase, every weight value in
will be updated by
where
denotes the gradient of the loss function
L with respect to the weight
, and
is the learning rate. In standard convolutional neural networks,
in (
15) is the final updated weight value at each update step. However, the updated filter
normally is not a rank-1 filter. This is due to the fact that the update in (
15) is done in the direction which considers only the minimizing of the loss function and not the rank of the filter.
With the proposed training network structure, we take a further update step, i.e., we update the 1-D vectors
,
, and
:
Here,
,
, and
can be calculated as
At the next feed forward step in the back-propagation, an outer product of the updated 1-D vectors
,
, and
is taken to concatenate them back into a 3-D convolution filter
, which we call the tensor composing step:
where
As the outer product of 1-D vectors always results in a rank-1 filter,
is a rank-1 filter as compared with
which is not. Comparing (
15) with (
22), we get
Therefore, we can say that is the incremental update vector which projects back onto the rank-1 subspace. The use of rank-1 filters are not constrained to replace the filters in the standard CNN structure but can also replace the full-rank filters in ResNet or DenseNet-like architectures. In this case, the rank-1 filters can also reduce the parameters and accelerate the inference speed in ResNet or DenseNet architectures.
5. Application of the Proposed Training Method to the MobileNet
Here, we show that the proposed rank-1 network training method can also be applied to train the well-known MobileNet (version 1). However, the performance becomes better when the parameters are obtained by the proposed method, than when obtained by the original MobileNet type training method. The main idea of applying the proposed training method to the training of the MobileNet is that we can extend the separate 2-D filters to 3-D filters by the outer product of rank-1 2-D filters with rank-1 1-D vectors resulting in rank-1 3-D filters, train the rank-1 3-D filters with the ATCD training method, and then compress the rank-1 3-D filters back to full-rank 2-D filters. In the original version (version 1) of the MobileNet, the 2-D images are separately convolved with 2-D filters, and then are combined by
convolutions. The output of a single layer of the original version of the MobileNet can be written as
where
is the
m’s output channel,
is the
j’s input channel,
is the
j’s filter that convolves with the
j’s input channel, ⊛ is the 2-D convolution operator, and
is the
m’s
convolution filter that produces the
m’s output channel. Meanwhile, the outputs
which are obtained by the convolutions of the 3-D rank-1 filters and the input channels as shown in
Figure 3 become
It has to be noted that the index of
in (
26) is
i (the index of output channels) compared to (
25) where the index of
is
j (the index of input channels).
Now, adding an extra layer which computes the linear combinations of the outputs
in (
26) by
convolutions with the filters
, we have
By putting (
26) into (
27) and rearranging the order of summation, we get,
After the values of
and
are all fixed for all
m,
i, and
j, i.e., after they have been trained, we can arbitrarily construct the vectors
and
so that the entries in the vectors are assigned as follows:
Then, we can rewrite (
28) as
which can now be rewritten as
By letting
the formula becomes the same as that for the Mobilenet described in (
25). This means that after being training by the proposed method, we can implement the inference system also in the Mobilenet style. It has to be noticed that even though
is a two dimensional rank-1 filter, since it is composed of the outer product of two 1-D vectors, the filter
can have a rank of
K, as the summation of
K independent rank-1 filters results in a rank-
K filter. As shown in the experiments, the proposed method learns better parameters due to the over-parametrization produced by the outer product into a 3-D filter, and therefore, the MobileNet constructed by the proposed rank-1 training method has a higher accuracy than that of the original MobileNet which is trained with a smaller number of parameters. Therefore, the proposed training method can contribute to obtain MobileNets with higher classification accuracies.
6. Experiments
We compared the performance of the proposed ATCD training method with the conventional fixed structure training method for the rank-1 CNN and the MobileNet on various datasets. We also compared the validation and testing accuracies with a standard full-rank CNN. We used the same number of layers for all the models, where for the fixed structured Flattened CNN we regarded the combination of the lateral, vertical, and horizontal 1-D convolutional layers as a single layer. Furthermore, we used the same numbers of input and output channels in each layer for all the models, and also the same ReLU(Rectified Linear Unit), batch normalization, and dropout operations.
Table 1,
Table 2,
Table 3 and
Table 4 show the different structures of the models used for each dataset in the training stage. The outer product operation of the three 1-D filters
,
, and
into a 3-D rank-1 filter
is denoted as
in the tables. We did not elaborate on the structures to produce the optimal performances, but only tried to make them the same for fair comparison. Furthermore, we did not use any recent structures with extra components like skip-connections, element-wise or channel-wise concatenations, multi-scale filters, such as the ResNet or DenseNet, but intentionally used simple VGG-like structures with simple consecutive convolutional filters to see only the effect of the proposed training method. However, to verify the fact that the proposed training method can be applied also to the training of rank-1 filters inside a ResNet or DenseNet structure, we further performed an experiment on the CIFAR10 dataset with the ResNet structure where we replaced all the convolutional filters with the rank-1 filters and then applied our ATCD training method.
The datasets that we used in the experiments were the MNIST, the CIFAR10, CIFAR100, and the ‘Dog and Cat’ datasets (
https://www.kaggle.com/c/dogs-vs-cats). We used different structures for different datasets, which we denoted as CNN1, CNN2, CNN3, and CNN4 in the tables, corresponding to the MNIST, CIFAR10, CIFAR100, and ‘Dog and Cat’ datasets, respectively. The MNIST and the CIFAR-10 datasets both consisted of 60,000 images in 10 different classes, divided into 50,000 training images and 10,000 test images, where the images in the CIFAR-10 dataset were colour images of size 32 × 32, while those in the MNIST dataset were gray images of size 28 × 28. The CIFAR-100 data set consisted of 100 classes, each with 500 training and 100 test color images of size 32 × 32. The ‘Dog and Cat’ dataset contained 25,000 color images of dogs and cats of size
, which we divided into 24,900 training and 100 test images for the validation test along the training. At the end of each training session, we tested the final testing accuracy of the ‘Dog and Cat’ dataset by taking the average of the 100 test images. Then, the final testing accuracy was obtained by taking the mean of 10 of such sessions. For the experiments on the MNIST, the CIFAR10, and the CIFAR100 datasets, we trained on 50,000 images, and then tested on 100 batches each consisting of 100 random test images, and calculated the overall average accuracy both for the validation along the training and for the final testing. We plotted for every training epoch the validation accuracy values in
Figure 5,
Figure 6,
Figure 7,
Figure 8 and
Figure 9. The number of epochs in the figures were determined to be greater than the epochs for which the validation accuracies with the proposed model sufficiently converged, and which resulted in graphs from which it became possible to visually compare the results of the different methods.
Figure 5,
Figure 6,
Figure 7,
Figure 8 and
Figure 9 show the validation accuracies for every epoch along the training process for a single training for each model. Even though it is customary to learn a model only once in deep learning, we performed multiple training sessions for each dataset, and obtained different models at the end of each training. Then, we calculated the means and the standard deviations of the different final accuracy values of the testing datasets for each trained model, and recorded them in
Table 5. For the experiments on MNIST, CIFAR10, and CIFAR100 datasets, 20 training sessions were performed, and for the ’Dog and Cat’ dataset, which took a long time to train, 10 training sessions were performed to obtain the values in
Table 5. The slight differences between the testing accuracies of the different models are due to the different initialization of convolutional filters and the randomness of the stochastic gradient descent-based backpropagation.
The rank-1 CNN trained with the proposed training method achieved slightly larger validation and testing accuracies on the MNIST dataset than the standard full-rank CNN and the Flattened CNN trained with the conventional fixed-structure training method (
Figure 10 and
Table 5). This is maybe due to the fact that the MNIST dataset was in its nature a low-ranked one, for which the proposed method could find the best approximation since the proposed method constrains the filters to a low rank sub-space. With the CIFAR10 and the CIFAR100 dataset, the accuracy was slightly less than that of the standard CNN which is maybe due to the fact that the images in the CIFAR10 and the CIFAR100 datasets were of higher ranks than those in the MNIST dataset. However, the validation accuracy of the rank-1 CNN trained with the proposed method was higher than that of the Flattened CNN trained with the conventional fixed structure training method on the CIFAR10 dataset which shows the fact that the better gradient flow in the proposed training method achieves a better solution. With the CIFAR100 dataset, the Flattened CNN could not be trained by conventional fixed structure training methods due to the deep structure of the CNN4 structure. The ‘Dog and Cat’ dataset was used in the experiments to verify the validness of the proposed training method on real-sized images and on a deep structure. In this case, again the Flattened network could not be trained with the conventional fixed structure training method. This is maybe due to the limitation to produce a gradient flow in deep structures with the direct fixed 1-D structure of the Flattened CNN. The standard CNN trained with conventional training method and the proposed rank-1 CNN trained with the proposed training method achieved similar validation accuracies as can be seen in
Figure 6. The validation accuracies for the CIFAR100 dataset are shown in
Figure 7. Again, it can be seen that the rank-1 CNN trained with the proposed ATCD method achieved similar validation and testing accuracies to the standard CNN.
The number of operations reduced according to (
12). So, for example, for the first layer of the structure CNN3, the number of operations for the standard CNN and the rank-1 (type-1) became
=
28,901,376 and
=
3,512,320, respectively. Therefore, the computation operations in the first convolutional layer in the CNN3 model was about eight times more with the standard CNN.
Table 6 summarizes the number of parameters for the different models and structures. We also performed experiments on the inference times for the different models on CPU and GPU environments, and recorded the values in the fifth and sixth columns in
Table 5. We used Tensorflow version 0.12.1 and ran it on Window10 OS, with NVIDIA 1080Ti GPU, Intel i9 CPU and 16 GB RAM memory. It should be noted that the Flattened model and the proposed model had the same inference speed as the structures in the testing time werere the same. Compared to the reduction ratio of the number of parameters, the increase in the inference speed was not that high, which is mainly due to the fact that the Tensorflow framework was not optimized for factorized neural networks and that the loading of the image took a long time in the inference stage. We believe that in the future, the speed of factorized filtering will increase with a framework optimized for them.
Figure 8 compares the validation accuracies of the networks with ResNet structures composed of standard 3-D filters and rank-1 3-D filters, respectively, where the rank-1 3-D filters were trained with the proposed training method. We used the 56-layer ResNet structure for CIFAR10 as proposed in [
50], and replaced all the standard 3-D filters by rank-1 3-D filters in the ResNet in the experiments with the proposed training method. As can be seen, the rank-1 ResNet trained with the proposed method achieved similar validation accuracy to the normal ResNet, which shows that the proposed training method can be applied to diverse network structures containing low-rank filters.
Finally,
Figure 9 compares the validation accuracies of the MobileNet when trained with the normal method and with the proposed training method. We used the CNN4 structure and replaced all the 3-D convolution operations with the depth-wise separable 2-D convolutions +
pointwise convolutions as suggested in the MobileNet structure. It can be seen that the proposed training method could accelerate the training process and achieved a better validation accuracy than the standard training method, which is due to the intentional over-parametrization obtained by the outer product with the proposed method.
Figure 11 shows the means and the positive standard deviations of the validation accuracies for different training sessions. It can be seen that the standard deviations at the early epochs of the training with the standard training method were large, while with the proposed training method the variation was not so large. However, after a long time of training, the final validation accuracies all converged to a similar value for each training session as shown by the standard deviation that decreased at the end of the training.
7. Conclusions
We proposed a training method which alternatively composes and decomposes the filters in the training stage for better training of low rank filters. As an exemplary case, we showed that a rank-1 CNN can be trained with the proposed method, which cannot be trained with conventional training methods. We used 3-D rank-1 filters in convolutional neural networks in the training phase, so that the redundancy in the filters are reduced by the rank deficient property of the rank-1 filters. The proposed training method updates the 3-D filter parameters and projects them back onto the rank-1 subspace at each epoch to find good parameter values for the 1-D vectors which constitute the 3-D filters. At test time, the trained 1-D vectors can be used directly as 1-D filters which filter the image by 1-D convolution instead of the 3-D convolution for fast inference. We showed in the experiments that the accuracy performance of the rank-1 CNN is almost the same as the standard CNN while reducing the number of parameters up to about 11%, and the number of operations up to about 12% in the convolution filters compared with the standard CNN. We also showed that the proposed method can also be used in the ResNet structure and showed that the proposed training method can be utilized for a better training of the MobileNet. This suggests the possibility that the proposed rank-1 training method can also be used with diverse structures such as the ResNet or other small-sized networks, to obtain a more efficient network structure.
In this paper, we applied the proposed training method to well-known deep learning models that can be factorized. Whether the proposed training method can be applied to other complicated models still leaves much room. Combining the proposed method with the existing training method and applying it to other complex deep learning models can be an additional topic of this study. Moreover, the experimental results showed that the inference time was not reduced as much as the decrease in the number of parameters, which is due to the fact that existing deep learning frameworks are not optimized for factorized models. Therefore, in order to accelerate the inference speed, further studies on hardware designs that can effectively adopt factorized models should be carried out in the future.