A Training Method for Low Rank Convolutional Neural Networks Based on Alternating Tensor Compose-Decompose Method

: Over the past decade, deep learning-based computer vision methods have been shown to surpass previous state-of-the-art computer vision techniques in various ﬁelds, and have made great progress in various computer vision problems, including object detection, object segmentation, face recognition, etc. Nowadays, major IT companies are adding new deep-learning-based computer technologies to edge devices such as smartphones. However, since the computational cost of deep learning-based models is still high for edge devices, research is being actively carried out to compress deep learning-based models while not sacriﬁcing high performance. Recently, many lightweight architectures have been proposed for deep learning-based models which are based on low-rank approximation. In this paper, we propose an alternating tensor compose-decompose (ATCD) method for the training of low-rank convolutional neural networks. The proposed training method can better train a compressed low-rank deep learning model than the conventional ﬁxed-structure based training method, so that a compressed deep learning model with higher performance can be obtained in the end of the training. As a representative and exemplary model to which the proposed training method can be applied, we propose a rank-1 convolutional neural network (CNN) which has a structure alternatively containing 3-D rank-1 ﬁlters and 1-D ﬁlters in the training stage and a 1-D structure in the testing stage. After being trained, the 3-D rank-1 ﬁlters can be permanently decomposed into 1-D ﬁlters to achieve a fast inference in the test time. The reason that the 1-D ﬁlters are not being trained directly in 1-D form in the training stage is that the training of the 3-D rank-1 ﬁlters is easier due to the better gradient ﬂow, which makes the training possible even in the case when the ﬁxed structured network with ﬁxed consecutive 1-D ﬁlters cannot be trained at all. We also show that the same training method can be applied to the well-known MobileNet architecture so that better parameters can be obtained than with the conventional ﬁxed-structure training method. Furthermore, we show that the 1-D ﬁlters in a ResNet like structure can also be trained with the proposed method, which shows the fact that the proposed method can be applied to various structures of networks.


Introduction
Deep learning-based computer vision shows good performance in various computer vision areas such as image segmentation [1,2], image synthesis [3], facial recognition [4], classification [5], person re-identification [6], and object detection [7,8]. However, in spite of the remarkable achievements in difficult computer vision tasks, conventional deep convolutional neural networks (CNNs) use a high number of parameters which limits their use on devices with limited resources such as smartphones, embedded systems, etc. Even though it has been known that there exist a lot of redundancy between the parameters and the feature maps in deep models, over-parametrized CNN models are used due to the reason that over-parametrization makes the training of the network easier as has been shown in the experiments in [9]. The reason for this phenomenon is believed to be due to a better gradient flow in over-parametrized networks.
Meanwhile, it has been shown in [10] that even with the use of regularization methods, there still exists excessive capacity in trained networks, which again implies the fact that the redundancy between the parameters is still large. Therefore, many researches focus on finding a better network structure so that the parameters can be expressed in a structured subspace with smaller number of coefficients. The research topic on compressing largescale deep learning models is increasing in importance as it is necessary to use compressed deep learning models in edge devices such as smartphones and IoT devices. While early works focused on compressing the parameters of pre-trained large scale deep learning models [11][12][13][14][15][16][17][18][19][20][21][22], studies are also actively under way to limit the number of parameters by proposing small networks in the first place [23][24][25][26][27][28][29][30][31][32][33][34][35][36][37]. Most recently, researches have prevailed on how to efficiently use these compressed models on edge devices [38][39][40][41]. We will provide a detailed overview of these research trends in Section 2.
In this paper, we propose a training method for the training of low-rank convolutional neural networks, which we call the alternating tensor compose-decompose (ATCD) method. The proposed training method can better train compressed low rank models than existing training methods, thus obtaining a compressed deep learning model with higher performance. In general, when training deep learning models, the same structure of the neural network is used during the training and the testing stages. Conventional tensor decomposing networks are trained with the fixed-structure based training method, i.e., they are trained in the decomposed form. We call the conventional training method the fixed-structure based training method. In comparison, the proposed training method do not use a fixed structure of neural network in the training stage, but allows the tensors to be alternatingly composed and decomposed so that a better gradient flow can flow through the tensors in the backpropagation step. This better gradient flow results in better parameter values than with conventional training method so that the compressed model can achieve a higher performance. As an example of the proposed training method, we apply it to the rank-1 CNN, where the rank-1 CNN is iteratively and alternatingly composed into a 3-D rank-1 CNN structure and decomposed into 1-D vectors in the training stage, where the 3-D rank-1 filters are constructed by the outer products of the 1-D vectors. The number of parameters in the 3-D rank-1 filters are the same as in the 3-D filters in standard CNNs, allowing a good gradient flow in the backpropagation stage. The difference with the backpropagation stage in standard CNNs is that the gradient flow flows also through the 1-D vectors from which the 3-D rank-1 filters are constructed, updating the parameters in the 1-D vectors also. After the backpropagation step, the 3-D filters lose their rank-1 property. However, at the next composition step, the parameters in the 3-D filters are updated again by the outer product operation to be projected onto the rank-1 subspace. By iterating this two-step update, all the 3-D filters in the network are trained to minimize the loss function while maintaining their rank-1 property. This is different from approaches which try to approximate the trained filters by low rank approximation after the training has been finished, e.g., like the low rank approximation in [14] or from approaches which use the same fixed CNN structure both in the training and the testing stages. The composition operation is included in the training phase in our network, which directs the parameter update in a different direction from that of standard CNNs, directing the solution to live on a rank-1 subspace.
In the testing phase, we do not need the tensor composing stage anymore, and the 3-D rank-1 filters can be permanently decomposed into 1-D filters. So in the testing stage, the rank-1 CNN is now reconstructed into a 1-D rank-1 CNN structure with the trained 1-D vectors used as the 1-D filters. So the rank-1 CNN has the same accuracy as the 3-D rank-1 CNN, but has the same inference speed as the 1-D rank-1 CNN, i.e., the inference speed is exactly the same as that of the Flattened network. Moreover, with the proposed method, the network can be trained even in the case when the Flattened network cannot be trained at all. In other words, the proposed training method can be applied to train networks with very limited gradient paths due to the low rank property which cannot be trained with conventional training methods.
We also show how the same training method can be applied to the well-known Mo-bileNet. We first show how the channel-wise filters can be expressed as a linear combination of low rank filters, and then show how the proposed alternating tensor compose-decompose (ATCD) training method can be applied to the training of the low rank filters. The low rank filters are composed into the MobileNet structure again at the end of the training. Thereby, better parameters are obtained than with conventional training with fixed MobileNet structure.

Related Works
In this section, we summarize the works related to our work in accordance with the evolving trend of research in this field. However, it should be noted that our work is somewhat unique and different from the related works in the aspect that we did not compress the parameters of pre-trained models or propose a new model architecture, but propose a new training method to train existing factorized structures to have better parameter values.

Works on Compressing the Parameters of Pre-Trained CNNs
Early works on compressing the CNN focused on how to compress the pre-trained parameters without loss of information. As has been well summarized in [42], researches on the compression of deep models can be categorized into works which try to eliminate unnecessary weight parameters [11,12], works which try to compress the parameters by projecting them onto a low rank subspace [13][14][15][16], and works which try to group similar parameters into groups and represent them by representative features [18][19][20][21][22]. These works follow the common framework of first training the original uncompressed CNN model by back-propagation to obtain the uncompressed parameters, and then trying to find a compressed expression for these parameters to construct a new compressed CNN model.

Works on Designing a Compressed Model
Compared to the works of compressing the pre-trained parameters, researches which try to restrict the number of parameters in the first place by proposing small networks are also actively in progress. However, as mentioned above, the reduction in the number of parameters changes the gradient flow, so the networks have to be designed carefully to achieve a trained network with good performance. For example, MobileNets [23] and Xception networks [24] use depthwise separable convolution filters, while the SqueezeNet [25] uses a bottleneck approach to reduce the number of parameters. MobileNet was modified in version 2 model using inverted residuals [26]. Recently, Google announced the Efficient-Net [27] which scales up the MobileNet and the ResNet to obtain a new family of efficient CNN models, while CondenseNet [28] and ShuffleNet [29] are using group convolutions to reduce the number of convolutions. Other models use 1-D filters to reduce the size of networks such as the highly factorized Flattened network [30], or the models in [31] where 1-D filters are used together with other filters of different sizes. Recently, Google's Inception model has also adopted 1-D filters in version 4 [43]. One difficulty in using 1-D filters is that 1-D filters are not easy to train, and therefore, they are used only partially like in the Google's Inception model, or in the models in [31] etc., except for the Flattened network which is constituted of consecutive 1-D filters only. However, only three layers of 1-D filters are used in the experiments in [30], which is maybe due to the difficulty of training 1-D filters with many layers.
Until now, all the efficient neural network architectures have been developed manually by human experts. Though still in the early stage, there are researches ongoing to automatically searching for architectures that are efficient and satisfy the resource or computation constraints [32][33][34][35][36][37]. However, it is known that such an automatic neural architecture search is extremely difficult, and therefore, in practice, the manually well-designed architectures are still widely used.

Works on Edge AI
Now that many efficient and compressed CNN architectures have been proposed, many researchers and major IT companies are focusing on how to shift these compression models to edge devices as customers spend more time on mobile devices [38][39][40][41][44][45][46][47][48]. Particularly, in [38], Eshratifar et al. propose an efficient training for intelligent mobile cloud computing services, while in [39] Li et al. proposed how to accelerate the inference in DNN via edge computing. In [40], a deep learning architecture for intelligent mobile cloud computing services called BottleNet is proposed, which reduces the feature size needed to be sent to the cloud. Furthermore, Bateni and Lie propose a timing-predictable runtime system that is able to guarantee deadlines of DNN workloads via efficient approximation [41]. It is believed that Edge AI research and algorithms on both commercial and academic laboratories are expected to be very active in the next three to five years.

Preliminaries for the Proposed Method
The following works have to be understood to understand the proposed method. The work of bilateral-projection based 2-D principal component analysis (B2DPCA) gave us the insight for bilateral filters and the tensor compose-decompose procedure, which we utilized to train the rank-1 Net.

Bilateral-Projection Based 2DPCA
In [49], a bilateral-projection based 2-D principal component analysis (B2DPCA) has been proposed, which minimizes the following energy functional: where X ∈ R n×m is the two dimensional image, P ∈ R m×l and Q ∈ R n×r are the left-and right-multiplying projection matrices, respectively, and C = P T XQ is the extracted feature matrix for the image X. The optimal projection matrices P opt and Q opt are simultaneously constructed, where P opt projects the column vectors of X to a subspace, while Q opt projects the row vectors of X to another one. It has been shown in [49], that the advantage of the bilateral projection over the unilateral-projection scheme is that X can be represented effectively with smaller number of coefficients than in the unilateral case, i.e., a small-sized matrix C can well represent the image X. This means that the bilateral-projection effectively removes the redundancies among both rows and columns of the image. Furthermore, since it can be seen that the components of C are the 2-D projections of the image X onto the 2-D planes p 1 q T 1 , p 1 q T 2 , ...p l q T r made up by the outer products of the column vectors of P and Q. The 2-D planes have a rank of one, since they are the outer products of two 1-D vectors. Therefore, the fact that X can be well represented by a small-sized C also implies the fact that X can be well represented by a few rank-1 2-D planes, i.e., only a few 1-D vectors p 1 , ...p l , q 1 , ....q r , where l << m and r << n.
In the case of (1), the learned 2-D planes try to minimize the loss function i.e., try to learn to best approximate X. A natural question arises, if good rank-1 2-D planes can be obtained to minimize other loss functions too, e.g., loss functions related to the image classification problem, such as where y true denotes the true classification label for a certain input image X, and y(X, P, C, Q) is the output of the network constituted by the outer products of the column vectors in the learned matrices P and Q. In this paper, we extend this case to the rank-1 3-D filter case, where the rank-1 3-D filter is constituted as the outer product of three column vectors from three different learned matrices, and show that good parameters can be learned for the image classification task. Furthermore, these learned rank-1 3-D filters can be decomposed into rank-1 1-D filters for fast inference speed.

Flattened Convolutional Neural Networks
In [30], the 'Flattened CNN' has been proposed for fast feed-forward execution by separating the conventional 3-D convolution filter into three consecutive 1-D filters. The 1-D filters sequentially convolve the input over different directions, i.e., the lateral, horizontal, and vertical directions. Figure 1 shows the network structure of the Flattened CNN. The Flattened CNN uses the same network structure in both the training and the testing phases. This is in comparison with our proposed model, where we use a different network structure in the training phase as will be seen later. However, the consecutive use of 1-D filters in the training phase makes the training difficult. This is due to the fact that the gradient path becomes longer than in normal CNN, and therefore, the gradient flow vanishes faster while the error accumulates more. Another reason is that the reduction in the number of parameters causes a gradient flow different from that of the standard CNN, which is more difficult to find an appropriate solution for the parameters. This fact coincides with the experiments in [9] which show that the gradient flow in a network with small number of parameters cannot find good parameters. Therefore, a particular weight initialization method has to be used together with this setting. Furthermore, in [30], the networks in the experiments have only three layers of convolution, which is maybe due to the fact of the difficulty in training networks with more layers.

Application of the Proposed Training Method to the Rank-1 CNN
As an example of how the tensor composing-decomposing method can be applied to train low-rank CNNs, we apply the proposed training method to the rank-1 CNN which is composed of mere rank-1 convolutional filters. In comparison with other CNN models which use 1-D rank-1 filters, we propose the use of 3-D rank-1 filters(W) in the training stage, where the 3-D rank-1 filters are constructed by the outer product of three 1-D vectors, say p, q, and t: This is an extension of the 2-D rank-1 planes used in the B2DPCA, where the 2-D planes are constructed by W = p ⊗ q = pq T . Figure 2 shows the training and the testing phases of the proposed method. The structure of the proposed network is different for the training phase and the testing phase. In comparison with the Flattened network ( Figure 1), in the training phase, the gradient flow first flows through the 3-D rank-1 filters and then through the 1-D vectors. Therefore, the gradient flow is different from that of the Flattened network resulting in a different and better solution of parameters in the 1-D vectors. The solution can be obtained even in large networks with the proposed method, for which the gradient flow in the Flattened network cannot obtain a solution at all. Furthermore, at test time, i.e., at the end of optimization, we can use the 1-D vectors directly as 1-D filters in the same manner as in the Flattened network, resulting in the same inference speed and number of operations as the Flattened network ( Figure 2).

Construction of the 3-D Rank-1 Filters
We first observe that a 2-D convolution can be seen as shifting inner products, where each component y(r) at position r of the output matrix Y is computed as the inner product of a 2-D filter W and the image patch X(r) centered at r: If W is constructed by the outer product of two 1-D vectors p and q, i.e., W = p ⊗ q = pq T , then W becomes a 2-D rank-1 filter. In this case, it can be observed that y(r) =< W, X(r) >=< pq T , X(r) >= p T X(r)q.
As has been explained in the case of B2DPCA, since p is multiplied to the rows of X(r), p tries to extract the features from the rows of X(r) which can minimize the loss function. That is, p searches the rows in all patches X(r), ∀ r for some common features which can reduce the loss function, while q looks for the features in the columns of the patches. In analogy to the B2DPCA, this bilateral projection removes the redundancies among the rows and columns in the 2-D filters.
In convolutional neural networks, the input X 3D and the convolutional filter W 3D are both three dimensional, where the third dimension refers to the depth of the input, i.e., the numer of input channels. In this case, the 3-D rank-1 filter W 3D is constructed by the outer product of three 1-D vectors p, q, and t, where the length of t is the same as the depth of the input X 3d . In analogy to the B2DPCA, the 3-D rank-1 filters which are learned by the three dimensional multilateral projection will have less redundancies among the rows, columns, and the channels than the normal 3-D filters in standard CNNs.
The three dimensional convolution of the 3-D rank-1 filter W 3D and X 3D can be expressed by the sum of channel-wise 2-D convolutions. Let W 2D the 2-D rank-1 filter that relates with Y (i) , which is constructed by the outer product of two 1-D vectors p l and q r . We construct a rank-1 2-D filter for each output channel Y (i) , The total number of 2-D filters is q = m × n, where q is the number of output channels. Then, the 3-D rank-1 filters can be constructed by Furthermore, let X (j) denote the j's 2-D channel in X 3d . Then the 3-D convolution which results in the i's output channel Y (i) can be expressed as where * and denote the 3-D and the 2-D convolution operations, respectively, and t i [j] refers to the j's component of the vector t i . Figure 3 visualizes how the 3-D rank-1 filters are constructed and how they convolve with the 2-D channels in X 3d . As explained above, at test time, we can use the trained 1-D vectors as the 1-D filters, so that in the test time only 1-D convolutions are used. As has been shown in [30], when using only 1-D convolutions, the number of operations reduces to instead of where I x and I y are the width and height of the input feature map, respectively, C out is the number of output channels, and f x and f y are the width and height of the filter.
3D that are convolved with the 3-D input X 3d are constructed by the outer products of the 2-D filters W 2D and the 1-D filters t 1 , t 2 , ...., t q , respectively, where the 2-D filters are again constructed by the outer products of the 1-D filters p 1 , p 2 , ..., p m and q 1 , q 2 , ...., q n according to Equation (9). Figure 4 explains the training process with the proposed network structure in detail. At every epoch of the training phase, we first take the outer product of the three 1-D vectors p, q, and t. Then, we assign the result of the outer product to the weight values of the 3-D convolution filter, i.e., for every weight value in the 3-D convolution filter W, we assign

Training Process
where, i, j, k correspond to the 3-D coordinates in Ω(W), the 3-D domain of the 3-D convolution filter W. Since the matrix constructed by the outer product of vectors has always a rank of one, the 3-D convolution filter W is a rank-1 filter. During the back-propagation phase, every weight value in W will be updated by where ∂L ∂w i,j,k denotes the gradient of the loss function L with respect to the weight w i,j,k , and α is the learning rate. In standard convolutional neural networks, w i,j,k in (15) is the final updated weight value at each update step. However, the updated filter W normally is not a rank-1 filter. This is due to the fact that the update in (15) is done in the direction which considers only the minimizing of the loss function and not the rank of the filter.
With the proposed training network structure, we take a further update step, i.e., we update the 1-D vectors p, q, and t:  At the next feed forward step in the back-propagation, an outer product of the updated 1-D vectors p, q, and t is taken to concatenate them back into a 3-D convolution filter W , which we call the tensor composing step: where As the outer product of 1-D vectors always results in a rank-1 filter, W is a rank-1 filter as compared with W which is not. Comparing (15) with (22), we get Therefore, we can say that ∆ i,j,k − ∂L ∂w i,j,k is the incremental update vector which projects W back onto the rank-1 subspace. The use of rank-1 filters are not constrained to replace the filters in the standard CNN structure but can also replace the full-rank filters in ResNet or DenseNet-like architectures. In this case, the rank-1 filters can also reduce the parameters and accelerate the inference speed in ResNet or DenseNet architectures.

Application of the Proposed Training Method to the MobileNet
Here, we show that the proposed rank-1 network training method can also be applied to train the well-known MobileNet (version 1). However, the performance becomes better when the parameters are obtained by the proposed method, than when obtained by the original MobileNet type training method. The main idea of applying the proposed training method to the training of the MobileNet is that we can extend the separate 2-D filters to 3-D filters by the outer product of rank-1 2-D filters with rank-1 1-D vectors resulting in rank-1 3-D filters, train the rank-1 3-D filters with the ATCD training method, and then compress the rank-1 3-D filters back to full-rank 2-D filters. In the original version (version 1) of the MobileNet, the 2-D images are separately convolved with 2-D filters, and then are combined by 1 × 1 convolutions. The output of a single layer of the original version of the MobileNet can be written as where Y (m) is the m's output channel, X (j) is the j's input channel, W (j) is the j's filter that convolves with the j's input channel, is the 2-D convolution operator, and a m is the m's 1 × 1 convolution filter that produces the m's output channel. Meanwhile, the outputŝ Y (i) , i = 1, ..., K which are obtained by the convolutions of the 3-D rank-1 filters and the input channels as shown in Figure 3 becomê It has to be noted that the index ofŴ (i) in (26) is i (the index of output channels) compared to (25) where the index of W (j) is j (the index of input channels). Now, adding an extra layer which computes the linear combinations of the outputŝ Y (i) , i = 1, ..., K in (26) by 1 × 1 convolutions with the filters k m , m = 1, ..., q, we have By putting (26) into (27) and rearranging the order of summation, we get, After the values of k m [i] and t i [j] are all fixed for all m, i, and j, i.e., after they have been trained, we can arbitrarily construct the vectors a m, m=1,...,q and b j, j=1,...,N so that the entries in the vectors are assigned as follows: Then, we can rewrite (28) as which can now be rewritten as By letting the formula becomes the same as that for the Mobilenet described in (25). This means that after being training by the proposed method, we can implement the inference system also in the Mobilenet style. It has to be noticed that even thoughŴ (i) is a two dimensional rank-1 filter, since it is composed of the outer product of two 1-D vectors, the filter W (j) can have a rank of K, as the summation of K independent rank-1 filters results in a rank-K filter. As shown in the experiments, the proposed method learns better parameters due to the over-parametrization produced by the outer product into a 3-D filter, and therefore, the MobileNet constructed by the proposed rank-1 training method has a higher accuracy than that of the original MobileNet which is trained with a smaller number of parameters. Therefore, the proposed training method can contribute to obtain MobileNets with higher classification accuracies.

Experiments
We compared the performance of the proposed ATCD training method with the conventional fixed structure training method for the rank-1 CNN and the MobileNet on various datasets. We also compared the validation and testing accuracies with a standard full-rank CNN. We used the same number of layers for all the models, where for the fixed structured Flattened CNN we regarded the combination of the lateral, vertical, and horizontal 1-D convolutional layers as a single layer. Furthermore, we used the same numbers of input and output channels in each layer for all the models, and also the same ReLU(Rectified Linear Unit), batch normalization, and dropout operations. Tables 1-4 show the different structures of the models used for each dataset in the training stage. The outer product operation of the three 1-D filters p, q, and t into a 3-D rank-1 filter w is denoted as w . = p ⊗ q ⊗ t in the tables. We did not elaborate on the structures to produce the optimal performances, but only tried to make them the same for fair comparison. Furthermore, we did not use any recent structures with extra components like skip-connections, element-wise or channel-wise concatenations, multi-scale filters, such as the ResNet or DenseNet, but intentionally used simple VGG-like structures with simple consecutive convolutional filters to see only the effect of the proposed training method. However, to verify the fact that the proposed training method can be applied also to the training of rank-1 filters inside a ResNet or DenseNet structure, we further performed an experiment on the CIFAR10 dataset with the ResNet structure where we replaced all the convolutional filters with the rank-1 filters and then applied our ATCD training method.

Soft-Max
The datasets that we used in the experiments were the MNIST, the CIFAR10, CI-FAR100, and the 'Dog and Cat' datasets (https://www.kaggle.com/c/dogs-vs-cats). We used different structures for different datasets, which we denoted as CNN1, CNN2, CNN3, and CNN4 in the tables, corresponding to the MNIST, CIFAR10, CIFAR100, and 'Dog and Cat' datasets, respectively. The MNIST and the CIFAR-10 datasets both consisted of 60,000 images in 10 different classes, divided into 50,000 training images and 10,000 test images, where the images in the CIFAR-10 dataset were colour images of size 32 × 32, while those in the MNIST dataset were gray images of size 28 × 28. The CIFAR-100 data set consisted of 100 classes, each with 500 training and 100 test color images of size 32 × 32. The 'Dog and Cat' dataset contained 25,000 color images of dogs and cats of size 224 × 224, which we divided into 24,900 training and 100 test images for the validation test along the training. At the end of each training session, we tested the final testing accuracy of the 'Dog and Cat' dataset by taking the average of the 100 test images. Then, the final testing accuracy was obtained by taking the mean of 10 of such sessions. For the experiments on the MNIST, the CIFAR10, and the CIFAR100 datasets, we trained on 50,000 images, and then tested on 100 batches each consisting of 100 random test images, and calculated the overall average accuracy both for the validation along the training and for the final testing. We plotted for every training epoch the validation accuracy values in Figures 5-9. The number of epochs in the figures were determined to be greater than the epochs for which the validation accuracies with the proposed model sufficiently converged, and which resulted in graphs from which it became possible to visually compare the results of the different methods. Figures 5-9 show the validation accuracies for every epoch along the training process for a single training for each model. Even though it is customary to learn a model only once in deep learning, we performed multiple training sessions for each dataset, and obtained different models at the end of each training. Then, we calculated the means and the standard deviations of the different final accuracy values of the testing datasets for each trained model, and recorded them in Table 5. For the experiments on MNIST, CIFAR10, and CIFAR100 datasets, 20 training sessions were performed, and for the 'Dog and Cat' dataset, which took a long time to train, 10 training sessions were performed to obtain the values in Table 5. The slight differences between the testing accuracies of the different models are due to the different initialization of convolutional filters and the randomness of the stochastic gradient descent-based backpropagation. The rank-1 CNN trained with the proposed training method achieved slightly larger validation and testing accuracies on the MNIST dataset than the standard full-rank CNN and the Flattened CNN trained with the conventional fixed-structure training method ( Figure 10 and Table 5). This is maybe due to the fact that the MNIST dataset was in its nature a low-ranked one, for which the proposed method could find the best approximation since the proposed method constrains the filters to a low rank sub-space. With the CIFAR10 and the CIFAR100 dataset, the accuracy was slightly less than that of the standard CNN which is maybe due to the fact that the images in the CIFAR10 and the CIFAR100 datasets were of higher ranks than those in the MNIST dataset. However, the validation accuracy of the rank-1 CNN trained with the proposed method was higher than that of the Flattened CNN trained with the conventional fixed structure training method on the CIFAR10 dataset which shows the fact that the better gradient flow in the proposed training method achieves a better solution. With the CIFAR100 dataset, the Flattened CNN could not be trained by conventional fixed structure training methods due to the deep structure of the CNN4 structure. The 'Dog and Cat' dataset was used in the experiments to verify the validness of the proposed training method on real-sized images and on a deep structure. In this case, again the Flattened network could not be trained with the conventional fixed structure training method. This is maybe due to the limitation to produce a gradient flow in deep structures with the direct fixed 1-D structure of the Flattened CNN. The standard CNN trained with conventional training method and the proposed rank-1 CNN trained with the proposed training method achieved similar validation accuracies as can be seen in Figure 6. The validation accuracies for the CIFAR100 dataset are shown in Figure 7. Again, it can be seen that the rank-1 CNN trained with the proposed ATCD method achieved similar validation and testing accuracies to the standard CNN.

Standard CNN Flattened CNN Proposed CNN
Conv1: 64 filters, each filter constituted as:

Standard CNN Proposed CNN
Conv1: 64 filters, each filter constituted as: Batch Normalization + ReLU + Drop Out (P = 0.5) Conv2: 64 filters, each filter constituted as: Batch Normalization + ReLU + Max Pool ( 1 2 ) Conv3: 144 filters, each filter constituted as: Batch Normalization + ReLU + Drop Out (P = 0.4) Conv4: 144 filters, each filter constituted as: Batch Normalization + ReLU + Max Pool ( 1 2 ) Conv5: 256 filters, each filter constituted as: Batch Normalization + ReLU + Drop Out (P = 0.4) Conv6: 256 filters, each filter constituted as: Batch Normalization + ReLU + Drop Out (P = 0.4) Conv7: 256 filters, each filter constituted as: Batch Normalization + ReLU + Max Pool ( 1 2 ) Conv8: 484 filters, each filter constituted as: Batch Normalization + ReLU+ Drop Out (P = 0.4) Conv9: 484 filters, each filter constituted as: Batch Normalization + ReLU + Drop Out (P = 0.4) Conv10: 484 filters, each filter constituted as:     The number of operations reduced according to (12). So, for example, for the first layer of the structure CNN3, the number of operations for the standard CNN and the rank-1 (type-1) became I x × I y × C out × f x × f y = 224 × 224 × 64 × 3 × 3 = 28,901,376 and I x × I y × (C out + f x + f y ) = 224 × 224 × (64 + 3 + 3) = 3,512,320, respectively. Therefore, the computation operations in the first convolutional layer in the CNN3 model was about eight times more with the standard CNN. Table 6 summarizes the number of parameters for the different models and structures. We also performed experiments on the inference times for the different models on CPU and GPU environments, and recorded the values in the fifth and sixth columns in Table 5. We used Tensorflow version 0.12.1 and ran it on Window10 OS, with NVIDIA 1080Ti GPU, Intel i9 CPU and 16 GB RAM memory. It should be noted that the Flattened model and the proposed model had the same inference speed as the structures in the testing time werere the same. Compared to the reduction ratio of the number of parameters, the increase in the inference speed was not that high, which is mainly due to the fact that the Tensorflow framework was not optimized for factorized neural networks and that the loading of the image took a long time in the inference stage. We believe that in the future, the speed of factorized filtering will increase with a framework optimized for them. Figure 8 compares the validation accuracies of the networks with ResNet structures composed of standard 3-D filters and rank-1 3-D filters, respectively, where the rank-1 3-D filters were trained with the proposed training method. We used the 56-layer ResNet structure for CIFAR10 as proposed in [50], and replaced all the standard 3-D filters by rank-1 3-D filters in the ResNet in the experiments with the proposed training method. As can be seen, the rank-1 ResNet trained with the proposed method achieved similar validation accuracy to the normal ResNet, which shows that the proposed training method can be applied to diverse network structures containing low-rank filters.  Finally, Figure 9 compares the validation accuracies of the MobileNet when trained with the normal method and with the proposed training method. We used the CNN4 structure and replaced all the 3-D convolution operations with the depth-wise separable 2-D convolutions + 1 × 1 pointwise convolutions as suggested in the MobileNet structure. It can be seen that the proposed training method could accelerate the training process and achieved a better validation accuracy than the standard training method, which is due to the intentional over-parametrization obtained by the outer product with the proposed method. Figure 11 shows the means and the positive standard deviations of the validation accuracies for different training sessions. It can be seen that the standard deviations at the early epochs of the training with the standard training method were large, while with the proposed training method the variation was not so large. However, after a long time of training, the final validation accuracies all converged to a similar value for each training session as shown by the standard deviation that decreased at the end of the training.

Conclusions
We proposed a training method which alternatively composes and decomposes the filters in the training stage for better training of low rank filters. As an exemplary case, we showed that a rank-1 CNN can be trained with the proposed method, which cannot be trained with conventional training methods. We used 3-D rank-1 filters in convolutional neural networks in the training phase, so that the redundancy in the filters are reduced by the rank deficient property of the rank-1 filters. The proposed training method updates the 3-D filter parameters and projects them back onto the rank-1 subspace at each epoch to find good parameter values for the 1-D vectors which constitute the 3-D filters. At test time, the trained 1-D vectors can be used directly as 1-D filters which filter the image by 1-D convolution instead of the 3-D convolution for fast inference. We showed in the experiments that the accuracy performance of the rank-1 CNN is almost the same as the standard CNN while reducing the number of parameters up to about 11%, and the number of operations up to about 12% in the convolution filters compared with the standard CNN. We also showed that the proposed method can also be used in the ResNet structure and showed that the proposed training method can be utilized for a better training of the MobileNet. This suggests the possibility that the proposed rank-1 training method can also be used with diverse structures such as the ResNet or other small-sized networks, to obtain a more efficient network structure.
In this paper, we applied the proposed training method to well-known deep learning models that can be factorized. Whether the proposed training method can be applied to other complicated models still leaves much room. Combining the proposed method with the existing training method and applying it to other complex deep learning models can be an additional topic of this study. Moreover, the experimental results showed that the inference time was not reduced as much as the decrease in the number of parameters, which is due to the fact that existing deep learning frameworks are not optimized for factorized models. Therefore, in order to accelerate the inference speed, further studies on hardware designs that can effectively adopt factorized models should be carried out in the future.