MobilePrune: Neural Network Compression via ℓ0 Sparse Group Lasso on the Mobile System

It is hard to directly deploy deep learning models on today’s smartphones due to the substantial computational costs introduced by millions of parameters. To compress the model, we develop an ℓ0-based sparse group lasso model called MobilePrune which can generate extremely compact neural network models for both desktop and mobile platforms. We adopt group lasso penalty to enforce sparsity at the group level to benefit General Matrix Multiply (GEMM) and develop the very first algorithm that can optimize the ℓ0 norm in an exact manner and achieve the global convergence guarantee in the deep learning context. MobilePrune also allows complicated group structures to be applied on the group penalty (i.e., trees and overlapping groups) to suit DNN models with more complex architectures. Empirically, we observe the substantial reduction of compression ratio and computational costs for various popular deep learning models on multiple benchmark datasets compared to the state-of-the-art methods. More importantly, the compression models are deployed on the android system to confirm that our approach is able to achieve less response delay and battery consumption on mobile phones.


Introduction
Deep neural networks (DNNs) have achieved tremendous success in many real-world applications. However, the computational cost of DNN models significantly restricts their deployment on platforms with limited computational resources, such as mobile devices. To address this challenge, numerous model compression algorithms have been proposed to reduce the sizes of DNN models. The most popular solution is to prune the weights with small magnitudes by adding 0 or 1 penalties [1][2][3][4]. The non-zero weights selected by these methods are randomly distributed and do not reduce the memory consumption due to the matrix operations widely adopted in nowadays deep learning architectures as shown in Figure 1(b.2). The implementation of such a non-structured sparse matrix in cuDNN [5], which is the Basic Linear Algebra Subroutines library used by deep learning models, has similar memory consumption as the original matrix without pruning, as shown in Figure 1(b.2),(c.2). To overcome the problem, structured pruning models [6][7][8][9] are proposed to enforce group sparsity by pruning a group of pre-defined variables together. By tying the weights connecting to the same neuron together, these approaches are able to prune a number of hidden neurons to reduce the sizes of weight matrices and benefit General Matrix Multiply (GEMM) used in cuDNN as shown in Figure 1(b.3),(c.3). However, one of the main problems of the structured methods is that they do not consider and take advantage of the hardware accelerator architectures.
In this paper, we observe three key factors that could lead to an extremely compact deep network. First, we observe that most modern deep learning architectures rely on the General Matrix Multiply (GEMM) functions implemented in the cuDNN package. We, therefore, propose a new network compression algorithm to harmonize the group selections in the structured penalty and the implementation of GEMM in cuDNN as well as the hardware accelerator using sparse systolic tensor array [10,11]. Figure 1 demonstrates the basic rationale of our observation. In comparison to the pruned model in Figure 1(c.3),(c.4) needs additional sparsity within the remaining groups, which could be utilized by the sparse systolic tensor array for hardware acceleration.
Third, most of these algorithms are designed originally for mobile platforms, as the computational resources are relatively rich for desktop applications. However, few of them have been deployed on real mobile systems to test the running time and energy consumption to verify their assumptions.  Therefore, we develop the very first algorithm in this paper, named MobilePrune, which is able to solve the optimization of 0 sparse group lasso regularization in an exact manner. The main technical contribution is that we solve the proximal operators for conditions. In addition, we conduct extensive experiments on multiple public datasets and find MobilePrune achieves superior performance at sparsifying networks with both fully connected and convolutional layers. More importantly, we deploy our system on the real android system on several mobile devices and test the performance of the algorithm on multiple Human Activity Recognition (HAR) tasks. The results show that MobilePrune achieves much lower energy consumption and higher pruning rate while still retaining high prediction accuracy. Besides a powerful network compression algorithm, this work also provides a valuable platform and mobile dataset for further work to evaluate their methods in a very real scenario.
The rest of this paper is organized as follows. Section 2 provides the relevant background and related work. In Section 3, we give a brief overview of the proposed Mo-bilePrune methods. In Section 4, we discuss the detailed information of the proposed methods and algorithms. In Section 5, we describe how the experiments are set up and evaluate the proposed methods from different perspectives. Section 6 discusses the future work and summarizes the paper.

Sparsity for Deep Learning Models
Many pruning algorithms for deep learning models achieve slim neural networks by introducing sparse-inducing norms to the models. 1 regularization [31][32][33] and 0 regularization [8] were applied to induce sparsity on each individual weight. However, such individual sparsity has arbitrary structures, which cannot be utilized by software and hardware. Wei et al. [24] applied group sparsity to prune filters or channels, which can reduce the matrix size used in GEMM in cuDNN. Because the pruned models are compatible with cuDNN, they achieved large speedups. There are methods [34,35] aiming to find sparse models at both individual and group levels, which is similar to our goal. However, they all used 1 norm to induce individual sparsity in addition to group sparsity. We have performed a comprehensive comparison and demonstrated that our MobilePrune method is the best in inducing sparsity at both individual and group levels for pruning deep learning models in Section 5.

Software & Hardware Compatibility
In this paper, we aim to design an algorithm to make the pruned DNN models compatible with cuDNN library [5] and hardware accelerator architecture that uses the sparse systolic tensor array [11,36]. cuDNN is the GPU-accelerated library used in deep learning models [5]. As shown in Figure 1a, convolution used in the convolutional neural network is lowered to matrix multiplication. Therefore, the size of the filter matrix can be reduced when inducing group sparsity column-wise as shown in Figure 1(b.3),(b.4), which can reduce the memory of the DNN models to achieve practical performance improvement. The systolic tensor array is an efficient hardware accelerator for structured sparse matrix as shown in Figure 1(c.4). Specifically, each column in Figure 1(c.4) is sparse. To achieve a pruned DNN model that is compatible with cuDNN and the systolic tensor array, sparsity needs to be induced on both the group level and within-group level. We will show how we achieve this in the following sections.

Overview
The central idea of MobilePrune is to compress the deep learning model in a way that is compatible with the architecture of data organization in the memory by combining 0 regularization and group lasso regularization. The group lasso regularization helps to keep important groups of weights that benefit cuDNN, while 0 regularization helps to achieve additional sparsity within those important groups that are needed for hardware acceleration. Figure 2 provides an overview of the proposed MobilePrune method. As illustrated in Figure 2a-c, the group lasso penalty will remove all the weights together with the ith neuron if it is less important for the prediction and if a group is selected, the 0 penalty further removes the weights with small magnitudes within the group. Note that zeroing out the weights connected to the ith neuron results in removing the ith neuron and all the associated weights entirely. We will discuss more detail information in next section.

Methods
Our main objective is to obtain a sparse deep neural network with a significantly less number of parameters at both individual and group levels by using the proposed novel combined regularizer: 0 sparse group lasso.

0 Sparse Group Lasso
We aim to prune a generic (deep) neural network, which includes fully connected feed-forward networks (FCN) and convolutional neural networks (CNN). Assuming the generic neural network has N neurons in FCN and M channels in CNN. Let W i denote the outgoing weights of the ith neuron in FCN and T j represent the 3D tensor of all filters in the jth channel, which can come from different layers. The training objective for the neural network is given as follows: is a training dataset with P instances, L is an arbitrary loss function parameteized by W and T, Ω η λ (W) and Γ α β,γ (T) represent the 0 sparse group lasso penalties for neurons and channels, respectively. Specifically, Ω η λ (W) is defined as where η ≥ 0 and λ ≥ 0 are regularization parameters. Let n(i) represent the set of outgoing edge weights of neuron i. Then, W i 0 = ∑ j∈n(i) W i j 0 (W i j is the jth outgoing edge weight of neuron i; W i j 0 = 0 when W i j = 0 and W i j 0 = 1 otherwise) computes the number of the non-zero edges in W i and W i g = ∑ j (W i j∈n(i) ) 2 aggregates the weights associated with the ith neuron as a group. The core spirit of Equation (2) is illustrated in Figure 2a-c, the group lasso penalty W i g tends to remove all the weights together with the ith neuron if it is less important. If a group is selected, the 0 penalty further removes the weights with small magnitudes within the group. Group sparsity W i g can help remove neurons in the neural network, which reduces the size of the neural network and further improve efficiency. Individual sparsity W i 0 helps to achieve additional sparsity within the remaining neurons. Such structured sparsity can be used by the systolic tensor array [10,11]. The other regularization term Γ α β,γ (T) is defined as following, where α, β, and γ are non-negative regularization parameters. Equation (3) defines a hierarchical-structured sparse penalty in which structure is guided by the memory organization of GEMM using cuDNN [5]. As demonstrated in Figure 2d, the pruning strategy encoded in Equation (3) explicitly takes advantage of the GEMM used in cuDNN. T j g enforces the group sparsity of all the filters applied to the jth channel and T j :,h,w g enforces the group sparsity across the filters on the same channel and T j 0 = ∑ f ∑ h ∑ w T f ,h,w 0 prunes the small weights within the remaining channels and filters. Equation (3) can help to achieve an extremely compact model as Figure 1(c.4). Therefore, the computation can be accelerated at both software and hardware levels.

Exact Optimization by PALM
In this subsection, we first briefly review the general PALM (Proximal Alternating Linearized Minimization) framework used in our MobilePrune algorithm. Then, we introduce how we modify PALM to efficiently optimize the 0 sparse group lasso for neural network compression. PALM is designed to optimize a general optimization problem formulated as: where F(W, T) is a smooth function and Φ 1 (W) and Φ 2 (T) do not need to be convex or smooth, but are required to be lower semi-continuous. The PALM algorithm applies proximal forward-backward algorithm [37] to optimize Equation (4) with respect to W and T in an alternative manner. Specifically, at iteration k + 1, the temporal values of W (k+1) and T (k+1) for the proximal forward-backward mapping are derived by solving the following sub-problems, where U . Additionally, c k and d k are positive constants. This optimization process has been proven to converge to a critical point when functions F, Φ 1 , and Φ 2 are bounded [37]. We further extend the convergence proof in [37] and prove that the global convergence of PALM holds for training deep learning models under mild conditions. The detailed proof can be found in Appendix A.
To optimize Equation (1), we define two proximal operators for the two penalty terms as the following, π η λ (y) ≡ arg min θ α β,γ (y) ≡ arg min Here, functions Ω η λ (·) and Γ α β,λ (·) take vectors as inputs, which are equivalent to Equations (2) and (3) once we vectorize W i and T j . The overall optimization process of MobilePrune is described in Algorithm 1. Once we can efficiently compute the optimal solution of π η λ (·) and θ α β,λ (·), the computational burden mainly concentrates on the partial derivative calculation of functions H(·) and F(·), which is the same as training a normal DNN model.

Algorithm 1
The framework of MobilePrune Algorithm.
4.3.1. Proximal Operator π η λ (·) The difficulty of solving π η λ (y) in Equation (9) is that both x g and x 0 are not differentiable when the vector x = 0. Furthermore, x 0 calculates the number of nonzeros in the vector x ∈ R n and there are C(n, 0) + C(n, 1) + · · · + C(n, n) = 2 n (C(n, k) computes the number of non-zero patterns in x where k elements in x are not zeros) possible combinations, which indicates the brute-force method needs 2 n computations to find the global optimal solution. However, here we prove that the π η λ (y) in Equation (9) can be efficiently solved by a O(n log(n)) algorithm in a closed from. We illustrate the algorithm in Algorithm 2 and prove its correctness in Theorem 1. To the best of our knowledge, it is the first efficient algorithm that can calculate this novel proximal operator.
Theorem 1. The proximal operator π η λ (y) can be written as The optimal solution of this proximal operator can be computed by Algorithm 2.
Proof. Without loss of generality, we assume y = [y 1 , y 2 , . . . , y n ] T ∈ R n is an ordered vector, where |y 1 | ≥ |y 2 | ≥ · · · ≥ |y n |. Then, we define − → y k = [y 1 , y 2 , . . . , y k , 0, . . . , 0], where the top k elements with largest absolute values are kept and all the rest elements are set to zeroes. We define another set Φ k = {x| x 0 = k, x ∈ R n } to represent all n-dimensional vectors with exact k non-zero elements. For any x = [x 1 , x 2 , . . . , x n ] T ∈ Φ k , we further define a mask function e : e i = 1{x i = 0} to reveal the non-zero locations of x.
Since we do not know how many non-zero elements remain in the optimal solution of Equation (9), we need to enumerate all possible k and solve n + 1 sub-problems for all x k ∈ Φ k with k = 0, . . . , n. For each k, the sub-problem is defined as Based on Lemma A4 in Appendix A.2, we observe that if y k g ≤ λ, then x k * = 0 and f (x k * ) = 1 2 y 2 2 . If y k g > λ, then x k * = ( y k g − λ) y k y k g and the value of the objective function can be computed as Equation (11) tells us f (x k * ) is a function of y k . The task of calculating the minimum value of function f (x k * ) is transformed into solving another optimization − 1 2 ( y k g − λ) 2 + ηk + 1 2 y 2 2 with respect to y k , which is equivalent to ask which of the k components of y can achieve the minimum value of f . Since we assume y k g > λ, the optimal solution is clearly to select the top k components with the largest value from y. Therefore we have − → y k = arg min y k − 1 2 ( y k g − λ) 2 + ηk + 1 2 y 2 2 and the corresponding Hence, problem (10) has a closed-form solution.
As shown in Algorithm 2, the heaviest computation is to sort the input vector y, therefore, the time complexity for solving Equation (9) is O(n log(n)).

Proximal
Operator θ α β,γ (y) Similar as π η λ (y), θ α β,γ (y) is the solution of the following optimization problem: where we assume x, y ∈ R n . G i ⊆ {1, . . . , n}, i is the index of a group and d represents the number of groups. Note that the grouping structures specified in Equation (12) is a special case of the grouped tree structures [30], where x g is the group lasso for the root of the tree and all the x G i g are the group lasso terms of its children. Notice that groups from the same depth on the tree do not overlap and furthermore x 0 = ∑ d i=1 x G i 0 . To simplify the notation, assuming x 0 = 0 we define h(x) = 1 2 x − y 2 2 + β x g that is a convex and differentiable and rewrite the problem as We can use the proximal method [38] to find a solution x † of Equation (13). In the proximal method, we need to estimate the Lipschitz constant L(x) = 1 + β In addition, we need to use Algorithm 2 to solve π α γ (·) for each group x G i . After obtaining x † , we can find the solution of Equation (12) We elaborate the algorithm for the proximal operator θ α β,γ (y) in the Algorithm 3. The major computation cost is the proximal method, therefore, the convergence rate of Algorithm 3 is O(1/k) [38].

Performance on Image Benchmarks
In this subsection, we compared our proposed MobilePrune approach with other state-of-the-art pruning methods in terms of prune rate, computational costs, and test accuracy. We mainly compared our methods with structured pruning methods because DNN models pruned by non-structure pruning methods could not obtain practical speedup as shown in Figure 1. Notably, we only compared the results that can be reproduced by the source codes provided by the competing methods. First, we briefly summarized their methodology. PF [32] and NN slimming [33] were simple magnitude-based pruning methods based on l1 norm. BC [9], SBP [7], and VIBNet [39] cast the DNN pruning into probabilistic Bayesian models. C-OBD [40], C-OBS [2], Kron-OBD [40,41], Kron-OBS [2,41], and EigenDamage [42] are Hessian matrix-based methods. 0 norm penalized method [8] and group lasso penalized method [24] are also well-known methods.
In our experiments, we use NVIDIA Corporation as the GPU and the number of cores of the CPU is 12. All the baseline models were trained from scratch via stochastic gradient decent(SGD) with a momentum of 0.9. We trained the networks for 150 epochs on MNIST and 400 epochs on CIFAR-10 and Tiny-ImageNet with an initial learning rate of 0.1 and weight decay of 5 × 10 −4 . The learning rate is decayed by a factor of 10 at 50, 100 on MNIST and at 100, 200 on CIFAR-10 and Tiny-ImageNet, respectively. The details of hyper-parameters for all experiments are summarized in Appendix B. We also provide the computational efficiency of our methods in Appendix C.
As shown in the top half of the MNIST dataset of Table 1, our model achieves the least number of neurons after pruning the LeNet-300-100 model and the lowest drop of the prediction accuracy 0.01% compared to other methods. More importantly, our pruned model achieves the lowest FLOPs. Note that the architecture of our pruned model is as compact as L0-sep [8], but is extremely sparse with only 5252 weights left. This additional sparsity would be critical when applying hardware acceleration [10,11] to our pruned model.
In addition, we compared with SSL on pruning the first two convolutional layers as done in [24] in Table A2. SSL has the same group lasso penalty term as ours but without The bottom half of the MNIST dataset in Table 1 shows the performance comparison on pruning the LeNet-5 model. The LeNet-5 model pruned by our method achieves the lowest FLOPs (113.50 K) with the smallest predicting accuracy drop 0.01%. Moreover, our pruned model also has the smallest number of weights (around 2310). In addition, we compared with SSL on pruning the first two convolutional layers as done in [24]. SSL has the same group lasso penalty term as ours, but without 0 norm regularization. More details about SSL can be found in Appendix C.2. As shown, our method decreases the sizes of the filters from 25 and 500 to 16 and 65, respectively, which dramatically lowers the FLOPs. In addition, the non-zero parameters in those remaining filters are very sparse in our model.

CIFAR-10 Dataset
We further evaluated our method on pruning more sophisticated DNN architectures, VGG-like [44] and ResNet-32 [42,45] and widen the network by a factor of 4, on CIFAR-10 [46]. Similarly, we compared with the state-of-the-art structured pruning methods [2,7,32,[39][40][41][42] in terms of various metrics. As shown in the middle of Table 1, the pruned VGG-like model obtained by our method achieves the lowest FLOPs with the smallest test accuracy drop. Similar to previous results, our pruned model is able to keep the smallest number of weights in comparison to other methods, the key for hardware acceleration [10,11]. As presented in Table 1, the pruned ResNet-32 model achieved by our method outperforms other pruned models in terms of pruned test accuracy and FLOPs. In addition, in terms of the remaining weights, our pruned model is at the same sparsity level as C-OBD [40] while our pruned accuracy outperforms C-OBD by a large margin.

Tiny-ImageNet Dataset
Besides the experiments on MNIST and CIFAR-10 datasets, we further evaluated the performance of our method on a more complex dataset, Tiny-ImageNet [47], using VGG-19 [48]. Tiny-ImageNet is a subset of the full ImageNet, which consists of 100,000 images for validation. There are 200 classes in Tiny-ImageNet. We compared our method with some state-of-the-art methods [2,33,40,42] in Table 1. As shown in Table 1, the test accuracy of the pruned model derived from our method outperforms all the other methods by a significant margin, about 10%, except EigenDamage. Our proposed method obtains the same-level test accuracy as EigenDamage. However, our method achieves a much sparser DNN model with 1.16 million fewer weights than EigenDamage. Meanwhile, our pruned model achieves lower FLOPs.

Performance on Human Activity Recognition Benchmarks
To demonstrate the efficacy and effectiveness of our proposed MobilePrune method, we perform a series of compassion studies with other state-of-the-art pruning methods such as 0 norm, 1 norm, 2 norm, group lasso, and 1 sparse group lasso for all three datasets-WISDM [49,50], UCI-HAR [51,52], and PAMAP2 [53][54][55]. We evaluate the pruning accuracy and pruning rate of weights (parameters) and nodes for our proposed MobilePrune approach and all other state-of-the-art pruning methods using the same learning rate (0.0001) and the same number of epochs (150) for all three datasets. The pruning thresholds are 0.015, 0.005, 0.01 for the pruning methods in the WISDM, HCI-HAR, and PAMAP2 datasets, respectively. In addition, we evaluate the computational cost and battery consumption for our proposed method with all other state-of-the-art pruning methods as well. The details of the dataset descriptions and the hyper-parameters for all experiments are summarized in Appendix D.

Performance on the Desktop
We use Google Colab [56] to build a PyTorch backend on the above datasets. The GPU for Google Colab is NVIDIA Tesla K80 and the number of cores of the CPU for Google Colab is 2. As shown in Table 2, if we only use 0 norm penalty or 2 norm penalty, there is no effect on neurons or channels pruning as expected for all three datasets. Similarly, if we only employ group lasso penalty, the pruned model still has more weights or nodes left. For the UCI-HAR dataset, 1 norm penalty and 1 sparse group lasso penalty cannot sparse down the model while for the other two datasets, these two penalties could achieve better sparsity, but cannot be better than MobilePrune approach. There exists a trade-off between the pruned accuracy and the pruning rate. As can be seen in Table 2, our MobilePrune method still has high pruned accuracy even if there are not too many parameters and nodes left. In addition, we compare our MobilePrune method with 1 sparse group lasso penalty. The 0 sparse group lasso model still significantly outperforms the 1 sparse group lasso model in weights and nodes pruning, which demonstrates its superiority in pruning CNN models. We also calculate the response delay and time saving percentage for all the above methods on the desktop platform. Response delay is the time needed for the desktop to run the pre-trained model after the raw input signal is ready. Here in Table 2, the response delay results are obtained after running 200 input samples. As can be seen in Table 2, MobilePrune could save up to 66.00%, 57.43%, 90.20% on response delay on WISDM, HCI-HAR, and PAMAP2 datasets, respectively.
Overall, if we apply MobilePrune method, the pruned CNN models can still achieve the best sparsity in terms of both neurons (or channel) and weights without loss of performance. Additionally, the results in Table 2 show that our MobilePrune method could achieve 28.03%, 46.83%, 3.72% on weight (parameter) sparsity, and 52.52%, 68.75%, and 10.74% on node sparsity for the WISDM, UCI-HAR, and PAMAP2 datasets, respectively.

Performance of Mobile Phones
We evaluate the computational cost and battery consumption for our proposed Mo-bilePrune approach with all other state-of-the-art pruning methods. In order to obtain the final results about how these models perform on today's smartphone, we implement an Android Application using Android Studio on Huawei P20 and OnePlus 8 Pro. PyTorch Android API [57] is used here for running trained models on Android devices. Currently, the Android devices only support running machine learning models by using CPU only.
For the Huawei P20, the CPU is Cortex-A73. For the OnePlus 8 Pro, it is using Octa-Core as its CPU. We also use the Batterystats tool and the Battery Historian script [58] to test the battery consumption. Table 3 shows the response delay results and battery usage for our proposed method and all other state-to-the-arts pruning methods. Response delay is the time needed for the smartphone's system to run the pre-trained model after the raw input signal is ready. Here, in Table 3, the response delay results are obtained after running 200 input samples and the battery consumption results are obtained after running 2000 input samples for each penalty in all three datasets. For the HCI-HAR dataset, our MobilePrune approach could save up to 40.14%, 22.22% on response delay and 34.52%, 19.44% on battery usage for Huawei P20 and OnePlus Pro 8, respectively, while the other pruning methods stay almost the same compared to the uncompressed version. For the WISDM and PAMAP2 datasets, 0 norm penalty, 2 norm penalty, and group lasso penalty cannot sparse down the model, and therefore they cannot provide any savings in both response delay and battery consumption. 1 norm and 1 sparse group lasso methods could provide better time saving and battery consumption saving compared to those three penalties, but they still cannot perform better than the MobilePrune method, which saves 61.94% and 88.15% in response time, and 37.50% and 36.71% in battery consumption for WISDM and PAMAP2 dataset, respectively, on Huawei P20. Additionally, it also saves 52.08% and 77.66% in response time, and 32.35% and 37.93% in battery consumption for WISDM and PAMAP2 dataset, respectively, on OnePlus 8 Pro. Overall, results in Table 3 demonstrate MobilePrune's superiority in pruning HAR CNN models for battery usage and computational cost on today's smartphone. Table 3. Comparison of pruning method on the mobile devices with other state-of-the-art pruning methods for computational cost and battery usage on HAR dataset-WISDM, HCI-HAR, and PAMAP2, respectively. (We highlight our MobilePrune results and mark the best performance as blue among different penalties for each device in each dataset).

Ablation Studies
To demonstrate the efficacy and effectiveness of the 0 sparse group lasso penalty, we performed a series of ablation studies on various DNN models. As shown in Table 4, if we only use 0 norm penalty, there is no effect on a neuron or channel pruning as expected. Similarly, if we only employ the group lasso penalty, the pruned model still has more weights left. However, if we apply 0 sparse group lasso, we can achieve pruned DNN models that are sparse in terms of both neurons (or channel) and weights. In addition, we compare our 0 sparse group lasso model with 1 sparse group lasso [59] on pruning DNN models. Table 4 shows their comparison on pruning various DNN models. More details can be found in Appendix A.3 and Appendix C. As shown in Table 4 and the results in supplementary, the 0 sparse group lasso model significantly outperforms the

Conclusions
In this work, we proposed a new DNN pruning method MobilePrune, which is able to generate compact DNN models that are compatible with both cuDNN and hardware acceleration. MobilePrune compress DNN models at both group and individual levels by using the novel 0 sparse group lasso regularization. We further developed a global convergent optimization algorithm MobilePrune based on PALM to directly train the proposed compression models without any relaxation or approximation. Furthermore, we developed several efficient algorithms to solve the proximal operators associated with 0 sparse group lasso with different grouping strategies, which is the key computation of our MobilePrune. We have performed empirical evaluations on several public benchmarks. Experimental results show that the proposed compression model outperforms existing stateof-the-art algorithms in terms of computational costs and prediction accuracy. MobilePrune has a great potential to design slim DNN models that can be deployed on dedicated hardware that uses a sparse systolic tensor array. More importantly, we deploy our system on the real android system on both Huawei P20 and OnePlus 8 Pro, and the performance of the algorithm on multiple Human Activity Recognition (HAR) tasks. The results show that MobilePrune achieves much lower energy consumption and higher pruning rate while still retaining high prediction accuracy.
There are other options to further compress the neural network models such as Neural Logic Circuits and Binary Neural Networks, which all use binary variables to represent inputs and hidden neurons. These two models are orthogonal to our methods, which means our pruning model could be adopted on Neural Logic Circuits, Binary Neural Networks and other neural network architectures designed for mobile systems. We will explore which mobile neural network could be better integrated with our network compression model in the future.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Appendix A.1. The Convergence Analysis of Applying PALM Algorithm to Deep Learning Models
For simplicity, we prove the convergence of the PALM algorithm on a neural network that only has fully connected layers, and the regularization is added on the weight matrix of each layer. The proof can be easily extended to DNN models with regularization added on each neuron and DNN with convolutional layers.
Given a feed-forward neural network with N − 1 hidden layers, there are d i neurons in the i-th layer. Let d 0 and d N represent the number of neurons in the input and output layers, respectively. Therefore, the input data can be presented by X := {x 1 , . . . , x n } ∈ R d 0 ×n and the output data can be denoted as Y := {y 1 , . . . , y n } ∈ R d N ×n . Let W i ∈ R d i ×d i−1 be the weight matrix between the (i − 1)-th layer and the i-th layer. Here, in order to simplify the notation, we let W i absorb the bias of the i-th layer. We denote the collection of W i as The DNN model training problem can be formulated as ) is the DNN model with N layers of model parameter W and σ i is the activation function for neurons in the i-th layer. r i is the regularization function applied to W i . We make the assumptions for our DNN model as follows: Assumption A1. Suppose that the DNN model satisfies the following assumptions: 1.

2.
The derivatives of the loss function and all activation functions σ i , i = 1, . . . , N are bounded and Lipschitz continuous.

3.
The loss function , activation function σ i , ∀i, and the regularization function r i , ∀i are either real analytic or semi-algebraic [37], and continuous on their domains.
Remark A1. The DNN model that satisfies the assumptions made in Assumption A1 could have squared, logistic, hinge, cross-entropy, or soft-max loss function and smooth activation functions, such as sigmoid, or hyperbolic tangent, and 1 norm, 2 norm, or 0 norm regularization term, and the assumption requires the activation functions to be smooth. Then, activation function, such as rectified linear unit(ReLU), does not satisfy the requirement. However, we can use the Softplus activation function or Swish activation function to replace ReLU.
where H is exactly the same as R in (A1), but appears in a different form. We propose Algorithm A1 to solve (A2) and prove the convergence via the following theorem. In Algorithm A1, we use the proximal operator prox σ t (x) at each step, which is defined as Theorem A1. Suppose Assumption A1 holds. As in [37], the sequence W k = (W k 1 , . . . , W k N ) generated by Algorithm A1 converges to the critical point of (A2) if the following conditions hold: The partial gradient ∇ W i H, ∀i is Lipschitz continuous and there exist positive constants l i , l i such that c k i ∈ (l i , l i ), k = 1, 2, . . . .
Proof. For the first condition, utilizing Proposition 1 and Lemma 3-6 in [60], we can easily prove that under our assumption Ψ(W) is a KL function. For Equation (1) in the main text, it is also a KL function because 0 and group lasso are semi-algebraic [60,61]. We will add the simple proof in the Appendix. According to our model assumption and Remark 1 in [62], we can show that Ψ(W) is a KL function. For the second condition, based on our model Assumption A1 and Lemma A1 provided in the following, we know that ∇ W i H, ∀i is Lipschitz continuous. In addition, in Algorithm A1, we use backtracking strategy to estimate L i at each iteration, therefore, there exist l i = L 0 i and l i =L i , such that c k i ∈ (L 0 i ,L i ), k = 1, 2, . . . .
For the last condition, based on Lemma A3 provided in the following, we have ∇ W H(W 1 , . . . , W N ) is Lipschitz continuous for any bounded set.

Algorithm A1 PALM Algorithm for Deep Learning Models
Initialize η > 1, L 0 1 , . . . , L 0 N , and W 0 1 , . . . , W 0 N . for k = 1, 2, . . . do for i = 1 to N do Find the smallest i k such that end for end for Lemma A1. According to Assumption A1, the derivatives of the loss function and all the activation functions used in function R in (A1) are bounded and Lipschitz continuous, then ∇ W i H = ∇ W i R is also bounded and Lipschitz continuous.
Proof. The partial derivative of W i can be written as From Assumption A1, we know both the derivatives of the loss function ∂ ∂Φ and the derivatives of any activation function ∂σ i+1 ∂σ i , ∀i are bounded and Lipschitz continuous. Based on Lemma A2 and (A6), we know the multiplication of bounded and Lipschitz continuous functions is still bounded and Lipschitz continuous. Therefore, ∇ W i H is bounded and Lipschitz continuous.
Proof. We first prove that the multiplication of two bounded Lipschitz continuous functions is still Lipschitz continuous as follows.
We can then extend the above to the multiplication of multiple functions and prove the lemma.
where L max is the largest Lipschitz constant among L 1 , . . . , L N . Then, we define M = L max and we have as desired.
Proof. Based on Theorem 2 in [63] and Theorem 6.4 in [61], we can prove the above theorem.
Appendix A.2. The Lemma Used in the Proof of Theorem 1 Lemma A4. The proximal operator associated with the Euclidean norm · has a closed form solution: arg min where x ∈ R n and y ∈ R n are both n-dimensional vectors.
This is a known result and has been used in previous study [64].
Appendix A.3. The Algorithm for 1 Sparse Group Lasso In Section 5.3, we conduct experiments regarding 1 norm sparse group Lasso for comparison with our proposed 0 sparse group lasso. Here, we elaborate the 1 norm sparse group Lasso algorithm. We first consider the following proximal operator associated with 1 overlapping group Lasso regularization: where the regularization coefficients, λ 1 and λ 2 , are non-negative values, i = 1, 2, . . . , k, G i ⊆ {1, 2, . . . , n} denotes the indices corresponding to the i-th group. According to Theorem 1 in [59], π λ 1 λ 2 (·) can be derived from the following π λ 1 0 (·) and we present this conclusion in Lemma A5. Lemma A5. Let u = sgn(v) max(|v| − λ 1 , 0), and According to Lemma A4, it is easy to verify that given x G i ∩ x G j = ∅, the optimal x G i minimizing h λ 2 (x) is given by where i = 1, 2, . . . , k. For the fully connected layers, we define the 1 overlapping group Lasso regularizer as where W :,q represents the output weights of the q-th neuron of W. According to Lemma A5, let W = sgn(W) max(|W| − λ 1 , 0), and then ψ λ 1 λ 2 (W :,q ) can be reduced to ψ 0 λ 2 ( W :,q ), which can be solved via (A14) as follows.
For the convolutional layers, we denote the 1 overlapping group Lasso regularizer as By incorporating (A15) and (A17) into our model, the 1 norm sparse group Lasso regularized problem can be formulated as :,c,:,: . (A19) We elaborate the DNN_PALM algorithm associated with 1 norm sparse group lasso in Algorithm A2, in which the partial derivatives are denoted as H(W

Algorithm A2 DNN_PALM Algorithm for 1 norm Group Lasso
Initialize µ > 1, L 0 i > 0, (W (i) ) 0 , ∀i and L 0 j > 0, (T (j) ) 0 , ∀j. for k = 1, 2, . . . do Iterative pruning [4] is another effective method for obtaining a sparse network while maintaining high accuracy. As iterative pruning is orthogonal to our method, we can couple the two methods to obtain even better performance per number of parameters used; specifically, we replace the usual weight decay regularizer used in [4] with our 0 sparse group lasso regularizer. In practice, we find that, empirically, the iterative method is able to achieve better performance. All results reported in the paper are from the iterative method.

Appendix B.2. Hyper-Parameter Settings
In our experiments, all the baseline models were trained from scratch via stochastic gradient decent(SGD) with a momentum of 0.9. We trained the networks for 150 epochs on MNIST and 400 epochs on CIFAR-10 and Tiny-ImageNet with an initial learning rate of 0.1 and weight decay of 5e-4. The learning rate is decayed by a factor of 10 at 50, 100 on MNIST and at 100, 200 on CIFAR-10 and Tiny-ImageNet, respectively.
The experimental settings regarding hyper-parameters for all DNN models we used in the paper are summarized in Table A1. We employ iterative pruning strategy to prune all the models. Namely, the pruning process and the retraining process are performed alternately. Table A1. List of hyper-parameters and their values("-" denotes "not applicable").

Appendix C.1. Computational Efficiency
We want to mention that our main focus here is to compress DNN models via 0 sparse group lasso. Our main contribution is to solve the corresponding optimizations. Speed is not our primary focus. However, we still provide the run time of our methods as a reference. We compared the run time of our MobilePrune with the baseline methods (original methods without pruning involved), and we found the ratio of the run time of MobilePrune to the run time of the baseline method is around 5 on average on the same machine.

Appendix C.3. Additional Ablation Studies
In this section, we perform ablation studies to compare DNN models regularized by the proposed 0 sparse group lasso and other DNN models that are regularized by its individual components. Specifically, we compare DNN models regularized by the proposed 0 sparse group lasso with DNN models regularized by 0 norm penalty (set group Lasso penalty to 0) and DNN models regularized by group Lasso penalty (set 0 norm penalty to 0), respectively. For fair comparison, for all regularized DNN models, we use the same hyper-parameter setting.
From Table A3 to Table A6, we observe that 0 norm penalty has no effect on structured pruning as expected and group Lasso penalty can effectively remove redundant structure components. Furthermore, the combination of 0 norm and group Lasso (our proposed 0 sparse group lasso penalty) can yield sparser models at both structure level and individual weight level. Notably, 0 norm can help group Lasso to remove more redundant structure components. Therefore, better acceleration in terms of FLOPs can be obtained by applying our proposed 0 sparse group lasso penalty. We want to mention that when we compute FLOPs, we do not take the individual weight sparsity into account. However, based on [11,65], lower FLOPs could be achieved by unitizing the sparsity on weight level on dedicated architectures.  In addition, we compare the proposed 0 sparse group lasso with 1 norm group Lasso. The algorithm for DNN models with 1 norm group Lasso penalty is introduced in Algorithm A2. For hyper-parameter setting, we use the same parameters as 0 sparse group lasso penalty (as shown in Table A1) except the parameter for 1 norm. We search the parameter of 1 norm in [0.0001, 0.01] and report the best results in terms of the pruned test accuracy.
From Table A3 to Table A6, we find that 0 sparse group lasso penalized models outperform 1 sparse group Lasso penalized models in terms of test accuracy and FLOPs. For VGG19 model (Table A6), 1 sparse group Lasso penalized model can achieve the fewest number of parameters, but the pruned test accuracy and FLOPs are much worse than the 0 sparse group lasso penalized model.
Appendix C.5. The Effect of the Coefficient of 0 Norm Regularizer In our proposed 0 sparse group lasso, 0 norm regularizer plays an important role of facilitating pruning networks effectively and efficiently, which has been shown through the results in ablation studies with and without 0 norm penalty. We further explore the effect of the strength of the 0 norm coefficient on the pruning performance. We vary the shrinkage strength for 0 norm penalty by a factor of 10 while keeping the other settings fixed.
As can be seen from Tables A7 and A8, the larger the 0 norm coefficient is, the more parameters are pruned as expected. Additionally, there is a trade-off between the shrinkage coefficients for 0 norm penalty and group Lasso penalty, which depends on the practical demand.

. K-Fold Cross-Validation
In order to improve the proposed model's final performance, k-fold cross-validation is used after the above segmentation step. The principle of the k-fold cross-validation method is to split the input samples as the number of k groups. It can lead to a less biased or less positive assessment of the ability of the model than other methods [67]. All the training data samples are considered for both training and validation in a k-fold cross-validation approach. First, we divide the training data samples into k equal subsets. Then, we pick one subset as the validation set and the remaining k − 1 subsets as the training set. There are k different ways to select the validation set, and therefore we have k different pairs of testing and validation datasets. In this paper, we choose k = 5 and evaluate all the 5 different pairs of testing and validation datasets for our proposed method and all state-of-the-art pruning methods. The final training and validation dataset will be selected according to the final performance with the testing data set.

Appendix D.4. Hyper-Parameters Tuning
Hyper-parameters have a great impact on the deep learning model performance. In the following context, we will present how to select the training subset, validation subset based on the 5-fold cross-validation before the training stage, how to pick the learning rate during the training stage, how to select the model by the epochs during the training stage, and how to pick pruning threshold to compress the final model. The experiments are implemented on all three datasets and the model performance is evaluated by varying several model parameters.

.1. Cross-Validation Tuning
In order to improve the proposed model's final performance, k-fold cross-validation is used after segmenting the input samples. This approach can lead to a less biased or less positive assessment of the ability of the model than other methods [67]. Table A9 shows the results on the test set corresponding to different validation set choices. The pruned accuracy results are obtained when we use the fold numbers 4, 5, and 1 as the validation set based on both pruned accuracy and the nonzero parameters' percent for the WISDM, UCI-HAR, PAMAP2 datasets, respectively.

Appendix D.4.2. Learning Rate Tuning
The learning rate is a hyper-parameter that controls how much to change the model in response to the estimated error each time the model weights are updated [68]. Table A9 demonstrates the experiment results of different learning-rate settings. For the WISDM dataset, we observe that the best performance of pruned accuracy and parameter remaining percentage is achieved when the learning rate equals 1.5 × 10 −4 . For the UCI-HAR dataset, when the learning rate is 2 × 10 −4 , we obtain the best results of pruned accuracy and parameter remaining percentage. For the PAMAP2 dataset, the best-pruned accuracy is achieved when the learning rate is 1.5 × 1.0 −4 . To make the experiment settings consistent and comparable, we set the learning rate to be 1.0 × 10 −4 for all three datasets. Table A9. Impact of different cross-validation fold numbers and learning rates on the proposed 0 sparse group lasso approach on each HAR dataset-WISDM, UCI-HAR, and PAMAP2, respectively. (We highlight our selection in both fold number and learning rate for each dataset). The number of epochs is a hyper-parameter that defines the number of times that the learning algorithm will work through the entire training dataset. Figure A1a-c show the training accuracy, validation accuracy, and testing accuracy versus the number of epochs for WISDM, UCI-HAR, and PAMAP2 datasets, respectively. For all three datasets, the number of epochs is 150; however, we only pick the epoch number based on the highest accuracy of the validation set for each dataset. The validation dataset is different from the test dataset, but is instead used to give an unbiased estimate of our final model. Based on the final results, the highest validation accuracy occurs at 150, 102, 113 epochs for WISDM, UCI-HAR, and PAMAP2 datasets, respectively. For all the pruning methods including 0 sparse group lasso, after pruning, the weights cannot be exact zero due to the binary bits computation. Therefore, if weight is less than a costumed prune threshold, we set the weight to be zero in those pruned models. Figure A1d-f shows the experiment results of different prune threshold settings for WISDM, UCI-HAR, and PAMAP2 datasets, respectively. For the WISDM dataset, we observe that the best performance of pruned accuracy and parameter remaining percentage is achieved when the threshold equals 0.015. For the UCI-HAR dataset, when the pruned threshold is 0.005, we obtain the best results of pruned accuracy and parameter remaining percentage. For the PAMAP2 dataset, the best-pruned accuracy is achieved when the pruned threshold is 0.01.