Robust CNN Compression Framework for Security-Sensitive Embedded Systems

: Convolutional neural networks (CNNs) have achieved tremendous success in solving complex classiﬁcation problems. Motivated by this success, there have been proposed various compression methods for downsizing the CNNs to deploy them on resource-constrained embedded systems. However, a new type of vulnerability of compressed CNNs known as the adversarial examples has been discovered recently, which is critical for security-sensitive systems because the adversarial examples can cause malfunction of CNNs and can be crafted easily in many cases. In this paper, we proposed a compression framework to produce compressed CNNs robust against such adversarial examples. To achieve the goal, our framework uses both pruning and knowledge distillation with adversarial training. We formulate our framework as an optimization problem and provide a solution algorithm based on the proximal gradient method, which is more memory-efﬁcient than the popular ADMM-based compression approaches. In experiments, we show that our framework can improve the trade-off between adversarial robustness and compression rate compared to the existing state-of-the-art adversarial pruning approach.


Introduction
In the past few years, convolutional neural networks (CNNs) have achieved great success in many applications including image classification and object detection.Despite the success, the excessively large amount of learning parameters and the vulnerability for the adversarial examples [1][2][3][4][5][6][7][8] are making it difficult to deploy CNNs especially on resource-constrained environments such as smartphones, automobiles, and wearable devices.To overcome this drawback, various model compression methods have been proposed, where many are based on weight pruning [9][10][11][12][13][14][15][16][17].Weight pruning generates sparse learning weights by solving an optimization problem with sparsity constraints on the weights, and then the actual compression is accomplished by removing zero weights from a trained model.Although their approach is quite simple, state-of-the-art weight pruning methods [16,17] achieve a high compression rate with little drop in accuracy.
On the other hand, it has been reported that even the state-of-the-art CNNs are vulnerable to adversarial attacks [1][2][3][4][5][6][7][8].Adversarial attacks are accomplished by using perturbed inputs which cause misclassification where modification is nearly imperceptibly small.Such perturbation can be easily produced by exploiting the gradient information of the target neural network [1,4,6].Furthermore, some works show that adversary can even generate adversarial examples without knowing anything about the target neural network [5].
Adversarial training [1,6] has been proposed as a countermeasure to adversarial attacks bringing robustness to neural networks against adversarial inputs.This method trains a classifier not only with training examples but also with adversarial examples generated actively by the defender for known types of adversarial perturbations.In particular, projected gradient descent attack [6]-based adversarial training is known to provide high robustness against the first-order adversary [1,4,6].However, it has been shown that adversarial training requires a significantly large capacity of the neural network to achieve high accuracy on both original and adversarial examples [6].
Recently, the vulnerability of the compressed neural network is raised as an issue [18].As shown in Madry et al. [6] the adversarial robustness of compressed neural networks is hard to achieve due to the lack of its architectural capacity.This prevents the compressed neural network from being deployed to a trust-sensitive domain.Despite the seriousness of this problem, only a few methods have been proposed [19,20].One notable technique is to consider adversarial robustness and model compression at the same time.Ye et al. [19] and Gui et al. [20] formulated an optimization problem by combining adversarial training with pruning and solved it with the alternating direction method of multiplier (ADMM) framework.These works demonstrated that considering weight pruning and adversarial training concurrently can show a better trade-off between robustness and compression rate than considering them separately.However, the ADMM framework requires two auxiliary tensors each of which has the same size as the learning parameters tensor of a CNN: this leads to a heavy memory burden for a resource-constrained environment.In this paper, we show that the joint optimization of pruning and adversarial training can be solved more memory efficiently using the proximal gradient method (PGM) without any auxiliary tensors.
Furthermore, we found that consistently providing information about the pretrained original network during adversarial training can improve the robustness of the resulting compressed network.With this intuition, we propose a novel robust pruning framework that jointly uses pruning and knowledge distillation [21] within the adversarial training procedure.Knowledge distillation is a technique to transfer the information of a network (teacher) to another network (student) by minimizing the gap between the SoftMax outputs of the two networks.In our framework, we use a pretrained original network as the teacher and provide its SoftMax output to a student network being compressed.We summarize our contribution as follows:

•
We propose a new robust weight compression framework for CNNs that uses pruning and knowledge distillation jointly within the adversarial training procedure.Our method is described as an optimization problem which deals with pruning, knowledge distillation, and adversarial training concurrently.

•
We show that our optimization problem can be solved with the proximal gradient method.Although the popular ADMM approach can also solve our optimization problem, it must keep two auxiliary tensors during optimization which can be a burden for a memory-constrained environment.Our proximal gradient-based approach solves the optimization problem without using any auxiliary tensor.

•
In experiments, we demonstrated that the knowledge distillation in our framework improves the adversarial robustness of the compressed CNNs.In addition, our method showed a better trade-off between adversarial robustness and compression rate compared to the state-of-the-art methods [15,19,22].

Adversarial Attacks
Adversarial attacks try to find allowable perturbations to change the prediction result of the target network.In the image classification domain, the set of allowable perturbations is generally defined by bounding the p norm of perturbation to satisfy an imperceptibility constraint.Such perturbation can be generated by exploiting the information of the target network.According to the amount of this information, adversarial attacks are categorized into the black-box and white-box attacks.A black-box attack assumes a weak adversary who does not have any information about the target model.In this situation, the adversary must rely on query access for chosen input data [5] or the transferability of adversarial examples [2,3].In a white-box setting, an adversary can access the details of the target model such as the structure, the parameters, the training dataset, etc.Based on the strong assumption, most white-box attack methods [1,4,6] exploit the first-order information of the target model to generate sophisticated perturbations.In this paper, we focus on the white-box attacks because it is important to study such attacks to implement effective defenses.

Adversarial Training
Adversarial training is a simple and intuitive learning strategy to enhance the robustness of a neural network against adversarial attacks.It generates adversarial examples using a first-order white-box attack [1,4,6] while training a neural network so that the network will correctly classify not only the training examples but also the generated examples.Adversarial training with a single-step attack such as the fast gradient sign method (FGSM) [1] is known to suffer from so-called label leaking [23] caused by the correlation between perturbation and true label.To prevent label leaking and to generate strong adversarial examples, Madry et al. [6] proposed projected gradient descent (PGD) attack-based adversarial training.

Weight Pruning
Weight pruning is a model compression technique to make unimportant learning weights to the zero value resulting in sparse weights, and thereby to remove redundant connections or components from a neural network.According to the unit of pruning, weight pruning is categorized into element-wise pruning and filter-wise pruning.
In their early stage, pruning methods focused on element-wise pruning that generates irregular sparsity patterns.To set the values of redundant weights to zero, elementwise pruning [9] measures the importance of weights usually by their absolute values.Han et al. [10] showed that this simple pruning process can be effectively combined with weight quantization and Huffman coding to achieve further compression.
Filter-wise pruning is getting more interest since it is more adequate for GPU acceleration as well as compressing convolution filters in CNNs.Some primary works prune the filters of CNN by measuring their importance by 2 norm [13] or by the number of effects on activation map [12].Based on these works, several advanced filter pruning methods [14][15][16][17]24] have been proposed by varying the ways of measuring the importance of each filter and the composition of the pruning procedure.

Knowledge Distillation
The main idea of the knowledge distillation [21] is to transfer the knowledge of a trained teacher network to a student network by training the student network using the input and the SoftMax output of the teacher.In the early stage, it is usually applied for model compression and achieved by transferring the knowledge of an over-parameterized teacher model to a smaller student model.Bucila et al. [25] primarily used this strategy with unlabeled synthesized data to transfer the knowledge of a large ensemble teacher.Hinton et al. [21] formally defined the knowledge distillation loss with temperature and showed that distillation is effective for transferring knowledge with the original training dataset.
Distillation also can be used as a defense to adversarial examples.The defensive distillation [22] achieves adversarial robustness by applying distillation on student and teacher models which have the same structure.However, it has been shown that the defense can be easily broken [4].
Many methods have been proposed to improve the effectiveness of distillation.Distillation with boundary support samples [26] tries to improve the generalization performance of a student model by conducting the distillation with the adversarial examples near the decision boundary.Distillation with teacher assistant [27] fills the gap between student and teacher models by using intermediate models called teacher assistants.

Adversarially Robust Model Compression
To preserve the robustness of the compressed model, adversarial pruning can be applied in most cases which combines the ideas of adversarial training and pruning.Ye et al. [19] and Gui et al. [20] formulated an objective which includes both adversarial training and sparsity constraints, and showed that applying adversarial training and pruning concurrently generated better robustness than applying them separately.Xie et al. [28] used blind adversarial training [29] during adversarial pruning which generated adversarial examples dynamically during adversarial training to reduce the sensitivity to the budget of adversarial examples.Madaan et al. [30] proposed a new pruning criterion to reduce the vulnerability of latent space represented by the difference between the activation map of adversarial example and its original input.
Some works also considered the adversarial robustness of different types of compression to pruning.Bernhard et al. [31] observed that the change of adversarial robustness according to the different levels of quantization.Lin et al. [32] proposed a defensive quantization method that reduced the sensitivity to the input of the neural network.Goldblum et al. [33] used knowledge distillation to transfer the robustness of an over-parameterized model to a predefined smaller model.

Methods
The main objective of our suggested method is to preserve the adversarial robustness of CNNs during the pruning procedure.An adversarially robust CNN should demonstrate high generalization performance on both original and adversarial inputs.One existing approach to generate such a CNN is adversarial pruning, which is the combination of adversarial training and pruning.However, adversarial pruning alone is not enough to achieve the goal since the decision boundary of the original network is quickly collapsed during the initial stage of the pruning procedure due to the decrease of network capacity, which results in a large decrease in generalization performance on the original inputs.To solve this problem, we propose a novel robust pruning framework that combines adversarial pruning with knowledge distillation.Using the combination, we can provide information of the decision boundary of the original network consistently during adversarial pruning.
In this section, we first describe our definition of the adversary, and then formulate our entire framework as a single optimization problem showing that it can be solved efficiently by the proximal gradient method without using any auxiliary tensors.

The Attack Model
Before describing our proposed method, we first elaborate on the attack model.For the purpose, let us define the SoftMax output of a CNN with weight parameter w ∈ R p as f (•; w).Let the data pairs {(x i , y i )} n i=1 be a training dataset.Here, x i ∈ R d is an input and y i ∈ {0, 1} k is the corresponding one-hot encoded true label.Then, the training procedure of CNN can be described as the following optimization problem.
Here, L is the cross-entropy loss [34] that indicates the gap between the SoftMax output and the true label.For the given discrete probability distribution p and q, the cross-entropy loss is defined as follows: The objective of the adversary is changing the prediction result of the trained CNN by adding an imperceptible perturbation on the input image, which can be generated by both targeted attack and untargeted attack.In the targeted attack, the adversary generates perturbation that minimizes the cross-entropy between the SoftMax output and the pre-defined target label that is different from the true label.Given input data pair (x, y) and target label y t , the targeted attack can be described as follows: Since the effectiveness of the targeted adversarial attack varies depending on the chosen target label, most robust pruning literature [19,20,30] focus on the untargeted attack for experimenting with adversarial examples, and we take the same approach.In untargeted adversarial attack, we generate adversarial examples by maximizing the cross-entropy between the SoftMax output and the true label: Also, we suppose a white-box setting where the adversary has full knowledge about the target CNN.In this case, the adversary can solve ( 2) and ( 3) by exploiting the gradient of the target CNN.

Adversarial Pruning with Distillation
Adversarial training is a type of robust optimization procedure which can be stated by the following min-max problem: To solve the inner maximization problem of (4), we consider the projected gradient descent (PGD) attack method [6] with an ∞ -norm feasible set.For a given data pair (x, y), the PGD attack is defined as follows: Here, Π B(x, ) is a projection operation to the ∞ -norm ball around x defined as B(x, ) := {x + δ : δ ∞ ≤ }.Let us note that uniformly distributed random noise is added to x in the initial stage of the PGD attack to prevent the label leaking problem [23].The solution of (4) which we denote as w * den is generally non-sparse since there is no sparse constraint on this optimization problem.By adding a sparse regularization term to (4), we can obtain the objective of adversarial pruning, where λ > 0 is a hyperparameter to control the sparsity of w.
Generally, the solution of (1), denoted by w * , is used as initial weights for solving (6).Here, our question is how we effectively preserve the accuracy of w * on original inputs during adversarial pruning procedure.The accuracy on the original inputs is largely dropped during the adversarial pruning procedure since the one-hot encoded label y i in (6) does not contain any information about the decision boundary of w * .
To consistently provide the information of w * during pruning, we combine the knowledge distillation idea with adversarial pruning.In our method, the pretrained network works as a teacher and provides SoftMax output f t (•; w * ) on original input during adversarial pruning procedure.The proposed objective is formulated as follows: Here, δ is the solution of (3) and t is a distillation hyperparameter [21].The t 2 is multiplied in front of the second term to prevent the shirking of gradient problem [21].The second term in (7) is distillation loss which indicates the cross-entropy between SoftMax output of the currently pruned model f (•; w) and the teacher model f (•; w * ).The overall formulation of ( 7) can be interpreted as the linear combination of the adversarial pruning loss (6) and the distillation loss.By solving (7), we can obtain a sparse but robust solution that approximates the decision boundary of w * .Our framework can be extended for filter pruning by replacing the third regularizer term with the number of non-zero filters as follows: Here, G is the number of filters and w g is the weight vector of gth filter.

Optimization
Most of the adversarial pruning approaches use the alternative direction method of multiplier (ADMM) method to solve the resulting optimization problem, for example, Ye et al. [19] and Gui et al. [20].However, by construction, the ADMM requires using two additional tensors to the learning weights during optimization, which can be preventive on a resource-constrained environment with limited memory.Here, we suggest another algorithm based on the proximal gradient method to solve our proposed optimization problem (7) which does not require such auxiliary tensors.For simplicity, we denote the linear combination of two cross-entropy loss in (7) by L APD : Here, APD stands for adversarial pruning with distillation.Then we can rewrite (7) as By applying a second order Taylor approximation on w k and Hessian approximation with ∇ 2 L apd (w k ) ≈ 1 η k I p×p for a η k > 0 to (9), we obtain the following formulation: Here, I p×p indicates the identity matrix where the shape is p × p.Based on this successive approximation result, the weight update can be formulated as follows: By removing the redundant parts of the above weight update equation, we can obtain We can rewrite the above equation as follows: By adding a constant w k − η k ∇L APD (w k ) 2 , we can obtain Then, we can get the following equation: This is exactly the form of proximal operator which is described as For each element, proximal operator with 0 regularization term can be computed as It is simply the thresholding operation which sets the updated weight parameter smaller than √ λ to zero.Let us note that by controlling the value of λ, we can explicitly manipulate the sparsity of network.The entire process of our method is described at Algorithm 1.

Algorithm 1: Adversarial Pruning with Distillation (APD)
Input: a distillation temperature t, a learning rate for the student η s , a learning rate for the teacher η t , the train dataset {(x i , y i )} n i=1 where
The CIFAR10 dataset has 32 × 32 color images with 50,000 trainset and 10,000 testset.As in Han et al. [10], we used the term "compression rate" to indicate the ratio of the number of zeros to the number of entire weight parameters in a CNN.We denoted the test accuracy on the original images as "original accuracy" and the test accuracy on the adversarial images as "adversarial accuracy".As in other literature [19,20,33], we consider that the robustness of the model is improved when both the original accuracy and the adversarial accuracy are improved.Otherwise, we consider a model with a higher mean value of the original and adversarial accuracy to be more robust.Given the time spent on the adversarial training for the large networks, we set the number of iterations of projected gradient descent (PGD) attack to 5 for the adversarial training of VGG16 and ResNet18.In this case, we evaluated the adversarial accuracy on both 10 iterations of PGD attack (denoted by PGD10) and 5 iterations of PGD attack (denoted by PGD5).We followed the parameters of Ye et al. [19] for the rest of the PGD attack parameters, which are strong enough to make the adversarial accuracy of the naturally trained LeNet, VGG16, and ResNet18 close to zero.The implementation of our method is available as open source (https://github.com/JEONGHYUN-LEE/APD).

The Effect of Knowledge Distillation
We compared the result of adversarial pruning (denoted by AP) (6) and our method (denoted by APD) (7) to show the effectiveness of the knowledge distillation, for both element-wise pruning and filter pruning.In this comparison, we set the value of α in ( 7) to 1 to maximize the effect of the SoftMax output of the teacher network.Also, we set the temperature t of the knowledge distillation to 10 for the MNIST dataset, and 100 for the CIFAR10 dataset for a similar reason.

Element-Wise Pruning
Generally, the element-wise pruning [9,10] can achieve higher sparsity with only a few accuracy drops compared to the filter pruning [11][12][13][14][15]. Therefore, we tested the elementwise pruning on the relatively high compression rates (×2, ×3, ×4) compared to the filter pruning [39].As in Ye et al. [19], we applied the same sparsity for every convolution layer in the target neural network.For instance, if the compression rate of a given network is determined to ×2, we set the fraction of zero weights in every layer of this network equal to 0.5.With this pruning scheme, we compared the element-wise pruning result of our method (7) with adversarial pruning (6).Both methods were optimized with proximal gradient descent.With this comparison, we demonstrated how much improvement was achieved by the knowledge distillation of our method.The results on MNIST and CIFAR10 are summarized at Tables 1 and 2, respectively.
A popular small network LeNet [35] is enough to achieve a high accuracy on the MNIST dataset.Our baseline LeNet, trained by the original training process achieves the original accuracy of 99.34% and the adversarial accuracy of 0%.With LeNet, our method (APD) showed a large improvement in both original accuracy and adversarial accuracy over the adversarial pruning (AP).In the compression rate of ×2, APD improved the original accuracy by 1.01 and the adversarial accuracy by 2.28% over AP.In the relatively high compression rate of ×3 and ×4, APD achieved a larger improvement in both original accuracy and adversarial accuracy.In particular, the amount of improvement in the adversarial accuracy achieved by APD in the compression rate of ×3 and ×4 was over than 20%.Compared to the baseline performance, APD achieved the compression rate of ×4 with the adversarial accuracy of 94.25% while reducing the original accuracy by about 1%.
We also applied APD and AP to the two CNNs, VGG16 [36] and ResNet18 [37] with the CIFAR10 dataset.Achieving high adversarial robustness on the CIFAR10 dataset is more challenging since it requires a higher architectural capacity of the CNN compared to the MNIST dataset.Our baseline VGG16 achieved the original accuracy of 92.99% and the adversarial accuracy of 0%.Despite the difficulty, APD showed an improvement with VGG16 in the entire compression rates.For instance, in the compression rate of ×4, APD improved the original accuracy by 0.88% and the adversarial accuracy against both PGD5 and PGD10 by more than 1% over AP.Though ResNet18 consists of fewer parameters than VGG16 (11 M vs. 138 M), the generalization performance of Resnet18 for the CIFAR10 dataset is higher than that of VGG16.The baseline ResNet18 showed the original accuracy of 94.40% and the adversarial accuracy of 0.03%.With ResNet18, APD improved the original accuracy and adversarial accuracy against both PGD5 and PGD10 by more than 2% over AP in the entire compression rates.Based on those results, we can conclude that consistently providing the SoftMax output of the baseline CNN with the knowledge distillation improves the adversarial robustness of the element-wise pruning solution.The filter pruning [11][12][13][14][15] generates the sparse patterns more adequate for GPU acceleration compared to the element-wise pruning [9,10].However, the sparsity that the filter pruning can achieve is often lower than that of element-wise pruning [39].Therefore, we set the smaller compression rates of ×1.5, ×2, and ×2.5 than those of the element-wise pruning.As with element-wise pruning, we set the same sparsity for each convolution layer.We compared our method (APD) with the adversarial pruning (AP) to show the effectiveness of the knowledge distillation on the filter pruning.The results on MNIST and CIFAR10 are summarized at Tables 3 and 4, respectively.
With LeNet, APD improved both original accuracy and adversarial accuracy in the entire compression rates.For instance, in the largest compression rate of ×2.5, APD improves the original accuracy by 0.36% and the adversarial accuracy by 1.44%.The improvement on the original accuracy tends to be smaller than the improvement on the adversarial accuracy since the original accuracy is already closed to that of the baseline network.APD also showed an improvement in both accuracy measures on the CIFAR10 dataset.With VGG16, APD improved the original accuracy significantly in high compression rate.For instance, in the compression rate of ×2.5, the original accuracy is improved by 5.23%.The adversarial accuracy against both PGD5 and PGD10 attacks is also improved by APD.In the compression rate of ×2.5, the adversarial accuracy increases by 2.09% against PGD5 attacks and 0.6% against PGD10 attacks.With ResNet18, APD also showed a consistent improvement on both original accuracy and adversarial accuracy in the entire compression rates.For instance, in the largest compression rate of ×2.5, APD improves the original accuracy by about 2% and adversarial accuracy by about 1% against both PGD5 and PGD10.Those results imply that the knowledge distillation in our method improves the adversarial robustness of the filter pruning solution.

The Convergence Behavior
To investigate the effect of the knowledge distillation on the convergence behavior of the adversarial pruning, we traced both original accuracy and adversarial accuracy of AP and APD on every epoch.The results on the epoch 0 indicate the initial performance of the currently pruned model where the weight parameters were initialized with the baseline model.We focused on the original accuracy of the early stage of the optimization to show how well APD preserved the original accuracy of the baseline model during the adversarial pruning.

Element-Wise Pruning
We traced both original accuracy and adversarial accuracy of AP and APD with the element-wise pruning scheme in the compression rate of ×2, ×3 and ×4.The results are described at Figure 1.Let us note that the adversarial accuracy is measured against PGD10.APD achieved a significant improvement in the original accuracy in the early stage of optimization with LeNet, VGG16, and ResNet18.With LeNet, the original accuracy of AP fell to lower than 20% on the first epoch whereas the original accuracy of APD was maintained above 90% across the entire optimization process.With VGG16, the original accuracy of both AP and APD was dropped on the first epoch.However, the amount of decrease in the original accuracy on the first epoch of APD was less than that of AP.For instance, in the compression rate of ×4, the original accuracy on the first epoch of APD was higher than that of AP by about 20%.Moreover, with LeNet and VGG16, APD improved the convergence behavior of both original accuracy and adversarial accuracy compared to AP.For instance, in the compression rate of ×3 with VGG16, APD only required 40 epochs for the average value of the original accuracy and the adversarial accuracy to reach 61.00% (the maximum average value achieved by AP), whereas AP required 46 epochs to achieve that.With ResNet18, APD reduced the drop of original accuracy on the first epoch by about 10% across the entire compression rates though the improvement in the convergence behavior of both original accuracy and adversarial accuracy is smaller than that of other networks.

Filter Pruning
We also traced both original accuracy and adversarial accuracy of AP and APD with the filter pruning scheme in the compression rate of ×1.5, ×2, and ×2.5.The results are described at Figure 2. APD improved the overall convergence behavior of the filter pruning.With LeNet, APD reduced the drop of the original accuracy on the first epoch about 5%.With VGG16, the improvement in the first epoch was more significant.For instance, in the compression rate of ×1.5, APD reduced the drop of the original accuracy on the first epoch by about 20%.Mitigating the drop of original accuracy in the first epoch led to an improvement in the overall convergence behavior.For instance, in the compression rate of ×1.5 with LeNet, APD required 49 epochs for the average value of the original accuracy and the adversarial accuracy to reach 96.63% (the maximum average value achieved by AP), whereas AP required 86 epochs to achieve that.In the compression rate ×1.5 with VGG16, APD required 33 epochs for the average value of both accuracies to reach 54.46% (the maximum average value achieved by AP), whereas AP required 59 epochs to achieve that.With ResNet18, APD also reduced the drop of original accuracy in the initial stage of pruning but the amount of improvement decreased in the high compression rate.

Comparison with the State-of-the-Art Methods
To show the relative benefit of our method (denoted as APD) compared to other state-of-the-art methods, we also compared APD to Defensive Distillation [22] (denoted as DD), Filter Pruning via Geometric Median [15] (denoted as FPGM), and Ye et al. [19].The results are summarized at Table 5.  DD is a well-known defense strategy that generates a robust model by using knowledge distillation.It trains a teacher model with a high temperature value in a modified SoftMax output and then applies knowledge distillation to a student model whose architecture is the same as that of the teacher model.We compared the original accuracy and the adversarial accuracy of APD and DD with LeNet in the compression rate of ×2.For DD, we set the temperature t as 40 and the number of epochs as 100.In comparison, APD showed about 6% higher original accuracy and 10% higher adversarial accuracy than DD.
FPGM is a SOTA filter pruning method that effectively prunes the redundant filters by measuring the Geometric Median [40] of each filter.To show that the pruning method only is not enough to generate sparse but robust solutions, we compared our pruned VGG16 with the compression rate of ×1.5 to FPGM's pruned VGG16 with the compression rate of ×1.3.APD showed 26.45% higher adversarial accuracy and 12.12% lower original accuracy compared to FPGM.The mean value of the original and the adversarial accuracy of APD is 61.82 and that of FPGM is 54.65.This result demonstrates that the model generated by the pruning method alone is vulnerable to adversarial attack.
Ye et al. is a SOTA robust pruning method.To solve the adversarial pruning (6) problem using alternative direction method of multipliers (ADMM), the method introduced two additional tensors for auxiliary parameters and Lagrangian multipliers.The size of those two tensors is exactly the same as the size of the weight parameters and therefore, it requires two times more memory than the memory required to store the weight parameters during the optimization procedure.On the other hand, APD solves our optimization problem (7) with the proximal gradient descent, which does not require any auxiliary tensor.We compared the result of APD and

Computational and Space Complexity
To show the computational and memory efficiency of APD in comparison to other methods, here we provide a short analysis without big O notations.The most dominant part of the training procedure of CNNs in terms of computational complexity is the forward and backward operations.For a given network and input data, we denoted the amount of computation for a forward as F and the amount of computation for a backward as B. In addition, we supposed that the number of iterations for training given network is I T and the number of iterations for generating adversarial example as I A .Then, the computational complexity of most of the pruning methods such as FPGM is I T × (F + B).DD contains additional forward operations for generating the SoftMax output of the teacher network resulting in I T × (2F + B).A relatively large increase of computational complexity for APD and Ye et al. is inevitable since the adversarial training requires an iterative adversarial attack for every iteration.Considering this, the computational complexity of Ye et al. is I T × (F + B + I A × F), where APD requires I T × (2F + B + I A × F) since it contains both adversarial training and knowledge distillation.
On the other hand, the most dominant part of the space complexity of the training procedure is the number of learning parameters.To describe the space complexity, let us denote the number of weights of the given network as P. FPGM requires no additional parameter and therefore its complexity is P. The space complexity of DD and APD are 2P since they require a teacher and a student network to perform knowledge distillation.Ye et al. requires two additional parameters for ADMM and a large 3P space complexity in result.Compared to Ye et al., the analysis shows that APD requires far less memory with the cost of an additional forward step.

Effectiveness of Knowledge Distillation on Other Attack Methods
To test our method on the other adversarial attacks, we evaluated the adversarial accuracy of our PGD-based trained LeNet (MNIST) against Fast Gradient Sign Method (FGSM) attack [1] and Carlini-Wagner (CW) 2 attack [4].For FGSM attack, we set the attack radius to 0.3.For CW attack, we used 2 bounded perturbation and set the maximum iterations to 1000.The baseline LeNet showed the original accuracy of 99.41% and the adversarial accuracy of 1.08% against FGSM and 0.48% against CW.The results are described in Table 6.The APD showed higher original accuracy and adversarial accuracy against both FGSM and CW 2 attacks compared to AP in the entire compression rates.In particular, the improvement on the adversarial accuracy against CW 2 attack is significant.Those results imply that our PGD-based approach is also effective on the other attack methods.

Conclusions
The adversarial robustness of the compressed CNNs is essential for deploying them to the real-world embedded systems.In this paper, we proposed a robust model compression framework for CNNs.Our framework used the knowledge distillation to improve the result of the existing adversarial pruning approach.In several experiments, our framework

Figure 1 .
Figure 1.The original accuracy and the adversarial accuracy of AP and APD (ours) with respect to the epoch of the element-wise pruning procedure for (a) LeNet, (b) VGG16, and (c) ResNet18.The left of each row is the result in the compression rate of ×2, the middle of each row is the result in the compression rate of ×3, and the right side of each row is the result in compression rate of ×4.The blue line means the original accuracy and the red line indicates the adversarial accuracy.The solid line is the result of APD and the dashed line is the result of AP.
ResNet18 Accuracy Trajectory on the Filter Pruning

Figure 2 .
Figure 2. The original accuracy and the adversarial accuracy of AP and APD (ours) with respect to the epoch of the filter pruning procedure for (a) LeNet, (b) VGG16, and (c) ResNet18.The left of each row is the result in the compression rate of ×1.5, the middle of each row is the result in the compression rate of ×2, and the right side of each row is the result in compression rate of ×2.5.The blue line means the original accuracy and the red line indicates the adversarial accuracy.The solid line is the result of APD and the dashed line is the result of AP.
the ∞ bound for imperceptibility; Initialize the student weight vector w s ∈ R p ; Initialize the teacher weight vector w t ∈ R adv ← x + ε; while x adv not converged do Update:x adv = Π B(x, ) (x adv + α • sgn(∇ x adv L( f (x adv , w s ), y))) ; end Compute the teacher SoftMax output: f t (x; w t ); Compute the student SoftMax output: f t (x adv ; w s ); Compute L APD (w s ) with (8); Update weight: w s = prox η s λ • 0 (w s − η s ∇L APD (w s )) p ; while w t not converged do Sample a data pair (x,y) from the train dataset; Compute L( f (x; w t ), y); Weight Update: w t ← w t − η t ∇L( f (x; w t ), y); end while w s not converged do Sample a data pair (x,y) from the train dataset; For each pixel of x, generate a uniformly random noise ε = (ε 1 , • • • , ε d ) ∼ U (− , ); x

Table 1 .
Summary of element-wise pruning results of APD (ours) and AP on MNIST.

Table 2 .
Summary of element-wise pruning results of APD (ours) and AP on CIFAR10.

Table 3 .
Summary of filter-wise pruning results of APD (ours) and AP on MNIST.

Table 4 .
Summary of filter-wise pruning results of APD (ours) and AP on CIFAR10.

Table 5 .
Summary of filter-wise pruning results of APD (ours) and other state-of-the-art methods.
Ye et al with LeNet and ResNet18.VGG16 was excluded in this comparison since the exact values of the original accuracy and the adversarial accuracy with VGG16 are not available in the original paper of Ye et al.We set the compression rates to ×2, ×4, and ×8 for LeNet, and ×2 for ResNet18.With LeNet , APD slightly improved both original accuracy and adversarial accuracy over Ye et al. in entire compression rates.With ResNet18, APD improved the original accuracy by 0.26% and the adversarial accuracy by 0.03% compared to Ye et al.The adversarial robustness of APD appears to be similar to that of Ye et al.; however, APD requires far less memory that Ye et al. and therefore will be more suitable for generating robust models in memory-constrained environments as we discuss in the next section.

Table 6 .
Summary of AP and APD results against FGSM and CW 2 attacks on the MNIST dataset.