Rethinking Weight Decay for Efficient Neural Network Pruning

Introduced in the late 1980s for generalization purposes, pruning has now become a staple for compressing deep neural networks. Despite many innovations in recent decades, pruning approaches still face core issues that hinder their performance or scalability. Drawing inspiration from early work in the field, and especially the use of weight decay to achieve sparsity, we introduce Selective Weight Decay (SWD), which carries out efficient, continuous pruning throughout training. Our approach, theoretically grounded on Lagrangian smoothing, is versatile and can be applied to multiple tasks, networks, and pruning structures. We show that SWD compares favorably to state-of-the-art approaches, in terms of performance-to-parameters ratio, on the CIFAR-10, Cora, and ImageNet ILSVRC2012 datasets.


Introduction
In recent decades, deep neural networks have become the reference for many machine learning tasks, especially computer vision. Their popularity quickly grew once deep convolutional networks managed to outclass classical methods on benchmark tasks, such as image classification on the ImageNet dataset [1]. Since their introduction by Le Cun et al. [2], many architectural innovations have now contributed to their performance and efficiency [3][4][5][6][7][8]. However, for any given type of deep neural network architecture, the number of parameters tends to correlate with performance, resulting in the best-performing networks having prohibitive requirements in terms of memory footprint, computation power, and energy consumption [9]. This is a crucial issue for multiple reasons. Indeed, many applications, such as autonomous vehicles, require networks that can provide adequate, real-time responses on energy-efficient hardware: for such tasks, one cannot afford to have either an accurate network that is too slow to run or one that performs quickly but crudely. Additionally, research on deep learning relies heavily on iterative experiments that require a lot of computation time and power: lightening the networks would help to speed up the whole process.
Many approaches have been proposed to tackle this issue. These include techniques such as distillation [10,11], quantization [12,13], factorization [14], and pruning [15]; most of them can be combined [16]. The whole field tends to indicate that there may exist a Pareto optimum, between performance, memory occupation, and computation power, that compression could help to attain. However, progress in the field shows that this optimum has yet to be reached.
Our work focuses on pruning. The basis of most pruning methods is to train a network and, according to a certain criterion, to identify which parts of it contribute the least to its performance. These parts are then removed and the network is fine-tuned to recover the incurred loss in performance [15,17].
Multiple decades of innovation in the field have uncovered many issues at stake when pruning networks, such as structure [18], scalability [19,20], or continuity [21]. However, many approaches, while trying to tackle these issues, tend to resort to complex methods involving intrusive processes that make them harder to actually use, re-implement, and adapt to different networks, datasets, or tasks.
Our contribution aims to solve these key problems in a more straightforward and efficient way that avoids human intervention in the training process as much as possible. Our method, Selective Weight Decay (SWD), is a pruning method for deep neural networks that is based on Lagrangian smoothing. It consists in a regularization which, at each step during the training process, penalizes the weights that would be pruned according to a given criterion. The penalization grows in the course of training until the magnitude of the targeted parameters is so close to zero that pruning them induces no drop in performance. This method has many desired properties, including avoiding any discontinuity, since pruned weights are progressively nullified. The weight removal is, itself, learned, which reduces the manual aspects of the pruning process. Moreover, since the penalized weights are not completely removed before the very end of training, the subset of the targeted parameters can be adjusted during training, depending on the current distribution of the weight magnitudes. The dependencies between weights can, thus, be better taken into account.
Our experiments show that SWD works well for both light-weight and large-scale datasets and networks with various pruning structures. Our method shines especially for aggressive pruning rates (few remaining parameter targets) and manages to achieve great results with targets for which classical methods experience a large drop in performance.
Therefore, about SWD, which prunes deep neural networks continuously during training, we have the following claims: • using standardized benchmark datasets, we prove that SWD performs significantly better on aggressive pruning targets than standard methods; • we show that SWD needs fewer hyperparameters, introduces no discontinuity, needs no fine-tuning, and can be applied to any pruning structure with any pruning criterion.
In the following sections, we will review in detail the field of network pruning, describe our method, present our experiments and their results, and then discuss our observations.

Problem Statement and Related Work
We now review the main pruning methods and attempt to organize them into sub-families.

Notations
We first recall the standard optimization problem with weight decay. Let N be a network with parameters w, trained over dataset D containing N pairs of input/groundtruth pairs (x i , y i ). The network is trained through error function E , and penalized by a weight decay with a coefficient µ [22,23]. The training process thus involves minimizing the objective function L, defined as: (1)

The Birth and Rebirth of Pruning
Although network size correlates with performance, the fundamental observation that motivates pruning is that not all of a trained network's parts seem to be useful. Unnecessary parts may be removed without penalizing performance.
At the end of the 1980s and at the beginning of the 1990s, the field of pruning quickly expanded from a few seminal studies [24][25][26]. At the time, as observed by Reed [27], two major branches cohabited: (1) sensitivity calculation methods, which consisted in evaluating the contribution of each parameter to the error function and in pruning those which contributed the least, and (2) penalty-term methods, which penalized weights globally so as to encourage convergence to networks having a few big weights rather than lots of small ones. It is worth mentioning that pruning was originally intended to help the generalization of networks, rather than being a compression method per se.
This field of research seems to have almost completely vanished during the ensuing decade, only to be resurrected by Han et al. [15]. Since then, the number of pruning investigations has expanded so quickly that it has made any reviewing task a challenging one [28].
Assume we want to train and prune a given model with a target pruning rate T. The method of Han et al. [15], which is currently the prototypical pruning technique, consists first in training, then in pruning and fine-tuning the network iteratively, with each time an increasing pruning rate t until T is reached. In particular, pruning is achieved here by reducing to and then maintaining at zero a proportion of the parameters of the whole network whose absolute magnitude is the smallest. Though it is still possible to prune and fine-tune the model only once, doing so can be viewed as a particular case of the method.
The literature that followed the work of Han et al. [15] has highlighted many questions that tend to be raised when pruning a neural network: "Which parameters should be pruned?", "How can we prune them and recover from the loss?", and "What kind of structures should be pruned?" We will tackle these questions.

Which Parameters Should Be Pruned?
One crucial prerequisite to pruning networks is to have a good criterion to define which parameters to prune. Many pruning criteria have been tested [29][30][31][32][33], for example: Anwar and Sung [29] try various random masks and select the one which induces the least degradation; Hu et al. [30] prune on the basis of the average rate of null activation after each pruned layer. The two most widespread criteria are gradient magnitude and weight magnitude, both of which we will detail.
The early branch of sensitivity calculation methods, birthed by the studies of Le Cun [25] and Mozer and Smolensky [26] and then studied within multiple articles [24,[34][35][36], led some recent studies to prune the weights of the least back-propagated gradient [37,38]. Nevertheless, the criterion that remains the most common, namely, the mere magnitude of the parameters, turns out to be surprisingly effective while also intuitive. Although re-introduced by Han et al. [15], it was first introduced by Chauvin [39] and Hanson and Pratt [40], then presented under the name of "clipping" by Janowsky [41]. Segee and Carter [42] observed the surprising correlation between this intuitive criterion and that of Mozer and Smolensky [26], which is more theoretically grounded. These studies tend to confirm that magnitude is a good proxy for the contribution of a parameter to optimization problems summed up by Equation (1), which is why we used it in our experiments.
The other main branch identified by Reed [27] revolved around enforcing sparsity using various kinds of weight decay regularization. The commonly stated motivation was that, if a certain parameter contributes poorly to the error term Err, then the weight decay term should outweigh it so that this very parameter would decrease toward zero.
Since weight decay is required for weight-magnitude pruning, which is the favored criterion among several of the best implementations, it seems that sparsity-inducing regularizations are worth exploring further.

How to Prune Parameters and Recover from the Loss
One may object by noting that removing weights, even those that seem the least important, may damage the network in such a way that no fine-tuning could ever allow it to recover. Indeed, doing so severely disrupts the training process, for example, by removing parts of the network while it is trying to learn to solve a problem. The work of Le Cun, Bengio, and Hinton [43] tends to show that the less the training process is disrupted, the better it performs.
For example, there is no guarantee that weights which seemed unimportant at first could not become crucial again in the new context of the pruned network. That is the reason why many efforts have focused on allowing weights to regrow in one way or another. Different approaches have been proposed [44][45][46] to either regrow previously pruned weights or to not completely prune parameters by still allowing them to be trained once they are reduced to zero by pruning.
The principle of regrowing weights is central to the family of methods that could be called sparse training. Sparse training was first introduced by Mocanu et al. [47] and then further explored within the literature [48][49][50]. It involves training the network with a constant level of sparsity, at first spread randomly with uniform probability, and then adjusted during steps which combine (1) pruning of a certain portion of the weights, according to a certain criterion, and (2) regrowing an equivalent amount of weights, depending on another criterion.
Such a family of methods provided a promising way to work around the problem of falsely unimportant weights, while limiting the impact of increasing sparsity all throughout training.
Of course, sparse training is not the only family of methods to tackle this issue. Indeed, this technique belongs to a vast trend in the literature: discovering the importance of sparsity during training to achieve better results with shrunken networks. Pruning networks early, so that they start training with their definitive sparsity from the beginning, is the whole point of a whole range of work [51][52][53][54], as well as one of the motivations behind the field surrounding the lottery ticket hypothesis [55][56][57][58][59][60]. Renda et al. [61], while studying the lottery ticket hypothesis, came up with a method called learning rate rewinding, which proposes replacing the fine-tuning step with a full retraining stage that uses the weights of the trained and pruned network as a new initialization.
Another distinct branch of methods involves work that aims to induce sparsity in a more continuous way throughout the training process, possibly avoiding any fine-tuning or retraining. One intuition that motivates these methods is that delegating the care of sparsity to the gradient descent is a sensible way to not disrupt training too much.
One sub-family of this branch focuses on finding a way to learn a pruning mask during training [21,[62][63][64]; some of these studies propose learning such a mask using variants of the quantification method of Courbariaux et al. [12] on auxiliary learnable parameters. The other major sub-family counts methods grounded on a Bayesian mathematical formalism [65][66][67][68][69]. They mainly consist in various kinds of sparsity-inducing regularization, whose parameters are tuned through variational inference. These methods are the ones that stick most closely to the former family of penalty-term methods: Ullrich et al. [69] even references the work of Nowlan and Hinton [70].
However, this whole family of methods tends not to be among the simplest to implement and adapt to various kinds of tasks, datasets, or structures. Adapting variational dropout [67] to structured pruning is the focus of whole contributions [66,68], and Gale et al. [20] show that the work of Molchanov et al. [67] and Louizos et al. [21] does not scale easily to large datasets such as ImageNet ILSVRC2012.
Unfortunately, one problem that encompasses the whole field, and about which Blalock et al. [28] raise the alarm, is the lack of comparability between the various methods in the literature. Indeed, contributions to the domain rarely compare to the same reference methods, tasks, models, training conditions, or datasets and do not always show the same metrics computed in a consistent way. Hence, it is very difficult to know which method brings actual methodological or theoretical improvements on the topic, and it appears that none of the questions or aspects we mentioned can be considered solved for now. Therefore, while taking inspiration from these methods and their desirable properties, we propose one that is easier to scale, adapt, and apply.

What Kind of Structures Should Be Pruned?
As pointed out by both Anwar et al. [71] and Li et al. [18], the parameter-wise pruning of Han et al. [15] produces sparse matrices that are hardly exploitable by modern hardware and deep learning frameworks. That is why a whole field of pruning is dedicated to finding ways to eliminate parts of the networks in a structured way that can actually induce a measurable speedup.
Because of the predominance of convolutional neural networks in the literature, the most widespread type of structure to be considered by the field is convolutional channels (or filters) [17,18,29,31,46,[71][72][73][74]. Indeed, pruning filters induces a direct shrinking of the very architecture of the network and a quadratic reduction in the parameter count, as each removed filter is one less input feature map for the next convolution layer.
Other types of structures have been experimented on, such as kernels or intra-kernel strided structures [29,71] or the reduction of convolutions to shift operations [75]. In this work, we focus on two granularity levels: parameter-wise (unstructured) and convolutional filter-wise (structured).

Selective Weight Decay
We now present our contribution, illustrated in Figure 1, and explain how it addresses the aforementioned issues.  To prune deep neural networks continuously during training, we apply distinct types of weight decay (penalty p on the y-axis) depending on weight magnitude (weight value w on the x-axis). Weights whose magnitude exceeds a threshold t (defined according to the number of weights to prune) are penalized by a regular weight decay. Those beneath this threshold are targeted by a stronger weight decay whose intensity grows during training. This stronger weight decay, only applied to a subset of the network, is the Selective Weight Decay. This approach can be equally well applied to weights (unstructured case) or groups of weights (structured case).

Principle
Selective Weight Decay (SWD) is a regularization which induces sparsity continuously on any type of structure: at each training step, a certain penalization is applied to the parameters to be pruned at this very step according to a certain criterion and a certain structure. The criterion we chose is weight magnitude, or variants of it according to the chosen structure. The penalized optimization problem can be viewed as: with a being a coefficient which determines the importance of SWD relative to the rest of the optimization problem. w * is the subset of w to be pruned at a certain step. SWD is summed up as Algorithm 1.

Algorithm 1: Summary of SWD
Data: network N of weights w, dataset D, target pruning rate T a ← a min ; while the network is not fully trained and a ≤ a max do draw batches x and y from D; The evolution of a is designed to be exponential and, according to two bounds a min and a max at a certain training step s, is defined as such: with s f inal being the value at which SWD reaches a max and, usually, the total number of training iterations. The exponential increase in Equation (3) allows the network to converge before applying a strong penalization. We favored an exponential increase over a linear one so that a can reach large final values without penalizing training too much throughout the training process. In addition, setting a min and a max manually allows precise control over the evolution of the penalty and enables a careful study of how training behaves under this constraint.

SWD as a Lagrangian Smoothing of Pruning
Penalizing weights until they reach zero appears to be a viable method to relax the hard constraint that is pruning. Indeed, pruning can be seen as a constraint on the L 0 norm of the network, which is non-differentiable (this is a problem in differentiable optimization methods such as those used in deep learning). The L 2 norm can be used as a differentiable relaxation of L 0 . We designed SWD so that it can be viewed as a Lagrangian smoothing, with coefficient a of the SWD term in Equation (2) being a Lagrangian multiplier.
As pointed out by the work of Murray and Ng [76], Lagrangian smoothing allows convergence relative to both the error term and the constraint. While a can be mostly negligible at the start of the training, it becomes preponderant at the end and forces the target weights to be pruned, so as not to hinder the convergence of the network during training while allowing for pruning. The fact that SWD only penalizes weights selectively during training allows them to recover as soon as they are no longer targeted by the pruning criterion, thereby combining both the pruning and regrowing criteria of sparse training. Therefore, SWD is a non-greedy method that allows weights to recover when needed.

On the Adaptability of SWD to Structures
The exact definition of w * depends on the chosen type of structure to prune. As SWD induces no constraint on such considerations, it can be applied to any type of structure.
Unstructured pruning is defined by removing all the weights of least magnitude in the whole network so that the proportion of pruned parameters matches the pruning target as closely as possible. Formally stated: with T being the pruning target and n(w) the number of elements of the parameters w.
We based the structured version of our SWD on the method of Liu et al. [17] to solve a problem induced by residual connections in modern convolutional networks: to ensure that the exact output dimensions of the feature maps after each residual connection match the desired target, one must prune exactly the same channels among the connections and the last convolution before them. To the best of our knowledge, this problem has not been tackled, and approaches that use certain norms of the filters as a criterion could not be adjusted to tackle this important problem without altering them too drastically, if the desire was to remain true to the original contribution.
However, since the method of Liu et al. [17] prunes multiplicative coefficients of batchnorm layers, it is easy for it to solve the residual connection issue as soon as a batchnorm layer is inserted after each residual connection (which does not change the overall performance of the network).
Han et al. [17] considered the magnitudes of multiplicative coefficients in a batchnorm layer to be an estimator of the importance of their corresponding filters. These batchnorm layers were then penalized during training by a smooth-L 1 norm. In their work, a global threshold was applied to all the batchnorm layers in order to globally prune a target percentage of all the smallest batchnorm coefficients.
However, in order to have better control over the exact number of parameters at the end of the pruning process, we instead prune all the smallest batchnorm layers until a portion of the overall network (once the parameters of the corresponding convolutional filters have been substracted) is removed.

General Training Conditions
In order to eliminate all unwanted variables, each series of experiments was run under the same conditions, except when explicitly stated, with the same initialization and same seed for the random number generators of the various used libraries. We used no pre-trained networks and we trained all of them in a very standard way.
The training conditions were as follows: all our networks were trained using the Pytorch framework (Paske et al. [77]); using SGD as an optimizer, with a base learning rate of 1 × 10 −1 for the first third of the training, then 1 × 10 −2 for the second, and finally 1 × 10 −3 for the last third, and momentum set to 0.9. All networks are initialized using default initialization from Pytorch. Our code is available at https://github.com/HugoTessier-lab/SWD, accessed on 26 February 2022.

Chosen Baselines and Specificities of Each Method Unstructured pruning: Han et al. [15]
The networks trained with this method were pruned and fine-tuned for five iterations. At each step, the pruning target is a fraction of the final one: for example, when first pruned, only a fifth of the final pruning target is actually removed. The pruned weights are those of least magnitude.

Structured pruning: Liu et al. [17]
The network is only pruned once and fine-tuned. In accordance with the paper, a smooth-L 1 norm is applied as a regularization to every prunable batchnorm layer with a coefficient λ that depends on the dataset. This method appeared to us to be the most straightforward one for allowing an accurate evaluation of the number of pruned parameters. The pruned filters are those whose multiplicative coefficients in the batch normalization layer are of least magnitude.

LR-Rewinding: Renda et al. [61]
When networks are trained following this method, the post-removal fine-tuning is replaced by a retraining which consists in repeating the pre-removal training process exactly, with the same learning rate values and the same number of epochs. This method updates and significantly improves the train, prune, and fine-tune framework that serves as a basis for both previous methods.

SWD
Whether on unstructured or structured pruning, when trained with SWD, the network is not fine-tuned at all and only pruned once at the end. The values a min and a max vary according to the model and the dataset.

Overall methodology
In order to isolate the respective gain of each method: • All the unstructured pruning methods use weight magnitude as their criterion; • All the structured pruning methods are applied to batch normalization layers; • Structured LR-Rewinding also applies the smooth-L 1 penalty from Liu et al. [17]; • The hyper-parameters specific to the aforementioned methods, namely, the number of iterations and the values of the smooth-L 1 norm, are directly extracted from their respective original papers.
Here are the only notable differences: • SWD does not apply any fine-tuning; • Unstructured LR-Rewinding only re-trains the network once (because of the extra cost from fully retraining networks, compared to fine-tuning); • SWD does not apply a smooth-L 1 norm (since it would clash with SWD's own penalty). Table 1 shows results from different techniques, as presented in the related papers, on different datasets, networks, compression rates, and pruning structures. To achieve the best performance possible, results of SWD in the case of structured pruning on ImageNet are ran with warm-restart and 180 epochs in total. Since these results do not come from identical networks on the same datasets, trained in the same conditions, and pruned at the same rate, the comparisons have to be interpreted with caution. However, Table 1 gives quantified indications as to how our method compares to the state of the art, in terms of performance and allowed compression rates.

Experiments on ImageNet ILSVRC2012
The results of the experiments on ImageNet ILSVRC2012 are shown in Table 2, which presents the top-one and top-five accuracies of ResNet-50 from He et al. [3] on ImageNet ILSVRC2012, under the conditions described in Sections 4.1 and 4.2. The "Baseline" method is a regular, non-pruned network, which serves as a reference. SWD outperforms the reference method for both unstructured and structured pruning. For models trained on ImageNet ILSVRC2012, the standard weight decay (parameter µ) is set to 1 × 10 −4 and models were trained during 90 epochs. For the method from Han et al. [15], we made each fine-tuning step last for 5 epochs, except for the last iteration which lasted 15 epochs. For Liu et al.'s [17] method, the network was only pruned once and fine-tuned over 40 epochs. The smooth-L 1 norm had a coefficient set to λ = 1 × 10 −5 . For unstructured pruning, SWD was applied with a min = 1 × 10 −1 and a max = 1 × 10 5 ; for structured pruning, a min = 10 and a max = 1 × 10 4 .

Impact of SWD on the Pruning/Accuracy Trade-Off
Even though obtaining better accuracy for a given pruning target is not without interest, it makes more sense to know what maximal compression rate SWD would allow for a given accuracy target. Figures 2 and 3 show the influence of SWD on the pruning/accuracy trade-off for ResNet-20, with an initial embedding of 64 and 16 feature maps, respectively, on CIFAR-10 [82]. We used that lighter dataset instead of ImageNet ILSVRC2012 because of the high cost of computing so many points.
Since each point is the result of only one experiment, there may be some fluctuations due to low statistical power. However, since the same random seed and model initialization were used each time, these may not prevent us from drawing conclusions about the behavior of each method. As we stated in Section 2.2, pruning originally served as a method to improve generalization. This suggests that the relationship between performance and pruning may be subtle enough to lead to local optima that may not be possible to predict. The standard weight decay (parameter µ) is set to 5 × 10 −4 for CIFAR-10. The models on CIFAR-10 were trained for 300 epochs and each fine-tuning lasted 15 epochs, except the last one (or the only one when applying Liu et al. [17]), which lasted 50 epochs. When using the smooth-L 1 norm, its coefficient is set to λ = 1 × 10 −4 . For an initial embedding of 64 feature maps with unstructured pruning, we set a min = 1 × 10 −1 and a max = 1 × 10 5 ; on structured pruning, a min = 1 × 10 2 and a max = 1 × 10 7 . For 16 feature maps, we set a min = 1 and a min = 1 × 10 4 for unstructured pruning and a min = 100 and a min = 1 × 10 6 when structured.
Exact results are reported in Table 3, in which the expected compression ratios, in terms of operations, are also displayed. Since unstructured pruning produces sparse matrices, whereas structured pruning leads to networks of smaller sizes, some authors such as Ma et al. [83] have argued against the use of the former and in favor of the latter. Indeed, sparse matrices either need specific hardware or expensive indexing methods, which makes them less efficient than structured pruning. Therefore, because of how hardwareor method-specific the gains of unstructured pruning can be, we preferred not to indicate any compression ratio in terms of operation count for unstructured pruning. However, concerning structured pruning, it is far easier to guess what the operation count will be. The operations are calculated in the following way, with f in being the number of input channels, f out being the number of output channels, k being the kernel size, h being the height (in pixels) of the input feature maps, and w being its width: • convolution layer: We make no distinction between multiplications and additions in our count.

Grid Search on Multiple Models and Datasets
To show the influence of the values of a min and a max on the performance of networks right before and right after the final pruning step, we conducted a grid search using LeNet-5 and ResNet-20 on MNIST and CIFAR-10 with both unstructured and structured pruning.
The LeNet-5 models were trained for 200 epochs, with a learning rate of 0.1 and no weight decay (even though µ is set to 5 × 10 −4 for SWD); the momentum is set to 0. The pruning targets are 90% and 99%. The results of these grid searches are reported in Table 4.
Another grid search, on CIFAR-10 with ResNet-20 (64 channels), is reported in Table 5 with an extended range of values explored in order to showcase the importance of the increase of a during training. As it involved testing cases decreasing a, we named the start and end values of a as a start and a end instead of a min and a max . Otherwise, the conditions were the same as described in Sections 4.1 and 4.5. Table 6 shows another distinct grid search, performed with structured pruning on various pruning targets.
In order to tease apart the sensitivity of SWD from variations of the model or of the dataset, we provide additional grid searches in Tables 7 and 8. These tables feature results on CIFAR-10 with ResNet-18 and ResNet-20 to showcase the influence of the model's depth, and on CIFAR-100 with ResNet-34 to have results on another, more complex dataset. Each network has an initial embedding of 64 and we show results for both structured and unstructured pruning.
Additionally, as highlighted by both Tables 4 and 6, the choice of a min and a max depends on the pruning target. To highlight this fact, we show a complete trade-off figure for various values of a min and a max in Figure 4, whose results are reported in Table 9. Table 3. Top-1 accuracy of ResNet-20, with an initial embedding of 64 or 16 feature maps, on CIFAR-10 for various pruning targets, with different unstructured and structured pruning methods. In both cases, SWD outperforms the other methods for high-pruning targets. For each point, the corresponding estimated percentage of remaining operations ("Ops") is given (except for unstructured pruning). The missing point in the table (*) is due to the fact that too high values of SWD can lead to overflow of the value of the gradient, which induced a critical failure of the training process on this specific point. However, if the value of a max is instead set to 1 × 10 6 , we obtain 95.19% accuracy, with a compression rate of operations of 82.21%. The best performance for each target is indicated in bold. Operations are reported in light grey for readability reasons.  Table 4. Top-1 accuracy after the final unstructured removal step and the difference of performance it induces, for LeNet-5 on MNIST with pruning targets of 10% and 1%. We observe that sufficiently high values of a max are needed to prevent the post-removal drop in performance. Higher values of a min seem to work better than smaller ones. The difference induced by a min and a max seems to be more dramatic for higher pruning targets. Colors are added to ease the interpretation of the results.

Grid Search with LeNet-5 on MNIST
Top-1 Accuracy after Removal (%) Change of Accuracy through Removal (%) a min  Table 5. On ResNet-20 with an initial embedding of 64 feature maps, trained on CIFAR-10 for a pruning target of 90%. Top-1 accuracy after the final unstructured removal step and the difference in performance it induces. The best results are obtained for reasonably low a start and high a end , in accordance with the motivation behind SWD we provided in Section 3. Colors are added to ease the interpretation of the results.
Extended Grid Search a start 1 × 10 4 1 × 10 3 1 × 10 2 1 × 10 1 1 × 10 0 1 × 10 −1 1 × 10 −2 1 × 10 −3 1 × 10 −4 1 × 10 −5 a end Top-1 accuracy after removal (%)   Table 6. Top-1 accuracy after the final structured removal step and the difference in performance it induces, for ResNet-20 with an initial embedding of 64 feature maps, trained on CIFAR-10 and pruning targets of 75% and 90%. Structured pruning with SWD turned out to require exploring a wider range of values than unstructured pruning, as well as being even more sensitive to a. Colors are added to ease the interpretation of the results.

Grid Search with Unstructured Pruning Top-1 Accuracy after Removal (%)
Change of Accuracy through Removal (%) a min 1 × 10 −1 1 × 10 −2 1 × 10 −3 1 × 10 −4 1 × 10 −5 1 × 10 −1 1 × 10 −2 1 × 10 −3 1 × 10 −4 1 × 10 −5 a max ResNet-18 on CIFAR-10  Table 8. Top-1 accuracy after the final structured removal step and the difference in performance it induces, for various networks and datasets with a pruning target of 90%. The influence of a min and a max varies significantly depending on the problem, although common tendencies persist. As previously shown in Table 6, structured pruning is a lot more sensitive to variations of a min and a max . Colors are added to ease the interpretation of the results.

Grid Search with Structured Pruning Top-1 Accuracy after Removal (%)
Change in Accuracy through Removal (%)   Table 9. Top-1 accuracy for ResNet-20 on CIFAR-10, with an initial embedding of 16 feature maps, with different unstructured pruning targets, for SWD with different values of a min and a max . Depending on the pruning rate, the best values to choose are not always the same. If, for each pruning target, we picked the best value among these, SWD would outclass the other technique from Table 3 by a larger margin. The best performance for each target is indicated in bold.
Influence of a min and a max a min 0.

Experiment on Graph Convolutional Networks
In order to verify that SWD can be applied to tasks that are not image classification (such as CIFAR-10/100 or ImageNet ILSVRC2012), we ran experiments on a Graph Convolutional Network (GCN) based on Kipf and Welling [84] on the Cora dataset [85]. We instantiated the GCN with 16 hidden units and trained it with the Adam optimizer [86] with a weight decay of 5 × 10 −4 and a learning rate of 1 × 10 −2 . The dropout rate was set at 50%.
Pruning models introduced severe instabilities when training with the original number of epochs per training, set to 200, which is why we trained models for 2000 epochs instead. For SWD, we set a min = 1 × 10 −1 and a max = 1 × 10 6 . For magnitude pruning, models were pruned across 5 iterations, with each fine-tuning lasting 200 epochs, except for the last one, which lasted 2000. The results are reported in Figure 5.

Ablation Test: The Need for Selectivity
Section 4.6 studied the sensitivity of performance to the pace at which SWD increases during training. However, we need to show the necessity of its other characteristic: its selectivity. Indeed, SWD is only applied to a subset of the network's parameters.
We ran this ablation test using ResNet-20, with an initial embedding of 64 feature maps, on CIFAR-10, with unstructured pruning, and under the same conditions as stated in Section 4.5. Without any fine tuning, we compared three cases: (1) using only simple weight decay, (2) using a weight decay that grows in the same way as SWD, and (3) SWD. Figure 6 shows that neither weight decay nor increasing weight decay achieve the same performance as SWD. Indeed, the weight decay curve equates pruning a normally trained network without any fine-tuning, which is expected to be sub-optimal. Increasing global weight decay amounts to applying SWD everywhere and, thus, to pruning the whole network.
Therefore, we can deduce that (1) SWD is a more efficient removal method than the manual nullification of small weights and (2) the selectivity of SWD is necessary.

Computational Cost of SWD
We measured the additional computing time caused by using SWD. Results are presented in Table 10. We performed experiments on ImageNet and CIFAR-10. In both cases, we obtained increased computation time of the order of 40% to 50%. These numbers should be put into perspective with the fact most pruning techniques come with additional epochs in training, which can easily result in doubling the computation time when compared with the corresponding baselines. Increasing weight decay SWD Figure 6. Ablation test: SWD without fine-tuning is compared to a network that has been pruned without fine-tuning and to which either normal weight decay or a weight decay that increases at the same pace as SWD was applied. It appears that weight decay alone is insufficient for obtaining the performance of SWD, and that an increasing global weight decay prunes the entire network. Therefore, the selectivity, as well as the increase, of SWD is necessary to its performance.

Discussion
The experiments on CIFAR-10 have shown that SWD performs on par with standard methods for low pruning targets and greatly outperforms them on high ones. Our method allows for much higher targets on the same accuracy. We think that the multiple desirable properties brought by SWD over standard pruning methods are responsible for a much more efficient identification and removal of the unnecessary parts of networks. Indeed, dramatic degradations of performance, which could come from removing a necessary parameter or filter, by error are limited by two things: (1) the continuity of SWD, which lets the other parameters compensate progressively for the loss, and (2) its ability to adapt its targeted parameters, so that the weights that are the most relevant to remove are penalized at a more appropriate time.
We compared SWD to multiple methods, described in Section 4.2. Because of the large number of diverging methods in the literature, we preferred to stick to very standard ones that still serve as baselines to many works and remain relevant points of comparison [20]. The values of the hyper-parameters specific to these methods were directly extracted from their original papers. Concerning the other hyper-parameters, we ran each experiment under the same condition and initialization to separate the influence of the hyper-parameters from that of the initialization and of the actual pruning method.
Because of the low granularity of filter-wise structured pruning, there is always the risk of pruning all filters of a single layer and, then, breaking the network irremediably. This likely explains the sudden drops in performance that can be observed for reference methods in Figure 2. Since SWD can adapt to induce no such damage, the network does not reach random guesses even at extreme pruning targets, such as 99.9%. Our results also confirm that SWD can be applied to different datasets and networks, or even pruning structures, and yet stay ahead of the reference methods. Indeed, Figure 5 suggests that the gains observed for a visual classification task carry over to a graph neural network trained on a non-visual task. That means that the properties of SWD are not task-or network-specific and can be transposed in various contexts (see Figure 5), which is an important issue, as shown by Gale [20].
Multiple observations can be drawn from the grid searches displayed in Tables 4-6. Experiments on MNIST show that the effect of a min and a max on both performance and postremoval drop in accuracy depends on the pruning target: the higher the target, the more dramatic the differences of behavior between given ranges of values. Upon comparison with experiments on CIFAR-10, we can tell that these behaviors are also sensitive to the models, datasets, and structures.
Both Tables 4 and 5 show that high values of a (or at least, of a max or a end ) are needed to prevent the post-removal drop in performance. This means that the penalty must be strong enough to effectively reduce weights almost to zero, so that they can be removed seamlessly. Table 5 shows that cases of high a start work pretty well. This is consistent with the literature in Section 2, which tends to demonstrate the importance of sparsity during training. However, the best results are obtained for reasonably low a min and high a max , which is consistent with our arguments in favor of SWD in Section 3.
Experiments on ResNet-20 with an initial embedding of 16 feature maps, instead of 64, revealed that these networks were much more sensitive to pruning, had a lower threshold for high a max values, and were more prone to local instabilities, such as the spike for structured magnitude pruning visible in Figure 3. This is understandable: slimmer networks are expected to be more difficult to prune, since they are less likely to be overparameterized. Because of their thinness, they also tend to be more vulnerable to layer collapse [53], which tends to prematurely reduce the network to random guessing by pruning entire layers and, hence, irremediably breaking the network.
These experiments show that the performance of SWD scales well on this slimmer network relative to the reference methods. However, the increased sensitivity to values of a min and a max highlight how sub-optimal it may be to apply the same values of these for any pruning target. Indeed, picking the best-performing combination for each target, in Table 9, results in a trade-off that outmatches the reference methods by a larger margin than what Figure 3 shows.
However, one may notice that we generally used a different pair of values a min and a max for each experiment. Indeed, as stated previously, the behavior of SWD for certain values of a is very sensitive to the overall task, and we had to choose empirically the best values we could for our hyperparameters. Moreover, for each pruning/performance tradeoff figure, we used the same pair of values, while Table 4 proved it to be quite sensitive to the pruning target. Therefore, it is very possible that our results are actually very suboptimal, comparatively, to what SWD could achieve with better hyperparameter values. As finding them is very time-and energy-consuming, a method to make this process easier (or to bypass it) would be a significant improvement.
Moreover, our contribution counts multiple aspects that could be expanded on and further explored, such as the penalty and evolution function. Indeed, we have chosen the L 2 norm to stick to the definition of weight decay, leaving open the question of how SWD would perform with other norms. Similarly, if the exponential increase of a was to bring satisfying results, other kinds of functions could be tested.
Let us add a last note about the introduced hyperparameters a min and a max . Our tests suggest that a poor choice of these values may dramatically harm performance.
Interestingly, however, reasonable choices (a min as small enough and a max as large enough) lead to consistently good results across datasets and architectures. These parameters have to be compared with the ones introduced by other methods. For example, in [15], defining multiple subtargets of pruning at various epochs during training is required, leading to a large combinatorial search space.
Overall, the principle of SWD is flexible enough to serve as a framework for multiple variations. It could be possible to combine SWD with progressive pruning [20] or to choose gradient magnitude as a pruning criterion instead of weight magnitude.

Conclusions
We have proposed a new approach to prune deep neural networks continuously during training. Our theoretically motivated method, Selective Weight Decay (SWD), shows a better performance/parameters trade-off when compared with reference methods from the literature. We have shown that our method performs better while removing the need for any fine-tuning after the network is pruned. One great advantage of SWD is that it can be combined with virtually any pruning criterion on any pruning structure, which opens up many possibilities. The hyperparameter a and its bounds, a min and a max , deserve to be studied further, leaving room for future improvements to our method.