# Rethinking Weight Decay for Efficient Neural Network Pruning

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- using standardized benchmark datasets, we prove that SWD performs significantly better on aggressive pruning targets than standard methods;
- we show that SWD needs fewer hyperparameters, introduces no discontinuity, needs no fine-tuning, and can be applied to any pruning structure with any pruning criterion.

## 2. Problem Statement and Related Work

#### 2.1. Notations

#### 2.2. The Birth and Rebirth of Pruning

#### 2.3. Which Parameters Should Be Pruned?

#### 2.4. How to Prune Parameters and Recover from the Loss

#### 2.5. What Kind of Structures Should Be Pruned?

## 3. Selective Weight Decay

#### 3.1. Principle

Algorithm 1: Summary of SWD |

#### 3.2. SWD as a Lagrangian Smoothing of Pruning

#### 3.3. On the Adaptability of SWD to Structures

## 4. Experiments

#### 4.1. General Training Conditions

#### 4.2. Chosen Baselines and Specificities of Each Method

**Unstructured pruning: Han et al. [15]**

**Structured pruning: Liu et al. [17]**

**LR-Rewinding: Renda et al [61]**

**SWD**

**Overall methodology**

- All the unstructured pruning methods use weight magnitude as their criterion;
- All the structured pruning methods are applied to batch normalization layers;
- Structured LR-Rewinding also applies the smooth-${L}_{1}$ penalty from Liu et al. [17];
- The hyper-parameters specific to the aforementioned methods, namely, the number of iterations and the values of the smooth-${L}_{1}$ norm, are directly extracted from their respective original papers.

- SWD does not apply any fine-tuning;
- Unstructured LR-Rewinding only re-trains the network once (because of the extra cost from fully retraining networks, compared to fine-tuning);
- SWD does not apply a smooth-${\mathcal{L}}_{1}$ norm (since it would clash with SWD’s own penalty).

#### 4.3. Comparison with the State of the Art

#### 4.4. Experiments on ImageNet ILSVRC2012

#### 4.5. Impact of SWD on the Pruning/Accuracy Trade-Off

**Figure 2.**Comparison of the trade-off between pruning target and top-1 accuracy for ResNet-20 (with an initial embedding of 64 feature-maps) on CIFAR-10, for SWD and two reference methods. “Magnitude pruning" refers either to the method used in Han et al. [15] or Liu et al. [17]. SWD has a better performance/parameter trade-off on high-pruning targets. (

**a**) Unstructured pruning. (

**b**) Structured pruning.

**Figure 3.**Comparison of the trade-off between pruning target and top-1 accuracy for ResNet-20 on CIFAR-10, with an initial embedding of 16 feature maps, for SWD and two reference methods. “Magnitude pruning" refers either to the method used in Han et al. [15] or Liu et al. [17]. SWD has a better performance/parameter trade-off on high-pruning targets. (

**a**) Unstructured pruning. (

**b**) Structured pruning.

- convolution layer: ${f}_{in}\times {f}_{out}\times {k}^{2}\times h\times w$;
- batch normalization layer: ${f}_{in}\times h\times w\times 2$;
- dense layer: ${f}_{in}\times {f}_{out}+{f}_{out}$.

**Table 3.**Top-1 accuracy of ResNet-20, with an initial embedding of 64 or 16 feature maps, on CIFAR-10 for various pruning targets, with different unstructured and structured pruning methods. In both cases, SWD outperforms the other methods for high-pruning targets. For each point, the corresponding estimated percentage of remaining operations (“Ops”) is given (except for unstructured pruning). The missing point in the table (*) is due to the fact that too high values of SWD can lead to overflow of the value of the gradient, which induced a critical failure of the training process on this specific point. However, if the value of ${a}_{max}$ is instead set to $1\times {10}^{6}$, we obtain 95.19% accuracy, with a compression rate of operations of 82.21%. The best performance for each target is indicated in bold. Operations are reported in light grey for readability reasons.

ResNet-20 on CIFAR-10 (Unstructured) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Target | Base | Ops | LRR | Ops | SWD | Ops | Base | Ops | LRR | Ops | SWD | Ops |

64 Feature Maps | 16 Feature Maps | |||||||||||

10 | 95.45 | 94.82 | 95.43 | 92.23 | 90.47 | 92.63 | ||||||

20 | 95.47 | 95.15 | 95.55 | 92.25 | 89.70 | 92.47 | ||||||

30 | 95.43 | 95.03 | 95.47 | 92.27 | 92.57 | 92.45 | ||||||

40 | 95.48 | 94.94 | 95.40 | 92.31 | 90.15 | 92.36 | ||||||

50 | 95.44 | 95.33 | 95.46 | 92.43 | 91.06 | 92.08 | ||||||

60 | 95.32 | 95.04 | 95.37 | 91.95 | 89.93 | 92.15 | ||||||

70 | 95.3 | 95.45 | 95.04 | 91.78 | 89.8 | 91.69 | ||||||

75 | 95.32 | 95.15 | 95.34 | 91.46 | 89.39 | 90.90 | ||||||

80 | 95.32 | 95.14 | 95.09 | 90.77 | 91.52 | 91.37 | ||||||

85 | 95.05 | 95.03 | 94.99 | 90.22 | 88.51 | 90.97 | ||||||

90 | 94.77 | 94.72 | 94.90 | 85.26 | 88.12 | 90.15 | ||||||

92.5 | 94.48 | 94.74 | 94.58 | 79.98 | 86.07 | 89.88 | ||||||

95 | 94.03 | 93.66 | 94.40 | 77.15 | 83.27 | 88.90 | ||||||

96 | 93.38 | 93.63 | 94.14 | 79.41 | 82.96 | 88.69 | ||||||

97 | 91.95 | 93.34 | 93.76 | 68.85 | 82.75 | 86.95 | ||||||

97.5 | 91.43 | 92.48 | 93.52 | 68.51 | 79.32 | 86.16 | ||||||

98 | 90.58 | 91.64 | 93.49 | 58.15 | 75.21 | 84.88 | ||||||

98.5 | 87.44 | 90.36 | 93.00 | 41.60 | 62.52 | 83.33 | ||||||

99 | 83.42 | 87.38 | 92.50 | 41.26 | 51.93 | 75.89 | ||||||

99.5 | 66.90 | 82.21 | 91.05 | 34.88 | 37.22 | 29.35 | ||||||

99.8 | 48.52 | 65.46 | 86.81 | 10.00 | 10.00 | 16.47 | ||||||

99.9 | 27.78 | 45.44 | 81.32 | 10.00 | 10.00 | 11.11 | ||||||

ResNet-20 on CIFAR-10 (structured) | ||||||||||||

10 | 94.83 | 84.13 | 95.10 | 85.37 | * | * | 91.96 | 90.06 | 89.95 | 85.70 | 91.88 | 77.68 |

20 | 94.88 | 70.41 | 95.39 | 76.45 | 95.38 | 70.21 | 91.25 | 78.64 | 88.91 | 76.21 | 91.97 | 64.63 |

30 | 94.88 | 58.20 | 95.53 | 67.45 | 95.48 | 59.67 | 90.55 | 69.77 | 90.65 | 63.10 | 91.22 | 59.23 |

40 | 94.91 | 48.06 | 95.32 | 53.40 | 95.44 | 51.96 | 89.59 | 62.20 | 89.94 | 54.85 | 90.67 | 51.07 |

50 | 94.92 | 40.25 | 94.31 | 43.68 | 94.96 | 44.31 | 89.11 | 51.72 | 88.07 | 44.04 | 90.27 | 41.95 |

60 | 94.29 | 33.88 | 95.02 | 35.82 | 94.93 | 37.90 | 87.70 | 42.16 | 85.84 | 35.6 | 89.66 | 33.42 |

70 | 93.24 | 26.01 | 94.98 | 28.08 | 94.64 | 30.20 | 85.08 | 33.12 | 85.84 | 30.29 | 88.93 | 28.76 |

75 | 92.08 | 21.36 | 94.67 | 24.26 | 94.25 | 24.89 | 82.61 | 28.68 | 83.58 | 22.92 | 88.23 | 25.59 |

80 | 84.20 | 16.55 | 94.45 | 19.97 | 94.15 | 22.46 | 79.71 | 24.18 | 83.50 | 19.07 | 87.82 | 23.94 |

85 | 77.18 | 12.07 | 94.36 | 16.45 | 94.27 | 19.02 | 10.00 | 17.29 | 82.53 | 18.36 | 86.79 | 19.20 |

90 | 71.01 | 8.04 | 93.42 | 11.67 | 93.72 | 14.35 | 10.00 | 12.31 | 80.19 | 12.06 | 85.25 | 15.75 |

92.5 | 10.00 | 5.87 | 92.94 | 8.93 | 93.06 | 12.84 | 10.00 | 10.40 | 74.81 | 9.86 | 83.67 | 12.38 |

95 | 10.00 | 4.01 | 91.14 | 6.66 | 91.93 | 9.65 | 64.23 | 8.1 | 66.30 | 5.89 | 80.66 | 11.08 |

96 | 10.00 | 3.39 | 89.80 | 5.72 | 91.67 | 8.89 | 10.00 | 7.16 | 64.90 | 4.81 | 78.39 | 9.99 |

97 | 10.00 | 2.84 | 89.25 | 4.68 | 90.57 | 7.45 | 10.00 | 5.08 | 39.30 | 4.25 | 75.45 | 8.53 |

97.5 | 10.00 | 2.26 | 87.92 | 4.27 | 89.90 | 7.05 | 10.00 | 4.21 | 10.00 | 3.79 | 72.73 | 7.8 |

98 | 10.00 | 1.80 | 88.00 | 3.63 | 89.07 | 6.00 | 10.00 | 3.7 | 10.00 | 3.12 | 71.45 | 6.73 |

98.5 | 10.00 | 1.37 | 74.97 | 2.73 | 87.68 | 5.29 | 10.00 | 2.13 | 10.00 | 2.64 | 66.71 | 5.08 |

99 | 10.00 | 0.99 | 57.99 | 2.32 | 84.66 | 4.22 | 10.00 | 1.79 | 10.00 | 2.24 | 51.49 | 4.25 |

99.5 | 10.00 | 0.57 | 10.00 | 1.45 | 79.86 | 2.42 | 10.00 | 0.8 | 10.00 | 1.53 | 47.63 | 1.9 |

99.8 | 10.00 | 0.26 | 10.00 | 0.75 | 70.18 | 0.97 | 10.00 | 0.03 | 10.00 | 0.13 | 36.67 | 0.53 |

99.9 | 10.00 | 0.12 | 10.00 | 0.37 | 66.96 | 0.45 | 10.00 | 0.01 | 10.00 | 0.01 | 10.00 | 0.35 |

#### 4.6. Grid Search on Multiple Models and Datasets

**Table 4.**Top-1 accuracy after the final unstructured removal step and the difference of performance it induces, for LeNet-5 on MNIST with pruning targets of 10% and 1%. We observe that sufficiently high values of ${a}_{max}$ are needed to prevent the post-removal drop in performance. Higher values of ${a}_{min}$ seem to work better than smaller ones. The difference induced by ${a}_{min}$ and ${a}_{max}$ seems to be more dramatic for higher pruning targets. Colors are added to ease the interpretation of the results.

Grid Search with LeNet-5 on MNIST | ||||||||
---|---|---|---|---|---|---|---|---|

Top-1 Accuracy after Removal (%) | Change of Accuracy through Removal (%) | |||||||

${a}_{min}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ |

${a}_{max}$ | Pruning target 90% | |||||||

$1\times {10}^{1}$ | ||||||||

$1\times {10}^{2}$ | ||||||||

$1\times {10}^{3}$ | ||||||||

$1\times {10}^{4}$ | ||||||||

Pruning target 99% | ||||||||

$1\times {10}^{1}$ | ||||||||

$1\times {10}^{2}$ | ||||||||

$1\times {10}^{3}$ | ||||||||

$1\times {10}^{4}$ |

**Table 5.**On ResNet-20 with an initial embedding of 64 feature maps, trained on CIFAR-10 for a pruning target of 90%. Top-1 accuracy after the final unstructured removal step and the difference in performance it induces. The best results are obtained for reasonably low ${a}_{start}$ and high ${a}_{end}$, in accordance with the motivation behind SWD we provided in Section 3. Colors are added to ease the interpretation of the results.

Extended Grid Search | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

${a}_{start}$ | $1\times {10}^{4}$ | $1\times {10}^{3}$ | $1\times {10}^{2}$ | $1\times {10}^{1}$ | $1\times {10}^{0}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | $1\times {10}^{-5}$ | |||

${a}_{end}$ | Top-1 accuracy after removal (%) | ||||||||||||

$1\times {10}^{1}$ | |||||||||||||

$1\times {10}^{2}$ | |||||||||||||

$1\times {10}^{3}$ | |||||||||||||

$1\times {10}^{4}$ | |||||||||||||

$1\times {10}^{5}$ | |||||||||||||

Change in accuracy through removal (%) | |||||||||||||

$1\times {10}^{1}$ | |||||||||||||

$1\times {10}^{2}$ | |||||||||||||

$1\times {10}^{3}$ | |||||||||||||

$1\times {10}^{4}$ | |||||||||||||

$1\times {10}^{5}$ |

**Table 6.**Top-1 accuracy after the final structured removal step and the difference in performance it induces, for ResNet-20 with an initial embedding of 64 feature maps, trained on CIFAR-10 and pruning targets of 75% and 90%. Structured pruning with SWD turned out to require exploring a wider range of values than unstructured pruning, as well as being even more sensitive to a. Colors are added to ease the interpretation of the results.

Grid Search with Structured Pruning | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Top-1 Accuracy after Removal (%) | Change of Accuracy through Removal (%) | ||||||||||||

${a}_{min}$ | $1\times {10}^{0}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | $1\times {10}^{0}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | |||

${a}_{max}$ | Pruning target 75% | ||||||||||||

$1\times {10}^{1}$ | |||||||||||||

$1\times {10}^{2}$ | |||||||||||||

$1\times {10}^{3}$ | |||||||||||||

$1\times {10}^{4}$ | |||||||||||||

$1\times {10}^{5}$ | |||||||||||||

$1\times {10}^{6}$ | |||||||||||||

Pruning target 90% | |||||||||||||

$1\times {10}^{1}$ | |||||||||||||

$1\times {10}^{2}$ | |||||||||||||

$1\times {10}^{3}$ | |||||||||||||

$1\times {10}^{4}$ | |||||||||||||

$1\times {10}^{5}$ | |||||||||||||

$1\times {10}^{6}$ |

**Table 7.**Top-1 accuracy after the final unstructured removal step and the difference in performance it induces, for various networks and datasets with a pruning target of 90%. The influence of ${a}_{min}$ and ${a}_{max}$ varies significantly depending on the problem, although common tendencies persist. Colors are added to ease the interpretation of the results.

Grid Search with Unstructured Pruning | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Top-1 Accuracy after Removal (%) | Change of Accuracy through Removal (%) | ||||||||||||

${a}_{min}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | $1\times {10}^{-5}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | $1\times {10}^{-5}$ | |||

${a}_{max}$ | ResNet-18 on CIFAR-10 | ||||||||||||

$1\times {10}^{1}$ | |||||||||||||

$1\times {10}^{2}$ | |||||||||||||

$1\times {10}^{3}$ | |||||||||||||

$1\times {10}^{4}$ | |||||||||||||

$1\times {10}^{5}$ | |||||||||||||

ResNet-20 on CIFAR-10 | |||||||||||||

$1\times {10}^{1}$ | |||||||||||||

$1\times {10}^{2}$ | |||||||||||||

$1\times {10}^{3}$ | |||||||||||||

$1\times {10}^{4}$ | |||||||||||||

$1\times {10}^{5}$ | |||||||||||||

ResNet-34 on CIFAR-100 | |||||||||||||

$1\times {10}^{1}$ | |||||||||||||

$1\times {10}^{2}$ | |||||||||||||

$1\times {10}^{3}$ | |||||||||||||

$1\times {10}^{4}$ | |||||||||||||

$1\times {10}^{5}$ |

**Figure 4.**Comparison of the trade-off between an unstructured pruning target and the top-1 accuracy for ResNet-20 on CIFAR-10, with an initial embedding of 16 feature maps, for SWD with different values of ${a}_{min}$ and ${a}_{max}$. Depending on the pruning rate, the best values to choose are not always the same.

**Table 8.**Top-1 accuracy after the final structured removal step and the difference in performance it induces, for various networks and datasets with a pruning target of 90%. The influence of ${a}_{min}$ and ${a}_{max}$ varies significantly depending on the problem, although common tendencies persist. As previously shown in Table 6, structured pruning is a lot more sensitive to variations of ${a}_{min}$ and ${a}_{max}$. Colors are added to ease the interpretation of the results.

Grid Search with Structured Pruning | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Top-1 Accuracy after Removal (%) | Change in Accuracy through Removal (%) | |||||||||||||

${a}_{min}$ | $1\times {10}^{0}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | $1\times {10}^{-5}$ | $1\times {10}^{0}$ | $1\times {10}^{-1}$ | $1\times {10}^{-2}$ | $1\times {10}^{-3}$ | $1\times {10}^{-4}$ | $1\times {10}^{-5}$ | ||

${a}_{max}$ | ResNet-18 on CIFAR-10 | |||||||||||||

$1\times {10}^{1}$ | ||||||||||||||

$1\times {10}^{2}$ | ||||||||||||||

$1\times {10}^{3}$ | ||||||||||||||

$1\times {10}^{4}$ | ||||||||||||||

$1\times {10}^{5}$ | ||||||||||||||

$1\times {10}^{6}$ | ||||||||||||||

ResNet-20 on CIFAR-10 | ||||||||||||||

$1\times {10}^{1}$ | ||||||||||||||

$1\times {10}^{2}$ | ||||||||||||||

$1\times {10}^{3}$ | ||||||||||||||

$1\times {10}^{4}$ | ||||||||||||||

$1\times {10}^{5}$ | ||||||||||||||

$1\times {10}^{6}$ | ||||||||||||||

ResNet-34 on CIFAR-10 | ||||||||||||||

$1\times {10}^{1}$ | ||||||||||||||

$1\times {10}^{2}$ | ||||||||||||||

$1\times {10}^{3}$ | ||||||||||||||

$1\times {10}^{4}$ | ||||||||||||||

$1\times {10}^{5}$ | ||||||||||||||

$1\times {10}^{6}$ |

**Table 9.**Top-1 accuracy for ResNet-20 on CIFAR-10, with an initial embedding of 16 feature maps, with different unstructured pruning targets, for SWD with different values of ${a}_{min}$ and ${a}_{max}$. Depending on the pruning rate, the best values to choose are not always the same. If, for each pruning target, we picked the best value among these, SWD would outclass the other technique from Table 3 by a larger margin. The best performance for each target is indicated in bold.

Influence of ${\mathit{a}}_{\mathbf{min}}$ and ${\mathit{a}}_{\mathbf{max}}$ | ||||
---|---|---|---|---|

${\mathit{a}}_{\mathbf{min}}$ | 0.1 | 0.1 | 1 | 1 |

${\mathbf{a}}_{\mathbf{max}}$ | $\mathbf{1}\times {\mathbf{10}}^{\mathbf{4}}$ | $\mathbf{5}\times {\mathbf{10}}^{\mathbf{4}}$ | $\mathbf{1}\times {\mathbf{10}}^{\mathbf{4}}$ | $\mathbf{5}\times {\mathbf{10}}^{\mathbf{4}}$ |

10 | 92.38 | 92.50 | 92.63 | 92.56 |

20 | 92.32 | 92.57 | 92.47 | 92.62 |

30 | 92.53 | 92.34 | 92.45 | 92.55 |

40 | 92.58 | 92.35 | 92.36 | 91.98 |

50 | 92.15 | 92.02 | 92.08 | 92.02 |

60 | 92.28 | 92.09 | 92.15 | 91.89 |

70 | 92.01 | 91.87 | 91.69 | 91.57 |

75 | 92.27 | 91.89 | 90.90 | 91.70 |

80 | 91.85 | 91.52 | 91.37 | 91.04 |

85 | 91.44 | 91.48 | 90.97 | 90.7 |

90 | 90.91 | 90.83 | 90.15 | 90.22 |

92.5 | 90.59 | 90.16 | 89.88 | 89.36 |

95 | 89.30 | 89.00 | 88.90 | 88.28 |

96 | 88.11 | 88.64 | 88.69 | 87.72 |

97 | 87.01 | 87.0 | 86.95 | 86.67 |

97.5 | 85.76 | 86.09 | 86.16 | 85.91 |

98 | 83.56 | 84.27 | 84.88 | 85.11 |

98.5 | 75.47 | 81.62 | 83.33 | 82.91 |

99 | 37.24 | 74.07 | 75.89 | 80.2 |

99.5 | 21.61 | 47.26 | 29.35 | 64.01 |

99.8 | 12.27 | 11.33 | 16.47 | 25.19 |

99.9 | 9.78 | 16.39 | 11.11 | 10.9 |

#### 4.7. Experiment on Graph Convolutional Networks

#### 4.8. Ablation Test: The Need for Selectivity

**Figure 6.**Ablation test: SWD without fine-tuning is compared to a network that has been pruned without fine-tuning and to which either normal weight decay or a weight decay that increases at the same pace as SWD was applied. It appears that weight decay alone is insufficient for obtaining the performance of SWD, and that an increasing global weight decay prunes the entire network. Therefore, the selectivity, as well as the increase, of SWD is necessary to its performance.

#### 4.9. Computational Cost of SWD

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst.
**2012**, 25, 1097–1105. [Google Scholar] [CrossRef] - Le Cun, Y.; Haffner, P.; Bottou, L.; Bengio, Y. Object recognition with gradient-based learning. In Shape, Contour and Grouping in Computer Vision; Springer: Berlin/Heidelberg, Germany, 1999; pp. 319–345. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv
**2015**, arXiv:cs.LG/1502.03167. [Google Scholar] - Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv
**2013**, arXiv:1312.4400. [Google Scholar] - Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Boukli Hacene, G. Processing and Learning Deep Neural Networks on Chip. Ph.D. Thesis, Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, Brest, France, 2019. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Stat
**2015**, 1050, 9. [Google Scholar] - Lassance, C.; Bontonou, M.; Hacene, G.B.; Gripon, V.; Tang, J.; Ortega, A. Deep Geometric Knowledge Distillation with Graphs. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 8484–8488. [Google Scholar] [CrossRef] [Green Version]
- Courbariaux, M.; Bengio, Y.; David, J.P. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 3123–3131. [Google Scholar]
- Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 525–542. [Google Scholar]
- Denton, E.L.; Zaremba, W.; Bruna, J.; Le Cun, Y.; Fergus, R. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 1269–1277. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 1135–1143. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv
**2015**, arXiv:1510.00149. [Google Scholar] - Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks Through Network Slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. arXiv
**2016**, arXiv:1608.08710. [Google Scholar] - Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the Value of Network Pruning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Gale, T.; Elsen, E.; Hooker, S. The State of Sparsity in Deep Neural Networks. arXiv
**2019**, arXiv:1902.09574. [Google Scholar] - Louizos, C.; Welling, M.; Kingma, D.P. Learning Sparse Neural Networks through L_0 Regularization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Krogh, A.; Hertz, J.A. A Simple Weight Decay Can Improve Generalization. In Advances in Neural Information Processing Systems 4; Moody, J.E., Hanson, S.J., Lippmann, R.P., Eds.; Morgan-Kaufmann: Burlington, MA, USA, 1992; pp. 950–957. [Google Scholar]
- Plaut, D.C.; Nowlan, S.J.; Hinton, G.E. Experiments on Learning Back Propagation; Technical Report CMU–CS–86–126; Carnegie–Mellon University: Pittsburgh, PA, USA, 1986. [Google Scholar]
- Hassibi, B.; Stork, D.G. Second order derivatives for network pruning: Optimal Brain Surgeon. In Advances in Neural Information Processing Systems 5; Hanson, S.J., Cowan, J.D., Giles, C.L., Eds.; Morgan-Kaufmann: Burlington, MA, USA, 1993; pp. 164–171. [Google Scholar]
- Le Cun, Y.; Denker, J.S.; Solla, S.A. Optimal Brain Damage. In Advances in Neural Information Processing Systems 2; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1990; pp. 598–605. [Google Scholar]
- Mozer, M.C.; Smolensky, P. Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment. In Advances in Neural Information Processing Systems 1; Touretzky, D.S., Ed.; Morgan-Kaufmann: Burlington, MA, USA, 1989; pp. 107–115. [Google Scholar]
- Reed, R. Pruning algorithms-a survey. IEEE Trans. Neural Netw.
**1993**, 4, 740–747. [Google Scholar] [CrossRef] - Blalock, D.; Gonzalez Ortiz, J.J.; Frankle, J.; Guttag, J. What is the State of Neural Network Pruning? arXiv
**2020**, arXiv:2003.03033. [Google Scholar] - Anwar, S.; Sung, W. Compact Deep Convolutional Neural Networks With Coarse Pruning. arXiv
**2016**, arXiv:1610.09639. [Google Scholar] - Hu, H.; Peng, R.; Tai, Y.W.; Tang, C.K. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. arXiv
**2016**, arXiv:1607.03250. [Google Scholar] - Luo, J.H.; Wu, J.; Lin, W. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Srinivas, S.; Venkatesh Babu, R. Data-free parameter pruning for Deep Neural Networks. arXiv
**2015**, arXiv:1507.06149. [Google Scholar] - Yu, R.; Li, A.; Chen, C.; Lai, J.; Morariu, V.I.; Han, X.; Gao, M.; Lin, C.; Davis, L.S. NISP: Pruning Networks Using Neuron Importance Score Propagation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9194–9203. [Google Scholar] [CrossRef] [Green Version]
- Hassibi, B.; Stork, D.G.; Wolff, G. Optimal Brain Surgeon: Extensions and performance comparisons. In Advances in Neural Information Processing Systems 6; Cowan, J.D., Tesauro, G., Alspector, J., Eds.; Morgan-Kaufmann: Burlington, MA, USA, 1994; pp. 263–270. [Google Scholar]
- Karnin, E.D. A simple procedure for pruning back-propagation trained neural networks. IEEE Trans. Neural Netw.
**1990**, 1, 239–242. [Google Scholar] [CrossRef] [PubMed] - Tresp, V.; Neuneier, R.; Zimmermann, H.G. Early Brain Damage. In Advances in Neural Information Processing Systems 9; Mozer, M.C., Jordan, M.I., Petsche, T., Eds.; MIT Press: Cambridge, MA, USA, 1997; pp. 669–675. [Google Scholar]
- Dong, X.; Chen, S.; Pan, S. Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4857–4867. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017-Conference Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
- Chauvin, Y. A Back-Propagation Algorithm with Optimal Use of Hidden Units. In Advances in Neural Information Processing Systems 1; Touretzky, D.S., Ed.; Morgan-Kaufmann: Burlington, MA, USA, 1989; pp. 519–526. [Google Scholar]
- Hanson, S.J.; Pratt, L.Y. Comparing Biases for Minimal Network Construction with Back-Propagation. In Advances in Neural Information Processing Systems 1; Touretzky, D.S., Ed.; Morgan-Kaufmann: Burlington, MA, USA, 1989; pp. 177–185. [Google Scholar]
- Janowsky, S.A. Pruning versus clipping in neural networks. Phys. Rev. A
**1989**, 39, 6600–6603. [Google Scholar] [CrossRef] [PubMed] - Segee, B.E.; Carter, M.J. Fault tolerance of pruned multilayer networks. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 8–14 July 1991; Volume ii, pp. 447–452. [Google Scholar]
- Le Cun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature
**2015**, 521, 436–444. [Google Scholar] [CrossRef] - Bellec, G.; Kappel, D.; Maass, W.; Legenstein, R. Deep Rewiring: Training very sparse deep networks. arXiv
**2017**, arXiv:1711.05136. [Google Scholar] - Dai, X.; Yin, H.; Jha, N.K. NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm. IEEE Trans. Comput.
**2019**, 68, 1487–1497. [Google Scholar] [CrossRef] [Green Version] - He, Y.; Kang, G.; Dong, X.; Fu, Y.; Yang, Y. Soft filter pruning for accelerating deep convolutional neural networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 2234–2240. [Google Scholar]
- Mocanu, D.; Mocanu, E.; Stone, P.; Nguyen, P.; Gibescu, M.; Liotta, A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat. Commun.
**2018**, 9. [Google Scholar] [CrossRef] [Green Version] - Dettmers, T.; Zettlemoyer, L. Sparse Networks from Scratch: Faster Training without Losing Performance. arXiv
**2019**, arXiv:1907.04840. [Google Scholar] - Evci, U.; Gale, T.; Menick, J.; Castro, P.S.; Elsen, E. Rigging the Lottery: Making All Tickets Winners. arXiv
**2019**, arXiv:1911.11134. [Google Scholar] - Mostafa, H.; Wang, X. Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization. arXiv
**2019**, arXiv:1902.05967. [Google Scholar] - Frankle, J.; Dziugaite, G.K.; Roy, D.M.; Carbin, M. Pruning Neural Networks at Initialization: Why are We Missing the Mark? arXiv
**2020**, arXiv:2009.08576. [Google Scholar] - Lee, N.; Ajanthan, T.; Torr, P.H. SNIP: Single-shot network pruning based on connection sensitivity. In Proceedings of the International Conference on Learning Representations, ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Tanaka, H.; Kunin, D.; Yamins, D.L.; Ganguli, S. Pruning neural networks without any data by iteratively conserving synaptic flow. Adv. Neural Inf. Process. Syst.
**2020**, 33, 6377–6389. [Google Scholar] - Wang, C.; Zhang, G.; Grosse, R. Picking Winning Tickets Before Training by Preserving Gradient Flow. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Frankle, J.; Dziugaite, G.K.; Roy, D.; Carbin, M. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3259–3269. [Google Scholar]
- Frankle, J.; Dziugaite, G.K.; Roy, D.M.; Carbin, M. Stabilizing the lottery ticket hypothesis. arXiv
**2019**, arXiv:1903.01611. [Google Scholar] - Malach, E.; Yehudai, G.; Shalev-Schwartz, S.; Shamir, O. Proving the lottery ticket hypothesis: Pruning is all you need. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 6682–6691. [Google Scholar]
- Morcos, A.S.; Yu, H.; Paganini, M.; Tian, Y. One ticket to win them all: Generalizing lottery ticket initializations across datasets and optimizers. Stat
**2019**, 1050, 6. [Google Scholar] - Zhou, H.; Lan, J.; Liu, R.; Yosinski, J. Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv
**2019**, arXiv:1905.01067. [Google Scholar] - Renda, A.; Frankle, J.; Carbin, M. Comparing rewinding and fine-tuning in neural network pruning. arXiv
**2020**, arXiv:2003.02389. [Google Scholar] - Guo, Y.; Yao, A.; Chen, Y. Dynamic Network Surgery for Efficient DNNs. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 1379–1387. [Google Scholar]
- Srinivas, S.; Subramanya, A.; Venkatesh Babu, R. Training Sparse Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Xiao, X.; Wang, Z.; Rajasekaran, S. AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters. Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Dai, B.; Zhu, C.; Guo, B.; Wipf, D. Compressing neural networks using the variational information bottleneck. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1135–1144. [Google Scholar]
- Louizos, C.; Ullrich, K.; Welling, M. Bayesian Compression for Deep Learning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Molchanov, D.; Ashukha, A.; Vetrov, D. Variational dropout sparsifies deep neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2498–2507. [Google Scholar]
- Neklyudov, K.; Molchanov, D.; Ashukha, A.; Vetrov, D. Structured Bayesian pruning via log-normal multiplicative noise. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6778–6787. [Google Scholar]
- Ullrich, K.; Meeds, E.; Welling, M. Soft weight-sharing for neural network compression. Stat
**2017**, 1050, 13. [Google Scholar] - Nowlan, S.J.; Hinton, G.E. Simplifying neural networks by soft weight-sharing. Neural Comput.
**1992**, 4, 473–493. [Google Scholar] [CrossRef] - Anwar, S.; Hwang, K.; Sung, W. Structured Pruning of Deep Convolutional Neural Networks. ACM J. Emerg. Technol. Comput. Syst.
**2015**, 13. [Google Scholar] [CrossRef] [Green Version] - He, Y.; Zhang, X.; Sun, J. Channel Pruning for Accelerating Very Deep Neural Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1398–1406. [Google Scholar] [CrossRef] [Green Version]
- Huang, Q.; Zhou, K.; You, S.; Neumann, U. Learning to Prune Filters in Convolutional Neural Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 709–718. [Google Scholar] [CrossRef] [Green Version]
- Yamamoto, K.; Maeno, K. PCAS: Pruning Channels with Attention Statistics for Deep Network Compression. arXiv
**2019**, arXiv:1806.05382. [Google Scholar] - Hacene, G.B.; Lassance, C.; Gripon, V.; Courbariaux, M.; Bengio, Y. Attention based pruning for shift networks. arXiv
**2019**, arXiv:1905.12300. [Google Scholar] - Murray, W.; Ng, K.M. Algorithm for nonlinear optimization problems with binary variables. Comput. Optim. Appl.
**2010**, 47, 257–288. [Google Scholar] [CrossRef] - Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8026–8037. [Google Scholar]
- Liu, Z.; Xu, J.; Peng, X.; Xiong, R. Frequency-domain dynamic pruning for convolutional neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1051–1061. [Google Scholar]
- Zhu, M.; Gupta, S. To prune, or not to prune: Exploring the efficacy of pruning for model compression. arXiv
**2017**, arXiv:1710.01878. [Google Scholar] - Molchanov, P.; Mallya, A.; Tyree, S.; Frosio, I.; Kautz, J. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11264–11272. [Google Scholar]
- Ye, J.; Lu, X.; Lin, Z.; Wang, J.Z. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv
**2018**, arXiv:1802.00124. [Google Scholar] - Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
- Ma, X.; Lin, S.; Ye, S.; He, Z.; Zhang, L.; Yuan, G.; Tan, S.H.; Li, Z.; Fan, D.; Qian, X.; et al. Non-Structured DNN Weight Pruning–Is It Beneficial in Any Platform? IEEE Trans. Neural Netw. Learn. Syst.
**2021**, 1–15. [Google Scholar] [CrossRef] [PubMed] - Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv
**2016**, arXiv:1609.02907. [Google Scholar] - Sen, P.; Namata, G.; Bilgic, M.; Getoor, L.; Galligher, B.; Eliassi-Rad, T. Collective classification in network data. AI Mag.
**2008**, 29, 93. [Google Scholar] [CrossRef] [Green Version] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar]

**Figure 1.**To prune deep neural networks continuously during training, we apply distinct types of weight decay (penalty p on the y-axis) depending on weight magnitude (weight value w on the x-axis). Weights whose magnitude exceeds a threshold t (defined according to the number of weights to prune) are penalized by a regular weight decay. Those beneath this threshold are targeted by a stronger weight decay whose intensity grows during training. This stronger weight decay, only applied to a subset of the network, is the Selective Weight Decay. This approach can be equally well applied to weights (unstructured case) or groups of weights (structured case).

**Table 1.**Quick comparison between SWD and multiple pruning methods, for different datasets and networks. All lines marked with an

*****are results obtained with our own implementations; all the others are extracted from the original papers.

Method | Type | Dataset | Network | Comp. | Accuracy |
---|---|---|---|---|---|

Liu et al. [78] | Unstructured | ImageNet | AlexNet | ×22.6 | 56.82% (+0.24%) |

Zhu et al. [79] | Unstructured | ImageNet | InceptionV3 | ×8 | 74.6% (−3.5%) |

Zhu et al. [79] | Unstructured | ImageNet | MobileNet | ×10 | 61.8% (−8.8%) |

Xiao et al. [64] | Unstructured | ImageNet | ResNet50 | ×2.2 | 74.50% (−0.40%) |

SWD (ours) * | Unstructured | ImageNet | ResNet50 | ×2 | 75.0% (−0.7%) |

SWD (ours) * | Unstructured | ImageNet | ResNet50 | ×10 | 73.1% (−1.8%) |

SWD (ours) * | Unstructured | ImageNet | ResNet50 | ×40 | 67.8% (−7.1%) |

Liu et al. [17] * | Structured | ImageNet | ResNet50 | ×2 | 63.6% (−12.1%) |

Luo et al. [31] | Structured | ImageNet | ResNet50 | ×2.06 | 72.03% (−3.27%) |

Luo et al. [31] | Structured | ImageNet | ResNet50 | ×2.95 | 68.17% (−7.13%) |

Molchanov et al. [80] | Structured | ImageNet | ResNet50 | ×1.59 | 74.5% (−1.68%) |

Molchanov et al. [80] | Structured | ImageNet | ResNet50 | ×2.86 | 71.69% (−4.49%) |

SWD (ours) * | Structured | ImageNet | ResNet50 | ×1.33 | 74.7% (−1.0%) |

SWD (ours) * | Structured | ImageNet | ResNet50 | ×2 | 73.9% (−1.8%) |

Liu et al. [17] | Structured | CIFAR10 | DenseNet40 | ×2.87 | 94.35% (+0.46%) |

Liu et al. [17] | Structured | CIFAR10 | ResNet164 | ×1.54 | 94.73% (+0.15%) |

Ye et al. [81] | Structured | CIFAR10 | ResNet20−16 | ×1.6 | 90.9% (−1.1%) |

Ye et al. [81] | Structured | CIFAR10 | ResNet20−16 | ×3.1 | 88.8% (−3.2%) |

SWD (ours) * | Structured | CIFAR10 | ResNet20−16 | ×1.42 | 91.22% (−1.15%) |

SWD (ours) * | Structured | CIFAR10 | ResNet20−16 | ×3.33 | 88.93% (−3.44%) |

Liu et al. [17] * | Structured | CIFAR10 | ResNet20−64 | ×2 | 94.92% (−0.75%) |

SWD (ours) * | Structured | CIFAR10 | ResNet20−64 | ×2 | 94.96% (−0.71%) |

SWD (ours) * | Structured | CIFAR10 | ResNet20−64 | ×50 | 89.07% (−6.5%) |

**Table 2.**Results with ResNet-50 on ImageNet ILSVRC2012, with unstructured and structured pruning for different rates of remaining parameters. SWD outperforms the reference method (or its counterpart with additional LR-Rewinding) in both cases. All values are in %. The best performance for each target is indicated in bold.

Experiments on ImageNet ILSVRC2012 | ||||||
---|---|---|---|---|---|---|

Unstructured pruning | ||||||

Target (%) | Han et al. [15] | +LRR [61] | SWD (ours) | |||

Top-1 | Top-5 | Top-1 | Top-5 | Top-1 | Top-5 | |

50 | 74.9 | 92.2 | 58.4 | 82.1 | 75.0 | 92.2 |

10 | 71.1 | 90.5 | 54.6 | 79.6 | 73.1 | 91.3 |

2.5 | 47.2 | 73.2 | 34.8 | 61.54 | 67.8 | 88.4 |

Structured pruning | ||||||

Target (%) | Liu et al. [17] | +LRR [61] | SWD (ours) | |||

Top-1 | Top-5 | Top-1 | Top-5 | Top-1 | Top-5 | |

90 | 74.7 | 92.2 | 56.1 | 80.7 | 74.2 | 91.9 |

75 | 73.4 | 91.6 | 51.1 | 77.1 | 73.5 | 91.5 |

50 | 63.6 | 85.7 | 40.0 | 66.2 | 69.0 | 88.8 |

20 | 0.1 | 0.5 | 0.1 | 0.5 | 69.0 | 88.7 |

**Table 10.**Increased training duration in seconds per epoch for two different networks and datasets. Results on ImageNet ILSVRC 2012 are averaged over 5 epochs, those on CIFAR-10 are averaged over 50 epochs. Each epoch includes both training and testing.

ResNet-50 on ImageNet on 3 NVIDIA Quadro K6000 | |||
---|---|---|---|

SWD type | None | Unstructured | Structured |

Seconds per epoch | 2936 | 4352 | 4143 |

Increase (%) | 0 | 48 | 41 |

ResNet-20 on CIFAR-10 on 1 NVIDIA GeForce RTX 2070 | |||

SWD type | None | Unstructured | Structured |

Seconds per epoch | 55 | 77 | 78 |

Increase (%) | 0 | 40 | 41 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Tessier, H.; Gripon, V.; Léonardon, M.; Arzel, M.; Hannagan, T.; Bertrand, D.
Rethinking Weight Decay for Efficient Neural Network Pruning. *J. Imaging* **2022**, *8*, 64.
https://doi.org/10.3390/jimaging8030064

**AMA Style**

Tessier H, Gripon V, Léonardon M, Arzel M, Hannagan T, Bertrand D.
Rethinking Weight Decay for Efficient Neural Network Pruning. *Journal of Imaging*. 2022; 8(3):64.
https://doi.org/10.3390/jimaging8030064

**Chicago/Turabian Style**

Tessier, Hugo, Vincent Gripon, Mathieu Léonardon, Matthieu Arzel, Thomas Hannagan, and David Bertrand.
2022. "Rethinking Weight Decay for Efficient Neural Network Pruning" *Journal of Imaging* 8, no. 3: 64.
https://doi.org/10.3390/jimaging8030064