Rewarded meta-pruning: Meta Learning with Rewards for Channel Pruning

Convolutional Neural Networks (CNNs) have a large number of parameters and take significantly large hardware resources to compute, so edge devices struggle to run high-level networks. This paper proposes a novel method to reduce the parameters and FLOPs for computational efficiency in deep learning models. We introduce accuracy and efficiency coefficients to control the trade-off between the accuracy of the network and its computing efficiency. The proposed Rewarded meta-pruning algorithm trains a network to generate weights for a pruned model chosen based on the approximate parameters of the final model by controlling the interactions using a reward function. The reward function allows more control over the metrics of the final pruned model. Extensive experiments demonstrate superior performances of the proposed method over the state-of-the-art methods in pruning ResNet-50, MobileNetV1, and MobileNetV2 networks.


Introduction
Convolutional Neural Networks (CNNs) have been shown to achieve state-of-the-art results in various computer vision tasks [23,26,[29][30][31].However, training the parameters of a CNN requires a significant amount of labeled data.Furthermore, a large amount of hardware resources are also required for training a large amount of training data.Recently, network pruning has become an important topic to simplify and accelerate large CNNs [24,30,52,58].
Many issues are at stake when trying to prune networks, such as structure [32], continuity [42] or scalability [41].There are primarily two ways to compress neural networks: weight [3, 12,50] and channel pruning [8,10,38].Reducing parameters by pruning connections is the most intuitive way to prune a network.Weight pruning consists of identifying low-performing weights to be pruned [14].This involves simply removing weights with small magnitudes, which is easy to implement [15].However, most frameworks cannot accelerate sparse matrices during computation.In order to get real compression and speedup, it requires specifically defined software [24] or hardware [13] to handle the sparsity.The actual cost is not impacted no matter how many weights are pruned.
Therefore, channel pruning which involves removing whole filters instead of simply reducing the weight values to zero is preferred instead.[27,39].This preference stems from the fact that channel pruning could remove the whole filters, creating a model with structured sparsity [20].With structured sparsity, the model can take full advantage of high-efficiency Basic Linear Algebra Subprogram (BLAS) libraries to achieve better acceleration.This makes the pruned model more structural and achieves practical acceleration [13].
MetaPruning [40] is one such channel pruning approach that can achieve the acceleration of CNNs.The central approach to MetaPruning is to generate weights for pruned structures instead of pruning weights or filters of the existing network.The accuracy of the untrained models is computed to rank each Network Encoding Vector (NEV).Evolutionary algorithms, which are motivated by processes of natural evolution [28], are used to find the NEV that produces a model of the highest accuracy.Consequently, however, MetaPruning is only able to choose the best accuracy chosen from a set range of FLOPs.So the algorithm finds the highest accuracy within the predetermined range of FLOPs instead of trying to find the proper balance between sacrificing accuracy and reducing FLOPs.This paper tries to address this issue.In the proposed Rewarded meta-pruning, instead of finding NEVs that produce the highest accuracy, the model tries to balance the accuracy with the FLOPs of the network to find the highest accuracy possible for the given FLOPs.In MetaPruning, the reward is directly proportional to the accuracy, because the reward is accuracy.So the increase in reward of subsequent mutations is lower compared to those of Rewarded meta-pruning, where the reward is directly proportional to the square of accuracy.At the same time, Rewarded meta-pruning is able to control the FLOPs of the final model by computing a score that takes both the accuracy and FLOPs into account to find models with high accuracy and low FLOPs.This score, which is the reward, can be further tweaked to include vari-ous parameters and control the metrics of the pruned model, as well as how the parameters interact with each other.
Our contribution lies in three folds: • We propose a channel pruning method, Rewarded meta-pruning, that can learn how to assign weights to pruned networks.
• We explore the importance of reward functions and the characteristics to define an effective reward function.
• We experimentally show the superiority of the proposed pruning method on publicly available pretrained CNNs; ResNet-50, MobileNetV1, and Mo-bileNetV2.

Related Work
The Lottery ticket hypothesis states that a randomly initialized dense neural network contains a subnetwork which, when trained in isolation, can yield results as well or even superior to the original network [12].In other words, a standard pruned network can have the same, if not higher accuracy, than the original network.There are several methods to find the right tickets.
Unstructured network pruning: Various random pruning methods [3,12,50] rely on pruning the parameters randomly based on various factors of the weights.[60] uses the L1 and L2 norm of each weight to compute their importance.The final pruned model is generated by pruning the less important weights.[19] computes the geometric median of the weights while [46] uses the complex Taylor series expansion to calculate the weight function.Other weight pruning methods like [44] and [35] use KL-divergence importance and Empirical sensitivity of the weights respectively to prune them.
Structured network pruning: In other approaches, Au-toPruner [43] integrates filter selection into the model training so that the finetuned network can select unimportant filters automatically.Sparse Structure Selection (SSS) [25] proposes the introduction of a new parameter, the scaling factor, which scales the output of specific structures.The sparsity regularization on these scaling factors pushes them to zero during training.Discriminative-aware channel pruning (DCP) [62] can find channels with true discriminative power and updates the model by pruning stage-wise using discrimination-aware losses.Adaptive DCP [38] introduces an additional discriminative-aware loss using the p-th loss, and additional losses such as additive angular margin loss [8].AutoML for Model Compression (AMC) [18] leverages reinforcement learning to automatically sample the design space and improve the model compression quality.Simpler methods like HRank [36] determine the rank of feature maps generated by filters to rank the filters and their effectiveness on the final accuracy.This however takes more epochs to train after pruning.[61] leverages the Lottery ticket hypothesis to greedily search through a network, finding subnetworks with lower loss than networks trained with gradient descent.Pruning algorithms that are inspired by Hebbian theory, like Fire Together Wire Together (FTWT) [10], prune filters based on the binary mask of each layer and the activation of the previous layer.
Meta-learning: Meta-learning is the learning of algorithms from other learning algorithms [11,40,53].Fundamentally there are three paradigms [54] in meta-learning; meta-optimizer, meta-representation and meta-objective.Meta-optimizer is the optimizer used to learn the optimization in the outer loop of meta-learning [21].Metarepresentation aims to learn and update the meta-knowledge [11].Lastly, Meta-objective is the final achieved task after completing training [34].
Learning to prune filters: Reinforcement Learning (RL) algorithms have been used to generate network architecture descriptions using Recurrent Neural Networks trained via policy gradient method [63].The same has also been implemented using Q-learning [49].Try-and-learn algorithm [24] uses RL to compute the reward of each filter and these rewards are then used to rank filters.It aggressively prunes filters in the baseline network while maintaining performance at a desired level.The model computes a reward as a product of the accuracy and efficiency term, and then uses REINFORCE [55] to estimate the gradients.The gradients can then be used to compute the loss and train the network, which learns to prune filters.The reward function makes it possible to control the trade-off between network performance and scale without human intervention.The try-and-learn algorithm automatically discovers redundant filters and removes them by repeating the process for every layer.
Neural architecture searching: Many methods have been proposed to search for optimal network structures from possible neural architectures [51,56,63].There are primarily five methods used to search for an optimized network; reinforcement learning [1,63], genetic algorithms [47,57], gradient-based approaches [56], parameter sharing [5] and weights prediction [4].[63] uses RL to optimize the networks generated from model descriptions given by a recurrent network.MetaQNN [1] uses RL with a greedy exploration strategy to generate high-performing CNN architectures.Genetic algorithms are used to solve search and optimization problems using bio-inspired operators [57].[47] uses genetic algorithms to discover neural network architectures, minimizing the role of humans in the design.Gradient-based learning allows a network to efficiently optimize new instances of a task [51].FBNets (Facebook-Berkeley-Nets) [56] uses a gradient-based method to optimize CNNs created by a differentiable neural architecture search framework.

Rewarded meta-pruning
The Rewarded meta-pruning algorithm proposes using a reward coefficient to control the trade-off between the accuracy and efficiency of each model instead of finding the model with the highest accuracy within a preset range of FLOPs.The goal is to maximize the reward, which is directly proportional to accuracy and inversely proportional to the efficiency of the model.This method is implemented in three phases: training, searching, and retraining as shown in Algorithm 1.

Training
Most popular CNNs [16,22,48] mainly use three types of layers; convolution block, bottlenecks, and linear layers.The channel scale represents the size of each layer.For the initial convolutional block and each type of bottleneck, the batch normalization with the sizes ranging from 10% to 100% the size of the original architecture, equally distributed between 31 initial weights.Once a model is created, it is trained for each batch of the train data with these initial weights, and cross-entropy loss is computed to update the weights.Each model is essentially a slice of a complete model, where the slice is defined by the NEV.Validation data is not used to validate the model, instead only to measure the progress.

Searching
The models created from the NEV candidates use the weights of the trained model to create a pruned model.Each NEV is thus converted to a pruned model and reward is calculated for them.Evolutionary search [45] is then used to find the best NEV, thereby finding the optimal pruned model.The initial weights are the trained weights, but the final model will be trained from scratch to remove bias in the pre-trained model.

Creating genes
Random candidates are generated to seed the evolutionary search.Each gene is created as a list of sizes corresponding to the model with values representing the weights from the dictionary.However, since the metrics of the models created by these NEVs are also random, arbitrary hyperparameters are used to control the final model.A gene is considered valid if the FLOPs of the model created between max F LOP s and min F LOP s.The FLOPs of each gene are stored as the last element of the NEV to reduce overall computation time.This is later replaced by the reward.

Reward and selection of NEVs
The candidates are ranked after each epoch according to the reward, computed as a product of accuracy and efficiency coefficients given by Equations (2) and (3) for each NEV as shown in Figure 2. The reward is computed as the following formula: ( The accuracy and efficiency coefficients, denoted by α and ψ , are defined as: (2) where G i denotes a gene of index i from all candidate genes G, and A and F represent functions that return the accuracy and FLOPs of the model created using the gene passed into them.The accuracy coefficient α increases exponentially with an increase in model accuracy, but as it approaches the base accuracy, the value tends to infinity.Since the model is not fine-tuned, the accuracy does not get close enough to the accuracy of the base model, b a , for the reward to reach high levels.If knowledge distillation is used to increase the accuracy of the new model, the base accuracy would be the accuracy of the new model, thereby eliminating any negative effect from the symmetric nature of equation 2. On the other hand, the efficiency coefficient ψ linearly decreases with increasing FLOPs but is limited by the FLOPs of the original model b f .Since prune rates are inversely proportional to FLOPs, a lower efficiency coefficient corresponds to a higher prune rate for the most part.The reward function is directly proportional to accuracy and prune rate.The accuracy coefficient is directly proportional to the reward but is moderated by the efficiency coefficient.This creates a balance between them so that high accuracy is not achieved at the cost of low prune rates in the final model.Once a reward is computed, it is stored as the last element of the NEV, which is later used to rank each NEV.The Top-50 NEVs from every epoch is stored, then the 10 best of them are mutated and crossed over to get the candidate genes for the next epoch.

Mutation and crossover
The best candidates from each epoch are mutated and crossed over to create candidates for the next epoch.Mutation involves changing a few elements in a gene to create a new gene.There is a 10% chance for each element in a gene to be changed to a random valid element.Crossover is the combining of two random genes to create a new gene.An element is randomly picked from one of the two chosen genes for each index.Channel configurations in a local region of the configuration space tend to have similar metrics [33], so the new candidates also have similar accuracy and FLOPs.This makes the reward of at least the best candidate tend not to decrease.If the reward has not increased in too many epochs, the genetic search is stuck in local minima.However, since evolutionary search is a highdimensional non-convex search, the critical points with errors much larger than that of the global minima are likely to be saddle points [6].In other words, the found local minima are likely to be close enough to the global minima, so the search can be terminated.It is not unlikely that mutating and crossing over more genes could find genes, but increasing the rates of mutations and crossover would affect the integrity of the evolutionary search.If unable to create enough new candidates from the two evolutionary operators, the remaining are created using random genes.

Retraining
Once the evolutionary search has been completed, the best gene is selected as the first gene in the list of candidates.This is the gene with the highest reward as found after multiple epochs of genetic searching.The best NEV is converted to a model and trained from scratch.All pruning algorithms train a pruned model for a few epochs to regain the lost accuracy in a process called fine tuning [15].In the Rewarded meta-pruning algorithm, model is created from an NEV instead of using the NEV to prune, and then trained from scratch.Hence the accuracy saturates at a higher epoch during finetuning.

Experimental Results
In this section, we demonstrate the superiority of the Rewarded meta-pruning method.We first describe the experimental settings to reproduce the experiments.Then we compare the results obtained with other methods pruning three major networks.Lastly, we perform an ablation study to understand the effectiveness of the proposed method.

Experimental setting
ResNet-50 [16] network is trained for 32 epochs, while MobileNetV1 [22] and MobileNetV2 [48]  The experiments are conducted on three commonly used networks including ResNet-50, MobileNetV1, and Mo-bileNetV2.The networks are trained using ImageNet [7] from scratch as described in the previous section.ImageNet consists of 1.2M training images and 50K validation images.It also consists of 100K test images, but since the labels of the test data are not released, validation data is used for testing.The validation data has not been used in any part of the training process except to compute the accuracy at every stage.As a natural consequence of using evolution-ary computation, Rewarded meta-pruning is resource-heavy in computing the pruned networks.But this cost is balanced out by the efficiency and accuracy of the models generated.A network need not be created each time it is used because the knowledge distilled from it can be transferred and used in varying contexts.

Evaluation protocol
Four metrics are used for evaluating the pruning algorithms: parameter ratio, top-1 and top-5 errors and FLOPs.The parameter ratio is the ratio of the pruned model to the baseline model.It is computed as; where P m and P b are the numbers of weights in the pruned and the baseline model respectively.Accuracy is the percentage of validation data identified correctly compared to the whole dataset.Top-1 error is the inverse of accuracy, i.e., the proportion of images where the predicted labels of the highest probability are wrong.Top-5 error is the proportion of images where the correct label is not present in the five highest probabilities of predicted labels.FLOPs is a measure of the number of Floating-point operations computed per second.

Performance on ResNet-50
ResNet-50 is a CNN with a depth of 50 layers.It was created to solve degradation in the model as deeper layers are stacked [16].ResNet uses skip connections to identify mapping.This adds the features with their original parameters before passing them into the next layer.Identity mapping followed by linear projection is used to expand channels of the features to make it possible to be added with the original parameters.

Performance on MobileNetV2
MobileNetV2 contains depth-wise and point-wise convolution.It has an inverted residual with a linear bottleneck which takes a low dimensional compressed representation as input and expands it to a higher dimension, then filters them with light-weight depthwise convolutions like in Mo-bileNetV1 [48].MobileNetV2 is an efficient network with a relatively low error.Thus, demonstration on MobileNetV2 is an effective way to show the performance of the pruning algorithm.

Performance on MobileNetV1
MobileNetV1 has a streamlined architecture that builds lightweight deep networks using depth-wise separable convolutions.All layers use Batch Normalisation and ReLu, except the fully connected layer which is followed by a softmax layer for classification [22].

Method
Top-1 Error Top-5 Error FLOPs Baseline [22] 29.40% -569M 0.75 MobileNet-224 [22] 31.60%-325M FTWT (r=1.0)[10] 30.34% -335M MetaPruning [40] 29.10% -324M Rewarded meta-pruning 29.60% 9.65% 295M In Table 3, we compare the Rewarded meta-pruning method with other competing pruning techniques to prune a MobileNetV1 model.It has almost regained the accuracy of the baseline network [22], with 0.2% lower accuracy and 48.15% lower FLOPs.This method clearly achieves superior results when compared to Fire-Together-Wire-Together [10] pruning method, with a 0.94% lower error and 7.03% lower FLOPs.It outperforms 0.75 MobileNet-224 [22], which is MobileNetV1 with 25% lower width, by 2% while using 5.27% lower FLOPs.While MetaPruning [40] shows lower error than even the baseline network, our method has a lower FLOPs.However, the size of the pruned network is still 9.83% larger, when compared to the proposed method.This could be due to the lack of shortcut connection in Mo-bileNetV1 in spite of being a smaller network, leading to a large number of fully-connected layers.In terms of performance achieved for ever resource, Rewarded meta-pruning edges out MetaPruning.It is fair to assume that the Rewarded meta-pruning method could have better results with more advanced reward functions.This will be validated by further research on the robustness of various hyperparemeters.

Discussion
From the results, it is clear that the proposed method performs best under the right reward functions.The reward function in this case is directly proportional to the accuracy and inversely proportional to FLOPs.
As the accuracy of the pruned model approaches the baseline accuracy, the reward increases exponentially.In MetaPruning [40], the reward is directly proportional to accuracy, whereas in this method, the reward is proportional to the square of the accuracy of the model.This by itself would not increase the accuracy of the final model, but chasing a higher reward that is only dependent on accuracy could lead to the final model having high FLOPs.This can be seen in MetaPruning, where the pruned model has a tendency to show FLOPs as high as the preset maximum FLOPs would allow.The reward is controlled by the efficiency coefficient to prevent this.
The reward decreases with increasing FLOPs because the efficiency coefficient is inversely proportional to the FLOPs.However, if the proportionality is too high, the reward would be throttled.Hence it is controlled by the linearly decreasing efficiency coefficient.From the definition of the reward function in Equation (1), it is clear that high accuracy alone is not enough for a model to be selected.As the accuracy increases, the probability of the model being selected increases, but only so long as the FLOPs of the model are also low enough.This can be inferred from how the distribution of Green increases as we move towards the right as shown in Figure 3.The reward increases as the FLOPs decrease, but it is not deemed acceptable until accuracy is high enough.This can be inferred from how the distribution of Red increases as the FLOPs decrease as inferred from Figure 3. Thus by definition, the Rewarded meta-pruning method leans more toward accuracy.
When compared to MetaPruning, the reward of the Rewarded meta-pruning method increases higher with each iteration of searching.This is because the reward of MetaPruning is directly proportional to accuracy while the reward of Rewarded meta-pruning is proportional to the square of accuracy.This can be observed in Figure 4 which shows the rewards, accuracies, and FLOPs of the best model after each iteration of searching.The accuracy and FLOPs, in the beginning, are chosen randomly, but that cannot be changed without tampering with the fundamental evolutionary searching.But it can be observed that the FLOPs of the best model in MetaPruning tend to increase for the most part whereas, in the case of Rewarded meta-pruning method, it tends to stay low.Accuracy increases in both cases, but MetaPruning saturates earlier than Rewarded meta-pruning.
It can also be inferred that if the two models started with the same batch of randomly initialized models, the Rewarded meta-pruning method will have a higher accuracy and a lower FLOPs because the slopes of the best accuracy and FLOPs of the best models are higher and lower respectively than that of MetaPruning.
However, the way NEVs are chosen, both randomly and after mutation or crossover, means the FLOPs of the pruned model approximately form a bell-curve between 1350 and 2100.This can be changed by controlling the generation of random filter sizes in the NEVs as shown in Figure 5.
The robustness of the hyperparameters used by Rewarded meta-pruning has already been explored by He et al. [17].As the reward function is tweaked to add different hyperparameters, various metrics of the final model can be controlled.There are various other coefficients that could be used in the reward.The prune rate of the pruned model can be set to be inversely proportional to the reward, as low prune rates automatically lead to lower FLOPs.The FLOPs is also directly proportional to hardware latency, which is the runtime of networks [9].But this is dependent on hardware, and different hardware have different latencies.Other metrics such as energy consumption have also been used for pruning.NetAdapt [59] uses energy consumption as a metric to measure the complexity of the network at every stage and prunes the network further while maintaining accuracy.

Conclusion
In this work, we have presented the following: 1) Implemented better reward function to meta-learn parameters for pruning and allow better control over various parameters of the pruned model.2) The Rewarded meta-pruning method has been shown to be superior to other state-of-theart methods, with higher accuracy and lower FLOPs than traditional channel pruning methods.3) The reward function can be optimized using other metrics to maximize the reward.4) ResNet-50, MobileNetV1 and MobileNetV2 are pruned effectively.

Figure 2 .
Figure 2. Computing reward for network encoding vectors.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (No.2021R1C1C1012590), (No.NRF-2022R1A4A1023248) and the Information Technology Research Center (ITRC) support program supervised by the Institute of Information Communications & Technology Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (IITP-2022-2020-0-01808).

Figure 4 .
Figure 4. Rewards, Accuracy and FLOPs of the best models after each iteration of searching.
are trained for 64 epochs.ResNet and MobileNetV2 retrained after searching for 400 epochs, but MobileNetV1 only requires 320 epochs.Searching for the NEVs takes 20 epochs for all the networks.Each epoch searches 50 NEVs, searching through 1000 unique NEVs throughout the run.MobileNetV1 and MobileNetV2 both use the Lambda scheduler to decay the models by a γ of 0.1 every epoch from an initial learning Algorithm 1 Algorithm of Rewarded meta-pruning Hyperparameters: max training: Number of training epochs, max iter: Number of searching epochs, max tuning: Number of finetuning epochs Input: dataset: training images that can be split into batches, r i : Random integer indexed at i, w i : Random weights indexed at i, ∇: Gradient of loss of given model 1 ,r 2 ,...,r n ] {w 1 ,w 2 , ... ,w n } = norm(nev) x = FC({w 1 ,w 2 , ... ,w n }) L = f(x,batch)x += ∇L end for end for candidate = List of n random nevs for i = 0,1,..., max iter do rewards = [r 1 ,r 2 ,...,r n ] for j = 0,1,..., n do r j = reward(nev j )

Table 2 .
Benchmarking state-of-the-art channel pruning methods with MobileNetV2

Table 3 .
Benchmarking state-of-the-art channel pruning methods with MobileNetV1