GradFreeBits: Gradient-Free Bit Allocation for Mixed-Precision Neural Networks

Quantized neural networks (QNNs) are among the main approaches for deploying deep neural networks on low-resource edge devices. Training QNNs using different levels of precision throughout the network (mixed-precision quantization) typically achieves superior trade-offs between performance and computational load. However, optimizing the different precision levels of QNNs can be complicated, as the values of the bit allocations are discrete and difficult to differentiate for. Moreover, adequately accounting for the dependencies between the bit allocation of different layers is not straightforward. To meet these challenges, in this work, we propose GradFreeBits: a novel joint optimization scheme for training mixed-precision QNNs, which alternates between gradient-based optimization for the weights and gradient-free optimization for the bit allocation. Our method achieves a better or on par performance with the current state-of-the-art low-precision classification networks on CIFAR10/100 and ImageNet, semantic segmentation networks on Cityscapes, and several graph neural networks benchmarks. Furthermore, our approach can be extended to a variety of other applications involving neural networks used in conjunction with parameters that are difficult to optimize for.


Introduction
Deep neural networks have been shown to be highly effective in solving many realworld problems. However, deep neural networks often require a large amount of computational resources for both training and inference purposes [1][2][3]. This limits the adoption and spread of this technology to scenarios with low computational resources.
To mitigate this computational burden, recent efforts have focused on developing specialized hardware to support the computational demands [4] as well as the model compression methods in order to reduce them [5]. These include various techniques such as pruning [6], knowledge distillation [7,8], a neural architecture search (NAS) [9], and as in this paper, quantization [10], which can naturally be combined with other approaches [11].
Quantization methods enable the computations performed by neural networks to be carried out with fixed-point operations rather than floating-point arithmetic [10,12,13]. This improves their computational efficiency and reduces their memory requirements. However, as with other compression methods, this typically comes at the cost of a reduced performance [5]. Recent efforts in the field have focused on improving the trade-offs between model compression and performance by proposing a plethora of quantization schemes tailored for different scenarios.
Quantization schemes can be divided into post-training and quantization-aware training schemes. Post-training schemes decouple the model training and quantization of its weights and/or activations and are most suitable when training data are not available when compressing the network [12,[14][15][16]. Quantization-aware training schemes perform both optimization tasks together and do require training data, which tends to provide a better performance [17][18][19].

Problem Definition
In this paper, we focus on the problem of training quantized neural networks using mixed-precision quantization-aware training to improve the trade-offs between the performance and computational requirements of QNNs. The goal is to develop a training scheme that produces a fully trained QNN with optimal bit allocations per layer, according to the task and properties of the target edge devices. Furthermore, to ensure hardware compatibility, uniform quantization methods are preferred. Such methods divide the real-valued domain into equally sized bins, as in [18][19][20][21]. However, the proper allocation of bits between layers is combinatorial in nature and is hard to optimize. Furthermore, delicate interactions between weights and bit allocations must be considered to maximize the performance.

System Model
We propose a novel quantization-aware training procedure for uniform and mixedprecision QNNs, where a different number of bits is allocated per layer. In our training procedure, we utilize a gradient-based quantization-aware training procedure for the weights, and interchangeably, gradient-free optimization methods are used for computing the optimal bit allocation per layer for the weights and activations of the network. Such algorithms are known to perform well in difficult scenarios with complex dependencies between variables while maintaining an excellent sample efficiency during optimization [22]. In particular, we use the algorithm Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [23], a highly versatile evolutionary search algorithm, which iteratively updates the parameters of a multivariate normal distribution to improve the fitness of the samples drawn from it. To summarize our approach, the network weights are updated by a gradient-based method while the bit allocation is updated using the gradient-free method CMA-ES-see Figure 1.

Our Contributions
The advantages of our approach are as follows: • Our training scheme for mixed-precision QNNs optimizes the network as a whole. That is, it considers the dependencies between the layers of the neural network and the dependencies between the weights and their bit allocation. • Our approach for optimizing the bit allocation is gradient free, and thus can handle multiple, possibly non-differentiable, hardware constraints. This enables tailoring QNNs to the resources of specific edge devices. • We propose a bit-dependent parameterization for the quantization clipping parameters that allows for a better performance evaluation when sampling the network with a varying bit allocation. • The systematic combination of gradient-based and gradient-free optimization algorithms can be utilized in other applications and scenarios, e.g., a search of the network's other hyperparameters.
We demonstrate the performance of our method on popular tasks such as image classification and semantic segmentation, and also for graph node classification. For all test cases, our method achieves a better or on par performance with the current state-of-the-art low-precision methods and, in particular, yields a comparable accuracy for a lower model size when compared to a fixed-precision setting.

Figure 1.
Our proposed training scheme: iterative optimization of the model weights and bit allocation. Given a fixed bit allocation (right), the weights are optimized using a gradient-based quantization-aware training procedure. Then, the weights are fixed (left) and the bit allocation is optimized for those weights using the CMA-ES [23] gradient-free optimization algorithm, and the training process is repeated in an iterative manner.

Fixed-Precision Methods
Most uniform per layer quantization methods rely on learned quantization parameters, such as the scaling parameters of the numbers before the rounding occurs. Quantizationaware training with fixed clipping parameters was initially proposed in [17], while the works [18,19,21] suggested ways to learn the clipping parameters. Several advances include weight normalization before quantization [20,24], a scale adjustment of the activations [19], soft quantization [25], and course gradient correction [26]. Recent efforts have focused on non-uniform quantization methods [27,28], which use lookup tables, making them difficult to deploy efficiently on the existing hardware. However, all the methods mentioned above use fixed-precision quantization (with the same bit allocation in all layers), which do not take into account the specific computational requirements and sensitivity to quantization noise that different layers may have.

Mixed-Precision Methods
Recent efforts to tackle the mixed-precision quantization problem have included the use of reinforcement learning [29,30], a Hessian analysis [31,32], quantizer parametrization [33], and differentiable NAS approaches [34][35][36]. Among these methods, only the NAS approaches account for the dependencies between the bit allocations in the different layers by forming a super network that includes multiple branches for each precision at each layer. The NAS approaches, however, are often more expensive to train due to the multiple network branches which are used. Furthermore, they typically restrict their search spaces to a subset of bit allocations [35,36], which may harm the trade-offs between the performance and computational requirements.

Joint Search Methods
A recent trend explores the joint search of mixed precision and architecture design to produce high-performance networks with low resource requirements. Such approaches however can be computationally expensive and often tailored to highly specific architectures [15,[37][38][39]. However, joint mixed-precision and pruning methods are often not highly specific and can be applied to multiple architectures [40,41]. Though such methods reduce the number of operations performed, sparse operations and/or structured pruning are required to achieve measurable reductions in the computational cost.

Quantization-Aware Training
In the uniformquantization scheme we consider, the real values of the weights are clipped between [−α, α] and the activations between [0, α]. As in [24], the range is mapped to the target integer range [−2 b−1 + 1, 2 b−1 − 1] for the weights and [0, 2 b − 1], for the activations, where b is the number of bits. In this scheme, α, named "clipping parameters", are trainable parameters that typically take on different values for different layers. Furthermore, to define the point-wise quantization operations used in our quantization-aware training scheme, we use two utility operations. The round(z) operation rounds all values in z to the nearest integer, and the clip(z, a, b) operation replaces all values z ≤ a with a and values z ≥ b with b. These are used in our point-wise quantization operations: Here, b is the number of bits that are used during quantization, W, W b are the real-valued and quantized weight tensors, X, X b are the real-valued and quantized input tensors, and α W , α X are their associated scale (or clipping) parameters, respectively. Equations (1) and (2) are used for training only, where the optimization of the weights and clipping parameters is obtained using the Straight-Through Estimator (STE) approach [24,42]. During the inference, the weights and activations are quantized, and all the operations are performed using integers in mixed precision while taking the clipping parameters into account. To improve stability, [20,24] also use weight normalization before quantization:Ŵ = W−µ σ+ , where µ and σ are the mean and std of W respectively, and = 10 −6 .

CMA-ES
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [23] is a populationbased gradient-free optimization algorithm. It is known to be highly versatile and has been applied to a large variety of settings, such as reinforcement learning [43], placement of wave energy converters [44], hyperparameter optimization [45], and more. It is designed to work in d-dimensional continuous spaces and optimize discontinuous, ill-conditioned, and non-separable objective functions in a black-box optimization setting [46].
At a high level, at the g-th generation, the CMA-ES draws λ d-dimensional samples from a multivariate normal distribution N (m (g) , C (g) ): x (g+1) k ∼ m (g) + σ (g) N (0, C (g) ), for k = 1, . . . , λ where m (g) , C (g) , σ (g) are the mean, covariance matrix, and step-size used in the previous generation, respectively. Then, keeping only the top µ samples with the best objective values, m (g+1) , C (g+1) , and σ (g+1) of the next generation are calculated using a set of update rules [46]. This process is repeated until one of several stopping criteria are fulfilled. More details about CMA-ES are given in Appendix A.

The GradFreeBits Method
In this work, we base our uniform quantization-aware training scheme (for the weights and activations) on [24]. We apply a combination of gradient-based training rounds of the model weights while interchangeably applying gradient-free training rounds for optimizing the mixed-precision bit allocation, with CMA-ES. This process is referred to as iterative alternating retraining, illustrated in Figure 1.

Motivation: CMA-ES for Mixed Precision.
We argue that CMA-ES is highly compatible with the problem of bit allocation in QNNs. We assume the objective function is the differentiable loss function used during training, with additional possibly non-differentiable constraints related to computational requirements (exact details appear below). As recent evidence suggests [30,31], the optimization landscape of the bit allocation problem is likely to be discontinuous, ill-conditioned, and amenable for optimization using gradient-free optimizers. Because the constraints may be non-differentiable, they can be sampled in a black-box setting, as is performed in gradient-free optimization (CMA-ES) and reinforcement learning [30]. Additionally, as shown in [31,32], the Hessian eigenvalues show large variations for different layers in QNNs, meaning that certain layers are typically more sensitive to changes in bit allocation than others. This is in part what motivated us to choose CMA-ES for this work, as it is capable of adapting to high variations in the Hessian eigenvalues and is therefore considered to be one of the best and widely used gradient-free methods.

Setting the Stage for CMA-ES
In order to optimize bit allocations of a QNN, two items must be defined: the search space and objective function.

Search Space
We define the search space as a vector containing the bit allocation of the weights and activations in all the layers of the network, aside from the first and the last layers, as these are quantized using a fixed allocation of 8 bits. We found it beneficial to optimize the logarithm (base 2) of this vector rather than the vector itself. Thus, the vector to be optimized by CMA-ES is the log-precision vector: where r W , r X are the bit allocations of the weights and activations, respectively, and [·, ·] is the concatenation operation.

Objective Function
Our objective function to minimize is a combination of the network's performance measure, the differentiable loss function, subject to a number of possibly non-differentiable computational constraints: where L(v; θ) is the loss function over the training set, parameterized by network parameters θ, which are assumed to be fixed during the bit allocation optimization stage. Furthermore, h j (v) are the computational requirements for a given precision vector v (e.g., model size, inference time, etc.), C j 's are the target requirements that we wish to achieve, and M is the number of constraints. To combine the constraints into the objective function, we use the penalty method: where ρ j are balancing constraint parameters. This is similar to the approach taken in [33], but here it is applied to gradient-free optimization. We define the computational constraints by matching the requirements of our mixed-precision network and to a fixed one. For example, we may require that the model size will be less than that of a fixed 4-bit allocation. We define a model size function MB(·) which takes in a log-precision vector for the weights and outputs the model size it produces when it was used to quantize the network. More formally, it is defined as follows: where x is the "ceil" operator, which rounds its argument up to the nearest integer, 2 v W i is the precision used to quantize layer i, |W i | is the number of parameters in layer i, L is the set of layers to be quantized in mixed precision (all conv layers excluding the first). r j is the precision of layers not in L, specifically r j = 8 bits for first conv and last linear layers and r j = 32 bits for batch norm layers. Using the notation in (4), we define the constraints on the model size for the weights entries v W and mean bit allocation for the activation entries v X : Here, v F is the log-precision vector of the target fixed precision that we wish to achieve, and v F is the mixed log-precision vector. As in [24,33], MB(·) calculates the model size given weight entries v W , L is the number of relevant layers, and β 1 , β 2 > 0 control the target compression rates of the weights and activations, respectively.
The constraints above are designed to limit the computational requirements while allowing the gradient-free optimization algorithm to explore non-trivial solutions which satisfy them. It is important to note that though these are mostly related to memory requirements, other constraints can easily be incorporated into our framework, such as power usage or inference time measurements, chip area, etc.

Gradient-Free Rounds
We define gradient-free steps as steps in which the CMA-ES algorithm optimizes the bit allocation (given the network's weights) according to the objective function (6). In each gradient-free step, multiple generations of samples of the log-precision vector v are evaluated on (6). Because CMA-ES operates in continuous space, the bit allocations (positive integers) are extracted from v using: r W = 2 v W , r X = 2 v X . where x is the "ceil" operator. At each objective evaluation, the sample of v is used to quantize the weights and activations of the model. Then, the loss (6) is calculated over a set of minibatches, named a "super-batch", yielding the value of the objective for each of the sampled bit allocations. Using this information, CMA-ES gradually minimizes the objective function, which enables non-trivial bit allocations to be found. In order to reduce subsampling noise during objective function evaluation, we define a moving super-batch as a set of minibatches, replaced in a queue-like manner. That is, in each iteration of the super-batch, we replace one of the minibatches within it. More details are given below in Section 3.5, and an ablation study on several replacement schemes is given in in our results section.
We define each gradient-free step to include a predefined number of objective evaluations M. It is important to note that gradient-free steps require significantly less computational resources than traditional epochs, even if the number of minibatches are matched. This is because they do not require backpropagation and the CMA-ES computations are negligible compared to forward passes of the model.
A gradient-free round is described in Algorithm 1, which applies several gradient-free steps utilizing the CMA-ES optimization algorithm. The terms θ, v, v F denote the network weights and the log-precision parameters of the mixed-and fixed-precision bit allocation, respectively. Furthermore, d is the number of log-precision parameters to optimize. First, the CMA-ES parameters m (g) , σ (g) , C (g) are initialized, then used to sample log-precision vectors in line (2), and are updated in line (11) according to the CMA-ES update rules. (See Appendix A for more details). In line (5), the network parameters θ and log-precision parameters v are inserted into the model and loss L(v; θ) is evaluated on the super-batch. (10) break while end update_CMAES(m (g) , σ (g) , C (g) , O, X) (11) O ← 0 λ , X ← 0 λ×d end end (12) return v *

Iterative Alternating Retraining
To start the optimization process, the model is pretrained with the quantization-aware training scheme in Section 2.1, using a fixed bit allocation. After this stage, the model is passed to the gradient-free optimizer CMA-ES to optimize its bit allocation for a round of N GF steps, as described in Algorithm 1. This adapts the bit allocation to the model weights, which are fixed at this stage in their floating-point values, and enables CMA-ES to maximize the performance of quantized networks, subject to the computational constraints (Equation (6)). Once the gradient-free round is completed, the bit allocation with the lowest objective value is passed to the gradient-based optimizer for a gradient-based round of N GB epochs. This adapts the model weights to the bit allocation, which is kept fixed, using the quantization-aware training scheme described in Section 2.1. The cycle is repeated several times until the performance and computational requirements are satisfactory. The process is illustrated in Figure 1. The output of this process is a fully trained mixed-quantization model, which can be readily deployed on the target edge device.

Variance Reduction in CMA-ES Sampling
Variance reduction has been shown to improve the convergence rate of optimization algorithms [47]. The main source of variance in our objective function (Equation (6)) is in the first term, related to the performance of the model for different bit allocations. There are two main causes of variance in this term: subsampling noise, caused by using small minibatches of randomly selected samples, and sensitivity to quantization errors, which networks are typically not robust to. In this section, we propose a mitigation to the first cause of variance, while in the next section, we propose a mitigation for the second.

Moving Super-Batches
To mitigate subsampling noise in our objective function (Equation (6)), we define a moving super-batch as a set of minibatches which are replaced in a queue-like manner. That is, in each iteration of the super-batch, we replace part of the minibatches within it. Figure 2 illustrates this approach. During each objective evaluation of CMA-ES, the entire super-batch is run through the model in order to calculate the first term of (6). The queue-like replacement scheme enables CMA-ES to encounter new data samples in each objective evaluation but with a larger overlap of data samples as compared to SGD, where the minibatches are re-sampled at each iteration. Several strategies for the frequency of replacement can be considered, such as replacing one or more minibatches after each objective evaluation, or doing so every fixed number of evaluations. These different settings are explored in the ablation study in Section 5.2.

Adapting the Clipping Parameters to Varying Bit Allocations
Any alternating minimization scheme that operates similarly as described above has the following shortcoming. Sampling the loss of the network with a set of new bit allocations that are incompatible with the training of the other parameters may lead to a misguided measure of performance. The most significant effect is due to the clipping parameters α X and α W in (1) and (2). To this end, we parameterize the clipping parameters and train the network's weights to be compatible with multiple bit allocations through stochastic choice of bit allocations during training. Because the weights throughout the optimization process are in floating point, we can apply their quantization with different bit allocations at each time but need the clipping parameters to be automatically adapted to the bit allocation. To support this desirable property, we follow the analysis in [12] regarding the influence of the bit allocation on the optimal clipping parameters.
According to [12], for a Laplace (0, β) distribution and 2 b quantization intervals, the quantization noise is given by and therefore, Setting (11) to zero led [12] to an optimal clipping parameter in post-training quantization, while here we use it to obtain a relation between the optimal clipping parameter with the number of bins used for the quantization: Setting a log on the two side yields: where b is the number of bits, and c is a constant.
There is no closed-form solution to Equation (13), and generally, it is involved with a few possibly inaccurate assumptions. First, the derivation of (10) is only approximated. Second, the assumption of the weights or activations being drawn from a Laplace distribution is reasonable but not entirely accurate. Third, the scale β is unknown, and lastly, the MSE is a reasonable error measure but not necessarily the best measure for optimizing a multi-layer neural network. Hence, we do not solve (13) directly but use it as guidance for parameterizing the clipping parameters α(b).
To deal with the choice of the clipping parameter α, we adopt the gradient-based optimization described in Section 2.1, but we parameterize it to account for different bit allocations. Specifically, we use a simple linear approximation instead of each of the parameters in (1) and (2). This is a reasonable choice in the premise of the assumptions above (Laplace distribution, MSE as error measure)-see Figure 3. Different assumptions on the distribution (e.g., normal distribution), error measure (e.g., mean absolute error), and similar uniform quantization schemes yield similar results. We note that the relation is not exactly linear and may benefit from a richer parameterization than in (14). (14) is used for the quantization in (1) and (2). To train α (0) , α (1) for each layer, we perturb the bit allocation during the pretraining stage, randomly changing per layer bit allocations by +1, −1 or 0 bits (no-change), around the predefined fixed bit allocation in pretraining stages. Finally, we also conduct an ablation study to verify the expected performance gains throughout the training. (See Section 5).

Experiments and Results
To quantitatively compare our mixed-precision scheme (GradFreeBits) to other related works, we apply it to several neural network architectures for image classification, semantic segmentation, and semi-supervised graph node classification tasks. The properties of the datasets and training configurations of the image datasets are detailed in Table 1. Throughout all the experiments, we use the bit-dependent clipping parameters described in Equation (14). For the mixed case, the averaged number of bits across the layers is considered. We compare our approach to the related works that use uniform quantization, with either fixed (F) or mixed (M) bit allocation schemes, which quantize both the weights and activations. For the ImageNet and segmentation encoder models, we used pretrained weights from TorchVision [48]. Our code is written in the PyTorch framework [49], and the experiments were conducted on an NVIDIA RTX 2080ti GPU.

CIFAR 10/100
The CIFAR10 and CIFAR100 image classification benchmarks [50] have 10 and 100 classes, respectively. The full dataset properties and training configuration can be found in Table 1. Additionally, we used random horizontal flips and crops and mixup [53] data augmentations.
The results for the CIFAR10 dataset are presented in Table 2. Our method outperforms the previous state-of-the-art EBS(M) for both the ResNet20 and ResNet56 models at both mixed-precision settings, e.g., +0.5% for 4-bit ResNet20 and +0.5% for 4-bit ResNet56. Table 2 also includes the results for the CIFAR100 dataset, in which our method also outperforms all the other related works by +1.6% for 4-bit ResNet20 and +0.7% for 3bit ResNet20. We believe that our method outperforms the related methods because it considers a larger search space of bit allocations compared to the other methods, which limit the search space [34,36] or use a fixed bit allocation [17,18,20,21,26]. For example, we consider 1-8 bits for the weights and activations in each layer, a total of 64 combinations, while [36] uses a set of 8 such manually selected combinations. Even though the search space is larger, our method is able to efficiently optimize it due to the excellent sample efficiency of the CMA-ES [23].

ImageNet
The ImageNet [51] image classification benchmark has 1 K classes and 1.2 M train and 150 K test RGB images. The full dataset properties and training configuration can be found in Table 1. The data augmentations are identical to those used in the CIFAR10/100 experiments (above).
The results for the ImageNet dataset are presented in Table 3. Our ResNet18 model achieves the highest Top1 accuracy and the lowest model size compared to all the other methods in 2W/2A-3W/3A. For example, our 3W/3A ResNet18 model achieves a smaller model size of −0.2 MB and a higher Top1 accuracy of +0.5 compared to next best uniform APoT [24]. For the ResNet50 model, our method consistently achieves the smallest model size, with a comparable Top1 accuracy to the other methods. Our method achieves the smallest model sizes −1 MB, −0.3 MB at the cost of a slightly reduced performance −0.2, and −0.6, compared to [19,36], for the 3W/3A and 4W/4A models, respectively. However, for the 2W/4A model, our method achieves a slightly worse Top1 accuracy of −0.3 compared to [32], though at a significantly smaller model size of −4.9 MB (a 37.4% reduction). For the 2 W/2 A model, our method achieves the highest accuracy in this category, +1.0 compared to [19], despite having the same model size. We believe that the improvements in the trade-offs are primarily due to our use of larger search spaces, 64-bit combinations for each layer, compared to the 49, 8, and 6 combinations in [30,35,36], respectively. Our method is able to properly optimize the bit allocations in this large search space as it uses the CMA-ES which has an excellent sample efficiency [23]. Furthermore, our method does not make simplifying assumptions regarding the interactions between the layers, as used in [31,32], which again increases the search space and enables our method to find superior bit allocations. Table 3. Top1 accuracy on ImageNet. (M) denotes mixed precision, (·) denotes model size, measured in MB. Subscripts denote reported difference in accuracy, compared to the FP accuracy reported in the original papers. * identifies methods that do not quantize the first and last layers. The cost of each method is presented as the total number of epochs required for pretraining, search, and fine-tuning. ∼X presents estimated cost obtained from text descriptions in the original papers.

ResNet18
PACT [18]  We also provide a cost comparison with the other fixed-and mixed-precision quantization methods, in terms of the number of epochs, in Table 3. Using the experimental details for ImageNet from Table 1, we find that the GFB uses only 57 effective epochs: 30 quantization-aware training epochs and 27 effective epochs for the bit optimization. Because the gradient-free steps have similar costs to the gradient-based epochs (see Appendix B), our training cost is calculated using five gradient-based epochs plus four gradient-free steps applied iteratively for three rounds: e f f ective_epochs = 3 × (4 + 5) = 27. More details can be found in Appendix B.
Our method requires the lowest computational cost compared to all the other methods, which typically require 100-200 epochs. We believe that our improved training costs may be due to the excellent sampling efficiency of the CMA-ES, which is able to find the optimal bit allocations in a relatively conservative budget of gradient-free steps.

Image Semantic Segmentation
In this section, we demonstrate the advantage of mixed precision for semantic segmentation, which is another common task for low-resource devices, such as automobiles, robots, and drones. Semantic segmentation is an image-to-image task where every pixel needs to be classified. Hence, this task is much more sensitive for quantization compared to the classification results above, where the network produces a single class [54]. We present our results on the popular Cityscapes dataset [52]. The dataset properties and training procedure are described in Table 1. We adopt the popular segmentation architectures DeepLabV3 [55] and DeepLabV3+ [56] with ResNet50 [57] and MobileNetV2 [58] encoders, and with the standard ASPP [55] module for the decoder. For a fair comparison to [16], the images are resized to 256, with random horizontal flips, during training.
In Table 4, we compare our method to the fine-tuned models in [16], where the data were available for training. Even though the method of [16] uses knowledge distillation, our method achieves a strictly superior model size and mIoU in all bit allocations. This is best seen in our 8-bit DeepLabV3(ResNet50), which achieves a +2.5 mIoU despite having a significantly smaller model size of −7.3 MB. We also provide comparisons between our fixed-and mixed-precision models in Figure 4, as well as sample images in Figure 5. The results clearly demonstrate the added value of mixed over fixed quantization models, as the former typically provides better trade-offs between the model size and performance. For example, in Figure 4a, our mixed quantization DeepLabV3+ (MobileNet) models typically achieve a 30% improvement in model size, though with a small degradation of −1.5 in the mIoU, a better trade-off than is obtained by reducing the precision of fixed quantization models. For the DeepLabV3+ (MobileNet) in Figure 4b, the improvement in trade-offs is smaller, though still apparent. Furthermore, the sample images in Figure 5 demonstrate that though the quality of the segmentation maps clearly degrades as the network is quantized to a lower precision, it is evident that the mixed-precision model (Figure 5f,g) provides more accurate segmentation maps than the fixed-precision model (Figure 5c-e). For example, the shape of the "yield" sign in the top right part of Figure 5b is better preserved by the mixed-precision model (Figure 5f,g). In this section, we demonstrate an application to semi-supervised node classification, as the compression of graph neural networks is important for several real-world applications, such as in autonomous vehicles. We compare the GFB to [59], using the GCNII [60] with 32 layers, on three common semi-supervised node classification datasets: Cora, Pubmed, and Citeseer [61]. The details of these datasets are provided in Table 5. The training procedure uses the following hyperparameters. In all the experiments, we used 8 bits for the first and last layers, an SGD with lr = 0.01, momentum = 0.9 cosine scheduler, β 1 , β 2 = 0.98, ρ 1 , ρ 2 = 10.0, M = 1024, N GF = 5, N GB = 4, N Rounds = 3, 30 pretrain epochs, and v s. space limited to [0.0 − 3.6]. We compare to the GCNII in [59] which uses only quantization and no wavelet compression and uses 8-bit weights while reducing the activations precision. The results for the semi-supervised node classification are reported in Table 6. The GFB achieves a significantly higher accuracy for comparable compression rates. For example, in the 8W/2A bit Pubmed, the GFB achieves a significantly higher accuracy (+37.1), compared to [59], even when considering the differences in the 8W/8A baselines. We believe this may be due to a higher observed bit allocation in the initial graph convolution layers, which tend to have a greater effect on the performance of the GCNII, compared to the deeper layers. This is due to the design of the GCNII, which reduces the effects of the deeper layers on the outputs to avoid the oversmoothing phenomena observed in the original GCN [60].

Bit-Dependent Clipping Parameters
We examine the effects of the fixed vs. bit-dependent clipping parameters (Equation (14)) which were pretrained using perturbed bit allocations around the predefined fixed bit allocation. The results are displayed in Figure 6. The fixed clipping parameters (red) tend to perform slightly worse than the bit-dependent clipping parameters (blue). As expected, they lead to larger performance drops during the gradient-free rounds (grey), where the different bit allocations are evaluated and optimized. Moreover, the latter seems to have a lower variance. We believe the variance reduction is caused by the noise forcing the network to learn the proper relationships of α(b), or the robust values of α, rather than an arbitrary combination of α (0) and α (1) which provide a low training loss for only the given fixed bit allocation. Figure 6. Accuracy during iterative alternating retraining stage, using different clipping parameter bit dependencies, for 4-bit mixed-precision ResNet20 on CIFAR10. Grey regions correspond to gradient-free steps, while white regions correspond to gradient-based epochs.

Iterative Alternating Retraining
Here, we examine the effects of pretraining, iterative alternating retraining, and the number of minibatches they contain. All the experiments are conducted with 4-bit mixed-precision ResNet20 models on CIFAR100 with the same hyperparameters used in Section 4.1.
The results of the ablations study are presented in Table 7. The −3.6% accuracy degradation demonstrates that pretraining plays a crucial role in reducing the performance degradation due to the changes in the bit allocation. It also seems that iterative alternating retraining, as opposed to separating the bit optimization and weight optimization stages, leads to a small +0.2% increase in performance, demonstrating the added value of this approach. Regarding the super-batch settings, it seems that the optimal setting is to use 32 minibatches and replace a single batch after each objective evaluation (SB), also leading to a performance increase of +0.2%. Table 7. Top1 accuracy of 4-bit mixed-precision ResNet20 on CIFAR100, for various system settings. We use the following shorthand: "SS." for super-batch setting, "|B|" for number of minibatches in the super-batch, "IAR." for iterative alternating retraining, "PRET." for pretraining.

Conclusions
We proposed GradFreeBits, a novel framework for mixed-precision neural networks, which enables customization to meet multiple hardware constraints. The framework is based on the combination of a gradient-based quantization-aware training scheme for the weights and gradient-free optimization of the bit allocation, based on the CMA-ES. The combination of the two approaches in an iterative alternating retraining scheme is quite general, making it likely easy to extend to other applications. Additionally, we propose a novel parameterization of the clipping parameters to facilitate their adaptation to different bit allocations.
Through extensive experimentation, we find that our method achieves superior or comparable trade-offs between the accuracy and model size, though at lower training costs, compared to several mixed and fixed quantization methods on a wide variety of tasks, including image classification, image semantic segmentation, and (graph) semi-supervised node classification benchmarks. Furthermore, we find that our proposed bit-dependent clipping parameters provide measurable gains in the performance of mixed-precision models, with negligible added parameters.
Future work includes utilizing additional constraints, such as measurements from hardware simulators. Additionally, we believe that extending our iterative retraining approach to new scenarios, such as optimizing per layer pruning rates, may provide similar benefits in trade-offs between accuracy and computational cost. Funding: The research reported in this paper was supported by the Israel Innovation Authority through the Avatar consortium.
Institutional Review Board Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. CMA-ES
In this section, we provide a very brief overview of the main derivations that are used in the CMA-ES, without including any theoretical background for the structure of these update rules, or the choices of hyperparameters. For more details regarding these topics, we refer the curious reader to [63]-we follow the same notation as this paper here.
Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [23] is a populationbased gradient-free optimization algorithm. At a high level, the optimization process of the CMA-ES is as follows. At the g-th generation, a set of λ d-dimensional samples x k ∈ R d are drawn from a multivariate normal distribution N (m (g) , C (g) ): x (g+1) k ∼ m (g) + σ (g) N 0, C (g) , for k = 1, . . . , λ where m (g) , C (g) are the mean and covariance matrix of the population at the previous generation, respectively. λ is the population size and σ (g) is the step size. Once the samples are drawn from this distribution, they are evaluated and ranked based on their objective function values. These ranked samples x (g+1) i:λ are used to calculate m (g+1) , C (g+1) and σ (g+1) of the next generation, using a set of update rules which are provided below.
Though CMA-ES is a highly effective optimization algorithm, its computational complexity is O(d 2 ) in space and time [23], where d is the dimension of the parameter vector to be optimized. Thus, the method is inefficient for solving high-dimensional problems where d is larger than a few hundreds.

Appendix A.1. Hyperparameters
The CMA-ES uses several hyperparameters in order to perform optimization [63]. These include a damping parameter d σ and c 1 , c µ , c c , c m , c σ which are "momentum"-like parameters, which control the amount of information retained from previous generations. Furthermore, w i are known as the "recombination weights", which are used in most update rules. They are typically chosen such that These are also used to calculate the effective population size for recombination: . For more details regarding the specific choices of these hyperparameters, please refer to [63].

Appendix A.2. Mean Update Rule
As mentioned above, several update rules are employed in the CMA-ES. The first of those is the update of the mean m (g+1) : and 27 epochs for the bit optimization (5 gradient-based epochs + 4 gradient-free steps, applied iteratively for 3 rounds: 27 = 3 · (4 + 5)), and it requires the lowest computational cost compared to all the other methods, which typically require 100-200 epochs. Using Equation (A11) with the experimental details from Table 1, it can be shown that a gradientbased epoch uses 1.2 M data samples, while a gradient-free step uses 1.6 M data samples. However, gradient-free steps have a lower computational cost than gradient-based epochs because backpropagation is not performed during the gradient-free steps and because the internal CMA-ES operations have a negligible computational cost compared to the forward-pass operations of the networks. In practice, we find they have similar runtimes. For example, using our NVIDIA RTX2080-Ti, our training scheme applied to ResNet50 requires 4.91 GPU hours for a gradient-based epoch and 4.90 h for a gradient-free step.