AMED: Automatic Mixed-Precision Quantization for Edge Devices

Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance. This makes them highly appropriate for systems with limited resources and low power capacity. Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths. Quantization methods either aim to minimize the compression loss given a desired reduction or optimize a dependent variable for a specified property of the model (such as FLOPs or model size); both make the performance inefficient when deployed on specific hardware, but more importantly, quantization methods assume that the loss manifold holds a global minimum for a quantized model that copes with the global minimum of the full precision counterpart. Challenging this assumption, we argue that the optimal minimum changes as the precision changes, and thus, it is better to look at quantization as a random process, placing the foundation for a different approach to quantize neural networks, which, during the training procedure, quantizes the model to a different precision, looks at the bit allocation as a Markov Decision Process, and then, finds an optimal bitwidth allocation for measuring specified behaviors on a specific device via direct signals from the particular hardware architecture. By doing so, we avoid the basic assumption that the loss behaves the same way for a quantized model. Automatic Mixed-Precision Quantization for Edge Devices (dubbed AMED) demonstrates its superiority over current state-of-the-art schemes in terms of the trade-off between neural network accuracy and hardware efficiency, backed by a comprehensive evaluation.


Introduction
Deep neural networks have established themselves as the primary algorithmic solution for a wide array of real-world applications.However, the computational and memory requirements of DNNs are considerable, leading to notable latency and power consumption during both the training and inference processes.
Quantization is a promising and straightforward technique to accelerate neural network architectures because it reduces the model's memory footprint and the computation complexity by a large factor.
The standard metric to evaluate a quantized model's computational complexity is the number of multiply-accumulate (MAC) operations or bit operations (BOPs) it demands.Neither consider other factors such as communication complexity [13], memory utilization, and inter-layer relations.Using ultra-low bit quantization does not, therefore, consistently improve the chip's overall performance due to communication and memory boundaries [14].
To this end, recent developments in industry [15,16] and academia [17,18] have introduced support for various precision matrix multiplications, leading to a new line of work focusing on mixed-precision (MP) quantization [19][20][21], that is assigning a different bandwidth to each matrix (i.e., weights and activations).MP quantization methods focus on either finding the optimal solution by the estimation of the loss manifold of a full-precision model and assess what layers would cause the least steepest change to the local minimum the model reached at one quantization step [19,20].In this work, we argue that, when quantizing a model during the training process, the minimum on this loss manifold could be very different.This realization led us to the conclusion that the quantization should be looked at as a process, and the optimal step depends on the intermediate state of the model, not on a final form.This way, an intermediate quantized model would be easier to train and the quantization loss would be lower, so the trajectory to the final quantized model is smoother.
Hardware-aware methodologies, sometimes coupled with Neural Architectural Search (NAS), have shown promising progress in the field of hardware-aware models.Current work [8,9] directly measures signals from the hardware simulator.Nevertheless, to incorporate hardware constraints into the loss, they neglect the dependencies between layers, that is the latency per layer per precision is a fixed number.This assumption is inaccurate in cases where the memory utilization is high because the communication boundary agrees with the computational one.NAS can provide a solution that is better tailored to the task.Even so, the search space is commonly huge, and the cost of training [22] and the carbon footprint [23] are high.
Our solution, Automatic Mixed-Precision Quantization for Edge Devices (dubbed AMED), is an algorithmic framework that chooses mixed-precision bit allocation per layer by looking at reduced precision as a Markov Decision Process (MDP), using signals directly from the hardware.The procedure is simple to use, not bounded to a specific HW design, and easily fine-tuned by the user to find a good balance between performance and resource consumption, as shown in Figure 1.

This paper makes the following contributions:
• A novel framework for mixed-precision quantization for DNNs that look at reduced precision as a Markov process.

•
A quality score that represents the accuracy-latency trade-off with respect to the hardware constraint.This allows us to create custom-fit solutions for a range of devicespecific hardware constraints via direct hardware signals in the training procedure.

•
Extensive experiments conducted on different hardware setups with different models on standard image classification benchmarks (CIFAR100, ImageNet).These outperform previous methods in terms of the accuracy-latency trade-off.• A proposed modular framework, i.e., the sampling method, hardware properties, accelerator simulator, and neural network architecture are all independent modules, making it applicable to any given case.

Multi-Objective Optimization
In a simple learning paradigm, the optimization process attempts to minimize an objective function; the example in this work is a classification model, minimizing the cross-entropy loss, denoted by L CE .
Multi-objective problems can have both positively correlated objectives, as well as negatively correlated objectives (also known as conflicting objectives or adversary objectives).The problem we aim to solve in this paper is the accuracy-latency trade-off of the quantized model, in which one objective is minimizing L CE , while the other is minimizing the latency, denoted by L lat .This idea can be extended to other hardware objectives (such as model size or power consumption); however, these typically correlate with L lat and are beyond the scope of this work.
Both objectives are conflicting because quantizing our model to a lower representation aims to reduce the latency while increasing the cost of quantization error.Due to these conflicting objectives, multi-objective optimization problems are known to lack optimal solutions with respect to all of the objectives [24].Typically, there are many optimal (or suboptimal) solutions because one is not comparable with the other.Solution A dominates solution B (A ≺ B) if all objective values of A are better than or equal to the respective objective value of B. Dominant solutions define a Pareto-optimal curve in the objective's plane.To solve multi-objective optimization, it is very common to use single-objective approaches or a Pareto approach.
A single-objective approach aggregates the objectives into one term, where the basic aggregation method is the weighted sum vector [25], denoted by α, which encodes the user's priorities over the different objectives.This method is very common when training a DNN, but has the following drawbacks:

•
The weighted sum vector α must be determined beforehand and requires a grid search or meta-learning, both of which can be costly and have difficulty converging.• Not all objectives can be optimized via the same optimization scheme.For example, a gradient-based optimizer cannot be used when only some of the objectives are differentiable.
A Pareto approach is based on sampling methods and finding the Pareto curve by adopting only dominant solutions.The main drawback of this approach in the perspective of a quantization scheme is the computational cost.Each sample needs to first quantize a neural network and train it, before it can evaluate the objectives.

Quantized Neural Networks
Quantization is one of the most efficient and commonly used techniques for the acceleration and compression of models.Generally, quantization procedures can be categorized into Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT): • PTQ uses a small calibration set to obtain the optimal quantization parameters without the need or capability to use the entire dataset.

•
QAT conducts the quantization during the training process and, thus, uses the entire training corpus.
QAT is more appealing when quantizing a model into ultra-low precision because the network is trained to withhold quantization noise [26].Because we are focusing on ultra-low precision, we use the QAT technique for this work.
Homogeneous quantization quantizes both weights and activations using the same bitwidth throughout the network, scaling from 32-bit to as low as binary [27,28].Ref. [29] proposes a learnable quantizer with a basis vector that is adaptable to the weights and activations.Refs.[12,30] presents a clipping method that automatically optimizes the quantization scales during model training.In [31], a non-uniform step size was learned as a model parameter so that it became more sensitive to the quantization transition points.Ref. [32] takes this method a step further and proposes a weight regularization algorithm that encourages a sharp distribution for each quantization bin and encourages it to be as close to the target quantized value as possible.Ref. [33] employed a series of hyperbolic tangent functions to approach the staircase function for low-bit quantization gradually.
Recent research, however, has shown that different layers in a DNN model contribute to the performance differently and contain different redundancies.Therefore, the notion of a mixed-precision quantization scheme that assigns a different precision bitwidth to each layer of the DNN can offer better accuracy while compressing the model even further.Nevertheless, determining each layer's precision is quite challenging because of the large search space.In [34], NAS was used for allocating the bitwidth by employing BOPs in the cost function.HAQ [19] leverages reinforcement learning to determine the quantization policy layerwise.Additionally, it takes the hardware accelerator's feedback into account in the architecture design, attaining an optimized solution for deterministic constraints.It fails, however, to find a solution that stochastically reduces latency and power consumption.FBNET [8] combines all possible design choices into a stochastic super-net and approximates the optimal scheme via sampling.These search methods can require a large quantity of computational resources, which scale up quickly with the number of layers.
Therefore, other works attempt to allocate the bitwidth using different methods.Ref. [35] used the first-order Taylor expansion to evaluate the loss sensitivity due to the quantization of each channel, and then adjusted the bitwidth channelwise.Ref. [36] considered each layer's bitwidth as an independent trainable parameter.Ref. [37] allocated differentiable bitwidths to layers, which consequently offers smooth transitions between neighboring quantization bit levels, all while meeting a target computational constraint.HAWQ [20,38] applies mixed-precision quantization using the Hessian information.Power iteration is adopted to compute the top Hessian eigenvalue, which is used to determine which layers are more prone to quantization.It relies on a proxy signal that assumes that all devices benefit from a higher compression rate and a lower number of operations, and does not consider the specific hardware design.Ref. [39] also used the Hessian trace and formulated the mixed-precision quantization as a discrete constrained optimization problem solved by a greedy search algorithm.Ref. [40] differentially learned the quantization parameters for each layer, including the bitwidth, quantization level, and dynamic range.FILM-QNN [21] investigates mixed-precision quantization in the layer scope.This means that each layer's parameters have a different precision.While this has the benefit of being even more efficient, the metadata of which parameters are sent to which multiply and accumulate (MAC) unit are not feasible in most current hardware architectures.
The quantization can also be performed in two different ways.Works such as [19,31,33,36,38] perform uniform quantization in which the width of the quantization bins is a single parameter.A more complex non-uniform quantization allows different bin widths, which reduces the quantization error [29,35].
To perform a standard MAC operation in a fixed-point representation on a given hardware, all of the data (weights and activations) must be equally scaled.This means that, when implementing non-uniform quantization, a lookup table to a higher representation is required.While the communication costs are reduced, the computational and area costs of non-uniform quantization are very high and inefficient.

Method
This section introduces our proposed technique for mixed-precision quantization.It leverages the strengths of both approaches described in Section 2. We achieve this by first combining the desired properties of the model into a single, unified objective function.Then, we employ the Pareto optimization approach to efficiently sample solutions that achieve a desirable trade-off between these properties.We define the problem (Section 4.1) and describe our perspective of the quantization process of neural networks (Section 4.2), following which, we describe our MP quantizer (Section 4.3) and our MP simulator (Section 4.4).Finally, we introduce the technique of AMED (Section 4.5).

Problem Definition
The DNN architecture parameters grouped into N layers are denoted by θ (Figure 2).The bit-allocation vector is denoted by A ∈ M N , where M is the number of possible bit allocations.Additionally, a quantized set of model parameters is denoted by θ A .A is a temporal vector, denoted by A t at time t, and the l-th layer bitwidth at time t is denoted by A l t .Due to the dependence on specific accelerator properties, a differentiable form of the hardware objective (e.g., latency) cannot be directly measured in advance.This renders traditional optimization techniques relying on gradient descent inapplicable.Furthermore, the non-i.i.d.nature of latency across layers adds another layer of complexity.Quantization choices for one layer (e.g., A l−1 ) can influence the cache behavior of subsequent layers (e.g., l), leading to unpredictable changes in the overall latency.This makes individual layer optimization ineffective.

Convolution Pooling
Motivated by multi-objective optimization approaches that can handle trade-offs between competing goals, we propose encoding user-specified properties of the DNN model (e.g., accuracy, memory footprint, and latency) into a single, non-differentiable objective function.By doing so, we aim to avoid introducing relaxations that might not accurately capture the true behavior of the hardware platform, potentially leading to suboptimal solutions.
We define: Since the objective function Z is non-differentiable with respect to the model parameters, we employ a nested optimization approach.This approach involves optimizing the inner loop (minimizing L CE ) before tackling the outer loop (minimizing Z).
To identify a set of solutions that best balance the trade-off between model performance and hardware constraints, we introduce two modifications to our objective function:

•
We introduce a penalty term that becomes increasingly negative as the quantized model's accuracy falls below a user-specified threshold.This ensures that solutions prioritize models meeting the desired accuracy level.

•
We incorporate a hard constraint on the model size.This constraint acts as a filter, preventing the algorithm from exploring solutions that exceed a user-defined maximum size for the quantized model.
With respect to the above and by using a maximization problem, our objective is defined as follows: where: and Z re f would be the baseline performance.We used a uniform 8-bit quantized network performance as Z re f .

Multivariate Markov Chain
Quantization error contributes to the overall approximation error.These errors can be conceptualized as non-orthogonal signals within the error space.Crucially, the error vectors may not always point in the same direction.
Simply quantizing the model's weight matrices and adding the resulting error to predictions is insufficient.Furthermore, quantized models without a subsequent finetuning step often underperform.
Nahshan et al. [41] demonstrates that the quantization error signals of models with similar bitwidths exhibit smaller angles in the error space, indicating higher similarity.This suggests that, even for models destined for very low precision, an intermediate quantization step (e.g., 8 bits) can be beneficial.
To the best of our knowledge, the optimal approach for progressively quantizing deep learning models, especially those with mixed precision and complex architectures, remains an open question.The high dimensionality of such a process makes traditional ablation studies challenging.
While it is well known that optimization of DNNs is not an MDP nor canonical for each precision [42,43], Nahshan observations motivate our exploration of temporal precision in the quantization process.We propose modeling the network's state as a function of its previous mixed-precision configuration, akin to a multivariate Markov chain.
Formally, we define the bit allocation as a multivariate Markov chain, as defined in [44].
Let A (i) t be the bit allocation of the i-th layer at time t, defined as: where: and P (ii) is a one-step transition probability matrix for the i-th layer precision as a Markov chain.
In the context of this work, every layer of the neural network i has a precision at time t, and the transition for it to a new precision at t + 1 is given by Equation (3).
and thus, the bit-allocation multivariate Markov chain update rule is: Modeling Q explicitly is challenging, since we would like to model it so close minima of the loss will have the highest probability.As an alternative to explicit modeling, we suggest using sampling techniques over Q where Q ∝ Q.In this study, we use random walk Metropolis-Hastings [45], a Markov chain Monte Carlo (MCMC) method.For Q, we construct a distribution from Objective 2.
We use an exponential moving average (EMA), denoted by γ, to avoid cases of diverged samples of quantized models: where γ = 0.01 empirically reduces the number of required samples.The update of new samples q is by applying a logarithmic scale over Equation (2): where q is the update step of the transition matrix Q, which is simply updating the probability of transitioning to a different precision based on the scaled loss.Z is defined in Equation ( 1) by the weighted sum of losses.Note that the only difference here from Equation ( 2) is the monotonic logarithm function, as well as taking the mask out of the log.since the mask operated elementwise (Hadamard product) with the objective.We employ the random walk Metropolis-Hastings algorithm (Algorithm 1) to determine whether to accept a new bit-allocation vector A or retain the current one.
▷ α is the layerwise acceptance ratio; The algorithm proceeds in the following steps: 1.
Candidate Generation: A candidate vector, denoted by A * , is proposed for the next allocation.

3.
Bernoulli-Based Acceptance: • If α ≤ 1, a Bernoulli distribution with probability α is used for acceptance: -If the Bernoulli trial succeeds, A * is accepted.-Otherwise, the current allocation is retained.
This acceptance scheme adaptively balances exploration and exploitation: Exploration Phase: Accepts a higher proportion of new candidates, encouraging broad space exploration.Exploitation Phase: Preferentially accepts candidates that leverage knowledge from previously discovered samples.

Quantizer
Our approach is agnostic to the specific quantization technique employed.As long as the chosen quantizer is compatible with the user's target accelerator, it can be seamlessly integrated.The experiments presented in this paper leverage a quantization-aware training (QAT) quantizer along with a learnable quantization scale, as advocated in [31].
During the optimization of the quantized network, we employ a technique called fake quantization.This involves maintaining a full-precision (FP32) copy of the weights, while using a simple straight-through estimator for back-propagation [46].This estimator approximates the gradients of the quantized weights during training.
To quantize a matrix M with b bits, we follow a two-step process: 1.
Scaling and rounding: Here, M int represents the integer version of M, obtained by scaling M by a factor S (often referred to as the scaling factor) and then rounding the result.

2.
Clamping: The scaled and rounded integer M int is then clamped to the valid range representable by b bits.This ensures the quantized values stay within the intended range.

Asymmetric quantizer
Int-MM Figure 3. Quantization of the i-th layer of the network.W i indicates the weights, X i the input activations, and Wi , Xi the quantized versions, respectively.S W i , S X i are the learnable parameters of the quantization.(scale).In Blue are dynamically changed tensors, while in orange the parameters.
Weights and activations are quantized with respect to the scaling factor (with rounding and clamping as described in Section 4.3.The quantized versions are multiplied in an integer matrix multiplication accelerator and produce a quantized vector Zi .With respect to the scaling factors, we can dequantize them into Z i , which can yield a prediction in FP, or quantize them again in a different precision in the next layer.

Simulator
To enable direct sampling of signals from emulated hardware, we developed a hardware accelerator simulator.This simulator allows us to model various aspects of an accelerator, including the inference time for different architectures, and generate signals usable within our algorithm.These signals can extend beyond latency to encompass memory utilization, energy consumption, or other relevant metrics.

Underlying Architecture Modeled
Our simulator draws heavily on SCALE-Sim v2 [17], a Systolic Accelerator Simulator (SAS) capable of cycle-accurate timing analysis.A SAS is ideal for DNN computations due to its efficient operand movement and high compute density.This setup minimizes global data movement, largely keeping data transfer local (neighbor to neighbor within the array), which improves both energy efficiency and speed.Our SAS can additionally provide power/energy consumption, memory bandwidth usage, and trace results, all tailored to a specific accelerator configuration and neural network architecture.We extended its capabilities by incorporating support for diverse bitwidths and convolutional neural network (CNN) architectures not originally supported, such as Inverted Residuals (described in [47]).

Simulator Approximations 1. Optimal Data Flow Assumptions:
The simulator models specific types of dataflows-Output Stationary (OS), Weight Stationary (WS), or Input Stationary (IS)-and assumes an ideal scenario where outputs can be transferred out of the compute array without stalling the compute operations.In real-world implementations, such smooth operations might not always be feasible, potentially leading to a higher actual runtime.
2. Memory Interaction: This simplistically models the memory hierarchy, assuming a double-buffered setup to hide memory access latencies.This model may not fully capture the complex interactions and potential bottlenecks of real memory systems.
The original ScaleSim simulator was validated against real hardware setups using a detailed in-house RTL model [17].

Using the Simulator
The simulator operates on two key inputs: 1. Network Architecture Topology File: This file specifies the structure of the neural network, including the arrangement of layers and their connections.
2. Accelerator Properties Descriptor: This descriptor defines the characteristics of the target hardware accelerator, such as its memory configuration and processing capabilities.
We evaluated the simulator using various network architectures: ResNet-18, ResNet-50 [48], and MobileNetV2 [47].While MobileNetV3 could potentially be explored in future work, it is not included in the current set of experiments.The specific accelerator properties used are detailed in Table 1.
Table 1.The accelerator setup for a compact accelerator based on SCALE-Sim [49], the SCALE-Sim micro-controller with low memory, and Eyeriss [50].The proprieties we used in the simulator for each setup are listed in the table.Data flow indicates the stationarity (weights-"ws"; activations-"as"; output-"os"), i.e., what data should remain in the SRAM for the next computed layer."os" writes the output directly to the input feature map SRAM.

SRAM Utilization Estimation:
The current simulator estimates SRAM utilization based on bandwidth limitations.Incorporating a more accurate calculation of SRAM utilization within the simulator is a potential area for future improvement.This would provide a more precise signal for the algorithm.
Simulator Output: The simulator generates a report for each layer, containing various metrics such as the following: -Compute cycles; -Average bandwidths for DRAM accesses (input feature map, filters, output feature map); -Stall cycles; -Memory utilization (potentially improved in future work); -Other details specified in Appendix C.

Extracting Latency Metrics:
From the reported compute cycles and clock speed ( f ), we calculate the computation latency as C f .Similarly, the memory latency for each SRAM is estimated using: The total latency of the quantized model is determined by the maximum latency value obtained from these calculations (computation and memory latencies for each layer).

Training and Quantizing with AMED
AMED employs a Metropolis-Hastings algorithm (Algorithm 1) to sample precision vectors A. These precision vectors guide the quantization process, resulting in a mixed-precision model θ A .Details on the quantization procedure, including activation quantization matching weight precision, can be found in Section 4.3.
Following quantization, we perform a two-epoch optimization step using stochastic gradient descent (SGD) to fine-tune both the quantized model parameters θ A and the scaling factors S W i and S X i associated with the weights and activations, respectively.
The quantized model's performance is evaluated on the validation set using the crossentropy loss L l CE and on a simulated inference scenario detailed in Section 4.4, using the latency loss L l lat .Both loss values contribute to updating the estimated expected utility Q (Equation ( 7)) and guide the sampling of a new precision vector A.
Figure 4 illustrates the quantization process, while Algorithm 2 details the complete workflow.
The algorithm described in Algorithm 2 is agnostic to the quantizer and to the hardware specification and simulation.This means that one can use any quantization technique that relies on the statistics of the weights and activations in a single layer and any hardware or hardware simulator and use AMED to choose the best mixed-precision bit allocation for the hardware.

Results
In this section, we present our quantization experimental results on the ImageNet dataset [51] for different architectures.
We applied our AMED algorithm to determine the bitwidth of each layer.We used an SGD optimizer with a momentum of 0.9 and a weight decay of 10 −3 for ResNet and a weight decay of 10 −5 for MobileNet.We ran each network for 80-90 epochs with a starting learning rate of 10 −2 for ResNet and 10 −3 for MobileNet.The learning rate dropped by a factor of 10 every 30 epochs.The batch size was 256, and we used common data augmentations of random horizontal flip and random crop.The values of β as listed in Table 2 are β 1 = 1 and β 2 = 10.We also followed the common practice of not quantizing the classifier (FC layer), which is less than 3% of the latency of the smallest network we trained.All experiments used a pretrained model quantized uniformly to 8 bits by the regime of [31].We applied a uniform quantization technique and learnable scale for both weights and activations, as described in Figure 3.The scale factor S W for the weights was initialized with the statistics from the INT8 quantized model, and S X for activations was initialized with the statistics from one batch (of size 256) in the following form: where X is the first batch of images and b is the number of bits that represent the quantized layer.All experiments using the SCALE-Sim accelerator hardware setup for the ImageNet dataset are listed in Table 2. Other hardware performance parameters are given in Appendix A1.
Table 2. Performance comparison with state-of-the-art methods on ImageNet, noticeable good results in bold.N MP indicates mixed precision using N as the minimum allowed bitwidth.ψ is our reimplementation, with pretrained FP32 weights from [54].We only present our implementation when we achieved better results than the original paper.If the original paper we are comparing to did not publish the model's bit allocation, we could not run the simulator and find the latency or calculate the model size.We tested our model on various hardware setups.Figure 5 shows that our method produced a different bit allocation for MobilenetV2 for each hardware constraint.The effect of different values for β for the same model and the same hardware simulator is shown in Figure 6, and other outcomes for different hardware constraints for ResNet-50 can be seen in Figure A1. .Quantization bit allocation of MobileNetV2 following our method using the simulator.The top figure is the SCALE-Sim setup; the middle is the Eyeriss setup; the bottom is SCALE-Sim with low memory.Depthwise convolutions have a higher feature map and, thus, higher memory footprint, and we can see that Algorithm 2 allocates fewer bits when the system memory is low, i.e., the model is memory-bounded.Models with higher memory allocate the bits differently due to the locality of the boundary (memory or computational) by the layer.This figure does not include the first and last layers, which we quantize to 8 bits.We report our quantized results in Table 2.Because AMED can efficiently provide a trade-off between latency and accuracy, we can control the accuracy degradation and latency requirements easily by a simple adjustment of the hyperparameters.For each architecture, we report multiple results that demonstrate this trade-off.As seen in Table 2, for ResNet-18, we achieved more than a x2.6 latency improvement with only a 0.23% drop in accuracy compared to the latest state-of-the-art quantization methods: LSQ [31] and DDQ [40].For ResNet-50, compared to the 4-bit models, we outperformed in accuracy by 0.2-0.93%while still improving latency.For MobileNetV2, we can also see more than a 1 ms improvement in latency compared to the HAQ [19] 4-bit mixed-precision model with a 0.5% degradation in accuracy.We emphasize that all results are prone to the β setting and a degradation in accuracy can easily be compensated for by a higher latency.As demonstrated in Figures 1, 7, and 8, AMED achieves a better Pareto curve for the accuracy-latency trade-off, which can dominant other quantized models.

Ablation Study
In this section, we describe our tests of AMED on CIFAR100 [55] with a ResNet-18 architecture with different hyperparameters, as listed in Table 3.The method chooses the bitwidths, resulting in lower latency when β is higher, as expected, because as shown in Equation ( 1), the higher β increases the objective's latency component.The use of EMA helps when we fix a small β value.This result emphasizes the importance of averaging the score of each bitwidth with older samples.

Discussion
In this paper, we introduced a novel mixed-precision quantization method called AMED.The proposed method relies on a novel meta-bit allocation strategy that finds an optimal bitwidth among different neural network layers by measuring direct signals from a hardware accelerator simulator.Our method significantly reduces the required exploration space compared to previous mixed-precision methods, due to the simplicity of the perspective of the low-degree objective.The extensive evaluations we performed demonstrated the superiority of our method over standard image classification benchmarks in terms of the accuracy-latency trade-off compared to the prior state of the art.The ability to obtain higher accuracy in a shorter training time results in lower time-to-market solutions.
Future work to examine this method for efficient NAS could yield a computationally efficient search, which will reduce the carbon footprint of the procedure and reveal better models in terms of inference time.
Another intriguing future topic is the exploration of different loss terms, solving dense prediction tasks like semantic segmentation [56,57].
Improvements in the simulator, such as enabling dynamic workflow based on the computational graph, or support for special hardware solutions such as fast sparse matrix multiplication [58], combined with pruning, could result in very promising outcomes.

Future Directions
Computationally efficient NAS: Integrating AMED with efficient Neural Architecture Search (NAS) techniques has the potential to yield a significantly more computationally efficient search process.This would not only reduce the carbon footprint associated with the search procedure, but also potentially uncover models with superior inference times.
Exploration of diverse loss terms: The ability to directly measure various empirical values within the deployment system through the hardware simulator opens doors for exploring a much broader range of loss terms.Unlike previous approaches, differentiability is no longer a prerequisite for loss terms, allowing us to consider factors like power consumption, memory usage, and bandwidth within the existing deployment environment.
Enhanced Simulator Capabilities: Further improvements to the simulator, such as enabling dynamic workflows based on the computational graph or incorporating support for specialized hardware solutions like Fast Sparse Matrix Multiplication [58] alongside pruning techniques, hold immense promise for achieving groundbreaking results.

Appendix B. Bit Allocations
In Figure A1, we add the bit allocation for ResNet-50 with a different hardware setup, and in Figure 6, we use different values for β for ResNet-18.
One can see that both affect the bit allocation significantly towards building a more efficient network when the memory boundary is closer or when we increase the hardware constraints on the objective.

Figure A2.
Visualization of the loss surface of two subsequent layers of ResNet-18.At a higher bitwidth (left), the interactions between layers are relatively small, making layerwise optimization possible.On the other hand, with a bitwidth decrease (right), with an increase in the quantization loss, the interactions become tangible and the loss is higher.A per-layer optimization depends on the initial point and is potentially sub-optimal.

Appendix C. Reports
The simulator provides the following detailed reports about the performance during the inference of a CNN:

Figure 1 .
Figure 1.ResNet-50 quantized models on a latency-accuracy plane.Circles are mixed-precision quantization, and triangles are uniform quantization.Our models achieve a better Pareto curve of dominant solutions in the two-dimensional plane for ultra-low precision.

Figure 2 .
Figure 2. A diagram of DNN architecture activation maps.
b M − BW × word size × f where -C denotes the compute cycles; -f denotes the clock speed; -b denotes the number of bits required for the specific SRAM; -M − BW denotes the memory bandwidth; -Word size is assumed to be 16 bits.
Figure5.Quantization bit allocation of MobileNetV2 following our method using the simulator.The top figure is the SCALE-Sim setup; the middle is the Eyeriss setup; the bottom is SCALE-Sim with low memory.Depthwise convolutions have a higher feature map and, thus, higher memory footprint, and we can see that Algorithm 2 allocates fewer bits when the system memory is low, i.e., the model is memory-bounded.Models with higher memory allocate the bits differently due to the locality of the boundary (memory or computational) by the layer.This figure does not include the first and last layers, which we quantize to 8 bits.

Figure 6 .
Figure 6.Quantization bit allocation of ResNet-18 following our method using the simulator.Both are for the SCALE-Sim setup.The top figure is for β = 1, and the bottom is for β = 10.When choosing a higher value for β, the algorithm chooses lower precision for the model for the same hardware constraint.

Figure 7 .
Figure 7. ResNet-18 quantized models on a latency-accuracy plane.Circles are mixed precision, and triangles are uniform quantization.Our models achieved a better Pareto curve of the dominant solution in the two-dimensional plane for ultra-low precision.

Figure 8 .
Figure 8. MobileNetV2 quantized models on a latency-accuracy plane.Circles are mixed precision, and triangles are uniform quantization.Our models achieved a better Pareto curve of the dominant solution in the two-dimensional plane for ultra-low precision.

Figure A1 .
Figure A1.Quantization bit allocation of ResNet-50 following our method using the simulator.The top figure is the SCALE-Sim setup; the middle is the Eyeriss setup; the bottom is SCALE-Sim with low memory.

Table 3 .
Performance comparison of different hyperparameters of ResNet-18 on CIFAR100.We used the SCALE-Sim simulator described in Table1, and the latency is normalized to one image inference.

Table A1 .
The same comparison as in Table2and the same notations, but with the latency from the simulator of the Eyeriss setup from Table1.Note that our model yields a different bitwidth for each layer because the signal from the hardware is different in this setup.

•
Computation report : Provides layerwise details about Total Cycles, Stall Cycles, Overall Utilization, Mapping Efficiency, and Computation Utilization.An example is shown in Table A2.• Bandwidth report: Provides layerwise details about Average IFMAP SRAM Bandwidth, Average FILTER SRAM Bandwidth, Average OFMAP SRAM Bandwidth, Average IFMAP DRAM Bandwidth, Average FILTER DRAM Bandwidth, and Average OFMAP DRAM Bandwidth.• Detailed access report: Provides layerwise details about the number of reads and writes, and the start and stop cycles, for both of the above-mentioned reports.

Table A2 .
[49]xample of the SCALE-Sim simulator computation report similar to[49]for Mo-bileNetV2 uniformly quantized to 2 bits (not including the FC layer).