# AMED: Automatic Mixed-Precision Quantization for Edge Devices

^{*}

## Abstract

**:**

## 1. Introduction

- A novel framework for mixed-precision quantization for DNNs that look at reduced precision as a Markov process.
- A quality score that represents the accuracy–latency trade-off with respect to the hardware constraint. This allows us to create custom-fit solutions for a range of device-specific hardware constraints via direct hardware signals in the training procedure.
- Extensive experiments conducted on different hardware setups with different models on standard image classification benchmarks (CIFAR100, ImageNet). These outperform previous methods in terms of the accuracy–latency trade-off.
- A proposed modular framework, i.e., the sampling method, hardware properties, accelerator simulator, and neural network architecture are all independent modules, making it applicable to any given case.

## 2. Materials

#### Multi-Objective Optimization

- The weighted sum vector $\alpha $ must be determined beforehand and requires a grid search or meta-learning, both of which can be costly and have difficulty converging.
- Not all objectives can be optimized via the same optimization scheme. For example, a gradient-based optimizer cannot be used when only some of the objectives are differentiable.

## 3. Literature Review

#### Quantized Neural Networks

**PTQ**uses a small calibration set to obtain the optimal quantization parameters without the need or capability to use the entire dataset.**QAT**conducts the quantization during the training process and, thus, uses the entire training corpus.

## 4. Method

#### 4.1. Problem Definition

- We introduce a penalty term that becomes increasingly negative as the quantized model’s accuracy falls below a user-specified threshold. This ensures that solutions prioritize models meeting the desired accuracy level.
- We incorporate a hard constraint on the model size. This constraint acts as a filter, preventing the algorithm from exploring solutions that exceed a user-defined maximum size for the quantized model.

#### 4.2. Multivariate Markov Chain

Algorithm 1 Random walk Metropolis–Hastings step. |

Input: $\widehat{\mathcal{Q}}$, ${\mathbf{A}}_{i}$ |

$bb{A}_{*}$ = $argmax\widehat{\mathcal{Q}}$ ▹ Axis 2 |

$\alpha =\frac{\widehat{\mathcal{Q}}\left[{\mathbf{A}}_{*}\right]}{\widehat{\mathcal{Q}}\left[\mathbf{A}\right]}$ ▹$\alpha $ is the layerwise acceptance ratio; |

if 0 $\le \alpha \le 1$ then ▹ Element-wise |

$\mathbf{B}\sim \mathrm{Bern}\left(\alpha \right)$ |

${\mathbf{A}}_{i+1}={\mathbf{A}}_{*}\mathbf{B}+{\mathbf{A}}_{i}(1-\mathbf{B})$ |

else |

${\mathbf{A}}_{i+1}={\mathbf{A}}_{*}$ |

end if |

- 1.
**Candidate Generation:**A candidate vector, denoted by ${A}_{*}$, is proposed for the next allocation.- 2.
**Layerwise acceptance ratio:**A layerwise acceptance ratio, $\alpha $, is calculated.- 3.
**Bernoulli-Based Acceptance:**- If $\alpha \le 1$, a Bernoulli distribution with probability $\alpha $ is used for acceptance:
- −
- If the Bernoulli trial succeeds, ${A}_{*}$ is accepted.
- −
- Otherwise, the current allocation is retained.

- If $\alpha \ge 1$, ${A}_{*}$ is automatically accepted.

**Exploration Phase:**Accepts a higher proportion of new candidates, encouraging broad space exploration.

**Exploitation Phase:**Preferentially accepts candidates that leverage knowledge from previously discovered samples.

#### 4.3. Quantizer

- 1.
- Scaling and rounding:$${M}_{int}=\mathrm{round}\left(\frac{M}{S}\right)$$Here, ${M}_{int}$ represents the integer version of M, obtained by scaling M by a factor S (often referred to as the scaling factor) and then rounding the result.
- 2.
- Clamping:$$\overline{M}=\mathrm{clamp}({M}_{int},min,max)$$The scaled and rounded integer ${M}_{int}$ is then clamped to the valid range representable by b bits. This ensures the quantized values stay within the intended range.
- -
- For symmetric quantization, $min=-{2}^{(b-1)}+1$ and $max={2}^{(b-1)}-1$.
- -
- For asymmetric quantization, $min=0$ and $max={2}^{(b-1)}$.

#### 4.4. Simulator

#### 4.4.1. Underlying Architecture Modeled

#### 4.4.2. Simulator Approximations

`Optimal Data Flow Assumptions`: The simulator models specific types of dataflows—Output Stationary (OS), Weight Stationary (WS), or Input Stationary (IS)—and assumes an ideal scenario where outputs can be transferred out of the compute array without stalling the compute operations. In real-world implementations, such smooth operations might not always be feasible, potentially leading to a higher actual runtime.

`Memory Interaction`: This simplistically models the memory hierarchy, assuming a double-buffered setup to hide memory access latencies. This model may not fully capture the complex interactions and potential bottlenecks of real memory systems.

#### 4.4.3. Using the Simulator

**SRAM Utilization Estimation:**

**Simulator Output:**

- -
- Compute cycles;
- -
- Average bandwidths for DRAM accesses (input feature map, filters, output feature map);
- -
- Stall cycles;
- -
- Memory utilization (potentially improved in future work);
- -
- Other details specified in Appendix C.

**Extracting Latency Metrics:**

- -
- C denotes the compute cycles;
- -
- f denotes the clock speed;
- -
- b denotes the number of bits required for the specific SRAM;
- -
- $M-BW$ denotes the memory bandwidth;
- -
- Word size is assumed to be 16 bits.

#### 4.5. Training and Quantizing with AMED

Algorithm 2 Training procedure of AMED. |

Input: dataset: $\mathbf{D}(x,y)$, model: $\theta $, params: $\beta ,\gamma $, simulator $\mathbf{S}$ |

${\mathbf{A}}_{0}^{i}=8;\forall i\in L$ |

${\theta}_{0}={\theta}_{{\mathbf{A}}_{0}}$ |

$\widehat{\mathcal{Q}}\sim U(2,8)$ ▹ between all bit representations |

Fit ${\theta}_{0}$ |

Compute reference ${\widehat{\mathcal{L}}}_{CE}({\theta}_{0},\mathbf{D})$,${\widehat{\mathcal{L}}}_{Lat}(\mathbf{S},{\theta}_{0})$ |

for i in epoch do |

Evaluate ${\mathcal{L}}_{CE}({\theta}_{i},\mathbf{D})$,${\mathcal{L}}_{Lat}(\mathbf{S},{\theta}_{i})$ |

Update $\widehat{\mathcal{Q}}$ ▹ from (7) |

Update ${\mathbf{A}}_{i}$ ▹ by Algorithm 1 |

Quantize the model ${\theta}_{{\mathbf{A}}_{i}}$ |

Fit ${\theta}_{{\mathbf{A}}_{i}}$ |

end for |

## 5. Results

**Table 2.**Performance comparison with state-of-the-art methods on ImageNet, noticeable good results in bold. ${N}_{MP}$ indicates mixed precision using N as the minimum allowed bitwidth. $\psi $ is our re-implementation, with pretrained FP32 weights from [53]. We only present our implementation when we achieved better results than the original paper. If the original paper we are comparing to did not publish the model’s bit allocation, we could not run the simulator and find the latency or calculate the model size.

Network | Method | Bitwidth | Acc (%) | Latency (ms) | Size (MB) |
---|---|---|---|---|---|

ResNet-18 | FP32 | 32/32 | 71.1 | 92.34 | 43.97 |

$LS{Q}_{\psi}$ | 8/8 | 71.0 | 34.97 | 14.68 | |

$LS{Q}_{\psi}$ | 4/4 | 68.73 | 24.4 | 7.34 | |

HAWQv2 | ${4}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 70.22 | 32.83 | 8.52 | |

MCKP | ${3}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 69.66 | 28.05 | 7.66 | |

DDQ | ${4}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 71.2 | 29.03 | 7.83 | |

LIMPQ | ${3}_{\mathrm{MP}}/{3}_{\mathrm{MP}}$ | 69.7 | 20.03 | 7.55 | |

LIMPQ | ${4}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 70.8 | 33.05 | 8.52 | |

AMED (${\beta}_{1}$) | ${2}_{\mathrm{MP}}/{2}_{\mathrm{MP}}$ | 70.87 | 9.3 | 7.87 | |

AMED (${\beta}_{2}$) | ${2}_{\mathrm{MP}}/{2}_{\mathrm{MP}}$ | 67.77 | 5.07 | 7.06 | |

ResNet-50 | FP32 | 32/32 | 80.1 | 263.64 | 102.06 |

$LS{Q}_{\psi}$ | 8/8 | 79.9 | 101.4 | 20.74 | |

$LS{Q}_{\psi}$ | 4/4 | 78.3 | 55.84 | 10.37 | |

$LS{Q}_{\psi}$ | 3/3 | 77.6 | 25.44 | 7.79 | |

MCKP | ${2}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 75.28 | 46.42 | 7.96 | |

HAQ | ${3}_{\mathrm{MP}}/{3}_{\mathrm{MP}}$ | 75.3 | — | — | |

HAWQv2 | ${2}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 76.1 | 86.51 | 10.13 | |

LIMPQ | ${3}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 76.9 | 32.51 | 8.11 | |

AMED (${\beta}_{1}$) | ${2}_{\mathrm{MP}}/{2}_{\mathrm{MP}}$ | 79.23 | 37.52 | 11.89 | |

AMED (${\beta}_{2}$) | ${2}_{\mathrm{MP}}/{2}_{\mathrm{MP}}$ | 78.5 | 34.47 | 7.75 | |

MobileNetV2 | FP32 | 32/32 | 71.80 | 104.34 | 17.86 |

$LS{Q}_{\psi}$ | 8/8 | 71.6 | 39.52 | 12.54 | |

MCKP | ${2}_{\mathrm{MP}}/8$ | 71.2 | 22.42 | 9.82 | |

HAQ | ${3}_{\mathrm{MP}}/{3}_{\mathrm{MP}}$ | 66.99 | — | — | |

HAQ | ${4}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 71.47 | 15.89 | 10.47 | |

DDQ | ${4}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 71.8 | 29.25 | 10.217 | |

PROFIT | ${4}_{\mathrm{MP}}/{4}_{\mathrm{MP}}$ | 71.5 | — | — | |

LSQ + BR | $3/3$ | 67.4 | 11.96 | 11.429 | |

AMED (${\beta}_{1}$) | ${2}_{\mathrm{MP}}/{2}_{\mathrm{MP}}$ | 71.29 | 15.01 | 6.34 | |

AMED (${\beta}_{2}$) | ${2}_{\mathrm{MP}}/{2}_{\mathrm{MP}}$ | 71.2 | 11.85 | 10.12 |

#### Ablation Study

## 6. Discussion

#### Future Directions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

FLOP | floating-point operations |

MAC | multiply–accumulate |

DNNs | deep neural networks |

MP | mixed-precision |

FP32 | floating-point with 32 bits |

NAS | Neural Architectural Search |

CE | cross-entropy |

PTQ | Post-Training Quantization |

QAT | Quantization-Aware Training |

PE | processing element |

SRAM | static random-access memory |

## Appendix A. Additional Experiments

Network | Method | Bitwidth | Acc (%) | Latency (ms) | Size (MB) |
---|---|---|---|---|---|

ResNet-18 | FP32 | 32/32 | 71.1 | — | 43.97 |

$LS{Q}_{\psi}$ | 8/8 | 70.0 | 3.42 | 14.68 | |

$LS{Q}_{\psi}$ | 4/4 | 68.73 | 0.85 | 7.34 | |

HAWQv2 | ${4}_{MP}/{4}_{MP}$ | 70.22 | 2.94 | 8.52 | |

MCKP | ${3}_{MP}/{4}_{MP}$ | 69.66 | 2.13 | 7.66 | |

DDQ | ${4}_{MP}/{4}_{MP}$ | 71.2 | 2.22 | 7.83 | |

AMED | ${2}_{MP}/{2}_{MP}$ | 70.84 | 0.32 | 6.16 | |

AMED | ${2}_{MP}/{2}_{MP}$ | 71.0 | 0.55 | 6.6 | |

ResNet-50 | FP32 | 32/32 | 80.1 | — | 102.06 |

$LS{Q}_{\psi}$ | 8/8 | 79.9 | 9.65 | 20.74 | |

$LS{Q}_{\psi}$ | 4/4 | 78.3 | 2.42 | 10.37 | |

$LS{Q}_{\psi}$ | 3/3 | 77.6 | 0.8 | 7.79 | |

MCKP | ${2}_{MP}/{4}_{MP}$ | 75.28 | 3.22 | 7.96 | |

HAWQv2 | ${2}_{MP}/{4}_{MP}$ | 76.1 | 7.82 | 10.13 | |

AMED | ${2}_{MP}/{2}_{MP}$ | 79.34 | 3.26 | 9.74 | |

AMED | ${2}_{MP}/{2}_{MP}$ | 79.43 | 3.45 | 9.75 | |

MobileNetV2 | FP32 | 32/32 | 71.80 | — | 17.86 |

$LS{Q}_{\psi}$ | 8/8 | 71.6 | 7.8 | 12.54 | |

MCKP | ${2}_{MP}/8$ | 71.2 | 4.44 | 9.82 | |

HAQ | ${3}_{MP}/{3}_{MP}$ | 70.9 | — | — | |

HAQ | ${4}_{MP}/{4}_{MP}$ | 71.47 | 2.03 | 10.47 | |

DDQ | ${4}_{MP}/{4}_{MP}$ | 71.8 | 5.87 | 10.217 | |

PROFIT | ${4}_{MP}/{4}_{MP}$ | 71.5 | — | — | |

LSQ + BR | $3/3$ | 67.4 | 3.9 | 11.429 | |

AMED | ${2}_{MP}/{2}_{MP}$ | 71.24 | 1.56 | 12.1 | |

AMED | ${2}_{MP}/{2}_{MP}$ | 70.97 | 1.21 | 9.53 |

## Appendix B. Bit Allocations

**Figure A1.**Quantization bit allocation of ResNet-50 following our method using the simulator. The top figure is the SCALE-Sim setup; the middle is the Eyeriss setup; the bottom is SCALE-Sim with low memory.

**Figure A2.**Visualization of the loss surface of two subsequent layers of ResNet-18. At a higher bitwidth (left), the interactions between layers are relatively small, making layerwise optimization possible. On the other hand, with a bitwidth decrease (right), with an increase in the quantization loss, the interactions become tangible and the loss is higher. A per-layer optimization depends on the initial point and is potentially sub-optimal.

## Appendix C. Reports

- Computation report: Provides layerwise details about Total Cycles, Stall Cycles, Overall Utilization, Mapping Efficiency, and Computation Utilization. An example is shown in Table A2.
- Bandwidth report: Provides layerwise details about Average IFMAP SRAM Bandwidth, Average FILTER SRAM Bandwidth, Average OFMAP SRAM Bandwidth, Average IFMAP DRAM Bandwidth, Average FILTER DRAM Bandwidth, and Average OFMAP DRAM Bandwidth.
- Detailed access report: Provides layerwise details about the number of reads and writes, and the start and stop cycles, for both of the above-mentioned reports.

**Table A2.**An example of the SCALE-Sim simulator computation report similar to [17] for MobileNetV2 uniformly quantized to 2 bits (not including the FC layer).

LayerID | Total Cycles | Stall | Overall Util % | Mapping Efficiency % | Compute Util % |
---|---|---|---|---|---|

0 | 137,148 | 0 | 16.10753928 | 53.09483493 | 16.10742183 |

1 | 132,649 | 0 | 43.61369102 | 53.00234993 | 43.61336223 |

2 | 36,847 | 0 | 9.574727929 | 28.125 | 9.574468085 |

3 | 61,151 | 0 | 15.70538503 | 76.5625 | 15.70512821 |

4 | 701,907 | 0 | 71.27146118 | 76.38573961 | 71.27135964 |

5 | 15,483 | 0 | 24.68513854 | 40.625 | 24.6835443 |

6 | 25,283 | 0 | 21.22176957 | 76.04166667 | 21.22093023 |

7 | 374,807 | 0 | 71.87994421 | 75.31844429 | 71.87975243 |

8 | 20,187 | 0 | 28.399465 | 40.625 | 28.39805825 |

9 | 25,283 | 0 | 21.22176957 | 76.04166667 | 21.22093023 |

10 | 374,807 | 0 | 71.87994421 | 75.31844429 | 71.87975243 |

11 | 5149 | 0 | 36.4002719 | 52.0625 | 36.39320388 |

12 | 9399 | 0 | 25.28460475 | 74.265625 | 25.28191489 |

13 | 157,519 | 0 | 70.24724002 | 72.76722301 | 70.24679406 |

14 | 6349 | 0 | 39.36052922 | 52.0625 | 39.35433071 |

15 | 9399 | 0 | 25.28460475 | 74.265625 | 25.28191489 |

16 | 157,519 | 0 | 70.24724002 | 72.76722301 | 70.24679406 |

17 | 6349 | 0 | 39.36052922 | 52.0625 | 39.35433071 |

18 | 9399 | 0 | 25.28460475 | 74.265625 | 25.28191489 |

19 | 157,519 | 0 | 70.24724002 | 72.76722301 | 70.24679406 |

20 | 3555 | 0 | 34.11392405 | 45.1171875 | 34.10433071 |

21 | 6173 | 0 | 38.2998542 | 75.390625 | 38.29365079 |

22 | 123,129 | 0 | 76.17864191 | 77.54464286 | 76.17802323 |

23 | 6243 | 0 | 38.8515137 | 45.1171875 | 38.84529148 |

24 | 6173 | 0 | 38.2998542 | 75.390625 | 38.29365079 |

25 | 123,129 | 0 | 76.17864191 | 77.54464286 | 76.17802323 |

26 | 6243 | 0 | 38.8515137 | 45.1171875 | 38.84529148 |

27 | 6173 | 0 | 38.2998542 | 75.390625 | 38.29365079 |

28 | 123,129 | 0 | 76.17864191 | 77.54464286 | 76.17802323 |

29 | 6243 | 0 | 38.8515137 | 45.1171875 | 38.84529148 |

30 | 6173 | 0 | 38.2998542 | 75.390625 | 38.29365079 |

31 | 123,129 | 0 | 76.17864191 | 77.54464286 | 76.17802323 |

32 | 6243 | 0 | 57.68861124 | 66.9921875 | 57.6793722 |

33 | 11,059 | 0 | 48.01858215 | 79.0234375 | 48.01424051 |

34 | 262,299 | 0 | 80.32093146 | 81.28125 | 80.32062524 |

35 | 8931 | 0 | 60.48874706 | 66.9921875 | 60.48197492 |

36 | 11,059 | 0 | 48.01858215 | 79.0234375 | 48.01424051 |

37 | 262,299 | 0 | 80.32093146 | 81.28125 | 80.32062524 |

38 | 8931 | 0 | 60.48874706 | 66.9921875 | 60.48197492 |

39 | 11,059 | 0 | 48.01858215 | 79.0234375 | 48.01424051 |

40 | 262,299 | 0 | 80.32093146 | 81.28125 | 80.32062524 |

41 | 3827 | 0 | 58.33714398 | 64.59960938 | 58.32190439 |

42 | 7103 | 0 | 51.84649092 | 71.92687988 | 51.83919271 |

43 | 139,231 | 0 | 72.87237576 | 73.39477539 | 72.87185238 |

44 | 6131 | 0 | 60.69054803 | 64.59960938 | 60.68065068 |

45 | 7103 | 0 | 51.84649092 | 71.92687988 | 51.83919271 |

46 | 139,231 | 0 | 72.87237576 | 73.39477539 | 72.87185238 |

47 | 6131 | 0 | 60.69054803 | 64.59960938 | 60.68065068 |

48 | 7103 | 0 | 51.84649092 | 71.92687988 | 51.83919271 |

49 | 139,231 | 0 | 72.87237576 | 73.39477539 | 72.87185238 |

50 | 12,263 | 0 | 60.31099649 | 64.20084635 | 60.30607877 |

51 | 16,043 | 0 | 61.18127844 | 73.03059896 | 61.1774651 |

52 | 21,471 | 0 | 2.916724885 | 3.057861328 | 2.916589046 |

## References

- Lebedev, V.; Ganin, Y.; Rakhuba, M.; Oseledets, I.; Lempitsky, V.S. Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition. arXiv
**2015**, arXiv:1412.6553. [Google Scholar] - Ullrich, K.; Meeds, E.; Welling, M. Soft Weight-Sharing for Neural Network Compression. arXiv
**2017**, arXiv:1702.04008. [Google Scholar] - Chmiel, B.; Baskin, C.; Zheltonozhskii, E.; Banner, R.; Yermolin, Y.; Karbachevsky, A.; Bronstein, A.M.; Mendelson, A. Feature Map Transform Coding for Energy-Efficient CNN Inference. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–9. [Google Scholar] [CrossRef]
- Baskin, C.; Chmiel, B.; Zheltonozhskii, E.; Banner, R.; Bronstein, A.M.; Mendelson, A. CAT: Compression-Aware Training for bandwidth reduction. J. Mach. Learn. Res.
**2021**, 22, 1–20. [Google Scholar] - Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both Weights and Connections for Efficient Neural Network. arXiv
**2015**, arXiv:1506.02626. [Google Scholar] - Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H.H. Learning Structured Sparsity in Deep Neural Networks. In Proceedings of the NIPS, Barcelona, Spain, 9 December 2016. [Google Scholar]
- Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. arXiv
**2019**, arXiv:1806.09055. [Google Scholar] - Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10726–10734. [Google Scholar]
- Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. arXiv
**2019**, arXiv:1812.00332. [Google Scholar] - Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv
**2016**, arXiv:1606.06160. [Google Scholar] - Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. J. Mach. Learn. Res.
**2018**, 18, 1–30. [Google Scholar] - Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.J.; Srinivasan, V.; Gopalakrishnan, K. PACT: Parameterized Clipping Activation for Quantized Neural Networks. arXiv
**2018**, arXiv:1805.06085. [Google Scholar] - Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. How to Evaluate Deep Neural Network Processors: TOPS/W (Alone) Considered Harmful. IEEE Solid-State Circuits Mag.
**2020**, 12, 28–41. [Google Scholar] [CrossRef] - Karbachevsky, A.; Baskin, C.; Zheltonozhskii, E.; Yermolin, Y.; Gabbay, F.; Bronstein, A.M.; Mendelson, A. Early-Stage Neural Network Hardware Performance Analysis. Sustainability
**2021**, 13, 717. [Google Scholar] [CrossRef] - Apple. Apple Describes 7 nm A12 Bionic Chips; EENews: Washington, DC, USA, 2018. [Google Scholar]
- Nvidia. Nvidia Docs Hub: Train With Mixed Precision. 2023. Available online: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html (accessed on 28 January 2023).
- Samajdar, A.; Joseph, J.M.; Zhu, Y.; Whatmough, P.; Mattina, M.; Krishna, T. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, USA, 23–25 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 58–68. [Google Scholar]
- Sharma, H.; Park, J.; Suda, N.; Lai, L.; Chau, B.; Kim, J.K.; Chandra, V.; Esmaeilzadeh, H. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 764–775. [Google Scholar]
- Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. HAQ: Hardware-Aware Automated Quantization. arXiv
**2018**, arXiv:1811.08886. [Google Scholar] - Dong, Z.; Yao, Z.; Gholami, A.; Mahoney, M.W.; Keutzer, K. HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 293–302. [Google Scholar]
- Sun, M.; Li, Z.; Lu, A.; Li, Y.; Chang, S.E.; Ma, X.; Lin, X.; Fang, Z. FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Virtual, 27 February–1 March 2022. [Google Scholar]
- Sun, J.; Li, G. An End-to-End Learning-based Cost Estimator. arXiv
**2019**, arXiv:1906.02560. [Google Scholar] [CrossRef] - Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. arXiv
**2019**, arXiv:1906.02243. [Google Scholar] - Srinivas, N.; Deb, K. Muiltiobjective Optimization Using Nondominated Sorting in Genetic Algorithms. Evol. Comput.
**1994**, 2, 221–248. [Google Scholar] [CrossRef] - Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms; John Wiley & Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
- Li, H.; De, S.; Xu, Z.; Studer, C.; Samet, H.; Goldstein, T. Training Quantized Nets: A Deeper Understanding. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks. arXiv
**2016**, arXiv:1602.02505. [Google Scholar] - Rozen, T.; Kimhi, M.; Chmiel, B.; Mendelson, A.; Baskin, C. Bimodal Distributed Binarized Neural Networks. arXiv
**2022**, arXiv:2204.02004. [Google Scholar] [CrossRef] - Zhang, D.; Yang, J.; Ye, D.; Hua, G. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. arXiv
**2018**, arXiv:1807.10029. [Google Scholar] - Baskin, C.; Liss, N.; Chai, Y.; Zheltonozhskii, E.; Schwartz, E.; Giryes, R.; Mendelson, A.; Bronstein, A.M. NICE: Noise Injection and Clamping Estimation for Neural Network Quantization. arXiv
**2021**, arXiv:1810.00162. [Google Scholar] [CrossRef] - Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned Step Size Quantization. arXiv
**2020**, arXiv:1902.08153. [Google Scholar] - Han, T.; Li, D.; Liu, J.; Tian, L.; Shan, Y. Improving Low-Precision Network Quantization via Bin Regularization. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5241–5250. [Google Scholar]
- Gong, R.; Liu, X.; Jiang, S.; Li, T.H.; Hu, P.; Lin, J.; Yu, F.; Yan, J. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4851–4860. [Google Scholar]
- Zur, Y.; Baskin, C.; Zheltonozhskii, E.; Chmiel, B.; Evron, I.; Bronstein, A.M.; Mendelson, A. Towards Learning of Filter-Level Heterogeneous Compression of Convolutional Neural Networks. arXiv
**2019**, arXiv:1904.09872. [Google Scholar] - Zhao, S.; Yue, T.; Hu, X. Distribution-aware Adaptive Multi-bit Quantization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9277–9286. [Google Scholar]
- Yang, H.; Duan, L.; Chen, Y.; Li, H. BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization. arXiv
**2021**, arXiv:2102.10462. [Google Scholar] - Yang, L.; Jin, Q. FracBits: Mixed Precision Quantization via Fractional Bit-Widths. In Proceedings of the AAAI, Palo Alto, CA, USA, 22–24 March 2021. [Google Scholar]
- Dong, Z.; Yao, Z.; Cai, Y.; Arfeen, D.; Gholami, A.; Mahoney, M.W.; Keutzer, K. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks. arXiv
**2020**, arXiv:1911.03852. [Google Scholar] - Chen, W.; Wang, P.; Cheng, J. Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5330–5339. [Google Scholar]
- Zhang, Z.; Shao, W.; Gu, J.; Wang, X.; Ping, L. Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution. arXiv
**2021**, arXiv:2106.02295. [Google Scholar] - Nahshan, Y.; Chmiel, B.; Baskin, C.; Zheltonozhskii, E.; Banner, R.; Bronstein, A.M.; Mendelson, A. Loss Aware Post-training Quantization. Mach. Learn.
**2021**, 110, 3245–3262. [Google Scholar] [CrossRef] - Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
**2011**, 12, 2121–2159. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980v9. [Google Scholar] - Ching, W.K.; Zhang, S.; Ng, M.K. On Multi-dimensional Markov Chain Models. Pac. J. Optim.
**2007**, 3, 235–243. [Google Scholar] - Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys.
**1953**, 21, 1087–1092. [Google Scholar] [CrossRef] - Bengio, Y.; Léonard, N.; Courville, A.C. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv
**2013**, arXiv:1308.3432. [Google Scholar] - Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Chen, Y.h.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits
**2017**, 52, 127–138. [Google Scholar] [CrossRef] - Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
- Tang, C.; Ouyang, K.; Wang, Z.; Zhu, Y.; Wang, Y.; Ji, W.; Zhu, W. Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance. arXiv
**2022**, arXiv:2203.08368. [Google Scholar] - Park, E.; Yoo, S. PROFIT: A Novel Training Method for sub-4-bit MobileNet Models. arXiv
**2020**, arXiv:2008.04693. [Google Scholar] - Wightman, R. PyTorch Image Models. 2019. Available online: https://github.com/rwightman/pytorch-image-models (accessed on 21 April 2022). [CrossRef]
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 19 September 2021).
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv
**2018**, arXiv:1802.02611. [Google Scholar] - Kimhi, M.; Kimhi, S.; Zheltonozhskii, E.; Litany, O.; Baskin, C. Semi-Supervised Semantic Segmentation via Marginal Contextual Information. arXiv
**2023**, arXiv:2308.13900. [Google Scholar] - Srivastava, N.; Jin, H.; Liu, J.; Albonesi, D.H.; Zhang, Z. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 766–780. [Google Scholar]

**Figure 1.**ResNet-50 quantized models on a latency–accuracy plane. Circles are mixed-precision quantization, and triangles are uniform quantization. Our models achieve a better Pareto curve of dominant solutions in the two-dimensional plane for ultra-low precision.

**Figure 3.**Quantization of the i-th layer of the network. ${W}_{i}$ indicates the weights, ${X}_{i}$ the input activations, and $\overline{{W}_{i}},\overline{{X}_{i}}$ the quantized versions, respectively. ${S}_{{W}_{i}},{S}_{{X}_{i}}$ are the learnable parameters of the quantization. (scale). In Blue are dynamically changed tensors, while in orange the parameters. Weights and activations are quantized with respect to the scaling factor (with rounding and clamping as described in Section 4.3. The quantized versions are multiplied in an integer matrix multiplication accelerator and produce a quantized vector $\overline{{Z}_{i}}$. With respect to the scaling factors, we can dequantize them into ${Z}_{i}$, which can yield a prediction in FP, or quantize them again in a different precision in the next layer.

**Figure 4.**An illustration of Algorithm 2. $\widehat{\mathcal{Q}}$ Table represents bit-allocation vector $\mathbf{A}$.

**Figure 5.**Quantization bit allocation of MobileNetV2 following our method using the simulator. The top figure is the SCALE-Sim setup; the middle is the Eyeriss setup; the bottom is SCALE-Sim with low memory. Depthwise convolutions have a higher feature map and, thus, higher memory footprint, and we can see that Algorithm 2 allocates fewer bits when the system memory is low, i.e., the model is memory-bounded. Models with higher memory allocate the bits differently due to the locality of the boundary (memory or computational) by the layer. This figure does not include the first and last layers, which we quantize to 8 bits.

**Figure 6.**Quantization bit allocation of ResNet-18 following our method using the simulator. Both are for the SCALE-Sim setup. The top figure is for $\beta =1$, and the bottom is for $\beta =10$. When choosing a higher value for $\beta $, the algorithm chooses lower precision for the model for the same hardware constraint.

**Figure 7.**ResNet-18 quantized models on a latency–accuracy plane. Circles are mixed precision, and triangles are uniform quantization. Our models achieved a better Pareto curve of the dominant solution in the two-dimensional plane for ultra-low precision.

**Figure 8.**MobileNetV2 quantized models on a latency–accuracy plane. Circles are mixed precision, and triangles are uniform quantization. Our models achieved a better Pareto curve of the dominant solution in the two-dimensional plane for ultra-low precision.

**Table 1.**The accelerator setup for a compact accelerator based on SCALE-Sim [17], the SCALE-Sim micro-controller with low memory, and Eyeriss [49]. The proprieties we used in the simulator for each setup are listed in the table. Data flow indicates the stationarity (weights—“ws”; activations—“as”; output—“os”), i.e., what data should remain in the SRAM for the next computed layer. “os” writes the output directly to the input feature map SRAM.

Name | SCALE-Sim | SCALE-Sim Low Mem | Eyeriss v1 |
---|---|---|---|

PE array height | 32 | 32 | 12 |

PE array width | 32 | 32 | 14 |

Input feature map SRAM (KB) | 64 | 4 | 108 |

Filter SRAM (KB) | 64 | 4 | 108 |

Output feature map SRAM (KB) | 64 | 4 | 108 |

Data flow | os | os | ws |

Bandwidth (w/c) | 10 | 10 | 10 |

Memory banks | 1 | 1 | 1 |

Speed (GHz) | 0.2 | 0.1 | 0.2 |

**Table 3.**Performance comparison of different hyperparameters of ResNet-18 on CIFAR100. We used the SCALE-Sim simulator described in Table 1, and the latency is normalized to one image inference.

$\mathit{\beta}$ | EMA | Top-1 (%) | Top-5 (%) | Latency (ms) |
---|---|---|---|---|

1 | 0.9 | 75.59 | 96.15 | 10.14 |

1 | 0.5 | 73.08 | 92.19 | 14.68 |

1 | 0.2 | 77.86 | 95.09 | 14.68 |

1 | 0.1 | 77.58 | 92.97 | 12.47 |

1 | 0.01 | 78.21 | 94.97 | 8.99 |

100 | 0.01 | 77.29 | 91.44 | 6.15 |

10 | 0.01 | 78.29 | 93.97 | 8.15 |

1 | 0.01 | 78.52 | 94.53 | 8.98 |

0.1 | 0.01 | 78.72 | 96.23 | 12.61 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kimhi, M.; Rozen, T.; Mendelson, A.; Baskin, C.
AMED: Automatic Mixed-Precision Quantization for Edge Devices. *Mathematics* **2024**, *12*, 1810.
https://doi.org/10.3390/math12121810

**AMA Style**

Kimhi M, Rozen T, Mendelson A, Baskin C.
AMED: Automatic Mixed-Precision Quantization for Edge Devices. *Mathematics*. 2024; 12(12):1810.
https://doi.org/10.3390/math12121810

**Chicago/Turabian Style**

Kimhi, Moshe, Tal Rozen, Avi Mendelson, and Chaim Baskin.
2024. "AMED: Automatic Mixed-Precision Quantization for Edge Devices" *Mathematics* 12, no. 12: 1810.
https://doi.org/10.3390/math12121810