# Quantization and Deployment of Deep Neural Networks on Microcontrollers

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. State of the Art on Embedded Execution of Quantized Neural Networks

^{®}E5640 microprocessor with an x86 architecture, using 8-bit integer instructions instead of floating-point instructions provides an execution speedup of more than 2, without a loss of accuracy on a speech recognition problem. In this case, the training is performed using single precision floating-point arithmetic, and the evaluation is done after the quantization of the network parameters.

## 3. Real Numbers Representation

#### 3.1. Floating-Point

#### 3.2. Fixed-Point

## 4. Training and Quantization of Deep Neural Networks

#### 4.1. Floating-Point to Fixed-Point Quantization of a Deep Neural Network

#### 4.1.1. Uniform and Non-Uniform

#### 4.1.2. Scaling and Offset

#### 4.1.3. Per-Network, Per-Layer and Per-Filter Scale Factor

#### 4.1.4. Conversion Method

#### 4.2. Post-Training Quantization

#### 4.3. Quantization-Aware Training

## 5. Deployment of Quantized Neural Network

- exporting the weights of the deep neural network and encoding them into a format suitable for on-target inference,
- generating the inference program according to the topology of the deep neural network,
- compiling the inference program,
- and then uploading the inference program with the weights onto the microcontroller’s ROM.

#### 5.1. Existing Embedded AI Frameworks

#### 5.1.1. TensorFlow Lite for Microcontrollers

#### 5.1.2. STM32Cube.AI

#### 5.1.3. Other Frameworks

#### 5.2. MicroAI: Our Framework Proposition

- open-source frameworks do not support convolutional neural networks with non-sequential topologies,
- frameworks that support convolutional neural networks are proprietary or too complex to be modified and extended easily,
- frameworks do not provide 16-bit quantization,
- some frameworks are dedicated to a limited family of hardware targets.

- The neural network training code that relies on Keras or PyTorch.
- The conversion tool (called KerasCNN2C) that takes a trained Keras model and produces a portable C code for the inference

#### 5.3. MicroAI: General Flow

- The number of iterations for the experiment (for statistical purposes)
- The dataset to use for training and evaluation
- The preprocessing steps to apply to the dataset
- The framework used for training
- The various model configurations to train, deploy and evaluate
- The configuration of the optimizer
- The post-processing steps to apply to the trained model
- The target configuration for deployment and evaluation

#### 5.4. MicroAI: Training

`RawDataModel`instance, which gathers the train and test sets. This instance contains numpy arrays for the data and the labels. A higher-level data model

`HARDataModel`is also available for Human Activity Recognition to process subjects and activities more easily. This model is then converted to a

`RawDataModel`using the

`DatamodelConverter`in the preprocessing phase. The preprocessing phase also includes features such as normalization. Dataset importation module for UCI-HAR, SMNIST and GTSRB (described in Section 6) are included and can be easily extended to new datasets.

- MLP: a simple multi-layer perceptron with configurable number of layers and neurons per layer.
- CNN: a 1D or 2D convolutional neural network with configurable number of layers, filters per layer, kernels and pools size, and number of neurons per fully connected layer for the classifier
- ResNet: a 1D or 2D residual neural network (v1) with convolutional layers. The number of blocks and filters per layer, stride, kernel size, and optional BatchNorm can be configured.

`[[model]]`block. Each model will be trained sequentially. A common configuration for all the models can be specified in a

`[model_template]`block. Model configuration also includes optimizer configuration and other parameters such as the batch size and the number of epochs.

`RemoveKerasSoftmax`module. This layer is indeed useless when only inference is performed.

`QuantizationAwareTraining`module. The actual training step before post-processing is seen as a general training, before optionally performing post-training quantization or quantization-aware training. The quantization-aware training can be seen as a fine-tuning on top of the more general training (which can also be skipped if necessary). The quantization-aware training does not actually convert the weights from a floating-point data type to an integer data type with fixed-point representation. This conversion is rather performed by the KerasCNN2C conversion tool or another deployment tool.

`LearningFramework`interface and by supplying compatible model templates.

#### 5.5. MicroAI: Deployment

#### 5.6. KerasCNN2C: Conversion Tool from Trained Keras Model to Portable C Code

`HDF5`file, a C library for the inference. It can also be used independently of the MicroAI framework.

- Add
- AveragePooling1D
- BatchNormalization
- Conv1D
- Dense
- Flatten
- MaxPooling1D
- ReLU
- SoftMax
- ZeroPadding1D

`model.h`header to run the inference process with the following signature: where

`number_t`is the data type used during inference defined in the

`number.h`header,

`MODEL_INPUT_CHANNELS`and

`MODEL_INPUT_SAMPLES`are the dimensions of the input defined in the generated

`model.h`header. The input and output arrays must be allocated by the caller.

`x_float`can be converted to a fixed-point number

`x_fixed`with the following call: where

`long_number_t`is a type twice the size of

`number_t`and

`clamp_to_number_t`saturates and converts to

`number_t`. Both are defined in the

`number.h`header.

`INPUT_SCALE_FACTOR`is the scale factor for the first layer, defined in the

`model.h`header.

#### 5.7. KerasCNN2C: Conversion Process

- Combine ZeroPadding1D layers (if they exist) with the next Conv1D layer
- Combine ReLU activation layers with the previous Conv1D, MaxPooling1D, Dense or Add layer
- Convert BatchNorm [50] weights from the mean $\mu $, the variance V, the scale $\gamma $, the offset $\beta $ and $\u03f5$ to a multiplicand w and an addend b using the following formula:$$w={\displaystyle \frac{\gamma}{\sigma}}$$$$\sigma =\sqrt{V+\u03f5}$$$$b=\beta -{\displaystyle \frac{\gamma \times \mu}{\sigma}}$$

`-Ofast`optimization level is enabled. Moreover, the code is written in a simple and straightforward way. So far, no special effort has been made to further optimize the source code for faster execution.

`cnn(...)`is generated. This function only contains the allocation of the buffers done by the allocator module and a sequence of calls to each of the layers’ inference function. The correct input and output buffers are passed to each layer according to the graph of the model.

#### 5.8. KerasCNN2C: Quantization and Fixed-Point Computation

`float`to an integer data type, such as

`int8_t`for 8-bit quantization or

`int16_t`for 16 bits quantization.

`int16_t`, then the intermediate results in a layer are computed and stored with an

`int32_t`data type. The result is then scaled back to the correct output scale factor before saturating and converting it back to the original operands’ data type.

`SMLABB`that performs a multiply–accumulate operation in one cycle (instead of two cycles). However, the compiler does not make use of the

`SSAT`operation that could allow saturating in one cycle. Instead, it uses the same instructions as a regular max operation, i.e., a compare instruction and a conditional move instruction requiring a total of two cycles.

## 6. Results

#### 6.1. Evaluation of the MicroAI Quantization Method

#### 6.1.1. Human Activity Recognition dataset (UCI-HAR)

#### 6.1.2. Spoken Digits Dataset (SMNIST)

#### 6.1.3. The German Traffic Sign Recognition Benchmark (GTSRB)

#### 6.2. Evaluation of Frameworks and Embedded Platforms

## 7. Discussion

## 8. Conclusions

`torch.fx`module of the newly released PyTorch 1.8.0 is used.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Conflicts of Interest

## Appendix A. Comparison of Inference Time between a Microcontroller, a CPU and a GPU

**Table A1.**Microcontroller (STM32L452RE), CPU (Intel Core i7-8850H) and GPU (NVidia Quadro P2000M) platforms. Power consumption figures for GPU and CPU are the TDP values from the manufacturer and do not reflect the exact power consumption of the device.

Platform | Model | Framework | Power Consumption |
---|---|---|---|

MCU | STM32L452RE | STM32Cube.AI | 0.016 W |

CPU | Intel Core i7-8850H | TensorFlow | 45 W |

GPU | NVidia Quadro P2000M | TensorFlow | 50 W |

**Table A2.**Comparison of 32-bit floating-point inference time for a single input on a microcontroller, a CPU and a GPU. Neural network architecture is described in Section 6 with several filters per convolution layer varying from 16 to 80 and dataset is described in Section 6.1.1. For CPU and GPU, inference batch size is set to 512 and the dataset is repeated 104 times to try to compensate for the large startup overhead compared to the total inference time. Measurements are averaged over at least 5 runs.

Platform | Inference Time (ms) | ||||||
---|---|---|---|---|---|---|---|

16 Filters | 24 Filters | 32 Filters | 40 Filters | 48 Filters | 64 Filters | 80 Filters | |

MCU | 85 | 174 | 271 | 404 | 544 | 921 | 1387 |

CPU | $0.0396$ | $0.0552$ | $0.0720$ | $0.0937$ | $0.1134$ | $0.1538$ | $0.2046$ |

GPU | $0.0227$ | $0.0197$ | $0.0223$ | $0.0284$ | $0.0317$ | $0.0395$ | $0.0515$ |

## Appendix B. Comparison between TensorFlow Lite for Microcontrollers and MicroAI Quantizations

**Figure A1.**Accuracy vs. filters for baseline (float32), 8-bit Post-Training Quantization from TensorFlow Lite (int8 TFLite PTQ), 8-bit Quantization-Aware Training from our framework (int8 MicroAI QAT), and 9-bit Post-Training Quantization from our framework (int9 MicroAI PTQ). Neural network architecture is described in Section 6 with several filters per convolution layer varying from 32 to 48 and dataset is described in Section 6.1.1.

## Appendix C. MicroAI Commands to Run for Automatic Training and Deployment of Deep Neural Networks

## Appendix D. Detailed Results of Frameworks and Embedded Platforms Evaluation

ROM Footprint (kiB) | |||||||||
---|---|---|---|---|---|---|---|---|---|

Framework | Target | Data Type | 16 Filters | 24 Filters | 32 Filters | 40 Filters | 48 Filters | 64 Filters | 80 Filters |

TFLiteMicro | SparkFunEdge | float32 | 116.520 | 133.988 | 157.957 | 188.426 | 225.395 | 318.926 | 438.363 |

MicroAI | SparkFunEdge | float32 | 54.316 | 67.066 | 91.035 | 121.512 | 158.473 | 251.863 | 371.332 |

MicroAI | NucleoL452REP | float32 | 55.770 | 68.145 | 92.129 | 122.582 | 159.559 | 253.004 | 372.434 |

STM32Cube.AI | NucleoL452REP | float32 | 61.965 | 79.449 | 103.410 | 133.898 | 170.859 | 264.289 | 383.742 |

MicroAI | SparkFunEdge | int16 | 46.952 | 50.629 | 62.629 | 77.832 | 96.355 | 142.973 | 202.699 |

MicroAI | NucleoL452REP | int16 | 48.129 | 51.629 | 63.613 | 78.855 | 97.340 | 144.051 | 203.770 |

TFLiteMicro | SparkFunEdge | int8 | 111.051 | 117.066 | 124.691 | 133.957 | 144.832 | 171.473 | 204.613 |

MicroAI | SparkFunEdge | int8 | 43.256 | 42.249 | 48.229 | 55.854 | 65.089 | 88.343 | 118.202 |

MicroAI | NucleoL452REP | int8 | 45.038 | 43.474 | 49.464 | 57.078 | 66.322 | 89.683 | 119.541 |

STM32Cube.AI | NucleoL452REP | int8 | 72.742 | 77.746 | 84.336 | 92.582 | 102.430 | 126.996 | 158.098 |

Response Time (ms) | |||||||||
---|---|---|---|---|---|---|---|---|---|

Framework | Target | Data Type | 16 Filters | 24 Filters | 32 Filters | 40 Filters | 48 Filters | 64 Filters | 80 Filters |

TFLiteMicro | SparkFunEdge | float32 | 179.633 | 294.157 | 438.541 | 624.172 | 860.835 | 1406.945 | 2087.241 |

MicroAI | SparkFunEdge | float32 | 53.247 | 153.732 | 259.212 | 394.494 | 569.852 | 1017.118 | 1561.264 |

MicroAI | NucleoL452REP | float32 | 55.762 | 152.426 | 259.160 | 395.721 | 559.249 | 976.732 | 1512.143 |

STM32Cube.AI | NucleoL452REP | float32 | 85.359 | 174.082 | 271.362 | 403.898 | 544.406 | 921.646 | 1387.083 |

MicroAI | SparkFunEdge | int16 | 40.867 | 113.035 | 191.439 | 287.655 | 389.450 | 667.547 | 1041.617 |

MicroAI | NucleoL452REP | int16 | 44.915 | 120.308 | 205.499 | 318.310 | 459.880 | 796.310 | 1223.513 |

TFLiteMicro | SparkFunEdge | int8 | 92.529 | 130.760 | 172.673 | 225.092 | 280.942 | 418.198 | 591.785 |

MicroAI | SparkFunEdge | int8 | 39.417 | 101.704 | 172.551 | 259.830 | 375.840 | 658.441 | 1003.365 |

MicroAI | NucleoL452REP | int8 | 43.003 | 107.705 | 180.830 | 272.986 | 383.761 | 659.996 | 1034.033 |

STM32Cube.AI | NucleoL452REP | int8 | 32.297 | 53.871 | 80.388 | 111.635 | 146.022 | 242.002 | 352.079 |

Energy (µWh) | |||||||||
---|---|---|---|---|---|---|---|---|---|

Framework | Target | Data Type | 16 Filters | 24 Filters | 32 Filters | 40 Filters | 48 Filters | 64 Filters | 80 Filters |

TFLiteMicro | SparkFunEdge | float32 | 0.135 | 0.221 | 0.330 | 0.469 | 0.647 | 1.058 | 1.569 |

MicroAI | SparkFunEdge | float32 | 0.040 | 0.116 | 0.195 | 0.297 | 0.428 | 0.765 | 1.174 |

MicroAI | NucleoL452REP | float32 | 0.247 | 0.675 | 1.148 | 1.753 | 2.478 | 4.327 | 6.700 |

STM32Cube.AI | NucleoL452REP | float32 | 0.378 | 0.771 | 1.202 | 1.789 | 2.412 | 4.083 | 6.146 |

MicroAI | SparkFunEdge | int16 | 0.031 | 0.085 | 0.144 | 0.216 | 0.293 | 0.502 | 0.783 |

MicroAI | NucleoL452REP | int16 | 0.199 | 0.533 | 0.910 | 1.410 | 2.038 | 3.528 | 5.421 |

TFLiteMicro | SparkFunEdge | int8 | 0.070 | 0.098 | 0.130 | 0.169 | 0.211 | 0.314 | 0.445 |

MicroAI | SparkFunEdge | int8 | 0.030 | 0.076 | 0.130 | 0.195 | 0.283 | 0.495 | 0.754 |

MicroAI | NucleoL452REP | int8 | 0.191 | 0.477 | 0.801 | 1.209 | 1.700 | 2.924 | 4.581 |

STM32Cube.AI | NucleoL452REP | int8 | 0.143 | 0.239 | 0.356 | 0.495 | 0.647 | 1.072 | 1.560 |

## Appendix E. Number of Integer ALU Operations for a Fixed-Point Residual Neural Network

**Table A6.**Number of arithmetic and logic operations with fixed-point on integers inference for the main layers of a residual neural network with f the number of filters (output channels), s the number of input samples, c the number of input channels, k the kernel size, n the number of neurons and i the number of input layers to the residual Add layer. Conv1D is assumed to be without padding and with a stride of 1.

MACC (1 Cycle) | Add (1 Cycle) | Shift (1 Cycle) | Max/Saturate (2 Cycles) | |
---|---|---|---|---|

Conv1D | $f\times s\times c\times k$ | N/A | $2\times f\times s$ | $f\times s$ |

ReLU | N/A | N/A | N/A | $c\times s$ |

Maxpool | N/A | N/A | N/A | $c\times s\times k$ |

Add | N/A | $s\times c\times (i-1)$ | $s\times c\times i$ | $c\times s$ |

FullyConnected | $n\times s$ | N/A | $2\times n$ | n |

## References

- Wang, Y.; Wei, G.; Brooks, D. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning. arXiv
**2019**, arXiv:1907.10701. [Google Scholar] - Lin, J.; Chen, W.M.; Lin, Y.; Cohn, J.; Gan, C.; Han, S. MCUNet: Tiny Deep Learning on IoT Devices. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online, 6–12 December 2020. [Google Scholar]
- Lai, L.; Suda, N. Enabling Deep Learning at the IoT Edge. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’18), San Diego, CA, USA, 5–8 November 2018; Association for Computing Machinery: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
- Kromes, R.; Russo, A.; Miramond, B.; Verdier, F. Energy consumption minimization on LoRaWAN sensor network by using an Artificial Neural Network based application. In Proceedings of the 2019 IEEE Sensors Applications Symposium (SAS), Sophia Antipolis, France, 11–13 March 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (PMLR 2019), Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Volume 97, pp. 6105–6114. [Google Scholar]
- Novac, P.E.; Russo, A.; Miramond, B.; Pegatoquet, A.; Verdier, F.; Castagnetti, A. Toward unsupervised Human Activity Recognition on Microcontroller Units. In Proceedings of the 2020 23rd Euromicro Conference on Digital System Design (DSD), Kranj, Slovenia, 26–28 August 2020; pp. 542–550. [Google Scholar] [CrossRef]
- Pimentel, J.J.; Bohnenstiehl, B.; Baas, B.M. Hybrid Hardware/Software Floating-Point Implementations for Optimized Area and Throughput Tradeoffs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2017**, 25, 100–113. [Google Scholar] [CrossRef] - Choi, J.; Chuang, P.I.J.; Wang, Z.; Venkataramani, S.; Srinivasan, V.; Gopalakrishnan, K. Bridging the accuracy gap for 2-bit quantized neural networks (qnn). arXiv
**2018**, arXiv:1807.06964. [Google Scholar] - Esser, S.K.; McKinstry, J.L.; Bablani, D.; Appuswamy, R.; Modha, D.S. Learned step size quantization. arXiv
**2019**, arXiv:1902.08153. [Google Scholar] - Nikolić, M.; Hacene, G.B.; Bannon, C.; Lascorz, A.D.; Courbariaux, M.; Bengio, Y.; Gripon, V.; Moshovos, A. Bitpruning: Learning bitlengths for aggressive and accurate quantization. arXiv
**2020**, arXiv:2002.03090. [Google Scholar] - Uhlich, S.; Mauch, L.; Yoshiyama, K.; Cardinaux, F.; Garcia, J.A.; Tiedemann, S.; Kemp, T.; Nakamura, A. Differentiable quantization of deep neural networks. arXiv
**2019**, arXiv:1905.11452. [Google Scholar] - Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Barcelona, Spain, 2016; Volume 29. [Google Scholar]
- Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 525–542. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv
**2016**, arXiv:1602.07360. [Google Scholar] - Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef][Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef][Green Version]
- Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 1, Montreal, QC, Canada, 7–10 December 2015; MIT Press: Cambridge, MA, USA, 2015; pp. 1135–1143. [Google Scholar]
- Yamamoto, K.; Maeno, K. PCAS: Pruning Channels with Attention Statistics. arXiv
**2018**, arXiv:1806.05382. [Google Scholar] - Hacene, G.B.; Lassance, C.; Gripon, V.; Courbariaux, M.; Bengio, Y. Attention based pruning for shift networks. arXiv
**2019**, arXiv:1905.12300. [Google Scholar] - Ramakrishnan, R.K.; Sari, E.; Nia, V.P. Differentiable Mask for Pruning Convolutional and Recurrent Networks. In Proceedings of the 2020 17th Conference on Computer and Robot Vision (CRV), Ottawa, ON, Canada, 13–15 May 2020; pp. 222–229. [Google Scholar]
- He, Y.; Ding, Y.; Liu, P.; Zhu, L.; Zhang, H.; Yang, Y. Learning Filter Pruning Criteria for Deep Convolutional Neural Networks Acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2009–2018. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv
**2015**, arXiv:1510.00149. [Google Scholar] - Fard, M.M.; Thonet, T.; Gaussier, E. Deep k-means: Jointly clustering with k-means and learning representations. Pattern Recognit. Lett.
**2020**, 138, 185–192. [Google Scholar] [CrossRef] - Cardinaux, F.; Uhlich, S.; Yoshiyama, K.; García, J.A.; Mauch, L.; Tiedemann, S.; Kemp, T.; Nakamura, A. Iteratively training look-up tables for network quantization. IEEE J. Sel. Top. Signal Process.
**2020**, 14, 860–870. [Google Scholar] [CrossRef] - He, Z.; Fan, D. Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 11430–11438. [Google Scholar] [CrossRef][Green Version]
- Lee, E.; Hwang, Y. Layer-Wise Network Compression Using Gaussian Mixture Model. Electronics
**2021**, 10, 72. [Google Scholar] [CrossRef] - Vogel, S.; Raghunath, R.B.; Guntoro, A.; Van Laerhoven, K.; Ascheid, G. Bit-Shift-Based Accelerator for CNNs with Selectable Accuracy and Throughput. In Proceedings of the 2019 22nd Euromicro Conference on Digital System Design (DSD), Kallithea, Greece, 28–30 August 2019; pp. 663–667. [Google Scholar] [CrossRef]
- Courbariaux, M.; Bengio, Y.; David, J.P. Training deep neural networks with low precision multiplications. arXiv
**2015**, arXiv:1412.7024. [Google Scholar] - Holt, J.L.; Baker, T.E. Back propagation simulations using limited precision calculations. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks, Seattle, WA, USA, 8–12 July 1991; Volume ii, pp. 121–126. [Google Scholar] [CrossRef]
- Vanhoucke, V.; Senior, A.; Mao, M.Z. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS 2011), Granada, Spain, 12–17 December 2011. [Google Scholar]
- Garofalo, A.; Tagliavini, G.; Conti, F.; Rossi, D.; Benini, L. XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions. In Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition, DATE 2020, Grenoble, France, 9–13 March 2020; IEEE: New York, NY, USA, 2020; pp. 186–191. [Google Scholar] [CrossRef]
- Cotton, N.J.; Wilamowski, B.M.; Dundar, G. A Neural Network Implementation on an Inexpensive Eight Bit Microcontroller. In Proceedings of the 2008 International Conference on Intelligent Engineering Systems, Miami, FL, USA, 25–29 February 2008; pp. 109–114. [Google Scholar] [CrossRef]
- Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML’10), Haifa, Israel, 21–24 June 2010; Omnipress: Madison, WI, USA, 2010; pp. 807–814. [Google Scholar]
- Zhang, Y.; Suda, N.; Lai, L.; Chandra, V. Hello Edge: Keyword Spotting on Microcontrollers. arXiv
**2018**, arXiv:1711.07128. [Google Scholar] - IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008); IEEE: Piscataway, NJ, USA, 2019; pp. 1–84. [Google Scholar] [CrossRef]
- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed Precision Training. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- ARM. ARM Developer Suite AXD and armsd Debuggers Guide, 4.7.9 Q-Format; ARM DUI 0066D Version 1.2; Arm Ltd.: Cambridge, UK, 2001. [Google Scholar]
- David, R.; Duke, J.; Jain, A.; Reddi, V.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Regev, S.; et al. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. arXiv
**2020**, arXiv:2010.08678. [Google Scholar] - STMicroelectronics. STM32Cube.AI. Available online: https://www.st.com/content/st_com/en/stm32-ann.html (accessed on 19 March 2021).
- Google. TensorFlow Lite for Microcontrollers Supported Operations. Available online: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/kernels/micro_ops.h (accessed on 22 March 2021).
- Google. TensorFlow Lite 8-Bit Quantization Specification. Available online: https://www.tensorflow.org/lite/performance/quantization_spec (accessed on 19 March 2021).
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar] [CrossRef][Green Version]
- STMicroelectronics. Supported Deep Learning Toolboxes and Layers, Documentation Embedded in X-CUBE-AI Expansion Package 5.2.0. 2020. Available online: https://www.st.com/en/embedded-software/x-cube-ai.html (accessed on 19 March 2021).
- Nordby, J. Emlearn: Machine Learning Inference Engine for Microcontrollers and Embedded Devices. 2019. Available online: https://doi.org/10.5281/zenodo.2589394 (accessed on 18 February 2021).
- Sakr, F.; Bellotti, F.; Berta, R.; De Gloria, A. Machine Learning on Mainstream Microcontrollers. Sensors
**2020**, 20, 2638. [Google Scholar] [CrossRef] - Givargis, T. Gravity: An Artificial Neural Network Compiler for Embedded Applications. In Proceedings of the 26th Asia and South Pacific Design Automation Conference (ASPDAC’21), Tokyo, Japan, 18–21 January 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 715–721. [Google Scholar] [CrossRef]
- Wang, X.; Magno, M.; Cavigelli, L.; Benini, L. FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things. IEEE Internet Things J.
**2020**, 7, 4403–4417. [Google Scholar] [CrossRef] - Tom’s Obvious Minimal Language. Available online: https://toml.io/ (accessed on 19 March 2021).
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Bach, F., Blei, D., Eds.; PMLR: Lille, France, 2015; Volume 37, pp. 448–456. [Google Scholar]
- Jinja2. Available online: https://palletsprojects.com/p/jinja/ (accessed on 19 March 2021).
- Zhang, H.; Cissé, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Davide, A.; Alessandro, G.; Luca, O.; Xavier, P.; Jorge, L.R.O. A Public Domain Dataset for Human Activity Recognition using Smartphones. In Proceedings of the ESANN, Bruges, Belgium, 24–26 April 2013. [Google Scholar]
- Khacef, L.; Rodriguez, L.; Miramond, B. Written and Spoken Digits Database for Multimodal Learning. 2019. Available online: https://doi.org/10.5281/zenodo.3515935 (accessed on 18 February 2021).
- Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv
**2018**, arXiv:1804.03209. [Google Scholar] - Stallkamp, J.; Schlipsing, M.; Salmen, J.; Igel, C. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 1453–1460. [Google Scholar] [CrossRef]
- Capotondi, A.; Rusci, M.; Fariselli, M.; Benini, L. CMix-NN: Mixed Low-Precision CNN Library for Memory-Constrained Edge Devices. IEEE Trans. Circuits Syst. II Express Briefs
**2020**, 67, 871–875. [Google Scholar] [CrossRef] - Park, E.; Kim, D.; Kim, S.; Kim, Y.; Kim, G.; Yoon, S.; Yoo, S. Big/little deep neural network for ultra low power inference. In Proceedings of the 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES + ISSS), Amsterdam, The Netherlands, 4–9 October 2015; pp. 124–132. [Google Scholar] [CrossRef]
- Anwar, S.; Hwang, K.; Sung, W. Structured Pruning of Deep Convolutional Neural Networks. J. Emerg. Technol. Comput. Syst.
**2017**, 13, 1–18. [Google Scholar] [CrossRef][Green Version] - Arcaya-Jordan, A.; Pegatoquet, A.; Castagnetti, A. Smart Connected Glasses for Drowsiness Detection: A System-Level Modeling Approach. In Proceedings of the 2019 IEEE Sensors Applications Symposium (SAS), Sophia Antipolis, France, 11–13 March 2019; pp. 1–6. [Google Scholar] [CrossRef]

**Figure 11.**ROM footprint for TFLite Micro, STM32Cube.AI and MicroAI with 80 filters per convolution.

**Figure 12.**Inference time for 1 input for TFLite Micro, STM32Cube.AI and MicroAI with 80 filters per convolution.

**Figure 13.**Energy consumption for 1 input for TFLite Micro, STM32Cube.AI and MicroAI with 80 filters per convolution.

31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

sign | exponent | significand |

31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |

integer part | fractional part |

Board | Nucleo-L452RE-P | SparkFun Edge |
---|---|---|

MCU | STM32L452RE | Ambiq Apollo3 |

Core | Cortex-M4F | Cortex-M4F |

Max Clock | 80 MHz | 48 MHz (96 MHz “Burst Mode”) |

RAM | 128 kiB | 384 kiB |

Flash | 512 kiB | 1024 kiB |

CoreMark/MHz | 3.42 | 2.479 |

Run current @3.3 V, 48 MHz | 4.80 mA | 0.82 mA * |

Framework | STM32Cube.AI | TFLite Micro | MicroAI |
---|---|---|---|

Source | Keras, TFLite, … | Keras, TFLite | Keras, PyTorch * |

Validation | Integrated tools | None | Integrated tools |

Metrics | RAM/ROM footprint, | None | ROM footprint |

inference time, MACC | inference time | ||

Portability | STM32 only | Any 32-bit MCU | Any 32-bit MCU |

Built-in platform | STM32 boards | 32F746GDiscovery, | SparkFun Edge, |

support | (Nucleo, …) | SparkFun Edge, … | Nucleo-L452-RE-P |

Sources | Private | Public | Public |

Data type | float, int8_t | float, int8_t | float, int8_t, int16_t |

Quantized data | Weights, activations | Weights, activations | Weights, activations |

Quantizer | Uniform (from TFlite) | Uniform | Uniform |

Quantized coding | Offset and scale | Offset and scale | Fixed-point $Qm.n$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Novac, P.-E.; Boukli Hacene, G.; Pegatoquet, A.; Miramond, B.; Gripon, V.
Quantization and Deployment of Deep Neural Networks on Microcontrollers. *Sensors* **2021**, *21*, 2984.
https://doi.org/10.3390/s21092984

**AMA Style**

Novac P-E, Boukli Hacene G, Pegatoquet A, Miramond B, Gripon V.
Quantization and Deployment of Deep Neural Networks on Microcontrollers. *Sensors*. 2021; 21(9):2984.
https://doi.org/10.3390/s21092984

**Chicago/Turabian Style**

Novac, Pierre-Emmanuel, Ghouthi Boukli Hacene, Alain Pegatoquet, Benoît Miramond, and Vincent Gripon.
2021. "Quantization and Deployment of Deep Neural Networks on Microcontrollers" *Sensors* 21, no. 9: 2984.
https://doi.org/10.3390/s21092984