# Design of a 2-Bit Neural Network Quantizer for Laplacian Source

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. A 2-Bit Uniform Scalar Quantizer of Laplacian Source

_{i}and the representative levels y

_{i}[6,7]. For such a uniform quantizer, it holds:

#### 2.1. The Variance-Matched 2-Bit Uniform Quantizer

_{max}, x

_{max}) and the outer defined in (−∞,−x

_{max}) $\cup $ (x

_{max}, ∞). Therefore, the MSE distortion will be the sum of the distortions incurred in the inner (D

_{in}) and outer regions (D

_{o}), defined using the following lemmas:

**Lemma**

**1.**

**Proof**

**of**

**Lemma**

**1.**

_{q}= 1). For a 2-bit quantizer, we obtain:

**Lemma**

**2.**

**Proof**

**of**

**Lemma**

**2.**

_{max}is the support region threshold value, whereas y

_{max}is the last representative level in the codebook. We observe the 2-bit uniform quantizer x

_{max}= 2∆, whereas y

_{max}= 3∆/2. Thus, the overload distortion of the 2-bit uniform quantizer of Laplacian source is defined as:

_{t}for the 2-bit uniform quantizer of Laplacian source is defined using the following expression:

^{opt}) is specified using the following lemma:

**Lemma**

**3.**

**Proof**

**of**

**Lemma**

**3.**

_{max}of N-levels uniform quantizer of Laplacian source). Moreover, by substituting this initial value into (13), one can obtain the asymptotic step size value:

_{10}(1/D), which is a standardly used objective performance measure of a quantization process [6,7]. Let SQNR(Δ

^{a}= 1.061) and SQNR(Δ

^{opt}= 1.087) denote the SQNR obtained using the asymptotic and optimal step size value, respectively. It can be shown that these two SQNRs are very close, as the calculated relative error amounts to 0.08%, meaning that the proposed asymptotic step size is very accurate when compared to the optimal one. Nevertheless, the analysis conducted in this paper is focused only on the optimal 2-bit uniform quantizer of Laplacian source. Next, we will show that the minimum of the total distortion is achieved for Δ = Δ

^{opt}, as it is defined with the following lemma.

**Lemma**

**4.**

^{opt}.

**Proof**

**of**

**Lemma**

**4.**

^{opt}, is specified as (see Lemma 1):

^{opt}< $\sqrt{2}$. Using this fact and applying it to (17), it holds that:

^{opt}. □

#### 2.2. The Variance-Mismatched 2-Bit Uniform Quantizer

_{q}

^{2}= 1 (see Section 2.1), for processing the Laplacian data with variance σ

_{p}

^{2}, where it holds σ

_{q}

^{2}≠ σ

_{p}

^{2}. In particular, this scenario is worth investigating, as it is often encountered in practice and reveals the robustness level of the quantizer model, which is a very important property when dealing with non-stationary data [6,7]. On the other hand, it is known that the variance-mismatch effect may cause serious degradation in quantizer performance [6,7,36,37]. In this subsection, we derive the closed-form expressions for the performance evaluation of the discussed quantizer.

_{q}) = σ

_{q}Δ denotes the optimal step size value determined for variance σ

_{q}

^{2}= 1 (see Section 2.1).

_{p}/σ

_{q}[36]. Then, total distortion becomes:

_{p}= σ

_{q}= 1, that is, ρ = 1), but it does not retain that value over the entire range and significantly decreases. Accordingly, the robustness of the quantizer is not at the satisfactory level, as the variance-mismatch effect has a strong influence on its performance; this, in turn, is reflected in limited efficiency of processing various Laplacian data.

#### 2.3. Adaptation of the 2-Bit Uniform Quantizer

_{i}the data of the input source X, where i = 1, …, M, and M is the total number of data samples. A flowchart is depicted in Figure 4 and can be described with the following steps:

**Step 1. Estimation of the mean value and quantization.**The mean value of the input data can be estimated as [6,7]:

**Step 2. Estimation of the standard deviation (rms value) and quantization.**The rms of the input data can be evaluated according to [6,7]:

**Step 3. Form the zero mean input data.**Each element of the input source X is reduced by the quantized mean, and zero mean data denoted with T are obtained:

^{q}is the quantized version of μ. Note that this is carried out in order to properly use the quantizer (as it is designed for a zero mean Laplacian source).

**Step 4. Design of adaptive quantizer and quantization of zero mean data.**The quantized variance, σ

^{q}, is used to scale the crucial design parameter Δ as follows:

_{i}of the source T are passed through the adaptive quantizer, and the quantized data t

_{i}

^{q}are obtained.

**Step 5. Recover the original data.**Since the mean value is subtracted from the original data and further quantized (using 32 bits), an inverse process has to be performed to recover the original data:

_{i}

^{Q}denotes the data recovered after quantization. It should be emphasized that the described process is equivalent to the normalization process widely used in neural network applications [15,18,22], as the same performance in terms of SQNR can be achieved [40]. Particularly, the normalization process assumes the following steps:

**Step****1.****Estimation of the mean value and quantization.****Step****2.****Estimation of the standard deviation (rms value) and quantization.****Step****3.****Normalization of the input data.**Each element of the input source X is normalized according to:$$T=\frac{X-{\mu}^{q}}{{\sigma}^{q}\left(1+\epsilon \right)}$$**Step****4.****Quantization of the normalized data.**To quantize normalized data (modeled as the PDF with zero mean and unit variance), the quantizer designed in Section 2.1 can be used, and quantized data t_{i}^{q}are obtained.**Step****5.****Denormalization of the data.**Since the input data are appropriately transformed for the purpose of efficient quantization, an inverse process referred to as denormalization has to be performed to recover the original data:$${x}_{i}^{Q}={t}_{i}^{q}{\sigma}^{q}+{\mu}^{q},\text{\hspace{1em}}i=1,\dots ,M$$

_{p}) defined with (26), which gives:

## 3. Experimental Results and Discussion

_{w}

^{2}and mean value μ

_{w}that is very close to zero. This, in turn, enables proper implementation of the developed adaptive quantizer model (Section 2.3).

^{ex}, by which the experimental value of SQNR can be measured:

_{w}is the distortion inserted by the adaptive uniform quantization (using 2-bits) of weights, W is the total number of weights, and w

_{i}are original while w

_{i}

^{q}are quantized values of the weights. Recall that beside classification accuracy, this is an additional objective performance measure used for the analysis of the quantized neural network.

^{ex}versus the parameter ε. It can be observed that SQNR decreases as ε increases, which is in accordance with the theoretical results presented in Figure 5 (observing one particular variance value). In addition, both the theoretical and experimental values of SQNR agree well (considering some specific value of ε for a given variance value). Moreover, we examined the influence of the parameter ε (observing the same range as in Figure 8) on the MLP performance obtained in the test data [41], as shown in Figure 9. Note that the increasing of ε slightly increases performance (classification accuracy), while the performance maximum is achieved for ε = 0.09. Thus, we can conclude that ε affects the introduced performance measures differently for the given network configuration and input data. Since classification accuracy is a relevant measure for neural networks, for the purpose of further analysis, we adopt corresponding values of classification accuracy and the SQNR achieved for ε = 0.09, which are listed in Table 1. In addition, we plot in Figure 10 the classification accuracy as the function of step size Δ/σ

_{w}, when ε = 0.09. It can be seen that the maximum score of classification accuracy is achieved for Δ = 1.09, which corresponds to the theoretically optimal value, confirming the applicability of the optimal quantizer.

_{3}= w

_{max}−Δ, y

_{4}= w

_{max}} and by the set of decision thresholds {x

_{o}= 0, x

_{1}= Δ, x

_{2}= 2Δ}, where $\Delta =2{w}_{\mathrm{max}}/{2}^{R}-1$ [17], R = 2, and w

_{max}is the maximal value of the weights. For the 2-bit uniform quantizer defined in [18], it holds: {y

_{3}= w

_{max}

^{a}− 3Δ/2, y

_{4}= w

_{max}

^{a}− Δ/2} and {x

_{o}= 0, x

_{1}= Δ, x

_{2}= 2Δ}, where $\Delta =2{w}_{\mathrm{max}}^{a}/{2}^{R}$ [18], R = 2, and w

_{max}

^{a}is the maximal absolute value of the weights. In the case of the 2-bit non-uniform quantizer described in [20], it holds: {y

_{3}= Δ/2, y

_{4}= 2Δ} and {x

_{o}= 0, x

_{1}= Δ, x

_{2}= 3Δ = x

_{max}

^{opt}}, where $\Delta =2{x}_{\mathrm{max}}^{\mathrm{opt}}/3$ [20] and x

_{max}

^{opt}denotes the value of the optimal support region threshold of the proposed 2-bit uniform quantizer. Finally, a 2-bit non-uniform quantizer [21] is defined as follows: {5/8 = F(y

_{3}), 7/8 = F(y

_{4})} and {x

_{o}= 0, 3/4 = F(x

_{1})}, where $F\left(x\right)=1-\frac{1}{2}\mathrm{exp}\left(-\sqrt{2}x\right)$.

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Teerapittayanon, S.; McDanel, B.; Kung, H.T. Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017; pp. 328–339. [Google Scholar]
- Gysel, P.; Pimentel, J.; Motamedi, M.; Ghiasi, S. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Trans. Neural Netw. Learn. Syst.
**2018**, 29, 5784–5789. [Google Scholar] [CrossRef] [PubMed] - Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Trees; CRC Press: Belmont, CA, USA, 1984. [Google Scholar]
- Langley, P.; Iba, W.; Thompson, K. An analysis of Bayesian classifiers. In Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA, USA, 12–16 July 1992; AAAI and MIT Press: Cambridge, MA, USA, 1992; pp. 223–228. [Google Scholar]
- Fu, L. Quantizability and learning complexity in multilayer neural networks. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.)
**1998**, 28, 295–299. [Google Scholar] [CrossRef] - Sayood, K. Introduction to Data Compression, 5th ed.; Morgan Kaufmann: Burlington, MA, USA, 2017. [Google Scholar]
- Jayant, N.S.; Noll, P. Digital Coding of Waveforms: Principles and Applications to Speech and Video; Prentice Hall: Hoboken, NJ, USA, 1984. [Google Scholar]
- Perić, Z.; Simić, N.; Nikolić, J. Design of single and dual-mode companding scalar quantizers based on piecewise linear approximation of the Gaussian PDF. J. Frankl. Inst.
**2020**, 357, 5663–5679. [Google Scholar] [CrossRef] - Nikolic, J.; Peric, Z.; Jovanovic, A. Two forward adaptive dual-mode companding scalar quantizers for Gaussian source. Signal Process.
**2016**, 120, 129–140. [Google Scholar] [CrossRef] - Na, S.; Neuhoff, D.L. Asymptotic MSE Distortion of Mismatched Uniform Scalar Quantization. IEEE Trans. Inf. Theory
**2012**, 58, 3169–3181. [Google Scholar] [CrossRef] - Na, S.; Neuhoff, D.L. On the Convexity of the MSE Distortion of Symmetric Uniform Scalar Quantization. IEEE Trans. Inf. Theory
**2017**, 64, 2626–2638. [Google Scholar] [CrossRef] - Na, S.; Neuhoff, D.L. Monotonicity of Step Sizes of MSE-Optimal Symmetric Uniform Scalar Quantizers. IEEE Trans. Inf. Theory
**2018**, 65, 1782–1792. [Google Scholar] [CrossRef] - Banner, R.; Hubara, I.; Hoffer, E.; Soudry, D. Scalable Methods for 8-bit Training of Neural Networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
- Pham, P.; Abraham, J.; Chung, J. Training Multi-Bit Quantized and Binarized Networks with a Learnable Symmetric Quantizer. IEEE Access
**2021**, 9, 47194–47203. [Google Scholar] [CrossRef] - Banner, R.; Nahshan, Y.; Soudry, D. Post Training 4-bit Quantization of Convolutional Networks for Rapid-Deployment. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–10 December 2019. [Google Scholar]
- Choi, J.; Venkataramani, S.; Srinivasan, V.; Gopalakrishnan, K.; Wang, Z.; Chuang, P. Accurate and Efficient 2-Bit Quantized Neural Networks. In Proceedings of the 2nd SysML Conference, Stanford, CA, USA, 31 March–2 April 2019. [Google Scholar]
- Bhalgat, Y.; Lee, J.; Nagel, M.; Blankevoort, T.; Kwak, N. LSQ+: Improving Low-Bit Quantization Through Learnable Offsets and Better Initialization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. J. Mach. Learn. Res.
**2018**, 18, 1–30. [Google Scholar] - Zamirai, P.; Zhang, J.; Aberger, C.R.; De Sa, C. Revisiting BFloat16 Training. arXiv
**2020**, arXiv:2010.06192v1. [Google Scholar] - Li, Y.; Dong, X.; Wang, W. Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Conference, Formerly Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
- Baskin, C.; Liss, N.; Schwartz, E.; Zheltonozhskii, E.; Giryes, R.; Bronstein, M.; Mendelso, A. Uniq: Uniform Noise Injection for Non-Uniform Quantization of Neural Networks. ACM Trans. Comput. Syst.
**2021**, 37, 1–15. [Google Scholar] [CrossRef] - Simons, T.; Lee, D.-J. A Review of Binarized Neural Networks. Electronics
**2019**, 8, 661. [Google Scholar] [CrossRef] [Green Version] - Qin, H.; Gong, R.; Liu, X.; Bai, X.; Song, J.; Sebe, N. Binary Neural Networks: A Survey. Pattern Recognit.
**2020**, 105, 107281. [Google Scholar] [CrossRef] [Green Version] - Li, Y.; Bao, Y.; Chen, W. Fixed-Sign Binary Neural Network: An Efficient Design of Neural Network for Internet-of-Things Devices. IEEE Access
**2018**, 8, 164858–164863. [Google Scholar] [CrossRef] - Zhao, W.; Teli, M.; Gong, X.; Zhang, B.; Doermann, D. A Review of Recent Advances of Binary Neural Networks for Edge Computing. IEEE J. Miniat. Air Space Syst.
**2021**, 2, 25–35. [Google Scholar] [CrossRef] - Perić, Z.; Denić, B.; Savić, M.; Despotović, V. Design and Analysis of Binary Scalar Quantizer of Laplacian Source with Applications. Information
**2020**, 11, 501. [Google Scholar] [CrossRef] - Gazor, S.; Zhang, W. Speech Probability Distribution. IEEE Signal Proc. Lett.
**2003**, 10, 204–207. [Google Scholar] [CrossRef] - Simić, N.; Perić, Z.; Savić, M. Coding Algorithm for Grayscale Images—Design of Piecewise Uniform Quantizer with Golomb–Rice Code and Novel Analytical Model for Performance Analysis. Informatica
**2017**, 28, 703–724. [Google Scholar] [CrossRef] - Banner, R.; Nahshan, Y.; Hoffer, E.; Soudry, D. ACIQ: Analytical Clipping for Integer Quantization of Neural Networks. arXiv
**2018**, arXiv:1810.05723. [Google Scholar] - Zhang, A.; Lipton, Z.C.; Li, M.; Smola, A.J. Dive into Deep Learning. arXiv
**2020**, arXiv:2106.11342. [Google Scholar] - Wiedemann, S.; Shivapakash, S.; Wiedemann, P.; Becking, D.; Samek, W.; Gerfers, F.; Wiegand, T. FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4Bit-Compact Multilayer Perceptrons. IEEE Open J. Circuits Syst.
**2021**, 2, 407–419. [Google Scholar] [CrossRef] - Kim, D.; Kung, J.; Mukhopadhyay, S. A Power-Aware Digital Multilayer Perceptron Accelerator with On-Chip Training Based on Approximate Computing. IEEE Trans. Emerg. Top. Comput.
**2017**, 5, 164–178. [Google Scholar] [CrossRef] - Savich, A.; Moussa, M.; Areibi, S. A Scalable Pipelined Architecture for Real-Time Computation of MLP-BP Neural Networks. Microprocess. Microsyst.
**2012**, 36, 138–150. [Google Scholar] [CrossRef] - Wang, X.; Magno, M.; Cavigelli, L.; Benini, L. FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things. IEEE Internet Things J.
**2020**, 7, 4403–4417. [Google Scholar] [CrossRef] - Hui, D.; Neuhoff, D.L. Asymptotic Analysis of Optimal Fixed-Rate Uniform Scalar Quantization. IEEE Trans. Inf. Theory
**2001**, 47, 957–977. [Google Scholar] [CrossRef] [Green Version] - Na, S. Asymptotic Formulas for Mismatched Fixed-Rate Minimum MSE Laplacian Quantizers. IEEE Signal Process. Lett.
**2008**, 15, 13–16. [Google Scholar] - Na, S. Asymptotic Formulas for Variance-Mismatched Fixed-Rate Scalar Quantization of a Gaussian source. IEEE Trans. Signal Process.
**2011**, 59, 2437–2441. [Google Scholar] [CrossRef] - Peric, Z.; Denic, B.; Savić, M.; Dincic, M.; Mihajlov, D. Quantization of Weights of Neural Networks with Negligible Decreasing of Prediction Accuracy. Inf. Technol. Control
**2012**. Accept. [Google Scholar] - Peric, Z.; Savic, M.; Dincic, M.; Vucic, N.; Djosic, D.; Milosavljevic, S. Floating Point and Fixed Point 32-bits Quantizers for Quantization of Weights of Neural Networks. In Proceedings of the 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, 25–27 March 2021. [Google Scholar]
- Peric, Z.; Nikolic, Z. An Adaptive Waveform Coding Algorithm and its Application in Speech Coding. Digit. Signal Process.
**2012**, 22, 199–209. [Google Scholar] [CrossRef] - LeCun, Y.; Cortez, C.; Burges, C. The MNIST Handwritten Digit Database. Available online: yann.lecun.com/exdb/mnist/ (accessed on 15 May 2021).

**Figure 3.**SQNR of 2-bit uniform quantizer (designed optimally with respect to MSE distortion) in a wide dynamic range of input data variances.

**Figure 5.**SQNR of the adaptive 2-bit uniform quantizer in a wide dynamic range of input data variances.

**Figure 7.**Distribution of weights of trained MLP network: (

**a**) between input and hidden layer and (

**b**) between hidden and output layer.

**Figure 10.**Classification accuracy of quantized MLP network as a function of quantization step size, ε = 0.09.

**Table 1.**Performance (classification accuracy and SQNR) of quantized MLP for various applied quantization models.

Quantizer | Full Precision | ||||||
---|---|---|---|---|---|---|---|

1-Bit [26] | 2-Bit Uniform [17] | 2-Bit Uniform [18] | 2-Bit Non-Uniform [20] | 2-Bit Non-Uniform [21] | 2-Bit Uniform Proposed | ||

Accuracy (%) | 91.12 | 94.70 | 94.49 | 92.38 | 92.73 | 96.26 | 96.86 |

SQNR (dB) | 4.25 | 1.63 | 1.19 | −8.89 | −2.41 | 8.71 | - |

**Table 2.**Performance (classification accuracy and SQNR) of quantized CNN for various applied quantization models.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Perić, Z.; Savić, M.; Simić, N.; Denić, B.; Despotović, V.
Design of a 2-Bit Neural Network Quantizer for Laplacian Source. *Entropy* **2021**, *23*, 933.
https://doi.org/10.3390/e23080933

**AMA Style**

Perić Z, Savić M, Simić N, Denić B, Despotović V.
Design of a 2-Bit Neural Network Quantizer for Laplacian Source. *Entropy*. 2021; 23(8):933.
https://doi.org/10.3390/e23080933

**Chicago/Turabian Style**

Perić, Zoran, Milan Savić, Nikola Simić, Bojan Denić, and Vladimir Despotović.
2021. "Design of a 2-Bit Neural Network Quantizer for Laplacian Source" *Entropy* 23, no. 8: 933.
https://doi.org/10.3390/e23080933