# Two Novel Non-Uniform Quantizers with Application in Post-Training Quantization

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work and Motivation

^{mod}, 3Δ

^{mod}], respectively, while differing in the way of specifying the decision and representation levels of the quantizer. In SPTQ design, the representation levels are the midpoints of the quantization cells, as it is the case in the simplest UQ design, while its quantization cells are not of equal widths, as is the case with UQ. In MSPTQ design, the quantizer decision thresholds are centered between the nearest representation levels, similar to the UQ design. However, unlike UQ, the quantization cells of MSPTQ are not of equal widths and the representation levels are not midpoints of the quantization cells. More details about SPTQ and MSPTQ models will be provided in the following sections. Intending to determine the parameters of the novel quantizers more favorably and as precise as possible, we also provide a studious analysis and the description of the optimization procedure of two-bit SPTQ and MSPTQ. Specifically, we describe their design for the assumed Laplacian source and perform their optimization in an iterative manner, as well as by performing numerical optimization procedure. Afterwards, we perform post-training quantization with the implementation of SPTQ and MSPTQ, study the viability of QNN accuracy and present the benefits in the case where two-bit UQ from [18] is utilized for the same classification task. We believe that both NUQs are particularly substantial for memory-constrained devices, where simple and acceptably accurate solutions are one of the key requirements.

## 3. Symmetric SPTQ Design for the Laplacian Source

_{N}(X)), compared to the original (X), for a given N and bit-rate R, where R = log

_{2}N [20]

_{N}. By the quantization procedure, an input signal amplitude range is divided into a granular region ℜ

_{g}and an overload region ℜ

_{o}(see Figure 1 for SPTQ). For any symmetric quantizer, as those we design here, these regions are separated by the support region thresholds denoted by −x

_{max}and x

_{max}, respectively [20]. The granular region ℜ

_{g}

^{th}cell is:

_{i}denotes the ith representation level and ${\left\{{\Re}_{i}\right\}}_{i=-N/2}^{-1}$ and ${\left\{{\Re}_{i}\right\}}_{i=1}^{N/2}$ denote the granular cells from the negative and positive amplitude regions, which are symmetrically placed around the zero mean. In symmetric quantization, the quantizer’s main parameter sets are halved, since only the positive or the absolute values of the quantizer’s parameters should be determined and stored. The symmetry also holds for the overload cells, that is, for a pair of quantization cells unlimited in width in the overload region, ℜ

_{o}, defined as

_{i}(see Figure 1), specified as midpoints of cells by:

_{max}denotes the support region threshold of our two-bit SPTQ, and it is one of the key parameters of the quantizer. From Equations (5)–(8) one can conclude that x

_{max}or the step size, Δ, completely determine the decision thresholds, x

_{i}, and the representation levels, y

_{i}, of the proposed two-bit SPTQ. In other words, the quantizer in question is completely determined by knowing the support region threshold, x

_{max}= x

_{max}

^{SPTQ}. Therefore, we introduce the following notation of our transfer characteristic of the symmetric two-bit SPTQ, Q

^{SPTQ}(x; x

_{max}) (see Figure 2, where the characteristic of the symmetric two-bit SPTQ is presented for x

_{max}= 2.5512, where the notation [J] comes from the name of the author of [20]).

^{2}= 1

_{3}= ∞, denoting the upper limit of the integral in Equation (11). Then, the total distortion of our symmetrical two-bit SPTQ can be rewritten as:

^{SPTQ}, with respect to Δ equal to zero:

^{SPTQ}with respect to Δ yields:

^{SPTQ}is a convex function of Δ. Moreover, to confirm that we end up iteratively with the unique optimal value for Δ in the numerical result section, we provide the results of numerical distortion optimization per Δ.

## 4. Symmetric MSPTQ Design for the Laplacian Source

^{mod}, 3Δ

^{mod}], as well as the same rule for specifying the representation levels, here denoted with ${y}_{1}^{\mathrm{mod}}$ and ${y}_{2}^{\mathrm{mod}}$

**mod**indicates modification. Let us further assume an additional specification of MSPTQ that quantizer decision threshold is centered between the nearest representation levels (similarly as in the UQ design):

^{mod}, 3Δ

^{mod}] is preserved. More precisely, the condition stated in Equation (23), previously mentioned as the additional specification, is one of the prerequisites for the MSPTQ design. To completely specify MSPTQ, it is necessary to determine the quantization step size Δ

^{mod}. To do so, we can minimize the distortion of MSPTQ having the support region that ranges [−3Δ

^{mod}, 3Δ

^{mod}]. For the given x

_{max}

^{MSPTQ}= 3Δ

^{mod}= ${x}_{2}^{\mathrm{mod}}$ according to Equations (22) and (23), we can determine the code book ${Y}^{\mathrm{MSPTQ}}\equiv \left\{{y}_{-2}^{\mathrm{mod}},{y}_{-1}^{\mathrm{mod}},{y}_{1}^{\mathrm{mod}},{y}_{2}^{\mathrm{mod}}\right\}\subset \mathbb{R}$ of our two-bit MSPTQ and the decision threshold ${x}_{1}^{\mathrm{mod}}$. Let us highlight again that symmetry about zero-mean value holds, so that by specifying ${x}_{0}^{\mathrm{mod}}=0$, and identifying that ${x}_{-i}^{\mathrm{mod}}=-{x}_{i}^{\mathrm{mod}},i\in \left\{\hspace{0.33em}1,\hspace{0.33em}2\right\}$, MSPTQ is completely determined. To clearly distinguish the two NUQ models we have proposed in this paper, in Table 1 we summarize the main parameters that unambiguously describe our NUQs, SPTQ (Table 1a)) and MSPTQ (Table 1b)). Note that the representation levels of SPTQ and MSPTQ follow the same rule y

_{1}= Δ/2, y

_{1}

^{mod}= Δ

^{mod}/2 and y

_{2}= 2Δ, y

_{2}

^{mod}= 2Δ

^{mod}, where the main difference is in specifying the decision thresholds ${x}_{1}^{}$ and ${x}_{1}^{\mathrm{mod}}$.

^{mod}and not on the representation and decision levels, we can derive the expression for the distortion as:

^{mod}equal to zero:

^{mod}can be determined iteratively from:

^{mod}we can calculate the total distortion of the MSPTQ from Equation (25).

^{mod}:

^{MSPTQ}is a convex function of Δ

^{mod}, which guarantees the existence of the unique minimum of D

^{MSPTQ}. As in the case of SPTQ, to confirm that we end up iteratively with the unique optimal value for Δ

^{mod}, in the numerical result section, we provide the results of numerical distortion optimization per Δ

^{mod}.

## 5. Application of Two Novel Non-Uniform Quantizers in Post-Training Quantization

_{ex}

**and SQNR**

^{*}_{ex}

**are experimentally determined distortion and SQNR,**

^{*}**Ŵ**= {w

_{j}}

_{j}

_{= 1, 2, …, W}denotes the vector of weights represented in FP32 format and

**Ŵ**= {w

^{*}_{j}

^{*}}

_{j}

_{= 1, 2, …, W}denotes the vector of weights to be loaded in QNN. In brief, at the very beginning of the post-training quantization, NN weights are normalized to zero mean and unit variance, forming the vector

**Ŵ**= {w

^{N}_{j}

^{N}}

_{j}

_{= 1, 2, …, W}. After all normalized weights are quantized by applying SPTQ or MSPTQ and denormalized to the original range, Ŵ

^{*}= {w

_{j}

^{*}}

_{j}

_{= 1, 2, …, W}is loaded into the QNN model (see Algorithm 1).

Algorithm 1: Weights compression by means of post-training quantization using SPTQ/MSPTQ |

Notation: w_{j}—pretrained weight, w_{j}^{SPTQ}—quantized weight using SPTQ, w_{j}^{MSPTQ}—quantized weight using MSPTQInput: Ŵ = {w_{j}}_{j} _{= 1, 2, …, W,} weights represented in FP32 format, ε_{min} = 10^{−4}Output: Quantized weights for SPTQ—Ŵ^{SPTQ} = {w_{j}^{SPTQ}}_{j} _{= 1, 2, …, W}, Quantized weights for MSPTQ—Ŵ^{MSPTQ} = {w_{j}^{MSPTQ}}_{j} _{= 1, 2, …, W}, ${\mathrm{SQNR}}_{\mathrm{th}}^{\mathrm{SPTQ}}$, ${\mathrm{SQNR}}_{\mathrm{ex}}^{\mathrm{SPTQ}}$, ${\mathrm{SQNR}}_{\mathrm{th}}^{\mathrm{MSPTQ}}$, ${\mathrm{SQNR}}_{\mathrm{ex}}^{\mathrm{MSPTQ}}$, Accuracy^{SPTQ}, Accuracy^{MSPTQ}Algorithm steps:1: load initial pretrained and stored weights Ŵ = {w _{j}}_{j} _{= 1, 2, …, W}2: normalize weights and form Ŵ ^{N} = {w_{j}^{N}}_{j} _{= 1, 2, …, W,}3: w _{min}← minimal value of the normalized weights from Ŵ^{N}4: w _{max} ← maximal value of the normalized weights from Ŵ^{N}5: select SPTQ model to quantize normalized weights6: initialize ε ^{SPTQ} ← 1, Δ^{(0)} = Δ^{SPTQ} ← 1 (or some other given value), i ← 17: while ε^{SPTQ} ≥ ε_{min} do8: calculate Δ ^{(i + 1)} by using (18)9: calculate ε ^{SPTQ} =abs (Δ^{(i + 1)}-Δ^{SPTQ})10: Δ ^{SPTQ} ← Δ^{(i + 1)}11: i ← i + 1 12: end while13: Δ ← Δ ^{SPTQ}14: x _{max}^{SPTQ} ← 3 Δ15: calculate {x _{-2}, x_{-1}, x_{0}, x_{1}, x_{2}} by using (7) for x_{max}^{SPTQ}16: form codebook Y ^{SPTQ} = {y_{-2}, y_{-1}, y_{1}, y_{2}} by using (8) or Table 1a)17: quantize normalized weights by using codebook Y ^{SPTQ}18: denormalize quantized weights and form vector Ŵ ^{SPTQ} = {w_{j}^{SPTQ}}_{j} _{= 1, 2, …, W}19: select MSPTQmodel to quantize normalized weights20: initialize ε ^{MSPTQ} ← 1, Δ^{mod(0)} = Δ^{MSPTQ} ← Δ^{SPTQ}, i ← 121: while ε^{MSPTQ} ≥ ε_{min} do22: calculate Δ ^{mod(i + 1)} by using (28)23: calculate ε ^{MSPTQ} =abs (Δ^{mod(i + 1)}-Δ^{MSPTQ})24: Δ ^{MSPTQ} ← Δ^{mod(i + 1)}25: i ← i + 1 26: end while27: Δ ^{mod} ← Δ^{MSPTQ}28: x _{max}^{MSPTQ} ← 3 Δ^{MSPTQ}29: calculate {x _{-2}^{mod}, x_{-1}^{mod}, x_{0}^{mod}, x_{1}^{mod}, x_{2}^{mod}} by using Table 1b) for x_{max}^{MSPTQ}30: form codebook Y ^{MSPTQ} ≡ {y_{-2}^{mod}, y_{-1}^{mod}, y_{1}^{mod}, y_{2}^{mod}} by using Table 1b)31: quantize normalized weights by using codebook Y ^{MSPTQ}32: denormalize quantized weights and form vector Ŵ ^{MSPTQ} = {w_{j}^{MSPTQ}}_{j} _{= 1, 2, …, W}33: calculate ${\mathrm{SQNR}}_{\mathrm{ex}}^{\mathrm{SPTQ}}$, ${\mathrm{SQNR}}_{\mathrm{th}}^{\mathrm{SPTQ}}$, ${\mathrm{SQNR}}_{\mathrm{ex}}^{\mathrm{MSPTQ}}$, ${\mathrm{SQNR}}_{\mathrm{th}}^{\mathrm{MSPTQ}}$ by using Equations (15), (25), (30)–(33), estimate accuracies of QNNs. |

^{SPTQ}and D

^{MSPTQ}are specified by Equations (15) and (25), respectively.

## 6. Numerical Results and Analysis

^{mod}, or equally, for determining x

_{max}

^{SPTQ}and x

_{max}

^{MSPTQ}. To initialize Algorithm 1 for determining Δ, we use different values of Δ

^{(0)}, specified in Table 2. To determine Δ

^{mod}, we use the result of the first iterative process, that is, we assume Δ

^{mod(0)}= Δ. The same condition as in [22], that two adjacent iterations differ by less than 10

^{−4}is used as the output criterion of algorithm. By observing statistics of the trained and normalized NN weights, we have found that the minimum and maximum weights in original FP32 amount to w

_{min}= ™7.063787 and w

_{max}= 4.8371024. Following a predefined rule for specifying the support region of SPTQ ranging in [−3Δ, 3Δ], we use Δ

^{(0)}= |w

_{max}|/3 = 1.61237 and Δ

^{(0)}= |w

_{min}|/3 = 2.3546 to initialize Algorithm 1 for SPTQ. Moreover, we assume that Δ

^{(0)}= x

_{max}[H]/3 = 0.6536 and Δ

^{(0)}= x

_{max}[J]/3 = 0.7249, where x

_{max}[H] and x

_{max}[J] are optimal and asymptotically optimal x

_{max}values for UQ given by Hui [23] and Jayant [20] (see Table 2). It is worthy highlighting that different initializations require around 40 iterations (see Table 2 and Figure 5). Moreover, we should highlight that all the observed initializations lead to the unique final value of Δ (Δ = 0.8504) and x

_{max}

^{SPTQ}= 3Δ = 2.5512. If we further use Δ

^{mod(0)}= Δ = 0.8504 for iteratively determining Δ

^{mod}, given the same output algorithm criterion, we only need seven iterations. As a result of the second iterative process, we determine Δ

^{mod}= 0.9021, as well as x

_{max}

^{MSPTQ}= 3Δ

^{mod}= 2.7063. To additionally confirm that with the output criterion of Algorithm 1 we ended up with the optimal values for Δ and Δ

^{mod}; one can observe in Figure 6, the depiction of the dependences of the distortion of the applied quantizers on the corresponding basic step sizes. Iteratively obtained values for Δ and Δ

^{mod}are marked with asterisks in Figure 6 and are indeed optimal as they give the minimum of D

^{SPTQ}and D

^{MSPTQ}.

_{min}|,|w

_{max}|), min(|w

_{min}|,|w

_{max}|)], which is in our experiment simply [−w

_{max}, w

_{max}]. Therefore, in Case 1, the support region depends on the maximum value of the normalized trained model weights in full precision, which for the observed trained weights (for MNIST dataset) amounts to w

_{max}= 4.8371024. By setting the support region of SPTQ and MSPTQ as stated, it includes 99.988% of all the normalized weights. One can notice from Table 3 that thus defined Case 1 provides the highest QNN model’s accuracy of all the observed cases, amounting to 97.61% of correctly classified validation samples of MNIST dataset. By comparing it to the application of simple UQ, we can conclude that with the applied SPTQ, we provide an increase in the accuracy of 0.64% (see Table 4). This represents a significant increase in accuracy, especially taking into account that the only difference is in the applied quantizers, while the same bit rate is assumed and we are very close to the full precision accuracy of the baseline NN. Similarly, the experimental and theoretically obtained SQNR of SPTQ is higher, compared to UQ, providing gain in SQNR of 1.8078 dB and 2.5078 dB, respectively. We can notice that theoretically determined SQNR has lower values than experimentally determined SQNR. As explained in [18,19], the reason is that in the experimental analysis, the weights originating from the Laplacian-like distribution being quantized are from the limited set of possible values [−7.063787, 4.8371024] (see Figure 7), while in the theoretical analysis, the quantization of values from the unrestricted Laplacian source is assumed, causing an increase in the amount of distortion, that is, the decrease in the theoretical SQNR value. In summary, Case 1 already shows the benefits of implementing SPTQ over the UQ providing increase in all significant performance indicators observed.

_{min}|, |w

_{max}|), max(|w

_{min}|, |w

_{max}|)], and it is defined as it is in [31]. In our experiment, it can be expressed as [−|w

_{min}|, |w

_{min}|], which in practice becomes [w

_{min}, −w

_{min}], forming the support region (in case of MNIST dataset) as [−7.063787, 7.063787]. It can be observed that the support region in Case 2 includes 100% of the weights and even goes beyond the maximum value of the normalized weights, which makes it unnecessarily wide and representative of an unfavorable choice of ℜ

_{g}.

_{g}, is determined for the iteratively calculated optimal quantization step, Δ = 0.8504. The support region threshold of SPTQ is defined as follows:

_{max}

^{SPTQ}= 3Δ, and x

_{max}

^{SPTQ}amounts to 2.5512. This value represents the theoretically optimal support region threshold value, therefore, providing the maximal theoretical SQNR value for this case. Although Case 3 yields the highest theoretical and higher experimental SQNR, compared to the previous cases, the QNN model’s accuracy is significantly lower. This further confirms the premise that the quantizer that provides the highest SQNR does not necessarily provide the highest accuracy of the QNN, which highly depends on the ℜ

_{g}choice as well.

_{g}choice to both SQNR and QNN performance. One can notice that although not optimal for SPTQ, support region thresholds specified in Cases 4 and 5 provide the highest experimental SQNR values, with theoretical SQNR being relatively close to the optimal value with the maximum difference from it of about 0.4 dB. Unlike SQNR, in these two cases we obtain the lowest QNN model’s accuracy. The maximum accuracy difference in these two cases is about 4%, compared to the best performing Case 1, which represents a significant NN performance degradation. Lower accuracy is a direct result of overly narrow support region threshold, with 94.787% and 96.691% of the weights being within the ℜ

_{g}for the Cases 4 and 5, respectively.

_{g}has a very strong influence on the QNN accuracy and obtained SQNR. Moreover, it has been shown that SPTQ performs much better than UQ for the case of using wider ℜ

_{g}, which we intuitively expected, and it is generally known for non-uniform quantization [20]. In contrast, it can be observed that the accuracy of the QNN with the application of UQ outperforms the QNN with SPTQ in Cases 3–5, with Case 3 being the most significant one, as it represents the optimal solution for the theoretical SQNR of the SPTQ. Based on the SQNR analysis of SPTQ, we can conclude that SPTQ designed in Case 3 could be greatly applicable in traditional quantization tasks. To overcome the mentioned imperfection of SPTQ in post-training quantization, in this paper, we have introduced a simple, yet efficient modification and optimization of the SPTQ, denoted by MSPTQ, which performance is presented and analyzed in the following.

_{g}, including the one constructed with the optimal support region threshold value. The performances of the modified SPTQ and MSPTQ are presented in Table 5 for slightly different Cases of the support region threshold, as previously defined Cases 4 and 5 are not relevant to the non-uniform quantization. Cases 1 to 3 are the same as in the previous analysis, so that we can conduct a direct comparison of performance. One can notice that MSPTQ outperforms both SPTQ and UQ in all the observed performance indicators for Cases 1 and 3, while QNN with the application of SPTQ obtains a higher accuracy with the margin of 0.44% for Case 2. Moreover, MSPTQ provides the gain in both theoretical and experimental SNQR values for all comparable cases over the QNN with the implementation of UQ. Case 2 is a specific one, being an example of an unfavorable choice of ℜ

_{g}, and QNN with SPTQ performs better, compared to its modified counterpart. On the other hand, for carefully designed ℜ

_{g}, by applying the modified SPTQ and MSPTQ, we obtain a significant increase in both SQNR and QNN accuracy, compared to the SPTQ. The gain in the accuracy is specifically large in Case 3, which utilizes the support region threshold, optimally determined for SPTQ, and it amounts to 1.42%. For the ℜ

_{g}specified in Case 3, MSPTQ obtains the highest experimental SQNR value, with an increase of 0.874 dB, compared to SPTQ and 0.4514 dB, in comparison to UQ. Case 4 implements the numerically optimized ℜ

_{g}width for the MSPTQ, with the support region threshold value being close to the one obtained for SPTQ. Expectedly, theoretically obtained SQNR is highest for this case, while experimentally obtained SQNR, as well as the accuracy, take the second highest value among all the cases by having 99.001% of the normalized weights within the ℜ

_{g}. As the degradation of the accuracy in Case 4 amounts to about 0.9%, compared to the full precession case (98.1% − 97.23% = 0.87%), we can conclude that with the simple modification we have managed not only to improve SQNR, but also to increase the accuracy of the QNN, compared to the case where UQ and SPTQ are implemented.

_{ref}, that is, for 20 log10(σ/σ

_{ref}) = 0). In [16], SQNR was calculated for different representative values of parameter, μ, in wide dynamic range of variances. By comparing achieved SQNR values from Table I given in [16] (SQNR = 4.44 dB for μ =255) and SQNR values from our Table 3 and Table 5 (SQNR = 7.8099 dB for SPTQ Case 3 and SQNR = 8.5608 dB for MSPTQ Case 4), we can conclude that with SPTQ and MSPTQ, we have achieved higher SQNR values, compared to μ-law logarithmic quantizer from [16]. This confirms that the advantages of the logarithmic quantizer do not come to the forefront in the case of lower bit rates.

^{2}= 4), our anticipation was that the ternary quantizer will provide a lower SQNR value, as well as a lower accuracy, compared to UQ, which the numerical results shown in Table 6 confirm. Namely, Table 6 shows the results of the application of ternary quantization to the NN weights of the considered previously trained MLP, where part of the results, shown in part (a) assumes the same support region, as in the case of optimal UQ (Case 5), asymptotically optimal UQ (Case 4) and optimal SPTQ (Case 3), while part (b) illustrates the importance of choosing the value of the scale factor, α, which serves to scale the entire model of the ternary quantizer with the aim to perform its adaptation. In [20], the optimal support region threshold value for the ternary quantizer is given, which for the assumed Laplace pdf is 3/sqrt(2). By setting α = 1, we indeed determined the highest theoretical value of SQNR for the ternary quantizer. However, as it has been shown in [18], the highest SQNR does not always guarantee the highest accuracy; we have varied the value of the scale factor from 1/4 to 7/4 with steps of 1/4, and we have ended up with the conclusion that, for the observed values of α, the highest accuracy is achieved for α = 3/2, providing the accuracy of the observed QNN that amounts to 97.51%. In short, both of our new models of non-uniform quantizers (SPTQ and MSPTQ) have superior performance (higher SQNR and accuracy), compared to the quantizer models from [32] and [33], which is not only a consequence of the fact that the number of cells in our quantizers is greater by one but also that with the wise distribution of the cell width, as well as the optimization of the support region threshold, we really provided significant step forward in the field of quantization.

## 7. Summary and Conclusions

^{mod}, which is of the utmost importance in traditional quantization. We have proved that the iterative algorithms utilized output the values of Δ and Δ

^{mod}that are indeed optimal, as they match with the corresponding results of numerical distortion optimization per basic step sizes and give the minimum of distortion of SPTQ and MSPTQ. Moreover, the main benefits of this paper are meaningful conclusions that specify the manner in which the performance of the proposed quantizers, as well as the performance of QNNs that have implemented novel NUQs, behave with the basic step sizes. We have also confirmed the premise that the quantizer that provides the highest SQNR does not necessarily provide the highest accuracy of QNN. Finally, with the introduction of MSPTQ model we have managed not only to improve the SQNR, but also to increase the accuracy of the QNN, compared to the case where UQ and SPTQ are implemented. Although UQ is a highly exploited quantizer model due to its intriguing nature and its possibility to be modified to some extent, it is natural to expect that research and development of some modified quantization models will continue to attract the attention of the scientific community.

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Number of Internet of Things (IoT) Connected Devices Worldwide in 2018, 2025 and 2030. Available online: https://www.statista.com/statistics/802690/worldwide-connected-devices-by-accesstechnology (accessed on 1 November 2021).
- Teerapittayanon, S.; McDanel, B.; Kung, H.T. Distributed deep neural networks over the cloud, the edge and end devices. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems (ICDCS), GA, Atlanta, USA, 5–8 June 2017; pp. 328–339. [Google Scholar]
- Vestias, M.; Duarte, R.; Sousa, J.; Neto, H. Moving Deep Learning to the Edge. Algorithms
**2020**, 13, 125. [Google Scholar] [CrossRef] - Liu, D.; Kong, H.; Luo, X.; Liu, W.; Subramaniam, R. Bringing AI to edge: From Deep Learning’s Perspective. arXiv
**2020**, arXiv:2011.14808. [Google Scholar] [CrossRef] - Gholami, A.; Kim, S.; Dong, Z.; Yao, Z.; Mahoney, M.W.; Keutzer, K. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv
**2021**, arXiv:2103.13630. [Google Scholar] - Guo, Y. A Survey on Methods and Theories of Quantized Neural Networks. arXiv
**2018**, arXiv:1808.04752. [Google Scholar] - Fung, J.; Shafiee, A.; Abdel-Aziz, H.; Thorsley, D.; Georgiadis, G.; Hassoun, J. Post-Training Piecewise Linear Quantization for Deep Neural Networks. In Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 69–86. [Google Scholar]
- Lin, D.; Talathi, S.; Soudry, D.; Annapureddy, S. Fixed Point Quantization of Deep Convolutional Networks. In Proceedings of the 33rd International Conference on Machine Learning Conference on Neural Information Processing Systems, New York, NY, USA, 8–14 June 2016; pp. 2849–2858. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. J. Mach. Learn. Res.
**2017**, 18, 6869–6898. [Google Scholar] - Huang, K.; Ni, B.; Yang, D. Efficient Quantization for Neural Networks with Binary Weights and Low Bit Width Activations. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 3854–3861. [Google Scholar]
- Yang, Z.; Wang, Y.; Han, K.; Xu, C.; Xu, C.; Tao, D.; Xu, C. Searching for Low-Bit Weights in Quantized Neural Networks. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
- Banner, R.; Nahshan, Y.; Hoffer, E.; Soudry, D. ACIQ: Analytical Clipping for Integer Quantization of Neural Networks. arXiv
**2018**, arXiv:1810.05723. [Google Scholar] - Sanghyun, S.; Juntae, K. Efficient Weights Quantization of Convolutional Neural Networks Using Kernel Density Estimation Based Non-Uniform Quantizer. Appl. Sci.
**2019**, 9, 2559. [Google Scholar] [CrossRef] - Nikolić, J.; Perić, Z.; Aleksić, D.; Tomić, S. On Different Criteria for Optimizing the Two-bit Uniform Quantizer. In Proceedings of the 2022 21st International Symposium INFOTEH-JAHORINA (INFOTEH), East Sarajevo, Bosnia and Herzegovina, 16–18 March 2022; pp. 1–4. [Google Scholar] [CrossRef]
- Na, S.; Neuhoff, D. Monotonicity of Step Sizes of MSE-Optimal Symmetric Uniform Scalar Quantizers. IEEE Trans. Inf. Theory
**2019**, 65, 1782–1792. [Google Scholar] [CrossRef] - Perić, Z.; Denić, B.; Dinčić, M.; Nikolić, J. Robust 2-bit Quantization of Weights in Neural Network Modeled by Laplacian Distribution. Adv. Electr. Comput. Eng.
**2021**, 21, 3–10. [Google Scholar] [CrossRef] - Hubara, I.; Courbariaux, M.; Soudry, D.; Ran, E.Y.; Bengio, Y. Binarized Neural Networks. In Proceedings of the 30th Conference on Neural Information Processing Systems (NeurIPS 2016), Barcelona, Spain, 1–9 December 2016. [Google Scholar]
- Tomić, S.; Nikolić, J.; Perić, Z.; Aleksić, D. Performance of Post-training Two-bits Uniform and Layer-wise Uniform Quantization for MNIST Dataset from the Perspective of Support Region Choice, Math. Probl. Eng.
**2022**, 2022, 1463094. [Google Scholar] [CrossRef] - Nikolić, J.; Perić, Z.; Aleksić, D.; Tomić, S.; Jovanović, A. Whether the Support Region of Three-bit Uniform Quantizer has a Strong Impact on Post-training Quantization for MNIST Dataset? Entropy
**2021**, 23, 1699. [Google Scholar] [CrossRef] [PubMed] - Jayant, S.; Noll, P. Digital Coding of Waveforms; Prentice Hall: Hoboken, NJ, USA, 1984. [Google Scholar]
- Uhlich, S.; Mauch, L.; Cardinaux, F.; Yoshiyama, K. Mixed precision DNNs: All you Need is a Good Parametrization. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Nikolić, J.; Aleksić, D.; Perić, Z.; Dinčić, M. Iterative Algorithm for Parameterization of Two-Region Piecewise Uniform Quantizer for the Laplacian Source. Mathematics
**2021**, 9, 3091. [Google Scholar] [CrossRef] - Hui, D.; Neuhoff, D.L. Asymptotic Analysis of Optimal Fixed-Rate Uniform Scalar Quantization. IEEE Trans. Inf. Theory
**2001**, 47, 957–977. [Google Scholar] [CrossRef] [Green Version] - Zhao, J.; Xu, S.; Wang, R.; Zhang, B.; Guo, G.; Doermann, D.; Sun, D. Data-adaptive Binary Neural Networks for Efficient Object Detection and Recognition. Pattern Recognit. Lett.
**2022**, 153, 239–245. [Google Scholar] [CrossRef] - Gong, R.; Liu, X.; Jiang, S.; Li, T.; Hu, P.; Lin, J.; Yu, F.; Yan, J. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 4851–4860. [Google Scholar] [CrossRef]
- Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]. IEEE Signal Process. Mag.
**2012**, 29, 141–142. [Google Scholar] [CrossRef] - Velichko, A. Neural Network for Low-Memory IoT Devices and MNIST Image Recognition Using Kernels Based on Logistic Map. Electronics
**2020**, 9, 1432. [Google Scholar] [CrossRef] - Python Software Foundation. Python Language Reference, Version 2.7. Available online: http://www.python.org. (accessed on 1 December 2021).
- Available online: https://github.com/zalandoresearch/fashion-mnist (accessed on 20 August 2022).
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv
**2017**, arXiv:1708.07747. [Google Scholar] - Soufleri, E.; Roy, K. Network Compression via Mixed Precision Quantization Using a Multi-Layer Perceptron for the Bit-Width Allocation. IEEE Access
**2021**, 9, 135059–135068. [Google Scholar] [CrossRef] - Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; Chen, Y. Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights. arXiv
**2017**, arXiv:1702.03044. [Google Scholar] - Wang, P.; Chen, Q.; He, X.; Cheng, J. Towards Accurate Post-Training Network Quantization via Bit-Split and Stitching. In Proceedings of the 37th International Conference on Machine learning (ICML’20), online. 12–18 July 2020; pp. 9847–9856. [Google Scholar]

**Figure 2.**Transfer characteristic of two-bit SPTQ Q

^{SPTQ}(x) for [™x

_{max}

^{SPTQ}, x

_{max}

^{SPTQ}] = [™2.5512, 2.5512].

**Figure 3.**Granular region, $\Re $

_{g}, and overload region, $\Re $

_{o}, of the symmetric two-bit MSPTQ.

**Figure 5.**Algorithm convergence llustration: (

**a**) Δ

^{(0)}= 1, Δ

^{(0)}= x

_{max}[H]/3, Δ

^{(0)}= x

_{max}[J]/3; (

**b**) Δ

^{(0)}= |w

_{max}|/3, Δ

^{(0)}= |w

_{min}|/3.

Quantizer type | x_{0} | x_{1} | x_{2} = x_{max}^{SPT} | y_{1} | y_{2} |

SPTQ | 0 | Δ | 3Δ | 1/2Δ | 2Δ |

(a) | |||||

Quantizer type | x_{0}^{mod} | x_{1}^{mod} | x_{2}^{mod}= x_{max}^{MSPTQ} | y_{1}^{mod} | y_{2}^{mod} |

MSPTQ | 0 | 5/4 Δ^{mod} | 3Δ^{mod} | 1/2Δ^{mod} | 2Δ^{mod} |

(b) |

SPTQ | Δ^{(0)} = 1 | Δ^{(0)} = |w_{max}|/3 | Δ^{(0)} = |w_{min}|/3 | Δ^{(0)} = x_{max}[H]/3 | Δ^{(0)} = x_{max}[J]/3 |

number of iterations | 40 | 39 | 40 | 41 | 41 |

**Table 3.**SQNR and QNN model’s accuracy for MLP trained on MNIST dataset: different SPTQ designs for R = 2 bit/sample.

w_{min} = −7.063787, w_{max} = 4.8371024, 3Δ = 2.5512 x _{max}[H] = 1.9605, x_{max}[J] = 2.1748 | Case 1 ℜ_{g} [−w _{max}, w_{max}] | Case 2 ℜ_{g} [w _{min}, −w_{min}] | Case 3 ℜ_{g} [−3Δ, 3Δ] | Case 4 ℜ_{g} [−x _{max}[H], x_{max}[H]] | Case 5 ℜ_{g} [−x _{max}[J], x_{max}[J]] |
---|---|---|---|---|---|

SQNR_{ex}^{SPTQ}(dB) | 4.6899 | 2.5745 | 7.8099 | 7.9051 | 8.0068 |

SQNR_{th}^{SPTQ}(dB) | 4.4438 | 1.6044 | 6.9790 | 6.5437 | 6.8086 |

Accuracy (%) | 97.61 | 97.42 | 95.75 | 93.77 | 94.52 |

Within ℜ_{g} (%) | 99.988 | 100 | 98.567 | 94.787 | 96.691 |

**Table 4.**SQNR and QNN model’s accuracy for MLP trained on MNIST dataset with the application of different UQ designs for bit rate of R = 2 bit/sample (part of the results are from [18]).

w_{min} = −7.063787, w_{max} = 4.8371024, 3Δ = 2.5512 x _{max}[H] = 1.9605, x_{max}[J] = 2.1748 | Case 1 ℜ_{g} [−w _{max}, w_{max}] | Case 2 ℜ_{g} [w _{min}, −w_{min}] | Case 3 ℜ_{g} [−3Δ, 3Δ] | Case 4 ℜ_{g} [−x _{max}[H], x_{max}[H]] | Case 5 ℜ_{g} [−x _{max}[J], x_{max}[J]] |
---|---|---|---|---|---|

SQNR_{ex}^{UQ}(dB) | 2.8821 | ™1.2402 | 8.2325 | 8.7676 | 8.7639 |

SQNR_{th}^{UQ}(dB) | 1.9360 | ™2.0066 | 6.8237 | 6.9787 | 7.0707 |

Accuracy (%) | 96.97 | 94.58 | 97.12 | 96.34 | 96.74 |

Within ℜ_{g} (%) | 99.988 | 100 | 98.567 | 94.787 | 96.691 |

**Table 5.**SQNR and QNN model’s accuracy for MLP trained on MNIST dataset: the application of MSPTQ designs for bit rate of R = 2 bit/sample.

w_{min} = −7.063787, w_{max} = 4.8371024, x _{max}^{SPTQ} =3Δ = 2.5512 x _{max}^{MSPTQ} = 3Δ^{mod} = 2.7063 | Case 1 ℜ_{g} [−w _{max}, w_{max}] | Case 2 ℜ_{g} [w _{min}, −w_{min}] | Case 3 ℜ_{g} [−3Δ, 3Δ] | Case 4 ℜ_{g} [−3Δ ^{mod}, 3Δ^{mod}] |
---|---|---|---|---|

SQNR_{ex}^{MSPTQ}(dB) | 5.5741 | 2.9114 | 8.6839 | 8.5608 |

SQNR_{th}^{MSPTQ}(dB) | 5.0581 | 1.9158 | 7.4890 | 7.5165 |

Accuracy (%) | 97.91 | 96.98 | 97.17 | 97.23 |

Within ℜ_{g} (%) | 99.988 | 100 | 98.567 | 99.001 |

**Table 6.**SQNR and QNN model’s accuracy for MLP trained on MNIST dataset: (a) the application of different ternary quantizer designs (b) adaptation of ternary quantizer with the scale factor α.

w_{min} = −7.063787, w _{max} = 4.8371024, 3Δ = 2.5512 x _{max}[H] = 1.9605, x _{max}[J] = 2.1748 | Case 1 ℜ_{g} [−w _{max}, w_{max}] | Case 2 ℜ_{g} [w _{min}, −w_{min}] | Case 3 ℜ_{g} [−3Δ, 3Δ] | Case 4 ℜ_{g} [−x _{max}[H], x_{max}[H]] | Case 5 ℜ_{g} [−x _{max}[J], x_{max}[J]] | ||
---|---|---|---|---|---|---|---|

SQNR_{ex}(dB) | 1.6988 | 0.43 | 5.7178 | 6.7273 | 6.47 | ||

SQNR_{th}(dB) | 2.7275 | 1.1827 | 5.5681 | 5.7436 | 5.7762 | ||

Accuracy (%) | 90.56 | 24.02 | 96.58 | 94.1 | 95.01 | ||

Within ℜ_{g} (%) | 99.988 | 100 | 98.567 | 94.787 | 96.691 | ||

(a) | |||||||

x_{max} = 3/sqrt(2) αℜ_{g}[−x_{max}, x_{max}] | Case 1ℜ_{g}α = 1/4 | Case 2ℜ_{g}α = 1/2 | Case 3ℜ_{g}α = 3/4 | Case 4ℜ_{g}α = 1 | Case 5ℜ_{g}α = 5/4 | Case 6ℜ_{g}α = 3/2 | Case 7ℜ_{g}α = 7/4 |

SQNR_{ex}(dB) | 2.52 | 5.01 | 6.6 | 6.552 | 5.49 | 4.28 | 3.23 |

SQNR_{th}(dB) | 2.1424 | 4.0509 | 5.3544 | 5.7800 | 5.4708 | 4.8068 | 4.0695 |

Accuracy (%) | 22.01 | 94.37 | 94.81 | 94.68 | 96.49 | 97.51 | 97.33 |

Within ℜ_{g} (%) | 41.142 | 72.968 | 89.17 | 96.283 | 98.869 | 99.648 | 99.887 |

(b) |

**Table 7.**SQNR and QNN model’s accuracy (MLP and CNN trained on Fashion-MNIST dataset) with the application of (a) SPTQ (b) MSPTQ, designed for bit rate of R = 2 bit/sample.

x_{max}^{SPTQ} =3Δ = 2.5512 | MLP [−3Δ, 3Δ] | CNN [−3Δ, 3Δ] | ||

SQNR_{ex}^{SPTQ}(dB) | 7.3242 | 7.51 | ||

SQNR_{th}^{SPTQ}(dB) | 6.9790 | 6.9790 | ||

Accuracy (%) | 86.05 | 83.69 | ||

Within ℜ_{g} (%) | 98.112 | 97.696 | ||

(a) | ||||

x_{max}^{SPTQ} =3Δ = 2.5512 x _{max}^{MSPTQ} = 3Δ^{mod} = 2.7063 | MLP [−3Δ, 3Δ] | MLP [−3Δ ^{mod}, 3Δ^{mod}] | CNN [−3Δ, 3Δ] | CNN [−3Δ ^{mod}, 3Δ^{mod}] |

SQNR_{ex}^{MSPTQ}(dB) | 8.13 | 8.082 | 8.12 | 8.13 |

SQNR_{th}^{MSPTQ}(dB) | 7.4890 | 7.5165 | 7.4890 | 7.5165 |

Accuracy (%) | 87.95 | 87.42 | 83 | 83.72 |

Within ℜ_{g} (%) | 98.118 | 98.552 | 97.696 | 98.232 |

(b) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Perić, Z.; Aleksić, D.; Nikolić, J.; Tomić, S.
Two Novel Non-Uniform Quantizers with Application in Post-Training Quantization. *Mathematics* **2022**, *10*, 3435.
https://doi.org/10.3390/math10193435

**AMA Style**

Perić Z, Aleksić D, Nikolić J, Tomić S.
Two Novel Non-Uniform Quantizers with Application in Post-Training Quantization. *Mathematics*. 2022; 10(19):3435.
https://doi.org/10.3390/math10193435

**Chicago/Turabian Style**

Perić, Zoran, Danijela Aleksić, Jelena Nikolić, and Stefan Tomić.
2022. "Two Novel Non-Uniform Quantizers with Application in Post-Training Quantization" *Mathematics* 10, no. 19: 3435.
https://doi.org/10.3390/math10193435