Design and Analysis of Binary Scalar Quantizer of Laplacian Source with Applications

: A compression method based on non-uniform binary scalar quantization, designed for the memoryless Laplacian source with zero-mean and unit variance, is analyzed in this paper. Two quantizer design approaches are presented that investigate the e ﬀ ect of clipping with the aim of reducing the quantization noise, where the minimal mean-squared error distortion is used to determine the optimal clipping factor. A detailed comparison of both models is provided, and the performance evaluation in a wide dynamic range of input data variances is also performed. The observed binary scalar quantization models are applied in standard signal processing tasks, such as speech and image quantization, but also to quantization of neural network parameters. The motivation behind the binary quantization of neural network weights is the model compression by a factor of 32, which is crucial for implementation in mobile or embedded devices with limited memory and processing power. The experimental results follow well the theoretical models, conﬁrming their applicability in real-world applications.


Introduction
Quantization can be classified into two categories: scalar and vector [1,2].The classification has been made based on whether only one sample is quantized at a time, using a fixed number of bits per sample (scalar quantization), or a number of samples is quantized at a time (vector quantization).The main advantage of scalar quantization is the reduced design complexity, which may be a crucial point in applications where processing delay is a critical parameter, such as speech coding.In scalar quantization, the real line is divided into a certain number of non-overlapping cells and, for each cell, the representative level is defined [1][2][3].Hence, the input data is mapped into the appropriate representative level, depending on the cell where the input belongs.
The binary quantizer is the simplest scalar quantization model, where each symbol is represented by only one bit.The main benefit achieved with this model concerns data compression, rather than achieved signal quality.Besides being widely used in speech [1][2][3][4][5][6][7][8][9] and image coding [1,[10][11][12][13], recently, the most prominent application became the compression of neural networks [14][15][16][17][18][19][20][21].Although extensively exploited, a detailed analysis of binary quantization from the viewpoint of signal (data) processing by assuming known data distribution, including a design for the reference variance and performance analysis in a wide dynamic range of variances, is not available in the literature.This motivated the authors to perform a detailed analysis of such a quantization model.The data is assumed to follow the Laplacian probability density function (PDF), which is known to model various real data well, including speech [1][2][3]22], differences between neighboring pixels in an image [1], or weights in neural networks [23,24].In brief, the main contributions of this paper can be summarized as follows:

•
We introduce two types of binary non-uniform scalar quantizers, named binary quantizer type 1 and binary quantizer type 2, and provide detailed descriptions of the design methods.Furthermore, we investigate the effect of clipping with the aim of reducing the quantization noise.Quantizers are designed for the memoryless Laplacian source with zero-mean and unit variance.

•
We conduct a detailed analysis of both binary quantizers and provide recommendations for quantizer selection in applications where the non-optimal design is required.

•
We analyze the performance of both quantizers in a wide range of input data variances and investigate the robustness property.

•
We propose a method to improve the performance in a wide dynamic range that is based on the forward adaption technique.

•
We verify the correctness of the theoretical quantizer models by applying them to several real data, including speech, image, and neural network parameters.
The rest of the paper is organized as follows.Section 2 deals with the overview of state-of-the-art applications of scalar quantization.In Section 3, two design approaches for the non-uniform binary scalar quantization are described in detail, and an appropriate comparison of the models is given.In Section 4, an analysis in a wide dynamic range is provided.In Section 5, application to speech coding, image coding, and neural networks is presented, and the obtained results are discussed.Finally, Section 6 concludes the paper.

Previous Work
Scalar quantization models have been used in various data processing applications and, hence, play an important role in signal processing and compression.In this section, we provide a brief overview of state-of-the-art applications in several research areas, including speech coding, image coding and neural network compression.
Speech belongs to the class of time-varying signals.Therefore, for its efficient processing, adaptive schemes that operate on a frame-by-frame basis are recommended [1,2].Frame here refers to the group of consecutive samples of finite lengths.We are particularly interested in quantization techniques such as delta modulation (DM), where compression is more important than the preserved quality of the speech signal.DM is a well-known predictive coding technique where the difference between successive samples is encoded using a single bit [1][2][3]25,26].The improved version, known as adaptive delta modulation (ADM), has been proposed to overcome the issues of DM, known as overload and granular distortion.The algorithms based on ADM can be classified into instantaneous (where the adaptation is done at the sample level) [6] and frame-based (where adaptation is done at the frame level) [7][8][9].Although in its original implementation ADM assumes the use of a one-bit (two-level) quantizer, some recent works propose substantial performance improvements using two-bit [6,7], two-digit [8], and three-level ADMs [9], with only a minimal increase in complexity.
Block truncation coding (BTC) [10][11][12][13] is a lossy compression algorithm widely used for the compression of monochrome images, which divides images into non-overlapping blocks, estimates the statistical parameters (mean and variance) for each block, and uses a binary quantizer to represent pixels, based on the block statistics [10,11].Modifications that assume the utilization of high-rate non-uniform quantizers are also analyzed in literature [12,13].

Design Methods of Binary Quantizer
The binary scalar quantizer (N = 2 levels) was characterized by three parameters: Decision threshold t and representative levels y 1 and y 2 .The model we addressed was zero-symmetrical, which assumed t = 0 and y 1 = −y 2 .Accordingly, the design target was to find only the representative level in the positive part y 2 .Usually, during the design process, the input data is modeled by PDF, and a certain performance criterion is adopted to select the required parameter.In this paper, we supposed that the data followed the Laplacian PDF given by [1][2][3]: where σ 2 is the variance of the data.
The signal-to-quantization noise ratio (SQNR) is a widely employed quantizer performance measure defined as [1][2][3]: where D is the mean-squared error (MSE) distortion: Minimal MSE distortion, or an equivalently maximal SQNR, is the most common performance criteria.In the following subsections, we describe in detail the design of two binary quantizer models.In particular, the design is performed by assuming the unit variance (σ 2 = σ ref 2 = 1); that is, we used p(x,σ = 1), which stands for a standard approach in scalar quantization [1].

Binary Quantizer Type 1
In the design of this quantizer, we introduced the clipping factor x clip , an additional parameter that served as the upper support region threshold, and also the step size ∆ = 2x clip /N, assuming that y 2 = ∆/2.Figure 1 illustrates the considered quantizer.power, or in latency-critical services.The effects of scalar quantization in NNs have been analyzed so far in numerous papers, and it was shown that the quantization of network weights and activations to 8-bit fixed point representations has only a negligible effect on performance, but a further reduction of precision may lead to severe performance degradation [27][28][29].Recent attempts provide competitive performance using bitrates lower than eight bits [23,24,30,31], or even binary quantizers in an extreme case [14][15][16][17][18][19][20][21].

Design Methods of Binary Quantizer
The binary scalar quantizer (N = 2 levels) was characterized by three parameters: Decision threshold t and representative levels y1 and y2.The model we addressed was zero-symmetrical, which assumed t = 0 and y1 = −y2.Accordingly, the design target was to find only the representative level in the positive part y2.Usually, during the design process, the input data is modeled by PDF, and a certain performance criterion is adopted to select the required parameter.In this paper, we supposed that the data followed the Laplacian PDF given by [1][2][3]: where σ 2 is the variance of the data.
The signal-to-quantization noise ratio (SQNR) is a widely employed quantizer performance measure defined as [1][2][3]: where D is the mean-squared error (MSE) distortion: Minimal MSE distortion, or an equivalently maximal SQNR, is the most common performance criteria.In the following subsections, we describe in detail the design of two binary quantizer models.In particular, the design is performed by assuming the unit variance (σ 2 = σref 2 = 1); that is, we used p(x,σ = 1), which stands for a standard approach in scalar quantization [1].

Binary Quantizer Type 1
In the design of this quantizer, we introduced the clipping factor xclip, an additional parameter that served as the upper support region threshold, and also the step size Δ = 2xclip/N, assuming that y2 = Δ/2.Figure 1 illustrates the considered quantizer.According to this model, the real line was divided into two regions: An inner region defined as (−x clip, x clip ) and an outer region defined as (−∞, −x clip ) ∪ (x clip , ∞).Therefore, the MSE distortion was composed of two components.The first one, inner distortion D i , can be evaluated as: The outer distortion D o , the second component, can be evaluated as: Based on the last two expressions, the total distortion becomes In terms of x cilp , it can be expressed as In the case of binary quantization, the following x clip = ∆ identity holds.
It can be seen that the MSE distortion is the function of the parameter x clip .Accordingly, the optimal parameter value that minimizes MSE distortion can be obtained by setting the first derivative of distortion, with respect to x clip , to zero: Hence, the optimal representative level amounts to y type The result obtained in (8) can also be verified by numerical simulation, as shown in Figure 2.
The outer distortion Do, the second component, can be evaluated as: Based on the last two expressions, the total distortion becomes In terms of xcilp, it can be expressed as In the case of binary quantization, the following xclip = Δ identity holds.
It can be seen that the MSE distortion is the function of the parameter xclip.Accordingly, the optimal parameter value that minimizes MSE distortion can be obtained by setting the first derivative of distortion, with respect to xclip, to zero: Hence, the optimal representative level amounts to

Binary Quantizer Type 2
The second type of binary scalar quantizer that is extensively used in neural network applications [27,33] is illustrated in Figure 3.
Information 2020, 11, x FOR PEER REVIEW 5 of 19

Binary Quantizer Type 2
The second type of binary scalar quantizer that is extensively used in neural network applications [27,33] is illustrated in Figure 3.
In terms of xclip, it can be expressed as We can see that the MSE distortion is highly dependent on xclip (or Δ).By optimizing the MSE distortion with respect to xclip, we arrived at This is an intuitively expected result that can be confirmed by directly comparing Equations (3-5), (8), and ( 9), and further verified by numerical simulation, as shown in Figure 4.Note that for this specific case (i.e., using the optimal values of xclip), both binary quantizer type 1 and binary quantizer type 2 guarantee the same performance.However, it is also of interest to The representative level in the positive part (0, ∞) was set to y 2 = x clip = ∆, and the following identity holds at y 1 = −y 2 .The MSE distortion is given as In terms of x clip , it can be expressed as We can see that the MSE distortion is highly dependent on x clip (or ∆).By optimizing the MSE distortion with respect to x clip , we arrived at This is an intuitively expected result that can be confirmed by directly comparing Equations (3)-( 5), (8), and ( 9), and further verified by numerical simulation, as shown in Figure 4.

Binary Quantizer Type 2
The second type of binary scalar quantizer that is extensively used in neural network applications [27,33] is illustrated in Figure 3.
In terms of xclip, it can be expressed as We can see that the MSE distortion is highly dependent on xclip (or Δ).By optimizing the MSE distortion with respect to xclip, we arrived at This is an intuitively expected result that can be confirmed by directly comparing Equations (3-5), (8), and ( 9), and further verified by numerical simulation, as shown in Figure 4.Note that for this specific case (i.e., using the optimal values of xclip), both binary quantizer type 1 and binary quantizer type 2 guarantee the same performance.However, it is also of interest to Note that for this specific case (i.e., using the optimal values of x clip ), both binary quantizer type 1 and binary quantizer type 2 guarantee the same performance.However, it is also of interest to compare the performances of discussed quantizers for an arbitrary value of x clip , which is investigated in the following subsection.

Quantizer Performance Evaluation
Figure 5, where the SQNR is plotted as the function of parameter x clip , illustrates the performance of both binary quantizers for a given fixed value x clip .The curves achieve the same maximum SQNR at the optimal values of x clip ( √ 2 for type 1 and 1/ √ 2 for the type 2 quantizer), as expected.If we compare performances for various x clip values, we can perceive that there is a region where the particular quantizer is more efficient.Thus, within the region x clip ∈ (0, x clip , t ), where x clip , t = 0.94 denotes the point where the curves intersect, binary quantizer type 2 performed better than quantizer type 1.
Observe that this range is narrower than the one where binary quantizer type 1 performs better.Observe also that the SQNR values achieved in these ranges are lower than the maximal one, where the SQNR may even take a negative value, meaning that the signal power is lower than the noise power.Finally, we can conclude that if the non-optimal design is necessary to use, a better candidate is binary quantizer type 1.
Information 2020, 11, x FOR PEER REVIEW 6 of 19 compare the performances of discussed quantizers for an arbitrary value of xclip, which is investigated in the following subsection.

Quantizer Performance Evaluation
Figure 5, where the SQNR is plotted as the function of parameter xclip, illustrates the performance of both binary quantizers for a given fixed value xclip.The curves achieve the same maximum SQNR at the optimal values of xclip ( 2 for type 1 and 1/ 2 for the type 2 quantizer), as expected.If we compare performances for various xclip values, we can perceive that there is a region where the particular quantizer is more efficient.Thus, within the region xclip  (0, xclip,t), where xclip,t = 0.94 denotes the point where the curves intersect, binary quantizer type 2 performed better than quantizer type 1. Observe that this range is narrower than the one where binary quantizer type 1 performs better.Observe also that the SQNR values achieved in these ranges are lower than the maximal one, where the SQNR may even take a negative value, meaning that the signal power is lower than the noise power.Finally, we can conclude that if the non-optimal design is necessary to use, a better candidate is binary quantizer type 1.

Analysis in a Wide Dynamic Range
In the previous section, we designed the binary quantizers for a particular (reference) value of variance and optimized the parameters accordingly.Here, we investigate the performance of optimized quantizers in a scenario where the input data variance differs from the reference one (σ 2 ≠ σref 2 = 1), or a mismatched quantization [36].In this case, it is necessary to derive the appropriate expression for performance evaluation using the Laplacian PDF p(x,σ) given in Equation ( 1).Accordingly, for binary quantizer type 1, the following expression for distortion can be derived: where xclip(σref) = xclip is the value determined for the reference variance σref 2 .For binary quantizer type 2, the distortion takes the following form:

Analysis in a Wide Dynamic Range
In the previous section, we designed the binary quantizers for a particular (reference) value of variance and optimized the parameters accordingly.Here, we investigate the performance of optimized quantizers in a scenario where the input data variance differs from the reference one (σ 2 σ ref 2 = 1), or a mismatched quantization [36].In this case, it is necessary to derive the appropriate expression for performance evaluation using the Laplacian PDF p(x,σ) given in Equation ( 1).Accordingly, for binary quantizer type 1, the following expression for distortion can be derived: where x clip (σ ref ) = x clip is the value determined for the reference variance For binary quantizer type 2, the distortion takes the following form: Clearly, the SQNR can be estimated using Equation (2).Figures 6 and 7 depict the SQNR in a wide range of input data variances for the optimal binary quantizer type 1 (x clip = √ 2) and binary quantizer type 2 (x clip = 1/ √ 2), respectively.Note that binary quantizers are not robust, since the desired SQNR is attained for the variance matched case (σ while, in the rest of the range, the SQNR significantly drops.In these figures, we also include the results for several arbitrary chosen values of x clip , (i.e., x clip = 2σ ref , 3σ ref , and 4σ ref . ) It can be seen that, by increasing x clip , the SQNR curves shift to the right, degrading performance even more.Therefore, quantizers designed for a particular variance may not be useful in non-stationary data processing, such as speech coding applications, and require performance improvement.
Information 2020, 11, x FOR PEER REVIEW 7 of 19 Clearly, the SQNR can be estimated using Equation (2).Figures 6 and 7 depict the SQNR in a wide range of input data variances for the optimal binary quantizer type 1 (xclip = 2 ) and binary quantizer type 2 (xclip = 1/ 2 ), respectively.Note that binary quantizers are not robust, since the desired SQNR is attained for the variance matched case (σ 2 = σref 2 ) while, in the rest of the range, the SQNR significantly drops.In these figures, we also include the results for several arbitrary chosen values of xclip, (i.e., xclip = 2σref, 3σref, and 4σref.)It can be seen that, by increasing xclip, the SQNR curves shift to the right, degrading performance even more.Therefore, quantizers designed for a particular variance may not be useful in non-stationary data processing, such as speech coding applications, and require performance improvement.

Applications of Binary Quantizer
In this section, we discuss the application of binary quantizers to several research areas, including speech coding, image coding, and neural network compression.Clearly, the SQNR can be estimated using Equation (2).Figures 6 and 7 depict the SQNR in a wide range of input data variances for the optimal binary quantizer type 1 (xclip = 2 ) and binary quantizer type 2 (xclip = 1/ 2 ), respectively.Note that binary quantizers are not robust, since the desired SQNR is attained for the variance matched case (σ 2 = σref 2 ) while, in the rest of the range, the SQNR significantly drops.In these figures, we also include the results for several arbitrary chosen values of xclip, (i.e., xclip = 2σref, 3σref, and 4σref.)It can be seen that, by increasing xclip, the SQNR curves shift to the right, degrading performance even more.Therefore, quantizers designed for a particular variance may not be useful in non-stationary data processing, such as speech coding applications, and require performance improvement.

Applications of Binary Quantizer
In this section, we discuss the application of binary quantizers to several research areas, including speech coding, image coding, and neural network compression.

Applications of Binary Quantizer
In this section, we discuss the application of binary quantizers to several research areas, including speech coding, image coding, and neural network compression.

Speech Coding
Speech belongs to the class of time-varying signals, where the variance tends to change over time [1,3].Hence, the implementation of a binary quantizer designed for the reference variance to encode the speech may not be the optimal choice, and adaptation is recommended.In this paper, we describe the implementation of a binary quantizer using two frame-wise adaptive techniques (i.e., PCM (Pulse Code Modulation) and ADM).

PCM
The implementation of a binary quantizer in PCM is done using the forward adaptive speech coding algorithm [1,[3][4][5] presented in Figure 8.The forward adaptation technique is introduced to improve the performance of the single quantizer (designed for the particular variance) in a wide dynamic range.The algorithm performs the following steps: Information 2020, 11, x FOR PEER REVIEW 8 of 19

Speech Coding
Speech belongs to the class of time-varying signals, where the variance tends to change over time [1,3].Hence, the implementation of a binary quantizer designed for the reference variance to encode the speech may not be the optimal choice, and adaptation is recommended.In this paper, we describe the implementation of a binary quantizer using two frame-wise adaptive techniques (i.e., PCM (Pulse Code Modulation) and ADM).

PCM
The implementation of a binary quantizer in PCM is done using the forward adaptive speech coding algorithm [1,[3][4][5] presented in Figure 8.The forward adaptation technique is introduced to improve the performance of the single quantizer (designed for the particular variance) in a wide dynamic range.The algorithm performs the following steps:  Step 1. Buffering.A group of M consecutive samples (i.e., one frame, x j (n), n = 1, . . ., M, j = 1, ..., F), is stored within the buffer, where j is the frame index and F is the total number of frames.
Step 2. Variance estimation and quantization.For the stored frame, the variance is estimated by the following equation [1,[3][4][5]: The log-uniform quantizer is used for variance quantization, which performs uniform quantization in the logarithmic domain [3][4][5].In particular, it quantizes the variance V j (dB) = 10 logσ j 2 to one of L allowed values, defined as where ∆ L = V max − V min /L denotes the step size and V max and V min denote the maximal and minimal estimated variance values.As this information is required at the decoder side, it has to be transmitted once per frame by the index J with log 2 L bits.
Step 3. Adaptive binary quantization.An adaptive binary quantizer is obtained by multiplying the parameter of the binary quantizer designed for the reference variance value by factor g: where g is defined as Each frame sample is quantized using the adaptive binary quantizer, and the output is encoded with a one-bit codeword (index I).
Step 4. Repeat all previous steps until all frames are processed.
Figure 9 depicts the theoretical SQNR (determined by substituting Equation ( 16) into Equations ( 12) or ( 13)) of the forward adaptive binary quantizer for the optimal y 2 value (y 2 =1/ √ 2), and L equals the 32-level log-uniform quantizer.We can see that the adaptive quantizer was robust, as an approximately constant SQNR was achieved in a wide dynamic range.
Step 1. Buffering.A group of M consecutive samples (i.e., one frame, xj(n), n = 1, …, M, j = 1, ..., F), is stored within the buffer, where j is the frame index and F is the total number of frames.
Step 2. Variance estimation and quantization.For the stored frame, the variance is estimated by the following equation [1,[3][4][5]: The log-uniform quantizer is used for variance quantization, which performs uniform quantization in the logarithmic domain [3][4][5].In particular, it quantizes the variance Vj (dB) = 10 logσj 2 to one of L allowed values, defined as min 2 1 2 where ΔL = Vmax − Vmin /L denotes the step size and Vmax and Vmin denote the maximal and minimal estimated variance values.As this information is required at the decoder side, it has to be transmitted once per frame by the index J with log2L bits.
Step 3. Adaptive binary quantization.An adaptive binary quantizer is obtained by multiplying the parameter of the binary quantizer designed for the reference variance value by factor g: where g is defined as Each frame sample is quantized using the adaptive binary quantizer, and the output is encoded with a one-bit codeword (index I).
Step 4. Repeat all previous steps until all frames are processed.12) or ( 13)) of the forward adaptive binary quantizer for the optimal y2 value (y2 =1/ 2 ), and L equals the 32-level log-uniform quantizer.We can see that the adaptive quantizer was robust, as an approximately constant SQNR was achieved in a wide dynamic range.The real data experiment was performed with the following sentence: The play seems dull and quite stupid.It was 3 s in length, spoken by a female speaker, sampled at 16 kHz, and extracted from a set of Harvard Psychoacoustic Sentences [37], a collection of phonetically balanced sentences that use specific phonemes with the same frequency of appearance as in the English language.The segmental Signal-to-Noise Ratio (SNR seg ) [1,3] was used as the performance measure; that is, the SNR was calculated over each speech frame and then averaged.The frame length was set to 10 ms (160 samples), while we used the log-uniform quantizer with L = 32 levels.
Figure 10 shows SQNR over different speech frames for three scenarios.In the first scenario, we use the binary quantizer with optimal representative level y 2 = 1/ √ 2(see magenta line).In other two scenarios the performance is investigated for binary quantizer with arbitrary chosen representative levels.Thus, we set x clip as the maximal amplitude of the speech, denoted as x max (x clip = x max = 0.086 for the considered speech).This implies y 2 = x clip = 0.043 for the binary quantizer type 2 (see red line), and y 2 = x clip /2 = 0.23 for the binary quantizer type 1 (see black line).Observe that optimally chosen level ensures the highest performance, while type 2 performs better than type 1 for the established parameter value.If we observe the theoretical results in Figure 5, we can confirm that for the value x = 0.043 the binary quantizer type 2 is indeed better than type 1.Therefore, we can conclude the experimental results for speech coding follow well the theoretical model.The real data experiment was performed with the following sentence: The play seems dull and quite stupid.It was 3 s in length, spoken by a female speaker, sampled at 16 kHz, and extracted from a set of Harvard Psychoacoustic Sentences [37], a collection of phonetically balanced sentences that use specific phonemes with the same frequency of appearance as in the English language.The segmental Signal-to-Noise Ratio (SNRseg) [1,3] was used as the performance measure; that is, the SNR was calculated over each speech frame and then averaged.The frame length was set to 10 ms (160 samples), while we used the log-uniform quantizer with L = 32 levels.
Figure 10 shows SQNR over different speech frames for three scenarios.In the first scenario, we use the binary quantizer with optimal representative level y2 = 1/ 2 (see magenta line).In other two scenarios the performance is investigated for binary quantizer with arbitrary chosen representative levels.Thus, we set xclip as the maximal amplitude of the speech, denoted as xmax (xclip = xmax = 0.086 for the considered speech).This implies y2 = xclip = 0.043 for the binary quantizer type 2 (see red line), and y2 = xclip/2 = 0.23 for the binary quantizer type 1 (see black line).Observe that optimally chosen level ensures the highest performance, while type 2 performs better than type 1 for the established parameter value.If we observe the theoretical results in Figure 5, we can confirm that for the value xclip = 0.043 the binary quantizer type 2 is indeed better than type 1.Therefore, we can conclude the experimental results for speech coding follow well the theoretical model.

Delta Modulation
The implementation of a binary quantizer in ADM with the first-order linear predictor (both the quantizer and the predictor are forward adaptive) is shown in Figure 11, and can be expressed using these steps:

Delta Modulation
The implementation of a binary quantizer in ADM with the first-order linear predictor (both the quantizer and the predictor are forward adaptive) is shown in Figure 11, and can be expressed using these steps: Step 1. Buffering.This is the same as in Step 1 of the algorithm in Section 5.1.1.
Step 2. Variance estimation and quantization.This is the same as in Step 2 of the algorithm in Section 5.1.1.
Step 3. Estimation of the correlation coefficient and quantization.The correlation coefficient, denoted as ρ, for the current jth frame is estimated as [1,[7][8][9] It is uniformly quantized to one of S available values, given by where ∆ ρ = ρ max − ρ min /S denotes the step size and ρ max and ρ min denote the maximal and minimal estimated values of the correlation coefficient.This information is also required at the decoder side, and it has to be transferred once per frame by the index K with log 2 S bits.
Step 4. Determination of the prediction error.For the jth frame, the prediction error can be determined according to where is the predicted sample value and y j (n) is the reconstructed value: where e q (n) is the quantized value of e j (n).
Step 5. Adaptive binary quantization.For the jth frame, the scaling factor is given by where g is defined by Equation ( 17) and ρ k is given by Equation (19).The adaptive representative level is obtained as in Equation ( 16).The prediction error signal was quantized using the adaptive binary quantizer and the output was encoded with a one-bit codeword (index I).
Step 6. Repeat all previous steps until all frames are processed.
A real data experiment was performed using the same speech signal, frame length, and quantizer parameters as in the case of PCM.In addition, both the log-uniform quantizer (for the frame variance) and the uniform quantizer (for the correlation coefficient) used 32 levels.
Figure 12 plots the SNR over different speech frames for ADM.Similar conclusions can be drawn as in the case of PCM; that is, binary quantizer type 2 performed better than type 1 for the established parameter value.We also observed that ADM (Figure 12) gave a higher performance both voiced and unvoiced frames in comparison with PCM (Figure 10).

Image Coding
The application of a binary quantizer in image coding was analyzed using a BTC algorithm, which we briefly describe.
The algorithm starts by dividing the input picture into a set of non-overlapping pixel blocks of size m × m.For each block, the mean value, denoted as x av k , where k is index of the block, is calculated as where x(i, j) is the pixel intensity, i is the row index, and j is the column index.The mean value is uniformly quantized according to x k av = x av,q x av,q = x av,min where x av,min = 0, x av,max = 255, and ∆ av = (x av,max − x av,min )/S 1 = 255/S 1 .For encoding, we used r av = log 2 S 1 bits.

Image Coding
The application of a binary quantizer in image coding was analyzed using a BTC algorithm, which we briefly describe.
The algorithm starts by dividing the input picture into a set of non-overlapping pixel blocks of size m × m.For each block, the mean value, denoted as xav k , where k is index of the block, is calculated as where x(i, j) is the pixel intensity, i is the row index, and j is the column index.The mean value is uniformly quantized according to where xav,min = 0, xav,max = 255, and Δav = (xav,max − xav,min)/S1 = 255/S1.For encoding, we used rav = log2S1 bits.
In the next step, for the kth block, the mean value is subtracted from the original pixel intensities, xd(i,j) = x(i,j) -xav k , and the block with new values is obtained.It actually represents an input to the binary quantizer.To perform efficient quantization, we had to adjust the quantizer to the local block statistics.Thus, for the kth block we estimate the variance as follows: As in previous subsections, we define the binary quantizer for kth block as Furthermore, the estimated block variance was uniformly quantized and encoded using rσ bits.In the next step, for the kth block, the mean value is subtracted from the original pixel intensities, x d (i,j) = x(i,j) -x av k , and the block with new values is obtained.It actually represents an input to the binary quantizer.To perform efficient quantization, we had to adjust the quantizer to the local block statistics.Thus, for the kth block we estimate the variance as follows: As in previous subsections, we define the binary quantizer for kth block as Furthermore, the estimated block variance was uniformly quantized and encoded using r σ bits.A particular block value x d (i,j) was binarily quantized by the value x d q (i,j) ∈ (−y 2 k , y 2 k ).
The reconstructed pixel intensity for the kth block, denoted as x r (i,j), can be obtained with The real data experiment was done using the standard test grayscale image of Lena of the size 512 × 512 pixels, which belongs to the USC-SIPI Image Database [38].The bit rate R and peak signal-to-noise ratio (PSQNR) were used as performance measures [1,12,13].The results for the BTC algorithm, with two different block sizes (4 × 4 and 8 × 8) and different r av and r σ values by adopting the optimal binary quantizer (y 2 = 1/ √ 2), are summarized in Table 1.It can be seen that a high PSQNR was obtained with satisfactory low bit rates.
Moreover, in Table 2, we present the results in terms of the PSQNR for the BTC algorithm with a binary quantizer, using the optimal and non-optimal (i.e., arbitrarily selected) representative levels for the block size of 4 × 4. Thus, we adopted y 2 = 1.5 and y 2 = 3, which correspond to binary quantizer type 1 and 2, respectively.As expected, BTC performance was degraded when it used the non-optimal quantizer.However, we can see that a better PSQNR was provided when binary quantizer type 1 was employed.This is also in accordance with the theoretical results in Figure 5 since, for the same value of x clip = 3, binary quantizer type 1 performed better than type 2. In Figure 13, we depict the reconstructed images in the cases of an optimal (y 2 = 1/ √ 2) and non-optimal level (y 2 = 3) for m = 4, r av = 8 bits/block, and r σ = 8 bits/block, showing better quality of reconstruction and reduced granular noise in the case of an optimally chosen level and confirming the above conclusions.Note that BTC with a binary quantizer achieved a compression ratio equal to 4, with respect to the original image (8 bits/pixel).

Neural Networks Compression
In this paper, the multilayer perceptron (MLP) neural network was employed to investigate the influence of a binary quantizer on performance, measured by prediction accuracy.As elaborated in Section 2, the motivation behind the binary quantization of neural network parameters was to provide model compression when compared to the full precision case, which is crucial for implementation in portable and edge computing devices with limited memory and processing power.MLP is a class of feedforward artificial NNs that consists of three layers: an input layer, a  2) and (b) a non-optimal binary quantizer (y 2 = 3).

Neural Networks Compression
In this paper, the multilayer perceptron (MLP) neural network was employed to investigate the influence of a binary quantizer on performance, measured by prediction accuracy.As elaborated in Section 2, the motivation behind the binary quantization of neural network parameters was to provide model compression when compared to the full precision case, which is crucial for implementation in portable and edge computing devices with limited memory and processing power.MLP is a class of feedforward artificial NNs that consists of three layers: An input layer, a hidden layer, and an output layer.Our goal was to apply the quantization to learned network weights (post-trained quantization).
Training data was taken from the MNIST database [39], which contains 60,000 monochrome images of handwritten single digits sized 28 × 28 pixels.The network input was the image vector of a size 28 × 28 = 784 pixels, and the number of hidden layer nodes was set to 128, whereas the output layer had 10 nodes and corresponded to the number of digits.Rectified linear unit (ReLU) and softmax activation functions were used in the hidden and output layers, respectively.The hyperparameters of the MLP neural network were as follows: Regularization rate (L2) = 0.01, learning rate = 0.0005, and batch size = 128.
Figure 14 plots the training and validation accuracy, where the model was evaluated on a hold-out validation dataset after each epoch during the training.We can observe that the neural network did not overfit, and the model converged after 20 epochs, achieving a prediction accuracy of 96.7%.Binary quantization was then applied to the learned neural network weights, which were given in the matrix form of dimensions 784 × 128 for the weights between the input and hidden layers and 128 × 10 for the weights between hidden and output layers.Distribution of the learned weights between the input and the hidden layers is illustrated in Figure 15, showing that the Laplacian PDF can be used as a good model.To perform effective quantization, one should adapt the quantizer to the statistics of the input matrix.Hence, the mean value was estimated and subtracted from the original network weights.Information about the mean value of the data was stored in full precision format (32-bit floating point).Furthermore, the standard deviation of data was estimated, and the binary quantizer was Binary quantization was then applied to the learned neural network weights, which were given in the matrix form of dimensions 784 × 128 for the weights between the input and hidden layers and 128 × 10 for the weights between hidden and output layers.Distribution of the learned weights between the input and the hidden layers is illustrated in Figure 15, showing that the Laplacian PDF can be used as a good model.
To perform effective quantization, one should adapt the quantizer to the statistics of the input matrix.Hence, the mean value was estimated and subtracted from the original network weights.Information about the mean value of the data was stored in full precision format (32-bit floating point).Furthermore, the standard deviation of data was estimated, and the binary quantizer was adapted (scaled) accordingly.This information was also stored in full precision format.Mean-normalized network weights were quantized using the adaptive binary quantizer.In order to reconstruct the original network weights, the mean value needed to be added to the quantized mean-normalized network weights.
Table 3 gives the achieved accuracies for the MLP neural network with (using a binary quantizer with optimal and non-optimal representative levels) and without (denoted as full precision) quantization of the weights.Namely, the non-optimal representative levels of the binary quantizer are specified as x max /2 (binary quantizer type 1) and x max (binary quantizer type 2), where x max denotes the maximal data value in the network weight matrix (x max = 0.45).The SQNR measure, indicating the efficiency of the observed binary models on the available real data, is also provided in Table 3. Binary quantization was then applied to the learned neural network weights, which were given in the matrix form of dimensions 784 × 128 for the weights between the input and hidden layers and 128 × 10 for the weights between hidden and output layers.Distribution of the learned weights between the input and the hidden layers is illustrated in Figure 15, showing that the Laplacian PDF can be used as a good model.To perform effective quantization, one should adapt the quantizer to the statistics of the input matrix.Hence, the mean value was estimated and subtracted from the original network weights.Information about the mean value of the data was stored in full precision format (32-bit floating point).Furthermore, the standard deviation of data was estimated, and the binary quantizer was adapted (scaled) accordingly.This information was also stored in full precision format.Meannormalized network weights were quantized using the adaptive binary quantizer.In order to reconstruct the original network weights, the mean value needed to be added to the quantized meannormalized network weights.
Table 3 gives the achieved accuracies for the MLP neural network with (using a binary quantizer with optimal and non-optimal representative levels) and without (denoted as full precision) quantization of the weights.Namely, the non-optimal representative levels of the binary quantizer  It can be seen that the highest values for both performance measures, prediction accuracy (91.28%) and SQNR (4.287 dB), were attained in the case of the optimal binary quantizer (y 2 = 1/ √ 2).In comparison with the network weights stored in the full precision format, the accuracy dropped by approximately 5%, but the compression ratio was substantial and amounted to 32.Therefore, the decreased performance can be compensated by large compression, which may be crucial for implementation in devices with limited memory and processing power.It is also interesting to note that the non-optimal binary model used so far [27,33], which applies the maximal data value as the representative level (type 2), had a lower accuracy of approximately 1.3%.
Eventually, considering the analysis for the arbitrary chosen level, Table 3 reveals, based on the SQNR values, that quantizer type 2 is more efficient than quantizer type 1.This can be also validated by the theoretical model in Figure 5 (see performance for x clip = 0.45).

Conclusions
This paper has addressed a binary scalar quantizer for data with a Laplacian PDF, with applications in speech and image coding, as well as the compression of neural networks.Two types of binary quantizers have been designed, taking into account the effect of clipping and the determination of the optimal clipping factor.Detailed performance analysis has shown that both quantizers provide the same maximal SQNR in an optimal setting.However, in cases where non-optimal design is necessary, binary quantizer type 1 is more efficient than binary quantizer type 2 for image compression, leading to better quality of the reconstructed image and reduced granular noise.Contrary to this finding, quantizer type 2 is more efficient for the compression of speech signal and neural network parameters.
Theoretical analysis of the considered models in a wide dynamic range has also been conducted, showing that their robustness is low, making them inefficient for non-stationary data.Therefore, the forward adaptive model has been introduced for non-stationary data.Several real-world data have been used to test the adaptive model, including speech, images, and the quantization of neural network weights, confirming the applicability of the proposed models and, furthermore, showing excellent matching between the experimental and theoretical results.
Future research will include the application of the introduced adaptive methods to compression of not only neural network weights, but also activations or even gradients and, furthermore, to quantization of state-of-the-art deep neural networks, such as AlexNet, GoogleNet, ResNet, or VGG.
obtained in(8) can also be verified by numerical simulation, as shown in Figure2.

Information 2020 ,
11, x FOR PEER REVIEW 5 of 19

Figure 5 .
Figure 5. Performance evaluation of binary quantizers type 1 and 2 for a given xclip.

Figure 5 .
Figure 5. Performance evaluation of binary quantizers type 1 and 2 for a given x clip .

Figure 6 .
Figure 6.Signal-to-quantization noise ratio (SQNR) as a function of input signal variance for binary quantizer type 1.

Figure 7 .
Figure 7. SQNR as a function of input signal variance for binary quantizer type 2.

Figure 6 . 1 .
Figure 6.Signal-to-quantization noise ratio (SQNR) as a function of input signal variance for binary quantizer type 1.

Figure 6 .
Figure 6.Signal-to-quantization noise ratio (SQNR) as a function of input signal variance for binary quantizer type 1.

Figure 7 .
Figure 7. SQNR as a function of input signal variance for binary quantizer type 2.

Figure 7 .
Figure 7. SQNR as a function of input signal variance for binary quantizer type 2.

Figure 8 .
Figure 8. Pulse code modulation algorithm with a binary quantizer.

Figure 8 .
Figure 8. Pulse code modulation algorithm with a binary quantizer.

Figure 9 .
Figure 9. SQNR of the forward adaptive binary quantizer (y 2 = 1/ √2) in a wide range of input data variances.

Figure 9 .
Figure 9. SQNR of the forward adaptive binary quantizer (y2 = 1/ 2 ) in a wide range of input data variances.

y 2 =Figure 10 .
Figure 10.SQNR across speech frames in the case of PCM.

Figure 10 .
Figure 10.SQNR across speech frames in the case of PCM.

Figure 11 .Figure 11 .xFigure 12 .
Figure 11.Adaptive delta modulation algorithm with a binary quantizer.Step 1. Buffering.This is the same as in Step 1 of the algorithm in Section 5.1.1.Step 2. Variance estimation and quantization.This is the same as in Step 2 of the algorithm in Section 5.1.1.Step 3. Estimation of the correlation coefficient and quantization.The correlation coefficient, denoted as ρ, for the current jth frame is estimated as [1,7-9]Figure 11.Adaptive delta modulation algorithm with a binary quantizer.

Figure 12 .
Figure 12.SQNR across speech frames in the case of adaptive delta modulation (ADM).

Figure 14 .
Figure 14.Learning curves for the considered multiplayer perceptron (MLP) neural network.

Figure 15 .
Figure 15.Distribution of learned weights for the considered MLP neural network.

Figure 14 .
Figure 14.Learning curves for the considered multiplayer perceptron (MLP) neural network.

Figure 14 .
Figure 14.Learning curves for the considered multiplayer perceptron (MLP) neural network.

Figure 15 .
Figure 15.Distribution of learned weights for the considered MLP neural network.

Figure 15 .
Figure 15.Distribution of learned weights for the considered MLP neural network.

Table 2 .
Performance comparison of the block truncation coding algorithm using optimal and non-optimal binary quantizers, applied to the monochrome image of Lena.

Table 3 .
Prediction accuracy of the multilayer perceptron neural network for different representation levels of the binary quantizer.