Switched 32-Bit Fixed-Point Format for Laplacian-Distributed Data

Bojan Denić; Zoran Perić; Milan Dinčić; Sofija Perić; Nikola Simić; Marko Anđelković

doi:10.3390/info16070574

,

and

¹

Faculty of Electronic Engineering, University of Niš, Aleksandra Medvedeva 4, 18000 Niš, Serbia

²

Faculty of Technical Sciences, University of Novi Sad, Trg Dositeja Obradovića 6, 21102 Novi Sad, Serbia

³

IHP—Leibniz-Institut für Innovative Mikroelektronik, Im Technologiepark 25, 15236 Frankfurt (Oder), Germany

^*

Author to whom correspondence should be addressed.

Information2025, 16(7), 574;https://doi.org/10.3390/info16070574

This article belongs to the Special Issue Signal Processing and Machine Learning, 2nd Edition

Version Notes

Order Reprints

Abstract

The 32-bit floating-point (FP32) format has many useful applications, particularly in computing and neural network systems. The classic 32-bit fixed-point (FXP32) format often introduces lower quality of representation (i.e., precision), making it unsuitable for real deployment, despite offering faster computations and reduced computational cost, which positively impacts energy efficiency. In this paper, we propose a switched FXP32 format able to compete with or surpass the widely used FP32 format across a wide variance range. It actually proposes switching between the possible values of key parameters according to the variance level of the data modeled with the Laplacian distribution. Precision analysis is achieved using the signal-to-quantization noise ratio (SQNR) as a performance metric, introduced based on the analogy between digital formats and quantization. Theoretical SQNR results provided in a wide range of variance confirm the design objectives. Experimental and simulation results obtained using neural network weights further support the approach. The strong agreement between the experiment, simulation, and theory indicates the efficiency of this proposal in encoding Laplacian data, as well as its potential applicability in neural networks.

Keywords:

fixed-point format; uniform quantization; Laplacian source; SQNR

1. Introduction

Besides computers, the most important application domain of the 32-bit floating-point (FP32) format [1] is neural networks (NNs). Specifically, NNs serve as an efficient tool for handling some challenging real-world problems, with notable results in natural language processing, image recognition, and speech recognition [2]. The FP32 format guarantees a high quality of representation, but its inherent complexity can be seen as a disadvantage in some situations. For instance, FP32-based NNs are extremely challenging to deploy in resource-constrained environments, like sensor nodes [3] and edge devices [4]. Its fixed-point counterpart, 32-bit fixed-point (FXP32), exhibits notably lower implementation complexity but lacks sufficient dynamic range. Specifically, FXP32 can perform operations faster than FP32 and also requires less computational overhead and memory consumption, which can be beneficial for real deployment. Refs. [5,6,7,8,9,10] provided some advanced fixed-point-based solutions for NNs, which support lower bit resolutions and are based on integer calculations where the exponent is shared between multiple parameters. Conversely, FP32 requires an exponent for each parameter and, consequently, increases the energy cost on hardware. These methods mutually differ in determining the shared exponent. For example, in [5], where dynamic fixed-point (DFP) was proposed, the exponent is determined based on the input data set using its maximum value. On the other hand, the shifted dynamic fixed-point (S-DFP) format [6] modifies DFP [5] by adding the bias. All the mentioned procedures were conducted to increase the dynamic range for parameter representation and make the quantized NN competitive with the initial one (with FP32).

NN parameters (weights, activation, biases, etc.) are subject to various statistical distributions. For example, NN weights can be modeled with a Gaussian or Laplacian distribution [11,12]. Therefore, the performance of digital formats can be affected by the probability density function (PDF) of the data. Besides the dynamic range (defined by the largest and smallest representable values), quality of representation (precision) is another indicator of the performance of digital formats. These indicators can change significantly with a change in resolution, which creates a practical need to measure the actual performance of digital formats. However, most available FXP solutions, e.g., [5,6,7,8], do not take into account the PDF of the data during design nor offer any mechanisms to measure the quality loss compared to the initial FP32.

The mechanism for estimating the quality of representation of the FXP format was provided in a recent paper [13] that takes into account the PDF of the data and uses the signal-to-quantization noise ratio (SQNR) as a measure. This mechanism actually exploits the analogy between the FXP format and uniform scalar quantization. In that paper, a Laplacian data source was assumed, while the precision of the FXP32 format was analyzed in the variance range adapted to signal processing applications (40 dB wide relative to the reference variance). The SQNR results in [13] showed a strong dependence on the data variance, but also the possibility of outperforming FP32.

The dynamic range of a digital format can also be estimated with respect to the variance of the data. It can be defined as the range of variances where a satisfactory SQNR is achieved. The increase in dynamic range of classic FXP formats in [14,15] was specifically achieved by improving the SQNR across a broad variance range. In [14], the authors suggested a two-stage adaption method that assumes data preprocessing to enhance the FXP24 format of the Laplacian source. Input data with arbitrary variance was converted to unit variance data in the first stage and scaled with the appropriate factor in the second stage before applying the FXP24 format with the best-chosen key parameter n (the number of bits for the integer part of a real number). As a result, the constant SQNR was ensured for any variance. A different approach was employed in [15], where the FXP format was improved in a general way (i.e., for any resolution R) for a Gaussian PDF. It is based on the premise that n for FXP format is not a predefined parameter, allowing it to vary between 0 and R-1. Therefore, to take advantage of the variety of n values, switching quantization [16,17] was employed. Recall that switching quantization is the popular method for performance improvement of the single (non-adaptive) quantizer, partitioning the variance range into non-overlapping subranges, with a specially designed quantizer for each subrange. The strategy [15] defines the subrange for each available n while suggesting an estimated data variance as a switching rule. This way, a nearly constant SQNR was ensured over a wide variance range. Furthermore, it was perceived that the gain in dynamic range offered by such an approach increases with R.

This paper extends the methodology from [15] to the Laplacian PDF and aims to improve the classic FXP32. To the best of the authors’ knowledge, this approach was not previously applied for the gain in FXP32. The benefits can be two-fold. First, a substantial increase in dynamic range for the target bit resolution can be obtained by following the accomplishments from [15]. Second, based on the findings from [13], a solution able to surpass (in terms of precision) the standardized FP32 format [1] can be developed. Note that this strategy differs from the ones in [5,6]. Although the logic applied in [5,6] tries to adjust the dynamic range to input data, it fails to account for variance in the design process, which typically results in a suboptimal exponent choice and therefore lower performance. The proposed strategy allows for the optimal selection of key FXP32 parameter for a variance range suitable for most practical applications. In brief, this paper delivers the following contributions:

Switched FXP32 quantization. This type of quantization chooses the optimal number of bits for the integer part n based on the variance level of data to be quantized, unlike the conventional FXP32, where that parameter is fixed. This provides the ability to dynamically track changes in data and ensure high performance.
Theoretical design. The design is performed using the analytical closed-form SQNR expression derived in this paper. It enables accurate calculation of the variance subrange, where a certain n acts, as well as the dynamic range.
Experimental and simulation analysis. It is performed to validate the theoretical switched FXP32 model. Several NN configurations and benchmark datasets are employed to obtain the weights and test the switched FXP32 model. This also indicates the potential for applying the proposed solution in NNs.

The rest of this paper is organized as follows. Section 2 describes the basics and equivalent quantizer model of the FXP32 format and also conducts the performance analysis from the perspective of SQNR. Section 3 is devoted to the switched FXP32 format and delivers theoretical design criteria and an expression for performance evaluation. Section 4 provides and discusses the theoretical, experimental, and simulation performance results. Section 5 concludes the paper.

2. The FXP32 Format

2.1. Basics and Equivalent Quantizer Model

The binary form of real data sample x displayed in FXP32 format is as follows [13]:

x = \underset{R = 32}{\underset{⏟}{{(s a_{n - 1} … a_{1} a_{0} a_{- 1} a_{- 2 -} a_{- m})}_{2}}},

(1)

where the resolution bits (R = 32) are distributed over n bits for the integer part (a_n₋₁, a_n₋₂,…a₀), m bits for the fractional part (a₋₁, a₋₂,…a_−m), and a bit s for the sign of x (i.e., R = n + m + 1). The decimal value of x can be provided using the following equation:

x = {(- 1)}^{s} \sum_{i = - m}^{n} a_{i} 2^{i} .

(2)

Numbers displayed in FXP32 format can be positive and negative. Negative numbers are obtained as reflections of positive ones due to symmetry around zero. Specifically, there are a total of 2³¹ positive FXP32 numbers, spaced apart by 2^−m = 2^{−(R−n−1)} = 2ⁿ⁻³¹, with the largest value that can be represented:

{(1...1.1..1)}_{2} = \sum_{i = - m}^{n - 1} 2^{i} = 2^{- m} \sum_{i = 0}^{n + m - 1} 2^{i} = 2^{- m} (2^{n + m} - 1) \approx 2^{n} .

(3)

The quantizer equivalent to the FXP32 format structure is a zero-symmetrical R = 32-bit uniform quantizer, called the FXP32 quantizer [13]. The upper support region threshold x_max = 2ⁿ and the step size Δ = 2ⁿ⁻³¹ completely define the FXP32 quantizer, which applies the quantization rule given below:

FXP 32 : \{\begin{cases} sgn (x) \cdot Δ (⌊\frac{|x|}{Δ}⌋ + \frac{1}{2}) = sgn (x) \cdot 2^{n - 31} (⌊\frac{|x|}{2^{n - 31}}⌋ + \frac{1}{2}), x \leq x_{\max} = 2^{n} \\ sgn (x) (x_{\max} - \frac{Δ}{2}) = sgn (x) 2^{n}, x > 2^{n} \end{cases},

(4)

where

⌊x⌋

denotes the rounding to the nearest integer lower than x. In this case, a zero-mean Laplacian PDF is used to statistically model the data samples [16,17]:

p (x, σ) = \frac{1}{σ \sqrt{2}} \exp \{\frac{\sqrt{2} |x|}{σ}\},

(5)

where σ² is the data variance.

The mean-squared error (MSE) distortion D will be used to estimate the error that results from quantizing the Laplacian data using the FXP32 quantizer. The MSE distortion D comprises two components, the granular D_g and the overload D_ov distortion, since the data can be quantized within and without the support region (−x_max, x_max). Using the initial expressions for a uniform quantizer’s D_g and D_ov [16,17]:

D_{g} (σ) = 2 \frac{Δ^{2}}{12} \int_{0}^{x_{\max}} p (x, σ) d x,

(6)

D_{o v} (σ) = 2 \int_{x_{\max}}^{\infty} {(x - x_{\max})}^{2} p (x, σ) d x,

(7)

the following is obtained for the FXP32 quantizer:

D_{g} (σ, n) = \frac{2^{2 n - 64}}{3} (1 - \exp \{- \frac{2^{n + 1 / 2}}{σ}\}),

(8)

D_{o v} (σ, n) = σ^{2} \exp \{- \frac{2^{n + 1 / 2}}{σ}\},

(9)

D (σ, n) = σ^{2} (\frac{2^{2 n - 64}}{3 σ^{2}} (1 - \exp \{- \frac{2^{n + 1 / 2}}{σ}\}) + \exp \{- \frac{2^{n + 1 / 2}}{σ}\}) .

(10)

The SQNR reflects the quantizer performance in addition to distortion, defined as follows [16,17]:

SQNR (σ) = 10 \log_{10} (\frac{σ^{2}}{D}) .

(11)

Substituting (10) in (11), the FXP32 quantizer’s closed-form SQNR expression is obtained:

SQNR (σ, n) = - 10 \log_{10} (\frac{2^{2 n - 64}}{3 σ^{2}} (1 - \exp \{- \frac{2^{n + 1 / 2}}{σ}\}) + \exp \{- \frac{2^{n + 1 / 2}}{σ}\}) .

(12)

which, for a given σ², depends exclusively on n. Since n of the FXP32 format is not strictly defined, it can be between 0 and R-1 = 31 (see (1)). Furthermore, σ² is commonly represented in the logarithmic form as

σ_{dB} = 20 \log_{10} (σ / σ_{ref})

, where σ_ref² is the reference variance commonly set to 1 [16]. This results in

σ = 10^{σ_{dB} / 20}

, so (12) becomes

SQNR (σ_{dB}, n) = - 10 \log_{10} (\frac{2^{2 n - 64}}{3 \cdot 10^{σ_{dB} / 10}} (1 - \exp \{- \frac{2^{n + 1 / 2}}{10^{σ_{dB} / 20}}\}) + \exp \{- \frac{2^{n + 1 / 2}}{10^{σ_{dB} / 20}}\}) .

(13)

2.2. Performance Analysis Using SQNR

The established equivalency between the FXP32 format and uniform FXP32 quantizer allow us to apply the SQNR of the FXP32 quantizer as a performance measure of the FXP32 format. The higher the SQNR, the better the representational quality (precision) of the format. Particularly, we will investigate performance in the wide range of variances, resulting in an SQNR curve, rather than a single variance value, resulting in a single SQNR output. This calls for a more complete understanding of how the FXP32 format behaves when dealing with variance-sensitive data.

In Figure 1 are plotted SQNR curves for several n, i.e., n = 0, 1, 2, 30 and 31 (remaining possible n values are omitted for visibility reasons). The following observations can be made from Figure 1: (1) increasing n results in a shift of the SQNR curves to the right, as in each instance the same maximum is reached; (2) the shift between the SQNR curves acquired for neighboring n values is constant. The constant shift is defined with the following Lemma 1.

Figure 1. SQNR as a function of σ_dB for the FXP32 format, for various n.

Lemma 1.

The SQNR curve of the FXP32 quantizer provided for n + 1 is 6.02 dB away from that provided for n, i.e.:

SQNR (σ_{dB}, n + 1) = SQNR (σ_{dB} - 6.02, n) .

(14)

Proof of Lemma 1.

According to (13) we have

SQNR (σ_{dB}, n + 1) = - 10 \log_{10} (\frac{2^{2} \cdot 2^{2 n - 64}}{3 \cdot 10^{σ_{dB} / 10}} (1 - \exp \{- \frac{2 \cdot 2^{n + 1 / 2}}{10^{σ_{dB} / 20}}\}) + \exp \{- \frac{2 \cdot 2^{n + 1 / 2}}{10^{σ_{dB} / 20}}\}) .

(15)

Note that 2 = 10^6.02/20 and 2² = 10^6.02/10, so expression (15) becomes

SQNR (σ_{dB}, n + 1) = - 10 \log_{10} (\frac{2^{2 n - 64}}{3 \cdot 10^{(σ_{dB} - 6.02) / 10}} (1 - \exp \{- \frac{2^{n + 1 / 2}}{10^{(σ_{dB} - 6.02) / 20}}\}) + \exp \{- \frac{2^{n + 1 / 2}}{10^{(σ_{dB} - 6.02) / 20}}\}) = SQNR (σ_{dB} - 6 . 02, n),

(16)

which completes the proof. □

Let us denote with

σ_{dB, \max}^{n} = 20 \log_{10} σ_{\max}^{n}

the variance where the SQNR curve for a given n reaches its maximum. This important parameter for further analysis will be determined iteratively, as Lemma 2 shows.

Lemma 2.

The point

σ_{\max}^{n}

at which the SQNR of the FXP32 quantizer provides the maximum can be specified by

{(σ_{\max}^{n})}^{(i)} = \frac{2^{n + 1 / 2}}{\log (1 - \frac{2^{n - 1 / 2}}{{(σ_{\max}^{n})}^{(i - 1)}} + 3 {(σ_{\max}^{n})}^{(i - 1)} 2^{\frac{127}{2} - n})} .

(17)

Proof of Lemma 2.

Let us define function S as follows:

S = \frac{σ^{2}}{D (σ, n)} = \frac{σ^{2}}{\frac{2^{2 n - 64}}{3} (1 - \exp \{- \frac{2^{n + 1 / 2}}{σ}\}) + σ^{2} \exp \{- \frac{2^{n + 1 / 2}}{σ}\}} .

(18)

Using the condition

{\partial S / \partial σ |}_{σ = σ_{\max}^{n}} = 0

, we get the following identity:

2^{2 (31 - n)} (2^{n + 1 / 2} + 2 σ_{\max}^{n} (\exp \{\frac{2^{n + 1 / 2}}{σ_{\max}^{n}}\} - 1)) = 3 {(σ_{\max}^{n})}^{2} 2^{n + 5 / 2} .

(19)

Finally, solving (19) by

σ_{\max}^{n}

we have

σ_{\max}^{n} = \frac{2^{n + 1 / 2}}{\log (1 - \frac{2^{n - 1 / 2}}{σ_{\max}^{n}} + 3 σ_{\max}^{n} 2^{\frac{127}{2} - n})} .

(20)

The last equation can be solved iteratively, and the proof is finished. □

The iterative process (17) can be initialized with

{(σ_{\max}^{n})}^{(0)} = 10^{C / 20}

, where C = 151.9 dB is the SQNR score of the FP32 format [13], which will also be used during the performance comparison process, as shown below.

By observing a certain value of n, Figure 1 shows a rapid decrease in SQNR when moving away (left and right) from the point

σ_{dB, \max}^{n}

. This means that FXP32 is not robust to changes in the variance level (robustness is essential for processing variance-sensitive data). On the other hand, this is not the issue for the FP32 format, which maintains a stable SQNR over a wide variance range. By directly comparing the achieved SQNR values, it can be seen that the FXP32 is able to achieve better scores than FP32 but only in a small range of σ_dB values. This means that FXP32 can represent data with a higher precision than FP32 within a narrow variance range; hence, it is a great option for data that exhibits low variance range. However, practical applications (e.g., neural networks) typically require a much wider variance range, and improving the FXP32 format is essential for effective use.

By increasing the SQNR in the wide variance range, this work achieves an extension of the FXP32 format’s dynamic range. The method exploits the switching strategy [15] but applies different design criteria, as will be explained in the next section.

3. Switched FXP32 Format

This section presents a switched FXP32 format that uses a range of possible n values, in contrast to the classic FP32 format, where n is fixed. Each n can operate within a certain range of variances while the data variance level defines which one will be chosen (i.e., switching is accomplished variance-wise). Thus, switched FP32 allows partitioning the variance range into non-overlapping ranges and assigns the optimal n to each range. The identification of non-overlapping variance ranges, also known as subranges, is an essential task. The process applied here is distinct from that in [15], where the intersection points of the SQNR curves of neighboring n are used as subrange boundaries. This will be clarified next.

The design of the considered solution seeks to achieve the SQNR performance for each n equal to or higher than the FP32, which can be mathematically formulated as follows:

SQNR (σ_{dB}, n) \geq c o n s t ., σ_{dB} \in [σ_{dB, L}^{n}, σ_{dB, U}^{n}],

(21)

where

σ_{dB, L}^{n} = 20 \log_{10} σ_{L}^{n}

and

σ_{dB, U}^{n} = 20 \log_{10} σ_{U}^{n}

represent the lower and upper bound of the subrange (n in the superscript refers to n value), where 0 ≤ n ≤ 31. Clearly, we set const. = C = 151.9 dB.

We will use Figure 1 to explain the division of the variance range considering criterion (21). Note that the SQNR curve of the classic FXP32 format for each observed n intersects the SQNR curve of the FP32 format at two variance points, left and right from

σ_{dB, \max}^{n}

, denoted as

σ_{dB, l}^{n}

and

σ_{dB, r}^{n}

, respectively. Thus, a general rule for creating non-overlapping subranges is introduced:

Select $σ_{dB, r}^{n}$ as the upper bound, i.e., $σ_{dB, U}^{n} = σ_{dB, r}^{n}$ , 0 ≤ n ≤ 31;
Select the upper bound for the pervious n as the lower bound for the current n, i.e., $σ_{dB, L}^{n} = σ_{dB, U}^{n - 1}$ , 1 ≤ n ≤ 31, except for the case n = 0, where $σ_{dB, L}^{n} = σ_{dB, l}^{0}$ .

To determine

σ_{dB, U}^{n}

an iterative rule is proposed, as Lemma 3 indicates.

Lemma 3.

The upper threshold of the subrange corresponding to n can be found with the following iterative rule:

{(σ_{U}^{n})}^{(i)} = \frac{2^{n + 1 / 2}}{\log (\frac{3 {({(σ_{U}^{n})}^{(i - 1)})}^{2} - 2^{2 n - 64}}{3 {({(σ_{U}^{n})}^{(i - 1)})}^{2} 10^{- C / 10} - 2^{2 n - 64}})} .

(22)

Proof of Lemma 3.

To prove this, we will use expression (12) and apply the following condition:

SQNR (σ = σ_{U}^{n}, n) = C,

(23)

that is,

- 10 \log_{10} (\frac{2^{2 n - 64}}{3 {(σ_{U}^{n})}^{2}} (1 - \exp \{- \frac{2^{n + 1 / 2}}{σ_{U}^{n}}\}) + \exp \{- \frac{2^{n + 1 / 2}}{σ_{U}^{n}}\}) = C .

(24)

Solving (24) with respect to

σ_{U}^{n}

, we obtain the following equation:

σ_{U}^{n} = \frac{2^{n + 1 / 2}}{\log (\frac{3 {(σ_{U}^{n})}^{2} - 2^{2 n - 64}}{3 {(σ_{U}^{n})}^{2} 10^{- C / 10} - 2^{2 n - 64}})},

(25)

which requires iterative solving, so the proof is completed. □

The appropriate initialization of (22) can be achieved with

{(σ_{U}^{n})}^{(0)} = σ_{\max}^{n}

, where

σ_{\max}^{n}

is defined with (17).

Given that

σ_{U}^{n}

is a critical parameter, this paper proposes a very accurate approximate expression for its calculation. According to Figure 1, the SQNR decreases for

σ_{dB} \in [σ_{dB, \max}^{n}, σ_{dB, U}^{n}]

because the component D_ov has a greater effect on the total distortion than D_g. Therefore, the SQNR can be approximated as follows:

{SQNR}^{a} (σ, n) = 10 \log_{10} (\frac{σ^{2}}{D_{o v}}) = - 10 \log_{10} (\exp \{- \frac{2^{n + 1 / 2}}{σ}\}) .

(26)

Now, the approximate

σ_{U}^{n}

, denoted as

σ_{U}^{n, a}

, can be obtained as the solution of the following equation:

{SQNR}^{a} (σ = σ_{U}^{n, a}, n) = C,

(27)

resulting in

σ_{U}^{n, a} = \frac{2^{n + 1 / 2}}{\log (10^{C / 10})} .

(28)

The relative error

|(σ_{U}^{n} - σ_{U}^{n, a}) / σ_{U}^{n}| \times 100

calculated for n ranging from 0 to 31 is illustrated in Figure 2. It is obvious that the approximate formula (28) is very accurate, with an approximation error below 0.2%.

Figure 2. Relative error for the approximate upper bound defined by (28).

The switched FXP32 format can be described following the subranges’ determination. The principle of work depends on whether the input data has a zero or non-zero mean, as shown in Figure 3. Let x_i, i = 1, 2,…, M, denote the input data samples. The mean µ can be calculated as follows [16]:

μ = \frac{1}{M} \sum_{i = 1}^{M} x_{i} .

(29)

If µ = 0, then the variance of the data should be estimated as follows [16]:

σ^{2} = \frac{1}{M} \sum_{i = 1}^{M} x_{i}^{2} .

(30)

which is further used to select n for available data according to the following:

\{\begin{cases} n = 0, if 10 \cdot \log_{10} σ^{2} \in [σ_{dB, L}^{0}, σ_{dB, U}^{0}] \\ n = j, if 10 \cdot \log_{10} σ^{2} \in [σ_{dB, U}^{j - 1}, σ_{dB, U}^{j}], j = 1, …, 31 \end{cases} .

(31)

Figure 3. The flowchart of the proposed switched FXP32 format.

Since n is required in the decoding phase, it should be quantized and stored in the memory using 32 bits. After this, data samples are simultaneously processed using (4), and the quantized data x_i,q are obtained. When µ ≠ 0, the aforementioned steps must be applied after subtracting the mean from the input data. In this scenario, 32 bits should be used to quantize and store µ in the memory. In the decoding phase, for zero-mean data, decoding samples are x_i,d = x_i,q, while in the other case, x_i,d = x_i,q + µ. Algorithm 1 shows the algorithm for switched FXP32 quantization of input data for a more detailed explanation.

Algorithm 1: The switched FXP32 quantization procedure

Require : data samples x_{i} (i = 1, \dots, M), subrange thresholds σ_{U}^{n}

(n = 0, …,31)
% Encoding phase
1: Estimate mean µ using (29)
2: if µ ≠ 0 then
3: Subtract µ from input data
4: end if
5: Calculate variance σ² using (30)
6: Apply the switching logic (31) to select n;
7: while i ≤ M do
8: Process data using (4) to obtain quantized samples x_i,q
9: end while
% Decoding phase
Require: quantized data samples x_i,q (i = 1,…,M), n, µ
1: while i ≤ M do
2: if µ = 0 then
3: Decode data as x_i,d = x_i,q
4: else
5: Decode data as x_i,d = x_i,q + µ
6: end if
7: end while

The SQNR will also be used to express the performance of the switched FP32 format. It can be derived from the classic FXP32 format’s SQNR utilizing the portions of the curves inside the subrange borders, as expected:

SQNR (σ) = \{\begin{cases} SQNR (σ, 0), if 10 \cdot \log_{10} σ^{2} \in [σ_{dB, L}^{0}, σ_{dB, U}^{0}] \\ SQNR (σ, n), if 10 \cdot \log_{10} σ^{2} \in [σ_{dB, U}^{n - 1}, σ_{dB, U}^{n}], n = 1, …, 31 \end{cases} .

(32)

4. Results and Discussion

This section presents the theoretical performance (SQNR) results for the proposed switched FXP32, together with the experimental and simulation results based on real data.

4.1. Theoretical SQNR Results

Figure 4 illustrates the theoretical SQNR for the switched FXP32. It indicates subranges where certain n performs, while the boundaries of these subranges are also determined and reported in Table 1. Figure 4 also shows other relevant parameters, such as the dynamic range limits

σ_{dB}^{\min}

and

σ_{dB}^{\max}

, the minimum

{SQNR}_{\min}

and maximum

{SQNR}_{\max}

values of SQNR, and the SQNR dynamics

Δ SQNR

=

{SQNR}_{\max}

−

{SQNR}_{\min}

. In addition, δ_n is also indicated, which denotes the width of the subrange. Note that each subrange, except the one where n = 0 is assigned, has a constant width of 6.02 dB (this can also be concluded from Table 1). The concrete numerical values of these parameters are listed in Table 2.

Figure 4. SQNR of the switched FXP32 format across a wide range of variances.

Table 1. The values of subrange threshold

σ_{U}^{n}

.

Table 2. The values of specific parameters of the switched FXP32.

Comparison with fixed-point baselines. As fixed-point baselines, we use the DFP and S-DFP formats [6]. Both DFP and S-DFP use r = 8 bits for the integer part and an 8-bit shared exponent. For DFP8, the shared exponent is given by e_s = e_m − (r − 2), where e_m =

⌊\log_{2} ({|x_{i}|}_{\max})⌋

is the exponent of the absolute maximum data value, while its step size is ∆^DFP =

2^{e_{s}}

. In the case of S-DFP8, the step size is ∆^S-DFP =

2^{e_{m_\mod}}

, where e_{m_mod} = e_m − bias − (r − 2) is the shared exponent and bias = 8. The performance for DFP8 and S-DFP8 formats can be calculated using (6), (7), and (11), assuming that

x_{\max}^{DFP} = 2^{7} \cdot Δ^{DFP}

and

x_{\max}^{S - DFP} = 2^{7} \cdot Δ^{S - DFP}

, and the results are presented in Figure 5 and Figure 6, respectively, for several values of σ_dB. Figure 5 reveals that DFP8 heavily depends on e_m for a particular data variance. Note that the optimal e_m for some variance (e.g., e_m = 0 for σ_dB = −10 dB) does not guarantee the maximal SQNR (nearly 35 dB [16]) of the DFP8 format across other variances. So, the rule above for e_m will provide an optimal choice for some variance, but in most cases, it will differ from the actual variance of the data. Nevertheless, even if the optimal e_m is chosen, the capabilities of DFP8 are significantly below the proposed switched FXP32 for the same variance values (see Figure 4).

Figure 5. SQNR vs. e_m for DFP8 for several data variance values.

Figure 6. SQNR vs. e_m for S-DFP8 for several data variance values.

It is easy to see that the SQNR of the S-DFP8 format is a shifted version of the SQNR of the DFP8, so the conclusions drawn above can be extended to S-DFP8.

Comparison with floating-point baselines. As floating-point baselines, we use the FP32 [13] and bfloat16 [18] formats, as shown in Figure 7.

Figure 7. Comparison of the switched FXP32 with different floating-point solutions.

In the provided figure, we have also added the SQNR curve of the classic FXP32 format, whose key parameter is selected in the following way:

n_{opt} = \underset{n}{\arg \min} (\frac{1}{m} \sum_{j = 1}^{m} SQNR (σ_{j}, n))

(33)

where m = 3000 is the number of variances σ_j from the range [

σ_{dB}^{\min}

,

σ_{dB}^{\max}

]. In other words, within a given variance range, such n should provide the best performance in terms of average SQNR (SQNR_av). Figure 8 reveals that n = 25 satisfies the condition (33).

Figure 8. The average SQNR versus n for the classic FXP32 format.

It is obvious from Figure 8 that the classic FXP32 (n = 25) enables the same maximal SQNR, but it works in a much lower dynamic range with respect to our proposal (16.1 dB vs.

\sum_{n = 0}^{31} δ_{n} = σ_{dB}^{\max} - σ_{dB}^{\min} =

204.3 dB). The included floating-point baselines provide a stable SQNR in the observed variance range, with FP32 achieving the best scores. Interestingly, the classic FXP32 allows for better SQNR across a relatively wide variance range with respect to the bfloat16 format. Compared to FP32 and bfloat16, the switched FXP32 format enables a competitive dynamic range, with gains in maximum SQNR of 15.9 dB and 112.2 dB, respectively. This confirms the high efficiency of the proposed solution.

4.2. Experimental SQNR Results

The experimental part is based on weights obtained from several NN configurations and databases. The networks MLP^I (multi-layer perceptron) and CNN^I (convolutional neural network) are applied to the MNIST database [19], while more complex MLP^II and CNN^II are applied to the CIFAR-10 database [20].

MLP^I uses one hidden layer with 128 nodes, while its input and output layers use 784 and 10 nodes, respectively. The activation functions used in the hidden and output layers are ReLU and softmax, respectively. Hyper parameters are adopted following values regularization rate = 0.3, learning rate = 0.0005, and batch size =128. Training is performed over 50 epochs.

For CNN^I we adopt the model in [12], which incorporates convolutional, max pooling, a fully connected layer, and an output layer. The number of output filters in the convolutional layer is set to 32, while its kernel size is 3 × 3. The size of the pooling window is 2 × 2. The fully connected layer with 100 units on top of it, which is activated by the ReLU activation function, is placed further, before the output layer. Dropout of 0.5 is performed on the fully connected layer. CNN^I is trained in batches of size 128 across 10 epochs.

MLP^II comprises five fully connected hidden layers with progressively decreasing dimensionality: two layers with 512 units, followed by two with 256 units, and one with 128 units. These are followed by a final output layer that predicts probabilities across the ten CIFAR-10 classes. To improve training dynamics and reduce overfitting, the architecture includes batch normalization (to stabilize and accelerate training) and dropout (as a form of regularization). Data augmentation is applied during training in the form of random cropping and horizontal flipping, while validation and test data are standardized using only pixel-wise normalization. The network is trained for 40 epochs using the Adam optimizer with a learning rate of 0.0001.

CNN^II architecture follows a VGG-style design composed of three convolutional blocks. Each block contains two convolutional layers with ReLU activations and same-padding, followed by batch normalization to improve stability, max pooling for spatial down sampling, and dropout for regularization. The number of convolutional filters increases across blocks, from 64 to 128 and, finally, to 256. A fully connected stage with 512 units follows the convolutional blocks, again incorporating batch normalization and dropout. The network is trained for 30 epochs using the Adam optimizer with a learning rate of 0.001.

The weight histograms of the trained MLP and CNN networks are depicted in Figure 9. Observe that the distribution of weights can be approximated using a Laplacian PDF (especially for Figure 9a–c). The existence of Laplacian data in NNs is therefore well validated. For MLP^I we used the weights between the input and hidden layer (100,352 in total), whose variance is σ_w² = 5.77 × 10⁻⁴ (σ_w_,dB = −32.39 dB) (Figure 9a). In the case of CNN^I, we analyzed the weights between the input and hidden layer in the fully connected part (540,800 in total), whose variance is σ_w² = 0.0034 (σ_w_,dB = −24.67 dB) (Figure 9b). The weights in MLP^II are from the first fully connected layer (1,572,864 in total) whose variance is σ_w² = 6.84 × 10⁻⁴ (σ_w_,dB = −31.65 dB) (Figure 9c). Finally, from CNN^II we also employ the weights from the fully connected part (2,097,152 in total), having a variance of σ_w² = 0.0079 (σ_w_,dB = −21 dB) (Figure 9d). All the weight sets used have a mean value close to zero.

Figure 9. Histograms of weights of trained neural networks: (a) MLP^I (MNIST database); (b) CNN^I (MNIST database); (c) MLP^II (CIFAR-10 database); (d) CNN^II (CIFAR-10 database).

Table 3 gives the experimental SQNR for the proposed FP32 and DFP8 [6] approaches, calculated as follows [12]:

SQNR = 10 \log_{10} (\frac{\sum_{i = 1}^{M} w_{i}^{2}}{\sum_{i = 1}^{M} (w_{i} - w_{i, q})}),

(34)

where w_i denotes the unquantized weight, w_i_,q is the quantized weight, and M is the total number of weights. It is evident that the capability of the proposed switched FXP32 on the weights of all the considered MLP/CNN architectures is higher than that of the employed baselines. Note that even in a case that has a slightly different distribution (CNN^II weights) and, therefore, suboptimal optimization, evidence suggests that there are still improvements. For identical variance values, the achieved experimental SQNR results are in very good agreement with the theoretical ones displayed in Figure 7.

Table 3. Experimental SQNR for the switched FXP32, FP32, and DFP8 [6], obtained on weights from different NN architectures.

4.3. Simulation SQNR Results

The simulation part uses the weights from the experimental part (e.g., from the MLP^II network) as starting points and aims to verify the efficiency of the switched FXP32 format in situations where the variance of the weights changes. Figure 10 demonstrates the simulation SQNR for weights with variance in the range of −40 dB to 150 dB, provided using the following procedure:

Figure 10. SQNR simulation results obtained for the weights of the MLP^II network.

(1): Scale the initial weights with $k = 10^{(σ_{dB} - σ_{w, dB}) / 20}$ to obtain the new weights of variance σ_dB;
(2): Apply Algorithm 1 (Section 3) on the new weights;
(3): Calculate SQNR using (34).

Note that the switched FXP32 format preserves high SQNR values and outperforms the FP32 and DFP8 formats [6], while the simulated SQNR aligns well with the theoretical SQNR shown in Figure 7. This confirms the correctness of the theoretical design method. The fact that this approach achieves high efficiency in encoding various Laplacian-distributed data is a strong indicator of its potential usefulness in neural network applications.

4.4. Limitations

Although our method is effective, it exhibits certain shortcomings. Encoding data with variance outside its dynamic range may be less efficient compared to FP32. As a mitigation measure, a scaling method can be applied to shift the variance of the data into the desired range and return the data with rescaling at the end.

5. Conclusions

An effective method for improving the FXP32 format for the Laplacian-distributed data was presented in this paper. Its main attribute is the ability to vary the number of bits for the integer part (denoted by n) depending on the estimated variance of the input data, unlike the classic FXP32 format, where this parameter is fixed. Due to the calculation of external parameters, a slightly increased computational complexity was introduced compared to the classic variant. The theoretical design process was comprehensively explained, and the critical parameters were determined along with the expression for performance evaluation. Theoretical analysis over a wide variance range was performed using SQNR as a performance metric, revealing a significant gain over existing floating-point (e.g., standardized FP32) and fixed-point baselines. The experimental and simulation analysis based on neural network weights were further included to validate the theoretical results. Based on these encouraging findings, we believe that the presented switched FXP32 can be a useful solution in real-world systems where Laplacian-distributed data occurs, e.g., neural networks. Moreover, future work will include the application in neural networks as an alternative to FP32, where considerable savings in computational and hardware resources are expected.

Author Contributions

Conceptualization, B.D. and Z.P.; methodology, B.D. and Z.P.; software, B.D., M.D., N.S. and M.A.; validation, B.D., Z.P. and M.D.; investigation, B.D., Z.P., M.D. and S.P.; resources, M.D.; data curation, S.P., N.S. and M.A.; writing—original draft preparation, B.D.; writing—review and editing, Z.P. and M.D.; visualization, B.D. and M.D.; supervision, Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia (grant number 451-03-136/2025-03/200102), as well as by the European Union’s Horizon 2023 research and innovation program through the AIDA4Edge Twinning project (grant ID 101160293).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

bfloat16	Brain floating-point format
CIFAR-10	Canadian Institute for Advanced Research database
CNN	Convolutional neural network
DFP	Dynamic fixed-point format
FP	Floating point
FP32	32-bit floating-point format
FXP	Fixed-point format
FXP32	32-bit fixed-point format
MNIST	Modified National Institute of Standards and Technology database
MLP	Multilayer perceptron
MSE	Mean squared error
NN	Neural network
PDF	Probability density function
S-DFP	Shifted dynamic fixed-point format
SQNR	Signal to quantization noise ratio

References

IEEE 754-2019; Standard for Floating-Point Arithmetic. IEEE: Piscataway, NJ, USA, 2019. [CrossRef]
Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
Aymone, F.M.; Pau, D.P. Benchmarking in-sensor machine learning computing: An extension to the MLCommons-Tiny suite. Information 2024, 15, 674. [Google Scholar] [CrossRef]
He, F.; Ding, K.; Yan, D.; Li, J.; Wang, J.; Chen, M. A novel quantization and model compression approach for hardware accelerators in edge computing. Comput. Mater. Contin. 2024, 80, 3021–3045. [Google Scholar] [CrossRef]
Das, D.; Mellempudi, N.; Mudigere, D.; Kalamkar, D.; Avancha, S.; Banerjee, K.; Sridharan, S.; Vaidyanathan, K.; Kaul, B.; Georganas, E.; et al. Mixed precision training of convolutional neural networks using integer operations. arXiv 2018, arXiv:1802.00930. [Google Scholar]
Sakai, Y.; Tamiya, Y. S-DFP: Shifted dynamic fixed point for quantized deep neural network training. Neural Comput. Appl. 2025, 37, 535–542. [Google Scholar] [CrossRef]
Kummer, L.; Sidak, K.; Reichmann, T.; Gansterer, W. Adaptive Precision Training (AdaPT): A Dynamic Fixed Point Quantized Training Approach for DNNs. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM23), Minneapolis, MN, USA, 27–29 April 2023. [Google Scholar] [CrossRef]
Alsuhli, G.; Sakellariou, V.; Saleh, H.; Al-Qutayri, M.; Mohammad, B.; Stouraitis, T. DFXP for DNN architectures. In Number Systems for Deep Neural Network Architectures: Synthesis Lectures on Engineering, Science, and Technology; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
Sungrae, K.; Hyun, K. Zero-centered fixed-point quantization with iterative retraining for deep convolutional neural network-based object detectors. IEEE Access 2021, 9, 20828–20839. [Google Scholar] [CrossRef]
Wu, S.; Li, G.; Chen, F.; Shi, L. Training and inference with integers in deep neural networks. arXiv 2018, arXiv:1802.04680. [Google Scholar]
Banner, R.; Nahshan, Y.; Hoffer, E.; Soudry, D. ACIQ: Analytical clipping for integer quantization of neural networks. arXiv 2018, arXiv:1810.05723. [Google Scholar]
Peric, Z.; Savic, M.; Simic, N.; Denic, B.; Despotovic, V. Design of a 2-bit neural network quantizer for Laplacian source. Entropy 2021, 23, 933. [Google Scholar] [CrossRef] [PubMed]
Peric, Z.; Savic, M.; Dincic, M.; Vucic, N.; Djosic, D.; Milosavljevic, S. Floating Point and Fixed Point 32-bits Quantizers for Quantization of Weights of Neural Networks. In Proceedings of the 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE 2021), Bucharest, Romania, 25–27 March 2021. [Google Scholar] [CrossRef]
Peric, Z.; Dincic, M. Optimization of the 24-Bit fixed-point format for the Laplacian source. Mathematics 2023, 11, 568. [Google Scholar] [CrossRef]
Dincic, M.; Peric, Z.; Denic, D.; Denic, B. Optimization of the fixed-point representation of measurement data for intelligent measurement systems. Measurement 2023, 217, 113037. [Google Scholar] [CrossRef]
Jayant, N.C.; Noll, P. Digital Coding of Waveforms: Principles and Applications to Speech and Video; Prentice Hall: Englewood Cliffs, NJ, USA, 1984. [Google Scholar]
Gersho, A.; Gray, R. Vector Quantization and Signal Compression; Kluwer Academic Publishers: New York, NY, USA, 1992. [Google Scholar]
Burgess, N.; Milanovic, J.; Stephens, N.; Monachopoulos, K.; Mansell, D. Bfloat16 Processing for Neural Networks. In Proceedings of the IEEE 26th Symposium on Computer Arithmetic (ARITH 2019), Kyoto, Japan, 10–12 June 2019. [Google Scholar] [CrossRef]
Lecun, Y.; Cortez, C.; Burges, C. The MNIST Handwritten Digit Database. Available online: http://yann.lecun.com (accessed on 1 February 2025).
Krizhevsky, A.; Nair, V. The CIFAR-10 and CIFAR-100 Dataset. 2019. Available online: https://www.cs.toronto.edu (accessed on 1 February 2025).

Figure 1. SQNR as a function of σ_dB for the FXP32 format, for various n.

Figure 2. Relative error for the approximate upper bound defined by (28).

Figure 4. SQNR of the switched FXP32 format across a wide range of variances.

Figure 5. SQNR vs. e_m for DFP8 for several data variance values.

Figure 6. SQNR vs. e_m for S-DFP8 for several data variance values.

Figure 7. Comparison of the switched FXP32 with different floating-point solutions.

Figure 8. The average SQNR versus n for the classic FXP32 format.

Figure 9. Histograms of weights of trained neural networks: (a) MLP^I (MNIST database); (b) CNN^I (MNIST database); (c) MLP^II (CIFAR-10 database); (d) CNN^II (CIFAR-10 database).

Figure 10. SQNR simulation results obtained for the weights of the MLP^II network.

Table 1. The values of subrange threshold

σ_{U}^{n}

.

Table 1. The values of subrange threshold

σ_{U}^{n}

.

n	$σ_{U}^{n}$ [dB]	n	$σ_{U}^{n}$ [dB]	n	$σ_{U}^{n}$ [dB]	n	$σ_{U}^{n}$ [dB]
0	−27.87	8	20.29	16	68.45	24	116.61
1	−21.85	9	26.31	17	74.47	25	122.63
2	−15.83	10	32.33	18	80.49	26	128.65
3	−9.81	11	38.35	19	86.51	27	134.67
4	−3.79	12	44.37	20	92.53	28	140.69
5	2.23	13	50.39	21	98.55	29	146.71
6	8.25	14	56.41	22	104.57	30	152.73
7	14.27	15	62.43	23	110.59	31	158.75

Table 2. The values of specific parameters of the switched FXP32.

δ₀ [dB]	δ_n [dB]	$σ_{dB}^{\min}$ [dB]	$σ_{dB}^{\max}$ [dB]	${SQNR}_{\max}$ [dB]	${SQNR}_{\min}$ [dB]	$Δ SQNR$ [dB]
17.6	6.02	−45.5	158.8	167.8	151.9	15.9

Table 3. Experimental SQNR for the switched FXP32, FP32, and DFP8 [6], obtained on weights from different NN architectures.

	MLP^I Weights	MLP^II Weights	CNN^I Weights	CNN^II Weights
Switched FXP32	165.05 dB (n = 0)	165.70 dB (n = 0)	158.01 dB (n = 1)	164.38 dB (n = 2)
FP32	151.94 dB	151.93 dB	151.95 dB	151.92 dB
DFP8 [6]	32.59 dB	27.31 dB	34.29 dB	37.92 dB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Switched 32-Bit Fixed-Point Format for Laplacian-Distributed Data

Abstract

1. Introduction

2. The FXP32 Format

2.1. Basics and Equivalent Quantizer Model

2.2. Performance Analysis Using SQNR

3. Switched FXP32 Format

4. Results and Discussion

4.1. Theoretical SQNR Results

4.2. Experimental SQNR Results

4.3. Simulation SQNR Results

4.4. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics