Switched 32-Bit Fixed-Point Format for Laplacian-Distributed Data
Abstract
1. Introduction
- Switched FXP32 quantization. This type of quantization chooses the optimal number of bits for the integer part n based on the variance level of data to be quantized, unlike the conventional FXP32, where that parameter is fixed. This provides the ability to dynamically track changes in data and ensure high performance.
- Theoretical design. The design is performed using the analytical closed-form SQNR expression derived in this paper. It enables accurate calculation of the variance subrange, where a certain n acts, as well as the dynamic range.
- Experimental and simulation analysis. It is performed to validate the theoretical switched FXP32 model. Several NN configurations and benchmark datasets are employed to obtain the weights and test the switched FXP32 model. This also indicates the potential for applying the proposed solution in NNs.
2. The FXP32 Format
2.1. Basics and Equivalent Quantizer Model
2.2. Performance Analysis Using SQNR
3. Switched FXP32 Format
- Select as the upper bound, i.e., , 0 ≤ n ≤ 31;
- Select the upper bound for the pervious n as the lower bound for the current n, i.e., , 1 ≤ n ≤ 31, except for the case n = 0, where .
Algorithm 1: The switched FXP32 quantization procedure |
(n = 0, …,31) % Encoding phase 1: Estimate mean µ using (29) 2: if µ ≠ 0 then 3: Subtract µ from input data 4: end if 5: Calculate variance σ2 using (30) 6: Apply the switching logic (31) to select n; 7: while i ≤ M do 8: Process data using (4) to obtain quantized samples xi,q 9: end while % Decoding phase Require: quantized data samples xi,q (i = 1,…,M), n, µ 1: while i ≤ M do 2: if µ = 0 then 3: Decode data as xi,d = xi,q 4: else 5: Decode data as xi,d = xi,q + µ 6: end if 7: end while |
4. Results and Discussion
4.1. Theoretical SQNR Results
4.2. Experimental SQNR Results
4.3. Simulation SQNR Results
- (1)
- Scale the initial weights with to obtain the new weights of variance σdB;
- (2)
- Apply Algorithm 1 (Section 3) on the new weights;
- (3)
- Calculate SQNR using (34).
4.4. Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
bfloat16 | Brain floating-point format |
CIFAR-10 | Canadian Institute for Advanced Research database |
CNN | Convolutional neural network |
DFP | Dynamic fixed-point format |
FP | Floating point |
FP32 | 32-bit floating-point format |
FXP | Fixed-point format |
FXP32 | 32-bit fixed-point format |
MNIST | Modified National Institute of Standards and Technology database |
MLP | Multilayer perceptron |
MSE | Mean squared error |
NN | Neural network |
Probability density function | |
S-DFP | Shifted dynamic fixed-point format |
SQNR | Signal to quantization noise ratio |
References
- IEEE 754-2019; Standard for Floating-Point Arithmetic. IEEE: Piscataway, NJ, USA, 2019. [CrossRef]
- Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
- Aymone, F.M.; Pau, D.P. Benchmarking in-sensor machine learning computing: An extension to the MLCommons-Tiny suite. Information 2024, 15, 674. [Google Scholar] [CrossRef]
- He, F.; Ding, K.; Yan, D.; Li, J.; Wang, J.; Chen, M. A novel quantization and model compression approach for hardware accelerators in edge computing. Comput. Mater. Contin. 2024, 80, 3021–3045. [Google Scholar] [CrossRef]
- Das, D.; Mellempudi, N.; Mudigere, D.; Kalamkar, D.; Avancha, S.; Banerjee, K.; Sridharan, S.; Vaidyanathan, K.; Kaul, B.; Georganas, E.; et al. Mixed precision training of convolutional neural networks using integer operations. arXiv 2018, arXiv:1802.00930. [Google Scholar]
- Sakai, Y.; Tamiya, Y. S-DFP: Shifted dynamic fixed point for quantized deep neural network training. Neural Comput. Appl. 2025, 37, 535–542. [Google Scholar] [CrossRef]
- Kummer, L.; Sidak, K.; Reichmann, T.; Gansterer, W. Adaptive Precision Training (AdaPT): A Dynamic Fixed Point Quantized Training Approach for DNNs. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM23), Minneapolis, MN, USA, 27–29 April 2023. [Google Scholar] [CrossRef]
- Alsuhli, G.; Sakellariou, V.; Saleh, H.; Al-Qutayri, M.; Mohammad, B.; Stouraitis, T. DFXP for DNN architectures. In Number Systems for Deep Neural Network Architectures: Synthesis Lectures on Engineering, Science, and Technology; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
- Sungrae, K.; Hyun, K. Zero-centered fixed-point quantization with iterative retraining for deep convolutional neural network-based object detectors. IEEE Access 2021, 9, 20828–20839. [Google Scholar] [CrossRef]
- Wu, S.; Li, G.; Chen, F.; Shi, L. Training and inference with integers in deep neural networks. arXiv 2018, arXiv:1802.04680. [Google Scholar]
- Banner, R.; Nahshan, Y.; Hoffer, E.; Soudry, D. ACIQ: Analytical clipping for integer quantization of neural networks. arXiv 2018, arXiv:1810.05723. [Google Scholar]
- Peric, Z.; Savic, M.; Simic, N.; Denic, B.; Despotovic, V. Design of a 2-bit neural network quantizer for Laplacian source. Entropy 2021, 23, 933. [Google Scholar] [CrossRef] [PubMed]
- Peric, Z.; Savic, M.; Dincic, M.; Vucic, N.; Djosic, D.; Milosavljevic, S. Floating Point and Fixed Point 32-bits Quantizers for Quantization of Weights of Neural Networks. In Proceedings of the 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE 2021), Bucharest, Romania, 25–27 March 2021. [Google Scholar] [CrossRef]
- Peric, Z.; Dincic, M. Optimization of the 24-Bit fixed-point format for the Laplacian source. Mathematics 2023, 11, 568. [Google Scholar] [CrossRef]
- Dincic, M.; Peric, Z.; Denic, D.; Denic, B. Optimization of the fixed-point representation of measurement data for intelligent measurement systems. Measurement 2023, 217, 113037. [Google Scholar] [CrossRef]
- Jayant, N.C.; Noll, P. Digital Coding of Waveforms: Principles and Applications to Speech and Video; Prentice Hall: Englewood Cliffs, NJ, USA, 1984. [Google Scholar]
- Gersho, A.; Gray, R. Vector Quantization and Signal Compression; Kluwer Academic Publishers: New York, NY, USA, 1992. [Google Scholar]
- Burgess, N.; Milanovic, J.; Stephens, N.; Monachopoulos, K.; Mansell, D. Bfloat16 Processing for Neural Networks. In Proceedings of the IEEE 26th Symposium on Computer Arithmetic (ARITH 2019), Kyoto, Japan, 10–12 June 2019. [Google Scholar] [CrossRef]
- Lecun, Y.; Cortez, C.; Burges, C. The MNIST Handwritten Digit Database. Available online: http://yann.lecun.com (accessed on 1 February 2025).
- Krizhevsky, A.; Nair, V. The CIFAR-10 and CIFAR-100 Dataset. 2019. Available online: https://www.cs.toronto.edu (accessed on 1 February 2025).
n | [dB] | n | [dB] | n | [dB] | n | [dB] |
---|---|---|---|---|---|---|---|
0 | −27.87 | 8 | 20.29 | 16 | 68.45 | 24 | 116.61 |
1 | −21.85 | 9 | 26.31 | 17 | 74.47 | 25 | 122.63 |
2 | −15.83 | 10 | 32.33 | 18 | 80.49 | 26 | 128.65 |
3 | −9.81 | 11 | 38.35 | 19 | 86.51 | 27 | 134.67 |
4 | −3.79 | 12 | 44.37 | 20 | 92.53 | 28 | 140.69 |
5 | 2.23 | 13 | 50.39 | 21 | 98.55 | 29 | 146.71 |
6 | 8.25 | 14 | 56.41 | 22 | 104.57 | 30 | 152.73 |
7 | 14.27 | 15 | 62.43 | 23 | 110.59 | 31 | 158.75 |
δ0 [dB] | δn
[dB] | [dB] | [dB] | [dB] | [dB] | [dB] |
---|---|---|---|---|---|---|
17.6 | 6.02 | −45.5 | 158.8 | 167.8 | 151.9 | 15.9 |
MLPI Weights | MLPII Weights | CNNI Weights | CNNII Weights | |
---|---|---|---|---|
Switched FXP32 | 165.05 dB (n = 0) | 165.70 dB (n = 0) | 158.01 dB (n = 1) | 164.38 dB (n = 2) |
FP32 | 151.94 dB | 151.93 dB | 151.95 dB | 151.92 dB |
DFP8 [6] | 32.59 dB | 27.31 dB | 34.29 dB | 37.92 dB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Denić, B.; Perić, Z.; Dinčić, M.; Perić, S.; Simić, N.; Anđelković, M. Switched 32-Bit Fixed-Point Format for Laplacian-Distributed Data. Information 2025, 16, 574. https://doi.org/10.3390/info16070574
Denić B, Perić Z, Dinčić M, Perić S, Simić N, Anđelković M. Switched 32-Bit Fixed-Point Format for Laplacian-Distributed Data. Information. 2025; 16(7):574. https://doi.org/10.3390/info16070574
Chicago/Turabian StyleDenić, Bojan, Zoran Perić, Milan Dinčić, Sofija Perić, Nikola Simić, and Marko Anđelković. 2025. "Switched 32-Bit Fixed-Point Format for Laplacian-Distributed Data" Information 16, no. 7: 574. https://doi.org/10.3390/info16070574
APA StyleDenić, B., Perić, Z., Dinčić, M., Perić, S., Simić, N., & Anđelković, M. (2025). Switched 32-Bit Fixed-Point Format for Laplacian-Distributed Data. Information, 16(7), 574. https://doi.org/10.3390/info16070574