Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks

Electronics 2025, 14(1), 14; https://doi.org/10.3390/electronics14010014

by Paweł Tumialis^1,*

, Marcel Skierkowski², Jakub Przychodny³ and Paweł Obszarski⁴

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Lianjie Zhou

Reviewer 4:

Jan Lansky

Electronics 2025, 14(1), 14; https://doi.org/10.3390/electronics14010014

Submission received: 26 October 2024 / Revised: 12 December 2024 / Accepted: 19 December 2024 / Published: 24 December 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors propose an interesting topic in the current context of moving some of the processing to the edge to reduce the data traffic to the cloud. Indeed, there are many advantages of using artificial intelligence at the edge (e.g., privacy, response time, reliability). However, the paper needs to go through extensive research.

It is not clear which architecture or microcontrollers are targeted in this research. The 8- and 4-bit quantization makes sense if the final neuronal network is deployed on a microcontroller where the resources are limited. The final paper should present results after running the proposed Tiny ML on the microcontroller.

Depending on the selected microcontroller, there might be neuronal network processing resources that can help in the optimization process. This aspect was ignored during the research. Even if that is the author’s purpose, the paper would benefit from such a comparison (e.g., a microcontroller with and without a CNN accelerator, or the Helium extension). Many articles have presented the impact and benefits of exploiting microcontroller architecture when using AI.

The three proposed neuronal networks (i.e., network anomalies, image classification, and speech recognition) are not fully characterized. The authors should detail the network anomalies detected (their number, type, etc.). The same approach should be applied to image classification and speech recognition. The presentation is too general and cannot be compared with other research. When evaluating AI performance, there are ways to present results that facilitate comparison with other research (mAP, Rank-K matching accuracy).

The authors must properly test the proposed networks by providing details of the power consumption, response time, frame rate for image classification, maximum bandwidth for network anomalies, and an analysis of Accuracy vs. Latency. The authors should also revise their results (i.e. Fig.4 Google speech commands), as through quantization it is expected to have a drop in accuracy. The figure mentioned shows that 32-bit accuracy is worse than 8-bit and 4-bit.

The authors should verify their statement on line 245, as there are papers demonstrating the implementation of 4-bit quantization on microcontrollers for character recognition.

Author Response

Thank you for the review. This is our first paper and we made a mistake by releasing a version in the system that does not contain any corrections. Therefore, we will only be able to include changes in the second round of review.

Comments 1: Microcontrollers

Our study does not focus on microcontrollers, but on dedicated ASICs. To measure the amount of silicon occupied by the neural network, its layers, mathematical operations, were synthesized. This means that a system was created that physically implements these operations using logic gates. Therefore, we did not describe the aspects related to microcontrollers. We indicated a few example applications of neural networks on microcontrollers to show the typical sizes of networks used in such applications.

Comments 2: More details about datasets

It's true that the datasets have not been described in great detail. However, there is a different concept behind it. Our research focuses on the impact of reducing the size of neural networks and their quantization on the results. We do not focus on solving the research problems represented by the datasets, but we use them to show the relationships related to scaling. Our descriptions show that, for example, anomaly detection is a problem of classifying 41 features into 20 classes. And this is the most important information in the context of reducing the network. Of course, if our concept is wrong, we will supplement the descriptions with additional details.

Comments 3: Lacking AI metrics

Our research focused on the impact of network scaling on their accuracy. Therefore, the paper describes the impact only on this metric.

Comments 4: Details of power, bandwith...

Good observation. Adding information about the power consumption of the ASIC and its bandwidth would significantly improve the quality of the publication. Unfortunately, there is a problem with the implementation of this part. To perform it, you need either expensive, professional simulation software or access to a laboratory station with FPGA systems that would simulate ASICs. As novice researchers, we do not have the possibility to obtain the necessary tools, so we are not able to complete this part. But the point is absolutely correct and we will keep it in mind in future studies.

Comments 5: 32-bit accuracy is worse than 8-bit and 4-bit

At first glance, this seems like a mistake. But it is the median of accuracy from 8 trainings. During our research, we noticed a phenomenon of very unstable training results. Our networks had less than 3k parameters and were very prone to falling into local minima. The individual best accuracy results for 32 bits and 4 bits differ from each other in favor of higher precision. But 4 bit networks had smaller spread of results, so the median is better. I will attach the results for a few trainings to better illustrate this issue.

Comments 6: Statement on line 245

Of course, there are publications about 4-bit quantization. However, there are fewer of them than about 8-bit. This is mainly due to the fact that current market solutions support mainly 8-bit (e.g. Nvidia graphics cards, Intel NPU, Synopsys ARC NPX NPU IP). Some solutions also offer support for 4-bit models (Cadence Neo NPU) but they are in the minority. Therefore, we believe that research on lower quantization should be continued. Maybe it will influence companies creating products.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

What model downsampling techniques were primarily tested for their impact on accuracy in this study?
How does 4-bit quantization limit the silicon area footprint?

Author Response

Comments 1: What model downsampling techniques were primarily tested for their impact on accuracy in this study?

Response 1: During the research, we tried to remove network layers and reduce the parameters of individual layers. Removing network layers had a very negative effect on the training results, so we stayed on reducing individual layers. The specific changes are described in Table 1. Mainly, it is a reduction in the number of filters for CNN layers, and for the network for the NSL-KDD problem, increasing the depth of the network while reducing the number of neurons in each layer.

Comments 2: How does 4-bit quantization limit the silicon area footprint?

Response 2: The silicon occupancy study concerns the ASIC system, in which each layer and mathematical operation is physically created from logic gates. Reducing the network coefficients to 4 bits allowed to reduce the size of multiplier modules, the number of flip-flops used, or the logic responsible for activation functions.

Reviewer 3 Report

Comments and Suggestions for Authors

This study tested the effect of model downscaling techniques on accuracy. The main idea was to reduce neural network models to 3k parameters or less. The tests were conducted on three different neural network architectures in the context of three separate research problems, modeling real tasks for small networks. However, there are some questions:

1. Introduction should be revised totally. Some core studies have not been involved.

2. Section II is too brief to understand. Please give more details on the study.

3. Section III is too short. I can’t catch the core meaning of Network training pipeline.

4. Section IV makes the network test. The tests are one-sided.

Comments on the Quality of English Language

1. Introduction should be revised totally. Some core studies have not been involved.

2. Section II is too brief to understand. Please give more details on the study.

3. Section III is too short. I can’t catch the core meaning of Network training pipeline.

4. Section IV makes the network test. The tests are one-sided.

Author Response

Comments 1: Introduction should be revised totally. Some core studies have not been involved.

Response 1: Of course we will improve it. The work lacks more similar solutions. The publications used should also be newer. I hope that describing two newer publications will be sufficient to present the context of our research. Unfortunately, we will be able to post the file with corrections in the next round of review.

Comments 2: Section II is too brief to understand. Please give more details on the study.

Response 2: It's true that the datasets have not been described in great detail. However, there is a different concept behind it. Our research focuses on the impact of reducing the size of neural networks and their quantization on the results. We do not focus on solving the research problems represented by the datasets, but we use them to show the relationships related to scaling. Our descriptions show that, for example, anomaly detection is a problem of classifying 41 features into 20 classes. And this is the most important information in the context of reducing the network. Of course, if our concept is wrong, we will supplement the descriptions with additional details.

Comments 3: Section III is too short. I can’t catch the core meaning of Network training pipeline.

Response 3: We will describe the stages of our research in more detail. I think that the improved description with the diagram will allow for a good understanding of the course of experiments.

Comments 4: Section IV makes the network test. The tests are one-sided.

Response 4: Could you please explain the concept of one-sided? The research used the QAT and knowledge distillation methods, but for them, various training parameters were also tested, such as learning rate, optimizer, or alpha for distillation. We calculated medians from all results and the best results were included in the paper. By improving section 2 of the publication, we will add information about the tested parameters. Maybe then it will be more visible.

Reviewer 4 Report

Comments and Suggestions for Authors

The article deals with the topic of neural networks, which is currently at the peak of scientific interest. Specifically, the authors propose the use of quantization of weights to 8 (or 16 bits) which will lead to a reduction in the size of the whole site. The authors try to show that at the same time the high relevance of the results will be preserved.

On the technical side, the paper has serious shortcomings and the paper needs significant additions to several chapters.

The introduction is too short and does not sufficiently describe the motivation for addressing the topic, but neither its possible use in practice.

Completely missing is the Related works section, where the authors present about 15-20 scientific papers that address similar issues.

The methodology is done in an accessible way, including the formulas used. The design of the actual solution is interspersed with the results chapter. Thus, it is necessary to clearly define what is the chapter "Solution design" and what is the chapter "Results"

In the paper there is no "Discussion" chapter, where the authors compare their results with the results of related works" In the paper there is also no "Conclusion" chapter, where the authors summarize the whole research, the most important results and present their plans for future research,

Figures and graphs are rather simple, but acceptable.

The number of references is insufficient, only 16, and the years 2023-24 are completely missing. At least 30 new references should be added, of which at least 10 should be from the years 2022-24.

Translated with DeepL.com (free version)

Author Response

Comments 1: Introduction is too short and missing is the Related works

Response 1: We combined these two paragraphs into one, so they seemed to be sufficient. However, the comment is very correct. They should indeed be separated and above all supplemented with a larger number of similar scientific papers. We will correct this in the next round of review.

Comments 2: "Solution design" and "Results"

Response 2: Since we are describing a study that had several separate aspects, we decided to create separate chapters for each of them. We believe that it would be better to emphasize in which chapters some results were placed. Would adding the word results to the titles of the chapters (that contain results) be sufficient?

Comments 3: "Discussion" and "Conclusion"

Response 3: We will supplement the work with a discussion and separate it from the conclusions.

Comments 4: The number of references is insufficient.

Response 4: Expanding the chapters according to the comments, we will provide a larger number of papers. We will pay attention to ensure that these are mainly newer publications.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I understand the author’s research intensions, however their results are not validated through experiments. They were able to design the network and test it on a development machine (I think a PC at this stage) without performing any tests on the final environment (with limited resources: memory, clock frequency) and prove the final concept operation.

If the results are so good, why not make the effort and upload the implementation on an embedded system and show the amount of memory used, the clock requirements for a given system response requirement.

It is good to have a metric as network scaling while trying to keep the accuracy. However, it is more important to have a valid model, one that you can rely on or trust its results. That is why a find important to compare the accuracy of the proposed model with other researchers and present the results obtained on testing data vs other researchers that used the same dataset.

Author Response

Comments 1: Experiment validation

Reply 1: Thank you very much for your comment. I have placed a description and a table at the end containing the values occupied by the model for different quantizations. The network synthesis was performed for a 1GHz clock and the system is able to work with such a clock. In the Fusion Compiler synthesis tool, there is a possibility of post-validation of the system and it needed 18 cycles to process the input. Mathematical operations in the layer were done in parallel (the model has only 3 hidden layers), so the system works very fast. Theoretically, with a slow clock, 1 cycle per layer would be enough, but because the clock speed is high, there must be intermediate buffers in the system so that there are no problems with signal timing. I hope that my changes respond to your comment.

Reviewer 3 Report

Comments and Suggestions for Authors

(1) Line 27 “CNN” should be changed to full name.

(2) Figure 1 needs to be centered.

(3) The number of references is too small and needs to be increased, Relevant literature needs to be cited, such as “A heterogeneous streaming vehicle data access model for diverse IoT sensor monitoring network management” and “A heterogeneous access meta-model for efficient IoT remote sensing observation management: Taking precision agriculture as an example” “A heterogeneous key performance indicator metadata model for air quality monitoring in sustainable cities”.

Author Response

Comments 1: Line 27 “CNN” should be changed to full name.

Response 1: Changed to Convolutional Neural Network

Comments 2: Figure 1 needs to be centered.

Response 2: Done

Comments 3: Number of references is too small

Response 3: This comment coincided with a remark from another reviewer. I significantly increased the number of citations (to 40). I made sure that they were fresher than 2022. Thank you very much for the citation suggestions. Unfortunately, they do not adequately address the issue described. However, it is worth remembering them in the context of future research projects. I hope that my corrections have resolved your comment.

Reviewer 4 Report

Comments and Suggestions for Authors

Most of my comments were solved, but remain one major comment unsolved

My original major comment:

The number of references is insufficient, only 16, and the years 2023-24 are completely missing. At least 30 new references should be added, of which at least 10 should be from the years 2022-24.

Your progress:

you increased number of references from 16 to 17,

so my major comment remain to solve

I expect at least 40-50 references, including 10 from years 2022-24

Author Response

Comments 1: The number of references is insufficient

Respons 1: I increased the number of citations to 40. I made sure that there were more newer publications and there are more than 10 of them from the years 2022+. I hope that my corrections are sufficient and the work better presents the current state of the art.

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

I want to thank the authors for following the previous recommendations and trying to improve their paper.

After the third review, I still find the same issues in their research approach. There are not enough arguments to justify their results. I think they are on the right path, but more effort is needed in developing their model. For instance, the authors have compared their model for Google Speech Commands against TinySpeech-X. However, the authors from [35] also proposed TinySpeech-Z, which has only 2.7 K parameters and an accuracy of 92.4%. These values are better compared to those obtained by the authors (Accuracy of 84.5%, 2.8K parameters).

For CIFARF-10, the model has low accuracy compared to what has been obtained in the literature. For instance, ThrifftyNet [24] achieved 90% accuracy for 40K parameters. However, in the ThrifftyNet [24] article tests were performed for a lower number of parameters. The results obtained have better accuracy compared to what the current paper is proposing.

I understand the need for downsizing and for the models to require less resources while having low power consumption. However, the obtained models have to be reliable by providing good accuracy.

Author Response

Comments 1: ThriftyNet performed better in the original publication

Response 1: Thank you very much for this comment. Rightly, the training results for this network came out worse than in the authors' publication of this architecture. So we repeated the trainings once again and got the median result of eight trainings equal to 84.4%. Of course, the median is lower than the best result due to the spread of results. The best result was 89.2%, which is very close to the authors' results. This also improved the results of quantized models, but for 4 bits the model still achieves mediocre results. The improvement of experiments related to this model turned out to be very important.

Comments 2: Small and better models (like TinySpeech-Z)

Response 2: We agree that there are small models that achieve better results than those described in our publication. However, our goal was not to create the best model for a given problem, but to show the impact of downscaling and quantization on model accuracy. We also presented the benefits of this process. Better, small models (e.g. TinySpeech-Z) are based on different architectures than ours, which is why they can scale better, achieve better results. Our models achieved the results they did. And they cannot be significantly improved without changing the architecture itself, which basically boils down to creating a new model. From another perspective, accuracy above 80% is often sufficient in edge, embedded, IoT and networked devices. Certainly, the issue of TinyML will continue to develop and over time, better and better miniature models will appear.

Reviewer 4 Report

Comments and Suggestions for Authors

All my comments were solved, I am OK with this version of the article

Author Response

Thank you for accepting the publication.

Round 4

Reviewer 1 Report

Comments and Suggestions for Authors

Based on the author’s responses, the final paper would benefit from the following observations:

- in the result section the authors should preset a table comparing their results against the results obtained in literature by displaying relevant information (e.g., accuracy, parameter number, memory used, type of architecture or teaching method, quantization). This type of table should be made for each of the three models. By adding the table will help the reader to have a better understanding of the cons and pros of the technique.

- if the purpose of the models is to be used at the edge, then some requirements should be provided: power consumption, response time and model accuracy. Regarding power consumption, I can understand that not each article focused on this sort of measurement, however, response time and accuracy of the model should be provided at least as threshold levels based on the literature of the last five years.

Author Response

Comments 1: A table comparing their results against the results obtained in literature

Response 1: I have included three tables in the publication below the graphs, containing a summary and comparison with models from the literature. To make the comparison meaningful, I have given there not the median but the best result for our models. Thanks to this, they can be compared with other publications. This also shows the problem with the spread of results for small networks. The tables also show the amount of memory needed for each network. I think they demonstrate the benefits of quantization very well.

Comments 2: Network latency

Response 2: This comment required some work. I decided that the best measure to estimate the speed of individual networks would be to determine their MAC. For each network, I counted the operations occurring in the layers and presented them in a table. Now the speed of operation depends only on the hardware used. Some RISC microcontrollers or specialized DSPs need one clock cycle for the entire MAC operation, while others need several. Therefore, I cannot answer specifically how long the network interference will take. I think that the information about the number of MACs is sufficient to estimate the operating time for any device.

Round 5

Reviewer 1 Report

Comments and Suggestions for Authors

Authors have implemented all the observations.

Article Menu

The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks

Further Information

Guidelines

MDPI Initiatives

Follow MDPI