Review Reports - Optimizing BFloat16 Deployment of Tiny Transformers on Ultra-Low Power Extreme Edge SoCs

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The submitted paper presents a work describing a methodology to reduce the models of Transformer architectures used in generative Artificial Intelligence. Authors exploit previous Transformers architectures (MobileBERT, tinyViT, and tinyLLAMA2), and propose solutions to reduce their implementation sizes by eliminating redundancies to use them in hardware resource-constrained devices, particularly in the Commercial Off-the-shelf (COTS) RISC-V 7 multi-core micro-controllers.

The paper is well written, and the main topic of research is well presented. Supported by several recent references, the introductory and the related work sections clearly present to readers the main aspects associated with the area under research. Recent references indicate that authors are in-line with the thematic. Since this is a very complex issue, with many details that can be difficult for readers to understand exactly the improvement made in the architectures, I would suggest authors a smoother transition between the related work section, where authors present a state of art on the thematic under analysis, and the developments present in the next two sections (background and methods). If I correctly understood it is in these two sections that authors present their contribution to improve the Transformer architectures. I also suggest improving the titles of these two sections, to be clearer about their contents. Results seem to be well presented with several studies that I’m not able to analyze in detail. Conclusions are simple and summaries the results archived in the described work.

Focusing on some issues that I would suggest authors to improve:

- To easily understand the structure of the paper and the way each topic is related, I would suggest authors to write a small paragraph at the end of the introductory section presenting the remaining organization of the paper. Per example: “ in the next section it is presented the…. Later, in section … it is described…. Before conclusions some results are described….”.

- A simple conceptual diagram is also welcome in this type of paper to clarify in a more abstract way what were the improvements made to the transformer architectures. As a reader, I felt that many technical details are presented before an overall introduction to the main issues. I would suggest the use of a conceptual image before presenting developments.

- Some pictures are placed many distant from their reference in the text. Per example, in line 192 there is a reference to figure 9 that is only presented several pages latter. In other cases, authors present pictures before their reference in the text. If possible, I would suggest an improvement to these types of situations, by following the general rule of referencing all images and tables in the text before appearing in the document, and place them near the text where they are referenced.

- Usually, the headline titles for tables are placed on the top of them instead of the bottom, a rule usually adopted for picture/figures.

- Reference 12 is missing.

As a conclusion, and despite the minor suggestions that I have made to the paper, I may say that this is an interesting paper, well written, and supported on results.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present a method and hardware/software optimizations in order to deploy Tiny Transformers on low resources SoCs.
The claimed contributions are:
Proposal of a simple encoder pruning methodology to resize a generic Transformer inspired by LLM pruning methods;
The use of the Schraudolph approximation along with parallelizing in order to accelerate the computation of the softmax functions;
Matrix Multiplications are optimized for a SIMD RISCV architecture by using Hardware Loops, loop unrolling, BFLloat16 packing and data tiling.

The paper is well written and quite well organized. The goal of the research presented in the paper is interesting but some points need to be enhanced or clarified.

line 144: "an efficient design that maintains an efficiency comparable to the baseline model" How is the "efficiency" defined here? Maybe it is a typo and the correct word would be "accuracy"?

Figure6: I have trouble understanding what the triangles in the Ai matrices actually represent. Is it the memory organization of the results provided by each core? If so, it seems to me that the cost of calculating addresses would not be negligible...

line 362: "we find that a brief fine-tuning phase on the entire new architecture consisting of a few training epochs in our chosen architectures .... can effectively restore accuracy to a level comparable to the original
model." This sentence is too imprecise: what is the dataset used, with what batch size? What does "comparable" mean here, a difference of 1%, 5%, 10% or more? Furthermore, how were the ranges of values of learning rate and weight decay, which we see can be quite different for MobileBert and TinyViT, obtained? Are these results reproducible?

line 387: "We compare a fully deployed kernel version on a larger L2 memory and a second version using a dynamic tiling scheme, introduced in section 4.3, to fit the L1 TCDM constraints described in section 4.1" How do you find the configuration that meets the memory size constraints? By trial and error or by calculation?

I have a little trouble understanding the choices of certain parameters which differ according to the paragraphs of the paper. For example:
In figure 8: embedding & projection size of 512;
In figure 10: the optimum seems to be Input sequence length=64 and projection size=512;
line 435: "Input sequence length, embedding size, and single-head hidden dimension were all set to 64".
What is the methodolgy used to choose the right set of values?

On the caption of figure 11: "... we compare to their int8 data precision performance." Does this mean that the precision of your library is the same as the int8 libraries (or at least as good) or simply that you haven't evaluated the precision at this point?

line 462: "reduce memory requirements by 94.9% compared to the baseline model while improving latency by about
19 × and keeping the accuracy error below 0.19, which can be considered an acceptable compromise considering the baseline model error is 0.084" Maybe, but that still represents a 126% increase in error...

Paragraph 5.4. "TinyVit Deployment" The conclusion seems to be that the method does not fit this classifier very well. Why is only the memory footprint reduction given and not the speedup factor?

There are a number of typos:

Line 127: What is the number naming systems used here: long scale or short scale? I assume the short scale naming system is used since the second number (2 trillion) does not exist in the long scale system. But in any cases, especially in scientific publications, the International System of Units recommends using the metric prefixes to indicate magnitude. So here it will probably be "50 giga tokens" and "2 tera training set".

line 154: "high accuracy. on IoT" The dot after "accuracy" seems to be a typo!

line 364-365: there is a formatting problem with the exponents.

Tables 5 and 6 and figure 17 are misplaced

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Explain the difference between pruning and quantization in the introduction section better.

Include the pseudocode to clarify the encoder pruning methodology instead of Figure 4.

Include confusion matrices

Discuss the variability of power consumption

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have answered all my concerns, so I have no further requirements.