LW-ViT: The Lightweight Vision Transformer Model Applied in Offline Handwritten Chinese Character Recognition
Round 1
Reviewer 1 Report
Having large numbers of parameters, being computationally-intensive and needing large volumes of memory have been mentioned as disadvantages for both transformer-based methods (Abstract) and CNN-based methods (Line 53, Line 68). However, in the abstract, the authors state that these disadvantages make transformer-based methods less suitable for mobile platforms compared to CNNs. Please resolve this issue. In the current presentation of the paper, the sentence beginning at Line 70 does not clarify the point.
After Line 72, the discussion is suddenly aborted. The authors state a problem, but they do not talk about their solution(s)! The reader will be shocked and confused until they reach section 3.
Figures 4 and 5 are just dropped without any explanation. How should a reader identify the categories from Figure 4. How should they see the diversity in Figure 5? There might be so many more questions as well.
Table 1: "Total number of dataset 104105", what does it mean?
Subsection 4.2: The authors argue that "In the course of conducting the experiments, the hyperparameters were modified several times in order to find the best recognition accuracy of the model,....". What about other metrics like precision, F1-Score and recall? Is there any tradeoff? Are you losing something to achieve accuracy? Have you measured? Can you provide the values in Table 3?
There are so many writing and grammar issues in the text. Some of them are mentioned below.
Title
I suggest the following modification. "A Lightweight Vision Transformer Model Applied to" --> "A Lightweight Vision Transformer Model for Application in"
Abstract
Regarding the sentence "To address these issues, a lightweight vision transformer (LW-ViT) model has been proposed with the aim of reducing 
the complexity of the transformer-based model", "LW-ViT" is not a standard term. It is used by the authors to name their method. So, the sentence should be modified as "To address these issues, a Vision Transformer (ViT) model named LightWeight Vision Transformer (LW-ViT) model has been proposed with the aim of reducing the complexity of the transformer-based model".
"The inspiration came from MobileViT, which reduces the number of parameters in the model by optimizing the MobileViT structure by reducing the number of transformer blocks and MV2 layer", the sentence is quite vague. Please consider rewriting.
Introduction
Line 28: "Handwritten Chinese character recognition, as a research hotspot in the field of pattern 28
recognition, is a challenging task [3]." --> A statement like this requires more recent references indicating recent ongoing research around the topic. Moreover, since this is the first occurrence of Handwritten Chinese Character Recognition, you can introduce the abbreviation HCCR like this "Handwritten Chinese Character Recognition". Then, the sentence beginning from Line 22: "Research into handwritten Chinese character recognition (HCCR)..." can be modified as "Research into HCCR..." . Please note that capital letters should be used (in this case and all cases related to abbreviations). For example, "Handwritten Chinese Character Recognition" instead of "Handwritten Chinese character recognition" and "Artificial Intelligence (AI)" instead of "artificial intelligence (AI)".
Line 30: "Secondly, the complex structure of many Chinese characters with many strokes", this is an incomplete sentence.
In the sentence beginning from Line 33 and ending at Line 36, the word "respectively" is not needed.
"a large number of researchers and scholars [7,8]", more recent references are required.
"smoothing and denoising", can you give a reference?
Line 40: CNN is quite standard for Convolutional Neural Network. You can introduce and use in the rest of the text; "Convolutional Neural Networks (CNNs)".
Line 50: "handwritten Chinese character recognition (HCCR)", the abbreviation is being defined for the second time.
Line 52: "Today, an increasing number of deep learning models [12] are being used to improve the performance of HCCR" --> "Today, an increasing number of deep learning models are being used to improve the performance of HCCR [12]."
Line 66: "Unlike convolutional neural networks (CNN),", as mentioned in my previous comments, the abbreviation should have been introduced earlier in the text.
Figures
The authors refer to Fig.1, Fig.2, etc. While the captions mention Figure 1, Figure 2, etc. Please be consistent.
To achieve better quality, I suggest that all figures be redone using a more professional tool.
Section 4
the title of this section is grammatically incorrect ("Experimental and Results"). I suggest "Experiment Methodology and Results" or something like that.
Tables
"The complete experimental results are shown in Table. 4.3", What is Table 4.3? What is Table 3?
Author Response
Please see the attachment.
Author Response File:  Author Response.pdf
 Author Response.pdf
Reviewer 2 Report
I find the post interesting and useful.
1. What steps are necessary (possible) to take for the most optimal deployment of the model on mobile devices?
2. Did the experiments also take place on another configuration (platform), or only on the one defined on lines 198-202?
3. Table 2 contains the adjusted parameters for the experiment. Is it possible to clearly see the original parameters and modified parameters somewhere?
4. The entire article is easy to understand, even by a less focused professional public, I would modify the experimental part to be even more comprehensible (one-sentence explanations of some terms) for the greatest possible interest and readability of the article.
I consider the used literature to be relevant to the given article.
Author Response
Please see the attachment.
Author Response File:  Author Response.pdf
 Author Response.pdf
Round 2
Reviewer 1 Report
I can see my comments sufficiently addressed
 
        



