Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

VisFormers—Combining Vision and Transformers for Enhanced Complex Document Classification

Mach. Learn. Knowl. Extr. 2024, 6(1), 448-463; https://doi.org/10.3390/make6010023

by Subhayu Dutta¹

, Subhrangshu Adhikary^2,*

and Ashutosh Dhar Dwivedi³

Reviewer 1: Anonymous

Reviewer 2:

George E. Tsekouras

Reviewer 3: Anonymous

Mach. Learn. Knowl. Extr. 2024, 6(1), 448-463; https://doi.org/10.3390/make6010023

Submission received: 9 December 2023 / Revised: 9 February 2024 / Accepted: 14 February 2024 / Published: 16 February 2024

(This article belongs to the Section Visualization)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The article discusses VisFormers, an approach that combines Vision and Transformers for Enhanced Complex Document Classification. The paper is nicely structured and prepared. It contains some fascinating ideas, but some remarks are to be mentioned.

More specific index terms must extend a list of keywords.

The paper’s structure description is missing in the Introduction.

The literature review in its current form needs to be appropriately structured and presented.

It is highly recommended that the theoretical background should be extended.

The paper needs to provide more contributions to the literature. The advantages and disadvantages of your method compared to similar approaches need to be clarified.

Please share a code (add a link to the developed software) to reproduce the system.

A separate section for the limitations of the approach should be added.

Comments on the Quality of English Language

Some moderate English corrections are required.

Author Response

Response: The authors are very thankful to the esteemed reviewer for spending time evaluating the manuscript and finding it interesting. Pointwise responses to all the concerns are provided below, and the necessary updates are highlighted in blue text in the revised manuscript.

More specific index terms must extend a list of keywords.

Response: The authors agree with the respected reviewer that more specific index terms should be used. Accordingly, the list of keywords has been updated.

Changes: VGG19, Transformers, Multi-Headed Neural Network, Optical Character Reader; Complex Document Classification

The paper's structure description is missing in the Introduction.

Response: The authors have revised the manuscript by adding the paper's structure in the Introduction.

Changes: See section 1 para 8.

The literature review in its current form needs to be appropriately structured and presented.

Response: The authors appreciate the reviewer's comments and appropriately structure the literature review. The work is accordingly revised to have a logical and organised flow of information.

Changes: See section 2, para 4 & 5.

It is highly recommended that the theoretical background should be extended.

Response: The authors agree with the esteemed reviewer's comment to extend the study's theoretical background. Accordingly, a detailed mathematical breakdown of the study has been added to the revised manuscript.

Changes: See section 3.5.

The paper needs to provide more contributions to the literature. The advantages and disadvantages of your method compared to similar approaches need to be clarified.

Response: The authors are thankful for the reviewer's insights. It is essential to analyse the contribution of the work compared to the literature. Accordingly, multiple sections of the manuscript are revised.

Changes: See section 2 para 4 & 5, table 3, table 4, table 5 and section 5.3.

Please share a code (add a link to the developed software) to reproduce the system.

Response: The authors have accordingly shared the codes for the experiment in the revised manuscript.

A separate section for the limitations of the approach should be added.

Response: The authors agree with the respected reviewer that a separate section for limitations should be included as this will make it easier for the researchers to improve the work in the future. Accordingly, in the revised manuscript, a section has been added that discusses the approach's limitations.

Changes: See section 5.4.

Reviewer 2 Report

Comments and Suggestions for Authors

1) The mathematical structure of the proposed approach is missing. What is the objective function when combining transformers with VGG16? What is the optimizer? What are the parameter values of the optimizer (I presume it is the Adam optimizer)? Some things have to be shown in a research paper.

2) Why did the authors decide to insert the VGG’s output in the next-to-last dense layer in Fig. 2?

3) A rigorous comparative analysis in terms of inference statistics is missing. The authors must provide the statistical analysis of their results.

Comments on the Quality of English Language

See above

Author Response

The mathematical structure of the proposed approach is missing. What is the objective function when combining transformers with VGG16? What is the optimiser? What are the parameter values of the optimiser (I presume it is the Adam optimiser)? Some things have to be shown in a research paper.

Response: The authors thank the esteemed reviewer for suggesting comments to improve the manuscript. All comments are carefully considered while revising the manuscript, and the updated texts are highlighted in blue fonts in the manuscript.

The authors appreciate the concern of the respected reviewer that a mathematical structure of the proposed approach needs to be included in the paper. Accordingly, in the revised paper, a thorough mathematical analysis of the algorithm has been provided.

Further, the authors would like to clarify that the proposed model has three objective functions. The first one is the transformer head, responsible for text processing; the second one is the VGG-19 head, responsible for processing the visual information of the documents; and the third one is the tail of the network, which combines the two heads.

The optimisers in all three objective functions are different. In the VGG-19 head, root mean squared propagation (RMSprop) optimiser. The target for this network was a flattened vector for the image after resizing it to 8x8 pixels. The weights in the transformer head, as well as the tail of the network, were updated with Adam Optimiser.

All of these points have been included in the revision of the manuscript.

Changes: See section 3.5.

Why did the authors decide to insert the VGG's output in the next-to-last dense layer in Fig. 2?

Response: The authors appreciate the concern of the respected reviewer. The authors would like to clarify that the transformer and VGG-19 networks were used as different objective functions, each with their weight update rule (RMSprop for VGG-19 and Adam for transformer). The output from the transformer head is a vector of 16 elements, each representing a different class. On the other hand, the output of the VGG-19 head is a reduced feature map of 64 elements, which is intended to complement the output from the transformer head. Now, as there is a requirement for decision-making based on these 64 nodes from the VGG-19 network along with the 16 nodes from the transformer network, there is a necessity for adding one dense layer and an output layer. All these points have been considered while revising the manuscript.

Changes: See section 3.5.

A rigorous comparative analysis in terms of inference statistics is missing. The authors must provide the statistical analysis of their results.

Response: The authors are thankful to the reviewer for this comment. A detailed statistical comparative analysis for the results has been added accordingly.

Changes: See table 4, table 5 and sections 5.2 and 5.3.

Reviewer 3 Report

Comments and Suggestions for Authors

There are currently a lot of programs used by companies, enterprises, and units whose main goal is to optimize the document circulation management system and ensure efficient control of the work flow and information of fairies.
These IRF, ERP, CRM software, CRM, ... analyze and process all entered data: recordings, scan documents, photos, companies, email and create one document in the selected structure.
I think that the impact of the task described on the quality of existing software on the market should be clearly presented.
To what extent does the results discussed in the article affect the currently functioning software on the market?

The idea of these kind of programs assumes maximum limitation of the circulation of traditional, paper documents and ordering the information transfer process. Thanks to the document circulation procedures, each case is implemented in a planned, systematized and appropriate way for the enterprise. This is a very evictive solution for managing a construction investment and collecting documents to the BIM system for the stage of using a building or structure.
The question arises, how does the described solution in the article improve and can upgrade programs already existing on the market?

I think that in the conclusions of the article, you can refer to a wider range of comparison of work results with the results of current programs on the market - and this will better illustrate the described issues.

Author Response

The mathematical structure of the proposed approach is missing. What is the objective function when combining transformers with VGG16? What is the optimiser? What are the parameter values of the optimiser (I presume it is the Adam optimiser)? Some things have to be shown in a research paper.

All of these points have been included in the revision of the manuscript.

Changes: See section 3.5.

Why did the authors decide to insert the VGG's output in the next-to-last dense layer in Fig. 2?

Changes: See section 3.5.

A rigorous comparative analysis in terms of inference statistics is missing. The authors must provide the statistical analysis of their results.

Response: The authors are thankful to the reviewer for this comment. A detailed statistical comparative analysis for the results has been added accordingly.

Article Menu

VisFormers—Combining Vision and Transformers for Enhanced Complex Document Classification

Further Information

Guidelines

MDPI Initiatives

Follow MDPI