Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Receipt Recognition Technology Driven by Multimodal Alignment and Lightweight Sequence Modeling

Electronics 2025, 14(9), 1717; https://doi.org/10.3390/electronics14091717

by Jin-Ming Yu¹, Hui-Jun Ma^2,3,*

and Jian-Lei Kong^2,3,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(9), 1717; https://doi.org/10.3390/electronics14091717

Submission received: 4 March 2025 / Revised: 7 April 2025 / Accepted: 15 April 2025 / Published: 23 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for inviting me to review this manuscript. The title is "Financial Document Detection and Recognition Technology Driven by Multimodal Alignment and Lightweight Sequence Modelling". The research is up to date, and the findings and discussion are useful for those working in the same field. I have some suggestions and observations that I would like to share with the authors as follows:

Abstract

In line 13, the authors could change the present continuous "are facing" to the present tense.

The abstract is well written.

The authors could add some practical implications at the end of the abstract.

Authors may reduce the list of keywords to five, and the full forms should also be provided for CLIP, BiGRU.

Introduction

The first sentence of the introduction overlapped with the beginning of the introduction, making it less inviting. Please revise.

In line 40, please elaborate on “the complexity and intricacy of financial management”.

Please use lower modality for the argument “no longer meet the demands of modern office…” in line 42.

Some examples can be given in line 47 for “inefficient and prone to errors”.

Some paragraphs in this section are quite short, the authors could merge the relevant paragraphs to make a better paragraph development.

Authors can state clearer research questions in the Introduction section.

An overall structure of the manuscript should also be provided.

Literature Review

In line 147, the authors may add references to “Early research primarily relied on traditional… layouts”.

The authors can use a table to organise the relevant and significant studies. The table has several columns, such as authors, years, methodology, objectives and key findings. The table can be arranged in chronological or thematic order. Authors can place their current study in the last row so that readers can easily see the development of the topic in the field.

In line 179, some references should be added for "researchers have sought breakthroughs...".

A clear research gap can be included at the end of the literature review.

Methodology

An orientation paragraph can be added between section 3 and 3.1.

Authors can state the research approach before introducing the research process.

For Figure 1, the input and output pictures are a bit small and unclear.

In line 218, authors can elaborate on “key challenges”.

In line 232, please give a definition or elaborate on what “visual and textual semantics” are in your study.

In line 254 the authors can give some examples of “semantic features”.

The authors discuss the experiment results (e.g. lines 260-264, and some others) in the methodology section, and it would be better if this part could be moved to the later finding section.

In line 421, the authors say, “particularly in the context of financial document processing where precision is critical”. The authors can include relevant studies on financial document processing and the element of “precision” in the literature review and introduction.

The authors may replace “Below” in line 433 with “In the following”.

Rewrite this sentence in a more scientific style: “It’s like compressing a bulky encyclopedia into a portable electronic dictionary, without losing key semantics” in lines 450-451.

Experiments

In line 498, what is “financial document understanding tasks”? This phrase is slightly problematic.

Please add a short orientation paragraph between Section 4.2 and 4.2.1.

Just a minor reminder: the authors can be more consistent with the layout style for the subheadings, e.g. “Hybrid Convolution-Transformer Block.” in line 333, “FBS (Frame Rate):” in line 532.

In Table 2, “Recognition Module” and “gradient accumulation” can be separated into two lines.

Figures 6 and 7 can be slightly larger.

The authors can revise this sentence as it contains too many ideas “This remarkable performance…. to 0.82” from lines 575 – 578.

The authors can elaborate more on the term “multimodal alignment” in line 624.

Please revise this sentence in lines 631 to 633 into one complete sentence “Robustness Analysis: Performance degradation …. sacrifice generation.”

In lines 645- 646, the authors mentioned some future research direction. Would it be good if this part can be moved to the end of the conclusion?

The font size of Table 10 is much smaller than the previous tables. Please keep the layout consistent.

Conclusion

The authors may include a clearer limitation and theoretical implications in the conclusion.

Comments on the Quality of English Language

The language is generally fluent. However, some minor editing may be required to prepare tables and figures.

Author Response

请参阅附件。

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript proposes a financial document detection and recognition framework (CLIP-BiGRU) based on multimodal alignment and lightweight sequence modeling, addressing the needs of automated processing of financial documents, especially in edge computing scenarios. The study integrates CLIP and BiGRU, combined with quantization, pruning, and distillation compression strategies, achieving model lightweighting while maintaining high accuracy. It achieved excellent performance on the CORD dataset (F1=93.1%, CER=5.1%) and real-time inference speed of 25 FPS on Jetson AGX Orin. The manuscript is well-structured, with a clear research background and problem definition. It fully compares existing methods (CTPN+CRNN, EAST+CRNN, LayoutLMv3) and verifies the comprehensive advantages of CLIP-BiGRU in terms of accuracy, model size, and inference speed, demonstrating certain technical innovation and application value. The main contributions of this paper include: (1) Cross-modal semantic correction based on CLIP, effectively reducing OCR mis-recognition (e.g., confusion between "5" and "S"), lowering CER by 1.8%; (2) Lightweight sequence modeling based on BiGRU, combined with position-aware attention mechanism, enhancing sequence modeling capability for multi-field text while reducing parameter count, lowering WER by 2.3%; (3) Achieving model lightweighting through a three-stage compression strategy (quantization, pruning, knowledge distillation), compressing the model size to 18MB, and achieving low power consumption (below 12W) and high performance (25 FPS) on edge devices.

However, the manuscript still has room for improvement in terms of clarity of technical details, completeness of methodology, depth of experimental validation, and comparative analysis with existing research. Overall, the manuscript has high research value and application potential but requires major revisions to enhance its scientific rigor and readability. Here are our suggestions for the authors:

It is recommended to add a description of the CORD dataset in the Introduction section to help readers better understand the research topic of the manuscript.
In the Related Work section, it is suggested to add discussions comparing other possible mainstream methods, such as the application of Transformer-based models in financial document recognition. In addition, supplementing with the latest research related to financial document recognition, such as structured information extraction methods based on graph neural networks (GNN) or specialized models for handwritten text (e.g., HTR systems).
Although the manuscript optimizes CLIP-BiGRU for edge devices (Jetson AGX Orin), it does not provide detailed explanations for potential computational bottlenecks on different hardware architectures of edge devices (e.g., Raspberry Pi without GPU support). It is recommended to provide additional analysis and descriptions.
The authors validated the model's performance on 1000 real-world receipt samples in the experimental section. However, the manuscript does not provide specific application cases in the financial industry, such as deployment scenarios in banks or tax institutions, nor does it describe integration applications with real financial systems (e.g., reimbursement review, tax declaration systems). It is suggested that the manuscript should propose potential practical application directions for the results to enhance practical value of the research.

Author Response

请参阅附件。

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The paper introduces CLIP-BiGRU, an innovative framework for financial document detection and recognition that combines multimodal alignment and lightweight sequence modeling. By integrating CLIP (Contrastive Language-Image Pretraining) with Bidirectional Gated Recurrent Units (BiGRU), the system achieves semantic correction through optimized image and text information processing. The framework employs a three-stage compression strategy (quantization, pruning, and distillation) to achieve impressive results on the CORD dataset: a detection F1 score of 93.1% and Character Error Rate of 5.1%. Most notably, the compressed model size of just 18MB enables real-time inference speeds of 25 FPS on edge devices while maintaining power consumption under 12W, making it a highly practical solution for automated financial document processing.

However, I have the following concerns,

(1) I respectfully suggest that the term "financial document" may not accurately represent the scope of this paper. Based on the dataset used for testing, CORD specifically contains receipts, which are indeed a type of financial document but represent only a subset of the broader category. Financial documents encompass many other types, including balance sheets (which report company assets, liabilities, and owners' equity), W-2 forms, cash flow statements, and various other documents. The authors should consider revising the paper to more precisely reflect the actual scope of their work. Additionally, the evaluation's reliance on the CORD dataset (1,000 images) alone may not adequately capture the diversity of real-world financial documents. Including validation across multiple datasets would provide more robust and comprehensive findings.

(2) A critical question regarding these new research papers is how their proposed models compare to state-of-the-art (SOTA) systems such as ChatGPT and DeepSeek. Can these SOTA models process receipt images and generate accurate responses as effectively as the proposed models? The authors' comparisons appear limited to EAST+CRNN and CPTN+CRNN combinations, without evaluating against SOTA models, particularly large foundation models with multimodal capabilities.

(3) While the model's compact size of approximately 18MB enables edge device deployment, which is a significant advantage, the authors did not sufficiently emphasize this valuable feature. Furthermore, the paper would benefit from a more comprehensive analysis of the trade-offs between model size, accuracy, and inference speed across various edge devices.

(4) Although the paper presents performance metrics, it would benefit from a more comprehensive analysis of failure cases and error patterns, particularly in challenging scenarios such as heavily distorted or poorly illuminated documents.

(5) The methodology section contains comprehensive technical details but would benefit from clearer architectural diagrams to better illustrate the system's workflow. While Figure 1 presents impressive content, the numerous arrows connecting various elements make it difficult to follow. We suggest reorganizing the layout and global optimizations to enhance readability and comprehension.

Author Response

请参阅附件。

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

We have reviewed the revised manuscript. The authors have provided detailed revisions and responses addressing the comments raised in the previous review round. In my opinion, the concerns have been satisfactorily addressed, and the manuscript is now suitable for publication. Therefore, we recommend that the manuscript be accepted for publication in its present form.

Article Menu

Receipt Recognition Technology Driven by Multimodal Alignment and Lightweight Sequence Modeling

Further Information

Guidelines

MDPI Initiatives

Follow MDPI