Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Evaluation of Prompt Engineering on the Performance of a Large Language Model in Document Information Extraction

Electronics 2025, 14(11), 2145; https://doi.org/10.3390/electronics14112145

by Lun-Chi Chen¹

, Hsin-Tzu Weng²

, Mayuresh Sunil Pardeshi³

, Chien-Ming Chen⁴

, Ruey-Kai Sheu⁵

and Kai-Chih Pai^1,*

Reviewer 1:

Abdur Rasool

Reviewer 2:

Qixian Huang

Reviewer 3: Anonymous

Electronics 2025, 14(11), 2145; https://doi.org/10.3390/electronics14112145

Submission received: 19 April 2025 / Revised: 15 May 2025 / Accepted: 22 May 2025 / Published: 24 May 2025

(This article belongs to the Special Issue Techniques and Applications in Prompt Engineering and Generative AI)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The integration of OCR (Amazon Textract) with LLMs (GPT-3.5 Turbo) for key information extraction (KIE) from invoices and shipping documents addresses a real-world industrial need, demonstrating high precision (95.5% on SROIE) and adaptability to noisy, multi-format data. however,

The paper’s core approach (OCR + LLM + prompt engineering) builds on existing work (e.g., LayoutLMv3, DocLLM). While the APE integration is useful, it lacks a clear theoretical or methodological breakthrough compared to prior art.

Errors in fields like 20FT CONTAINERS NO (Table 6) suggest LLM struggles with implicit reasoning (e.g., inferring container counts from 8x20’). Add a post-processing module (e.g., rule-based checks or a small classifier) to correct frequent LLM errors.

The contribution of Amazon Textract vs. LLM is unclear. Does Textract alone suffice for structured fields?

While anonymization is mentioned, no details are provided for sensitive data (e.g., invoice amounts, company names).

Equations (1)–(3) lack intuitive explanations. How does APE’s resample(U_k) (Section 3.1) differ from beam search?

Test if adding layout coordinates (from Textract) to LLM prompts improves performance.

How does APE’s performance scale with the number of prompt candidates (NUMBER_OF_PROMPTS in Table 4)? Is there diminishing returns?

The authors are recommended to cite https://www.mdpi.com/2673-2688/6/3/56.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes an Applied Key Information Extraction (KIE) pipeline using Amazon Web Services (AWS) Textract and an Automatic Prompt Engineer (APE) with large language models (LLM) to improve document information extraction.

Here are some suggestions after minor revision below:

The paper could benefit from clearer explanations of the technical processes involved, particularly the integration of OCR and LLMs. More detailed diagrams or flowcharts could help in understanding the methodology.
If can, While the paper provides precision and accuracy metrics, a more detailed comparative analysis with existing methods would strengthen the argument for the proposed approach's superiority.
For error Analysis, the paper mentions error types and confidence scores but could delve deeper into specific error cases and how they were addressed or could be mitigated in future work.
For scalability conception, discussion on the scalability of the approach to other document types and languages would be beneficial. The paper could explore how the method generalizes beyond the datasets used in the experiments.
In the future work, this paper could outline potential future research directions, such as improving the adaptability of the model to different document layouts or exploring other LLMs and OCR tools for enhanced performance.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper uses known methods like OCR, LLMs, APE, and IPC for a key information extraction task. But the main new idea is not clearly stated. It is not clear if the main point is the design of the Applied-KIE pipeline, the way the tests are done, or what is learned from comparing the different prompt types on real company data. The paper should clearly say what the new part is.
The paper mostly uses GPT-3.5-Turbo. This model is cheaper to run. But the paper should briefly explain why it was chosen instead of newer models like GPT-4. GPT-4 is only used for checking outputs and for IPC prediction. The paper could also say how results might change if a larger or newer model was used.
The paper explains APE and IPC mostly by pointing to other papers. It would help to give a bit more detail about how they work in this KIE task. For example, it should briefly explain how the “score function”is used to choose better prompts in APE.
The “Error Counts”metric in Tables 5 and 6 is confusing. The values go up when performance gets better. For example, 0.9153 in Table 5 and 0.853 in Table 6. This does not match what most people expect from an error count, which should go down as results improve. This number may actually show accuracy or something like 1 minus the error rate. The paper should clearly explain what “Error Counts” means and how it is calculated.
The confidence score calculation removes the one lowest and one highest probability values before scaling. The paper does not explain why it removes exactly one of each. It also does not say why this works well for all cases, no matter how many documents are used. The paper should explain why this choice was made and if other options were tested.
The paper reports precision and a metric called “Error Counts,”which is not clearly defined. It does not always show recall or F1-score, even though the formulas are given. Including recall and F1-score would give a better view of how well the model works.
The human performance results are not fully clear. The text says each field has high accuracy, but file-level accuracy drops to about 70% when an error happens. It is not clear if 70% is the average for files with errors or if one field error lowers the whole file’s score to 70%. The paper should explain how this file-level accuracy is calculated. Table 6 also lists “Human”with precision scores per field, but the text talks about file-level results. It is not clear how the field-level human precision numbers were found. The paper should explain how these were measured to make the comparison fair.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

This paper can be accepted.

Article Menu

Evaluation of Prompt Engineering on the Performance of a Large Language Model in Document Information Extraction

Further Information

Guidelines

MDPI Initiatives

Follow MDPI