Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Efficient Prompt Optimization for Relevance Evaluation via LLM-Based Confusion Matrix Feedback

Appl. Sci. 2025, 15(9), 5198; https://doi.org/10.3390/app15095198

by Jaekeol Choi^†

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Jones Schaefer

Appl. Sci. 2025, 15(9), 5198; https://doi.org/10.3390/app15095198

Submission received: 31 March 2025 / Revised: 30 April 2025 / Accepted: 4 May 2025 / Published: 7 May 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper is interesting and the contribution is significant. I only have some minor suggestions as follows:

(1) Could you please present the existing works such as the contents in subsection 3.2 in the section 2, because it is better to only present your work independently in a section such as section 3.

(2) It is better way to use more tables and figures to show the configuration and results of your simulation and experiment. Too many descriptions using the words are not very straightforward way.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper introduces a novel approach to optimizing prompts for large language models in the context of query-passage relevance evaluation. This method leverages a confusion matrix to provide structured feedback for refining prompts, aiming to improve the accuracy and efficiency of relevance judgments. While the paper presents an innovative idea, several aspects could be improved to strengthen the presentation.
1)The paper introduces the concept of using a confusion matrix for prompt optimization, but it lacks detailed explanations of how the confusion matrix is constructed and how the feedback mechanism works.
2)The paper uses Cohen's kappa and F1-score to evaluate the performance of the optimized prompts. It is better to include additional metrics such as precision, recall, and accuracy for a more complete picture of the model’s performance. These metrics would help readers better understand the trade-offs and overall effectiveness of the optimized prompts. In addition, a comparison of the computational costs is also crtical to highlight the efficiency gains.
3) The paper compares APO-CF with APE, OPRO, and APO, but the discussion could benefit from a more detailed analysis of why APO-CF outperforms these methods. Provide a detailed analysis of the strengths and weaknesses of each baseline method. Discuss how APO-CF addresses the limitations of these methods. For instance, provide a case study that shows the original prompt, the confusion matrix values, the refined prompt, and the performance gains.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

I want to thank the author for the opportunity to read this article. The article presents an interesting topic and is very well constructed, but it requires some improvements:

The specific motivations that led to this research are not clear in the introduction. The author presents previous research but needs to clarify the motivations.

- The objective of the article needs to be clear in the introduction.

The author discusses the advantages and positive results at various points in the article. However, I suggest a more impartial textual approach, demonstrating each advantage and positive aspect of the proposal in a summarized and comparative way. I emphasize that it is important to present the advantages, but the language used can be improved.

- The 3-year dataset used needs to be justified.

- Lines 253-255: “For our experiments, we utilize GPT-4o, GPT-4o-mini, and DeepSeek-Chat 253 as the large language models (LLMs). To access GPT-4o and GPT-4o-mini, we use the 254 OpenAI API2, while for DeepSeek-Chat, we rely on the DeepSeek API3. “ It is necessary to justify the language models selected.

- The Cohen’s kappa metric also needs to be referenced regarding its concept and usefulness for this research.

- How were the comparison models in section 4.3 defined? Was there any literature review for this?

In the discussion section, the author only provides one reference for comparing and validating the research. This needs to be expanded to demonstrate the real advances of this research in comparison with other research.

- What were the limitations of this research? This needs to be made explicit in the conclusion.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

All concerns have been addressed.

Reviewer 3 Report

Comments and Suggestions for Authors

I congratulate the authors for the review work carried out and for the form of the response letter presented. The comments of this reviewer have been taken into account. The article can be accepted.

Article Menu

Efficient Prompt Optimization for Relevance Evaluation via LLM-Based Confusion Matrix Feedback

Further Information

Guidelines

MDPI Initiatives

Follow MDPI