Next Article in Journal
Optimization of Ultrasound-Assisted Extraction of Phenolic Compounds from the Aerial Part of Plants in the Chenopodiaceae Family Using a Box–Behnken Design
Previous Article in Journal
Enhanced Sliding-Mode Control for Tracking Control of Uncertain Fractional-Order Nonlinear Systems Based on Fuzzy Logic Systems
 
 
Article
Peer-Review Record

Using Large Language Models for Goal-Oriented Dialogue Systems

Appl. Sci. 2025, 15(9), 4687; https://doi.org/10.3390/app15094687
by Leonid Legashev *, Alexander Shukhman, Vadim Badikov and Vladislav Kurynov
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Appl. Sci. 2025, 15(9), 4687; https://doi.org/10.3390/app15094687
Submission received: 3 March 2025 / Revised: 16 April 2025 / Accepted: 22 April 2025 / Published: 23 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I am against publishing the presented paper in a broad audience journal like Applied Sciences for the following reasons:

  1. The linguistic part (and majority of figures: 1, 4, 5, 7, 9, 10), primarily based on Russian language processing, will be understandable for the general audience; thus, the study's applicability looks very narrow; it could be suitable for some computational linguistic journal.  
  2. Only a few models can be deployed locally, so the implementation in the article’s title is confusing—the existing commercial solutions were used instead.
  • The dialog objects tested are pretty naïve to judge about strengths and weaknesses of the reviewed models.
  1. As the graph in Fig. 6 has some information to get from, despite some messiness and unreadability, the one in Fig. 2 is entirely meaningless.
  2. The representativeness of respondents does not deliver valid inference due to an extremely low number of respondents; thus the models’ scores are not convincing.

Author Response

Comments 1:  Only a few models can be deployed locally, so the implementation in the article’s title is confusing—the existing commercial solutions were used instead.

Response 1: Thank you for pointing this out. Article title is fixed.

 

Comments 2: As the graph in Fig. 6 has some information to get from, despite some messiness and unreadability, the one in Fig. 2 is entirely meaningless.

Response 2: More detailed information about the second figure has been added to the text.

 

Comments 3: The representativeness of respondents does not deliver valid inference due to an extremely low number of respondents; thus the models’ scores are not convincing.

Response 3: Increasing the number of respondents and conducting additional research can be time-consuming. We believe a conducted research may be sufficient to provide meaningful insights. 

Reviewer 2 Report

Comments and Suggestions for Authors

Some suggestions for improving this article are summarized as follows:

1. For the Paper Title, avoid unspecific acronyms, like "LLM". It should be revised to clearly point out the application, the studied methodology, and under which modality both play together.

2. For Section 1 (Introduction), the length is too long. The part of Literature Review (such as Table 1) should be a separate section. Besides, I also suggest to add some updated literatures related to your works from 2024.

3. In Section 2 (Materials and Methods), the justification for the model choice is missing. A more detailed comparative analysis would strengthen the argument for the studied five LLMs.

4. In Section 2.2, it is suggested to provide the specific algorithm flowchart and pseudo codes for your presented method. In addition, for Figure 3, the iterative operation process of the LMM-based dialogue agent is too simple and not specific enough.

5. In Section 3 (Results), except the MultiWOZ 2.2 dataset used in the current study, it would be better to validate the proposed method on a wider range of open datasets.

6. For your numerical experiments, a statistical analysis of the numerical results could be conducted.

7. Some images are too blurry to be clearly seen, such as Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10.

Comments on the Quality of English Language

There is a need to further improve English and correct grammar, because there are obvious formatting issues and several grammatical errors in many places of the current manuscript.

Author Response

Comments 1:  For the Paper Title, avoid unspecific acronyms, like "LLM". It should be revised to clearly point out the application, the studied methodology, and under which modality both play together.

Response 1: Thank you for pointing this out. Article title is fixed.

 

Comments 2: For Section 1 (Introduction), the length is too long. The part of Literature Review (such as Table 1) should be a separate section. Besides, I also suggest to add some updated literatures related to your works from 2024.

Response 2: Introduction section divided in two parts, more literature sources added to review.

 

Comments 3:  In Section 2 (Materials and Methods), the justification for the model choice is missing. A more detailed comparative analysis would strengthen the argument for the studied five LLMs.

Response 3: A large number of new large language models appear every week, the selected models were and remain relevant at the time of the study.

 

Comments 4:  In Section 2.2, it is suggested to provide the specific algorithm flowchart and pseudo codes for your presented method. In addition, for Figure 3, the iterative operation process of the LMM-based dialogue agent is too simple and not specific enough.

Response 4: Thank you for pointing this out. Pseudo codes for both presented methods added to the text, more details added to Figure 3.

 

Comments 5:  In Section 3 (Results), except the MultiWOZ 2.2 dataset used in the current study, it would be better to validate the proposed method on a wider range of open datasets.

Response 5: Additional experiments are time-consuming, due to technical issues with AI server we were able to add only MANTiS dataset research within a tight deadline.

 

Comments 6:  For your numerical experiments, a statistical analysis of the numerical results could be conducted.

Response 6: We would be glad to address the comment, but we need more details on what kind of statistical analysis needs to be performed.

 

Comments 7:  Some images are too blurry to be clearly seen, such as Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10.

Response 7: The blurriness is caused by file converting to PDF format. All figures are prepared in high quality and inserted into the text, all figures uploaded separately in .zip file.

Reviewer 3 Report

Comments and Suggestions for Authors

The study evaluates seven LLMs on intent mining and named entity recognition (NER) for Russian and English languages, comparing approaches with and without additional training on labeled dialogues
How were the test prompts formulated, and were they manually created or extracted from existing datasets?
What thresholds or significance levels were used to determine performance improvements?
How many training samples were used, and what was the computational cost?
Was the fine-tuning process performed on both English and Russian datasets?
Were there cases where the heuristic approach failed, and how were these addressed?
How does the performance of the locally deployed models compare with cloud-based alternatives in a real-time setting?
Are there plans to release the training and evaluation scripts?
How can other researchers validate the results?
Improve the discussion section

Author Response

Comments 1:  How were the test prompts formulated, and were they manually created or extracted from existing datasets?

Response 1: Test prompts were created manually based on our experience in prompt engineering.

 

Comments 2:  What thresholds or significance levels were used to determine performance improvements?

Response 2: Performance improvements weren’t part of current study; large language models were evaluated by standard metrics.

 

Comments 3:  How many training samples were used, and what was the computational cost?

Response 3: During the training, 8,437 dialogues consisting of 113,552 messages were used, with a total of 31 intents identified in the data.

Comments 4:  Was the fine-tuning process performed on both English and Russian datasets?

Response 4: Fine-tuning process was performed on multilingual MultiWoZ dataset.

 

Comments 5: Were there cases where the heuristic approach failed, and how were these addressed?

Response 5: In these cases, the dialogue ended earlier than the original one.

 

Comments 6:  How does the performance of the locally deployed models compare with cloud-based alternatives in a real-time setting?

Response 6: Performance comparison of locally deployed models and cloud-based models wasn’t part of current research, we focused om comparison of two proposed methods on locally deployed model LLaMA.

 

Comments 7:  Are there plans to release the training and evaluation scripts?

Response 7: The script is available for preliminary review at the link:

https://colab.research.google.com/drive/1tpgzObpuXBYoXUpXOfP5dOHnvcpaSUcz?usp=sharing

 

Comments 8:  How can other researchers validate the results?

Response 8: A universal notebook for validating the presented results is under development.

 

Comments 9:  Improve the discussion section

Response 9: Obtained results of the study outlined in the discussion section.

Reviewer 4 Report

Comments and Suggestions for Authors

Thank you for inviting me to review this manuscript. The title is "Using large models for goal-oriented dialogue systems". The topic is interesting, and the results provide insight into the field. I have a few suggestions and observations that I would like to share with the authors:

Abstract

Authors can add research aims after the background.

Some theoretical and practical implications can be mentioned.

Introduction

Authors can split the sentences in lines 27-30 into two complete sentences.

Try not to use abbreviations, e.g. in line 34.

The authors can add some references in line 34 to say "one of the most popular areas of research is...".

Line 36 needs a direct quote for the definition of chatbot.

The introduction is a little too short. Authors can expand by adding more background, research objectives and also the overall structure of the article.

Literature Review

Why have the authors highlighted some headings in green? Please remove them accordingly.

The authors can improve the intext citations, e.g. in line 47 it should read "Addlesee et al. [1]" instead of "Addlesee A. et al. [1]".

For this section, the authors need to review these articles in more detail, e.g. similarities and differences between these papers and identify any trends.

Authors should review the tense used in the paragraph, e.g. present tense/past tense, and also plural or singular, e.g. Zhang, et al. [17] performs ....

A clearer research gap can be identified at the end of the literature review, and also how the present study addresses this research gap.

For the paragraph from lines 165 to 169, authors may move it to the end of the Introduction.

Methodology

Authors can talk about their research approach, e.g. a quantitative study or a mixed approach, before moving on to discuss their own data and models.

Authors can add some references to the paragraph from lines 171 to 181.

Please check the tense used in this section, e.g. in line 197 it should be "we compared two methods...".

The presentation of the code could be more organised, e.g. in lines 236 to 271.

Some expressions are very spoken, e.g. lines 272 and 284, "Let's implement three different versions" and "Let's select 20 dialogs from...". Please check the manuscript.

Findings

Please add an orientation paragraph between Section 4 and 4.1.

The words in figures 6, 7, 8, 9 and 10 are difficult to read. Please revise or enlarge the words.

Conclusion

The conclusion is missing. Please add it. The authors can use a table to organise the findings into bullet points so that the reader can easily make reference to them.

The reader would also expect to find limitations, areas for future research and theoretical and practical implications in the conclusion.

References

Intext citations in the text should be checked carefully, as should the tense of the reporting verbs.

Comments on the Quality of English Language

Language

Professional editing is required before publication.

Author Response

Reviewer 1

Comments 1: Abstract: Authors can add research aims after the background. Some theoretical and practical implications can be mentioned.

Response 1: Thank you for pointing this out. Abstract was extended, new text in article highlighted in light blue color.

 

Comments 2:  Introduction: Authors can split the sentences in lines 27-30 into two complete sentences. Try not to use abbreviations, e.g. in line 34. The authors can add some references in line 34 to say "one of the most popular areas of research is...". Line 36 needs a direct quote for the definition of chatbot. The introduction is a little too short. Authors can expand by adding more background, research objectives and also the overall structure of the article.

Response 2: Thank you for pointing this out. Introduction was extended.

 

Comments 3:  Literature Review: Why have the authors highlighted some headings in green? Please remove them accordingly. The authors can improve the intext citations, e.g. in line 47 it should read "Addlesee et al. [1]" instead of "Addlesee A. et al. [1]". For this section, the authors need to review these articles in more detail, e.g. similarities and differences between these papers and identify any trends. Authors should review the tense used in the paragraph, e.g. present tense/past tense, and also plural or singular, e.g. Zhang, et al. [17] performs .... A clearer research gap can be identified at the end of the literature review, and also how the present study addresses this research gap. For the paragraph from lines 165 to 169, authors may move it to the end of the Introduction.

Response 3: Literature review was inspected and fixed.

 

Comments 4:  Methodology: Authors can talk about their research approach, e.g. a quantitative study or a mixed approach, before moving on to discuss their own data and models. Authors can add some references to the paragraph from lines 171 to 181. Please check the tense used in this section, e.g. in line 197 it should be "we compared two methods...". The presentation of the code could be more organised, e.g. in lines 236 to 271. Some expressions are very spoken, e.g. lines 272 and 284, "Let's implement three different versions" and "Let's select 20 dialogs from...". Please check the manuscript.

Response 4: In this study our prime focus was on the large language models and datasets, due to methodology section started with data and models’ descriptions. Other comments were addressed in the text.

 

Comments 5:  Findings: Please add an orientation paragraph between Section 4 and 4.1. The words in figures 6, 7, 8, 9 and 10 are difficult to read. Please revise or enlarge the words.

 

Response 5: Thank you for pointing this out. Orientation paragraph was added to the text. The blurriness is caused by file converting to PDF format. All figures are prepared in high quality and inserted into the text, all figures uploaded separately in Figures.zip file.

 

 

Comments 6: Conclusion: The conclusion is missing. Please add it. The authors can use a table to organise the findings into bullet points so that the reader can easily make reference to them. The reader would also expect to find limitations, areas for future research and theoretical and practical implications in the conclusion.

Response 6: Conclusion section was added.

 

Comments 7:  References: Intext citations in the text should be checked carefully, as should the tense of the reporting verbs.

Response 7: Citations were checked.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I remain opposed to publishing the paper, the illustrative part of which remains unclear to a broad audience, in a journal like Applied Sciences. My comments were addressed partly; the critical ones were simply skipped. The paper is still immature: the dialogs analysed are rather primitive for LLMs, there are no formal conclusions, a misleading data availability section, etc.

Author Response

Comments 1:  I remain opposed to publishing the paper, the illustrative part of which remains unclear to a broad audience, in a journal like Applied Sciences. My comments were addressed partly; the critical ones were simply skipped. The paper is still immature: the dialogs analysed are rather primitive for LLMs, there are no formal conclusions, a misleading data availability section, etc.

Response 1: The illustrative part of the paper is clear to a broad audience since all the prompts and model’s inferences translated to English, which is a common practice in multi-language LLM studies. The dialogs analyzed in the study relate to the area of ​​customer service, which is the main practical application of large language models in chatbots domain. The MultiWOZ dataset is frequently used for testing models, as supported by other studies such as [35], [37] and [41]. Regarding data availability section as it was stated in the text no new data were created in this study and data sharing is not applicable – the MultiWOZ 2.2 and MANTiS datasets are publicly available and all the prompts are listed in the paper. The conclusions are clearly supported by the results obtained.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have already responded to my previous questions and made significant improvements. I suggest that this submission could be accepted for publication as long as it meets the format requirements of Applied Sciences.

Author Response

Comments 1:  The authors have already responded to my previous questions and made significant improvements. I suggest that this submission could be accepted for publication as long as it meets the format requirements of Applied Sciences.

Response 1: Thank you for your comments.

Reviewer 3 Report

Comments and Suggestions for Authors

You could add more reflection on why some models perform better.

Author Response

Comments 1:  You could add more reflection on why some models perform better.

Response 1: Thank you for pointing this out. The discussion section is improved.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have addressed most of my concerns.

I have no further comments. 

 

Back to TopTop