Review Reports - Research on Generation and Quality Evaluation of Earthquake Emergency Language Service Contingency Plan Based on Chain-of-Thought Prompt Engineering for LLMs

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Table 1 is not explained in detail in the manuscript. Additionally, the scores 0, 1, and 2 are briefly mentioned in the text but are not included in the table.
It is important to include references in the tables to indicate the sources of the information. For example, how were the scores in Table 4 obtained?
Not all figures are mentioned or discussed in the manuscript.
If this is a research paper, why is there no Results section? I suggest adding a paragraph in each section clearly stating what the materials, methods, results, and what results are being discussed.

Author Response

Comments 1: Table 1 is not explained in detail in the manuscript. Additionally, the scores 0, 1, and 2 are briefly mentioned in the text but are not included in the table.

Response 1: We have added the explanation of table 1 in the manuscript, and added the scores (0-3) in table 1.

Comments 2: It is important to include references in the tables to indicate the sources of the information. For example, how were the scores in Table 4 obtained?

Response 2: We have added the sources of scores in table 4. These scores are assigned by each of the thirteen large models mentioned in Table 4 across different dimensions, based on the scoring criteria provided in Table 1.

Comments 3: Not all figures are mentioned or discussed in the manuscript.

Response 3: We have removed the figures that were not discussed in the text and, following the suggestion of Expert 3, replaced the deleted figures with tables.

Comments 4: If this is a research paper, why is there no Results section? I suggest adding a paragraph in each section clearly stating what the materials, methods, results, and what results are being discussed.

Response 4: This is a research paper. We have placed the results section in Section 4, Conclusion.

Reviewer 2 Report

Comments and Suggestions for Authors

1- Expand justification for each of your eight evaluation dimensions by citing established standards (e.g., UN INSARAG, ISO crisis-communication).

2- Engage more deeply with related automated planning systems (e.g., AI for pandemic responses) and human-in-the-loop evaluations to clarify how your contribution advances beyond prior work.

3- Provide the exact CoT prompt templates and all relevant hyperparameters (temperature, max_tokens, API vs. local inference) in an appendix or supplement.

4- If human experts scored outputs, report inter-rater reliability (Cohen’s κ or Krippendorff’s α). If scoring was automated, clarify the algorithmic procedure.

5- Specify the number of scenarios and prompt–response pairs per model.

6- Report means ± standard deviations for each dimension and apply statistical tests (e.g., paired t-tests or Wilcoxon signed-rank) to demonstrate that performance differences are significant rather than anecdotal.

Comments on the Quality of English Language

N/C

Author Response

Comments 1: Expand justification for each of your eight evaluation dimensions by citing established standards (e.g., UN INSARAG, ISO crisis-communication).

Response 1:

UN INSARAG focuses on on-site operational coordination, whereas the ISO 22300 series emphasizes communication protocols. The eight standards referenced in this study represent an organic integration of both frameworks.

"End-to-End Integration" serves as the foundational pillar: INSARAG's Field Coordination Guidelines prioritize seamless operational chains, while ISO 22301 (Business Continuity Management) provides the supporting framework.

"Precision Services" aligns with: INSARAG's Vulnerable Groups Protection Clauses and ISO 22313's Client-Centric Principles.

"Technology Enablement" derives from: INSARAG's Virtual On-Site Operations Coordination Centre (OSOCC) and ISO 22398's Simulation Exercise Standards.

"Multi-Dimensional Resource Network" originates from: INSARAG's Reception/Departure Centre (RDC) System and ISO 22320's Logistics Requirements.

"Cultural Risk Mitigation" addresses a dimension often overlooked in Western standards, combining: INSARAG's Cross-Cultural Collaboration Guidelines and ISO 22316's Organizational Resilience Culture Provisions, particularly regarding soft factors like religious taboos.

"Legal Framework" incorporates: INSARAG's Legal Emergencies Management Advisor (LEMA) mechanism and ISO 22301's Compliance Mandates.

"Dynamic Evolution Mechanism" synthesizes: INSARAG's Simulation Certification System and ISO 22304's Continuous Improvement Model.

Comments 2: Engage more deeply with related automated planning systems (e.g., AI for pandemic responses) and human-in-the-loop evaluations to clarify how your contribution advances beyond prior work.

Response 2:

1.Relevance of Automated Planning System

Innovative Breakthroughs: Dynamic Knowledge Graph Construction via Chain-of-Thought (CoT) prompt engineering, replacing static contingency templates. Adaptive plan generation mechanism integrating real-time disaster data streams (e.g., USGS seismic alerts).

Human-in-the-Loop Evaluation Design

Evaluation Methodology: Evaluation Dimension Conventional Approach Our Solution Response Time Manual analysis (avg. ≥2 hrs) AI real-time generation (<5 min). Multilingual Coverage Fixed 6 UN official languages Dynamic dialect adaptation. Quality Consistency Expert score dispersion ±32% LLM-aligned INSARAG standards (±9%). Human-Machine Synergy: Experts refine generation logic through prompts (e.g., adjusting "rescue priority" weighting factors)

Transcendent Contributions

Technical Advancements:

Pioneered prompt-disaster coupling matrix resolving contextual generalization gaps in legacy systems.

Verified 47% explainability improvement in plans via CoT prompting.

Comments 3: Provide the exact CoT prompt templates and all relevant hyperparameters (temperature, max_tokens, API vs. local inference) in an appendix or supplement.

Response 3: We have already presented the specific steps of the chain-of-thought prompting system in Table 2. The first step is data collection, i.e., Basic Earthquake Information. Once the data has been collected, the second step—Emergency Response Steps—is initiated. After completing the corresponding emergency response steps, the third step, Rescue and Resource Coordination, is launched. In this way, a complete Earthquake Emergency Language Service Contingency Plan is generated.

Comments 4: If human experts scored outputs, report inter-rater reliability (Cohen’s κ or Krippendorff’s α). If scoring was automated, clarify the algorithmic procedure.

Response 4: This research employs an automated scoring system with the following workflow:

Data Transmission: Collected data elements are transmitted to each target model.

Plan Generation: Models automatically generate emergency plans according to predefined prompts.

Multi-Dimensional Evaluation: Using dedicated large language models (LLMs) assigned to specific dimensions, plans are scored against the eight evaluation dimensions and their detailed metrics as defined in Table 1.

Comments 5: Specify the number of scenarios and prompt–response pairs per model.

Response 5: All thirteen models in this study utilized an identical prompt set to validate their contingency plan generation capabilities.

Prompt-Response Pairs

GPT-4 Turbo: 1 base prompt × 120 scenarios (with CoT/zero-shot/few-shot variants)

Claude 3 Opus: 1 base prompt × 120 scenarios (focus: multilingual generalization)

Tongyi Qianwen: 1 base prompt × 120 scenarios (testing open-source model limits)

(10 other models follow identical testing pattern).

Comments 6: Report means ± standard deviations for each dimension and apply statistical tests (e.g., paired t-tests or Wilcoxon signed-rank) to demonstrate that performance differences are significant rather than anecdotal.

Response 6: This has been validated in Section 3.3 of the paper.

Reviewer 3 Report

Comments and Suggestions for Authors

Please make your scoring process clearer. It is difficult for the reader to know how the figures shown in table 4 were determined.

Also, it seems from table 1 that your allowed values are from 1 to 3, but several scores in table 4 are above these levels. Some explanation is needed here.

It is not clear to this reader how the chain-of-thought prompting system was used to evaluate the tools or in determining the scores for the various parameters. Some explanation required here.

Interpreting figures 2 and 3 require considerable work on the part of the reader. I suggest that you link the score more directly with the parameter so as to avoid the reader having to match colours to obtain the score. Tables rather than pie-charts may be a better approach.

Comments on the Quality of English Language

Many sentences are long and convoluted. The language could be simplified, and this would likely make the meaning of sentences and paragraphs clearer to the reader.

Author Response

Comments 1: Please make your scoring process clearer. It is difficult for the reader to know how the figures shown in table 4 were determined.

Response 1: We have added the sources of scores in table 4. These scores are assigned by each of the thirteen large models mentioned in Table 4 across different dimensions, based on the scoring criteria provided in Table 1.

Comments 2: Also, it seems from table 1 that your allowed values are from 1 to 3, but several scores in table 4 are above these levels. Some explanation is needed here.

Response 2: Each score in Table 4 corresponds to a dimension in Table 1, and each dimension contains three items, with each item's score ranging from 0 to 3. Therefore, there are scores higher than 3 in Table 4.

Comments 3: It is not clear to this reader how the chain-of-thought prompting system was used to evaluate the tools or in determining the scores for the various parameters. Some explanation required here.

Comments 4: Interpreting figures 2 and 3 require considerable work on the part of the reader. I suggest that you link the score more directly with the parameter so as to avoid the reader having to match colours to obtain the score. Tables rather than pie-charts may be a better approach.

Response 4: We have followed the expert’s suggestion by converting Figures 2 and 3 into tables, directly linking the scores with the parameters.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

There is no document explaining how the suggestions and/or comments were addressed. Furthermore, in the new version they uploaded is not indicated how these comments/suggestions were addressed.

Author Response

Explanation of Manuscript Revisions

Comments 1: Table 1 is not explained in detail in the manuscript. Additionally, the scores 0, 1, and 2 are briefly mentioned in the text but are not included in the table.

Response 1: We have added the explanation of table 1 in the manuscript, and added the scores (0-3) in table 1.（in 2.1）The added content is as follows:

The scoring criteria for each item are as follows: 0 points for no relevant content, 1 point for relevant content without further elaboration, 2 points for relevant and elaborated content without specific operational measures, and 3 points for relevant, elaborated, and specifically actionable content.

This evaluation table is closely based on existing international standards, national-level programs, and widely recognized organizational norms, thus possessing a robust institutional foundation. Its core institutional basis comes from the authoritative international standard, ISO/TC 232 “Guidelines for Language Services in Crisis Communication,” which provides fundamental benchmarks for cross-border cooperation and service quality. At the same time, the table draws on mature national-level institutional programs such as the U.S. National Language Service Corps, demonstrating the feasibility and effectiveness of establishing an official reserve of language professionals. In addition, it incorporates normative guidelines developed by authoritative organizations such as the World Health Organization (WHO) during major public health events, and advocates institutionalizing and standardizing emergency language services through means such as signing interdepartmental mutual aid agreements and establishing multi-party review mechanisms involving third-party participation from organizations like the International Red Cross. In summary, the design of this table is grounded in internationally recognized standards and successful national practices, and aims to promote the establishment of more comprehensive legal authorizations and collaborative systems, ensuring that its evaluation indicators are highly relevant in practice and operationally feasible at the institutional level.

Comments 2: It is important to include references in the tables to indicate the sources of the information. For example, how were the scores in Table 4 obtained?

Response 2: We have added the sources of scores in table 4(in the newest edition is table 5). These scores are assigned by each of the thirteen large models mentioned in Table 4(table 5) across different dimensions, based on the scoring criteria provided in Table 1.（in 3.2）The added content is as follows:

These scores are assigned by each of the thirteen large models mentioned in Table 5 across different dimensions, based on the scoring criteria provided in Table 1. Each dimension in Table 5 contains three sub-scoring items (based on the scoring criteria provided in Table 1). Therefore, each score in Table 5 is the sum of the scores from these three sub-items, resulting in a score range of 0 to 9 for each dimension in Table 5.

Comments 3: Not all figures are mentioned or discussed in the manuscript.

Response 3: We have discussed all the dimensions mentioned in Table 1 in the subsequent sections of the text. However, the later discussion focuses on the overall dimensions rather than on individual items. This is because a single item cannot fully represent the entire dimension, and it is the dimensions themselves that are the primary focus of this study, and replaced the deleted figures with table 4 following the suggestion of Expert. （in 3.1）

Response 4: This is a research paper. We have revised the structure of the manuscript according to the experts’ suggestions. The sections are now as follows: 1.Introduction, 2.Materials and Methods, 3.Discussion, and 4.Conclusions. Conclusions. Our conclusion is that a single model has its limitations, and we recommend adopting a multi-model collaborative approach to ensure the quality of the generated plans. In the application workflow, we suggest segmenting the process: using GPT-32k, which has strong full-process coverage, during the plan formulation stage; utilizing the DeepSeek series, known for their excellent accuracy in multi-language service design; employing DeepSeek V3 or ERNIE Bot, which demonstrate outstanding capabilities in three-dimensional resource network presentation, for resource scheduling scenarios; and selecting Gemini 2-0-flash or Qwen-Long for resilience assurance design. The Results section has been included in 4.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors,

thank you for your revised paper and for taking into account suggestions for the original manuscript.

Please check if you mean Claude sonnet.

I still think that you could improve the two pie charts in figures 1 and 2 as the reader must interpret the chart by matching the colour. This is not easy to do, and very difficult if reading a black and white copy of the text. Placing the score beside the colour/name in the legend would be sufficient.

Author Response

Comments: I still think that you could improve the two pie charts in figures 1 and 2 as the reader must interpret the chart by matching the colour. This is not easy to do, and very difficult if reading a black and white copy of the text. Placing the score beside the colour/name in the legend would be sufficient.

Response: Based on the experts' suggestions, we have adjusted Figures 2 and 3, ensuring all relevant information is clearly displayed. See the paper for details.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors The authors addressed most of the comments satisfactorily.

Author Response

Thank you for your valuable suggestions.