Next Article in Journal
LLM-LCSA: LLM for Collaborative Control and Decision Optimization in UAV Cluster Security
Previous Article in Journal
Cross-Layer Optimized OLSR Protocol for FANETs in Interference-Intensive Environments
Previous Article in Special Issue
Temperature Field Distribution Testing and Improvement of Near Space Environment Simulation Test System for Unmanned Aerial Vehicles
 
 
Article
Peer-Review Record

Thinking Like an Expert: Aligning LLM Thought Processes for Automated Safety Modeling of High-Altitude Solar Drones

Drones 2025, 9(11), 780; https://doi.org/10.3390/drones9110780
by Qingran Su 1, Xingze Li 2,*, Yuming Ren 2, Bing Fu 2, Chunming Hu 2 and Yongfeng Yin 2,*
Reviewer 1:
Reviewer 2:
Reviewer 3:
Reviewer 4: Anonymous
Reviewer 5:
Drones 2025, 9(11), 780; https://doi.org/10.3390/drones9110780
Submission received: 11 August 2025 / Revised: 31 October 2025 / Accepted: 5 November 2025 / Published: 9 November 2025
(This article belongs to the Special Issue Design and Flight Control of Low-Speed Near-Space Unmanned Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Summary:
As the application of high-altitude solar drones expands, ensuring their safety is paramount. Traditional safety modeling, which relies on manual expert analysis, struggles to keep pace with rapid development cycles. While Large Language Models (LLMs) offer a path to automation, state-of-the-art reasoning frameworks like Graph of Thoughts (GoT) are too generic, lacking the domain-specific knowledge required for effective application. To address this gap, we introduce K-EGoT, a framework that grounds LLM reasoning in a verifiable, domain-specific knowledge base. Our method introduces a "Safety Rationale"—a mandatory, auditable link between LLM-generated model extensions and expert-curated safety principles. We then train a specialized model using a novel ’thought process alignment’ strategy, applying Direct Preference Optimization (DPO) to the quality of these rationales to ensure the model’s reasoning aligns with expert logic. On a high-fidelity dataset for the flight control-energy coupling problem, our 7B K-EGoT model achieved a Safety Extension Score (SES) of 92.7, significantly outperforming the 84.7 score from standard GoT prompting. Our work delivers a reliable and auditable solution for automated safety modeling for this critical class of drones.

Comments (major):
1. Please extend the size and resolution of the figures, that the characters in all the figures are too small to read.
2. Please add the explanation of how to understand the roadmaps in the captions of Figure 5 & 6.
3. Please add some figures about High-Altitude Solar Drones that can help the readers to understand what the objects this research is focusing on.
4. If possible, please add more relative papers derived from Drones.
5. LLM Model Qwen2-7B was used in this research. But no model architecture and minimum relative parameters. Please add at least one table about these issues.
6. Long-chain and short-chain data should be identified separately for easier and clearer understanding.
7. What is RQ short for? Please explain and add the full spelling.
8. Discussion part are too short, and why just 5.1, where are the other parts?
9. What are Hold Altitude, Climb Altitude, Descend Altitude in Figure 1? And why these parameters were set?
10. What is the meaning of the blue junctions in Figure 6? Please add the explanations in the captions.
11. What is the purpose of this research? It is not clear to understand it. Please add one section of objectives to explain.
12. If the authors want to use the characters like Chain-of-Thought, please also modify Tree of Thoughts (ToT) and Graph of Thoughts (GoT) to the same style.
13. Please add the hardware (GPU or CPU) and software parameters
 using in this research, which leads to a huge gap of understanding.

Author Response

Comments 1: Please extend the size and resolution of the figures, that the characters in all the figures are too small to read.

Response 1: Thank you for pointing this out. We agree with this comment. To address the issue that the characters in the figures are too small to read, we have adjusted the resolution of all figures. Specifically, we have increased the resolution of each figure to a level that ensures all text (including labels, legends, and annotations) within them is clearly visible and easy to read. Once again, thank you for your valuable guidance.


Comments 2. Please add the explanation of how to understand the roadmaps in the captions of Figure 5 & 6.

Response 2: Thank you for your valuable feedback. We have made the necessary modifications to address your suggestion to supplement the explanation of roadmaps in the captions of Figures 5 and 6. For Figure 5 (Original functional state diagram), on the basis of the original annotation "Original functional state diagram, which lacks energy aware safety constraints", a roadmap understanding explanation has been added: "The basic functional process presented in the diagram can be regarded as the original roadmap without safety constraints, and its core logic is developed along the main line of 'Start → Initialization → Initial Verification → Altitude Monitor'. Height adjustment is achieved through a single branch judgment (Climb=1/0), without involving coupling constraints between energy state and flight control. By tracking the connection relationship between state nodes and transition conditions, the process breakpoints that do not cover safety risks in the original design can be clarified." After modification, the complete annotation is located at Lines 670-681 of the paper.

 For Figure 6 (Extended functional state diagram generated by K-EGoT), on the basis of the original caption "The blue junctions represent the sub state graph Input_Validation", a roadmap understanding explanation is added: The iter node implements energy state intervention on flight decisions, and readers can annotate the conditions of state transitions (such as Time judgment) through blue sub icons Understand how security constraints can be embedded into the original process and form a closed-loop protection, "with the revised complete caption located on lines 682-693 of the paper. Once again, thank you for your valuable guidance.


Comments 3: Please add some figures about High-Altitude Solar Drones that can help the readers to understand what the objects this research is focusing on.

Response 3: Thank you for this comment. We agree that a visual aid would significantly help readers, especially those unfamiliar with this domain, to understand the subject of our research. To address this, we have now added a new figure (referenced as Figure 1, Page 3) in the Introduction section. This figure shows a typical example of a high-altitude solar drone. Furthermore, we have revised the accompanying text to explicitly describe its key morphological characteristics, such as the high aspect ratio (long, narrow wings) and the large-area photovoltaic solar arrays, and directly link this unique physical design to the core research challenge: the flight control-energy coupling. We believe this addition fully satisfies the reviewer's request by providing clear visual context.

Comments 4: If possible, please add more relative papers derived from Drones.

Response 4: Thank you very much for your valuable suggestions on the research of drone defect detection. In response to your feedback, we have comprehensively expanded the "2.2.2 Drone Defect Detection" section in lines 205-226 of the paper, adding the latest research results in the field of drone defect detection. By citing references 31-36, we systematically pointed out a key gap in the current research system for drone safety - the lack of solutions that organically integrate "design safety" and "operational safety". This supplement further strengthens the innovative value of the K-EGoT framework proposed in this article, which enhances the overall security of unmanned aerial vehicle systems from the source by embedding security checks and constraint mechanisms in the modeling phase. These modifications make the paper's discourse more complete, showcasing the achievements of existing research while pointing out the direction of this study. Thank you again for helping us improve the academic rigor and completeness of our paper.Once again, thank you for your valuable guidance.

Comments 5: LLM Model Qwen2-7B was used in this research. But no model architecture and minimum relative parameters. Please add at least one table about these issues.

Response 5: Thank you for your valuable feedback. In response to your suggestion that the Qwen2-7B language model was used in this study but its architecture and key parameters were not supplemented, and at least one related table needs to be added, we have made corresponding modifications. In the "4.1.2. Baselines" section(lines 506-510) of the paper, we have added "Table 1 Qwen2-7B Model Architecture and Core Parameters"(Page 14), which systematically summarizes the key information of the Qwen2-7B model, including model size (7 billion parameters), architecture type (only decoder Transformer), decoder layers (28 layers), hidden layer dimensions (4096 dimensions), attention head quantity (32), attention head dimension (128 dimensions), and feedforward network. Size (14336 dimensions), covering model architecture and parameter details. 

The above modifications clearly present the architecture and core parameters of the Qwen2-7B model by adding a structured table, which solves the problem of missing parameter information previously and facilitates readers' understanding of the review requirements for experimental foundations. Once again, thank you for your valuable guidance.

Comments 6: Long-chain and short-chain data should be identified separately for easier and clearer understanding.

Response 6: Thank you for your valuable feedback. We agree with the reviewer that the original description of the long-chain and short-chain data generation was intertwined, which could make it difficult to follow. To address this, we have significantly restructured the subsection 'Dataset Construction for High-Altitude Solar Drone Safety Scenarios' (Section 3.3.1, lines 335-352). In the revised version, we now explicitly separate the descriptions. We first introduce the long-chain data and its complete generation process. Following this, we use a new paragraph starting with 'Separately,' to distinctly introduce the short-chain data and its creation method. This new structure, which uses bolded text to clearly identify each dataset, directly satisfies the reviewer's request for a clearer, separate identification.

Comments 7: What is RQ short for? Please explain and add the full spelling.

Response 7: Thank you for your insightful question. The abbreviation "RQ" stands for "Research Question", which refers to the central or specific inquiry that a study seeks to investigate and explore. It serves as a guiding focus for the research, helping to define its scope, objectives, and direction. In our revised manuscript, we have replaced all instances of the abbreviation "RQ" with its full form, "Research Question", in order to enhance the overall clarity, readability, and accessibility of the text. We appreciate your attention to detail, and we hope this revision contributes to a more seamless reading experience.

Comments 8: Discussion part are too short, and why just 5.1, where are the other parts?

Response 8: Thank you for pointing out these issues. Thank you for this pointed feedback. We agree that the previous Discussion section was underdeveloped and that the '5.1' numbering was a structural error. We have addressed this by removing the confusing subsection numbering. The revised text is now contained within a single, unified Discussion on lines 771-798. We have substantially expanded the content. The revised section now opens with a new, in-depth paragraph. This paragraph synthesizes the main implications of our work, such as how our knowledge-enhancement strategy enables a 7B model to outperform general-purpose prompting and, most importantly, how the 'Safety Rationale' provides a crucial, auditable trail for safety engineering. Finally, to add the critical analysis and depth expected of a Discussion section, we have introduced three new dedicated subsections (formatted with bold text for clarity): Construct Validity, Internal Validity, and External Validity. This new structure allows us to formally analyze the threats to our study's validity and detail our specific mitigations—such as defending our dataset with the high Fleiss' Kappa score. This section is transformed from a brief note into a thorough analysis.Once again, thank you for your valuable guidance.

Comments 9: What are Hold Altitude, Climb Altitude, Descend Altitude in Figure 1? And why these parameters were set?

Response 9: Thank you for your questions regarding the hold altitude, climb altitude, and descend altitude in orignal Figure 1(now Figure 2). These states such as 'Climbing' (representing Climb Altitude), 'Descending' (representing Descend Altitude), and 'Holding' (representing Hold Altitude), are not intended to represent a complex, finalized engineering model. Instead, Figure 2 serves as a simplified, high-level example of a drone's basic control logic, presented in the standard style of a SysML state machine diagram.

We chose these specific states because they form a universally understood, fundamental example of drone operation (i.e., going up, going down, or staying level). The primary purpose of this figure is to provide a clear and simple baseline or starting point (the 'Original Diagram') for our automated safety extension. This simple diagram intentionally lacks the critical safety logic (like energy-awareness), which is precisely the problem our K-EGoT framework is designed to solve. Therefore, these states were set to illustrate the style of the input, allowing us to effectively demonstrate the 'before' (Figure 2,  the 'Original Diagram') and 'after' (Figure 6, the 'Extended Diagram') of our method. Thank you again for your valuable questions.


Comments 10: What is the meaning of the blue junctions in Figure 6? Please add the explanations in the captions.

Response 10: Thank you for your valuable feedback. In response to your inquiry about the meaning of "blue junctions" in Figure 6 and your request for additional explanation in the caption, we have made targeted modifications to the caption of Figure 6 (Extended functional state diagram generated by K-EGoT). Based on the original caption, we have further supplemented the specific functions and role descriptions of the blue nodes: "The blue junctions in the figure are the core associated nodes of the sub state diagram 'Input_Validation', mainly responsible for two major functions: one is to serve as the convergence point for the results of the 'Static Data Validation' and 'Dynamic Data Validation' modules. Integrate sensor data validity verification results; Secondly, as a trigger hub for fault handling, when anomalies are detected during data verification, the process can be directly guided into the 'Fault Diagnosis' and' Fault Handling 'sub modules to prevent invalid data from flowing into the main control link and affecting the safety of flight control energy coupling. This design is in line with the' External Interfaces' safety criteria in the knowledge base (such as' Dynamic Data Feature Verification ') to ensure the reliability of input data. The revised complete caption is located on lines 682-693 of the paper. Thank you again for your valuable questions.

Comments 11: What is the purpose of this research? It is not clear to understand it. Please add one section of objectives to explain.

Response 11: Thank you for pointing out this issue. Thank you for this crucial feedback. We acknowledge that the research purpose and objectives were not stated as clearly as they could have been. To address this comment, we have substantially revised the Introduction to integrate a clear and explicit explanation of our objectives.

In the revised Introduction, we now clearly establish the central research problem: manual, expert-driven safety modeling for HALE drones is a critical bottleneck on lines 64-72 (as stated in the first paragraph: '...a labor-intensive endeavor, heavily reliant on the extensive experience of senior safety engineers...'). We then precisely define the specific gap that existing LLM-based solutions fail to address on lines 73-81 (in the second paragraph: '...advanced reasoning frameworks... are too general-purpose...' and '...optimization techniques... produce a correct result without ensuring the underlying reasoning is sound.').

Building on this identified gap, we explicitly state our primary research purpose on lines 81-90 ('Our work, the Knowledge-Enhanced Graph of Thoughts (K-EGoT), thus focuses on injecting verifiable domain knowledge into an existing advanced reasoning framework...'). To fully articulate our specific goals as requested, we conclude the Introduction with the bulleted list under 'The main contributions of this paper are as follows:'. This list now functions as a direct enumeration of our research objectives, such as proposing the 'Safety Rationale' (Objective 1) and demonstrating the efficacy of aligning the model's reasoning process (Objective 2). Thank you again for your valuable guidance.

Comments 12: If the authors want to use the characters like Chain-of-Thought, please also modify Tree of Thoughts (ToT) and Graph of Thoughts (GoT) to the same style.

Response 12: Thank you for pointing out this issue. We fully agree with your comment that maintaining a consistent style for similar concepts is crucial for readability, and we have made targeted revisions to unify the expression of these thought frameworks: We have adjusted the expression of all relevant concepts in the paper—specifically, changing "Chain-of-Thought" to "Chain of Thought", "Tree-of-Thoughts" to "Tree of Thoughts", and "Graph-of-Thoughts" to "Graph of Thoughts".  Once again, thank you for your valuable guidance.

Comments 13: Please add the hardware (GPU or CPU) and software parameters using in this research, which leads to a huge gap of understanding.

Response 13: Thank you for your valuable feedback. In response to your request to supplement the hardware (GPU/CPU) and software parameters used in this study to eliminate comprehension barriers, we have added detailed hardware and software parameter descriptions in the "4.1.3. Implementation Details" section on lines 529-534. In terms of hardware, it is explicitly added that "all experiments were run on servers equipped with three NVIDIA GeForce RTX 3090 graphics cards (single card 24GB VRAM), with a CPU model of Intel Xeon W-2295 (3.0GHz, 18 cores 36 threads) and a memory capacity of 128GB DDR4-3200". This hardware parameter description is located on lines 728-731 of the paper.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a study which has significance and relevance - I note the highlighting of parts of the text? This may relate to the reference to anonymous reviewers by the authors?This has been ignored in this review.

The paper generally reads well and is [again generally] logically structured with a clear narrative (see my comments below). The M&M are set out c/w the results, assumptions, and limitations. I have comments:

  1. The introduction needs a paragraph setting out the paper structure.
  2. The format of the manuscript needs improvement - for example se page 8.
  3. The equation numbering must be references in all the related text.
  4. I found that there are figures which are too small - for example see Figures 1, 5, and 6 - and they need enlargement to improve the readability.
  5. There is a need to improve the citations for many Figures 3, 3, and 4
  6. It would be useful if the authors provide a scenario-based illustrative example of the proposal presented in this paper to demonstrate the utility of the proposal in the 'real-world'.
  7. A search of ScienceDirect using the search terms: "Flight Control of Low-Speed Near-Space Unmanned Systems" resulted in: 56 papers (2026), 1129 papers (2025), 950 papers (2024), 831 review articles , and 4493 research articles. The literature review needs improvement c/w the research method adopted in the literature search. Typical examples of the results are shown in the attached example papers.

In summary, I found this to be a good paper that will be of interest to the intended audience. However, there are required revisions noted in my points 1-7 however once suitably revised and extended the paper will in my view be potentially publishable subject to the normal English grammar and formatting checks carried out in the proofing process.

 

Comments for author File: Comments.zip

Author Response

Comments 1: The introduction needs a paragraph setting out the paper structure.

Response 1: Thank you for your valuable feedback. In response to your suggestion to add a paragraph in the introduction section that elaborates on the structure of the paper, we have added a new paragraph specifically introducing the structure of the paper at the end of the "1. Introduction" chapter (lines 113-162 of the paper). The specific wording is: "The subsequent chapters of this paper will be developed according to the following logic: Chapter 2 'Background and Related Work' will outline the limitations of traditional security analysis methods, the current research status of drone defect detection, and the evolution and shortcomings of LLM inference paradigms; Chapter 3 'Intelligent Safety Modeling Method' will elaborate on the overall architecture of the K-EGoT framework, the construction process of the drone safety knowledge base, and the core pipeline of framework training and inference; Chapter 4 'Experiments and Analysis' verifies the overall effectiveness, component contributions, and generation quality of the K-EGoT framework through designing multiple sets of experiments, and analyzes computational efficiency; Chapter 5 'Discussion' will explore research insights, effectiveness threats, and cross disciplinary promotion paths; Chapter 6 'Conclusion' summarizes the core contributions, points out current limitations and proposes future research directions, and finally supplements information such as author contributions and funding support; At the same time, to ensure logical coherence in the introduction, we have made adjustments to the content before and after the new paragraph, integrating the brief description of "subsequent chapter organization logic" mentioned in the original introduction into the new paragraph, making the overall expression more systematic and complete.

 

Comments 2: The format of the manuscript needs improvement - for example se page 8.

Response 2: Thank you very much for carefully reviewing our manuscript and for pointing out this important issue. We sincerely appreciate your valuable feedback and constructive guidance, which has been instrumental in improving the overall quality and presentation of our paper. We have thoroughly revised the format of the manuscript to ensure it aligns more closely with the journal’s formatting guidelines and enhances its visual clarity and professionalism. Specifically, we have adjusted the size and resolution of all figures to improve their legibility and ensure they are presented in a consistent and publication-ready manner. Additionally, we have re-optimized the overall layout of the paper, including the arrangement of text, figures, and tables, to enhance readability and ensure a more balanced and coherent structure throughout the document.

 

Comments 3: The equation numbering must be references in all the related text.

Response 3: Thank you very much for your careful review and for raising this important point. We fully agree with your observation regarding the need for appropriate references to support the formulas presented in our paper. We have carefully reviewed the manuscript and added relevant and properly formatted references for each formula where applicable. These references either cite the original sources from which the formulas were derived or provide supporting literature that contextualizes their use within our study.

 

Comments 4: I found that there are figures which are too small - for example see Figures 1, 5, and 6 - and they need enlargement to improve the readability.

Response 4: Thank you for pointing this out. We agree with this comment. To address the issue that the characters in the figures are too small to read, we have adjusted the resolution of all figures. Specifically, we have increased the resolution of each figure to a level that ensures all text (including labels, legends, and annotations) within them is clearly visible and easy to read. Once again, thank you for your valuable guidance.

 

Comments 5: There is a need to improve the citations for many Figures 3, 3, and 4

Response 5: Thank you for pointing out this problem. We have addressed this comment by revising the manuscript to improve the descriptive quality of the in-text references to these figures.Previously, the text passively pointed to the figures (e.g., '...is shown in...'). We have updated this wording to be more active and descriptive, serving to guide the reader by summarizing what the figure illustrates.

For example, the text referencing Figure 3 on lines 350-352 has been revised to read: Figure 3 illustrates this rewriting process, where a rewriter model is prompted to compress a detailed, lengthy chain of thought ("Long Instruction") into a concise version ("Short Instruction") that retains the core logic.

Similarly, the reference to Figure 4 on lines 363-364 has been updated to: Figure 4 provides an example of the final data structure, contrasting the "long think" tag in the long-CoT dataset (left) with the "short think" tag in the short-CoT dataset (right).We believe these modifications make the connection between the text and the figures more explicit and improve the overall readability of the manuscript. Once again, thank you for your valuable guidance.

 

Comments 6: It would be useful if the authors provide a scenario-based illustrative example of the proposal presented in this paper to demonstrate the utility of the proposal in the 'real-world'.

Response 6: Thank you for your valuable feedback. In response to your suggestion to provide examples based on real-life scenarios to demonstrate the practicality of our research proposal, we have introduced a real-life example of "Low Battery Emergency Cruise of High Altitude Solar Drones at Night" in the "4.3.3. Research Question 3: Qualitative and Failure Case Analysis" section on lines 665-693. The specific content is: "Taking the nighttime phase of a certain type of high-altitude solar drone performing a 72 hour continuous reconnaissance mission as an example, when the drone battery SOC (State of Charge) drops to 25% (below the safety threshold of 30%) and encounters sudden cloud cover (solar replenishment interruption), The K-EGoT framework automatically triggers dynamic security extension inference: Firstly, the Answering Node generates an extension scheme based on the 'Flight Control Energy Coupling -1' criterion in the knowledge base, which prohibits climbing and initiates energy recovery. It synchronously generates a Safety Rationale that explicitly references the criterion ID and a risk explanation of 'low battery climbing can easily lead to energy depletion'; Subsequently, the Evaluation Node verified the fit of the proposed solution with the knowledge base (KB_Ground score 0.92) and logical coherence (Coherence score 0.88); The final Aggregate Rational e Node integrates the evaluation results and embeds the rule of 'maintaining a level flight attitude with low battery at night' into the SysML state diagram to avoid dangerous maneuvers by drones due to misjudging energy states. This new scenario example is located on lines 665-694 of the paper.

 

Comments 7: A search of ScienceDirect using the search terms: "Flight Control of Low-Speed Near-Space Unmanned Systems" resulted in: 56 papers (2026), 1129 papers (2025), 950 papers (2024), 831 review articles , and 4493 research articles. The literature review needs improvement c/w the research method adopted in the literature search. Typical examples of the results are shown in the attached example papers.

Response 7: Thank you very much for your valuable suggestions on the research of drone defect detection. In response to your feedback, we have comprehensively expanded the "2.2.2 Drone Defect Detection" section in lines 205-226 of the paper, adding the latest research results in the field of drone defect detection. By citing references 31-36, we systematically pointed out a key gap in the current research system for drone safety - the lack of solutions that organically integrate "design safety" and "operational safety". This supplement further strengthens the innovative value of the K-EGoT framework proposed in this article, which enhances the overall security of unmanned aerial vehicle systems from the source by embedding security checks and constraint mechanisms in the modeling phase. These modifications make the paper's discourse more complete, showcasing the achievements of existing research while pointing out the direction of this study. Thank you again for helping us improve the academic rigor and completeness of our paper.Once again, thank you for your valuable guidance.

 

Reviewer 3 Report

Comments and Suggestions for Authors

While introduction, main part, conclusion and references are good,

there are must be noticed the next items:

  1. Main part of article is weakly connected to solar drones, which are application of presented methods;
  2. Article does not show samples of knowledge presentation structures for solar drones (while it contains general formulas of instructions- Figure 4);
  3. It is not clear from article, how drone's intelligence (or decision-making system) work and what is mechanism of instruction processing. 

Author Response

Comments 1: Main part of article is weakly connected to solar drones, which are application of presented methods;

Response 1: Thank you for pointing out this issue. We appreciate the opportunity to clarify the connection between our proposed method and the domain of high-altitude solar drones. We respectfully elaborate that the high-altitude solar drone domain is not merely a coincidental application of our method, but rather the fundamental driver of our methodological design. Our work is motivated by a core safety challenge that is profound and defining for this specific class of drones, which we identify in the Introduction (Section 1, lines 43-72) as the "flight control-energy coupling" problem.

Unlike conventional drones, high-altitude long-endurance (HALE) solar platforms are defined by their reliance on volatile solar power, extreme high-aspect-ratio morphology, and the need for long-endurance autonomous operation through day-night cycles. This creates an intricate, dynamic interplay between the Flight Control System (FCS) and the Energy Management System (EMS) that is the principal source of safety risk.Generic LLM reasoning frameworks (like the standard GoT) fail in this domain precisely because they lack the deep, specialized knowledge required to model this complex coupling. Our K-EGoT framework was specifically designed to solve this problem:

  • The "K" (Knowledge) in K-EGoTis not general-purpose knowledge. It represents the domain-specific "Drone Safety Knowledge Base" (detailed in Section 3.2 and the Appendices) that we meticulously curated to formalize the expert rules governing this flight control-energy coupling.
  • Our core innovation, the "Safety Rationale" (Section 3.1), is the mechanism that explicitly groundsthe LLM's reasoning process in this specific drone safety knowledge base, ensuring every modeling decision is auditable against domain principles (e.g., criterion "Flight Control-Energy Coupling - 1" in Appendix.
  • Our entire experimental design (Section 4), including the dataset, metrics (SES), and the qualitative case study (RQ3), is centered on evaluating the model's ability to resolve these specific high-altitude solar drone safety scenarios.

In summary, the main part of our article is deeply rooted in solving a unique and critical drone safety problem. A generic method could not have achieved this. To make this crucial link more explicit for the reader, we have revised the Introduction and Discussion sections to further emphasize that the "flight control-energy coupling" problem is the central challenge that necessitates the development of our knowledge-enhanced K-EGoT framework.

Thank you again for your valuable suggestion.

 

Comments 2: Article does not show samples of knowledge presentation structures for solar drones (while it contains general formulas of instructions- Figure 4);

Response 2: Thank you for this valuable feedback. We understand the importance of showing concrete examples of the knowledge structure, as this is the core of our "K-EGoT" framework. We have rewritten the appendix section(Appendix A, Safety Criteria Knowledge Base), focusing on the content of the knowledge base you are concerned about.

This appendix provides the complete, formalized knowledge base that our method relies on, presented in a structured format across Tables A1-A10.

As described in our methodology (Section 3.2, lines 284-325), this knowledge base is organized into the five-category taxonomy specific to this domain. For example:

  • Table A1(line 858)shows the criteria for "External Interface Failure Modes."
  • Tables A2-A5(lines859) details the "Functional Logic" criteria.
  • Tables A6-A7(lines860) details the "Functional Hierarchy Failure Mode" criteria.
  • Tables A8-A9(lines861) details the "tate/Mode Failure Mode Analysis" criteria.
  • Table A10(lines862)  provides the critical criteria for "Flight Control-Energy Coupling," which is central to solar drones.

Each criterion in these tables (e.g., "Coupling - 1: Degradation of Maneuverability under Energy Constraints") functions as a formal knowledge object with an ID and detailed description, serving as the verifiable "knowledge presentation structure" that our LLM is grounded against.

We believe the confusion may have arisen because Figure 4, as the reviewer correctly noted, illustrates the prompt template for generating training data (the long/short CoT rewriting), rather than the knowledge base structure itself. The knowledge in Appendix A is the source that the "Safety Rationale" is based on.

To make this vital information more accessible to the reader, we have added an explicit forward-reference(lines 323-325) at the end of Section 3.2 (Knowledge Base Curation), which now clearly directs readers to Appendix A for the complete, structured safety criteria.

Thank you again for helping us improve the paper's clarity.

 

Comments 3: It is not clear from article, how drone's intelligence (or decision-making system) work and what is mechanism of instruction processing.

Response 3: Thank you for this critical question. We would like to respectfully clarify that this paper is not about the drone's on-board, real-time intelligence or its run-time decision-making system (e.g., the autopilot). Our K-EGoT framework is an offline, design-stage engineering tool that uses LLMs to assist human safety engineers in creating and validating the safety model for the drone. Our framework's output is a SysML state machine diagram (like the one in Figure 5, Page 19), which serves as the formal blueprint or specification from which engineers will later implement that on-board, real-time intelligence. Therefore, our paper focuses on automating the safety modeling of this system before it is built. In the context of our K-EGoT framework, the "instruction" (or input) is the initial, high-level SysML state machine model provided by the engineer (as shown in Figure 5).The "mechanism of instruction processing" is our core K-EGoT reasoning pipeline (detailed in Section 3.3.2, lines 410-416 and Algorithm 1, Page 11 ). This mechanism iteratively:

  • Analyzes the input SysML model.
  • Generates candidate safety extensions and "Safety Rationales" (Answering Node).
  • Evaluates these extensions against the Drone Safety Knowledge Base (Evaluation Node).
  • Refines its understanding and prepares for the next step (Aggregate Rationale Node).

The final output of this "processing" is the complete, safety-extended SysML model, as shown in Figure 6. In short, K-EGoT is an AI framework that helps engineers design the safety logic. To resolve this ambiguity in the manuscript, we have revised the Introduction (Section 1, lines 43-72) and the framework overview (Section 3.1, lines 265-283) to state more explicitly that K-EGoT is a design-stage, automated safety modeling framework.

Thank you for helping us clarify this important distinction.

 

Reviewer 4 Report

Comments and Suggestions for Authors The paper presents a novel framework that uses LLM reasoning in a verifiable, domain-specific knowledge base. Its main contributions include a two-phase optimization strategy combining supervised fine-tuning and rationale-based alignment, a curated safety knowledge base, and a dynamic reasoning loop for SysML model extension. Strengths of the work are shown in performance improvements over state-of-the-art baselines, and practical relevance for aerospace safety engineering.   However, there are still some issues/comments to be resolved:   1. Provide clear explanations for all variables used in the equations throughout the paper to improve clarity and reproducibility:     Equation (1) mentions xi without explanation.     Line 254 mentions xi and yi without explanation.     Equation (3) mentions xi, yi and pl without explanations.     Equation (4) mentions xi, yi and ps without explanations.     Equation (6) has multiple variables without explanations.     Equation (7) mentions P without explanation.     Equation (9) mentions fagg without explanation.     Equation (10) mentions emin and emax without explanations.     Equation (11) mentions F1states and F1 transitions without explanations.     Equation (12) mentions KB_Grounding and Coherence without explanations.     Equatoin (14) mentions Ijson, Ischema and Iconnectivity without explanations (at least not with the same name).   2. In Section 4.2.1, the paper introduces a 1–5 scale for expert evaluation but does not explicitly state whether a higher score indicates better performance. This should be clarified in the text, similar to how Table 1 implies that higher values are better.

Author Response

Comments 1: Provide clear explanations for all variables used in the equations throughout the paper to improve clarity and reproducibility: Equation (1) mentions xi without explanation. Line 254 mentions xi and yi without explanation. Equation (3) mentions xi, yi and pl without explanations. Equation (4) mentions xi, yi and ps without explanations. Equation (6) has multiple variables without explanations. Equation (7) mentions P without explanation. Equation (9) mentions fagg without explanation. Equation (10) mentions emin and emax without explanations. Equation (11) mentions F1states and F1 transitions without explanations. Equation (12) mentions KB_Grounding and Coherence without explanations. Equatoin (14) mentions Ijson, Ischema and Iconnectivity without explanations (at least not with the same name).

Response 1: Thank you for your valuable feedback. In response to your request for clear explanations of all variables involved in the formulas in the paper to improve clarity, we have systematically supplemented and improved the descriptions of the formulas and corresponding variables throughout the paper.

  • At Equation (1) (lines 358-359), we have added an explanation that represents the "input scenario prompt' corresponding to the i-th data point, which is used to trigger the model to generate inference trajectories";
  • Equations (3) and (4) (lines 372-373), add "represents the target output corresponding to the i-th data point, and  represents the input scene prompt for the i-th data point."(lines 372-373) and "the long Chain of Thought trajectory  into a shorter one "(line 357);
  • At Equation (6) (lines 387-405), add "is the learnable parameter of the domain expert modelis the DPO training datase is the positive sample set corresponding to input x (which must meet the requirements of 'SysML extension scheme is correct, safe and logically complete, and references knowledge base standards') is the negative sample set corresponding to input x. A detailed explanation of the probabilities of positive and negative samples generated by the model, represented by  and, respectively;
  • At Equation (7) (lines 419-421), add "is the aggregation prompt passed by the parent node , including historical inference information and task guidance";
  • At Equation (9) (lines 436-444), add "is an aggregation function used to fuse the antecedent prompt, the newly generated security basis , and the evaluation basis , to generate an optimization prompt for subsequent inference";
  • At Equation (10) (lines 462-465), add "is the minimum value of the exploration factor (at this time, the model prioritizes generating schemes based on verification safety logic to reduce randomness) is the maximum value of the exploration factor (at which point the model generates more candidate solutions to cover potential risks);
  • At Equation (11) (lines 551-552), add "is the F1 score of the state element (measuring the match between the generated state and the expert annotated state) is the F1 score of the transition element (measuring the matching degree between the generated transition rule and the expert annotated transition rule);
  • At Equation (12) (lines 559-567), add "" as the knowledge base correlation index (calculated by parsing the 'security basis' to extract the referenced knowledge base criterion ID and verifying its correlation), and "Coherence" as the logical coherence index (scored by the LLM evaluator based on the integrity of the inference chain);
  • At Equation (14) (lines 570-573), add "is the validity metric for JSON format (a value of 1 indicates correct format, 0 indicates error) is the conformity index (range 0-1) between the generated results and the predefined data structure specificationsis an indicator of the effectiveness of connections between state machine elements and other structures (range 0-1).

 

Comments 2: In Section 4.2.1, the paper introduces a 1–5 scale for expert evaluation but does not explicitly state whether a higher score indicates better performance. This should be clarified in the text, similar to how Table 1 implies that higher values are better.

Response 2: Thank you very much for your valuable feedback. We have provided additional clarification in the "4.2.1 Expert Evaluation" section of lines 557, clearly stating that "the scoring is based on a 1-5 point scale, with higher scores indicating superior model performance".

 

Reviewer 5 Report

Comments and Suggestions for Authors

Comments for drones-3841881

Title: Thinking like an expert aligning LLM thought processes for automated safety modeling of high-altitude solar drones



In this paper, a new K-EGoT (Knowledge-Enhanced Graph of Thought) framework for the operation of high-altitude solar drones was proposed. However, the proposed method is not completely new, and  a technique that utilizes the existing LLM model and the knowledge of drone safety domain experts was proposed.

Although it is evaluated as a timely study at a time when drone use and industry have become active recently, I hope that the paper will be reviewed once again by adding the following contents.

 

  1. The proposed K-EGoT method should be presented in more detail. What is the difference from the model in which the deep learning of domain experts is added to the existing LLM model?. If the difference from the application of (LLM+expert knowledge) is not properly presented, the paper is evaluated as having no creativity. In other words, compared to the (LLM+expert knowledge learning model), performance evaluation indicators such as performance excellence, originality, reliability, and performance accuracy on test data must be additionally compared.
  2. The questions about the K-EGoT framework presented in Figure 1 are as follows. It explains the process of creating an extended state transition diagram from the (LLM+domain expert knowledge learning model). This methodology is not a new technique at all, and excellent methodologies such as Bayesian state transition and Markov state transition have been proposed and used so far. The difference and excellence of this must be evaluated quantitatively.
  3. Performance comparison results from more diverse perspectives should be attached.
  4. The additions as an appendix should be summarized and organized. It is not desirable to write the prompt in the paper as it is.
  5. The conclusion is too insufficient. The implications of the study and the direction of future research should be presented together.
  6. In the evaluation metrics of Equations 11 to 13, a clear basis for weighting (0.5, 0.3, 0.4, 0.6, 0.7, 0.3) must be provided.
  7. With an SES score of 92.7, the implications of the result of outperforming the existing GoT prompting method (84.7) should be more clearly explained.
  8. Please redraw the figures considering that the size of the letters expressed in the picture is too small to be seen.
Comments on the Quality of English Language

English should be improved.

Author Response

Comments 1: The proposed K-EGoT method should be presented in more detail. What is the difference from the model in which the deep learning of domain experts is added to the existing LLM model?. If the difference from the application of (LLM+expert knowledge) is not properly presented, the paper is evaluated as having no creativity. In other words, compared to the (LLM+expert knowledge learning model), performance evaluation indicators such as performance excellence, originality, reliability, and performance accuracy on test data must be additionally compared.

Response 1: Thank you for pointing out this issue. Next, I will answer your questions in detail in parts.

  • The Core Innovations of our method K-EGoT(lines 81-97)

The core innovation of K-EGoT is the paradigm Shift from "Behavioral Imitation" to "Thought Process Alignment". The prevalent approach of integrating Large Language Models (LLMs) with domain-specific knowledge is largely limited to the level of "behavioral imitation". This paradigm—fine-tuning a model on expert-annotated data—trains the model to produce a final answer that resembles an expert's. The foundational flaw in this approach is that the reasoning process remains an unreliable "black box". The model may arrive at the correct answer through spurious correlations in the data, while its underlying logic remains flawed and untrustworthy. In high-stakes, safety-critical domains such as aviation, an answer that cannot be verified and trusted, even if correct, holds little engineering value in high-stakes aviation domains where safety certification requires traceability.

Our K-EGoT framework addresses this challenge by introducing the core concept of "thought process alignment," representing a fundamental paradigm shift. The innovation of our method lies not in the mere act of "combining knowledge," but in "how this combination is achieved," ensuring that every step of the model's reasoning is as rigorous and evidence-based as that of a domain expert.

Specifically, our innovations are manifested across the following three dimensions, which collectively overcome the deficiencies of existing methods:

  • Addressing the "Black Box" Problem by Introducing an Auditable "Safety Rationale"
  • Shortcomings of Existing Methods: Outputs from conventional methods lack the explainability and traceability required to meet safety certification standards.
  • Our Approach: We mandate that for every reasoning step, the model must generate a "Safety Rationale" that explicitly references principles from the knowledge base. This externalizes the model's internal monologue into a transparent and auditable logical chain, significantly reducing the 'black box' characteristics by externalizing a transparent logical chain and enhancing the credibility of the results.
  • Solving the "Correct for the Wrong Reasons" Problem by Shifting the Optimization Target from Behavior to Thought Process
  • Shortcomings of Existing Methods: Advanced methods like DPO-Behavioral still optimize for the correctness of the final output (behavior), which fails to guarantee that the model has internalized the deep safety logic (e.g., flight control-energy coupling constraints), as evidenced by the lower Rationale Quality (Srat=82.1) of DPO-Behavioral compared to K-EGoT (Srat=88.4).
  • Our Approach: We innovatively establish the quality of the "Safety Rationale" as the core criterion for Direct Preference Optimization (DPO). We reward not the "correct answer," but the "rigorous, reliable, and expert-aligned reasoning process".This "thought process alignment" compels the model to prioritize internalizing the expert's decision-making logic over merely fitting to superficial data patterns.
  • Overcoming Inflexibility in Complex Problem-Solving by Introducing Dynamic Reasoning and Adaptive Exploration
  • Shortcomings of Existing Methods: Linear or simple tree-like reasoning structures are struggle to efficiently handle complex safety problems with coupled constraints.
  • Our Approach: We not only leverage the advanced non-linear exploration capabilities of the Graph of Thoughts (GoT) framework but also introduce a dynamic adjustment mechanism based on evaluation feedback. This allows the system to intelligently balance "exploration" (comprehensively searching for potential hazards) and "exploitation" (refining the most promising safety strategies). This endows the entire reasoning process with both breadth and depth, enabling it to identify robust optimal safety extension solutions for the flight control-energy coupling decision space (e.g., avoiding nighttime climbing with low battery SOC)

In summary, the true innovation of K-EGoT lies in its establishment of a mechanism that forces the LLM's reasoning process to be deeply bound to verifiable expert knowledge, and then sets this "reliability of thought" as the primary optimization objective. This enables us to utilize a moderately-sized model (Qwen2-7B) to generate solutions that are more reliable and trustworthy than those from general-purpose reasoning methods (e.g., standard GoT prompting with SES=84.7) on the same model architecture.

  • Comparison with other methods

To comprehensively and rigorously validate the innovation and effectiveness of the K-EGoT framework, our experimental design employs a hierarchical set of baselines. The selection of these baselines is intended to systematically address two core research questions:

  • Does our deep knowledge integration method demonstrate a significant advantage over state-of-the-art prompt engineering techniques that do not require fine-tuning?
  • Compared to conventional fine-tuning methods that also integrate expert knowledge, is our proposed "thought process alignment" strategy superior to the standard "behavioral alignment" approach?

Group 1: Prompting-Based Baselines(lines 511-518)

This group of methods represents the upper bound of leveraging an LLM's general-purpose reasoning capabilities through external instructions, without altering the model's weights.

  • Baseline Methods:
    • Chain of Thought (CoT)
    • Tree of Thought (ToT)
    • Graph of Thought (GoT)
  • Representativeness and Purpose:
    • Representativeness: These methods are recognized as the primary strategies in LLM reasoning, from foundational to state-of-the-art, representing the upper limits of general reasoning abilities.
    • Rationale for Selection: They were chosen to demonstrate a key argument: for highly specialized and structured tasks like high-altitude solar drone safety modeling, relying solely on an LLM's general-purpose reasoning capabilities has limitations.
  • What This Comparison Proves:
    The experimental results (Table 2) show that the fine-tuned methods consistently outperform the strongest prompting-based method, GoT. This indicates that for safety-critical domains, deep knowledge integration achieved through fine-tuning is necessary and more effective than general-purpose prompting techniques.

Group 2: LLM + Domain Knowledge Integration Methods(lines 519-527)

This group of methods represents different strategies for integrating expert knowledge into an LLM, with the core objective of comparing how knowledge is integrated for the best effect.

  • Baseline Methods:
    • Standard Supervised Fine-Tuning (Standard SFT)
    • SFT + DPO-Behavioral
  • Representativeness and Purpose:
    • Representativeness: SFT + DPO-Behavioral represents an advanced implementation of the "LLM+expert knowledge learning model" that the reviewer is concerned with. It uses a state-of-the-art technique (DPO) to align with expert preferences but follows the conventional approach of focusing only on the final result.
    • Rationale for Selection: A direct comparison between our K-EGoT and SFT + DPO-Behavioral allows us to clearly isolate and demonstrate the unique value contributed by our core innovation—"thought process alignment".The primary variable between these two methods is the optimization target for DPO: one optimizes for the final 'behavior,' while the other optimizes for our proposed 'Safety Rationale' (the thought process).
  • What This Comparison Proves:
    The experimental data shows that K-EGoT (92.7) outperforms SFT + DPO-Behavioral (89.4) in the overall score. The gap is particularly evident in the "Rationale Quality" and expert-rated "Trustworthiness"  metrics. This provides evidence for our paper's central thesis: in safety-critical domains, ensuring the model "thinks correctly" (thought process alignment) is more important and effective than merely ensuring it "answers correctly" (behavioral alignment).

Our experimental design, through a direct comparison with the SFT + DPO-Behavioral model—a typical "LLM+expert knowledge" approach—clearly reveals the distinction and superiority of K-EGoT. While both methods utilize expert knowledge for fine-tuning, they differ in their core optimization objective. The SFT + DPO-Behavioral model aims to mimic the expert's final behavior, whereas K-EGoT aims to align with the expert's intrinsic thought process. Our evaluation metrics, particularly "Rationale Quality" and "Trustworthiness", allow us to quantify this difference. The experimental results show that K-EGoT performs better on these "process quality" related metrics, which directly illustrates that the superiority of our method is not just in achieving a higher score, but in generating a model with superior explainability, traceability, and trustworthiness. This demonstrates that K-EGoT is not a simple superposition of knowledge, but a deeper paradigm of knowledge integration oriented towards reliable reasoning processes. Once again, thank you for your valuable guidance.

 

Comments 2: The questions about the K-EGoT framework presented in Figure 1 are as follows. It explains the process of creating an extended state transition diagram from the (LLM+domain expert knowledge learning model). This methodology is not a new technique at all, and excellent methodologies such as Bayesian state transition and Markov state transition have been proposed and used so far. The difference and excellence of this must be evaluated quantitatively.

Response 2: We sincerely thank the reviewer for this insightful comment comparing our K-EGoT framework with classic formal methods like Bayesian and Markov state transitions. We fully agree on the importance of these formal methods in system safety engineering. We would like to clarify a key difference in positioning: our K-EGoT framework is not intended to replace these established verification techniques but rather to serve as a complementary tool that operates at a different, much earlier stage of the engineering lifecycle. Its core objective is not the probabilistic verification of a completed model, but rather the generative discovery and completion of missing safety design considerations from an incomplete, initial functional blueprint.

K-EGoT and traditional formal methods differ fundamentally in their core objectives, inputs, and outputs. Formal methods typically require a structurally complete system model with precise transition probabilities as input, with the goal of performing quantitative risk analysis and verification to output metrics such as probabilities and failure rates. In contrast, K-EGoT takes an initial, functional design diagram and a domain knowledge base as input. Its objective is qualitative model enhancement and safety extension, and its outputs are a structurally more complete, safety-enhanced design model and an auditable "Safety Rationale" that explains the reasoning behind each modification. Therefore, a direct quantitative comparison between these two approaches—for instance, comparing our SES score against a steady-state probability from a Markov chain—is methodologically inappropriate as they measure entirely different things.

We believe K-EGoT's contribution lies precisely in bridging the gap between initial design and formal verification. A more complete and safety-aware model generated by K-EGoT can, in the future, serve as a higher-quality input for traditional formal methods, enabling more accurate quantitative risk assessment. Consequently, the quantitative evaluation in our paper is focused on comparing K-EGoT against its true peers—other paradigms attempting to solve complex engineering reasoning tasks with LLMs (e.g., GoT, SFT + DPO-Behavioral). Our evaluation metrics, particularly "Rationale Quality" and "Trustworthiness", are specifically designed to measure K-EGoT's superiority in the unique task of "automated generation of trustworthy reasoning" which is a task that traditional formal methods are not designed to perform. Once again, thank you for your valuable guidance.

 

Comments 3: Performance comparison results from more diverse perspectives should be attached.

Response 3: Thank you for pointing out this issue. To provide a performance comparison from more diverse perspectives, we have added a new subsection after the experimental results (Section 4.3.4, lines 726-769): Computational Efficiency Analysis. This new subsection is dedicated to evaluating the "computational efficiency" of our K-EGoT framework and all baseline methods. This complements our original analysis, which primarily focused on model "quality" and "trustworthiness" (e.g., SES score).

In this new subsection, we introduce and compare four key metrics, presenting all data in a new table (Table: efficiency_results). These metrics include: 1) Total Training Time (hours), to quantify the one-time engineering cost required to produce a usable model; 2)Average Inference Latency (seconds), to measure the "thinking time" required for the model to process a single diagram; 3) Average Inference Calls, to reflect the algorithmic complexity of the reasoning process; and 4)Average Inference Tokens, as a measure of the total computational workload and potential cost.

By juxtaposing the detailed efficiency data (Table: efficiency_results) with our existing quality metrics (Table main_results), we provide a comprehensive analysis of the critical engineering trade-off between "quality" and "efficiency". This new analysis allows us to clearly demonstrate that K-EGoT's higher computational cost (e.g., 266.38 seconds of latency and 18.62 calls) is a direct result of its design. However, this additional computational expense is exchanged for the high-quality, auditable, and trustworthy results (e.g., an SES score of 92.7) that are crucial in safety-critical domains. For the problem we aim to solve, this expenditure is acceptable.

 

Comments 4: The additions as an appendix should be summarized and organized. It is not desirable to write the prompt in the paper as it is.

Response 4: Thank you for pointing out this issue. We have reorganized the appendix by removing the content related to prompt words from the main text and placing it in Appendix B. Once again, thank you for your valuable guidance.

 

Comments 5: The conclusion is too insufficient. The implications of the study and the direction of future research should be presented together.

Response 5: Thank you for pointing out this issue.We agree that the previous conclusion was insufficient in its elaboration on the study's implications and future research. Following this suggestion, we have substantially revised and expanded Section 6, "Conclusions"(lines 800-840). The revised section now comprehensively elaborates on the practical value (Implications) of our research and presents clear, specific future research directions. These two components are now closely integrated to provide a more complete and substantive summary.

In the revised conclusion, we have added a dedicated paragraph to elaborate on the practical value and implications of our study. We explicitly state that the value of the K-EGOT framework lies in shifting safety analysis from a "post-design verification activity" to the "initial stage of the design blueprint," thereby avoiding "high-cost design rework in the later stage". Furthermore, we highlight that the "audit trajectory" generated by the "Safety Rationale" mechanism meets the "strict transparency requirements" demanded by aviation safety certification. We also discuss the significance of achieving success on a 7B model, which is particularly relevant for engineering teams with "limited computing resources".

Following the discussion of implications, we now detail the current limitations and, based on them, propose five specific future research directions. These directions include: 1) developing more refined priority arbitration mechanisms to address conflicts between safety criteria; 2) integrating the framework output with the formal validation tool chain to meet industrial application needs; 3) enabling cross-domain promotion to fields like autonomous driving by reconstructing the knowledge base based on standards such as ISO 26262; 4) exploring the extension of modeling from the design phase to runtime safety monitoring ; and 5) developing effective human-machine collaboration interfaces to efficiently support engineers in their review and refinement processes. We believe these detailed additions have fully addressed the reviewer's concerns and made the conclusion section more substantive.

 

Comments 6: In the evaluation metrics of Equations 11 to 13, a clear basis for weighting (0.5, 0.3, 0.4, 0.6, 0.7, 0.3) must be provided.

Response 6: Thank you for pointing out this issue. We fully agree that, in the original manuscript, we failed to provide a clear basis for the weighting coefficients used in Equations 11, 12, and 13. To address this, we have made revisions to Section 4.2, "Evaluation Metrics".

We have added dedicated explanatory paragraphs directly following each equation to detail the rationale for setting these specific weights. Specifically, for Equation 11(the overall SES score) on lines 538-548, our new text clarifies that this weighting (0.5, 0.3, 0.2) is set based on "engineering priorities in safety-critical domains" and was "established in consultation with domain experts". We elucidate that: Correctness is assigned the highest weight (0.5) as the functional and structural accuracy of the model is the "foremost requirement" for safety applications11; Rationale Quality(0.3) reflects our "core research objective" of ensuring an auditable and "aligned with expert logic" reasoning process. and Structural Integrity(0.2) serves as a "foundational criterion" that is comparatively less critical.

Similarly, we have provided a clear basis for Equations (12) and (13). For Equation (12) (which measures Correctness), we explain that the higher weight for transitions (0.6) (versus states at 0.4) is based on expert consensus that the critical safety logic (such as the "flight control-energy coupling" constraint) is "primarily embedded within the transitions"(lines 553-556). For Equation (13) (which measures Rationale Quality), we clarify that the high weight for KB Grounding (0.7) emphasizes our study's focus on objective, auditable "traceability" (a "hard metric" aligning with safety standards like ARP-4761), while the "soft metric" Coherence (0.3) is considered supplementary(lines 559-567). We believe these detailed justifications have now fully addressed the reviewer's concern.Thank you for providing valuable guidance.

 

Comments :7: With an SES score of 92.7, the implications of the result of outperforming the existing GoT prompting method (84.7) should be more clearly explained.

Response 7: Thank you for pointing out this issue. In the current manuscript, we have already elaborated on this point in detail across multiple sections, addressing the two key aspects the reviewer has highlighted.

First, regarding the implications of this score gap, we provide an in-depth analysis in three key parts of the manuscript (Section 4.3.1, Discussion, and Conclusion). In Section 4.3.1 "Overall Effectiveness Analysis"(lines 585-634), we explicitly state that this achievement marks a "shift from general heuristic reasoning to domain knowledge based verifiable reasoning". This result indicates that for specialized, safety-critical domains, "injecting domain knowledge through fine-tuning is far more critical than relying on the general-purpose reasoning of a prompted model". Following this, in the "Discussion"  section(lines 771-780), we further clarify that the "core insight" is that "verifying the process of reasoning via the 'Safety Rationale' is more crucial than just verifying the final product". This approach turns the "opaque reasoning of LLMs into a transparent, auditable trail". Finally, in the "Conclusion" section(lines 814-825), we summarize this implication: our method successfully solves the "problem of the lack of domain foundation" when applying "general reasoning frameworks" to professional domains.

Second, regarding the technical means by which this performance improvement was achieved, we provide quantitative analysis in Section 4.3.1 and Section 4.3.2. In Section 4.3.1, our comparison of K-EGoT (92.7) with "SFT + DPO-Behavioral" (89.4) demonstrates that the score increase does not come from fine-tuning alone, but specifically from our novel "rationale-centric alignment" strategy. This result shows our approach is "substantially more effective than... simple behavioral fine-tuning".

In Section 4.3.2 "Component Contribution Analysis", we further quantify the contribution of each technical component via an "ablation study". The results (Table 3) "irrefutably prove" that "Rationale Alignment" is the "core driver" of our performance ; removing this component causes the total SES score to drop by 7.6 points. Furthermore, the "Dynamic Reasoning" strategy (contributing 5.8 points) and the "GoT Reasoning" paradigm (contributing 3.5 points) are also shown to be key technical factors in the performance improvement. We believe that through this detailed analysis already present in the manuscript, we have thoroughly explained why and how our K-EGoT framework outperforms the standard GoT prompting method.

 

Comments 8: Please redraw the figures considering that the size of the letters expressed in the picture is too small to be seen.

Response 8: Thank you for pointing this out. We agree with this comment. To address the issue that the characters in the figures are too small to read, we have adjusted the resolution of all figures. Specifically, we have increased the resolution of each figure to a level that ensures all text (including labels, legends, and annotations) within them is clearly visible and easy to read. Once again, thank you for your valuable guidance.

 

 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

If the authors used any AI generated content, please mention it in the manucript.

Author Response

Comments 1: If the authors used any AI generated content, please mention it in the manucript. 

Like Figure-1, it is likely AI-generated. If not, please cite the citation. The content is acceptable for publication.

Response 1

We thank you for the valuable feedback and the opportunity to clarify these points.

  • We would like to formally state that no AI-generated content was used in the preparation of our manuscript.
  • Regarding Figure 1, we confirm that it is not AI-generated. It is a conceptual illustration sourced from the webpage: https://satelliteobservation.net/2023/01/10/france-debates-its-near-space-policy/. And we have successfully archived the source webpage using the Internet Archive's Wayback Machine. The permanent archive link for our source is: https://web.archive.org/web/20251103025845/https://satelliteobservation.net/2023/01/10/france-debates-its-near-space-policy/
  • As requested, we have added a citation for Figure 1 in the revised manuscript's caption and in the main reference list. The full citation, including this permanent archive link, has been added to our reference list [1] (please see revised manuscript).

Once again, thank you for your valuable guidance.

Reviewer 2 Report

Comments and Suggestions for Authors

I have read the revised manuscript and the authors response. While the manuscript is not structured as I would present it, this is more akin to writing style than errors in the paper structure. Accordingly I can accept the revised manuscript.

On e suggestion: the conclusion has substantive topics which would be better placed in the discussion section with the conclusion limited just to closing observations. This is a minor revision which may be implemented in the final proof copy where the format and English grammar check will be implemented.

Author Response

Comments 1: The conclusion has substantive topics which would be better placed in the discussion section with the conclusion limited just to closing observations. This is a minor revision which may be implemented in the final proof copy where the format and English grammar check will be implemented.

Response 1: Thank you very much for your valuable feedback. In accordance with your recommendation, we have revised the structure of the paper as follows:

  1. We have moved the "substantive topics" previously discussed in the Conclusion—including our work's practical value(lines 781-792), and future work(lines 812-826)—entirely into the Discussion section. As a result, the Discussion section now contains a comprehensive treatment of our research's significance, applications, validity threats, and future plans.

  2. Correspondingly, the Conclusions(lines 828-841)  section has been streamlined and strictly limited to a "concluding summary" of the paper's core work and main findings.

We believe this revision has made the paper's structure clearer, more logical, and better aligned with academic norms. Thank you once again for your guidance!

Reviewer 3 Report

Comments and Suggestions for Authors

No more comments. Author's answers made article more clear.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have appropriately addressed all my comments.

Author Response

We would like to express our sincere gratitude to the reviewers for their valuable guidance and constructive comments during the publication process of our manuscript. Your insights have greatly helped improve the quality of this paper. Once again, we appreciate your time and professional support, and wish you all the best in your future work.

Reviewer 5 Report

Comments and Suggestions for Authors

The resubmitted article was written by revising the comments pointed out.

Sincerely

Reviewer.

Author Response

We would like to express our sincere gratitude to the reviewers for their valuable guidance and constructive comments during the publication process of our manuscript. Your insights have greatly helped improve the quality of this paper. Once again, we appreciate your time and professional support, and wish you all the best in your future work.

Back to TopTop