Stepwise Soft Actor–Critic for UAV Autonomous Flight Control
Round 1
Reviewer 1 Report
In this work, the authors propose a stepwise soft actor-critic (SeSAC) method to address the problem of real-world UAV operation in various complex environment. The proposed method starts training with an easy mission and increments the difficulty of missions until a final goal is achieved. This is because learning to achieve the final goal from scratch requires a huge amount of exploration that cannot be easily achieved. The authors also propose a method to control a learning hyperparameter of the SeSAC algorithm and implements a positive buffer mechanism during training to enhance learning effectiveness. The proposed algorithm was verified in a six-degree-of-freedom flight environment with high-dimensional state and action spaces. The experimental results demonstrate that the proposed algorithm successfully completed missions in two challenging scenarios, one for disaster management and another for counterterrorism missions, while surpassing the performance of other baseline approaches.
The authors have addressed a good problem, however, the reviewer has following major concerns:
1- The authors must highlight the parameters with respect to which the performance improvement is observed. It is not mentioned neither in the abstract nor in the conclusion section of the manuscript. In the performance evaluation section, the authors are claiming that they are getting improved score with varying episodes, however, the reviewer is not able to understand the meaning of score. Which parameters are involved in score?
2- Conclusion section must be supported by the results analysis. In particular, the improved results with respective values need to be highlighted.
3- How learning efficiency and stability is computed is not clear to the reviewer?
4- In the proposed method, what are the state, action, and reward values is not clear.
5- Recent related works need to be discussed which utilizes RL framework in the context of UAV network. For instance, the recent publication given below need to be included.
a) Energy and Throughput Management in Delay-Constrained Small-World UAV-IoT Network
b) Bayesian Optimization Enhanced Deep Reinforcement Learning for Trajectory Planning and Network Formation in Multi-UAV Networks
c) Improving Quality-of-Service in Cluster-Based UAV-Assisted Edge Networks
6- The quality of obtained results need to be improved.
7- More discussion on the obtained results need to be provided.
Overall, this is a good work, however, a major revision is required.
Needs improvement.
Author Response
Response to Reviewer 1’s Comments
Dear Reviewer,
Thank you for the high-quality review and valuable comments. We believe your comments have helped to further develop our work into a substantially better manuscript by clarifying important concepts for the proposed method, and we sincerely appreciate the opportunity to do so. We highlighted the added or modified sentences in the marked copy of our revised manuscript in yellow. In what follows, we respond to all the issues you noted. Your comments appear in black, and our responses appear in red.
Comment 1: The authors must highlight the parameters with respect to which the performance improvement is observed. It is not mentioned neither in the abstract nor in the conclusion section of the manuscript. In the performance evaluation section, the authors are claiming that they are getting improved score with varying episodes, however, the reviewer is not able to understand the meaning of score. Which parameters are involved in score?
Response 1: Firstly, we sincerely appreciate your feedback. In response to the concerns you raised, we have made revisions to various sections of the paper to convey information more clearly.
Highlighting Performance Improvement Parameters: We added more details in the conclusion about how our proposed SeSAC method demonstrated superior performance compared to traditional methods. Specifically, SeSAC exhibits faster convergence and stable learning curve. Notably, when using the first convergent episode as a metric for assessing learning efficiency, SeSAC achieved the desired score in just 660 episodes.
Definition of 'Score': Additionally, as you inquired about the meaning of 'score,' we have clarified this in the paper. The 'score' represents the cumulative reward received by the agent in one episode. In the PAM scenario, when the agent successfully completes the mission, it earns a reward of 500. On the contrary, if the mission fails, a reward of -100 is given. Moreover, at every timestep, the agent receives a distance-based reward depending on its proximity to the target.
Through these revisions, we aimed to address the queries you presented. Should you have any further questions or concerns, please don't hesitate to inform us. Once again, we're grateful for your invaluable insights.
Comment 2: Conclusion section must be supported by the results analysis. In particular, the improved results with respective values need to be highlighted.
Response 2: We have made revisions to the conclusion section of our paper to ensure that it is directly supported by our results and provides a clear indication of our findings, as per your suggestion.
In the revised conclusion, we have emphasized our significant findings, including the superior performance of SeSAC compared to PPO and traditional SAC. The results, such as SeSAC converging to the desired score in just 660 episodes, SAC-P and SAC-PC converging at 1602 and 1951 episodes, respectively, and the impact of techniques like positive buffer, cool-down alpha, and stepwise learning, have been highlighted to demonstrate the effectiveness of SeSAC.
We believe that these adjustments make our conclusion more cohesive and reflective of our results analysis.
Comment 3: How learning efficiency and stability is computed is not clear to the reviewer?
Response 3: We provided a detailed explanation in the updated manuscript to clarify the measures of learning efficiency and stability.
Learning Efficiency: We measure learning efficiency primarily through the "First convergent episode" metric. This metric indicates the number of episodes it took for the agent to achieve a certain desired score or performance level for the first time. A lower number of episodes suggests higher learning efficiency, as the agent required fewer attempts to reach satisfactory performance. This measure can provide insights into how quickly an algorithm can potentially adapt to new or dynamic environments.
Learning Stability: Stability is gauged by the variability of the agent's performance over multiple episodes. Specifically, we investigated the deviation of scores following mission successes. A lower deviation indicates that the agent's performance is more consistent and has less irregularity, denoting a higher level of stability. This can be visually corroborated in figures 8 and 11. Additionally, we utilized 'cumulative successes' as a metric to ascertain the frequency of performance declines after achieving the desired score. If the agent has been trained well, it will secure success in a greater number of episodes, which can be a benchmark for its stability.
Comment 4: In the proposed method, what are the state, action, and reward values is not clear.
Response 4: In response to your query regarding the state, action, and reward values in the proposed method, we've added Figure 3 to illustrate the basic state. Furthermore, we've updated the manuscript to include the reward values for both success and failure in the PAM and MTCM scenarios. Thank you for pointing out the need for further clarity.
Comment 5: Recent related works need to be discussed which utilizes RL framework in the context of UAV network. For instance, the recent publication given below need to be included.
- a) Energy and Throughput Management in Delay-Constrained Small-World UAV-IoT Network
- b) Bayesian Optimization Enhanced Deep Reinforcement Learning for Trajectory Planning and Network Formation in Multi-UAV Networks
- c) Improving Quality-of-Service in Cluster-Based UAV-Assisted Edge Networks
Response 5: Thank you for recommending recent studies. Recognizing the significance of the contributions of the recommended works, we have incorporated a discussion on these papers into the introduction section of our manuscript. We are grateful for your suggestions, which have enhanced our paper.
Comment 6: The quality of obtained results need to be improved
Response 6: To address your concern, we have conducted a comprehensive revision of the experimental results section in our manuscript. Here are the key changes and additions:
PAM Scenario Results: Table 5 presents detailed results for PPO, SAC, SAC-P, SAC-PC, and SeSAC in the context of the PAM scenario. We've expanded on the reward system and scoring criteria, highlighting the "Min score," "Max score," "Mean score," and "Cumulative successes" columns over 3,000 episodes. These metrics provide a comprehensive view of the agent's performance.
Role of Positive Buffer and Cool-down Alpha: In the updated results section, we emphasize the enhancing role of the positive buffer and cool-down alpha in SAC-PC. Specifically, we spotlighted the trend observed around the 2,000th episode, showcasing SAC-PC's potential for success in the PAM scenario.
Introduction of Key Metrics: To enhance clarity, we've defined essential metrics such as the score, cumulative success, and the first convergent episode. Collectively, these metrics present a holistic view of an agent's learning stability and efficiency.
Highlighting Cool-down Alpha's Impact: Our revision underscores the impact of integrating the cool-down alpha into SAC-PC, particularly its contribution to increased stability after achieving the desired score, as depicted in the comparison plots in Figure 11.
Comment 7: More discussion on the obtained results need to be provided.
Response 7: In the revised conclusion of our paper, we have accentuated the unique characteristics and the resultant outcomes of SeSAC. The successful completion of tasks using SeSAC, which were unattainable with conventional methods such as PPO and SAC, clearly delineates the advantages of our proposed methodology.
Particularly, the results from the MTCM experiment, where SeSAC converged to the desired score in just 660 episodes, serves as a prime example of its superior performance. Through this, we aimed to underscore how SeSAC's efficiency and benefits stand out compared to traditional techniques.
Author Response File: Author Response.docx
Reviewer 2 Report
Dear Editor;
The manuscript titled “Stepwise Soft Actor-Critic for UAV Autonomous Flight Control” is evaluated meticulously from an academic perspective. In general, the article may be published upon your approval. However, there are some correction-required items as follows;
Comment - 1
Abbreviations should be written in the long form for the first usage. In this manner,
six-degree-of-freedom flight….
Above given sentence should have been written in the following form;
six-degree-of-freedom (DOF) flight…
(Line 20)
Comment - 2
(Line-28,29,30)
UAVs are mainly used for surveillance purposes also. In this manner, while providing information about the UAVs usage areas, surveillance should be added.
Comment - 3
(Line-39-43)
Other than rotary-wing and fixed-wing UAVs in the UAV industry, there are also many flexi-wing UAVs. In this manner, the following sentence should be corrected as given below;
UAVs can be divided into rotary-wing UAVs, fixed-wing UAVs, and flexi-wing UAVs.
Comment - 4
Some similar studies from the literature have been provided the below given paragraph.
However, the conclusion of the studies also should have been provided.
Comment - 5
The longer form of the DRL should be written before DRL.
Comment – 6
The one extra dot (.) should be deleted.
Comment – 7
The below-given paragraph is terrific. I want to congratulate the authors.
Comment – 8
The abbreviation's long form is unnecessary since it was used previously.
Comment – 9
Table 1.
In Table 1, the positions (states) of the UAVs were given. The positioning of the UAV also should be provided in the previous paragraph. How will UAVs have information about the positioning? For example, will an INS or GPS (or both) be used for navigation? It should be described.
Comment – 10
Figure 3.
The figure was given as left and right. Instead of this terminology,
Figure.3.a. and Figure 3.b. should be written separately.
Comment – 11
Figure 5.
Figure.5.a. and Figure 5.b. should be written separately.
Comment – 12
The position of an aircraft should be described with the following entries;
· Altitude
· Speed
· Bank angle (Roll)
· Vertical vector movement (Nose-up/nose-down)
In the following paragraphs, bank angle value and climb/dive rate should be written in this manner. The flight should also be documented if it is accepted as a level and straight flight.
Comment – 13
Figure 6.
Figure.6.a. and Figure 6.b. should be written separately.
Comment – 14
5. Conclusion
Other future studies may be offered for rotary-wing UAVs or flexi-wing UAVs.
Comment – 15
Author Contributions should be added to the article. A sample is provided below;
Comment – 16
Data Availability Statements should be added to the article. A sample is provided below;
Data Availability Statements
The datasets used or analyzed during the current study are available from the corresponding author upon reasonable request.
Best Regards
Comments for author File: Comments.pdf
Author Response
Response to Reviewer 2’s Comments
Dear Reviewer,
Thank you for the high-quality review and valuable comments. We believe your comments have helped to further develop our work into a substantially better manuscript by clarifying important concepts for the proposed method, and we sincerely appreciate the opportunity to do so. We highlighted the added or modified sentences in the marked copy of our revised manuscript in yellow. In what follows, we respond to all the issues you noted. Your comments appear in black, and our responses appear in red.
Comment 1: Abbreviations should be written in the long form for the first usage. In this manner, six-degree-of-freedom flight….
Above given sentence should have been written in the following form; six-degree-of-freedom (DOF) flight…
Response 1: Thank you for your kind comments. I have revised the manuscript based on your suggestions and also reviewed other parts of the paper.
Comment 2: UAVs are mainly used for surveillance purposes also. In this manner, while providing information about the UAVs usage areas, surveillance should be added.
Response 2: Thank you for pointing out the importance of surveillance as a primary application area for UAVs. We recognize this essential use-case and have incorporated information about the surveillance domain at the beginning of the introduction as one of the key areas where UAVs are extensively employed. Your feedback has greatly assisted in ensuring our manuscript provides a comprehensive overview of the diverse applications of UAVs.
Comment 3: Other than rotary-wing and fixed-wing UAVs in the UAV industry, there are also many flexi-wing UAVs. In this manner, the following sentence should be corrected as given below; UAVs can be divided into rotary-wing UAVs, fixed-wing UAVs, and flexi-wing UAVs.
Response 3: Based on your feedback, I delved into literature on flexi-wing UAVs and added relevant content between lines 49 to 52 in the revised manuscript. Your insights help ensure our paper remains up-to-date and accurately represents the broad spectrum of UAVs in the industry.
Comment 4: Some similar studies from the literature have been provided the below given paragraph.
However, the conclusion of the studies also should have been provided.
Response 4: In response to your feedback, we have incorporated the conclusions of the related studies and updated lines 69 to 85 of the manuscript accordingly.
Comment 5: The longer form of the DRL should be written before DRL.
Response 5: In response to your feedback, we have made the necessary modifications to line 93 of the manuscript. Thank you for bringing this to our attention.
Comment 6: The one extra dot (.) should be deleted.
Response 6: Thank you for your thorough review of my manuscript. I have made the necessary corrections by removing the unnecessary "dot."
Comment 7: The below-given paragraph is terrific. I want to congratulate the authors.
Response 7: Thank you for your kind words. Your praise is truly invigorating and serves as a great encouragement for us.
Comment 8: The abbreviation's long form is unnecessary since it was used previously.
Response 8: In response to your feedback, we have made the necessary modifications to line 231 of the manuscript.
Comment 9: In Table 1, the positions (states) of the UAVs were given. The positioning of the UAV also should be provided in the previous paragraph. How will UAVs have information about the positioning? For example, will an INS or GPS (or both) be used for navigation? It should be described.
Response 9: Thank you for highlighting concerns regarding the positioning system of the UAVs. In our revised manuscript, we have clarified this point. When selecting an agent for our experiments, we chose an INS/GPS-equipped fixed-wing aircraft from the 57 aircraft options offered by JSBSim. This choice ensures accurate and dependable navigation. We have included this information in lines 262-271 of our manuscript.
Comment 10: The figure was given as left and right. Instead of this terminology, Figure.3.a. and Figure 3.b. should be written separately.
Response 10: Thank you for your valuable feedback. As you suggested, we have revised the caption as follows:
Figure 4. (a) States for the relative position of the agent and target. (b) Examples of AA and HCA.
Comment 11: Figure.5.a. and Figure 5.b. should be written separately.
Response 11: Thank you for your valuable feedback. As you suggested, we have revised the caption as follows:
Figure 6. (a) PAM’s initial condition and mission success criteria. (b) MTCM’s initial condition and mission success criteria.
Comment 12: The position of an aircraft should be described with the following entries; Altitude, Speed, Bank angle (Roll), Vertical vector movement (Nose-up/nose-down).
In the following paragraphs, bank angle value and climb/dive rate should be written in this manner. The flight should also be documented if it is accepted as a level and straight flight.
Response 12: Thank you for your feedback. Based on your suggestions, we have made revisions to lines 347, 348, 358, and 359 in the manuscript. In both the PAM and MTCM scenarios, we set the agent to start in a level and straight flight condition, with an altitude of 25,000 ft, a speed of 300 kts, and a bank angle of 0 degrees.
Comment 13: Figure.6.a. and Figure 6.b. should be written separately.
Response 13: Thank you for your valuable feedback. As you suggested, we have revised the caption as follows:
Figure 7. Scenarios for SeSAC. (a) In PAM, the agent learns step-by-step from the initial goal, which is a target radius of 2.0km, to the final goal of 0.1km. (b) In MTCM, the agent learns step-by-step from initial goal, which are a target distance of 3.5km and set to 45˚, to the final goal of target distance 2.4km and set to 12˚.
Comment 14: Other future studies may be offered for rotary-wing UAVs or flexi-wing UAVs.
Response 14: Thank you for your suggestion on potential future studies focusing on rotary-wing or flexi-wing UAVs. We concur with your insight, and in fact, in the conclusion section of our manuscript, we have highlighted this very point. We mentioned, "These results suggest that the approach applied to fixed-wing UAVs in this paper can be extended to other UAV types such as rotary-wing or flexi-wing UAVs, opening up possibilities for applications in various fields." This emphasizes that our proposed method can indeed be adapted and applied to a broader range of UAV configurations. We hope this addresses your feedback, and we appreciate your constructive input.
Comment 15: Author Contributions should be added to the article. A sample is provided below;
Response 15: Thank you for your valuable suggestion. Following your recommendation, we have added an "Author Contributions" section at the end of the manuscript, as follows:
Author Contributions: Conceptualization, J.H.B.; Methodology, H.J.H.; Writing – original draft, H.J.H.; Writing – review & editing, J.J.; Visualization, J.C.; Supervision, S.H.K. and C.O.K.; Project administration, C.O.K.
Comment 16: Data Availability Statements should be added to the article. A sample is provided below;
Data Availability Statements: The datasets used or analyzed during the current study are available from the corresponding author upon reasonable request.
Response 16: Thank you for your suggestion regarding the Data Availability Statements. We understand the importance of sharing data for academic transparency and reproducibility. However, due to the security policies of our research institute, we regret to inform you that we are unable to provide the datasets used in this study. We sincerely hope for your understanding on this matter and are committed to providing as much detail as possible within the article to support our findings.
Author Response File: Author Response.docx
Reviewer 3 Report
1、Some figures in the manuscript are a little blurry, please improve the clarity.
2、The contributions of the manuscript should be better summarized and listed.
3、I suggest the authors add the latest references to cover the recently published papers on this domain/topic (machine learning methods for UAVs), such as Reinforcement Learning-Based Security/Safety UAV
System for Intrusion Detection Under Dynamic and
Uncertain Target Movement, Adversarial Attacks and Defenses for Deep Learning-based Unmanned Aerial Vehicles, and so on.
4、More indepth comparsion and analysis should be given in the manuscript.
Minor editing of English language is required.
Author Response
Response to Reviewer 3’s Comments
Dear Reviewer,
Thank you for the high-quality review and valuable comments. We believe your comments have helped to further develop our work into a substantially better manuscript by clarifying important concepts for the proposed method, and we sincerely appreciate the opportunity to do so. We highlighted the added or modified sentences in the marked copy of our revised manuscript in yellow. In what follows, we respond to all the issues you noted. Your comments appear in black, and our responses appear in red.
Comment 1: Some figures in the manuscript are a little blurry, please improve the clarity.
Response 1: Thank you for pointing out the clarity issues in some of the figures. In response to your feedback, we have not only enhanced the resolution of the blurry figures but also increased the size of the text within them to ensure readability. We appreciate your constructive feedback and have made the necessary adjustments to improve the manuscript.
Comment 2: The contributions of the manuscript should be better summarized and listed.
Response 2: In response to your suggestion, we have summarized the key contributions of our study and listed them in lines 112-124 of the manuscript.
Comment 3: I suggest the authors add the latest references to cover the recently published papers on this domain/topic (machine learning methods for UAVs), such as Reinforcement Learning-Based Security/Safety UAV System for Intrusion Detection Under Dynamic and Uncertain Target Movement, Adversarial Attacks and Defenses for Deep Learning-based Unmanned Aerial Vehicles, and so on.
Response 3: Thank you for recommending the inclusion of the latest papers in our domain. We sincerely appreciate your expertise and insight, which helps ensure our research remains current and comprehensive.
In response to your suggestion, we have updated our literature review and incorporated recent works, such as "Reinforcement Learning-Based Security/Safety UAV System for Intrusion Detection Under Dynamic and Uncertain Target Movement" and "Adversarial Attacks and Defenses for Deep Learning-based Unmanned Aerial Vehicles." These updates can be found on lines 53-68 of our manuscript.
Comment 4: More indepth comparsion and analysis should be given in the manuscript.
Response 4: Based on your feedback, we have made comprehensive revisions to the experimental results and conclusion sections of our manuscript. Here are the key changes and additions:
PAM Scenario Results: Table 5 presents detailed results for PPO, SAC, SAC-P, SAC-PC, and SeSAC in the context of the PAM scenario. We've expanded on the reward system and scoring criteria, highlighting the "Min score," "Max score," "Mean score," and "Cumulative successes" columns over 3,000 episodes. These metrics provide a comprehensive view of the agent's performance.
Role of Positive Buffer and Cool-down Alpha: In the updated results section, we emphasize the enhancing role of the positive buffer and cool-down alpha in SAC-PC. Specifically, we spotlighted the trend observed around the 2,000th episode, showcasing SAC-PC's potential for success in the PAM scenario.
Introduction of Key Metrics: To enhance clarity, we've defined essential metrics such as the score, cumulative success, and the first convergent episode. Collectively, these metrics present a holistic view of an agent's learning stability and efficiency.
Highlighting Cool-down Alpha's Impact: Our revision underscores the impact of integrating the cool-down alpha into SAC-PC, particularly its contribution to increased stability after achieving the desired score, as depicted in the comparison plots in Figure 11.
Author Response File: Author Response.docx
Reviewer 4 Report
The manuscript presents an engaging exploration into the application of the soft actor-critic method for guiding UAV movements in complex spaces. While the paper tackles a significant issue, it lacks in several key areas that are crucial for its readability and for propelling this line of research forward.
Major Concerns:
Problem Formalization: The paper falls short on providing a formal description of the problem at hand and neglects to provide the underlying physical movement formulas. To ensure a comprehensive understanding of the rest of the paper, a detailed and formal problem description, including assumptions about the relevant environment parameters, is needed.
Detailed Environmental Description: Important details missed in the paper include a clear depiction of the environment within which the drone operates. While state, action, and reward are later defined, these should be based on a prior detailed environmental description.
Methodology and Comparisons: The benchmark methods are not explained adequately, and the details about the simulation are insufficient. Please elucidate how the comparison was conducted. What does the "score" you measure for each method represent?
Conclusion: The conclusion is not detailed enough and does not provide a comprehensive overview of what was achieved and what remains to be addressed in future work.
Additional Revisions:
Abstract: The sentence, "This is because learning to achieve the final goal from scratch requires... that cannot be easily achieved" is out of context and needs clarification.
Background: "In this method"—the description here is not a method; it is a general description of the environment and the way the agent acts and interacts with it. The explanation of Equation (1) could be clearer.
Experiments: Please explain and give citations and details for the benchmark methods. Also, provide metrics you used for comparison.
Failure Reward: You measure success against failure; however, failure if the agent fails to achieve the goal should have a different reward than failure if it crashes.
Table 5: What does the score mean? What is the meaning of a negative score? What is the standard deviation of the score for each method?
Conclusion: This section should summarize what was the problem, how it was treated, and what were the achievements. However, the present conclusion is too brief and lacks sufficient explanation. Additionally, the statement, "The agent trained for a specific mission is unable to perform the task in new situations," demands further clarification. It's vital to elucidate what has been accomplished in relation to the initial goal (which was to enable the agent to function in new situations) and identify what challenges persist.
Supplementary Material: Providing access to the source code used in your research would greatly benefit readers. It would enhance transparency and reproducibility while also facilitating future research in this domain.
In conclusion, the paper could significantly benefit from addressing the aforementioned issues prior to resubmission. These improvements would greatly enhance the paper's clarity, relevance, and contribution to the field. Good luck.
Author Response
Response to Reviewer 4’s Comments
Dear Reviewer,
Thank you for the high-quality review and valuable comments. We believe your comments have helped to further develop our work into a substantially better manuscript by clarifying important concepts for the proposed method, and we sincerely appreciate the opportunity to do so. We highlighted the added or modified sentences in the marked copy of our revised manuscript in yellow. In what follows, we respond to all the issues you noted. Your comments appear in black, and our responses appear in red.
Comment 1: Problem Formalization: The paper falls short on providing a formal description of the problem at hand and neglects to provide the underlying physical movement formulas. To ensure a comprehensive understanding of the rest of the paper, a detailed and formal problem description, including assumptions about the relevant environment parameters, is needed.
Response 1: Thank you for the feedback regarding the need for a more thorough problem formalization.
To clarify the problem domain, we revised our manuscript to detail the states selected from JSBSim. JSBSim provides comprehensive information about the aircraft, encompassing aspects such as location, speed, engine condition, and the positions of control surfaces like ailerons and rudders. From this extensive pool of information, we specifically chose 10 states. Recognizing the significance of the aircraft's relationship with its target, we supplemented these with an additional 8 states, ensuring a holistic depiction of its spatial and dynamic interactions. Consequently, our learning process integrates a total of 18 states.
For a clearer depiction of the 10 primary states derived from JSBSim, we introduced Figure 3 in the manuscript. This figure visually illustrates each state and underscores its importance in the aircraft's operational dynamics.
Furthermore, responding to feedback about the ambiguity in our reward system, we refined the manuscript. We delineated the rewards assigned to both success and failure scenarios, making a clear connection between these rewards and the cumulative total, which we term as the "score".
In addition, it's pivotal to note that the foundational aerodynamics are rooted in JSBSim's dynamic model. For our experiments, we selected a fixed-wing aircraft equipped with INS/GPS from the 57 aircraft options provided by JSBSim. Our experiments utilized JSBSim's default atmospheric environment, mirroring the environmental parameters established by the 1976 U.S. Standard Atmosphere, and omitting meteorological phenomena like clouds or rainfall. To structure our learning environment, we leaned on pivotal libraries, including Python 3.8.5, JSBSim 1.1.5, PyTorch 1.9.0, and Gym 0.17.2.
Comment 2: Detailed Environmental Description: Important details missed in the paper include a clear depiction of the environment within which the drone operates. While state, action, and reward are later defined, these should be based on a prior detailed environmental description.
Response 2: Thank you for emphasizing the need for a more detailed description of the environment in which drones operate. We concur with your observations and have incorporated the details from lines 262-271 to better illustrate this aspect. Alongside the elaboration on JSBSim, we've appended a description of the JSBSim-Wrapper, which bridges Python and JSBSim, to the manuscript.
To begin with, the UAV in our experiment operates on JSBSim, an open-source flight dynamics model. JSBSim, a data-driven, 6-DOF flight dynamics model, stands out for its intricate ability to model flight dynamics and control. Its data-driven methodology allows for the classification of aircraft types and their accompanying equipment, like engines and radar, through an extensible markup language.
To ensure seamless integration between Python and JSBSim, we customized the JSBSim-Wrapper, optimizing it for our experimental setup. This wrapper is adept at extracting 51 types of information from JSBSim outputs, encompassing metrics like aircraft position, speed, engine status, and control surface positions, which include features like ailerons and rudder. Out of this vast array of data, we meticulously selected 10 states. In addition to this, we incorporated eight states based on target information, culminating in a total of 18 states for the learning process.
For our experimental needs, we selected an INS/GPS-equipped fixed-wing aircraft from the 57 options that JSBSim offers. All experiments were executed under JSBSim's default atmospheric conditions, mirroring the 1976 U.S. Standard Atmosphere, and excluded any meteorological anomalies like cloud cover or rainfall.
In terms of technical infrastructure, we've been consistent in leveraging key libraries, including Python 3.8.5, JSBSim 1.1.5, PyTorch 1.9.0, and Gym 0.17.2.
Comment 3: Methodology and Comparisons: The benchmark methods are not explained adequately, and the details about the simulation are insufficient. Please elucidate how the comparison was conducted. What does the "score" you measure for each method represent?
Response 3: We acknowledge the insufficient explanation of the PPO benchmark method in the initial manuscript. While we've elaborated on the SAC, which serves as a benchmark for SeSAC, in Section 2, we concur that PPO was not sufficiently detailed. To rectify this oversight, we have enhanced the description of PPO between lines 339-343 of the manuscript.
Regarding the comparison with other models, we ensured that the evaluation was conducted on a level playing field. All models were tested under the same conditions for 3,000 episodes. We recognize our oversight in not elucidating the definition of the "score" used in these comparisons and apologize for the confusion caused. To rectify this, we have added a comprehensive description of the "score" on lines 413-424 of the manuscript.
To clarify the "score": The score essentially represents the cumulative reward the agent garners within a single episode. In the context of the PAM scenario, a successful mission completion by the agent is rewarded with 500 points. In contrast, a mission failure results in a deduction of 100 points. Furthermore, as the agent progresses through each timestep, it earns distance rewards, which are contingent on its proximity to or distance from the target. If the agent continually strays away from its objective, culminating in a mission failure, it incurs a substantially negative score. The columns titled "Min score" and "Max score" in our results depict the minimum and maximum scores achieved during individual episodes, respectively. The "Mean score" represents an average of scores from the entire 3,000 episodes. A higher mean score suggests a greater frequency of successful episodes or consistent progress in the intended direction by the agent. Lastly, the "Cumulative successes" column quantifies the total successful episodes amassed over the 3,000-episode span. Both the cumulative successes and the mean score serve as potent indicators to gauge the stability and success of the learning process.
Comment 4: Conclusion: The conclusion is not detailed enough and does not provide a comprehensive overview of what was achieved and what remains to be addressed in future work.
Response 4: Following your suggestion, we have revised the paper to more directly support the research results and provide a clearer representation of our findings. Additionally, we have addressed the future research objectives in the conclusion section.
We highlighted SeSAC's notable performance, especially its capability to converge to the desired score in just 660 episodes. We also mentioned the potential scalability of our approach to different UAV types.
However, we acknowledged a limitation: agents trained for specific missions struggled in unfamiliar situations. To address this, we outlined our future research directions.
With these adjustments, we aim to clearly convey our research outcomes and forthcoming challenges.
Comment 5: Abstract: The sentence, "This is because learning to achieve the final goal from scratch requires... that cannot be easily achieved" is out of context and needs clarification.
Response 5: Thank you for highlighting this ambiguity in the abstract. In light of your feedback, we have revised that sentence for clarity.
Comment 6: Background: "In this method"—the description here is not a method; it is a general description of the environment and the way the agent acts and interacts with it. The explanation of Equation (1) could be clearer.
Response 6: Thank you for your insightful feedback on my manuscript. We agree with your comments and have made revisions to lines 133-136 of the manuscript to reflect your suggestions.
Comment 7: Experiments: Please explain and give citations and details for the benchmark methods. Also, provide metrics you used for comparison.
Response 7: Thank you for pointing out the need for more clarity on our benchmark methods and comparison metrics.
To address this, we've elaborated on the Proximal Policy Optimization (PPO) method in lines 339-343 and included the appropriate citation.
Regarding the metrics used for comparison, we've detailed them in lines 413-424. Central to our evaluation is the "score", which aggregates the rewards obtained by the agent in each episode. Metrics such as the "Min score" and "Max score" represent the range of scores, from the lowest to highest, achieved in individual episodes. The "Mean score" provides an average across 3,000 episodes, reflecting an uptick in successful missions or marked agent advancements. The "Cumulative successes" denotes the aggregate number of successful episodes over these 3,000 episodes. In conjunction, cumulative successes and the mean score act as robust indicators to gauge sustained learning progress.
Additionally, the concept of the first convergent episode sheds light on the efficiency of learning, highlighting the pace at which the set objectives were met.
Comment 8: Failure Reward: You measure success against failure; however, failure if the agent fails to achieve the goal should have a different reward than failure if it crashes.
Response 8: The term "failure reward" was designated to distinctly contrast with the "success reward." There are two types of failure scenarios in our experiment: one where the agent crashes onto the ground and the other where the agent overtakes its desired target. In both situations, the agent receives an identical reward of -100.
We recognize that our previous description might have been ambiguous. To clarify this, we have revised upon this concept in lines 370-372 of our manuscript.
Comment 9: Table 5: What does the score mean? What is the meaning of a negative score? What is the standard deviation of the score for each method?
Response 9: The score represents the cumulative rewards the agent receives in a single episode. Specifically, within the PAM scenario, an agent is rewarded 500 points upon successful mission completion. Conversely, a mission failure results in a deduction of 100 points. Moreover, during each timestep of the mission, the agent obtains a distance-based reward, which reflects its distance from the target. If the agent consistently deviates from the target path, culminating in a mission failure, it will accumulate a significantly negative score.
We have added more details about score's definition and implications in lines 413-418 of the manuscript.
Comment 10: Conclusion: This section should summarize what was the problem, how it was treated, and what were the achievements. However, the present conclusion is too brief and lacks sufficient explanation. Additionally, the statement, "The agent trained for a specific mission is unable to perform the task in new situations," demands further clarification. It's vital to elucidate what has been accomplished in relation to the initial goal (which was to enable the agent to function in new situations) and identify what challenges persist.
Response 10: Thank you for pointing out the insufficiencies in our conclusion section. We acknowledge the oversight and have taken steps to revise the manuscript in alignment with your suggestions.
Drawing from our prior discussions, we've updated the conclusion to better encapsulate the problem we aimed to address, the methodology we adopted, and the achievements and limitations we observed.
Regarding the specific statement, "The agent trained for a specific mission is unable to perform the task in new situations," we have provided more clarity in lines 569-570 of our manuscript.
Comment 11: Supplementary Material: Providing access to the source code used in your research would greatly benefit readers. It would enhance transparency and reproducibility while also facilitating future research in this domain.
Response 11: Thank you for your suggestion regarding the provision of our source code as supplementary material. We fully understand the importance of transparency and reproducibility in research and recognize the value it would bring to readers and fellow researchers. However, due to security concerns within our research institute, we are currently unable to share the source code publicly. We deeply regret any inconvenience this may cause. We appreciate your understanding on this matter.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
The authors have addressed all my concerns. The paper can now be accepted for the publication.
Author Response
Thank you for your review and feedback. We are pleased to hear that all concerns have been addressed. We appreciate your recommendation for the paper's acceptance for publication.
Reviewer 3 Report
The paper can be accepted. I have no comments.
Author Response
Thank you for your review. We appreciate your feedback and are glad to hear that the paper is acceptable for publication.
Reviewer 4 Report
The article is very interesting and has produced excellent results. I have a few minor revisions:
Line 136: Please put some words/reference for Markov Decision/MDP and why is this problem an MDP.
Line 195-196: Kindly enhance the description of the two types of memory and their management.
Line 202: "The Critic consists of two main Q-networks and two corresponding target Q-networks." However, Figure 2 only displays one main Q network and one target Q network.
Algorithm 1, line 27: Kindly improve the indentation.
Table 1: Kindly insert a vertical line between the two columns.
Lines 336-338: Please specify the name of your proposed algorithm here (as mentioned in the experiments) in comparison to the baselines.
Lines 464, 465: For 2.0km and 0.5km, please add a space between the number and unit for consistency.
Figures 9, 11, 13: Please condense the figure caption.
Good luck.
Author Response
Comment 1: Line 136: Please put some words/reference for Markov Decision/MDP and why is this problem an MDP.
Response 1: Thank you for pointing out that the relationship between the Markov decision process (MDP) and our reinforcement learning (RL) problem needs to be explained. Based on your feedback, we have revised lines 133-141 to provide a more comprehensive understanding and added references. Through these revisions, we aimed to address the queries you presented.
Comment 2: Line 195-196: Kindly enhance the description of the two types of memory and their management.
Response 2: Thank you for the feedback on the memory description.
In response to Comment 2, we have expanded on lines 197-207 to better detail our three memory buffers: the episode buffer, replay buffer, and positive buffer. We have clarified their individual roles, how they manage stored data, especially the removal of the oldest tuples when reaching capacity, and their sampling processes during training.
Comment 3: Line 202: "The Critic consists of two main Q-networks and two corresponding target Q-networks." However, Figure 2 only displays one main Q network and one target Q network.
Response 3: We have revised Figure 2 to more accurately depict the architecture of the Critic. To clearly represent the two networks that constitute both the main Q-network and the target Q-network, we have illustrated each network using separate boxes.
Comment 4: Algorithm 1, line 27: Kindly improve the indentation.
Response 4: Based on your suggestion, we have improved the indentation in Algorithm 1 for better readability.
Comment 5: Table 1: Kindly insert a vertical line between the two columns.
Response 5: We have made the suggested changes to Table 1 by inserting a vertical line between the two columns. Furthermore, we have checked and made similar adjustments to all the tables in the paper.
Comment 6: Lines 336-338: Please specify the name of your proposed algorithm here (as mentioned in the experiments) in comparison to the baselines.
Response 6: In response to your feedback, we have updated lines 345-348 to include the name of our proposed algorithm, SeSAC, ensuring that it aligns with the references in the experiments section and provides a clear distinction from the baselines.
Comment 7: Lines 464, 465: For 2.0km and 0.5km, please add a space between the number and unit for consistency.
Response 7: We have made the necessary corrections to lines 464 and 465 as suggested, ensuring a space between the number and unit. Additionally, we have reviewed the entire manuscript to find similar discrepancies and rectified them for consistency.
Comment 7: Figures 9, 11, 13: Please condense the figure caption.
Response 7: Thank you for the suggestion. We have revised the captions for Figures 9, 11, and 13 to be more concise while retaining all essential information.
Author Response File: Author Response.docx