Next Article in Journal
A Strong Form Meshless Method for the Solution of FGM Plates
Previous Article in Journal
Design of Novel Laser Crosslink Systems Using Nanosatellites in Formation Flying: The VISION
 
 
Article
Peer-Review Record

Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach

Aerospace 2022, 9(8), 424; https://doi.org/10.3390/aerospace9080424
by Liang Jiang 1,*, Ying Nan 1, Yu Zhang 2 and Zhihan Li 1
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Aerospace 2022, 9(8), 424; https://doi.org/10.3390/aerospace9080424
Submission received: 8 April 2022 / Revised: 26 July 2022 / Accepted: 1 August 2022 / Published: 4 August 2022
(This article belongs to the Section Aeronautics)

Round 1

Reviewer 1 Report

The authors propose a DRL-based anti-interception guidance system that can help a hypersonic glide vehicle (HGV) evade attacks from multiple interceptors. To do so, they model the problem as an MDP and they propose a new algorithm, called RBT-DDPG (extension of DDPG), in order to autonomously learn the anti-interception guidance.

The authors seem to have full knowledge of the specific scientific area, and present three different categories of solutions to the problem of anti-interception guidance: procedural guidance, fly-around guidance and active evading guidance. Moreover, they provide a detailed explanation on the weak points of those methods presenting related works.

Then, they explain why they use DRL and present few studies in the domain. However, to their best knowledge, there are no works on the application of DRL to HGV anti-interception guidance.

 

Goals

The goals are well-established, and are summarized here:

 Describe the anti-interception guidance of HGV as an optimization problem: They provide a detailed explanation of the problem, using equation from physics, and present how the system of equations can be seen and solved as an optimization problem.

-     Propose an extension of DDPG, RBT-DDPG, to solve the optimization problem: The proposed method achieves faster policy convergence and better performance in the training of the DDPG. Specifically, they argue that the Actor Network’s parameters are being affected by the loss provided by the Critic Network. So, the proposed method tries to improve the update strategy of the Critic Network (Q), using repetitive training on batches that the loss is better than a threshold. (Seems like a clipping on the loss function – Maybe they can re-write with more details this paragraph in Section 3.1)

 

Comparison

The comparison is being made against the common version of DDPG, because there are not any other works (to the best of their knowledge – I didn’t find any similar work either) that try to solve the same problem. From the reported results, there seems to be a significant difference in the success rate of common DDPG and the proposed variation of DDPG.

 

 Presentation & Structure

The presentation is good and they seem to have knowledge of the domain. The language (grammar / vocabulary) is also good.  However, there are some points that they need to check:

 -        To present the related work, they use the word “Reference” before citing a study. Should change to something like “…the authors…”.

-        “Reinforcement learning (RL) is a model-free algorithm” and “Deep reinforcement learning(DRL), which combines the advantages of DNNs and RL, is a method close to general artificial intelligence” in Section 1. Not an algorithm and bold statement.

-        In Section 2, paragraph 2, seems to be missing an “R” before points (1) and (2).

-        Some more details must be taken care from the authors, e.g. missing spaces in some cases and use of the same format for all the equations.

 

Results

The authors provide an experimental comparison on the anti-interception success rate and the training episodes needed between DDPG and the proposed RBT-DDPG. In both cases, the proposed method outperforms the original DDPG, and the authors proceed to analyze the strategies learned from the method.  

 

Author Response

Q1: To present the related work, they use the word “Reference” before citing a study. Should change to something like “…the authors…”.

A1:We are extraordinary respectful of these existing studies and have modified the format of the citations.

Q2: “Reinforcement learning (RL) is a model-free algorithm” and “Deep reinforcement learning(DRL), which combines the advantages of DNNs and RL, is a method close to general artificial intelligence” in Section 1. Not an algorithm and bold statement.

A2:We have removed this sentence in the revised manuscript.

Q3: In Section 2, paragraph 2, seems to be missing an “R” before points (1) and (2).

A3:We have modefied this typo in the revised manuscript.

Q4: Some more details must be taken care from the authors, e.g. missing spaces in some cases and use of the same format for all the equations.

A4:we have done a thorough proof-reading.

Reviewer 2 Report

The paper presents an interesting application of deep reinforcement method, mainly, the implementation of DDPG algorithm for avoiding incoming interceptors. The used methodology is clear, albeit described in somewhat confusing manner. The achieved results are presented and effects shown. However, there are some issues that should be clarified before the paper can be selected for publication.

  1. Authors should performs proofreading of the paper. The paper consists of a lot of typos, incomplete and awkward sentences as well as silly mistakes. Spaces are not observed before parenthesis, words written in capital letters in middle of sentences, incomplete thoughts and so on.
  2. The language in the paper is unnecessarily complex. Simple ideas are explained in a confusing language and in this reviewers opinion too much background information is given. The introductions of MDP and DDPG algorithms just as easily could have been a single reference. This just bloats the paper and the author contribution is lost in the wall of text. Simple explanations are better.
  3. Paper novelty is not clear. This should be clearly stated and expressed in the paper. As far as I can tell, the novelty is two-fold: introduction of RBT and introducing DRL for HGV control. While the RBT method is interesting, the HGV control in its simplest form is a goal oriented collision avoidance problem with a single action dimension. This has already been solved in multiple other approaches with dynamic and static obstacles in higher dimensionality input and output space. Therefore, this seems like a simpler implementation and solution to an already solved problem. As such, the contribution here would be the specific implementation for HGV, but that should be stated and explained.
  4. There is a lack of references in regards to DDPG implementations. Most of the references are related to HGVs in different ways, but as mentioned, similar implementations and problems have been solved in mainly mobile robotics fields. These references also should be discussed and how the proposed approach differs from them.
  5. The static input space dealing with specifically 2 incoming interceptors means that the method is very static and obtained results are not general and cannot be extended to real implementations. It is an interesting approach, and this problem is mentioned in the article, but it diminishes its scientific importance. Methods of how to make it general should be discussed.
  6. The results seem to have been taken from the same samples as the training results, which makes the generalizability doubtful and the results unreliable. More dynamic and previously untrained results should also be presented.
  7. Most importantly, comparison with only DDPG is not sufficient. Other baselines and state-of-the-art methods should be compared. More so, TD3 network is a famous DDPG based architecture that solves a similar problem as the introduced RBT. At least a comparison with that should be made.
  8. Figure captions need to be clear and full descriptions of the image. The captions in the paper are lacking descriptive information about the images.

Author Response

Thank you very much for your attention and comments on our paper. We have revised the manuscript according to your advices.

Q1:Authors should performs proofreading of the paper. The paper consists of a lot of typos, incomplete and awkward sentences as well as silly mistakes. Spaces are not observed before parenthesis, words written in capital letters in middle of sentences, incomplete thoughts and so on.

A1:we have done a thorough proof-reading.

 

Q2:The language in the paper is unnecessarily complex. Simple ideas are explained in a confusing language and in this reviewers opinion too much background information is given. The introductions of MDP and DDPG algorithms just as easily could have been a single reference. This just bloats the paper and the author contribution is lost in the wall of text. Simple explanations are better.

A2:Since DRL has not been studied in the area of anti-intercept guidance, we would like to present the DRL in a complete manner. We have tried to remove some complex and unnecessary statements in the revised manuscript. We request to keep a little content of MDP and DDPG.Firstly, they do not take up much space and secondly, queries about MDP-related concepts still appear in another review comments, and their deletion would have a serious impact on the explanation of RBT.

Q3:Paper novelty is not clear. This should be clearly stated and expressed in the paper. As far as I can tell, the novelty is two-fold: introduction of RBT and introducing DRL for HGV control. While the RBT method is interesting, the HGV control in its simplest form is a goal oriented collision avoidance problem with a single action dimension. This has already been solved in multiple other approaches with dynamic and static obstacles in higher dimensionality input and output space. Therefore, this seems like a simpler implementation and solution to an already solved problem. As such, the contribution here would be the specific implementation for HGV, but that should be stated and explained.

A3:Unlike control objects such as cars, robotics or UAVs, most of the existing studies on anti-interception guidance focus on vertical maneuvers due to the complexity of HGV flight motion, and thus the action dimension is only 1-dimensional. From the perspective of computational speed, the traditional dynamic or static method is not applicable for HGVs due to their highly dynamic characteristics and the limitation of on-board computational capability, which is reflected in the introduction and the last paragraph of Section 4.2. From the perspective of adaptability to the environment, the conventional approach (Differential Game) is also compared with the strategy chosen by DRL in Section 4.3.2.We have added content about novelty in the revised manuscript.

Q4:There is a lack of references in regards to DDPG implementations. Most of the references are related to HGVs in different ways, but as mentioned, similar implementations and problems have been solved in mainly mobile robotics fields. These references also should be discussed and how the proposed approach differs from them.

A4:In the new introductory section, we have added references in regards to DDPG implementations and  discussed how the proposed approach differs from them.

Q5:The static input space dealing with specifically 2 incoming interceptors means that the method is very static and obtained results are not general and cannot be extended to real implementations. It is an interesting approach, and this problem is mentioned in the article, but it diminishes its scientific importance. Methods of how to make it general should be discussed.

A5:As mentioned in the Remark 4, in most cases HGV will only face two interceptors in the airspace delineated in this paper, which is a common interception rule adopted by national air defense systems. This paper at least demonstrates that the strategy is effective in most cases. In future work, we will study the many-to-many case.

Q6:The results seem to have been taken from the same samples as the training results, which makes the generalizability doubtful and the results unreliable. More dynamic and previously untrained results should also be presented.

A6:We demonstrate in Section 4.3 that the DRL can obtain a feasible strategy over the whole initial conditions, as shown in Figure 9. The performance of the both sides when HGV really performing penetration does not differ much from that in the training scenario (bearing in mind that it takes many years to develop a higher performance HGV or interceptor). The initial condition of the HGV is constrained by dynamic pressure, heat flow density, etc., and the feasible readers in the field of HGV may not be interested in HGVs to achieve ultra-low altitude penetration or exo-atmospheric penetration, but only in anti-interceptor guidance in the 25-60 km altitude.

Q7:Most importantly, comparison with only DDPG is not sufficient. Other baselines and state-of-the-art methods should be compared. More so, TD3 network is a famous DDPG based architecture that solves a similar problem as the introduced RBT. At least a comparison with that should be made.

A7:We have tested the effect of the RBT mechanism on TD3. It shows that RBT-TD3 has a faster learning speed compared to the original TD3.

Q8:Figure captions need to be clear and full descriptions of the image. The captions in the paper are lacking descriptive information about the images.

A8:Some descriptions have been moved from the body to the captions.

Reviewer 3 Report

This article studies an interesting problem and it is well-organized.

Please check English language and style, minor spell check required.

Author Response

Q1:Please check English language and style, minor spell check required.

A1:In the revised manuscript, we have done a thorough proof-reading.

Reviewer 4 Report

The manuscript presents a novel deep RL approach for anti-interception guidance. Particularly, authors compare the proposed RBT-DDPG algorithm to traditional DDPG showing that the proposed algorithm can achieve greater and faster rewards than DDPG. I think this is an interesting and sound paper, however, in some parts is not well organized and therefore is hard to read. Probably additional efforts in the presentation need to be done in order to accept the paper. I have a few recommendations that might allow improving the manuscript.

1. The abstract lacks context. It is important that authors start the paper by discussing the research area and the problem addressed, and why it is important to address that problem. Currently, authors start directly describing their approach.

2. Through the Introduction Section, there are many referenced articles cited as Reference [X]. It is not very clear how these works are related to the particular problem addressed in this paper. Additionally, I believe it is important to give credits to previous authors. I would recommend using references such as Smith et al. [X] ...

3. Equation 1 shows only the right side of the equation, there is no left side or nothing showing what is equal to, such as x = { .... 

4. On page 5, from "The initial state:" onwards, I really do not get the point of all these equations, regardless some of them are used or referenced later on in other sections. The problem is that no text explains this in Section 2.

5. It looks odd to me to have a different amount of training episodes for DDPG and RBT-DDPG as shown in Fig. 4. I understand that RBT-DDPG overpass DDPG at 11714 episodes, however, it is not clear to me what is the stop criteria.

6. Conclusion section briefly discusses limitations and future work. I think using multi-agent-DRL is an interesting idea that authors should elaborate on. Additionally, in general RL is well known for suffering from long training time as also shown in these results. Other approaches might benefit RL convergence such as expert advice [1, 2] or action affordance [3]. Including a discussion of other approaches in the Conclusion section would make this stronger.

[1] Cruz et al. Agent-advising approaches in an interactive reinforcement learning scenario. 

[2] Bignold et al. Human engagement providing evaluative and informative advice for interactive reinforcement learning.

[3] Cruz et al. Learning contextual affordances with an associative neural architecture. 

 

 

Author Response

Thank you very much for your attention and comments on our paper. We have revised the manuscript according to your advices.

Q1:The abstract lacks context. It is important that authors start the paper by discussing the research area and the problem addressed, and why it is important to address that problem. Currently, authors start directly describing their approach.

A1:We have added the research area and the problem to the abstract of the revised manuscript.

Q2:Through the Introduction Section, there are many referenced articles cited as Reference [X]. It is not very clear how these works are related to the particular problem addressed in this paper. Additionally, I believe it is important to give credits to previous authors. I would recommend using references such as Smith et al. [X] ...

A2:We present many existing research articles on anti-interception guidance methods in the introduction. By presenting and analyzing them, it is possible to see why the existing anti-interception guidance methods cannot be applied to engineering or are not effective. Therefore, we introduce DRL, a brand new method that has never been applied in the research areas of anti-interception guidance. We are extraordinary respectful of these existing studies and have modified the format of the citations.

Q3:Equation 1 shows only the right side of the equation, there is no left side or nothing showing what is equal to, such as x = { .... 

A3:Equation 1 is a system of differential equations consisting of four equations. The left-hand side of each equation is the differential term. Since this system of differential equations is not written in matrix form, we don’t feel the necessity the left-hand side of its parentheses add some terms.

Q4:On page 5, from "The initial state:" onwards, I really do not get the point of all these equations, regardless some of them are used or referenced later on in other sections. The problem is that no text explains this in Section 2.

A4:We have added the text explains about “The initial state” in the revised manuscript.

 

Q5:It looks odd to me to have a different amount of training episodes for DDPG and RBT-DDPG as shown in Fig. 4. I understand that RBT-DDPG overpass DDPG at 11714 episodes, however, it is not clear to me what is the stop criteria.

A5:We tried to show the fact that DDPG takes more training episodes (about 7000 more episodes compared to RBT-DDPG), the cumulative reward still fail to converge to a smaller variance, reflecting that DDPG is not only slower in training but also less robust. We adopt a consistent length of training episodes in the revised manuscript.

Q6:Conclusion section briefly discusses limitations and future work. I think using multi-agent-DRL is an interesting idea that authors should elaborate on. Additionally, in general RL is well known for suffering from long training time as also shown in these results. Other approaches might benefit RL convergence such as expert advice [1, 2] or action affordance [3]. Including a discussion of other approaches in the Conclusion section would make this stronger.

A6:We have accepted these excellent suggestions in the revised manuscript.

Round 2

Reviewer 2 Report

The authors have addressed the pressing issues in the paper and improved the presentation significantly. The updated references also correspond to what would be required from an article tackling not only HGVs, but specifically the neural network method of general collision avoidance. While there are a couple of minor grammar errors still present, I think this paper can be accepted for publication after going through them with the editor.

Back to TopTop