Design of a Reinforcement Learning-Based Speed Compensator for Unmanned Aerial Vehicle in Complex Environments
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper presents a DDPG-based rotational speed compensator for UAV altitude control during shipboard landing operations. The work addresses an important practical problem in maritime UAV operations and demonstrates the application of deep reinforcement learning to UAV control systems. The authors provide comprehensive simulation studies and show improvements in altitude tracking performance compared to constant RPM strategies. However, the paper has several limitations that need to be addressed before publication, particularly regarding safety guarantees, cooperative control considerations, and computational efficiency.
Specific Comments:
-
The paper only considers single UAV control without addressing the potential benefits of cooperative control strategies. Recent work by Xue et al. on "Cooperative Game-based Optimal Shared Control of Unmanned Aerial Vehicle" demonstrates that formulating UAV control as a cooperative non-zero sum game between operators and UAV systems can significantly enhance performance while reducing operator workload. The authors should consider incorporating such cooperative game-theoretic frameworks to optimize the interaction between the DDPG controller and human operators during critical landing phases.
-
While the paper addresses altitude control optimization, it fails to provide finite-time safety guarantees crucial for shipboard landing scenarios. The work by Li et al. on "Hierarchical Optimal Synchronization for Linear Systems via Reinforcement Learning: A Stackelberg-Nash Game Perspective " shows that incorporating barrier functions and finite-time stability analysis into reinforcement learning frameworks can ensure safety constraints are maintained. The proposed DDPG approach would benefit from similar finite-time safe reinforcement learning techniques to guarantee safe landing operations within specified time bounds.
-
The continuous online learning nature of DDPG may lead to excessive computational overhead. The authors should consider implementing dynamic event-triggered mechanisms as demonstrated in recent literature (e.g., Tan et al., 2025) to reduce computational burden while maintaining control performance. Such mechanisms have proven effective in UAV systems for minimizing unnecessary control updates while preserving system stability.
-
This paper relies too heavily on simulation results and lacks sufficient experimental validation. The complex marine environment presents numerous challenges that may not be adequately reflected in simulations, including communication delays, sensor noise, and hardware limitations. More comprehensive experimental validation would enhance the contribution of this paper. If experimental validation is complex, please provide additional explanations.
-
The paper lacks a rigorous theoretical analysis of the convergence guarantees and stability bounds of the proposed DDPG-based controller. If possible, the authors should provide Lyapunov stability analysis and convergence proofs, or provide relevant stability explanations.
-
Although the paper compares variable RPM strategies with constant RPM strategies, it does not compare them with other advanced control methods (such as model predictive control, adaptive control, or other modern reinforcement learning algorithms). A more comprehensive comparative study would better highlight the advantages of the proposed method.
-
Figure quality could be improved, some figures are too blurry(Fig4, Fig5, Fig12 ).
-
The reward function design section could benefit from more detailed justification of parameter choices
While the paper addresses an important and practical problem in UAV control, significant revisions are required before it can be considered for publication.
Author Response
Please refer to the attachment for the modification description.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors• The RPM channel models speed as an instantly commanded input. Motor dynamics, rotor inertia, and torque limits are omitted. The reward penalizes only dΩ over dt, which tacitly assumes an actuator with effectively unbounded bandwidth, a lacunose abstraction given the physics of engines and governors. Table 3 sets the domain from 24 to 30 rad per second, Figure 7b displays step like changes at a time step of 0.01 seconds, and Section 3.1 describes a compensator issuing direct commands. Imagine a brief high pitched whine as RPM jumps. Together these choices make transfer to hardware uncertain and may inflate apparent disturbance rejection
• The definitions of ΔΩ and the action mapping are inconsistent. Section 3.1 states Ωc equals ΩN plus ΔΩ, so speed oscillates around the nominal value of the rotor in all modes during operation. Equation 35 instead maps the Tanh output directly to ΔΩ within Ωmin to Ωmax. If ΩN is then added as implied by Section 3.1, Ωc can exceed the permitted range, if the mapping already yields Ωc then the notation is wrong, and the ambiguity persists across the text. Figure 1 finds no explicit summation point for the compound command, a concrete opacity that invites equivoque.;
• Reward and training settings lack sensitivity checks. Coefficients αh1 through αh5 are much larger than αh, which creates strong stepwise rewards near zero error while dynamic penalties remain small, and no ablation separates these effects. Table 4 sets a target network update factor of 0.99 and repeats an action noise parameter, a configuration seldom used in DDPG. Each episode lasts 15 seconds, yet the test step occurs at 30 seconds by Equation 36, which creates a distribution shift between training and evaluation. Figure 6 shows light blue spikes over a dark blue moving average, though no independent runs or variance are reported, leaving heteroscedastic behaviour unexamined.
• The environment model and test conditions are weakly grounded and display inconsistencies. Equation 8 claims laboratory provenance but offers no specific citation or sea state parametrisation, which limits verisimilitude and prevents reproducible stress definition. Equation 9 drives atmospheric disturbance with white noise through transfer functions without an identified source model. Vertical velocity in Figures 7c and 8c approaches about minus 20 metres per second, while Table 1 bounds observations between minus 15 and 15 metres per second. Sections 4.3 and 4.4 rely on deck motion prediction data yet the prediction method is not described, and claims of good control at high sea state lack quantification.
• The experimental evaluation lacks baselines and statistical reporting, which weakens the claims. Only a constant rotor speed configuration is compared, there is no standard maritime landing controller as a reference condition. Key metrics are absent, including success rate, number of trials, standard deviation, and confidence intervals. Figures 10 and 11 invoke a safety boundary, yet no evaluation criterion is specified. The Data Availability section states not applicable, which hinders replication and limits the probative value of the results.
Author Response
Please refer to the attachment for the modification description.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper designs a rotational speed compensator (RPM compensator) based on Deep Deterministic Policy Gradient (DDPG) to improve the altitude control performance of unmanned helicopters (UAVs) during shipboard landing in complex marine environments, and verifies its effectiveness through simulation.
Compared with conventional constant RPM control, the proposed method improves altitude tracking accuracy, responsiveness, and stability under disturbance, and demonstrates particularly favorable results in turbulent conditions and complex ship motion scenarios. This is evaluated as a meaningful study that can contribute to enhancing the operational safety of shipborne UAVs. On the other hand, there are several concerns, and I recommend revision and resubmission after addressing the following points.
1. The study is limited to simulation-based verification, without considering practical factors such as applicability in real-world environments, sensor/actuator delays, or wind speed measurement errors. A discussion of potential challenges and countermeasures for real-world implementation would enhance the persuasiveness of the work.
2. The comparison is only made with “constant RPM control,” and does not include other altitude control methods (e.g., PID with feedforward compensation, adaptive control, MPC). Without such comparisons, the superiority of the DDPG approach can only be partially evaluated.
3. The coefficients in the reward function, network structure, learning rates, and other hyperparameters are empirically set, but the tuning procedure and selection rationale are not stated. Including the parameter adjustment process (initial values, number of trials, evaluation criteria, adjustment strategy, etc.) and sensitivity analysis results would improve reproducibility and generalizability.
Author Response
Please refer to the attachment for the modification description.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI have no further comments.
Author Response
The reviewers have no further comments. We appreciate the reviewers' previous suggestions.
Reviewer 2 Report
Comments and Suggestions for AuthorsI read this paper as shifting the learning target in shipboard landing, away from the usual practice of holding rotor speed nearly constant while shaping altitude with collective and classical or model based loops, toward a DDPG policy that learns a bounded rotor speed compensation integrated into a full landing stack with an online LSTM deck motion estimator, as laid out in Figure 1 on page 6 and in the action mapping of Equation 35 with bounds in Table 3 on page 17. In the turbulence tests the learned compensator trims steady altitude ripple from about negative 2 to positive 2 meters to roughly negative 0.5 to 0.5 meters and it outperforms both a constant speed baseline and a single MPC setting, see Figure 8 and Figure 9 on page 20 and page 21. I view that focus on RPM compensation within the maritime landing loop as the core novelty. I would encourage the authors to tighten several points. The controller commands rotor speed without a governor or engine torque model, so transfer to airframes that regulate RPM tightly remains uncertain. I ask the authors to specify the MPC horizon, constraints, and tuning to make the comparison fair. I also ask for an ablation of the reward design, which currently mixes staged bonuses near zero error with several penalties, see Table 2 on page 16. I suggest cleaning Table 4 on page 18, which repeats the target action noise entry and includes the line Target network action output on 1 that needs definition. Finally, the claim that online learning meets real time on page 25 would benefit from timing and hardware numbers, and the data availability entry marked not applicable on page 26 limits independent checks. Overall, I credit the work for embedding a learned RPM compensator in a realistic landing pipeline and for the gains it shows in simulated disturbance, and I ask for stronger realism and reporting to support adoption.
Author Response
Please refer to the attached PDF for the revision notes.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have generally responded well to the reviewer's comments. Therefore, I recommend accepting this paper for publication.
Author Response
The reviewers have no further comments. We appreciate the reviewers' previous suggestions.
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors made great efforts in responding our concerns. Thanks for that, I believe the manuscript reaches a level of an acceptance.
