Author Contributions
Conceptualization, Z.T. and B.Z.; methodology, Z.T.; software, Z.T.; validation, Z.T., S.H. and H.F.; formal analysis, Z.T.; investigation, Z.T.; resources, B.Z.; data curation, S.H.; writing—original draft preparation, Z.T.; writing—review and editing, W.Z.; visualization, H.F.; supervision, W.Z.; project administration, B.Z.; and funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Safety-constrained dual-timescale reinforcement learning framework. The framework consists of the Agent (step-level and episode-level learning) and the Environment (Env). Symbols are defined as follows: represents the current state; is the action generated by the Actor; is the value function estimated by the Critic; is the Lagrange multiplier dynamically adjusting the safety penalty; is the expected cumulative safety cost; and denote the episode-level total cost and composite return, respectively, fed back to the Critic; is the step-level reward; is the step-level cost (aggregated into ); and and are the intermediate and final step PID gains processed through safety mapping and exponential moving average (EMA) smoothing.
Figure 1.
Safety-constrained dual-timescale reinforcement learning framework. The framework consists of the Agent (step-level and episode-level learning) and the Environment (Env). Symbols are defined as follows: represents the current state; is the action generated by the Actor; is the value function estimated by the Critic; is the Lagrange multiplier dynamically adjusting the safety penalty; is the expected cumulative safety cost; and denote the episode-level total cost and composite return, respectively, fed back to the Critic; is the step-level reward; is the step-level cost (aggregated into ); and and are the intermediate and final step PID gains processed through safety mapping and exponential moving average (EMA) smoothing.
Figure 2.
Overall hierarchical framework encompassing LPV modeling, LMI safety domain construction, and two-phase RL training. Labels on the interconnections delineate the progression of mathematical models, parameter bounds, and control policies across the integrated steps.
Figure 2.
Overall hierarchical framework encompassing LPV modeling, LMI safety domain construction, and two-phase RL training. Labels on the interconnections delineate the progression of mathematical models, parameter bounds, and control policies across the integrated steps.
Figure 3.
Pole- placement verification (fixed inner-loop parameters).
Figure 3.
Pole- placement verification (fixed inner-loop parameters).
Figure 4.
Three-dimensional cross-section of the longitudinal outer-loop parameter feasible region.
Figure 4.
Three-dimensional cross-section of the longitudinal outer-loop parameter feasible region.
Figure 5.
Reward- curve evolution across five-stage offline curriculum training. Different colors distinguish the five training stages; in each subplot, the solid line indicates the mean episode reward and the shaded region represents the standard deviation.
Figure 5.
Reward- curve evolution across five-stage offline curriculum training. Different colors distinguish the five training stages; in each subplot, the solid line indicates the mean episode reward and the shaded region represents the standard deviation.
Figure 6.
LMI constraint-ablation comparison (Stage 5). (a) Episode-reward-training curves and (b) constraint violations per episode. Dashed lines indicate the mean over the last 1000 episodes.
Figure 6.
LMI constraint-ablation comparison (Stage 5). (a) Episode-reward-training curves and (b) constraint violations per episode. Dashed lines indicate the mean over the last 1000 episodes.
Figure 7.
Stage 6, Gazebo online fine-tuning training curve. The shaded colored area represents the standard deviation of the episode reward. The smoothed episode reward (window = 2000 episodes) increases from approximately at the start to at the end of timesteps, demonstrating steady policy improvement under fixed-bias parameters and sensor noise.
Figure 7.
Stage 6, Gazebo online fine-tuning training curve. The shaded colored area represents the standard deviation of the episode reward. The smoothed episode reward (window = 2000 episodes) increases from approximately at the start to at the end of timesteps, demonstrating steady policy improvement under fixed-bias parameters and sensor noise.
Figure 8.
Performance comparison of longitudinal and lateral channel policy transfer. (a) Representative velocity tracking curves (wind speed 1.5 m/s); (b) RMSE mean ± standard deviation comparison across four control schemes (); and (c) time-averaged gain value comparison. The longitudinal RL-PID RMSE is 2.503 m/s; after lateral transfer, it increases to only 2.912 m/s (+16.3%), well below the 20% feasibility threshold.
Figure 8.
Performance comparison of longitudinal and lateral channel policy transfer. (a) Representative velocity tracking curves (wind speed 1.5 m/s); (b) RMSE mean ± standard deviation comparison across four control schemes (); and (c) time-averaged gain value comparison. The longitudinal RL-PID RMSE is 2.503 m/s; after lateral transfer, it increases to only 2.912 m/s (+16.3%), well below the 20% feasibility threshold.
Figure 9.
High-speed step- response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Figure 9.
High-speed step- response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Figure 10.
Emergency- braking dynamic-response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Figure 10.
Emergency- braking dynamic-response comparison. (a) Velocity response; (b) pitch angle; (c) pitch rate; and (d) adaptive-gain curves. The dashed line in (a) represents the reference command as indicated in the legend; the dashed lines in (b,c) represent the initial states; and the dashed line in (d) represents the fixed PID parameter values.
Figure 11.
Frequency- sweep-test response comparison. (a) Test A velocity response; (b) Test A pitch angle; (c) Test A adaptive-gain curves; (d) Test B velocity response; (e) Test B pitch angle; and (f) Test B adaptive-gain curves. Dashed lines denote reference commands in (a,d), initial states in (b,e), and Fixed-PID values in (c,f).
Figure 11.
Frequency- sweep-test response comparison. (a) Test A velocity response; (b) Test A pitch angle; (c) Test A adaptive-gain curves; (d) Test B velocity response; (e) Test B pitch angle; and (f) Test B adaptive-gain curves. Dashed lines denote reference commands in (a,d), initial states in (b,e), and Fixed-PID values in (c,f).
Figure 12.
Statistical distribution of mixed-trajectory test results. (a) Velocity RMSE; (b) velocity MAE; (c) maximum ; and (d) RMSE by trajectory type (). Circles denote outliers, and black diamonds denote mean values.
Figure 12.
Statistical distribution of mixed-trajectory test results. (a) Velocity RMSE; (b) velocity MAE; (c) maximum ; and (d) RMSE by trajectory type (). Circles denote outliers, and black diamonds denote mean values.
Figure 13.
Computation–performance Pareto front across three benchmark scenarios. Error bars: mean ± std (). Dashed grey lines connect Pareto-optimal points (circled). Lower-left is better.
Figure 13.
Computation–performance Pareto front across three benchmark scenarios. Error bars: mean ± std (). Dashed grey lines connect Pareto-optimal points (circled). Lower-left is better.
Table 1.
Systematic comparison with representative related methods.
Table 1.
Systematic comparison with representative related methods.
| Method | Safety Mechanism | Control Architecture | Policy Transfer | Distinction of This Work |
|---|
| Wang et al. [18] | No explicit constraints | RL outputs attitude commands | None | LMI safety domain for outer loop |
| Sönmez et al. [19] | Action clipping | RL predicts PD gains | None | LMI polytopic constraints; integral terms & transfer |
| Xue et al. [23] | Penalty reward | PPO outputs control commands | None | Fixed inner-loop PID; LMI safety domain |
| Saeed et al. [29] | LMI-LPV robust control | Fixed LPV gain scheduling | None | RL online adaptation within LMI domain |
| This work | LMI + Lagrangian PPO | Inner fixed + outer RL | Long.→Lat. transfer | — |
Table 2.
Longitudinal outer-loop parameter LMI safety domain.
Table 2.
Longitudinal outer-loop parameter LMI safety domain.
| Parameter | Lower Bound | Upper Bound | Initial Value (Midpoint) |
|---|
| 0.2 | 4.5 | 2.4 |
| 1.0 | 10.0 | 5.5 |
| 3.0 | 25.0 | 14.0 |
| 0.05 | 0.55 | 0.3 |
Table 3.
Five-stage offline curriculum training configuration.
Table 3.
Five-stage offline curriculum training configuration.
| Stage | Reference Signal | Environment/Randomization | Training Objective | Steps |
|---|
| Stage 1 | Constant velocity m/s | No wind, widest constraints | Baseline tracking | 1.0 M |
| Stage 2 | Sinusoidal, 8 m/s, 0.3 Hz | No wind | Periodic dynamic tracking | 2.0 M |
| Stage 3 | Chirp, 0.1–0.5 Hz, 8–0 m/s | No wind | Wideband frequency response | 3.0 M |
| Stage 4 | Chirp (same as Stage 3) | Mild wind 0–2 m/s | Tracking under disturbance | 4.0 M |
| Stage 5 | Mixed chirp/sinusoidal/step | Full-domain rand., wind 0–2.5 m/s | Generalization | 5.0 M |
Table 4.
Performance summary at the conclusion of each offline training stage.
Table 4.
Performance summary at the conclusion of each offline training stage.
| Stage | Environment | Final Reward | Mean Velocity Error |
|---|
| Stage 1 | Simplified simulation | ≈−570 | 0.41 m/s |
| Stage 2 | Simplified simulation | ≈−4750 | 1.03 m/s |
| Stage 3 | Simplified simulation | ≈−2870 | 1.24 m/s |
| Stage 4 | Simplified simulation | ≈−2940 | 1.24 m/s |
| Stage 5 | Simplified + domain rand. | ≈−5460 | 1.47 m/s |
Table 5.
LMI constraint-ablation study statistics (Stage 5, last 1000 episodes).
Table 5.
LMI constraint-ablation study statistics (Stage 5, last 1000 episodes).
| Metric | With LMI | Without LMI | Difference |
|---|
| Constraint violations/ep | | | |
| Episode Reward | | | — † |
Table 6.
Step-response performance comparison (, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
Table 6.
Step-response performance comparison (, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
| Metric | RL-PID | Fixed-PID | Improvement |
|---|
| Rise time (s) | | | |
| Settling time (s) | | | |
| Overshoot (%) | | | |
| Steady-state error (m/s) | | | |
| RMSE (m/s) | | | |
| MAE (m/s) | | | |
Table 7.
Emergency-braking-performance comparison (, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
Table 7.
Emergency-braking-performance comparison (, random wind 0–3 m/s, and mean ± std). Bold values indicate the best performance.
| Metric | RL-PID | Fixed-PID | Improvement |
|---|
| Braking time (s) | | | |
| Braking distance (m) | | | |
| Max. pitch angle (°) | | | |
Table 8.
Frequency-sweep-test performance comparison (, random wind 0–3 m/s, and mean ± std).
Table 8.
Frequency-sweep-test performance comparison (, random wind 0–3 m/s, and mean ± std).
| Test Type | Metric | RL-PID | Fixed-PID | Improvement |
|---|
| Const.-amplitude | RMSE (m/s) | | | |
| | MAE (m/s) | | | |
| Const.-frequency | RMSE (m/s) | | | |
| | MAE (m/s) | | | |
Table 9.
Mixed-trajectory test overall statistics ().
Table 9.
Mixed-trajectory test overall statistics ().
| Metric | RL-PID | Fixed PID | Improvement |
|---|
| Velocity RMSE (m/s) | | | 15.0% |
| Velocity MAE (m/s) | | | 20.5% |
| Max. pitch angle (°) | | | |
Table 10.
Per-trajectory RMSE comparison (mean ± std, unit: m/s).
Table 10.
Per-trajectory RMSE comparison (mean ± std, unit: m/s).
| Trajectory Type | n | RL-PID | Fixed PID | Improvement |
|---|
| Constant velocity | 24 | | | |
| Sinusoidal | 24 | | | |
| Step | 30 | | | |
| Chirp | 22 | | | |
Table 11.
Summary of core performance improvements across four experiments (improvement ).
Table 11.
Summary of core performance improvements across four experiments (improvement ).
| Experiment | Core Metric | RL-PID | Fixed PID | Improvement |
|---|
| Step response | Overshoot (%) | | | 18.5% |
| | Steady-state error (m/s) | | | 62.8% |
| | MAE (m/s) | | | 17.1% |
| Emergency braking | Braking time (s) | | | 9.5% |
| | Braking distance (m) | | | 7.8% |
| Sweep (const.-amp.) | RMSE (m/s) | | | 7.5% |
| Sweep (const.-freq.) | RMSE (m/s) | | | 12.6% |
| Mixed () | Velocity RMSE (m/s) | | | 15.0% |
| | Chirp RMSE (m/s) | | | 40.9% |