Next Article in Journal
The Impacts of Climate Change on Aircraft Noise near European Airports
Previous Article in Journal
Enhanced Test Data Management in Spacecraft Ground Testing: A Practical Approach for Centralized Storage and Automated Processing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving Control Performance of Tilt-Rotor VTOL UAV with Model-Based Reward and Multi-Agent Reinforcement Learning

by
Muammer Ugur
1,2,* and
Aydin Yesildirek
1
1
Mechatronics Engineering Department, Yildiz Technical University, Istanbul 34000, Turkey
2
Mechatronics Engineering Department, Kirklareli University, Kirklareli 39000, Turkey
*
Author to whom correspondence should be addressed.
Aerospace 2025, 12(9), 814; https://doi.org/10.3390/aerospace12090814
Submission received: 17 July 2025 / Revised: 28 August 2025 / Accepted: 5 September 2025 / Published: 9 September 2025

Abstract

Tilt-rotor Vertical Takeoff and Landing Unmanned Aerial Vehicles (TR-VTOL UAVs) combine fixed-wing and rotary-wing configurations, offering optimized flight planning but presenting challenges due to their complex dynamics and uncertainties. This study investigates a multi-agent reinforcement learning (RL) control system utilizing Soft Actor-Critic (SAC) modules, which are designed to independently control each input with a tailored reward mechanism. By implementing a novel reward structure based on a dynamic reference response region, the multi-agent design improves learning efficiency by minimizing data redundancy. Compared to other control methods such as Actor-Critic Neural Networks (AC NN), Proximal Policy Optimization (PPO), Nonsingular Terminal Sliding Mode Control (NTSMC), and PID controllers, the proposed system shows at least a 30% improvement in transient performance metrics—including RMSE, rise time, settling time, and maximum overshoot—under both no wind and constant 20 m/s wind conditions, representing an extreme scenario to evaluate controller robustness. This approach has also reduced training time by 80% compared to single-agent systems, lowering energy consumption and environmental impact.

1. Introduction

Unmanned aerial vehicles (UAVs) have rapidly become integral across various industries, including surveillance, logistics, agriculture, and defense. This widespread adoption has driven the development of diverse UAV types, including fixed-wing, rotary-wing, vertical take-off and landing (VTOL), and tilt-rotor VTOL (TR-VTOL) platforms. TR-VTOL UAVs offer superior maneuverability but suffer from limited range and higher energy consumption [1]. Fixed-wing UAVs, on the other hand, are more energy-efficient and capable of longer range but require runways for take-off and landing, which limits their operational flexibility [2]. These constraints are particularly problematic in remote or military applications where infrastructure is sparse and continuous operation is essential [3].
TR-VTOL UAVs have emerged to address these trade-offs, combining the vertical flight capabilities of rotary-wing UAVs with the cruise efficiency of fixed-wing aircraft. Unlike conventional VTOLs that deactivate certain motors during mode transitions, TR-VTOLs utilize rotatable propulsion units that remain active in all phases, reducing structural load and vibration [4]. However, this hybrid configuration introduces additional control challenges. The transition between vertical and horizontal flight involves complex interactions among motor tilt dynamics, aerodynamic forces, and environmental disturbances, which complicate the control of both translational and rotational motion [5].
In response to these complexities, researchers have proposed various control strategies. Sheng et al. developed a compartmental aerodynamic model to characterize the dynamics of TR-VTOL UAVs and conducted a stability analysis in a simulation environment [6]. He et al. applied nonlinear sliding mode control to address external disturbances and parameter variations [7]. Similarly, Masuda et al. introduced an H-infinity-based control framework to enhance robustness against model uncertainties during the transition between flight modes [8]. In another study, Xie et al. proposed a robust attitude control method for a three-motor tilt-rotor UAV that achieves fixed-time convergence despite system uncertainties. Their control structure utilizes only input-output data within an actor-critic learning framework, enabling stable performance without full access to the system dynamics [9].
While model-based robust controllers offer improved performance in uncertain environments, their effectiveness relies heavily on accurate system models and tuning. In contrast, reinforcement learning (RL) has recently gained attention as a data-driven approach capable of learning optimal control policies through interaction with the environment. RL-based controllers can adapt to varying conditions and model uncertainties without requiring precise modeling. Several RL applications in UAV control have demonstrated promising results, including disturbance rejection, autonomous landing, and stabilization under complex conditions [10,11,12,13].
However, many of these approaches employ static reward functions, which may be insufficient for capturing the complexity of real-world dynamic environments. A static reward structure limits the agent’s ability to adapt its behavior when the operational context shifts. This limitation is particularly significant in TR-VTOL UAVs, where sudden aerodynamic changes, actuator nonlinearities, and time-varying disturbances are common [14,15].
To address this shortcoming, recent studies have explored the concept of dynamic reward functions, which allow the learning objective to evolve based on real-time performance feedback [16]. Nevertheless, most existing work on dynamic reward design is limited to basic UAV configurations and does not extend to tilt-rotor architectures [17]. Furthermore, the potential of multi-agent reinforcement learning (MARL) remains largely underutilized in this context. MARL enables the decomposition of complex control tasks into sub-agents, each responsible for a specific control axis such as roll, pitch, or yaw. This structure not only improves training convergence but also enhances control robustness by decoupling the policy learning process [18].
This study presents three main contributions. First, a multi-agent reinforcement learning structure was developed by assigning a separate learning agent to each control axis (roll, pitch, and yaw) of the TR-VTOL UAV, addressing its complex control requirements. This decomposition of control tasks enabled each agent to learn more quickly and efficiently, thereby reducing the overall training time and enhancing control performance. Second, a model-based dynamic reward function tailored to the vehicle’s dynamics was designed to guide the learning process by evaluating the system’s state at each training step. Compared to static reward structures, this approach improved both system stability and learning efficiency. Third, to simulate real-world challenges, random external disturbance forces were applied along the x, y, and z axes, allowing the system to demonstrate stable and reliable responses under dynamic environmental conditions. Collectively, these contributions show that the proposed method offers a robust control solution not only in terms of learning performance but also in terms of resilience to real-world uncertainties.
However, due to the complexity and risk of conducting real-world experiments on TR-VTOL UAVs, especially given the nonlinear dynamics and tilt transitions, this study was conducted in a simulation environment. Training reinforcement learning (RL) agents requires thousands of trial-and-error episodes, which is impractical and potentially unsafe in real-world settings. After validating the control framework in simulation, future work will focus on transferring the trained models to physical platforms through sim-to-real adaptation.

2. Modelling of VTOL

Although RL has found wide application in solving complex tasks for UAVs, training and adapting a policy for varying initial conditions and situations remain challenging. Policies learned in fixed environments using RL often prove fragile and prone to failure under changing environmental parameters. To address this, randomization techniques have been employed to develop the control policies for the TR-VTOL UAV [19]. The scenarios were designed taking into account random initial conditions, external disturbances, different axis angles, and variable height and speed inputs [20]. Throughout this process, both RL policies and the nominal reward function were continuously updated. The training process was completed after approximately 2500 episodes in the single-agent structure and approximately 120 episodes in the multi-agent structure, and these values are consistent with the scales commonly used in RL-based control literature to ensure policy stability [21].
RL, by its nature, relies on trial-and-error-based exploration, and agents initially perform stochastic and unpredictable actions while learning from environmental feedback. For TR-VTOL UAVs, such actions can increase the risk of hardware damage due to actuator saturation, sudden aerodynamic loads, or instability caused by rollover or ground effects. Even if the motors are oriented vertically and theoretically generate sufficient thrust for takeoff, the UAV may remain stationary due to inappropriate control signals from the RL agent. Excessive pitch or roll commands, ground effect disturbances, or center-of-gravity imbalances can redirect or counteract the vertical lift. In such cases, the actuators operate under high load without achieving the expected altitude change, leading to unnecessary stress on the hardware. Therefore, training was initiated in an environment where exploration is safe and repeatable, and disruptive effects can be controlled and incorporated into the simulation [22]. Training in a simulation environment not only reduces hardware damage but also shortens training time thanks to higher simulation speeds. It allows errors to be safely observed and analyzed, which contributes to improving the learning process. Additionally, the reproducible recreation of conditions enables the fair comparison of different algorithms and ensures methodological consistency [23].

2.1. Dynamic Equations of TR-VTOL

The dynamic equations of the aircraft are based on factors such as wing and body geometry, air density, airspeed, and the UAV’s lift force, drag, and directional stability. The thrust forces generated by the motors are calculated using thrust system equations. Based on the linear and angular momentum equations, the force and moment equations of the TR-VTOL UAVs are expressed as Equation (1) [24]:
F M = m I 0 0 J v ˙ B ω ˙ B + w B × m v B w B × J w B
Here, F and M represent the total external force and moment. I is the 3 × 3 identity matrix, J is the inertia matrix, m is the mass, and v ˙ B and ω ˙ B are the linear and angular accelerations of the UAV in the body frame. The Euler angles indicate the orientation of the body frame with respect to the inertial frame. A schematic of the TR-VTOL UAV employed in this work is shown in Figure 1.
The inertia moments of the UAV were calculated taking into account the masses and positions of all components. Table S1 shows the masses of the main equipment on the TR-VTOL UAV and their x, y, z positions relative to the body coordinate system. Using this information, the contribution of each component to the body inertia was determined, and the total inertia matrix was created. The XFLR5 software (version 6.47) was employed to perform inertia calculations, providing central inertia values by accounting for both the wing–body geometry and the spatial distribution of onboard equipment. Figure 2 shows the visual model of the UAV designed in the XFLR5 environment and the equipment layout. This method allows for a more realistic modelling of UAV dynamics in simulation and control design. The inertia moments of the TR-VTOL UAV were calculated considering the masses and positions of all onboard components. Detailed information on the masses, positions, and the calculation procedure is provided in Figure S1.
The rotational transformation matrix that represents the vehicle’s orientation with respect to the inertial frame, expressed in terms of Euler angles, is presented in Equation (2). In the transformation matrix, the shorthand notations C a and S α are used to denote cos α and sin α , respectively.
R B i = C ψ C θ C ψ S θ S ϕ S ψ C ϕ C ψ S θ C ϕ + S ψ S ϕ S ψ C θ S ψ S θ S ϕ + C ψ C ϕ S ψ S θ C ϕ C ψ S ϕ S θ C θ S ϕ C θ C ϕ
The total forces acting on the aircraft consist of aerodynamic force ( F a ), gravity ( F g ), and propeller thrust force ( F p ). The total force acting on the UAV in the body frame is expressed in Equation (3):
F B = F g B + F p B + F a B
The gravitational force on body frame, denoted as F g B , is presented in Equation (4).
F g B = R i B 0 0 m g = m g S θ m g C θ S m g C θ C
The motors are positioned at four separate points, with the center located at the center of gravity. The TR-VTOL UAV and its motor tilt configurations in hover, transition, and fixed-wing modes are illustrated in Figure 3.

2.1.1. Forces and Moments in Quad Mode

The expression for the force and moment generated by each motor when the vehicle is in quad mode is given in Equations (5)–(9) [25]:
F p i = K f w i 2
M i = K m w i 2
M x = F 2 + F 4 F 1 F 3 y
M y = F 1 + F 2 F 3 F 4 x
M z = F 1 + F 4 F 2 F 3 c
with w i being the angular velocity (rotational velocity) of motor number i , K f the thrust coefficient of the propellers, K m the propeller’s torque coefficient, x ,   y ,   c the moment arm length of the TR-VTOL UAV with respect to the y-axis, the x-axis, and the force-moment scaling factor, respectively. M x ,   M y ,   M z represent the moments generated with respect to the x , y , z axes, respectively.
In practice, the coefficients K f and K m are typically determined through experimental measurements. Since the present study is simulation-based, with experimental validation planned for a subsequent phase, the thrust force F was obtained using the simplified dynamic thrust model proposed by Gabriel Staples [26], in combination with the specifications of the selected motor–propeller system. The moment coefficient K m was adopted from a similar motor–propeller study reported in the literature [27], due to its provision of validated data for comparable configurations. The primary focus of this study is on improving the overall control performance of the aircraft; therefore, the net thrust produced by the motor–propeller assembly is employed. Accordingly, the simplified dynamic thrust model forms the basis for evaluating overall aircraft control performance. It should be noted that this approach relies on momentum-based assumptions and neglects certain real-world phenomena, such as installation effects, propeller–wing interactions, and three-dimensional tip losses. In addition, it is acknowledged that the effective thrust coefficient K f may differ between rotary-wing and fixed-wing flight modes due to variations in inflow conditions, advance ratios, and installation effects. Such efficiency differences have been analytically discussed in the literature [28], where mode-dependent corrections were introduced to capture the influence of propeller–nacelle interactions and forward-flight aerodynamics. In this study, the primary focus is on developing the RL-based control framework. Since the present work is limited to simulations, detailed modeling of propeller efficiency was not pursued. Instead, simplified equations and constant coefficients were adopted to emphasize the control objective, with mode-specific efficiency variations to be considered in future work.
The dynamic thrust equation is provided in Equation (10).
    F = 4.3924 × 10 8 · R P M d 3.5 p i t c h 4.234 × 10 4 · R P M · p i t c h V 0
where F (N) is the propeller thrust, d (in.) is the propeller diameter, RPM is the rotational speed, pitch (in.) is the propeller pitch, and V0 (m/s) is the forward flight velocity. The thrust–velocity characteristics of the selected motor–propeller combination are shown in Figure 4.

2.1.2. Forces and Moments in Transition Mode

The motor force and position vectors are shown in Figure 3. In the transition mode, the motor’s rotation angle is denoted as λ. By varying the rotation angle (λ) between 0 and 90 degrees, the vehicle transitions between rotary wing mode and fixed wing mode. The position and rotation vectors of the motors are expressed in Equations (11) and (12), respectively.
r ^ i =   x i y i 0 T
r m i =   s i n λ i 0 c o s λ i T ,     0 < λ i < 90     ,     i = 1 , , 4
The force vector generated by each motor is given by Equation (13):
F p i = F p i r m i
The total motor force is expressed as Equation (14).
F p B = i = 1 4 F p i
moment generated by each motor is given by Equation (15):
M i = r ^ i × F p B

2.1.3. Forces and Moments in Fixed-Wing Mode

The aerodynamic behavior of a fixed-wing UAV is determined by the forces and moments generated as a function of flight conditions and control surface deflections. These forces and moments, which include both longitudinal (lift, drag, pitching moment) and lateral–directional (side force, rolling moment, yawing moment) components, are functions of parameters such as angle of attack, sideslip angle, angular rates, and control surface deflections. These aerodynamic effects are modeled using well-established force and moment equations, which provide the basis for analyzing and controlling the motion of the vehicle [29]. The details of the aerodynamic force and moment equations used in this study are provided below.
Longitudinal Forces and Moments
The longitudinal forces and moments acting on the UAV are described by lift, drag, and pitching moment, which depend on the angle of attack, pitch rate, and elevator deflection. The governing equations are given in Equations (16)–(18).
F l i f t = 1 2 ρ V a 2 S C L α , q , δ e
F d r a g = 1 2 ρ V a 2 S C D α , q , δ e
M = 1 2 ρ V a 2 S c C m α , q , δ e
Here Flift and Fdrag are the lift and drag forces, M is the pitch moment, ρ is the air density, Va is the relative flow velocity of the aircraft, α is the angle of attack, β is the sideslip angle, q is pitch rate,   δ e is the elevator deflection, C L ,   C D ,   C m are the lift, drag, and pitch moment coefficient, S represents the wing surface area, b represents the wing span and c represents the chord length.
Equations (16)–(18) model the contributions of elevator deflection to lift, drag, and aerodynamic moments based on aerodynamic data obtained from the XFLR5 software. These data are incorporated into the dynamic model within the MATLAB/Simulink (version 2023b) environment through prelookup tables. Consequently, variations in elevator deflection yield corresponding changes in forces and moments in a realistic manner, thereby capturing the longitudinal dynamic response of the UAV in the simulation environment. Figure 5 illustrates the implementation of longitudinal forces in the Simulink model and their integration via prelookup tables.
Lateral Forces and Moments
The lateral-directional forces and moments include side force, rolling moment, and yawing moment, depending on sideslip angle, roll rate, and aileron/rudder deflections. The governing equations are given in Equations (19)–(21):
Y = 1 2 ρ V a 2 S C Y β , p , δ a , δ r
L = 1 2 ρ V a 2 S b C I β , p , δ a , δ r
N = 1 2 ρ V a 2 S b C n β , p , δ a , δ r
Here, Y is the sideslip force, L, N are the roll and yaw moments, p is the roll rate. δ a , δ r are the aileron and rudder deflections, respectively. C Y , C I , C n are the sideslip coefficient, roll moment coefficient, yaw moment coefficient, respectively.
In Equations (19)–(21), the contributions of aileron and rudder deflections to roll moment, yaw moment, and side force are modeled using aerodynamic data similar obtained from XFLR5. These data are also transferred into prelookup tables within the Simulink environment. As a result, variations in aileron and rudder deflections lead to realistic calculations of lateral forces and moments, enabling the simulation of the UAV’s lateral dynamic responses. Figure 6 illustrates the integration of lateral force and moment coefficients into the Simulink model.
The parameters used in the mathematical model and the aerodynamic coefficients obtained from the XFLR5 program are listed in Table 1.

3. RL-Based Multi-Agent SAC Control Framework

RL algorithms are artificial intelligence tools that continuously update policy parameters based on actions, observations, and rewards. The goal of an RL algorithm is to find the optimal policy that maximizes the long-term cumulative reward received during a task. Various models are used for discrete and continuous-time systems. Since the system operates in both continuous action and continuous observation spaces, models suitable for such domains were considered. Research has shown that the DDPG model is widely used in such environments, followed by TD3, PPO, SAC, and TRPO. In this study, the SAC algorithm has been preferred. The SAC model is an algorithm that performs well in complex, continuous-space problems encountered in real-world applications. It utilizes an entropy regulator, which not only maximizes rewards but also increases policy uncertainty, thereby enhancing the exploration process of agents [30]. Thanks to these features, it is commonly used in complex control tasks, robotic applications, and other continuous-space RL scenarios. The model of the algorithm is presented in Equation (22).
J π = t = 0 T E s t , a t ~   ρ π r s t , a t + a H π · s t
J(π): Objective function measuring the performance of the policy, ρπ: Distribution of states followed by the policy, r (st,at): Reward of the action at in state st, α: A scaling parameter for the entropy term, γ: Discount factor of future rewards. H (π (·∣st)): Entropy of the action distribution of policy π in state st.
AC utilizes three networks. These are the state value function (V), the soft Q function and the policy function (π). The state value function is parameterized by ψ, the soft Q function is parameterized by θ and the policy function is parameterized by ϕ.
The state value function calculates the total expected reward from a specific state.
J ν ψ = E s t D 1 2 V ψ s t E a t π Φ Q ϴ s t , a t l o g π Φ ( a t | s t ) 2
V ψ s t : Estimation of the value function for state s t .
D: Situations in the experience pool (replay buffer).
Q ϴ s t , a t : Estimation of the Q-value function for a given state and action.
π Φ ( a t | s t ) : It represents the policy function that has the probability of choosing an action when the situation is given.
The soft Q function parameters are trained to minimize the residuals from the soft Bellman equation shown in Equations (24) and (25).
J Q ϴ = E ( s t , a t ) D 1 2 V ϴ s t , a t Q ^ s t , a t 2
Q ^ s t , a t = r s t , a t + γ E s t + 1 ~ p V ψ ¯ s t + 1
r s t , a t : Instantaneous reward obtained by action at in state st.
γ: discount factor used to calculate the present value of future rewards.
V ψ ¯ ( s t + 1 ) : It represents the state value function.
The SAC algorithm encourages exploration during policy updates by using an entropy regularizer. However, some SAC variants employ KL Divergence during policy updates to ensure that the new policy does not deviate too far from the old policy Equation (26). KL Divergence is a metric that quantifies the difference between two probability distributions.
J π ϕ = E s t D D K L π . | S t exp Q θ s t , · Z θ s t
Z θ s t : It is a partition function, it serves to normalize the distribution, although it is generally stubborn, it does not contribute to the gradient according to the new policy, so it is ignored.
Q θ s t , · : It expresses the action distribution of the Q value function in the st case.
For a comprehensive description of the SAC algorithm, including the update rules and implementation details, the reader is referred to Figure S1.
The proposed control strategy employs the Soft Actor-Critic (SAC) algorithm, which uses an actor-critic architecture to optimize policy and value estimation. Detailed algorithmic procedures, including the actor and critic update rules, are provided in Figure S1.
A key feature of SAC is entropy regulation. The policy is trained to maximize the balance between the expected payoff and entropy, which measures the randomness in the policy. This is closely related to the exploration-exploitation tradeoff. Increasing entropy promotes more exploration, thereby accelerating learning. It also helps prevent the policy from prematurely converging to a poor local optimum. The agent used in training consists of two neural networks: the critic and the actor. The structure of the actor and critic networks is depicted in Figure 7. These networks serve as function estimators [31]. The structure of the neural networks in our study is specified as follows: the critic takes the state and action as inputs and outputs a value function that estimates future rewards. The term Time Division Error (TD-error) usually refers to time slot overlaps or timing errors in time division multiplexing systems. The actor network takes the state observed by the agent as input; this state represents information from the environment. The output includes the average action value and the standard deviation of the action.

3.1. Nominal Reference Dynamic Reward

One of the significant challenges in RL is designing the reward function. To shape the reward function effectively and achieve the desired results, a deep understanding of the system and expertise in the relevant field are essential. Additionally, this process involves various optimization problems. In this study, a nominal dynamic reward function was developed to determine rewards and penalties. During the simulation process, parameters such as exceedance rate, rise time, and settling time were updated according to the initial and target values for each iteration. The limits were adjusted based on these parameters in each iteration Figure 8. Within the framework of these established limits, a penalty was applied in the event of a limit violation using the Exterior Penalty Function.
The mathematical formula for the function used to generate penalties in the event of a violation of specific constraints is provided Equation (27).
P x , w h , w g = w h j = 1 l h j x 2 + w g i = 1 m m a x 0 , g i x 2
x : Represents decision variables. wh and wg: Weight coefficients used in the penalty function. These coefficients determine the contributions of equality and inequality constraints to the penalty function. h j x : Represents the j-th equality constraint. Equality constraints require the system to satisfy a certain condition. g i x : Represents the i-th inequality constraint. Inequality constraints ensure that a system stays within certain limits. j = 1 l and i = 1 m : Express the sums of equality and inequality constraints. m a x 0 , g i x : A function used in case of violation of the inequality constraint. If g i x is greater than zero, the constraint is violated and the penalty is applied; otherwise, the penalty is not applied. This reward function is applied separately to each axis of the aircraft. For example, the reward-penalty algorithm for the pitch angle is shown in Figure S2.

3.2. Overall Control of the TR-VTOL UAV RL Neural Network Implementation System

The control architecture of the system with RL is shown in Figure 9. Random desired references and initial conditions for the quad mode and fixed-wing flight mode were generated using a uniform distribution. Specifically, roll angles were sampled independently from [−0.4, 0.4] rad (≈±23°), pitch angles from [−0.3, 0.3] rad (≈±17°), and yaw angles (heading) from [−0.1, 0.1] rad (≈±5.7°). The initial values of roll, pitch, and yaw angles are also randomly selected within the same ranges. This allows moderate roll and pitch variations while keeping heading errors small, avoiding non-physical or infeasible flight states.
During training, an error signal is obtained by comparing randomly generated reference and initial values for the model. The RL agent makes inferences based on the observed values and the reward generated by the nominal dynamic equation, attempting to send the most appropriate command to the aircraft.
To accurately replicate real-world flight conditions during simulation, uniform random external disturbance forces were introduced to emulate the effects of unpredictable environmental interactions. These physical perturbations were applied directly to the UAV body in multiple directions to test the robustness of the control policy under dynamic and uncertain conditions. The quantitative parameters used to characterize these disturbances are summarized in Table 2.
Figure 9 illustrates the RL neural network implementation scheme, which includes the actor, critic, entropy, and temporal-difference (TD) error components of the SAC algorithm. As shown, the desired value is compared with the initial value of the model to generate an error signal. The Actor block represents the policy function, which selects actions in different states and improves decision-making by updating policy parameters during training. The critic block represents the Q-value function, estimating the quality of the Actor’s actions. The entropy term in SAC increases policy uncertainty, enabling the policy to explore a broader distribution of actions. The maximum entropy strategy not only maximizes the expected return but also encourages exploration by preventing the policy from becoming overly deterministic, thereby improving the exploration–exploitation balance. The TD error quantifies the discrepancy between the target Q-value and the Critic’s estimate; this error is used to update the critic network. Finally, the RL agent receives a reward signal from the nominal dynamic model, which provides feedback on the quality of the current system state.

4. Simulation Results

To evaluate the effectiveness of the proposed reinforcement learning framework, simulations were conducted for both single-agent and multi-agent implementations of the SAC algorithm. The training was carried out separately for quadcopter and fixed-wing flight modes. At the end of the training process, the controllers were assessed in terms of reward convergence speed and the final reward values achieved. The following subsections present the training outcomes, including the number of episodes and time required for convergence, as well as the corresponding reward evolution for each configuration.
In the single-agent case, training with the SAC algorithm lasted for 2500 episodes, corresponding to approximately 2 h and 21 min of simulation time. The agent achieved an average reward of −1.7, which represents the baseline performance for comparison with the multi-agent approach, as illustrated in Figure 10.
Although improvements were observed with the single-agent SAC algorithm, a multi-agent implementation was adopted to further enhance vehicle stabilization. One of the simulation results obtained from applying the multi-agent SAC algorithm is presented in Figure 11. Notably, the training process was observed to be 98% shorter compared to the single-agent scenario. The corresponding training parameters are summarized in the table below. The multi-agent training lasted 121 episodes, taking 3 min and 56 s, with the average rewards recorded as −0.03, −0.057, and −0.022. Vehicle control was successfully achieved using the RL agents trained in these separate sessions, demonstrating both faster convergence and improved stabilization performance.
The proposed multi-agent Soft Actor-Critic (MA-SAC) controller was trained separately for quad-mode and fixed-wing mode; the controllers provide attitude regulation about the roll, pitch and yaw axes in both modes. Figure 12 shows the roll tracking performance in quad-mode: the desired roll trajectory (reference) is closely tracked by the learned policy, with small transient deviations during setpoint changes and negligible steady-state error.
Figure 13 illustrates the pitch tracking performance in quad-mode. The desired pitch trajectory (reference) is followed closely, with small transient deviations during setpoint changes and negligible steady-state error.
Figure 14 shows the yaw tracking performance in quad-mode. The desired yaw trajectory (reference) is tracked accurately, with minor transient deviations during setpoint changes and minimal steady-state error.
Axis control in fixed wing mode is shown in Figure 15, Figure 16 and Figure 17. Figure 15 illustrates the roll-axis tracking performance of the UAV in fixed-wing mode. As can be seen, the UAV generally follows the commanded roll angle. However, no significant overshoot or delay occurs during setpoint changes. This indicates that, although slight deviations are observed in the roll axis, the control algorithm maintains accuracy and stability.
Figure 13 shows the pitch tracking performance in fixed-wing mode. The UAV accurately follows the desired pitch, staying within a ± ε band without overshoot during setpoint changes.
Figure 17 illustrates the yaw-axis tracking performance of the UAV in fixed-wing mode. As seen, the UAV follows the commanded yaw angle; however, the tracking range is slightly wider compared to the other axes. No significant overshoot is observed during setpoint changes. This indicates that the control algorithm provides stable yaw-axis tracking, but its accuracy is somewhat limited compared to the other axes.
The vertical take-off, transition to horizontal flight, fixed-wing motion, transition back to vertical flight, and vertical landing of the Tilt Rotor fixed-wing UAV were carried out as follows. In Q1, the vehicle performed a vertical take-off in quadcopter mode and reached the target altitude. To achieve the required speed for switching to fixed-wing mode, it accelerated forward while moving horizontally in quadcopter mode during Q2. In Q3, the vehicle maintained stable motion in fixed-wing mode. During Q4, preparations for vertical landing were conducted as the vehicle transitioned back to quadcopter mode. Finally, in Q5, the operation concluded with a vertical landing in quadcopter mode. The entire sequence is illustrated using two complementary visualizations: the altitude (H) profile in Figure 18 and the MATLAB UAV Animation (version 2023b) in Figure 19.
Figure 20 illustrates the fixed-wing motion of the TR-VTOL UAV during circular maneuvers. The vehicle executes smooth and stable turns in response to the commanded roll angles, forming circular paths without following a predefined reference trajectory. This visualization demonstrates the capability of the control system to maintain stability and orientation during continuous maneuvering, with only minor deviations caused by aerodynamic effects.
Beyond qualitative visualizations, a quantitative evaluation of the proposed controller was conducted by comparing its performance with existing approaches in both hover and fixed-wing modes. The following analysis presents detailed results in terms of RMSE, rise time, settling time, and overshoot. In the hover mode, the RMSE values of the roll, pitch, and yaw axes were compared with the AC NN and NTSMC methods [9]. As shown in Table 3, the proposed SAC controller reduced the roll RMSE from 0.153 (AC NN) to 0.058, corresponding to a 62.1% improvement; the pitch RMSE from 0.1758 (AC NN) to 0.083, corresponding to a 52.8% improvement; and the yaw RMSE from 0.311 (AC NN) to 0.070, corresponding to a 77.5% improvement. The percentage improvement was calculated as: (reference−proposed)/reference × 100.
In the fixed-wing flight mode, the multi-agent SAC algorithm was compared with the PPO algorithm [33], a classical PID controller, and the method in [34] under both no-wind and (20 m/s) constant wind conditions. As shown in Table 4, the proposed SAC controller shows improved performance compared to PPO and PID, particularly in terms of settling time and rise time. In the no-wind case, the roll rise time was reduced from 0.265 s (PPO) to 0.158 s (SAC), corresponding to a 40.4% improvement, while the pitch rise time was reduced from 0.661 s (PPO) to 0.103 s (SAC), yielding an 84.4% improvement. Settling times were shortened by more than 80% in both roll and pitch compared with PPO and PID. While the SAC controller reduced rise time and settling time compared to PPO and PID, it did not achieve an improvement in overshoot performance. Instead, SAC maintained a small steady tracking error without exceeding the reference, whereas PPO exhibited overshoot of 21% and 24% for roll and pitch, respectively, and PID without RL showed overshoot of 4% and 17%. Under 20 m/s constant wind, the SAC algorithm maintained stable performance. As shown in Table 4, rise and settling times for the pitch axis were lower than those observed with PPO and PID controllers, while overshoot present in the other controllers was not observed with SAC. These results indicate that the proposed multi-agent SAC model provides stable transient responses under wind conditions.
Overall, the results confirm that the proposed multi-agent SAC controller achieves superior tracking accuracy and transient response compared to baseline methods.

5. Conclusions and Future Work

This study investigated the training of a TR-VTOL UAV using reinforcement learning (RL), emphasizing the importance of algorithm selection, reward function design, and the development of an appropriate training environment. One of the main contributions is the introduction of a nominal dynamic reward function. At each iteration of the training process, reward values were dynamically generated based on varying initial and target conditions. Training across diverse scenarios improved the stabilization performance of the TR fixed-wing UAV.
Another key innovation is the adoption of a multi-agent structure, where a separate agent is assigned to control each axis. This allows control tasks to be distributed among agents, enabling sensor data for each control input to be processed independently and reward values to be generated dynamically. As a result, the agents reached appropriate control values more efficiently during training. Since extensive iterations are required and direct real-world training may risk damaging the aircraft before the RL model is fully learned, the entire training procedure was conducted in a simulation environment.
For future work, the proposed control architecture will be further evaluated under varying wind and turbulence conditions to provide a more comprehensive performance assessment. In addition, real-flight experiments will be carried out to validate the practical applicability of the method. To improve computational efficiency, faster training algorithms and advanced optimization techniques will be investigated. Finally, the proposed approach will be extended to different VTOL configurations to assess its generalization capability in other complex flight tasks.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/aerospace12090814/s1, Table S1: Masses and Positions of the Main Equipment (Detailed information on the masses, positions, and the calculation procedure of the UAV components). Figure S1: Pseudocode representation of the Soft Actor-Critic (SAC) algorithm employed for controlling the TR-VTOL UAV. Figure S2: Axis-specific reward–penalty algorithm applied to the pitch angle.

Author Contributions

Conceptualization, M.U.; Methodology, M.U.; Software, M.U.; Validation, M.U., A.Y.; Formal analysis, M.U.; Investigation, M.U.; Resources, A.Y.; Data curation, M.U.; Writing—original draft preparation, M.U.; Writing—review and editing, M.U., A.Y.; Visualization, M.U.; Supervision, A.Y.; Project administration, M.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

All relevant data supporting the reported results are included in the manuscript and Supplementary Materials.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Misra, A.; Jayachandran, S.; Kenche, S.; Katoch, A.; Suresh, A.; Gundabattini, E.; Selvaraj, S.K.; Legesse, A.A. A Review on Vertical Take-Off and Landing (VTOL) Tilt-Rotor and Tilt Wing Unmanned Aerial Vehicles (UAVs). J. Eng. 2022, 2022, 1803638. [Google Scholar] [CrossRef]
  2. Chen, C.; Zhang, J.; Wang, N.; Shen, L.; Li, Y. Conversion Control of a Tilt Tri-Rotor Unmanned Aerial Vehicle with Modeling Uncertainty. Int. J. Adv. Robot. Syst. 2021, 18, 1–14. [Google Scholar] [CrossRef]
  3. Nahrendra, I.; Made, A.; Christian, T.; Byeongho, Y.; Eungchang Mason, L.; Hyun, M. Retro-RL: Reinforcing Nominal Controller with Deep Reinforcement Learning for Tilting-Rotor Drones. IEEE Robot Autom Lett. 2022, 7, 9004–9011. [Google Scholar] [CrossRef]
  4. Yathish, K.; Pk, S.; Mascarenhas, S.; Bali, H. The Design and Development of Transitional UAV Configuration. In Proceedings of the 2nd International Conference on Emerging Research in Civil, Bangalore, India, 25–26 July 2019. [Google Scholar]
  5. Lu, K.; Tian, H.; Zhen, P.; Lu, S.; Chen, R. Conversion flight control for tiltrotor aircraft via active disturbance rejection control. Aerospace 2022, 9, 155. [Google Scholar] [CrossRef]
  6. Sheng, H.; Zhang, C.; Xiang, Y. Mathematical Modeling and Stability Analysis of Tiltrotor Aircraft. Drones 2022, 6, 92. [Google Scholar] [CrossRef]
  7. He, G.; Li, Y.; Huang, H.; Wang, X. A Nonlinear Robust Sliding Mode Controller with Auxiliary Dynamic System for the Hovering Flight of a Tilt Tri-Rotor UAV. Appl. Sci. 2020, 10, 6551. [Google Scholar] [CrossRef]
  8. Masuda, K.; Uchiyama, K. Robust Control Design for Quad Tilt-Wing UAV. Aerospace 2018, 5, 17. [Google Scholar] [CrossRef]
  9. Xie, T.; Xian, B.; Gu, X. Fixed-time convergence attitude control for a tilt trirotor unmanned aerial vehicle based on reinforcement learning. ISA Trans. 2023, 132, 477–489. [Google Scholar] [CrossRef] [PubMed]
  10. Pi, C.-H.; Ye, W.-Y.; Cheng, S. Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation. Appl. Sci. 2021, 11, 3257. [Google Scholar] [CrossRef]
  11. Xia, K.; Huang, Y.; Zou, Y.; Zuo, Z. Reinforcement Learning Control for Moving Target Landing of VTOL UAVs With Motion Constraints. IEEE Trans. Ind. Electron. 2023, 71, 7735–7744. [Google Scholar] [CrossRef]
  12. Yang, R.; Du, C.; Zheng, Y.; Gao, H.; Wu, Y.; Fang, T. PPO-Based Attitude Controller Design for a Tilt Rotor UAV in Transition Process. Drones 2023, 7, 499. [Google Scholar] [CrossRef]
  13. Reinforcement Learning Based Quadcopter Controller Fang-I Hsiao. Available online: https://web.stanford.edu/class/aa228/reports/2019/final62.pdf (accessed on 7 June 2024).
  14. Imran, I.H.; Wood, K.; Montazeri, A. Adaptive control of unmanned aerial vehicles with varying payload and full parametric uncertainties. Electronics 2024, 13, 347. [Google Scholar] [CrossRef]
  15. Define Observation and Reward Signals in Custom Environments. Available online: https://www.mathworks.com/help/reinforcement-learning/ug/define-reward-and-observation-signals.html (accessed on 3 June 2024).
  16. Generate Reward Function from a Model Verification Block for a Water Tank System. Available online: https://www.mathworks.com/help/reinforcement-learning/ug/generate-reward-fcn-from-verification-block-for-watertank.html (accessed on 3 June 2024).
  17. Ye, C.; Zhu, W.; Guo, S.; Bai, J. DQN-Based Shaped Reward Function Mold for UAV Emergency Communication. Appl. Sci. 2024, 14, 10496. [Google Scholar] [CrossRef]
  18. Kouzeghar, M.; Song, Y.; Meghjani, M.; Bouffanais, R. Multi-target pursuit by a decentralized heterogeneous uav swarm using deep multi-agent reinforcement learning. arXiv 2023, arXiv:2303.01799. [Google Scholar]
  19. Gain-Scheduled PID Autotuning a VTOL UAV During Forward and Backward Transition. Available online: https://www.mathworks.com/help/slcontrol/ug/gain-scheduled-control-vtol-uav.html (accessed on 10 July 2024).
  20. Mohanty, A.; Schneider, E. Tuning of an Aircraft Pitch PID Controller with Reinforcement Learning and Deep Neural Net. Available online: https://cs229.stanford.edu/proj2019aut/data/assignment_308832_raw/26643693.pdf (accessed on 3 June 2024).
  21. Richter, D.J.; Calix, R.A. Qplane: An open-source reinforcement learning toolkit for autonomous fixed wing aircraft simulation. In Proceedings of the 12th ACM Multimedia Systems Conference, Istanbul, Turkey, 21 September 2021. [Google Scholar]
  22. Cui, J.; Liu, Y.; Nallanathan, A. Multi-agent reinforcement learning-based resource allocation for UAV networks. IEEE Trans. Wirel. Commun. 2019, 19, 729–743. [Google Scholar] [CrossRef]
  23. Chan, J.H.; Liu, K.; Chen, Y.; Sagar, A.S.; Kim, Y.G. Reinforcement learning-based drone simulators: Survey, practice, and challenge. Artif. Intell. Rev. 2024, 57, 281. [Google Scholar] [CrossRef]
  24. Part 1: Key Concepts in RL. Available online: https://spinningup.openai.com/en/latest/spinningup/rl_intro.html (accessed on 20 February 2025).
  25. Zhang, F.; Lyu, X.; Wang, Y.; Gu, H.; Li, Z. Modeling and Flight Control Simulation of a Quad Rotor Tail-Sitter VTOL UAV. In Proceedings of the AIAA Modeling and Simulation Technologies Conference, Grapevine, TX, USA, 9–13 January 2017. [Google Scholar] [CrossRef]
  26. Propeller Static & Dynamic Thrust Calculation|Flite Test. Available online: https://www.flitetest.com/articles/propeller-static-dynamic-thrust-calculation (accessed on 17 August 2024).
  27. Kumar, R.; Bhargavapuri, M.; Deshpande, A.M.; Sridhar, S.; Cohen, K.; Kumar, M. Quaternion feedback based autonomous control of a quadcopter uav with thrust vectoring rotors. In Proceedings of the 2020 American Control Conference (ACC), Denver, CO, USA, 1–3 July 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
  28. He, C.; Chen, G.; Sun, X.; Li, S.; Li, Y. Geometrically compatible integrated design method for conformal rotor and nacelle of distributed propulsion tilt-wing UAV. Chin. J. Aeronaut. 2023, 36, 229–245. [Google Scholar] [CrossRef]
  29. Randal, W.B.; Timothy, W.M. Small Unmanned Aircraft: Theory and Practice; Princeton University Press: Princeton, NJ, USA, 2012; Volume 317. [Google Scholar]
  30. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: New York, NY, USA, 2018. [Google Scholar]
  31. Soft Actor-Critic—Spinning Up Documentation. Available online: https://spinningup.openai.com/en/latest/algorithms/sac.html#quick-facts (accessed on 14 April 2024).
  32. Qing, A.; Santiago, S.; Chris, D.; Ashutosh, S.; Rahman, D.-M.A. Deep Reinforcement Learning-Based Resource Scheduler for Massive MIMO Networks. IEEE Trans. Mach. Learn. Commun. Netw. 2023, 1, 242–257. [Google Scholar] [CrossRef]
  33. Bøhn, E.; Coates, E.M.; Moe, S.; Johansen, T.A. Deep reinforcement learning attitude control of fixed-wing UAVs using proximal policy optimization. In Proceedings of the 2019 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 11–14 June 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
  34. Tahir, Z.; Waleed, T.; Saad Ali, L. State Space System Modelling of a Quad Copter UAV. arXiv 2019, arXiv:1908.07401. [Google Scholar] [CrossRef]
Figure 1. General Structure of a TR-VTOL UAV.
Figure 1. General Structure of a TR-VTOL UAV.
Aerospace 12 00814 g001
Figure 2. The weight distribution of a TR-VTOL UAV.
Figure 2. The weight distribution of a TR-VTOL UAV.
Aerospace 12 00814 g002
Figure 3. TR-VTOL UAV in Rotary-Wing, Transient and Fixed-Wing Modes.
Figure 3. TR-VTOL UAV in Rotary-Wing, Transient and Fixed-Wing Modes.
Aerospace 12 00814 g003
Figure 4. Thrust Force—Airspeed Table.
Figure 4. Thrust Force—Airspeed Table.
Aerospace 12 00814 g004
Figure 5. Implementation of Longitudinal Force and Moment Coefficients Using Prelookup Tables in Simulink.
Figure 5. Implementation of Longitudinal Force and Moment Coefficients Using Prelookup Tables in Simulink.
Aerospace 12 00814 g005
Figure 6. Implementation of Lateral Force and Moment Coefficients Using Prelookup Tables in Simulink.
Figure 6. Implementation of Lateral Force and Moment Coefficients Using Prelookup Tables in Simulink.
Aerospace 12 00814 g006
Figure 7. (a) Critic Network, (b) Actor Network.
Figure 7. (a) Critic Network, (b) Actor Network.
Aerospace 12 00814 g007
Figure 8. Creating Reference Response Regions.
Figure 8. Creating Reference Response Regions.
Aerospace 12 00814 g008
Figure 9. RL neural network implementation scheme, including the actor, critic, and entropy-based learning mechanism of the SAC algorithm (adapted from [15,32]).
Figure 9. RL neural network implementation scheme, including the actor, critic, and entropy-based learning mechanism of the SAC algorithm (adapted from [15,32]).
Aerospace 12 00814 g009
Figure 10. Training progress of the TR-VTOL UAV using the single-agent SAC algorithm.
Figure 10. Training progress of the TR-VTOL UAV using the single-agent SAC algorithm.
Aerospace 12 00814 g010
Figure 11. Training progress of the TR-VTOL UAV using the multi-agent SAC algorithm.
Figure 11. Training progress of the TR-VTOL UAV using the multi-agent SAC algorithm.
Aerospace 12 00814 g011
Figure 12. Roll Axis Graph in Quad Mode.
Figure 12. Roll Axis Graph in Quad Mode.
Aerospace 12 00814 g012
Figure 13. Pitch Axis Graph in Quad Mode.
Figure 13. Pitch Axis Graph in Quad Mode.
Aerospace 12 00814 g013
Figure 14. Yaw Axis Graph in Quad Mode.
Figure 14. Yaw Axis Graph in Quad Mode.
Aerospace 12 00814 g014
Figure 15. Roll Axis Graph in Fixed Wing Mode.
Figure 15. Roll Axis Graph in Fixed Wing Mode.
Aerospace 12 00814 g015
Figure 16. Pitch Axis Graph in Fixed Wing Mode.
Figure 16. Pitch Axis Graph in Fixed Wing Mode.
Aerospace 12 00814 g016
Figure 17. Yaw Axis Graph in Fixed Wing Mode.
Figure 17. Yaw Axis Graph in Fixed Wing Mode.
Aerospace 12 00814 g017
Figure 18. Vertical Take-Off-Landing and Horizontal Movement of TR-VTOL UAV-V1.
Figure 18. Vertical Take-Off-Landing and Horizontal Movement of TR-VTOL UAV-V1.
Aerospace 12 00814 g018
Figure 19. Vertical Take-Off-Landing and Horizontal Movement of TR-VTOL UAV-V2.
Figure 19. Vertical Take-Off-Landing and Horizontal Movement of TR-VTOL UAV-V2.
Aerospace 12 00814 g019
Figure 20. Fixed Wing Orbit Control of TR-VTOL UAV.
Figure 20. Fixed Wing Orbit Control of TR-VTOL UAV.
Aerospace 12 00814 g020
Table 1. Parameters used.
Table 1. Parameters used.
NameDescriptionTunnel
IxxMoment of inertia about the X-axis (kg·m2)0.02
IyyMoment of inertia about the Y-axis (kg·m2)0.027
IzzMoment of inertia about the Z-axis (kg·m2)0.046
Ssurface area of wing, (m2)0.151
bwingspan, (m)1
cchord, (m)0.153
ARwing aspect ratio6.638
Wweight, (kg)1.422
αangle of attack (deg)4
Vvelocity (m/s)17.24
CG33% MAC on thrust line0.241
CLlift coefficient0.596
CDDrag coefficient0.04
Cmqpitch moment due to pitch rate−0.041
CIproll moment due to roll rate−0.439
CIrroll moment due to yaw rate0.330
Cnpyaw moment due to roll rate−0.265
Cnryaw moment due to yaw rate−0.048
Table 2. Parameters of Random Disturbance Forces.
Table 2. Parameters of Random Disturbance Forces.
Disturbance TypeApplied AxisMagnitudeDistributionPurpose
External ForceX, Y, Z (Body Axes)±2 NUniformEvaluate control policy under physical disturbances
Table 3. Analysis of Control Root Mean Square Errors (RMSE) Hover Mode.
Table 3. Analysis of Control Root Mean Square Errors (RMSE) Hover Mode.
ExperimentsProposed (SAC)(AC NN)(NTSMC)
RMS Error of roll angle (°) 0.0580.153 0.484
RMS Error of pitch angle (°) 0.0830.1758 0.629
RMS Error of yaw angle (°) 0.0700.3110.316
Table 4. Analysis of Control Root Mean Square Errors (RMSE) Fixed Wing Mode.
Table 4. Analysis of Control Root Mean Square Errors (RMSE) Fixed Wing Mode.
ExperimentsControllerRise Time (s)Settling Time (s)Overshot (%)
φθφθφθ
No WindProposed RL (SAC)0.1580.1030.2670.25400
PPO RL0.2650.6611.5841.6632124
PID without RL1.3440.2282.0501.364417
Constant Wind
(20 m/s)
Proposed RL (SAC)0.0840.0160.2610.26700
PPO RL0.1661.7922.9033.28012293
PID without RL0.9300.9453.5765.2569280
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ugur, M.; Yesildirek, A. Improving Control Performance of Tilt-Rotor VTOL UAV with Model-Based Reward and Multi-Agent Reinforcement Learning. Aerospace 2025, 12, 814. https://doi.org/10.3390/aerospace12090814

AMA Style

Ugur M, Yesildirek A. Improving Control Performance of Tilt-Rotor VTOL UAV with Model-Based Reward and Multi-Agent Reinforcement Learning. Aerospace. 2025; 12(9):814. https://doi.org/10.3390/aerospace12090814

Chicago/Turabian Style

Ugur, Muammer, and Aydin Yesildirek. 2025. "Improving Control Performance of Tilt-Rotor VTOL UAV with Model-Based Reward and Multi-Agent Reinforcement Learning" Aerospace 12, no. 9: 814. https://doi.org/10.3390/aerospace12090814

APA Style

Ugur, M., & Yesildirek, A. (2025). Improving Control Performance of Tilt-Rotor VTOL UAV with Model-Based Reward and Multi-Agent Reinforcement Learning. Aerospace, 12(9), 814. https://doi.org/10.3390/aerospace12090814

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop