Enhancing UAS Integration in Controlled Traffic Regions Through Reinforcement Learning

Vico Navarro, Joaquin; Vila Carbó, Juan Antonio

doi:10.3390/drones9060412

Open AccessArticle

Enhancing UAS Integration in Controlled Traffic Regions Through Reinforcement Learning

by

Joaquin Vico Navarro

^*

and

Juan Antonio Vila Carbó

Instituto Universitario de Investigación de Automática e Informática Industrial, Universitat Politècnica de València, Camí de Vera s/n, 46022 València, Spain

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(6), 412; https://doi.org/10.3390/drones9060412

Submission received: 16 April 2025 / Revised: 3 June 2025 / Accepted: 4 June 2025 / Published: 6 June 2025

(This article belongs to the Section Innovative Urban Mobility)

Download

Browse Figures

Versions Notes

Abstract

Controlled Traffic Regions (CTRs) around major airports pose an important challenge to Unmanned Aerial System (UAS) traffic management. Current regulations highly restrict UAS missions in these areas by confining them to segregated areas. This paper makes a proposal to allow more ambitious UAS missions inside CTRs, such as paths across the CTR or between heliports inside the CTR, based on self-separation. This proposal faces two important problems: on the one hand, the adaptive response to the dynamic airspace reconfiguration of a CTR without necessarily terminating the flight, and on the other, a self-managed conflict resolution that allows maintaining traffic separations without the intervention of air traffic controllers. This paper proposes a solution named Reinforcement Learning Multi-Agent Separation Management (RL-MASM). It employs a multi-agent reinforcement learning system with a fully decentralized decision-making scheme, although it uses a common information source of the environment. The proposed system is evaluated against classical control algorithms for obstacle avoidance to determine the potential benefits of AI-based methods. Results show that AI-based methods can benefit from knowing the intent of a UAS. This leads to increased performance in intrusions into no-fly zones or collisions, and also solves some challenging scenarios for classical control algorithms. From the aeronautical point of view, the proposed solution also introduces important advantages in terms of efficiency, scalability, and decentralization.

Keywords:

dynamic airspace reconfiguration; reinforcement learning; automation; unmanned operations; conflict resolution

1. Introduction

Unmanned Aerial System (UAS) operations inside a Controlled Traffic Region (CTR) are of high interest, since these regions include most of the urban areas of big cities where UAS are intended to be used. A CTR is a controlled zone around an airport which usually has a high air-traffic density and includes different types of traffic: commercial flights flying under Instrument Flight Rules (IFR) [1], general aviation (small aircraft, helicopters, etc.) flying under Visual Flight Rules (VFR), and now UAS and some other types of advanced air mobility need also to be included. That makes a CTR a specially complex airspace where different types of traffic fly under different flight rules.

The safe integration of UAS traffic into different types of airspaces, especially CTRs, is one of the big challenges for the growth of unmanned air traffic. This problem must be addressed from several perspectives: legislative, technological, and organizational [2,3]. The permissiveness of the legislative perspective is usually in line with the degree of development achieved from the other two perspectives: technological, and organizational. For example, most European Union state members’ legislation [4] is quite restrictive and only allows UAS traffic in confined areas currently. That implies constraining UAS operations, such as intracity transportation or inspection missions inside a CTR.

Both EUROCONTROL and the Federal Aviation Administration (FAA) have addressed the problem of integrating UAS into the airspace in a safe and standardized manner through the ConOps framework [5,6]. This framework defines key concepts like UAS types and categories, their operational environments, Unmanned Traffic Management (UTM), and their integration with Air Traffic Management (ATM). However, one of the most interesting aspects is the use of U-Space services. They provide a set of services and technologies that enable drones to fly beyond the visual line of sight (BVLOS). U-space services are intended to be available through a real-time communication network. Among the services that will be available, this paper assumes the presence of a U-space monitoring service to deliver traffic information in a given airspace. This service will rely on technological facilities based on the “electronic conspicuity” concept. Based on that, this work contributes to providing support for some other U-space services, specifically, No-Fly Zones (NFZs) and UTM. NFZs are restricted areas where UASs cannot penetrate. In our case, these areas are only allowed for manned aviation. This restriction will be enforced through facilities like “geo-fencing”. On the other hand, UTM will ensure that drones can detect and avoid other aircraft, manage congestion, and maintain safe separation from other drones and manned aviation.

The Dynamic Airspace Reconfiguration (DAR) [7] concept proposed by U-space is the basis for a first approach to deal with integrated operations. DAR refers to the process of adjusting or modifying the structure and boundaries of airspace to optimize its use based on particular factors, like, for example, current traffic demands or weather conditions. In our case, DAR is used to define a number of NFZ into a CTR where UAS operation is restricted, so only manned aviation is allowed. These NFZs can be dynamically redefined to maintain safety based on traffic, weather or contingencies, and to provide UASs the maximum operational level. The proposal of this work provides support for DARs.

The DAR concept can be implemented using obstacle avoidance and collision avoidance techniques. These techniques have been extensively studied in robotics [8,9]. The Conflict Resolution (CR) strategies proposed in this paper are based on a field of Artificial Intelligence (AI) called Reinforcement Learning (RL) and, more precisely, on a sub-field of RL known as Multi-Agent Reinforcement Learning (MARL). In MARL, multiple agents interact in the same environment. Several MARL structures are possible depending on whether the relationship between the agents is cooperative, competitive, or mixed. Structures also depend on whether the agents can communicate with each other or not.

The main advantage of using MARL over control algorithms is their adaptability to complex and dynamic environments. They are designed to balance exploration (trying new actions to discover better strategies) and exploitation (relying on known strategies that have worked well in the past). This makes RL suitable for environments where optimal actions are not always clear and need to be discovered over time.

Once manned aircraft and UAS traffic have been separated using DAR, the other issue is how to keep unmanned traffic separated. Several CR strategies have been proposed in previous research [10,11,12]. They range from fully centralized to fully decentralized. This work advocates a decentralized strategy based on self-separation, similar to that proposed for VFR separation. Centralized schemes are generally more efficient, but suffer from reduced reliability when faced with uncertainty and communication failures, and tend to require higher amounts of computing, leading to scaling issues. The proposed approach avoids providing a centralized unit and enables each drone to execute the CR process independently, enhancing responsiveness.

The problem of CR among multiple autonomous mobile agents has been addressed using AI. Solutions can be broadly classified into pair-wise CR and Multi-Aircraft Conflict Resolution (MACR). However, using some heuristics, MACR can be transformed into a set of pair-wise CRs. One proposal for this transformation is to process the conflicts hierarchically, computing the envelope and restricting it in further conflicts [13]. Solutions can also be classified according to the decentralization level. The work of [14] uses neural network architectures that employ inter-agent communication for decision making. This scheme either requires being executed in a centralized way and resolutions broadcast to all agents, or using a highly reliable peer-to-peer communication facility between UASs, which seems unrealistic to be implemented in a U-space service. On the other hand, decentralized CR algorithms are based on defining an observation space with a reduced number of intruders. Delimiting this set is usually based on heuristics [15]. This technique has been proven to be effective in manned aviation, but determining this observation space becomes more complex as traffic density increases and traffic with highly different dynamics coexist in the airspace.

The main contribution of this work is to propose and to validate a scheme for UAS operation in a CTR based on AI techniques. This technique allows overcoming the current restrictions of UAS to segregated areas. The work also contributes to the definition of a simulation framework for drones in AI. An important issue in this definition is that the proposed architecture enables the use of environments with a variable number of drones, allowing the definition of policies and observation spaces that are not constrained to a predetermined number of agents. This work also defines a number of key scenarios where collision avoidance techniques should be validated and compares a reference method used in classical control for routing and collision avoidance with our proposed AI-based algorithm, detailing their performance and advantages.

The proposal to separate UAS from manned aviation is based on the DAR concept. It uses self-separation for separating UASs among them. The implementation relies on a U-space service which is a central facility that provides traffic information and it also assumes that UASs are equipped with some device that provides electronic conspicuity to implement this U-space service. It also assumes that every UAS has submitted a flight-plan. The solution provides the optimal path for every UAS in a completely decentralized way. This solution dynamically avoids conflicts with DARs and other UASs.

This paper is organized as follows: Section 2 discusses the problem of organizing air traffic into a CTR and our proposal to integrate UASs in such a scenario. Section 3 describes the modeling of the components and the environment to be implemented as a MARL algorithm. Section 4 details the formulation of the problem in terms of the RL agent. Section 5 deals with the training process of the RL agents. Section 7 is the major contribution of this paper and validates the results through several critical scenarios. It also compares the performance of the AI approach to traditional control techniques. Lastly, Section 8 contains the conclusions and consideration of future work.

2. Introducing UAS Traffic into CTRs

Figure 1 shows an example of a CTR of a Spanish airport (LEVC, Valencia) showing the routes followed by different traffic types and how the urban area of the biggest city is totally included in it. Additionally, it also includes a large number of small cities in the metropolitan area. The CTR vertically expands from the terrain to 6000 ft. Commercial and military traffic flying under IFR take-off and land along routes converging on the runway axis. General aviation flying under VFR fly through corridors to join the Aerodrome (AD) traffic pattern in the approaches. Since the CTR is class-D, the airport tower performs positive control over IFR traffic and clears VFR traffic to enter the CTR, although VFR traffic is responsible for self-separating from the rest of the traffic. Airports with a high traffic density (as many have in the USA) are class-B, so the tower is responsible for separating both IFR and VFR traffic.

In this context, new types of traffic related to Urban Air Mobility (UAM) and especially UAS need to be introduced. Some UAS missions can be enclosed inside a well defined and small region. These types of missions can be easily managed using the corresponding permits to temporarily segregate an airspace that is not flown by manned aviation. This is the current situation. However, we want to introduce missions where UASs fly between different points scattered around the CTR, like hospitals, small heliports, cities, etc., where UAS would fly a route defined by a mission plan. The mission plans would go through a deconfliction process in the strategic phase (pre-flight), which is beyond the scope of this paper. But conflicts will also arise in the tactical phase due to uncertainty and the dynamic nature of the environment. As discussed in [16], an airspace reconfiguration can cause strategically deconflicted flight plans to no longer be conflict-free, so routes need to be modified to preserve the separations.

The proposed solution is based on separating UASs from manned aviation (IFR or VFR) using NFZ where only manned aviation can fly and dynamically reconfiguring them when necessary according to the DAR concept. Figure 1 shows these protected areas with well-defined horizontal and vertical limits. There might be some static areas that are always activated, but most of them are dynamically activated when necessary. The separation of one UAS from other UASs would be based on self-separation.

The implementation of the proposed scheme is based on MARL.

This is a solution based on self-separation, with fully decentralized decision-making. A UAS is modeled as an RL agent. RL is a type of machine learning where an agent learns to make decisions by interacting with an environment. The goal is for the agent to maximize a reward signal by taking actions that lead to the best outcomes over time. This way agents can learn the “optimal” strategy in a highly dynamic environment.

However, some central information source that provides information about the environment is assumed. An alternative to a centralized information source would be one in which this information is supplemented, or even fully replaced, by readings of onboard sensors providing information about surrounding traffic. These are usually referred to as DAA (Detect and Avoid) devices, but they are not required in our system. The observation space consists of environment information, encoded as an image, supplemented with a vector containing the kinematic state of the surrounding UASs. The environment image provides the following information: the layout of the NFZs, the last known position of the surrounding traffic, their protecting volumes (separation minima), and their intents in the form of mission plans. This centralized information source could be realistically implemented using U-space services like the Network Identification Service (NIS), which allows obtaining information about nearby aircraft, and the Flight Authorisation Service (FAS), that allows submission of mission plan information to UTM.

Figure 2 shows the role of AI in this problem. In the example, a UAS is trying to get to the next waypoint in the mission plan when an NFZ, corresponding to a CTR corridor, is activated through DAR. Two extreme strategies are possible to reach the target: holding until the NFZ is hopefully deactivated and can be reached following a straight route, or avoiding the obstacle, contouring its perimeter following a long trajectory. The adopted strategy will obviously depend on the goals that the reward function tries to optimize (distance, time, etc.). But even with a given reward function, it will also depend on the environment: the time that the NFZ remains activated or the size of the NFZ. That is something that the RL agent should learn through training, so the design of the agents’ training must also be carefully designed to optimize its behavior.

3. Simulation Environment

The Reinforcement Learning Multi-Agent Separation Management (RL-MASM) algorithm requires a simulation environment so the agents may interact with it and gather experiences to learn the optimal policy. This simulation environment is divided into four main blocks: the agents, the airspace, the separation model, and the mission plans. In this section, each block is formally defined.

3.1. Agent Modeling

The multi-agent separation management problem presented in this work is formally defined as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) [17]. Dec-POMDP is a model for decision-making across multiple agents in which each agent receives a partial observation of the current state of the environment and chooses an action accordingly. Dec-POMDP is a generalization of the Markov Decision Process (MDP).

Formally, Dec-POMDPs are defined by the tuple

〈 N, S, A, T, O, O, R, γ 〉

[18], where

N

is the set of n agents,

S

is the set of environment states,

A

is the set of joint actions, T is the transition probability function,

O

is the set of joint observations with O being the observation probability function, R is the reward function, and

γ

is the discount function.

The agents use a stochastic policy (

π

) to map their local observation to a distribution over their action space

π : O_{i} \times A_{i} \to [0, 1]

. Through training, this policy is tuned to maximize the discounted expected cumulative reward

J (π)

.

J (π) = E [\sum_{t = 0}^{\infty} γ^{t} R (s^{t}, a^{t}) | π]

(1)

RL techniques are widely used to find the optimal policy in MDPs environments. Our solution presented is based on the Proximal Policy Optimization (PPO) algorithm [19], which has been proven to work in cooperative environments with homogeneous agents [20].

The agent requires a model of the drone to simulate the interactions with the environment. The state s of the drone i at time t is represented in Equation (2). It is formed by the drone’s position (

x_{i}, y_{i}

), its speed (

v_{i}

), and heading (

χ_{i}

).

s_{i}^{t} = {[\begin{matrix} x_{i} \\ y_{i} \\ v_{i} \\ χ_{i} \end{matrix}]}^{t}

(2)

Equations (3) and (4) define the state transition for an action composed of a heading and speed change (

Δ χ_{i}, Δ v_{i}

).

{[\begin{matrix} v_{i} \\ χ_{i} \end{matrix}]}^{t + 1} = [\begin{matrix} max (v_{min}, min (v_{max}, v_{i}^{t} + Δ v_{i}^{t})) \\ (χ_{i}^{t} + Δ χ_{i}^{t}) mod 360 \end{matrix}]

(3)

{[\begin{matrix} x_{i} \\ y_{i} \end{matrix}]}^{t + 1} = {[\begin{matrix} x_{i} \\ y_{i} \end{matrix}]}^{t} + {[\begin{matrix} cos (χ_{i}) v_{i} \\ sin (χ_{i}) v_{i} \end{matrix}]}^{t + 1}

(4)

All agents are homogeneous and share the same performance limits, with Table 1 showing the main parameters for the drone model. The values were derived from the typical drone outlined in [21].

This implementation allows drones to hover. Future work may consider incorporating drones with different performance models in the same environment, including both multi-copter and fixed-wing drones.

The drones start each episode stationary, with their position defined by the first waypoint on their mission plan as described in Section 3.4.

3.2. Airspace Modeling

The actions of the Air Traffic Controls (ATCs) are simulated by dynamically segregating areas of the airspace, defined by independently activated NFZs. Drones must exit and avoid entering any activated NFZ. In this paper, several different models of the airspace have been defined for training and evaluation of the system.

In order to encourage generalization to any airspace configuration, the agents are trained on airspaces with randomly generated NFZs.

n_{o b s t a c l e}

NFZs are created and randomly located. Overlap between two or more NFZs is allowed. Each NFZ is a regular polygon with n sides, with

n \in N : 3 \leq n \leq o b s_{s i d e s}

, and a circumradius of

r \in R : 0.5 * o b s_s i z e_{m a x} \leq n \leq o b s_s i z e_{m a x}

.

This process is repeated at the start of each training episode. The airspace models used in the evaluation are detailed in Section 7.

3.3. Separation Modeling

The safe separation modeling follows the work presented in [22]. The authors present a solution that defines a loss of separation as the intersection of the separation threshold volumes between two drones, with different separation thresholds depending on several factors, such as the drone category, the current traffic density, and the performance of the Communications, Navigation, and Surveillance (CNS) infrastructure. The separation threshold distances used in this article were sourced from a similar urban use case in the BUBBLES project validation exercises, concretely, S2E2 [23], and are present in Table 2.

Drones are randomly given a category at the start of each episode.

3.4. Mission Plan Creation

Mission plans represent the intent of each agent, providing a set of targets or waypoints that the drone has to reach. The mission plans are important as they provide a desired trajectory that the agents must optimize and also inform the intent of the surrounding agents. Having the intent of the rest of the traffic allows the agents to plan ahead and avoid future conflicts.

Random mission plans start on a random point, and subsequent waypoints are created within

W P_{m a x D i s t}

of the previous waypoint. To ensure that the agents can finish the mission if an ATC segregates an area, the intermediate waypoints are only created outside of segregable areas. The first and last waypoints of the mission plan are the take-off and landing sites. Mission plans are completed when the drone reaches each waypoint on its mission plan in order.

4. Reinforcement Learning Framework

In this section, we illustrate the different system components to solve the environment described in the previous Section.

RL-MASM is agnostic to the total number of participants in the airspace, enabling decentralized decision-making that takes into account all relevant hazards and acts accordingly.

4.1. Observation Space

The observation space for the agents was designed with two objectives in mind. First, the observation space must convey enough information to the agent so that it is able to learn the optimal policy. Second, the observation space must be compatible with an indefinite number of agents, so that it may be deployed in environments with varying numbers of participants.

To fulfil the previous restrictions, a hybrid observation space was defined. This hybrid observation space is composed by an image that contains information about the current airspace surrounding the agent and a vector with information about the current state of the agent and its target. This is defined as

o^{t} = [o_{a}^{t}, o_{s}^{t}]

, with

o_{a}^{t}

being the airspace image and

o_{s}^{t}

being the state vector observation.

The

o_{a}^{t}

observation is a three-channel, 256 by 256 pixel square image that represents a 5

{km}^{2}

area surrounding the agent, resulting in a resolution of 19.5 m/pxl. This image contains the next waypoints in the mission plan, the active NFZs in the area and the position, speed, intent, and separation minima of other drones.

Taking into account the state and intents of surrounding agents allows the agent to optimize its trajectory, reducing conflicts and distance traveled. Figure 3 represents the image observation received by an agent, showing itself, an NFZ and another drone. The image is centered in the current agent’s position and is aligned in the current direction of flight. The intent of the current agent appears as a green line towards the next waypoints. The separation minima of both drones is shown as a blue area and the intent of the other drones is shown with fading blue lines. The red areas represent active NFZs.

This representation of the airspace offers an observation space that is not dependent on the number of participants, allowing for an indefinite number of participants in the airspace, as well as representing NFZs of indefinite shapes. Previous approaches only allow for a set number of intruders and use heuristics to extract the most relevant actors to use in the decision-making step [15].

The area covered by the image has been adapted to the performance and separation minima used in this article and may require increases for faster drones.

Information about the current state of the drone is also provided in the

o_{s}^{t}

. Equation (5) defines this observation:

o_{s}^{t} = [\begin{matrix} d i s t_{t a r g e t} / e n v_{s i z e} \\ c o s (χ_{i} - ψ_{i}) \\ s i n (χ_{i} - ψ_{i}) \\ v / v_{m a x} \end{matrix}]

(5)

with

d i s t_{t a r g e t} / e n v_{s i z e}

being the normalized distance to the next waypoint,

χ_{i} - ψ_{i}

the relative heading to the next waypoint, and

v / v_{m a x}

the normalized velocity.

4.2. Action Space

As the environment proposed in this paper is limited to a 2D space, the action space is limited to the acceleration and turn rate. The action is formulated as

a_{i} (t) = [a_{T}, a_{χ}] : a_{T}, a_{χ} \in R, - 1 \leq a \leq 1

.

a_{T}

acts over the acceleration and

a_{χ}

over the turn rate.

The acceleration and turn rate change for the state update are obtained by multiplying the action with the performance parameters of the drone, as in Equation (6)

{[\begin{matrix} Δ v_{i} \\ Δ χ_{i} \end{matrix}]}^{t} = a_{i}^{t} [\begin{matrix} Δ v_{m a x} \\ Δ χ_{m a x} \end{matrix}]

(6)

4.3. Reward Function

We developed a multi-objective reward function to describe the desired behavior of the agents. The reward function for the agent i is formulated in Equation (7). The reward is divided into the following sub-components: the reward for progressing on the mission plan (

r_{i}^{t a r g e t}

), the avoidance reward (

r_{i}^{c o n f l i c t}

), the nominal speed reward (

r_{i}^{v e l}

), and the efficiency reward (

r_{i}^{e f f}

).

r_{i} = r_{i}^{t a r g e t} + r_{i}^{c o n f l i c t} + r_{i}^{v e l} + r_{i}^{e f f}

(7)

The reward calculation uses the weights

W_{t a r g e t}, W_{c o n f}, W_{v e l}, W_{t h u s t}

and

W_{χ}

. The weights can be tuned to give priority to an objective over the others. Table 3 contains the weights used in the experiments.

The reward for progressing on the mission plan is defined in Equation (8). It is based on the change in distance to the next waypoint during the last step.

r_{i}^{t a r g e t} = W_{t a r g e t} \frac{d i s t_{i - t a r g e t}^{t} - d i s t_{i - t a r g e t}^{t + 1}}{v_{n o m}}

(8)

If the agent reaches a waypoint at the end of the step,

r_{i}^{t a r g e t}

is set to 1.

The avoidance reward ensures that the agents remain sufficiently separated during the episode. This reward penalises the agent for getting closer to another agent or NFZ. The discount function

f_{d} (d i s t)

is used to give more importance to obstacles closer to the current drone.

The resulting

r_{i}^{c o n f l i c t}

is presented below:

\begin{matrix} r_{c} o n f l i c t & = W_{c o n f} \times \\ \sum_{j \in N \ {i}} r_{f} f_{d} (d i s t_{i - j}^{t + 1}) \frac{d i s t_{i - j}^{t} - d i s t_{i - j}^{t + 1}}{v_{m a x}} \end{matrix}

(9)

r_{f} = \{\begin{matrix} 1 & if d i s t_{i - j}^{t} - d i s t_{i - j}^{t + 1} > 0 \\ 0.5 & otherwise \end{matrix}

(10)

The distance for each obstacle is calculated (

d i s t_{i - j}

) differently for drones or NFZs. For drones, the distance is measured from the exterior of their separation thresholds. For NFZs, the distance is measured between the center of the drone and the exterior of the NFZ. Figure 4a,b display examples of how this distance is calculated.

The reward is reduced by the factor

r_{f}

if the agent is moving away from the obstacle j, ensuring that an agent that approaches and afterwards moves away from it has a lower cumulative reward than another that does not approach the obstacle.

And the discount function

f_{d} (d i s t)

reduces the conflict penalty exponentially with the distance:

f_{d} (d i s t) = (1 - \frac{1}{1 + e^{α (- d i s t + β)}})

(11)

This tunable logistic function is bounded to

[0, 1]

. We set the logistic growth rate

α

at 0.01 and the midpoint

β

at 500.

The reward function has to be compatible with states in which there is an intrusion or loss of separation, as these events may occur during training and evaluation. The path of the highest reward in these cases is the shortest to the resolution of the conflict. To further penalize these cases,

W_{c o n f}

is added to

r_{c} o n f l i c t

once per instance of separation loss or intrusion until they are resolved.

The drones are encouraged to fly at nominal speeds by

r^{v e l}

:

r_{i}^{v e l} = W_{v e l} \frac{| v_{i}^{t} - v_{n o m} |}{v_{n o m}}

(12)

The efficiency reward discourages changes to the current velocity or direction, preventing oscillations and promoting consistent trajectories.

r_{i}^{e f f} = W_{t h u s t} | a_{T}^{t} | + W_{χ} | a_{χ}^{t} |

(13)

4.4. Neural Network

The PPO algorithm uses an actor network that outputs the policy distribution and a critic network that estimates the value of the current state. In our case, both use the same central architecture. We have used a hybrid architecture to accommodate the observations (

o_{a}^{t}, o_{s}^{t}

). First, we implemented a custom feature extractor to obtain features from the image of the airspace. This feature extractor uses a series of two-dimensional convolutional layers (conv2D) with max pooling (MaxPool2d). Then, the drone state observation (

o_{s}^{t}

) is concatenated to the output of the last layer and three subsequent fully connected layers of 256, 128, and 64 units with ReLU activation. Similar architectures have been used in the literature to process hybrid observations [24].

The action

a_{i}^{t}

is sampled from the distribution generated by the actor network, and the critic network outputs the expected value of a given state.

Figure 5 summarizes the network architecture for the actor and critic.

5. RL-MASM Algorithm Training

With the environment and framework described in previous sections, the RL-MASM was trained using the PPO algorithm [19,25]. During training, a single policy is trained with experiences from all agents. Later, the agents are independently deployed during evaluation, following the Centralized Training Decentralized Execution paradigm.

RL-MASM was trained for 4.5 million total steps with 12 agents per environment. The parameters for the training environment are detailed in Table 4. Episodes are terminated if all agents have finished their missions or the maximum number of steps (

m a x_{s t e p s}

) is reached.

Each environment step corresponds to 2 s of simulated time.

The starting point for the reward weights was defined analytically and then refined through successive training iterations. The weights were selected to ensure that the reward values remained within the same order of magnitude throughout training, which helped minimize instability. Furthermore, in scenarios involving loss of separation, the reward for avoiding the other drone consistently outweighs the combined rewards of all other objectives.

In Section 6, we evaluate the effect each parameter has on the behavior of the system.

In Figure 6, we show the total reward, intrusions, and losses of separation per episode gathered during training. There is an increase in the total reward per episode as the policy is trained, suggesting the policy is learning from the interactions with the environment. The losses of separation and intrusions reduce over time as well. These metrics show the number of simulation steps that a drone was within the separation threshold of other drones and within an NFZ, respectively.

An important point is that the initial conditions of the environment may lead to inevitable intrusions and losses of separation, independently of the performance of the agent. For example, agents may start the episode within an NFZ or in very close proximity to another agent. This forces the agents to learn how to solve these situations and, as the training progresses, their numbers are reduced until a baseline level is reached.

6. Reward Function Sensitivity Analysis

In this section, we evaluate the sensitivity of the system to the parameters in the reward function (

W_{t a r g e t}, W_{c o n f}, W_{v e l}, W_{t h r u s t}, and W_{χ}

). We performed 20 Monte Carlo training runs varying the values of the reward weights. The values for each run were sampled from a uniform distribution with the ranges shown in Table 5. The range of values were defined analytically.

Figure 7 shows the linear correlation between the changes in the different parameters and the total distance traveled by all agents, the total number of separation losses, and the success rate. The success rate is defined as the ratio of episodes where all agents complete their flight plans.

The results indicate that, as expected, there is a trade-off between

W_{t a r g e t}

and

W_{c o n f}

in distance and losses of separation. Giving more weight to the separation in the reward function leads to safer missions, but it negatively impacts the total distance needed and success rate as agents need to deviate from their mission plan to avoid other drones and NFZs.

The impact of the rest of the parameters is negative in both metrics, as the behavior that they enforce is not represented in the metrics obtained. For the

W_{t h r u s t}

and

W_{χ}

parameters, the results analyzed without those terms show erratic flights with changes in speed and heading. In the case of

W_{v e l}

, the agents tend to hover or fly at their maximum speed. Although these behaviors perform well according to the presented metrics, they are undesirable for operators as they affect the overall flight efficiency and increase the aircraft’s maintenance cost.

The reward weights used in the evaluation are present in Table 3.

7. Model Validation

Once the policy had been trained in a random environment, the policy was evaluated in a series of scenarios created to assess the performance of the separation management system. One of the goals of this study is to assess the performance of AI-based methods against classical control algorithms. For that reason, RL-MASM was compared to the classical control method Dynamic Window Approach (DWA) [26]. DWA is a non-communicating obstacle avoidance algorithm used extensively for mobile robot routing [27]. DWA offers great computational efficiency and robustness to changes in configuration parameters while not requiring explicit communication between drones.

Simulations with a basic agent that ignores NFZs and other participants were used as the baseline algorithm.

The algorithms were validated in the following environments:

Random Obstacle: This environment is similar to the training environment; however, for the evaluation, the NFZs may activate and deactivate dynamically as the episode progresses.
LEVC CTR with DAR: uses the Valencia CTR as defined in Airac 2310 [16]. This environment was chosen because it was well studied in the AURA project [16] as an example of UAS integration with ATM through DAR and offers interesting features for routing.
Converging Pattern: In this environment, the drones are distributed radially and have to traverse to the opposite side. It is a challenge for the conflict avoidance performance of the algorithms, as all mission plans intersect at the center of the environment. This environment has no NFZs.
Narrow Passage: In this environment, the agents must cross a narrow passage formed by two NZFs that remain active for the duration of the episode. It simulates a constrained navigation scenario with high congestion.

Figure 8 shows an example of each environment. In the figures, the NFZs are represented as golden shapes; each drone is represented by a triangle, with a blue circle around it, showing its separation threshold. The trajectory of each drone is shown in the same color as the drone, with its mission plan displayed as a line and the waypoints as squares. A bigger square is used to represent the next waypoint for each drone. In the CTR environment, the LEVC CTR profile is added for visualization purposes.

The validation is performed on 15 instances of each environment with the parameters defined in Table 6. The parameters used for DWA are in Table 7.

For each episode, the NFZs of the Random and CTR environments start with a 50% chance of being active and are periodically toggled with a randomized period

T_{a c t} \in R : 250 \leq T_{a c t} \leq 750

s. A random period is sampled per NFZ per episode.

Several metrics were collected to evaluate the performance of the algorithms. The metrics are as follows: (a) “Success rate”, the ratio of episodes in which all agents complete their missions, reaching all the waypoints in the plan; (b) Total distance traveled; (c) Episode length; the number of simulation steps intruding a NFZ; and (d) Total steps with separation loss. Table 8 shows the results of the validation with the DWA and the proposed algorithm, as well as the basic algorithm used as a baseline. The table shows the mean distance traversed by all the agents in each episode and the number of simulation steps required by the agents to achieve their mission. If the maximum number of simulation steps is reached, the episode is stopped and marked as unsuccessful.

Episodes finish if all agents successfully finish their missions or if the maximum number of steps (

m a x_{s t e p s}

) is reached.

The results show that RL-MASM is able to solve the environment at a higher rate than DWA, with a lower mean episode length. The DWA algorithm tends to deadlock with high traffic densities and with convex NFZ shapes. Such a situation can be observed in Figure 9a, which shows the DWA algorithm in the narrow passage environment. Due to the sizes of the drone’s separation thresholds and NFZs, none of the drones can progress without loss of separation and enter a deadlock state, hindering the successful completion of the episode. In environments where the NFZs may deactivate, this effect leads to a lower total traveled distance, but an increased episode length. This is especially significant in the CTR environment. In turn, the proposed algorithm can observe the NFZ and the intent of other agents, leveraging this information to optimize the path and complete the assigned mission, as can be seen in Figure 9b.

The Converging Pattern environment challenges the agents to plan ahead to avoid conflicts with other drones. The DWA method fails to complete some instances on time as the drones become surrounded in the center and find themselves in similar deadlocks as in the Narrow environment, increasing the time required to solve the episode dramatically. The proposed algorithm is able to complete the missions within a lower timeframe while avoiding most of the conflicts, thus reducing the mean steps with losses of separation by 93% versus no separation management. Figure 10a shows that the trajectories followed by the agents solve the scenario successfully, with some agents following the shortest path and others preemptively routing around the center to avoid conflicts.

Figure 10b shows the cumulative trajectories of the agents in the Narrow Passage environment. Although there are a small number of intrusions on the NFZs, agents are able to adapt their trajectories to the incoming traffic, often circling around to allow other drones to pass, avoiding potential deadlocks. The number of intrusions and losses of separation in this environment shows that further improvement to the methods is available, especially in situations with high traffic density.

As an example of the CTR environment, Figure 11a shows the initial configuration, and Figure 11b displays the end of the episode as the agents are reaching their last waypoints, with the tracks flown by each drone. Throughout the episode, the CTR NFZs have been toggled with random periods. This shows the agents preemptively modifying their trajectories to avoid potential losses of separation and exiting activated NFZs as soon as possible.

The proposed algorithm also outperforms DWA in the number of intrusions at the cost of a higher number of separation losses in the NFZ. The results in Figure 12b show a marked improvement over the baseline in the number of separation losses, which is especially notable considering the density of the airspace.

8. Conclusions

This work presents RL-MASM, a novel framework for integrating heterogeneous traffic on a CTR based on multi-agent reinforcement learning, with a particular focus on the dynamic airspace reconfiguration scheme proposed by SESAR. This solution assumes a high level of autonomy of the participants, requiring minimum effort from ATC to maintain separation.

RL-MASM implements an observation space that represents the current airspace and nearby traffic, without constraints on the number of participants or predefined NFZ geometries. It was validated across diverse scenarios involving various airspace structures and traffic densities, including a recreation of the Valencia CTR, with a higher success rate and improved separation management when compared to DWA.

While current evaluations assume ideal communication conditions, real-world operations face latency, data loss, and sensor inaccuracies. Future work will address these challenges and explore vertical separation strategies to further enhance robustness.

As urban air mobility continues to expand, intelligent airspace management systems like RL-MASM will be critical to ensuring the safe and scalable integration of UAS worldwide.

Author Contributions

Conceptualization, J.V.N. and J.A.V.C.; methodology, J.V.N.; software, J.V.N.; validation, J.V.N.; formal analysis, J.V.N. and J.A.V.C.; investigation, J.V.N.; resources, J.V.N. and J.A.V.C.; data curation, J.V.N.; writing—original draft preparation, J.V.N.; writing—review and editing, J.A.V.C.; visualization, J.V.N.; supervision, J.A.V.C.; project administration, J.A.V.C.; funding acquisition, J.A.V.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted with funding from Generalitat Valenciana under an ACIF grant (CIACIF/2021/489). Funding for open access charge: Universitat Politècnica de València.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript: AI: Artificial Intelligence, ATC: Air Traffic Control, ATM: Air Traffic Management, CNS: Communications, Navigation, and Surveillance, CR: Conflict Resolution, DAR: Dynamic Arispace Reconfiguration, Dec-POMDP: Decentralized Partially Observable Markov Decision Process, DWA: Dynamic Window Approach, FAA: Federal Aviation Administration, FAS: Flight Authorisation Service, IFR: Instrument Flight Rules, MACR: Multi-Aircraft Conflict Resolution, MARL: Multi-Agent Reinforcement Learning, MDP: Markov Decision Process, PPO: Proximal Policy Optimization, RL: Reinforcement Learning, RL-MASM: Reinforcement Learning Multi-Agent Separation Management, UAM: Urban Air Mobility, UAS: Unmanned Aerial System UTM: Unmanned Traffic Management, VFR: Visual Flight Rules.

References

ICAO. Annex 2—Rules of the Air; Technical Report; ICAO: Montreal, QC, Canada, 2025. [Google Scholar]
Kleczatský, A.; Hulínská, v.; Kraus, J. Enabling UAS BVLOS flights in CTR. J. Phys. Conf. Ser. 2023, 2526, 012094. [Google Scholar] [CrossRef]
Politi, E.; Purucker, P.; Larsen, M.; Reis, R.J.D.; Rajan, R.T.; Penna, S.D.; Boer, J.F.; Rodosthenous, P.; Dimitrakopoulos, G.; Varlamis, I.; et al. Enabling Technologies for the Navigation and Communication of UAS Operating in the Context of BVLOS. Electronics 2024, 13, 340. [Google Scholar] [CrossRef]
EASA. Easy Access Rules for Standardised European Rules of the Air (SERA); Technical Report; EASA: Cologne, Germany, 2024. [Google Scholar]
EUROCONTROL. U-Space Concept of Operations (CONOPS); Technical Report; SESAR Joint Undertaking: Brussels, Belgium, 2023. [Google Scholar]
Federal Aviation Administration. UTM Concept of Operations Version 2.0 (UTM ConOps v2.0); Technical Report; NextGEN: Atlanta, GA, USA, 2020.
ECAC. European Commission. Commission Implementing Regulation (EU) 2021/664 of 22 April 2021 on a Regulatory Framework for the U-Space (Text with EEA Relevance); Technical Report; ECAC: Neuilly-sur-Seine, France, 2021.
Feng, S.; Sebastian, B.; Ben-Tzvi, P. A Collision Avoidance Method Based on Deep Reinforcement Learning. Robotics 2021, 10, 73. [Google Scholar] [CrossRef]
Song, S.; Saunders, K.; Yue, Y.; Liu, J. Smooth Trajectory Collision Avoidance through Deep Reinforcement Learning. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 914–919. [Google Scholar] [CrossRef]
Ribeiro, M.; Ellerbroek, J.; Hoekstra, J. Review of conflict resolution methods for manned and unmanned aviation. Aerospace 2020, 7, 79. [Google Scholar] [CrossRef]
Bilimoria, K.; Lee, H.; Mao, Z.; Feron, E. Comparison of Centralized and Decentralized Conflict Resolution Strategies for Multiple-Aircraft Problems. In Proceedings of the 18th Applied Aerodynamics Conference, Denver, CO, USA, 14–17 August 2000; American Institute of Aeronautics and Astronautics Inc.: Reston, VA, USA, 2000. [Google Scholar] [CrossRef]
Guan, X.; Lyu, R.; Shi, H.; Chen, J. A survey of safety separation management and collision avoidance approaches of civil UAS operating in integration national airspace system. Chin. J. Aeronaut. 2020, 33, 2851–2863. [Google Scholar] [CrossRef]
Zhang, H.; Zhou, J.; Shi, Z.; Li, Y.; Zhang, J. DDQNC-P: A framework for civil aircraft tactical synergetic trajectory planning under adverse weather conditions. Chin. J. Aeronaut. 2024, 37, 434–457. [Google Scholar] [CrossRef]
Dalmau, R.; Allard, E. Air Traffic Control Using Message Passing Neural Networks and Multi-Agent Reinforcement Learning. In Proceedings of the 10th SESAR Innovation Days (SID), Virtual Event, 7–10 December 2020. [Google Scholar]
Chen, Y.; Hu, M.; Yang, L.; Xu, Y.; Xie, H. General multi-agent reinforcement learning integrating adaptive manoeuvre strategy for real-time multi-aircraft conflict resolution. Transp. Res. Part Emerg. Technol. 2023, 151, 104125. [Google Scholar] [CrossRef]
Janisch, D.; Sánchez-Escalonilla, P.; Cervero, J.M.; Vidaller, A.; Borst, C. Exploring Tower Control Strategies for Concurrent Manned and Unmanned Aircraft Management. In Proceedings of the 2023 IEEE/AIAA 42nd Digital Avionics Systems Conference (DASC), Barcelona, Spain, 1–5 October 2023; pp. 1–10. [Google Scholar] [CrossRef]
Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Cohen, W.W., Hirsh, H., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 1994; pp. 157–163. [Google Scholar] [CrossRef]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. NIPS ’22. [Google Scholar]
Consortium, B. Algorithm for Analysing the Collision Risk; Technical Report; Universitat Politècnica de València (UPV): Valencia, Spain, 2021. [Google Scholar]
Vila Carbó, J.A.; Balbastre Tejedor, J.V.; Morcillo Pallarés, P.; Yuste Pérez, P. Risk-Based Method for Determining Separation Minima in Unmanned Aircraft Systems. J. Air Transp. 2023, 31, 57–67. [Google Scholar] [CrossRef]
Consortium, B. Concept Validation Test Plan (CVALP); Technical Report; Universitat Politècnica de València (UPV): Valencia, Spain, 2022. [Google Scholar]
Choi, J.; Lee, G.; Lee, C. Reinforcement learning-based dynamic obstacle avoidance and integration of path planning. Intell. Serv. Robot. 2021, 14, 663–677. [Google Scholar] [CrossRef] [PubMed]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable Reinforcement Learning Implementations. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Fox, D.; Burgard, W.; Thrun, S. The dynamic window approach to collision avoidance. IEEE Robot. Autom. Mag. 1997, 4, 23–33. [Google Scholar] [CrossRef]
Cao, Y.; Mohamad Nor, N. An improved dynamic window approach algorithm for dynamic obstacle avoidance in mobile robot formation. Decis. Anal. J. 2024, 11, 100471. [Google Scholar] [CrossRef]

Figure 1. Structure of LEVC CTR.

Figure 2. Different strategies for an RL agent to deal with an NFZ which is dynamically activated and then deactivated.

Figure 3. Visualization of the observation

o_{a}^{t}

for an agent.

Figure 3. Visualization of the observation

o_{a}^{t}

for an agent.

Figure 4. Distance measurements for obstacles. (a) Distance calculation between UAS. (b) Distance calculation between UAS and NFZ.

Figure 5. Architecture of the actor and the critic. (a) Actor architecture. (b) Critic architecture.

Figure 6. Training results per episode with value in blue and exponentially weighted moving average in orange. (a) Reward. (b) Intrusions. (c) Sep. Losses.

Figure 7. Correlation between changes in weight values and metrics: (a) Total distance, number of losses of separation, and (b) success rate.

Figure 8. Validation scenarios. (a) LEVC CTR with DAR environment, with CTR overlaid for visualization. (b) Random obstacle environment. (c) Converging pattern environment. (d) Narrow passage environment.

Figure 9. Examples of the Narrow Passage environment. (a) DWA deadlock on Narrow environment. (b) RL Narrow environment.

Figure 10. Cumulative trajectories of 15 episodes on the Converging Pattern (a) and Narrow Passage (b) environments.

Figure 11. CTR Simulation. (a) Simulation step 0. (b) Simulation step 1295.

Figure 12. Validation metrics. (a) Total simulation steps within an NZF per episode. (b) Total simulation steps in separation loss per episode. (c) Episode length (steps). (d) Total distance traveled per episode.

Table 1. Drone performance parameters.

Parameter	Description	Value
$Δ v_{m a x}$	Acceleration	1 m/ $s^{2}$
$Δ χ_{m a x}$	Track change rate	17 $°$ /s
$v_{m i n}$	Minimum speed	0 m/s
$v_{n o m}$	Nominal speed	10 m/s
$v_{m a x}$	Maximum speed	15 m/s

Table 2. Horizontal separation distances.

Category	Threshold (m)	Category	Threshold (m)
A1	102.00	SAIL III–IV	263.64
A2	102.00	SAIL V–VI	281.90
A3	198.36	Certified (no passenger)	471.01
SAIL I–II	234.20	Certified (passenger)	640.19

Table 3. Environment parameters.

Parameter	Description	Value
$n_{a g e n t s}$	Number of agents	12
$m a x_{s t e p s}$	Step limit per episode	1750
$e n v_{s i z e}$	Environment size	25 km
$n_{o b s t a c l e}$	Number of NFZs	15
$o b s_s i d e s_{m a x}$	Maximum number of sides of an NFZ	6
$o b s_s i z e_{m a x}$	Maximum size of random NFZs	4 km
$W P_{m a x D i s t}$	Maximum distance between waypoints	7.5 km
$W_{t a r g e t}$	Weight for target reward	0.75
$W_{c o n f}$	Weight for conflict avoidance reward	−1.1
$W_{v e l}$	Weight for nominal speed reward	−0.15
$W_{t h u s t}$	Weight for efficiency reward for speed changes	−0.05
$W_{χ}$	Weight for efficiency reward for track changes	−0.05

Table 4. Training hyperparameters.

Hyperparameter	Value
Learning rate	$3 \times 10^{- 4}$
Number of training steps	4,500,000
Batch size	512
Rollout buffer	9216
Discount factor	0.99
Generalized Advantage Estimator (GAE) parameter	0.95

Table 5. Weight ranges.

Parameter	Range
$W_{t a r g e t}$	0.5	1
$W_{c o n f}$	−0.25	−1.5
$W_{v e l}$	−0.25	0
$W_{t h r u s t}$	−0.25	0
$W_{χ}$	−0.25	0

Table 6. Validation environment parameters.

Parameter	CTR with DAR	Random	Converging	Narrow
$n_{a g e n t s}$	15	15	10	8
$m a x_{s t e p s}$	2500
$e n v_{s i z e}$	30 km	30 km	30 km	25 km

Table 7. DWA parameters.

Parameter	Value
Simulation time	10 s
Time step	1 s
Velocity space samples	7
Heading space samples	7
Weight path	0.75
Weight separation/intrusion	1

Table 8. Performance metrics of the algorithms with the different environment types. Mean/

σ

are shown for distance and length. Best result per environment highlighted in bold.

Table 8. Performance metrics of the algorithms with the different environment types. Mean/

σ

are shown for distance and length. Best result per environment highlighted in bold.

Environment	Algorithm	Success Rate	Distance (km)	Length (Steps)
CTR with DAR	RL-MASM	93%	455.67/7.11	1683.60/329.03
	DWA	87%	366.76/6.62	2121.13/229.89
	Baseline		299.22/5.4	774.87/21.41
Random	RL-MASM	87%	460.58/11.01	1800.47/407.16
	DWA	73%	370.13/7.36	2178.60/251.95
	Baseline		297.31/5.83	767.53/30.90
Converging	RL-MASM	100%	165.88/6.82	1012.87/71.79
	DWA	87%	180.69/8.81	2046.93/248.13
	Baseline		127.158/3.35	700.07/46.82
Narrow	RL-MASM	100%	125.80/4.53	964.67/50.59
	DWA	67%	133.23/9.30	1966.73/385.82
	Baseline		103.51/3.63	636.60/30.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vico Navarro, J.; Vila Carbó, J.A. Enhancing UAS Integration in Controlled Traffic Regions Through Reinforcement Learning. Drones 2025, 9, 412. https://doi.org/10.3390/drones9060412

AMA Style

Vico Navarro J, Vila Carbó JA. Enhancing UAS Integration in Controlled Traffic Regions Through Reinforcement Learning. Drones. 2025; 9(6):412. https://doi.org/10.3390/drones9060412

Chicago/Turabian Style

Vico Navarro, Joaquin, and Juan Antonio Vila Carbó. 2025. "Enhancing UAS Integration in Controlled Traffic Regions Through Reinforcement Learning" Drones 9, no. 6: 412. https://doi.org/10.3390/drones9060412

APA Style

Vico Navarro, J., & Vila Carbó, J. A. (2025). Enhancing UAS Integration in Controlled Traffic Regions Through Reinforcement Learning. Drones, 9(6), 412. https://doi.org/10.3390/drones9060412

Article Menu

Enhancing UAS Integration in Controlled Traffic Regions Through Reinforcement Learning

Abstract

1. Introduction

2. Introducing UAS Traffic into CTRs

3. Simulation Environment

3.1. Agent Modeling

3.2. Airspace Modeling

3.3. Separation Modeling

3.4. Mission Plan Creation

4. Reinforcement Learning Framework

4.1. Observation Space

4.2. Action Space

4.3. Reward Function

4.4. Neural Network

5. RL-MASM Algorithm Training

6. Reward Function Sensitivity Analysis

7. Model Validation

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI